Introduction

Bantu expansion is an excellent example of prehistoric colonization on a continental scale. A fairly homogenous population of Bantu speaking people had spread from the northwest of the equatorial forest in Cameroon and Nigeria throughout Central, Eastern, and Southern Africa. Although this process was completed in the recent past, our knowledge is based mostly on indirect archeological (e.g., P. de Maret 1984; M. Posnansky 1968), anthropological (e.g., G.P. Murdock 1959; J. Vansina 1990), genetic (e.g., L.L. Cavalli-Sforza et al. 1994; L. Pereira et al. 2002), and/or linguistic (e.g., J.H. Greenberg 1972; M. Guthrie 1967–1971; B. Heine 1973; D. Nurse 1997) evidence. The linguistic arguments have especially contributed to widely accepted interpretations of the Bantu expansion (M. Guthrie 1962; J. Vansina 1995). Nevertheless, traditional comparative linguistics was limited by the huge number of Bantu languages and incompleteness of records. Therefore, most authors have focused on lexical data. The extensive comparative material on proto-Bantu roots collected by M. Guthrie (1967–1971) was converted to similarity matrices and treated by different tree-building methods (C. Flight 1988; A. Henrici 1973). These analyses were, however, limited by the small number of included languages. Consequently, standardized lexical data available in an almost complete set of languages/dialects was introduced (Y. Bastin et al. 1983, 1999; A. Coupez et al. 1975). Nevertheless, the computational approach used by lexicostatistics, although reasonable (e.g., S.M. Embelton 1986), is purely phenetic. The resulting trees were therefore not fully appropriate to the historical reconstruction of language evolution (H. Hoijer 1956; D. Nurse 1994–1995).

Thus, an introduction of phylogenetic methodology (R.D. Gray and F.M. Jordan 2000; K. Rexová et al. 2003) was needed. This crucial step was performed by C.J. Holden (2002) who calculated a maximum parsimony tree for 73 Bantu languages. In spite of its methodological purity, even this study suffers from a low number of characters. Therefore, we searched for additional characters. Grammatical data, although usually considered to be less prone to borrowing (e.g., D. Nurse 1994–1995) and therefore, more suitable for the purpose of genetic classification, was overlooked and never used yet for the construction of maximum parsimony trees.

The aim of this study is: (1) to analyze a combined data set consisting of both lexical and grammatical data, (2) to perform a phylogeographic interpretation of the resulting trees, and (3) to discuss the usefulness of phylogenetic methodology in the case of Bantu languages.

Materials and methods

We used the lexical data set collected by Y. Bastin et al. (1999). These data were originally generated for the lexicostatistical analyses of languages, i.e., they are recognized cognate linguistic forms according to the criteria of comparative linguistics. Lexical data were set up on the basis of a list of 100 lexical items (meanings) primarily chosen by M. Swadesh (1955) and reduced to 92 meanings by Yvonne Bastin. It is widely accepted that words corresponding to meanings on the list are relatively insensitive to borrowing. The grammatical data set (see S1 for description of grammatical characters) consisting of 52 phonological and morphological items was collected by Y. Bastin (1979).

Hence, 87 Bantu languages with both lexical (92) and grammatical (52) items were available for further analyses. The small number of items complicated the separate analyses of lexical and grammatical matrices, especially in the case of grammatical data analysis (see S2 and S3 for separate analyses of grammatical and lexical data set). Although a partition-homogeneity test revealed some statistically significant (P=0.01) incongruence between the lexical and grammatical matrices, we followed the total evidence approach.

The multistate matrix of 144 items was generated and processed by PAUP software (version 4.0b4a (D.L. Swofford 2000]). A heuristic search (“addseq = random” and “nreps = 1,000”) was performed to find the most parsimonious trees. Items with more than one form in a single language (synonyms) were treated as polymorphic character states. Bremer indices (up to 4) and bootstrap procedure (nreps = 1,000) were computed to indicate support of the individual clades. In addition, weighted parsimony trees were constructed (“reweight”) to reduce the effect of possible borrowing. To root the tree of Bantu languages, we used the Nen (A44). This language was selected because it comes from the source region of Bantu expansion. Moreover, in our preliminary analyses of lexical matrices including some Bantoid non-Bantu languages (i.e., the putative sister groups) the Nen language has kept a stable position on the base of the tree.

Next, we recoded the matrix into binary form and further treated it into a MrBayes 3.0B4 (J.P. Huelsenbeck and F. Ronquist 2001). We followed R.D. Gray and Q.D. Atkinson (2003) who performed an analysis of Indo-European languages in adjusting the parameters (samplefreq = 10,000; burnin = 300,000; others: default).

Results

Maximum parsimony analysis generated eight trees. The topology of the consensus tree (length=3,198, CI=0.46, RI=0.55, and RC=0.25; see Fig. 1) was almost the same as that based on three trees resulting from RC weighted-parsimony (length=7,72.53, CI=0.56, RI=0.59, and RC=0.33), therefore we will further discuss the former one.

Fig. 1
figure 1

Maximum parsimony tree of 87 languages based on 92 lexical and 52 grammatical characters. Numbers indicate bootstrap values (>50). Language names are adopted from [Y. Bastin et al. 1999] and their alphanumerical codes provide information about the geographic location of the language [15]

As expected, all languages in the A zone (Cameroon) are placed on the tree base. The clade consisting of all other Bantu languages (B–S zones) was supported (bootstrap BS=72 and Bremer index Br=2). On the base of this clade, there are offshoots belonging to the C and D zones (N and NE Congo-Kinshasa), three of which are probably monophyletic. The rest of the Bantu languages belong to a well-supported superclade (BS=61 and Br>3). It contains several languages of zone D (E Congo-Kinshasa), L (SE Congo-Kinshasa, Br>3), M (S Congo-Kinshasa, Zambia), and a well-supported clade of Eastern and Southern Bantu languages (BS=59 and Br=2) covering the zones E, F, G, J, M, N, P, and S (from Uganda and Kenya to South Africa). The remaining languages of the superclade form a weakly supported western clade (Br=1) consisting of several clades covering the zones K, L, and R (SW Congo-Kinshasa, Angola, Zambia) and a single well-supported clade (BS=68 and Br>3) of the zones H and B (Congo-Brazzaville, SW Congo-Kinshasa).

Almost all statistically supported clades revealed by maximum parsimony were corroborated by alternative Bayesian methodology (Fig. 2). The differences in the tree topology concern the position of a few languages within C–D zones. The superclade has remained well supported (posterior probability P=95%). The basalmost offshoots of the superclade formed a single basal clade (D zone, P=99%), the remaining ones joined the western clade (zones L, M, and D28, P=93%).

Fig. 2
figure 2

Bayesian tree of 87 languages. Numbers indicate posterior probabilities. Language names are adopted from [Y. Bastin et al. 1999] and their alphanumerical codes provide information about the geographic location of the language [15]

Discussion

Our results suggest a scenario of Bantu expansion consisting of the following steps (see Fig. 3): (1) an initial radiation in Cameroon (A zone); (2) the subsequent branching in the rainforest areas of Congo-Kinshasa (C and D zones); (3) main radiation somewhere in SE Congo-Kinshasa W of the Tanganyika Lake (D and possibly L and M zones); (4) westward spread from this area to K, R, H, and B zones; and (5) migration from the area of main radiation to E and S Africa (J, F, E, G, N, P, S zones).

Fig. 3
figure 3

Geographical distribution of studied languages. The individual clades are visualized by symbols. Open asterisks denote the putative initial radiation of Bantu languages; solid asterisks the subsequent branching in the rainforest; solid squares the basalmost clade of the “main radiation” (small ellipse); solid triangles, solid circles, and patterned squares other basal clades of the main radiation (big ellipse) placed to the western branch by Bayesian analysis; open circles the western clade; and open squares the eastern and southern Bantu clade

This scenario disagrees with the results of a previous phylogenetic analysis (C.J. Holden 2002) and the current interpretations of Bantu migration (J.L. Newman 1995; J. Vansina 1995) in one important point. We supported the existence of a monophyletic superclade containing all the Bantu languages found in the territories south and east of the rainforest areas of Congo-Kinshasa. Consequently, not one of the western Bantu languages south of the rainforest areas was clustered together with their northern neighbors. If further proved, the scenarios of Bantu migration should be changed considerably. Because the monophyletic group containing both northern and southern Bantu languages from Western Africa was not supported, the early split of the western and eastern branches of Bantu languages in the north rainforest (C zone) areas followed by the migration of the Western group to W Congo-Kinshasa, Congo-Brazzaville, and Angola, as proposed by Curtin et al. (1995) and J. Vansina (1990), lose its substantiation. The main phylogenetic signal of our data favors the colonization of Angola, SW Congo-Kinshasa and surrounding territories from the more eastern source areas.

Our scenario is more intuitive than complex models of Bantu migration proposed by [J. Vansina 1995]. It requires just a single passage through the rainforest areas and/or Congo Basin, a single acquisition of the new technologies (iron metallurgy and grain cultivation), possibly somewhere around the Great Lakes, and the consequent demographic expansion followed by dispersal throughout the dry-forests and savanna regions of subequatorial Africa. The time and place of acquisition of the above mentioned technologies are, however, a matter of never ending discussion and disputes.

Our language tree corroborated the results of the recent cross-cultural linguistic comparison carried out by C. Ehret (1998). Our clade of the Eastern and Southern Bantu languages apparently corresponds to his “Mashariki“ group and the four previous offshoots to his “Western savanna,” “Botatwe,” “Sabi,” and “Central savanna” groups, respectively. Moreover, C. Ehret (1998) provides independent evidence supporting close relationships of all the above-mentioned groups forming a monophylum in our tree.

It should be noted that some ideas of early scholars are congruent with our phylogeographic scenario. M. Guthrie (1962) in his comprehensive study of Bantu languages concluded that “the bush country to the south of the equatorial forest midway between the two coasts” was a nucleus from where Bantu languages have radiated. R. Oliver (1966) combined opinions with J.H. Greenberg (1966) and M. Guthrie (1962) and proposed a two-step model including an initial radiation in Cameroon/Nigeria, followed by migration to the “nucleus” and subsequent radiation south of the equatorial forest. Also, C. Flight (1988) concluded that savanna Bantu formed a single branch.

The rest of the clades do not greatly contradict the views of recent scholars (Y. Bastin et al. 1999). In general, the affinities of languages in local scale are well understood and phylogenetic methodology may not substantially improve the existing classification.

In our previous study (K. Rexová et al. 2003), we demonstrated a good correspondence between the results of cladistics and traditional comparative linguistics. Nevertheless, there are some specific pitfalls in the case of Bantu languages. Neighboring languages are frequently intelligible and do not behave as fully independent units of evolution analogous to biological species. Borrowing and convergence (T.J. Hinnebusch 1999) may obscure the observed pattern of cultural evolution and the analyses suffer from lowered congruence among studied characters. Although these processes may explain lower consistency indices found in phylogenetic analyses of Bantu languages ((C.J. Holden 2002); this study) when compared to Indo-European ones (K. Rexová et al. 2003), the methodology used is robust enough to recognize the hidden phylogenetic signal.