Introduction

The East African water system, which includes three great lakes (Tanganyika, Malawi, and Victoria), many smaller lakes, and rivers (Fig. 1), abounds with a variety of fish taxa. The most species-rich among these taxa is the family Cichlidae (suborder Labroidei, order Perciformes [Kaufman and Liem 1982]), in particular, the tribe Haplochromini (Fryer and Iles 1972; Greenwood 1981). Haplochromines of the great lakes and of some of the smaller lakes have been studied extensively by morphological taxonomists and the studies have led to the description of several hundred species (Poll 1986; Greenwood 1981; Seehausen 1996). More recently, the “species flocks” of the great lakes have attracted the attention of molecular evolutionary biologists interested in the nature of the speciation process (reviewed by Meyer 1993). Much less attention has been paid to the riverine species, which appear to be less differentiated morphologically and are less accessible for collection than the lacustrine forms. Yet clues to the origin of the lacustrine species flocks undoubtedly lie in the riverine species, from which the former seem to have arisen. To understand the speciation processes in the great lakes it is therefore imperative to obtain information about the phylogenetic relationships and the distribution of variability among the haplochromine fish in the entire East African water system, both lacustrine and riverine.

Figure 1
figure 1

River systems and major lakes of East Africa. Arabic numerals indicate localities at which haplochromines investigated in this study were collected. The mtDNA (control region) haplogroups found at these locations are indicated by Roman numerals in parentheses. Altogether, 460 samples were analyzed. The names of the localities are as follows. 1, Pond Apida; 2, Migori River; 3, Lake Manyara; 4, Lake Babati; 5, Lake Chala; 6, Pangani River; 7, Pangani River; 8, Wami River; 9, Wala River; 13, Lupa River; 14, Piti River; 15, Lake Rukwa; 16, Myunga River; 17, Wogo River; 18, Kitilda/Rukwa; 20, Mpanda; 26, Malagarasi River; 27, Malagarasi River; 28, Luiche; 30, Lake Singida; 31, Mpanda River; 32, Muze River; 33, Lake Rukwa; 34, Zimba River; 35, Nakanga/Lake Rukwa; 36, Lake Nabugabo, Lake Kayugi and Lake Kayania; 37, Lake Wamala; 38, Kabagole/Katonga River; 39, Katwe/Lake Edward; 40, Katunguru Bridge/Kazinga channel; 41, Lake Lutoto; 42, Lake Nshere; 43, Kashaka/Lake George; 44, Kasenyi/Lake George; 45, Butiaba/Lake Albert; 46, Bugoigo, Lake Albert; 47, Buzumu Gulf; 48, Mwanza Gulf; 49, Chamagati Island; 50, Zue Island; 51, Speke Gulf; 52, Muhuru; 53, Rusinga; 54, Anyanga; 57, Yetti; 58, Katavi.

At the molecular level, the first steps in this direction have been taken by the Tübingen group (Sültmann et al. 1995; Mayer et al. 1998, Sültmann and Mayer 1997; Nagl et al. 2000, 2001). These studies were based on the use of anonymous nuclear DNA markers (Sültmann et al. 1995; Mayer et al. 1998, Sültmann and Mayer 1997) and on the sequence variation in the control region of mitochondrial (mt) DNA (Mayer et al. 1998; Nagl et al. 2000, 2001). The use of the nuclear DNA markers established the recent Lake Tanganyika flocks as the phylogenetically oldest in East Africa, followed by some of the riverine forms and the Lake Malawi flocks; phylogenetically youngest appear to be some other riverine forms, haplochromines of the Lake Edward region (Lakes Edward, George, Albert, and others), and, especially, the flocks of Lake Victoria (see also Meyer et al. 1990; Meyer 1993; Zardoya et al. 1996; Takahashi et al. 2001). These findings are in agreement with the conclusions drawn from morphological studies (Stiassny 1991). The studies based on sequencing of the mtDNA control region revealed the existence of seven major haplogroups (groups I–VII, not counting those found in the haplochromines of Lakes Tanganyika and Malawi) distinguished by 7 to 41 diagnostic substitutions (Nagl et al. 2000). The haplogroups are differentially distributed in East Africa. Groups I and VII are found in the Malagarasi region, groups II and IV in the Lake Rukwa region, and groups II, III, and VI in the Pangani and Wami regions (Fig. 1). Group V is distributed throughout East Africa, but it is the only haplogroup found in the northern part, in the Lake Victoria and Lake Edward regions. The haplogroup is differentiated further into at least four subgroups (VA through VD), of which VA is restricted to southern East Africa (Lake Rukwa region), VB to the Lake Edward region, and VC and VD to Lake Victoria and its surroundings.

These data, specifically the distribution of the subgroups of haplogroup V, suggest that the Lake Victoria is not, contrary to the initial claim of Meyer et al. (1990), monophyletic and that it has not, contrary to the suggestion made by Booton et al. (1999), descended directly from fish of the Lake Edward region. However, this interpretation relies on only a few informative substitutions in sequences at a single locus. Recently, the origin of LV was deduced by mtDNA (Verheyen et al. 2003) and nuclear markers (Seehausen et al. 2003). But, the results of each work were different, and the incomplete lineage sorting problem makes it difficult to deduce the origin (Kocher 2003). In the present study, therefore, we tapped another source of phylogenetic information—the short interspersed elements (SINEs).

SINEs belong to the nonviral superfamily of retroposons which are derived from cellular RNA by retroposition involving reverse transcriptase provided by long interspersed elements (LINEs) (Kajikawa and Okada 2002). In fish and many other eukaryotes, the cellular RNA is provided by tRNA (Okada 1991). Earlier works from Okada laboratory have validated the SINE insertion method for phylogenetic inference (see Shedlock and Okada 2000 for review). Studies such as the salmon phylogeny (Murata et al. 1993) and the origin and phylogeny of whales (Shimamura et al. 1997; Nikaido et al. 1999, 2001) provided clear conclusions for long-standing phylogenetic questions by using well-characterized fixed SINE loci. In addition, unfixed SINE loci in a population are also useful markers for inference of population structures, as shown by the study of charr distributed in the Northern Hemisphere (Hamada et al. 1998). In the latter case, especially when species radiated successively during a very short time span that is shorter than that required for fixation of a SINE in a population, polymorphisms and trans-specific variations of the presence or absence of a SINE are observed (Hamada et al. 1998).

The cichlid fish have been shown to possess a distinct family of SINEs designated AFC (Takahashi et al. 1998). In the present study, we identify new SINE loci of the AFC family and use them to gain an insight into the events that led to the emergence of the Lake Victoria haplochromine species flocks. The data presented here includes two important aspects of this method described above in one study: one is the determination of phylogeny by fixed SINE loci and the other is an inference of population structures by unfixed SINE loci.

Materials and Methods

Source and Isolation of DNA

Fish were collected during expeditions in 1993, 1995, 1996, and 1998; a few were kindly provided by Dr. Lothar Seegers (Dinslaken, Germany). Voucher specimens of the collected species have been deposited at the Nationaal Natuurhistorisch Museum, Leiden, The Netherlands; the Musée Royal de l’Afrique Centrale, Turveren, Belgium; and the Max-Planck-Institut für Biologie, Tübingen, Germany. Since the taxonomy of the riverine haplochromines remains unresolved, these samples are referred to only by the localities of their origin (Fig. 1). Pieces of fins from the collected specimens were stored in 70% ethanol and genomic DNA was isolated with the QIAamp Tissue Kit (Qiagen, Hilden, Germany). Contaminating RNA was removed by digestion with RNase A (30 min at 37°C) and genomic DNA was then isolated by phenol/chloroform extraction.

Construction and Screening of Genomic Libraries, Subcloning, and Sequencing

Genomic DNAs isolated from Neochromis nigricans and Ptyochromis xenognathus were digested with EcoRI endonuclease and the digests were size-fractionated by sucrose density gradient (10–40%, w/v) centrifugation. DNA fragments 2 to 4 kb in length were ligated to the arms of the λgt10 phage vector (Stratagene, Amsterdam-Zuidoost, The Netherlands) and packaged in vitro using MaxPlax Lambda Packaging Extract (Epicentre Technology, Madison, WI). From the libraries thus obtained, cichlid SINEs of the AFC family were isolated and characterized in the following way. Fragments corresponding to nucleotides 91–179 of the SINE AFC family were amplified by the polymerase chain reaction (PCR) using the primers F1 (5′-TCCTTGGGCAAGACACTTCAC-3′ and R1 (5′-ACTGA CAGAGGCGAGGCTGCC-3′). Amplification was carried out in the PTC-100 Programmable Thermal Controller (MJ Research, Biozym, Hessisch Oldendorf, Germany). Genomic DNA (100 ng) was added to a reaction mixture containing 1× ExTaq PCR buffer, 0.2 mM of each of the four deoxynucleoside triphosphates, 0.1 µM of primers, and 1 unit of ExTaq polymerase (Takara, Shiga, Japan). The PCR program consisted of denaturation for 3 min at 94°C followed by 20 cycles, each cycle consisting of 30 s denaturation at 94°C, 30 s annealing at 55°C, and 30 s extension at 72°C. The PCR products were then isolated with the QIAquick PCR Purification Kit (Qiagen) and used as probes for screening the libraries by hybridization using the GeneScreen Plus (NEN Life Science Products, Boston). The probes were labeled by primer extension in the presence of [α-32P]dCTP. Hybridizations were carried out overnight in a solution containing 50% (v/v) formamide, 6× SSC (SSC is 0.15 NaCl and 0.015 M trisodium citrate, pH 7), 1% (w/v) SDS, 2× Denhardt’s solution [1× Denhardt’s solution is 0.02% (w/v) Ficoll 400, 0.02% (w/v) polyvinylpyrrolidone, and 0.02% (w/v) bovine serum albumin], and 100 µg herring sperm DNA/ml at 42°C. The filters were washed in 2× SSC plus 1% SCS at 60°C for 1 h. Positive phage clones were isolated and their inserts were subcloned into pUC18 or pUC19 vectors. The inserts were sequenced using the LI-COR 4000 DNA sequencer and primers that corresponded or were complementary to the consensus sequence of the AFC family. The loci at which SINEs were integrated were designated by serial numbers: 1300 or 1800 for N. nigricans and 1400 or 1900 for P. xenognathus.

Once individual SINEs had been identified and their flanking sequences determined, primers specific for the 5′ and 3′ flanking regions were designed (Table 1) and used to amplify the SINEs by 30 cycles of PCR. The PCR products were analyzed by electrophoresis in 1.5% (w/v) NuSieve GTG and 1% (w/v) Seakem GTG agarose gels (FMC BioProducts, Rockland). For sequencing, the PCR products were purified by the QIAquick PCR Purification Kit (Qiagen). Direct sequencing was performed using the DNA Sequencing Kit (Applied Biosystems, Foster City, CA) and the Applied Biosystems Automated 310 or 3100 Sequencer. Sequencing reactions were carried out in PTC-100 Programmable Thermal Controllers. The purified PCR products (100–500 ng) were added to 5 pmol of a sequence primer and 1× Applied Biosystems sequencing solution. The sequencing reaction program consisted of denaturation for 3 min at 94°C, followed by 25 cycles, each cycle consisting of 30 s denaturation at 94°C, 30 s annealing at 50°C, and 2 min extension at 60°C.

Table 1 PCR primers used

Sequence and Phylogenetic Analysis

Sequences were aligned by using the CLUSTALW 1.82 program (Thompson et al. 1994) and were then modified by visual inspection. The nucleotide sequences were deposited in GenBank under accession numbers AB101301–AB101418.

Test for Deviation from Hardy–Weinberg Equilibrium

In each SINE locus, the presence or absence of SINE insertions was considered as two kinds of alleles. The frequencies of the presence or absence of SINE insertion (p and q, respectively) were calculated as follows: p = (2n ++ + n +−)/[2(n ++ + n +− + n −−)] where n ++, n +−, and n −− indicate the number of individuals with SINEs in both homologous chromosomes, in only one chromosome, or in neither chromosome, respectively; q = 1 − p. The deviation from Hardy–Weinberg expectation was tested by chi-square test with statistic χ2 = (n ++np 2)2/(np 2)+(n +− − 2pqn)2/(2pqn)+(n −−nq 2)2/ (nq 2), where n = n ++ + n +− + n −−. The 5% significance value was obtained from chi-square table (df = 1,3.84) for the cases in which the sample size is larger than or equal to 30. In the case in which the sample size is smaller than 30, the permutation test was conducted. In the permutation test, the number of samples and the number of each allele were fixed and the genotypes assigned to each individual were randomized. The 5% cutting value was obtained after 1000 randomization replications.

Phylogenetic Relationship of SINE 1357 Sequences

The phylogenetic relationship of 17 unique sequences was determined by taking into account the changes at each site based primarily on the parsimony principle. The type A sequence of A. alluaudi was assumed to be the outgroup of all the other sequences. The sequences from Lake Malawi are thought to form a monophyletic cluster. By considering sequences of only D, E, and F types, there are six parsimony-informative sites: 31, 115, 147, 155, 254, and 267. The changes at sites 115, 147, 254, and 267 were congruent to one another. Site 31 had incongruent changes to the four sites and the changes at site 155 are incongruent with that at site 267. However, the changes at these sites are presumably recurrent. Therefore, sites 31 and 155 were excluded from the determination of the phylogenetic relationships of the sequences except for the grouping of D3 and D4 with a G→C change at the site 31.

Computation of the Divergence Times of the SINE 1357 Sequences

If the clusters X and Y are the two descendant clusters of interior nodes 1 and 2, the average distance (D) of the sequences belonging to the two clusters was computed as

where d ij is a distance value between sequences i and j, which belong to clusters X and Y, respectively, and m and n are the numbers of the sequences that belong to clusters X and Y, respectively. When the average distance at node 3 (Fig. 6) was computed, D1 was considered to be the sequence at this interior node and all other D, E, and F types were derived from D1. The average distance at node 3 was computed as D = ∑ j d ij /n(i = D1, j = D2, D3,…,F7), where n is the number of D-, E-, and F-type sequences except for D1 (Takeza et al. 1995).

Figure 2
figure 6

Network dendrogram of SINE 1357 sequences. The average distances computed for nodes 1, 2, and 3 were 0.012 (±0.003), 0.007 (±0.002), and 0.004 (±0.002), respectively. Assuming that the separation of Lake Malawi sequences (B and C) from those of Lake Victoria and riverine species occurred 2 my ago and that the number of nucleotide changes increases proportionally with time, the divergence times for nodes 1 and 3 were estimated as 3.36 (±0.95) and 1.30 (±0.52) my ago, respectively. ID, indels. Arrows between capital letters indicate changes at specified nucleotide sites. A, A. alluaudi; B, M. melanopterus, S (specimen) 5554, Lake Malawi; C, Stigmatochromis woodi, S 5593, Lake Malawi; D1, S 18, L (locality) 57; S 8688, L 41 (VB; mitochondrial haplogroup [Nagl et al. 2000]); S 8711, L 40 (VB); S 8712 and 8717, L 40 (VB, VII)*; S 8762 and 8767, L 39; S 1595, L 6 (II§); S 8989, L 45 (VB)*; S 8784, L 44 (VB, VII)*; S 9401, L 34 (IV)*; S 8632, L 37 (VC); S 9334, L 18 (IV)*; D2, S 8903, L 43 (VII); D3, S 179, L 48 (VC); S 320 allele 2, L. Victoria (VC); S 9354, L 33 (II)*; D4, S 348 allele 2, L 48 (VC)*; D5, S 8988, L 45 (VB)*; E1, S 131, L 57; N. nigricans, L. Victoria, F3 individual born in Tübingen; S 9044, L 46 (VB)*; S 334, L 48 (VC)*; E2, S 9281, L 30 (VC)*; F1, S 1738, L 5 (VI); S 8783, L 44 (VB,VII)*; F2, S 9201, L 4 (VI)*; S 8676, L 38 (VC§); F3, S 1511, L 20 (IV§); F4, S 183, L 48 (VC)*; F5, S 348 allele 1, L. Victoria (VC); F6, S 320 allele 1, L 57 (VC); F7, S 326, L 57 (VC); S 1605, L 16 (VA). * indicates that the mitochondrial haplogroup is not known for the specimen and that the type found in the same locality is shown. § (indicates that the mtDNA types were obtained by W.E.M. (unpublished data).

Results

Strategy. To obtain information about the phylogenetic relationships among East African haplochromine cichlids by using SINEs as markers, we applied two strategies. The first was to seek new SINEs which were inserted after the divergence of the species flocks in the three great lakes and to determine their distribution among the fish sampled from different East African localities. New SINEs were obtained by the procedure described under Materials and Methods. They were distinguished by their unique flanking region sequences corresponding to the different integration sites. The presence or absence of the individual SINEs was determined by PCR using primers specific for the flanking regions. The appearance of a single longer PCR product, corresponding in size to the combined length of the particular SINE and of the two flanks, was taken as evidence for the presence of two copies of the SINE in the genome of the tested individual (a +/+ homozygote; Fig. 2). The appearance of a single shorter PCR product corresponding in size to the length of the flanking regions alone indicated an absence of the SINE in both chromosomes of an individual (a −/− homozygote; Fig. 2). And the appearance of both the longer and the shorter PCR products was evidence for the presence of the SINE on one chromosome and its absence on the homologous chromosome (a +/− heterozygote; Fig. 2).

Figure 3
figure 2

An example of PCR results with primers specific for flanking region of SINE 1801. +/+, +/−, and −/− indicate the presence of SINE on two, one, or no chromosomes, respectively. Individuals of the indicated Lake Victoria region species are distinguished by numbers.

In the second strategy, each of the identified SINEs was sequenced in a small sample of distantly related individuals. The SINE that showed the highest variability among these individuals was then chosen for a large scale sequencing of fishes collected at different localities in East Africa. Here the aim was to identify lineage-specific, diagnostic phylogenetically informative substitutions. Since our interest was primarily in the origin of the Lake Victoria flock, we focused on a collection of fish from Lake Victoria and its surroundings, the Lake Edward region, and the East African rivers (Fig. 1). The collection overlapped to a large extent with that used by Nagl et al. (2000) so that the results of the two studies could be compared.

Distribution of SINEs in East African haplochromines. Screening of N. nigricans and P. xenognathus genomic libraries with AFC-family SINE probes identified 250 loci, all of which were analyzed and characterized in terms of the time of retroposition. Among them, 16 were shown to merit further study. The other loci were not informative because SINE sequences were inserted into the genomes of all species used in this study or we failed to amplify the loci by PCR. Primers specific for the flanking regions of the 16 loci were used to determine the distribution of the loci by PCR in a panel of fish (Table 2) which included samples from Lakes Tanganyika, Malawi, and Victoria, as well as East African riverine species and the nonendemic Astatoreochromis alluaudi (Greenwood 1959). None of the 16 SINEs were found in the Lake Tanganyika fish; 4 SINEs (1357, 1802, 1805, and 1844) were found to be shared by all tested fish except those from Lake Tanganyika; 6 SINEs (1303, 1304, 1306, 1327, 1807, and 1823) were shared by all haplochromines except Lake Tanganyika specimens and A. alluaudi; 2 SINEs (1350 and 1840) were shared by all except those from Lake Tanganyika, A. alluaudi, and Lake Malawi fishes (Fig. 3). These 12 SINEs appeared to be fixed in the population in which they were found. Their distribution can be used to draw a cladogram (Fig. 3) which suggests that in the evolution of the East African haplochromines, the Lake Tanganyika flock was the first to split off from the stem lineage, followed by A. alluaudi, then by fish of the Lake Malawi flock, and, finally, by the rest of the taxa, both riverine and lacustrine. This interpretation is in agreement with previously published phylogenies of East African haplochromine cichlids, although the mitochondrial sequence data tend to place A. alluaudi outside the Tropheini species of Lake Tanganyika (e.g., Sültmann et al. 1995; Mayer et al. 1998; Salzburger et al. 2002). In the 1807 elements, a deletion of 216 bp was observed (designated 1807d in Fig. 4). This deletion and the remaining 4 of the 16 SINEs (1424, 1801, 1909, and 1918) were found to be restricted to some riverine and lacustrine species, but they were not fixed in these taxa (Figs. 2 and 3). Instead they occurred as polymorphisms in most of the populations in which they were found to be present (i.e., they were present in some individuals but not in others in these populations; Fig. 2 and Table 3). Moreover, one of the four SINEs was found to occur in two forms: some of the 1801 elements (designated 1801i) contained an insertion of 274 bp (Fig. 4).

Figure 4
figure 3

Cladogram indicating the distribution of 16 SINEs in the groups of East African haplochromines. Numbers above the arrowheads represent the SINE loci integrated at different sites; the arrowheads indicate the integration events. SINEs marked by an asterisk were found to be segregating for presence or absence in the clade.

Figure 5
figure 4

Schematic diagram of 1357, 1801, 1909, and 1807 SINE organization. Approximate positions of indels (ID), insertions, and a deletion are indicated. Numbers on segments and bars are distances in base pairs. Positions of PCR primers are specified by arrows.

Table 2 Distribution of “fixed” SINEs among cichlid fish tested
Table 3 Distribution and frequencies of polymorphic SINEs in East African haplochromines

The distribution of the five polymorphic SINEs and their variants was then determined by PCR-typing a panel of 460 specimens collected from Lake Victoria and its immediate surroundings, the Lake Edward region, and the East African rivers (Table 3), Three (1424, 1801, and 1918) among the five SINE elements in Table 3 were found to be widely distributed across the entire region in terms of different localities (both lacustrine and riverine), species (where they could be identified morphologically), and the various mtDNA haplogroups. Generally, where they were not found, the sample size might have been too small for their detection. The 1424 element was present in a homozygous condition in all 130 individuals sampled from Lake Victoria. It may therefore be fixed in the endemic haplochromines of this lake. The frequencies of three SINEs (1801, 1807, and 1918) were around 70% in the Lake Victoria endemics, and at the various localities they ranged widely (Table 3). The fourth polymorphic SINE, the 1909 element, was found only in the endemic haplochromines of the Lake Victoria, Lake Albert, and Lake Edward regions; it could not be detected at any other locality. Similarly, the element was found only in fish of the mtDNA haplogroups V and VII and not at all in fish bearing any of the other five haplogroups. As for the variants, the 1807d element was found to be present in the Lake Victoria, Lake Edward, and Lake Rukwa regions, as well as in the rivers Katavi, Muzu, and Zimba (Table 3). SINE 1801i was detected only in Lake George (1.2%) and the Kazinga Channel (4.5%).

In cases in which the morphologically defined species could be identified with some degree of confidence in the Lake Victoria region, all the species could be shown to be polymorphic for the 1807d variant and the 1801 and 1918 SINEs. The 1909 element was polymorphic in five Lake Victoria species (Paralabidochromis plagiodon, Haplochromis rockkribensis, H. velvet black, Ptyochromis sauvagei, and P. xenognathus), whereas eight other species in the Lake Victoria region appeared to lack this SINE. At the various localities, any of the following three situations could occur: a SINE could be present in all individuals from that locality, absent in all individuals, or present in only some individuals (Table 3). Overall, significantly fewer SINE heterozygotes were found than might have been expected from the frequencies with which the elements occurred in the combined sample (Table 3). This deficiency is probably attributable to the division of the population into many, often small, relatively isolated subpopulations. In the isolates, fixations or losses of different polymorphic SINEs may occur relatively rapidly, while occasional mixing of the subpopulations may restore the polymorphisms. In many localities, including Lakes Victoria, Wamala, Edward, and Albert, significant deviation from the HardyWeinberg equilibrium was observed (Table 3). In all the significant cases, the heterozygote frequencies were reduced. This result indicates that in many cases a population structure exists even within a locality, either because the population is subdivided physically or by reproductive barriers between different species.

Comparisons of SINE Sequences

The presence or absence of SINEs is one source of phylogenetic information; another is the evolutionary divergence of each of the SINEs in the positive individuals since its insertion in the haplochromine genome. To tap the latter source, we first carried out preliminary tests to determine the extent of the sequence variation of the different SINEs. To this end, in one experiment we sequenced six SINE loci (1318, 1319, 1322, 1327, 1329, and 1411) in two species, one from Lake Malawi (Melanochromis auratus) and the other from Lake Victoria (H. nyererei). In a second experiment, we sequenced eight SINEs (1303, 1304, 1306, 1327, 1357, 1802, 1823, and 1832) in seven individuals: Labidochromis caeruleus from Lake Malawi and one individual from six of the seven mtDNA haplogroups (II through VII). Based on the information thus generated, we then chose one SINE (1357) for an in-depth study in which we sequenced 32 individuals from Lake Victoria and 18 other localities, 2 individuals from Lake Malawi, and 1 A. alluaudi individual. Two individuals generated two different alleles. In total, 39 sequences were obtained. The data set obtained in this study is discussed below; the implications drawn from the other set of sequences are described in the next section.

The optimization of the alignment of the 39 SINE 1357 sequences requires the introduction of six indels (ID1 through ID6): one (ID2) in the 5′ flanking region, three (ID1, ID5, and ID6) in the region between the core and the LINE-related part of the element, and two (ID3 and ID4) in the LINE-related part (Figs. 5 and 6). The indels serve as markers differentiating clades of sequences: ID1 (7 bp long) distinguishes A. alluaudi from the rest of the sequences; ID2 (5 bp), ID3 (21 bp), and ID4 (6 bp) separate A. alluaudi sequences from the rest; and ID5 (10 bp) and ID6 (6 bp) distinguish the A. alluaudi + Lake Malawi clade from the rest. This distribution of indels is again consistent with the phylogeny of East African haplochromines mentioned earlier.

Figure 6
figure 5

Distribution of substitutions at differential sites (numbers at the top should be read vertically downward) of SINE 1357. Dashes (—) indicate identity with the consensus nucleotide at the top; asterisks, indels (ID). The full sequence giving the numbering used is presented in the supplementary on-line material.

The phylogenetic relationship among the 14 alleles (D1, D2,…, F7) shown in Fig. 6 is one of many possibilities of how these alleles were generated. It is indeed one of the most parsimonious pathways that explain nucleotide changes among the 14 alleles, but we assumed many recurrent nucleotide changes at some sites (five changes at site 31 and two changes at site 155) rather arbitrarily because these sites are near nucleotide repeats. Although the relationships of different alleles in Fig. 6 do not assume any recombination among them, it is possible that recombination played a role to generate them together with mutation. Whether or not recombination occurred among these alleles, a few steps are necessary for them to be generated from a previous state. Further, the 14 alleles found in Lake Victoria and the neighboring river systems and lakes are quite different from those found in Lake Malawi, distinguished by two indels and three nucleotide changes, and many of them are shared by fish with different mtDNA types which appear to have diverged more than 1 million years ago and distributed in various localities. Like the mtDNA sequences (Nagl et al. 2000), the SINE 1357 sequences reveal the existence of allelic lineages at the studied locus. There are, however, conspicuous differences between the mtDNA and the nuclear lineages. First, within each locality, the mtDNA lineages are more homogeneous than the SINE 1357 lineages. Thus, all the endemic Lake Victoria haplochromines tested fall into a single mtDNA haplogroup (V) divisible in two subgroups (VC and VD [Nagl et al. 2000]). In contrast, a much more limited sampling of the Lake Victoria endemics revealed the existence of at least 14 SINE 1357 lineages (Fig. 6). Second, the nuclear SINE 1357 lineages have, as expected, longer coalescence times (Fig. 6) than the mtDNA lineages. The SINE 1357 lineages are also more widely distributed among the East African haplochromines than the mtDNA lineages. The presence of diverse SINE 1357 lineages in Lake Victoria endemics and in various other lakes and rivers in East Africa indicates that these lineages must have already been present in the founding population of Lake Victoria. This observation implies, in turn, that the founding population could not have been small.

Discussion

Evolution of SINEs and Their Use as Phylogenetic Markers. Among haplochromines bearing the different mtDNA haplogroups, the five polymorphic SINEs are distributed as follows: 1807d in haplogroup V only; 1909 in groups V and VII; 1801 in groups III, IV, V, and VII; and 1424 and 1918 in haplogroups II through VII. Since haplogroup V is estimated to have diverged from the remaining six haplogroups 1.3 to 1.5 my ago and the subgroups VB and VC are similarly estimated to have separated from each other 150,000 to 220,000 years ago (Nagl et al. 2000), SINE 1424, 1801, 1909, and 1918 insertions must have occurred in the interval between the separation of Lake Malawi species (2 my ago) and the separation of group V from the rest of the haplogroups (1.3 to 1.5 my ago). SINE 1807d, which is only found in group V, must have arisen after the separation of group V from the other haplogroups (1.3 to 1.5 my ago) but before the divergence of group V subgroups (150,000 to 220,000 years ago). If speciation occurs in a period during which a SINE is polymorphic, the SINE can be expected to behave like any ancestral polymorphism in the speciation phase: it may become fixed in some species and lost in others. Moreover, since the fate of polymorphic SINEs in a population can be expected to be decided primarily by genetic drift, their sorting into the emerging species will be random (Hamada et al. 1998). As a consequence, the distribution of a single SINE among species generated by adaptive radiation within a short period of time (e.g., the haplochromine species of the Lake Victoria flock) is not a reliable marker of their phylogeny. As for other loci, multiple SINEs must be used to neutralize the effect of random sorting of allelic lineages (Takahashi et al. 2001). Alternatively, SINEs can serve as markers to investigate population dynamics by taking the advantage of their polymorphism, as was shown by this study.

Evolution of East African Haplochromines

The known SINEs divide the East African haplochromines into four groups, which, for convenience, we refer to as SINE groups 1 through 4 (Fig. 3). In SINE group 1 is the large species flock of haplochromines endemic to Lake Tanganyika. The group contains several lineages that diverged 510 my ago (Nishida 1997). This flock has been studied extensively by other investigators (Nishida 1991, 1997; Kocher et al. 1993, 1995; Sturmbauer and Meyer 1993; Sturmbauer et al. 1994; Sültmann et al. 1995; Mayer et al. 1998; Takahashi et al. 2001; Salzburger et al. 2002), whereas in the present study it served merely as an outgroup. Fish related to it gave rise to an evolutionary lineage from which the other three SINE groups split off successively. The first to diverge was the nonendemic, monotypic species A. alluaudi. It was apparently originally restricted in its distribution to Lake Victoria and its satellites, the Lake Edward region, and the rivers associated with these two regions. It owes its present wide distribution to the human effort to use it as a biological agent against snails (McMahon et al. 1997). The early divergence of A. alluaudi is indicated by the absence of SINEs 1303, 1304, 1306, 1327, 1807, and 1823, which are shared by groups 3 and 4; by the presence of SINEs 1357, 1802, 1805, and 1844 (Fig. 3); by ID2, ID3, and ID4 in 1357, which presumably represent insertions shared by groups 3 and 4; and by substitutions 143GA, 293TC, 314AT, 392TA, and 394CT in SINE 1357 (Figs. 5 and 6). The next to diverge were the fish of group 3, the highly diversified Lake Malawi flock, characterized by the absence of SINEs 1350 and 1840. The last to diverge was group 4, which includes the haplochromines of Lake Victoria, the Lake Edward region, and all other localities in Fig. 1. The emergence of this group is heralded by the insertion of SINEs 1350 and 1840, indels ID5 and ID6, and substitutions 120GT, 193AG, and 301GT in SINE 1357. This interpretation agrees with that based on the mtDNA control region sequences (Nagl et al. 2000).

Origin of the Lake Victoria Species Flock

Earlier analysis of mtDNA control region sequences (Nagl et al. 2000) revealed that all the endemic Lake Victoria haplochromines tested belonged to one of the seven haplogroups (V). The majority of the fish belonged to the subgroup VC and only fish from a few localities were typed as carrying the subgroup VD. Both subgroups were largely restricted in their distribution to Lake Victoria, its satellites, and rivers flowing into it. They were distinguished by single diagnostic substitutions. By comparison, the SINE 1357 nuclear sequences reported in this study reveal a higher extent of variation than the mtDNA control region sequences in geographical distribution of different types. In a sample smaller than that typed for mtDNA, at least eight SINE 1357 alleles (D1, D3, D4, E1, F4, F5, F6, F7) can be distinguished among the Lake Victoria endemics, at least three of which (D1, D3, E1) are also present at several other localities. This result may seem paradoxical because one might expect to find more variants in the faster evolving mtDNA control region than at a nuclear locus. The paradox disappears, however, when one takes into account the different expected persistence times of mtDNA and nuclear DNA variants. The presence of the SINE 1357 alleles in Lake Victoria and at other localities indicates that they arose before the fish colonized the lake. In contrast, the VC and VD variants may have arisen after the colonization, while other variants, had they been present originally, may have since been lost. Two inferences follow from this hypothesis. First, if the founding population of Lake Victoria contained numerous ancestral alleles or allelic lineages, many of which persist to this day, its size could not have been small. This conclusion is in agreement with that reached on the basis of studies of the major histocompatibility complex (Mhc) and neutral nuclear locus polymorphism (Klein et al. 1993; Nagl et al. 1998). And second, the diversity of the ancestral SINE 1357 alleles in Lake Victoria suggests that these alleles originated either in a large lake or at multiple riverine localities. In either case, the birthplace of these alleles must have been a region in which the mtDNA haplogroup V dominated. A possible candidate is the Lake Rukwa region (Nagl et al. 2000), which may be the region in which haplogroup V may have originated.