Main

At the macroevolutionary level, the diversity of life has been shaped mainly by two antagonistic processes: evolutionary radiations increase, and extinction events decrease, organismal diversity over time5,8,9. Evolutionary radiations are referred to as adaptive radiations if new lifeforms evolve rapidly through adaptive diversification into a variety of ecological niches, which typically presupposes ecological opportunity1,2,3,10. Whether or not an adaptive radiation occurs depends on a variety of extrinsic and intrinsic factors as well as on contingency, whereas the magnitude of an adaptive radiation is determined by the interplay between its main components, speciation (minus extinction) and adaptation to distinct ecological niches1,2,4,11. Despite considerable scientific interest in the phenomenon of adaptive radiation as the cradle of organismal diversity1,2,10,12,13, many predictions regarding its drivers and dynamics remain untested, particularly in exceptionally species-rich instances. Here, we examine what some consider as the “most outstanding example of adaptive radiation”14, the species flock of cichlid fishes in Lake Tanganyika. This cichlid assemblage comprises about 240 species15, which together feature an extraordinary degree of morphological, ecological and behavioural diversity14,15,16,17. We construct a species tree of Lake Tanganyika’s cichlid fauna on the basis of genome-wide data, demonstrate the adaptive nature of the radiation, reconstruct eco-morphological diversification along the species tree, and test general and cichlid-specific predictions related to adaptive radiation.

In situ radiation in Lake Tanganyika

To establish the phylogenetic context of cichlid evolution in Lake Tanganyika, we estimated the age of the radiation through divergence time analyses based on cichlid and other teleost fossils18, and constructed time-calibrated species trees using 547 newly sequenced cichlid genomes (Supplementary Table 1). Our new phylogenetic hypotheses (Fig. 1, Extended Data Figs. 14, Supplementary Figs. 1, 2) support the assignment of the Tanganyikan cichlid fauna into 16 subclades—corresponding to the taxonomic grouping of species into tribes15—and confirm that the Tanganyikan representatives of the tribes Coptodonini, Oreochromini and Tylochromini belong to more ancestral and widespread lineages that have colonized the lake secondarily12,15,19 (Supplementary Discussion). It has been under debate whether all endemic Tanganyikan cichlid tribes evolved within the confines of Lake Tanganyika or whether some of them evolved elsewhere before the formation of the lake20,21,22. Our time calibrations establish that the most recent common ancestor of the cichlid radiation in Lake Tanganyika lived around 9.7 million years ago (Ma) (95% highest-posterior-density age interval: 10.1–9.1 Ma) (Fig. 1), which coincides with the appearance of lacustrine conditions in the Tanganyikan Rift23. This suggests that the radiation commenced shortly after the lake had formed and that all endemic cichlid tribes have evolved and diversified in situ, that is, within the temporal and geographical context of Lake Tanganyika.

Fig. 1: Time-calibrated species tree of the cichlid fishes of African Lake Tanganyika.
figure 1

The species tree was time calibrated with a relaxed-clock model and is based on a maximum-likelihood topology inferred from genome-wide SNPs. Species names are abbreviated using a six-letter code, whereby the first three letters represent the genus and the last three letters the species name (Supplementary Table 1; see Extended Data Fig. 2 for the phylogeny with full species names). Branches are coloured according to tribes, and for all lake species an illustration is shown. Representatives of riverine cichlids (grey font) are nested within the radiation. The inset shows the time-calibrated phylogeny of more ancestral cichlid lineages (estimated under the multi-species coalescent model, Extended Data Fig. 1), highlighting the phylogenetic positions of the Tanganyikan representatives of the tribes Coptodonini (Coptodon rendalli (Copren)), Oreochromini (Oreochromis tanganicae (Oretan)) and Tylochromini (Tylochromis polylepis (Tylpol)), which colonized the lake secondarily. The schematic map of the African continent shows the position of the three Great Lakes Victoria, Malawi and Tanganyika, with a magnified section of Lake Tanganyika. The presumed age of Lake Tanganyika23 (9–12 Myr) is indicated in blue along the time axes. Species trees based on alternative topologies are presented in Extended Data Figs. 24, and uncalibrated nuclear and mitochondrial phylogenies on the specimen level are shown in Supplementary Figs. 1, 2.

Source data

Phenotypes correlate with environments

Because—in the case of adaptive radiation—diversification occurs via niche specialization, a strong association is expected in the extant fauna between the environment occupied by a species and the specific morphological features used to exploit it2,3. To quantify eco-morphological diversification across the radiation, we investigated three trait complexes through landmark-based morphometric analyses. Specifically, we quantified body shape and upper oral jaw morphology using 2D landmarks acquired from X-ray images and the shape of the lower pharyngeal jaw bone based on 3D landmarks derived from micro-computed tomography (μCT) scans (Extended Data Fig. 5). To approximate the ecological niche of each species, we used the carbon and nitrogen stable-isotope composition of muscle tissue, which provides information about the relative position along the benthic–pelagic axis (δ13C value) and the relative trophic level (δ15N value), respectively16,24—a pattern that we corroborate here for Lake Tanganyika (Extended Data Fig. 6a, Supplementary Discussion). The major axes of shape variation for each trait complex were identified through a principal component analysis (PCA). To test for phenotype–environment correlations and to identify the ecologically most relevant components of each of these trait complexes, we performed a two-block partial least-square analysis (PLS) with the stable-isotope measurements, and applied a phylogenetic generalized least-square analysis (pGLS) to account for phylogenetic dependence.

The quantification of variation in body shape revealed that principal component 1 (PC1) represented mainly differences in aspect ratio, whereas PC2 was loaded with changes in head morphology (Fig. 2a). The changes in aspect ratio (comparable to PC1) were correlated with the δ13C and δ15N values (PLS: Pearson’s r = 0.69, R2 = 0.48, P = 0.001; pGLS: R2 = 0.12, P < 0.001, λpGLS = 1.007). PC1 of upper oral jaw morphology mainly represented changes in the orientation and relative size of the premaxilla, which was also the main correlate to the stable C and N isotope composition (PLS: Pearson’s r = 0.62, R2 = 0.38, P = 0.001; pGLS: R2 = 0.09, P < 0.001, λpGLS = 1.023), whereas PC2 was defined by changes in the ratio of the rostral versus the lateral part of the bone (Fig. 2b). For lower pharyngeal jaw shape, we found that PC1 reflected mainly changes in the aspect ratio of the jaw bone in combination with an increased posterior thickness, whereas PC2 involved similar shifts in thickness, yet in this case in combination with changes in the lengths of the postero-lateral horns that act as muscle-attachment structures25 (Fig. 2c). The PLS revealed that shape changes similar to PC2 are best associated with stable-isotope values (PLS: Pearson’s r = 0.67, R2 = 0.45, P = 0.001; pGLS: R2 = 0.16, P < 0.001, λpGLS = 1.018). The PCAs further revealed that the occupied area of the morphospace and ecospace scales with the number of species in the tribes (Extended Data Figs. 6, 7; ecospace: Pearson’s r = 0.88, d.f. = 9, P < 0.001; body shape: Pearson’s r = 0.91, d.f. = 9, P < 0.001; upper oral jaw morphology: Pearson’s r = 0.88, d.f. = 9, P < 0.001; lower pharyngeal jaw shape: Pearson’s r = 0.83, d.f. = 9, P = 0.002), a pattern that is not driven by sample size only (Supplementary Discussion).

Fig. 2: Morphospace and ecospace occupation of the cichlid fishes of Lake Tanganyika.
figure 2

ac, PCA of body shape (a, n = 242 taxa; 2,197 specimens), upper oral jaw morphology (b, n = 242 taxa; 2,197 specimens) and lower pharyngeal jaw shape (c, n = 239 taxa, 1,168 specimens) along with the associated shape changes. d, Ecospace spanned by the stable C and N isotope composition (δ13C and δ15N values; n = 236 taxa; 2,259 specimens). The colour scale indicates the number of species in 20 by 20 bins across the trait space (see Extended Data Figs. 6, 7 for PCA and stable-isotope biplots with a focus on morpho- and ecospace occupation per tribe).

Source data

Overall, the significant association between each of the three traits and the stable C and N isotope composition underpins their adaptive value (Extended Data Fig. 8a–c). A joint consideration points out that deep-bodied cichlids with inferior mouths and thick lower pharyngeal jaws with short horns are associated with higher stable-isotope projections (high δ13C and low δ15N values), indicating that such fishes occur predominantly in the benthic/littoral zone of the lake and feed on plants and algae, whereas more elongated species with more superior mouths and longer and thinner lower pharyngeal jaws are generally associated with lower stable-isotope projections (low δ13C and high δ15N values), suggesting a more pelagic lifestyle and a higher position in the food chain.

Pulses of morphological diversification

Next, we investigated the temporal dynamics of how the observed eco-morphological disparity emerged over the course of the radiation. In addition to the three eco-morphological traits, we also scored male pigmentation patterns to approximate disparity along the signalling axis—another potentially important component of diversification in adaptive radiations1,6,7,26. For all four traits, we estimated morphospace expansion through time using ancestral-state reconstructions along the time-calibrated species tree and applying a variable-rates model of trait evolution27,28 (Extended Data Fig. 8d, e). We calculated morphological disparity as the extent of occupied morphospace in time intervals of 0.15 million years (Myr) in comparison to a null model that assumes Brownian motion. Likewise, evolutionary rates through time were calculated as mean evolutionary rates derived from the variable-rates model, sampled at the same time points along the phylogeny.

Our analyses uncovered a pattern of discrete pulses in morphospace expansion, which were followed, in most cases, by morphospace packing (Fig. 3). The timing of these pulses differed among the traits. For body shape, we found a pulse of rapid morphospace expansion early in the radiation, alongside the first pulse of lower pharyngeal jaw shape diversification (Fig. 3b, c); this early phase of the radiation also features the highest evolutionary rates for body shape (Fig. 3d). The pulse in upper oral jaw diversification occurred in the middle phase of the radiation. Evolutionary rates were increased during this period, and were even higher at a later phase that was dominated by packing of the upper oral jaw morphospace rather than its expansion (Fig. 3b–d). This suggests that, in that later phase, rapidly evolving lineages diverged into pre-occupied regions of the morphospace, ultimately resulting in convergent forms16. The second pulse in lower pharyngeal jaw morphospace expansion happened late in the radiation when evolutionary rates were also highest for this trait (Fig. 3b–d). Thus, the theoretical prediction that eco-morphological diversification is rapid early in an adaptive radiation and slows down through time as the available niche space becomes filled1,5 applies only to body shape. Yet, this early burst in body shape diversification was not connected to a substantial increase in lineage accumulation (Fig. 3c).

Fig. 3: Temporal dynamics of morphological diversification in the adaptive radiation of cichlid fishes in Lake Tanganyika.
figure 3

ad, First row: body shape, n = 232 taxa, 2,164 specimens; second row: upper oral jaw morphology, n = 232 taxa, 2,164 specimens; third row: lower pharyngeal jaw shape, n = 232 taxa, 1,148 specimens; fourth row: pigmentation patterns, n = 218 taxa, 1,016 specimens. a, Species tree (Fig. 1) with branches coloured according to the mean relative rates of trait evolution for each trait. PP, posterior probability for rate shift. b, Morphospace densities (number of lineages) through time for each trait. Blue lines indicate the point in time when 50% of the extant morphospace had become occupied. c, Comparison of slopes (blue) of morphospace expansion over time between the observed data and the Brownian motion null model of trait evolution (mean across 500 Brownian motion simulations with 95% quantiles). A difference in slopes above zero represents morphospace expansion and values below zero indicate morphospace packing relative to the null model. Lineage accumulation through time derived from the species tree is shown in dark grey. d, Mean relative rates of trait evolution over time with standard deviation (blue).

Source data

Pigmentation patterns showed a single pulse of diversification and increased evolutionary rates late in the radiation—a signature unlikely to be caused by a high turnover rate in this trait (Supplementary Discussion). This late pulse of diversification in pigmentation patterns, together with the consecutive pulses of morphospace expansion in the eco-morphological traits, is in agreement with the prediction that diversification in an adaptive radiation proceeds in discrete temporal stages—first in macrohabitat use, then by trophic specialization, followed by a final stage of divergence along the signalling axes1,6,7. However, in contrast to the conventional stages model, the most recent stage of the cichlid adaptive radiation in Lake Tanganyika, which coincides with a large number of speciation events (Fig. 3c), is characterized by temporally overlapping pulses of diversification in both a putative signalling trait and in an ecologically relevant trait. The lower pharyngeal jaw shape is the only trait complex showing two discrete pulses of morphospace expansion—one early in the radiation and one late when niche space already became limited. This later pulse suggests that diversification in the pharyngeal jaw apparatus facilitated fine-scaled resource partitioning after body shape and upper oral jaw morphospaces had been explored, resulting in the densely packed niche space observed today (Figs. 2, 3b).

Genomic features and species richness

Finally, we examined whether the diversity patterns arising over the course of the radiation are linked with particular genomic features. It has previously been suggested—on the basis of five reference cichlid genomes—that the radiating African cichlid lineages are characterized by increased transposable element counts, increased levels of gene duplications, and genome-wide accelerated coding-sequence evolution13. Because of the phylogenetic substructure of Lake Tanganyika’s cichlid fauna and the widely differing species numbers among tribes, our data offered the opportunity to examine genomic features for an association with per-tribe species richness within a large-scale radiation. We did not find evidence that members of species-rich tribes exhibit greater numbers of transposable elements (Fig. 4a) or more duplicated genes in their genomes (Fig. 4b), nor do they feature elevated genome-wide signatures of selection in coding sequences (Fig. 4c) (see also Extended Data Fig. 9). However, we found that a tribe’s species richness scales positively with a common measure of genetic diversity: genome-wide heterozygosity (Fig. 4d). That genetic diversity is linked to species richness has been previously suspected, although the nature of this relationship and the determinants of genetic diversity are under debate29,30.

Fig. 4: Association between genomic features and species richness across the cichlid tribes in Lake Tanganyika.
figure 4

Each genomic summary statistic was tested for a correlation with species richness per tribe (log transformed). To account for phylogenetic structure in the data, we calculated phylogenetic independent contrasts for each variable. Data points are coloured according to tribes; large points are tribe means shown with 95% confidence intervals, small points represent species means and are only shown for group sizes <40. a, Percentage of the genome identified as transposable elements (TEs) (Pearson’s r = −0.31, d.f. = 10, P = 0.33; tribe means are based on one genome per species; Extended Data Fig. 9a). b, Number of duplicated genes (Pearson’s r = −0.27, d.f. = 10, P = 0.40; tribe means are based on species means). c, Genome-wide dN/dS ratios as a measure of selection on coding sequences (Pearson’s r = 0.26, d.f. = 10, P = 0.42; tribe means are based on species means across a set of 15,294 genes per genome; Extended Data Fig. 9b). d, Percentage of heterozygous sites per genome (Pearson’s r = 0.70, d.f. = 10, P = 0.012; tribe means are based on species means). e, f4-ratio statistics as a measure of gene flow among species within each tribe (Pearson’s r = −0.35, d.f. = 9, P = 0.29; tribe means are based on all species triplets within each tribe; see Extended Data Fig. 10 for a summary of the f4-ratio statistics for all species comparisons). f, Mean percentage of heterozygous sites in simulations with within-tribe migration rates sampled from the observed f4-ratio statistics (Pearson’s r = 0.84, d.f. = 10, P = 0.00067; tribe means are based on species means across 20 simulations; Extended Data Fig. 9c).

Source data

Elevated levels of heterozygosity could potentially result from hybridization31, which has itself been suggested as a trigger of cichlid radiations22,32,33. In Tanganyikan cichlids, the level of gene flow within tribes (estimated using f4-ratio values34) does not correlate with a tribe’s species richness (Fig. 4e, Extended Data Fig. 10). Nevertheless, much of the variation in heterozygosity as well as its correlation with species richness can be explained by the observed levels of gene flow within tribes in combination with the reduced gene flow among them: through coalescent simulations of genome evolution along the species tree we show that variation in migration rates, sampled from the empirical f4-ratio estimates, can produce levels of heterozygosity that are similar to the ones observed in nature (Fig. 4f). Hence, the correlation between species richness and heterozygosity can be explained by gene flow and phylogenetic structure, which is consistent with the expectation that the effect of gene flow scales positively with the number of hybridizing species and the divergence among these. In the cichlid radiation in Lake Malawi, which is an order of magnitude younger than the one in Lake Tanganyika, heterozygosity levels vary much less among lineages and do not scale with species richness, which—according to our findings—can be explained by the much lower levels of genetic differentiation between the hybridizing species33.

Conclusion

On the basis of a comprehensive dataset on cichlid fishes from African Lake Tanganyika, we tested predictions related to the phenomenon of adaptive radiation. We establish that the Tanganyikan cichlid radiation unfolded within the temporal and spatial confines of the lake, giving rise to an endemic fauna consisting of about 240 species in 52 genera and 13 tribes in less than 10 Myr. Although the ancestors of these tribes initially found comparable ecological opportunity, present-day species numbers differ by two orders of magnitude among these phylogenetic sublineages. Our analyses of morphological, ecological and genomic information revealed that, taken as a whole, species-rich tribes occupy larger fractions of the morphospace and ecospace and contain species that are, at the per-genome level, genetically more diverse, which appears to be linked to gene flow. We demonstrate a phenotype–environment association in three trait complexes (body shape, upper oral jaw morphology and lower pharyngeal jaw shape) and pinpoint their most relevant adaptive components. We show that eco-morphological diversification was not gradual over the course of the radiation. Instead, we identified trait-specific pulses of accelerated phenotypic evolution, whereby only diversification in body shape shows an early burst1,5. The sequence of the trait-specific pulses essentially follows the pattern postulated in the stages model of adaptive radiation1,6,7, with the extension that the most recent stage of the cichlid adaptive radiation in Lake Tanganyika, which is characterized by a large number of speciation events, is defined by increased diversification in both an ecological (lower pharyngeal jaw) and a signalling (pigmentation) trait. To what extent the observed diversity and disparity patterns were shaped by past environmental fluctuations and extinction dynamics cannot be answered conclusively through the investigation of the extant fauna alone.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

Sampling

Sampling was conducted between 2014 and 2017 at 130 locations at Lake Tanganyika. To maximise taxon coverage, we included additional specimens from previous expeditions (4.9% of the samples) as well as from other collections (0.8%). The final dataset (301 taxa; n = 2,723 specimens) contained an almost complete taxon sampling of the cichlid fauna of Lake Tanganyika, as well as 18 representative cichlid species from nearby waterbodies, and 32 outgroup species. All analyses described below are based on the same set of typically 10 specimens per species, or subsets thereof (Supplementary Tables 1, 2, Supplementary Methods).

Whole-genome sequencing

Genomic DNA of typically one male and one female specimen per species (n = 547) was extracted from fin clips preserved in ethanol using the E.Z.N.A. Tissue DNA Kit (Omega Bio-Tek) and sheared on a Covaris E220 (60 μl with 10% duty factor, 175 W, 200 cycles for 65 s). Individual libraries were prepared using TruSeq DNA PCR-Free Sample Preparation kit (Illumina; low sample protocol) for 350-bp insert size, pooled (six per lane), and sequenced at 126-bp paired-end on an Illumina HiSeq 2500 (Supplementary Table 1 contains information on read depths).

Assessing genomic variation

After adaptor removal with Trimmomatic35 (v.0.36), reads of 528 genomes (all species belonging to the cichlid radiation in Lake Tanganyika plus additional species nested within this radiation and some selected outgroup species; Supplementary Table 1) were mapped to the Nile tilapia reference genome (RefSeq accession GCF_001858045.136) using BWA-MEM37 (v.0.7.12). Variant calling was performed with HaplotypeCaller and GenotypeGVCF tools38 (v.3.7) (GATK), applying a minimum base quality score of 30. Variant calls were filtered with BCFtools39 (v.1.6; FS < 20, QD > 2, MQ > 20, DP > 4,000, DP < 8,000, ReadPosRankSum > −0.5, MQRankSum > −0.5). We applied a filter to sites in proximity to indels with a minor allele count greater than 2, depending on the size of the indel. With SNPable (http://lh3lh3.users.sourceforge.net/snpable.shtml), we determined all sites within regions of the Nile tilapia reference genome in which read mapping could be ambiguous and masked these sites. Using VCFtools40 (v.0.1.14) we further masked, per individual, genotypes with a read depth below 4 or a genotype quality below 20. Sites that were no longer polymorphic after the filtering steps were excluded, resulting in a dataset of 57,751,375 SNPs. Called variants were phased with the software beagle41 (v.4.1). The phasing of Neolamprologus cancellatus, which appeared to be F1 hybrids, was further improved with a custom script. Further details are provided in the Supplementary Methods.

De novo genome assemblies

De novo genome assemblies were generated from the raw-read data for each individual following an approach described previously42,43, using CeleraAssembler44 (v.8.3) and FLASH45 (v.1.2.11). Eight genomes repeatedly failed to assemble and were therefore excluded from further analyses (specimen vouchers: A188, IRF6, IZC5, JWE7, JWG1, JWG2, LJD3 and LJE8). Assembly quality was assessed with QUAST46 (v.4.5) and completeness was determined with BUSCO47 (v.3). Assembly statistics summarized with MultiQC48 (v.1.7) are available on Dryad.

Determining the age of the radiation

To determine the age of the cichlid radiation in Lake Tanganyika, we applied phylogenomic molecular-clock analyses for representatives of all cichlid subfamilies and the most divergent tribes, together with non-cichlid outgroups (44 species; Extended Data Fig. 1). Following Matschiner et al.18 we identified and filtered orthologue sequences from genome assemblies and compiled ‘strict’ and ‘permissive’ datasets that contained alignments for 510 and 1,161 genes and had total alignment lengths of 542,922 and 1,353,747 bp, respectively. We first analysed the topology of the species with the multi-species coalescent model implemented in ASTRAL49 (v.5.6.3), based on gene trees that we estimated for both datasets with BEAST250 (v.2.5.0). As undetected past introgression can influence divergence-time estimates in molecular clock analyses, we further tested for signals of introgression in the form of asymmetric species relationship in gene trees and excluded five species (Fundulus heteroclitus, Tilapia brevimanus, Pelmatolapia mariae, Tilapia sparrmanii, and Steatocranus sp. ‘ultraslender’) potentially affected by introgression from all subsequent molecular-clock analyses. We then estimated divergence times among the most divergent cichlid tribes and the age of the cichlid radiation in Lake Tanganyika with the multi-species coalescent model in StarBEAST251 (v.0.15.5), using the ‘strict’ set of gene alignments (Extended Data Fig. 1). Further details are provided in the Supplementary Methods.

Phylogenetic inference

To infer a complete phylogeny of the cichlid radiation in Lake Tanganyika (the Tanganyikan representatives of the more ancestral tribes Coptodonini, Oreochromini and Tylochromini were excluded) from genome-wide SNPs we applied additional filters, retaining only SNPs with <40% missing data and between-SNP distances of at least 100 bp. The remaining 3,630,997 SNPs were used to infer a maximum-likelihood phylogeny with RAxML52 (v.8.2.4; Fig. 1, Extended Data Fig. 2, Supplementary Fig. 1). The species-tree topology was further estimated under the multi-species coalescent model from a set of local phylogenies with ASTRAL (Extended Data Fig. 3); these local phylogenies were inferred with IQ-TREE53 (v.1.7-beta7) from alignments for 1,272 genomic regions determined to be particularly suitable for phylogenetic analysis (see Supplementary Methods). We also applied the multi-species coalescent model implemented in SNAPP54 (v.1.4.2) to the dataset of genome-wide SNPs (Extended Data Fig. 4). Species-level phylogenies resulting from these different approaches were used as topological constraints in subsequent relaxed-clock analyses of divergence times (see below). In addition, we estimated the mitochondrial phylogeny based on maximum-likelihood with RAxML (Supplementary Fig. 2). Further details are provided in the Supplementary Methods.

Divergence time estimates within the radiation

For relaxed-clock analyses, the 1,272 alignments were further filtered by applying stricter thresholds on the proportion of missing data and the strength of recombination signals. Ten remaining alignments with a length greater than 2,500 bp and less than 130 hemiplasies (total length: 30,738 bp; completeness: 95.8%), were then used jointly to estimate divergence times with the uncorrelated-lognormal relaxed-clock model implemented in BEAST2. To account for phylogenetic uncertainty in downstream phylogenetic comparative analyses, we performed three separate sets of relaxed clock analyses, in which the topology was either fixed to the species-level phylogeny inferred with RAxML (Fig. 1, Extended Data Fig. 2), the species tree inferred with ASTRAL (Extended Data Fig. 3) or the Bayesian species tree inferred with SNAPP (Extended Data Fig. 4). Further details are provided in the Supplementary Methods.

Morphometrics

To quantify body shape and upper oral jaw morphology, we applied a landmark-based geometric morphometric approach to digital X-ray images (for the full set of 10 specimens per species whenever possible; n = 2,197). We selected 21 landmarks, of which 17 were distributed across the skeleton and four defined the premaxilla (Extended Data Fig. 5a). Landmark coordinates were digitized using FIJI55 (v2.0.0-rc-68/1.521i). To extract overall body shape information, we excluded landmark 16, which marks the lateral end of the premaxilla, hence minimizing the impact of the orientation of the upper oral jaw. We then applied a Procrustes superimposition to remove the effect of size, orientation, and translational position of the coordinates.

For upper oral jaw morphology, we used a subset of four landmarks. A crucial feature of the oral jaw morphology is the orientation of the mouth relative to the body axes. However, this component of the upper oral jaw morphology would be lost in a classical geometric morphometric analysis, in which only pure shape information is retained. To overcome this, we extracted the premaxilla-specific landmarks (1, 2, 16 and 21) after Procrustes superimposition of the entire set of landmarks and subsequently recentred the landmarks to align the specimens without rotation. Thus, the resulting landmark coordinates do not represent the pure shape of the premaxilla but additionally contain information on its orientation and size in relation to body axes and body size, respectively.

To quantify lower pharyngeal jaw bone shape in 3D, a landmark-based geometric morphometric approach was applied on μCT scans of the head region of five specimens per species (n = 1,168). To capture all potential functionally important structures of the lower pharyngeal jaw bone, we selected a set of 27 landmarks (10 true landmarks and 17 sliding semi-landmarks) well distributed across the left side of the bone (Extended Data Fig. 5b). Landmark coordinates were acquired using TINA56 (v.6.0). To retain the lateral symmetric properties of the shape data during superimposition, we reconstructed the right side of the lower pharyngeal jaw bone by mirroring the landmark coordinates across the plane of bilateral symmetry fitted through all landmarks theoretically lying on this plane. We then superimposed the resulting 42 landmarks while sliding the semi-landmarks along the curves by minimizing Procrustes distances and retained the symmetric component only.

To identify the major axes of shape variation across the multivariate datasets we performed a PCA for each trait. We also calculated morphospace size per tribe as the square root of the convex hull area spanned by species means of the PC1 and PC2 scores. We then tested for a correlation between morphospace size and estimated species richness of a tribe15 (log-transformed to obtain normal distribution). To account for phylogenetic non-independence, we calculated phylogenetic independent contrasts with the R package ape57 (v.5.2) using the species tree (Fig. 1) pruned to the tribe level. We then calculated Pearson’s correlation coefficients for independent contrasts using the function cor.table of the R package picante58 (v.1.8).

All landmark coordinates for geometric morphometric analyses were processed and analysed in R59 (v.3.5.2) using the packages geomorph60 (v.3.0.7) and Morpho61 (v.2.6). Further details are provided in the Supplementary Methods.

Stable-isotope analysis

To approximate ecology for each species, we measured the stable carbon (C) and nitrogen (N) isotope composition of all available specimens from Lake Tanganyika (n = 2,259). We analysed a small (0.5–1 mg) dried muscle sample of each specimen with a Flash 2000 elemental analyser coupled to a Delta Plus XP continuous-flow isotope ratio mass spectrometer (IRMS) via a Conflo IV interface (Thermo Fisher Scientific). Carbon and nitrogen isotope data were normalized to the VPDB (Vienna Pee Dee Belemnite) and Air-N2 scales, respectively, using laboratory standards which were calibrated against international standards. Values are reported in standard per-mil notation (‰), and long-term analytical precision was 0.2‰ for δ13C values and 0.1‰ for δ15N values. Note that we have used some of these stable-isotope values in a previous study62.

To confirm interpretability of the δ13C and δ15N values, we additionally collected and analysed baseline samples covering several trophic levels from the northern and the southern basin of Lake Tanganyika (Supplementary Methods, Supplementary Discussion).

To test for a correlation of ecospace size with species richness of the tribes, we applied the same approach as described above to the δ13C and δ15N values.

Phenotype–environment association

For each trait (body shape, upper oral jaw, lower pharyngeal jaw) we performed a two-block PLS analysis based on species means of the Procrustes aligned landmark coordinates and the stable C and N isotope compositions using the function two.b.pls in geomorph. To account for phylogenetic dependence of the data we applied a pGLS as implemented in the R package caper63 (v.1.0.1) across the two sets of PLS scores (each morphological axis and the stable-isotope projection) using the time-calibrated species tree based on the maximum-likelihood topology. The strength of phylogenetic signal in the data was accounted for by optimising the branch length transformation parameter lambda using a maximum-likelihood approach.

Scoring pigmentation patterns

To quantify a putative signalling trait in cichlids, we scored the pigmentation patterns in typically five male specimens per species (n = 1,016), on the basis of standardized images taken in the field after capture of the specimens (see Supplementary Methods). Following the strategy described in Seehausen et al.64, the presence or absence of 20 pigmentation features was recorded, whereby we extended number of scored features to include additional body and fin pigmentation patterns (Extended Data Fig. 5c). We then applied a logistic PCA implemented in the R package logisticPCA65 (v.0.2) and used the PC1 scores as univariate proxy for differentiation along the signalling axes for further analyses.

Trait evolution modelling and disparity estimates

To investigate the temporal dynamics of morphological diversification over the course of the radiation we essentially followed the strategy of Cooney et al.28 (which is based on measurements on extant taxa and assumes constant niche space and no or constant extinction over the course of the radiation), using the PLS scores of body shape, upper oral jaw morphology, and lower pharyngeal jaw shape and the PC1 scores of pigmentation patterns as well as the time-calibrated maximum-likelihood species tree topology. For each trait we assessed the phylogenetic signal in the data by calculating Pagel’s lambda and Blomberg’s K with the R package phytools66 (v.0.6-60). We then tested the fit of four models of trait evolution for each of the four traits. We applied a white noise model, a Brownian motion model, a single-optimum Ornstein–Uhlenbeck model and an early burst model of trait evolution using the function fitContinuous of the R package geiger67 (v.2.0.6.1). Additionally, we fitted a variable-rates model (a Brownian motion model which allows for rate shift on branches and nodes) using the software BayesTrait (http://www.evolution.rdg.ac.uk/; v.3) with uniform prior distributions adjusted to our dataset (alpha: −1–1, sigma: 0–0.001 for morphometric traits; alpha: 0–10, sigma: 0–10 for pigmentation pattern) and applying single-chain Markov-chain Monte Carlo runs with one billion iterations. We sampled parameters every 100,000th iteration, after a pre-set burnin of 10,000,000 iterations. We then tested for each trait for convergence of the chain using a Cramer–von Mises statistic implemented in the R package coda68 (v.0.19-3). The models were compared by calculating their log-likelihood and Akaike information criterion (AIC) difference (Extended Data Fig. 8d). Based on differences in AIC, the variable-rates model was best supported for all traits but body shape, which showed a strong signal of an early burst of trait evolution (Extended Data Fig. 8d, note that the variable-rates model has the highest log-likelihood for body shape as well). We nevertheless focused on the variable-rates model for further analyses of all traits to be able to compare temporal patterns of trait evolution among the traits.

To estimate morphospace expansion through time we used a maximum-likelihood ancestral-state reconstruction implemented in phytools. To account for differences in the rate of trait evolution along the phylogeny, we reconstructed ancestral states using the mean rate-transformed tree derived from the variable-rates model. We then projected the ancestral states onto the original species tree and calculated the morphospace extent (that is, the range of trait values) in time intervals of 0.15 million years (note that this is an arbitrary value; however, differently sized time intervals had no effect on the interpretation of the results). For each time point we extracted the branches existing at that time and predicted the trait value linearly between nodes. We then compared the resulting morphospace expansion over time relative to a null model of trait evolution. We therefore simulated 500 datasets (PLS and PC1 scores) under Brownian motion given the original species tree with parameters derived from the Brownian motion model fit to the original data. For each simulated dataset we produced morphospace-expansion curves using the same approach as described above. We then compared the slopes of our observed data with each of the null models by calculating the difference of slopes through time (Fig. 3) using linear models fitted for each time interval with the two subsequent time intervals. Note that for body shape we also estimated morphospace expansion through time using the early burst model for ancestral-state reconstruction, which resulted in a very similar pattern of trait diversification.

Unlike other metrics of disparity (for example, variance or mean pairwise distances) morphospace extent is not sensitive to the density distribution of measurements within the morphospace and captures its full range69. Hence, comparing the extent of morphospace between observed data and the null model directly unveils the contribution of morphospace expansion relative to the null model; and because the increase in lineages over time is identical in the observed and the simulated data, this comparison also provides an estimate for morphospace packing.

To summarize evolutionary rates we calculated the mean rate of trait evolution inferred by the variable-rates model in the same 0.15 million years intervals along the phylogeny.

To account for phylogenetic uncertainty in the tree topology we repeated the analyses of trait evolution using the time-calibrated trees based on tree topologies estimated with ASTRAL and SNAPP (Extended Data Figs. 3, 4; Supplementary Methods; Supplementary Discussion). Furthermore, to also account for uncertainty in branch lengths, we repeated the analysis on 100 trees from the Bayesian posterior distribution for each of the three trees (Extended Data Fig. 8d, e, results are provided on Dryad).

Further details can be found in the Supplementary Methods.

Characterization of repeat content

For the repeat content analysis, we randomly selected one de novo genome assembly per species of the radiation (n = 245). We performed a de novo identification of repeat families using RepeatModeler (v.1.0.11; http://www.repeatmasker.org). We then combined the RepeatModeler output library with the available cichlid-specific libraries (Dfam and RepBase; v.27.01.2017; http://www.repeatmasker.org; 258 ancestral and ubiquitous sequences, 161 cichlid-specific repeats, and 6 lineage-specific sequences; 65,118, 273,530 and 6,667 bp in total, respectively) and used the software RepeatMasker (v.4.0.7; http://www.repeatmasker.org) (-xsmall -s -e ncbi -lib combined_libraries.fa) to identify and soft-mask interspersed repeats and low complexity DNA sequences in each assembly. The reported summary statistics were obtained using RepeatMasker’s buildSummary.pl script (Fig. 4a, Extended Data Fig. 9a, results per genome are provided on Dryad).

Gene duplication estimates

Per genome, gene duplication events were identified with the structural variant identification pipeline smoove (population calling method; https://github.com/brentp/smoove, docker image cloned 20/12/2018), which builds upon lumpy70, svtyper71 and svtools (https://github.com/hall-lab/svtools). Variants were called per sample (n = 488 genomes, 246 taxa of the Tanganyika radiation) from the initial mapping files against the Nile tilapia reference genome with the function ‘call’. The union of sites across all samples was obtained with the function ‘merge’, then all samples were genotyped at those sites with the function ‘genotype’, and depth information was added with --duphold. Genotypes were combined with the function ‘paste’ and annotated with ‘annotate’ and the reference genome annotation file. The obtained VCF file was filtered with BCFtools to keep only duplications longer than 1 kb and of high quality (MSHQ >3 or MSHQ = −1, FMT/DHFFC[0] > 1.3, QUAL >100). The resulting file was loaded into R (v.3.6.0) with vcfR72 (v.1.8.0) and filtered to keep only duplications with less than 20% missing genotypes. Next, we removed duplication events with a length outside 1.5 times the interquartile range above the upper quartile of all duplication length, resulting in a final dataset of 476 duplications (Fig. 4b).

Analyses of selection on coding sequence

To predict genes within the de novo genome assemblies, we used AUGUSTUS73 (v.3.2.3) with default parameters and ‘zebrafish’ as species parameter (n = 485 genomes, 245 taxa). For each prediction we inferred orthology to Nile tilapia genes (GCF_001858045.1_ASM185804v2) with GMAP (GMAP-GSNAP74; v.2017-08-15) applying a minimum trimmed coverage of 0.5 and a minimum identity of 0.8. We excluded specimens with less than 18,000 tilapia orthologous genes detected (resulting in n = 471 genomes, 243 taxa). Next, we kept only those tilapia protein coding sequences that had at least one of their exons present in at least 80% of the assemblies (260,335 exons were retained, representing 34,793 protein coding sequences). Based on the Nile tilapia reference genome annotation file, we reconstructed for each assembly the orthologous coding sequences. Missing exon sequences were set to Ns. We then kept a single protein coding sequence per gene (the one being present in the maximum number of species with the highest percentage of sequence length), resulting in 15,294 protein coding sequences. Per gene, a multiple sequence alignment was then produced using MACSE75 (v.2.01). We calculated for each specimen and each gene the number of synonymous (S) and non-synonymous (N) substitutions by pairwise comparison to the orthologue tilapia sequence using codeml with runmode –2 within PAML76 (v.4.9e). To obtain an estimate of the genome-wide sequence evolution rate that is independent of filtering thresholds, we calculated the genome-wide dN/dS ratio for each specimen based on the sum of dS and dN across all genes (Fig. 4c, Extended Data Fig. 9b).

Signals of past introgression

We used the f4-ratio statistic34 to assess genomic evidence for interspecific gene exchange. We calculated the f4-ratio for all combinations of trios of species on the filtered VCF files using the software Dsuite77 (v.0.2 r20), with T. sparrmanii as outgroup species (we excluded N. cancellatus as all specimens of this species appeared to be F1 hybrids; Supplementary Methods). The f4-ratio statistic estimates the admixture proportion, that is, the proportion of the genome affected by gene flow. The results presented in this study (Fig. 4e, Extended Data Fig. 10) are based on the ‘tree’ output of the Dsuite function Dtrios, with each trio arranged according to the species tree on the basis of the maximum-likelihood topology. The per-tribe analyses (Fig. 4e) were based only on comparisons where all species within a trio belong to the same tribe (n = 243 taxa).

In addition to the f4-ratio we also identified signals of past introgression among species using a phylogenetic approach by testing for asymmetry in the relationships of species trios in 1,272 local maximum-likelihood trees generated using IQ-TREE (Supplementary Methods; Extended Data Fig. 10).

Heterozygosity

We calculated the number of heterozygous sites per genome (n = 488 genomes, 246 taxa from the Tanganyika radiation) from the VCF files using the BCFtools function stats and then quantified the percentage of heterozygous sites among the number of callable sites per genome (see above) (Fig. 4d).

To explore if the observed levels of heterozygosity per tribe can be explained by the levels of gene flow within tribes we performed coalescent simulations with msprime78 (v.0.7.4). We simulated genome evolution of all species of the radiation following the time-calibrated species tree (Fig. 1), assuming a generation time of 3 years79 and a constant effective population size of 20,000 individuals. Species divergences were implemented as mass migration events and introgression within tribes as migration between species pairs with rates set according to their introgression (f4-ratio) signals inferred with Dsuite. To convert the f4-ratio values into migration rates, we applied a scaling factor of 5 × 10−6, which results in a close correspondence in magnitude of the simulated introgression signals to those observed empirically (Fig. 4, Extended Data Fig. 9c). In each of 20 separate simulations, we randomly sampled one pairwise f4-ratio value for each pair of species (there are many f4 ratios per species pair—one for each possible third species added to the test trio; the maximum values per pair are shown in Extended Data Fig. 10). The simulated data consisted of one chromosome of 100 kb (mutation rate: 3.5 × 10−9 per bp per generation33, recombination rate: 2.2 × 10−8 per bp per generation; see Supplementary Methods). Levels of heterozygosity were calculated for all simulated datasets as described for the empirical data.

To account for between-tribe gene flow we further performed simulations in which migration between tribes was also sampled from the empirical f4-ratio distribution. For simplicity in setting up the simulation model, we assume that gene flow between tribes is ongoing until present day, which is clearly an overestimate (see Supplementary Discussion). Nevertheless, the results of these simulations support our hypothesized scenario, confirming that much of the variation in heterozygosity as well as its correlation with species richness can be explained by the observed levels of gene flow.

Correlation of genome-wide statistics with species richness

We tested for a correlation between tribe means (based on species means) of each genomic summary statistics (transposable element counts, number of gene duplications, genome-wide dN/dS ratio, per-genome heterozygosity, and f4-ratio, as well as the heterozygosity and f4-ratio statistics derived from simulated genome evolution) and species richness of the tribes, applying the same approach as described above for tests of correlation between morpho- and ecospace size and species richness.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.