Main

The history of the human settlement of Polynesia has long been examined by its residents5 and has been an open question worldwide since at least the time of Captain James Cook6,7. More recently, the prevalence of certain health conditions in these island founder populations has attracted the interest of medical geneticists8. However, although essential for both medical research and historical understanding, little is known about the human genetic structure of this oceanic expanse, our planet’s last habitable region to be settled.

Background

The settlement sequence of Polynesian islands remains particularly difficult to unravel using comparative linguistic or cultural approaches owing to the rapidity of the initial expansion and the subsequent cultural exchanges between islands7,9,10,11. Meanwhile, the archaeological estimates for settlement dates remain debated, and have recently been revised forward across eastern Polynesia by up to a millennium2,3,4,12,13. Previous region-wide Polynesian genetics studies have considered only globin gene polymorphisms14 or have been restricted to near (western) Polynesia15 and the Society Islands16 and lacked an ancestry-specific approach. Meanwhile, ancient DNA studies have sequenced only four samples from one island in western Polynesia and three near-modern samples from one island in eastern Polynesia, all with low genotype density, still lower between-sample genotype overlap, and different time frames17,18. Here we use a dataset of modern samples that is two orders of magnitude larger to examine detailed intra- and inter-island population substructure across all of Polynesia (Supplementary Tables 1, 2). We leverage our sample size to perform directionality and network analyses, and leverage our high-density, overlapping genotypes from coexistent individuals to perform within-generation autosomal haplotype matching, allowing us to date and reconstruct the settlement paths of these islands for the first time. This study also allows us to demonstrate new ancestry-specific techniques for analysing genomic data from underrepresented, admixed populations.

The Polynesians are predominantly descended from Austronesian-speaking voyagers17 who trace their linguistic origins to Taiwan;9 their ancestral expansion is thought to have proceeded into Island Southeast Asia and eventually out into the Pacific19. The Austronesian-speaking settlers of the western Pacific (Fiji, Tonga and Samoa) went on to people the widely dispersed islands in the vast ocean to their east through extraordinary voyages of exploration and settlement2,20. Historians believe that family groups of 30–200 individuals sailed in double-hulled canoes across thousands of kilometres of open ocean to inhabit each new Polynesian island group21,22. The first arrivals to these isolated island groups are thought to have experienced rapid initial growth, driven by the abundant resources of unfished reefs, huge seabird colonies and flightless birds (that soon became extinct) unhabituated to humans2,7,22,23,24,25. These rapidly expanding island populations then initiated new voyages of exploration in search of—according to some theories—further untapped resources26, a model supported by early oral histories27. Geological analyses of Polynesian trade goods, particularly adzes, indicate that the remote Polynesian islands remained in trade contact with one another for several centuries26,28,29. However, these contacts were necessarily limited in frequency by the vast distances between island groups and limited in size by the capacities of the double-hulled sailing canoes21.

Under this historical model, we would expect the minor alleles on these isolated Pacific islands to be lost in a telescoping fashion following the order of the islands’ colonization—a range expansion30—owing to the compounding succession of founding bottlenecks. We confirm this hypothesis below and then use its consequence—that the genetic composition of each remote island group is dominated by the contribution of its founders (Extended Data Fig. 3), whose descendants rapidly populated it—to reconstruct the Polynesian settlement sequence. We finally evaluate this model for self-consistency to test its validity.

Dimensionality of Polynesian genetics

In direct contrast to continental (and nearshore island) populations, in which genetic substructure is shaped by large historical migrations, conquests and diffusions occurring freely across the two-dimensional landmass surface, thus producing two-dimensional projections of genetic variance that mirror geography31,32, we find that Polynesian population structure exhibits high dimensionality (Supplementary Fig. 1) not at all reflective of geography (Fig. 1a), with islands diverging separately in a standard principal component analysis (PCA) (Supplementary Figs. 2, 3). Indeed, the first two dimensions of major genomic variation—even in an ancestry-specific PCA of the Polynesian individuals (Fig. 1b)—do not separate islands geographically, as they do for within-continent populations33,34 (Extended Data Figs. 1, 2). Instead, each successive principal component captures the genetic drift of a particular island or island group (Fig. 1b, c, Supplementary Fig. 2), illustrating that genetic variance between these islands is dominated by their founder effects, not by diffusion clines or migration gradients. To further complicate such a standard variance-based approach (Supplementary Figs. 24) to genomic dimensionality reduction, the Polynesian islands differ widely in genetic diversity. Because the originating islands have much greater diversity (as discussed below), they dominate the first principal component when included in the PCA (Supplementary Fig. 3). Furthermore, many individuals, including all samples from certain islands, have some amount of non-Polynesian ancestry: European, Native American and African33. The presence of large-scale post-colonial admixture from such divergent ancestry sources completely confounds Polynesian-focused interpretations of within-island and between-island variance when these admixed samples are included in the PCA (Supplementary Fig. 4).

Fig. 1: Dimensionality reduction of genetic variation in Pacific Islanders.
figure 1

a–c, Ancestry-specific PCA of islanders (with non-Asian derived ancestries, such as post-colonial European ancestry, masked) shows islands (a) diverging separately along each component (bc), owing to the independence of genetic drift from each island’s founder effect. Neither geography nor settlement sequence can be discerned. The westernmost islands are omitted, as their greater diversity would otherwise dominate the first principal component (PC) (see Supplementary Fig. 2). The per cent variance explained by each of the first four principal component dimensions is listed along each axis. Dots represent individuals, and colours represent islands. d, Ancestry specific t-SNE plot of all sampled islanders, providing superior separation of each island group. The ancestral western Pacific islands are on the left and the easternmost Polynesian island (Rapa Nui) on the right. Important patterns are now evident; for instance, Rarotonga and the Palliser group appear at the centre of the eastern Polynesian islands while the other eastern islands radiate out from them, consistent with the settlement patterns we infer below. t-SNE preserves local relationships, but not global relationships (between widely separated clusters).

To overcome these threefold obstacles to visualizing relationships between islands, we applied a novel ancestry-specific version of a nonlinear dimensionality reduction technique, t-distributed stochastic neighbour embedding (t-SNE), applying it only to the genomic segments of Polynesian ancestry in our sampled individuals and employing a matrix completion step (Fig. 1d, Supplementary Fig. 5). In a plot of this ancestry-specific t-SNE method (Fig. 1d), the islands of the ancestral west—Taiwan, Island Southeast Asia (Sumatra), Fiji, Tonga and Samoa—are grouped on the left and the more recently settled eastern islands are on the right. Islands in archipelagos, such as the Cook Islands of Mauke, Atiu and Rarotonga, form neighbouring clusters. Rarotonga and Palliser appear at the centre of the eastern Polynesian islands, with the other eastern islands radiating out from them. This pattern is consistent across alternative dimensionality-reduction methods (Methods), including our ancestry-specific formulations of uniform manifold approximation (UMAP) (Supplementary Figs. 6, 7) and self-organizing map (SOM) (Supplementary Fig. 8), as well as our genetic-drift projection method (Supplementary Fig. 9).

Tree building and path reconstruction

Because individuals from each of the islands form coherent, separate clusters in all of the non-linear, variant-based projections (t-SNE, UMAP and SOM), we can define a meaningful variant-frequency vector for each island by averaging the single nucleotide polymorphism (SNP) dosage vectors across all individuals on that island. Again, we consider only genomic segments of Polynesian origin (Supplementary Tables 3, 4), since standard non-ancestry-specific analyses are confounded by the recent introduction of highly differentiated colonial ancestry, such as European, even when the proportion of that ancestry is small (Supplementary Fig. 10). Averaging across all individuals reduces noise and produces composite Polynesian-specific frequency vectors with little to no remaining missingness from masking. Using these island-specific Polynesian-variant frequency vectors, we compute statistics for each pair of islands (Extended Data Figs. 4a–d, 7, Supplementary Figs 1119), including the average number of pairwise differences35 (π), variant inner product36 (outgroup − F3), fixation index (Fst), and directionality index37,38 (range expansion statistic) (ψ).

The directionality index ψ (Fig. 2a) measures the aggregate increase in frequencies of retained rare variants across the genome due to founder events, following the direction of a range expansion (Fig. 2b, Supplementary Discussion, ‘On psi’). The ψ-statistic gives crucial information that is not available from any genetic distance (π, F2, MixMapper39) or inner product (F3, TreeMix40) based methods; namely, a directionality arrow delineating a parental population from its child. Most human population studies have not required such directionality, as modern human populations are generally siblings, both having genetic drift from a no-longer extant, ancient parental population. That parental population, if available from ancient samples, is clearly indicated by the arrow of time (typically carbon dating). However, among the relatively recently settled Polynesian islands, genetic drift is created not by time, but by founder effects. Thus, the undrifted (parental) populations for most of these islands are still approximately extant: they are the populations of the originating islands. When constructing a population tree, this means that our dataset contains not only the terminal (leaf) nodes, but also the internal nodes, and we know their hierarchy from the ψ statistic. This directional knowledge enables us to use a tree-building algorithm that, unlike population tree algorithms currently in use36,39,40,41 (Supplementary Figs. 20, 21), is guaranteed to find the optimal tree out of the space of all possible trees in the presence of perfect data (see Methods section ‘Migration network reconstruction’). Using this more robust directionality-based algorithm (see Supplementary Discussion, ‘On tree-building’), the settlement path of Polynesia can be reconstructed (Fig. 2a).

Fig. 2: Serial bottlenecks and relatedness define the settlement sequence and timings for the Polynesian Islands.
figure 2

a, Inferred genetic-based map of Polynesian origins for the islands sampled in our study (not to scale). The direction, line width and date for each arrow are based on inter-island statistics as described in the key and the text. For example, the widths of the arrows are inversely proportional to the value of the range expansion statistic (ψ) relative to Samoa. The order of arrow divergences indicates the order of shared drift among the child populations. Where they occur, these shared paths may indicate that one or more intermediate islands in the settlement sequence are missing from our dataset (Extended Data Fig. 5). This settlement sequence is consistent with a principal curve analysis (Extended Data Fig. 7). A sex-averaged generation time of 30 years was used, as found in several studies of pre-industrial populations (Supplementary Discussion, ‘On generation times and meiosis events’). Locations with prehistoric remains of megalithic statue building are also indicated (red asterisk). b, The range expansion statistic (ψ) shows a steady increase in retained rare variant frequencies (genetic surfing) along paths of settlement as a result of each successive founder effect. Note that each matrix element is computed on a different SNP set (rare variants found in some samples from both islands), so the matrix need not have a similar ordering across all rows or all columns—that it does is a confirmation of the range expansion process. Rapa Nui (Easter Island) is the easternmost island in our dataset with the most compounded series of founder effects. c, Example IBD segment length distributions for all pairs of individuals, one on Rapa Nui and the other on Mangareva (green), Palliser (blue), Rarotonga (red) and Samoa (black), used to fit the respective exponential decay constants (λ).

Dating

To estimate dates for the settlement events that we infer, we use a method for detecting DNA segments that have been inherited from a common ancestor (identical by descent (IBD)) for all pairs of individuals on different islands. Again, we consider only genomic segments of Polynesian ancestry. For each pair of islands A and B, we pool all of the Polynesian IBD segments shared between individuals on A paired with individuals on B, and fit an exponential curve to the resulting segment length distribution (Fig. 2c, Extended Data Fig. 4d). From the decay constant of this exponential curve, we compute the number of generations elapsed since divergence of the island pair (Extended Data Fig. 6, Supplementary Figs. 2224). Fig. 2a shows the estimated divergence dates for all pairs of islands that are connected by a settlement path. Recent movement between islands, such as post-settlement trade contact, can introduce small numbers of longer, inter-island IBD segments, shifting the estimated divergence time towards the present, so we fit a truncated exponential. Nevertheless, these divergence dates should be seen as the terminus ante quem for the settlement of each child island (Fig. 2a, Extended Data Fig. 6, Extended Data Table 1). In the case of the most remote islands such as Rapa Nui, which are believed to have had no large-scale population exchanges with other islands, the IBD-based date should coincide closely with the actual date of settlement.

The dates that we infer from our genome-wide network analyses support the radiocarbon-based ‘short chronology’ from the comprehensive re-analysis of Wilmshurst et al.12, as corrected by Mulrooney et al.3 (Extended Data Table 1), as opposed to the previous nearly-one-thousand-year-older ‘long chronology’2,4, and as opposed to the intermediate chronology suggested by Spriggs and Anderson13 (Marquesas ad 300–600, remainder of eastern Polynesia ad 600–950). Only in the settlement of the Marquesas Island group, dated by Mulrooney to the late 1100s, and the Southern Cook Islands, dated even later by Wilmshurst to the mid-1200s, do we find different (earlier) dates. However, as Mulrooney et al. explain, the small sample size of early-dated historical sites on each island mean that new archaeological discoveries could revise Wilmshurst’s chronology (backward). Our dates, from the full island-wide ancestral history coded within modern Polynesians themselves, do not have these sampling issues affecting ancient DNA and artifacts. Indeed, modern genomes complement ancient artifacts, since issues affecting the artifacts—finding the earliest human sites on each island, determining whether objects within them are anthropogenic and determining whether those artifacts, often wood or charcoal, stem from young or old trees4,42 (inbuilt age)—do not affect the modern genomes, and vice versa.

Our date for the settlement of Rapa Nui is consistent with Wilmshurst and Mulrooney and also agrees closely with the date found by Hunt et al. (ad 1200) based on analyses of pollen in lake cores and soil erosion patterns43, as well as with recent radiocarbon dates of archaeological sites44. Furthermore, unlike the long chronology estimates (200 bc in the Marquesas), our settlement dates (ad 1140 on Fatu Hiva in the Marquesas, or 28.4 generations before 1989) agree with the genealogical oral histories of many Pacific Islanders themselves27 (ad 1005, or 29 generations before 1875, on Fatu Hiva). In the Tuamotus our dates (ad 1110, or 29.3 generations before 1989) agree even more closely with island’s oral histories45 (ad 1125, or 28 generations before 1965).

Our later divergence dates (ad 13301360) for some islands within archipelagos—North Marquesas (Nuku Hiva) in the Marquesas, Raivavae and Rimatara in the Australs—fall within the period of greatest inter-island trade contact in eastern Polynesia26. Either the last islands were discovered during this period of long-distance trade voyaging, as suggested by the dates of Schmid et. al. 4, or sufficient migration-to-inhabitant ratios still existed within archipelagos then to influence IBD distribution dates (Supplementary Fig. 23). Note that our reconstruction of the settlement path is independent of these date estimates, which are overlaid on it, and is more robust to later sporadic contact than IBD distributions are (see Methods sections ‘Polynesian ancestry-specific allele frequency analyses’ and ‘F4’).

Discussion

Our analyses indicate the following scenario for the settlement of eastern Polynesia. From western Polynesia, Polynesian voyagers reached Rarotonga in the Cook Islands around ad 830, having passed from Samoa along a route shared with the settlement of Fiji and Tonga. Rarotonga is the largest of the Cook Islands and has the highest elevation, with fertile volcanic soil watered by orographic rainfall26, creating distinct clouds. These clouds, together with a prominent mountain, make the island visible for long distances at sea and probably facilitated its discovery46. From this base, we find that settlers continued south around ad 1190 to Rapa Iti (a branch recently hypothesized from linguistic evidence47) and, separately, east to the smaller Cook Islands (Mauke and Atiu in our dataset).

Settlers also fanned out from Rarotonga northeast to the Society Islands (represented by Tahiti in our dataset but also containing the culturally significant island Ra‘iātea) around ad 1050, thence northeast to the Tuamotu Archipelago (represented by Mataiva in the Palliser group in our dataset) by ad 1110. At this time the widely scattered Tuamotu hub and other critical atolls in the expansion path (e.g. Nororotu in the Austral group) would have only recently emerged above falling sea levels (ad 900) and finished solidifying their topsoil and forests45,48 (Extended Data Fig. 5). Thus, our inferred dates and settlement path lend support to the idea that expansion into eastern Polynesia was mediated by the birth of those intermediary island clusters at the turn of the last millennium.

Stretching across central eastern Polynesia, the Tuamotu Archipelago was previously hypothesized to have served as a regional voyaging hub20,26,28, and our analysis indicates that it was from this hub that settlers made their way north to the Marquesas Islands (Nuku Hiva and Fatu Hiva in our dataset) and south to the Gambier Islands (Mangareva in our dataset) beginning in the mid-1100s. From Mangareva, we find that the expansion reached the easternmost inhabited Polynesian island, Rapa Nui (Easter Island), around ad 1210. This final leg had been suggested by some based on similarities between the Mangarevan and Rapanui languages49, and by similarities in their traditional stone ceremonial platforms50. This settlement sequence is also supported by our marker frequency-based genetic analyses, including ancestry-specific UMAP (Supplementary Fig. 6), drift projection (Supplementary Fig. 9), F-statistics (Supplementary Figs. 11, 12), principal curve analyses (Extended Data Fig. 7, Supplementary Fig. 17), diversity statistics (Supplementary Figs. 2531), and ADMIXTURE clustering (Supplementary Figs. 32, 33).

Notably, we find that the population of Raivavae in the Australs arrived via the distant Tuamotus and Mangareva rather than via the other Austral islands of Tubuai and Rimatara (Fig. 1a, Supplementary Figs. 6, 7). Together with even more distant North and South Marquesas and Rapa Nui, each also with inferred settlement stemming from the Tuamotus, Raivavae had an ancient tradition of carving monumental anthropomorphic statues in stone. No other Austral island had these51; indeed, such immense sculptures are found only on those far-flung islands that we now show to have a common genetic source in the Tuamotu archipelago (Fig. 2a). It is also notable that it is only on islands that we infer were settled via the Tuamotus that pre-colonial Native American genetic contact has been identified, and its timing corresponds closely with our voyaging dates for this region33. This supports the theory that that contact occurred while the Polynesians were embarking on their easternmost, and longest, voyages of discovery.

The modern peoples of Polynesia harbour strong genetic evidence for a range expansion beginning in Samoa and propagating across eastern Polynesia through a series of telescoping founder events from the 11th and 12th centuries. Since this telescoping series of bottlenecks increased (via genetic surfing) the frequency of retained rare variants along the settlement path (see ψ-statistic, Fig. 2b), and since some of these variants are probably deleterious, future studies characterizing the individual frequencies and effects of these rare variants are desirable. We suggest that such large-scale sequencing and phenotyping studies should focus on the terminal islands in the settlement sequences that we have described, where compounded bottlenecking created the largest increase in frequencies (Fig 2b). We have shown that these particular islands also have high levels of homozygosity (Supplementary Figs. 2527), which should increase the power to detect trait associations, and significant IBD, enabling IBD mapping, another useful approach52. Of note, two large modern Polynesian populations lie at the geographic termini of these serial bottleneck chains, Hawaii in the north and New Zealand in the south, and are thus notable candidates for such future large-scale association studies. We have introduced ancestry-specific computational methods for detailed characterization of Polynesian variant frequencies within admixed, modern samples, so potential admixture within future cohorts from such diverse populations should not be considered a barrier to designing these studies. Continued partnerships with these communities will be crucial53, since such studies will benefit both the personalized health understandings of these populations, as well as the global genetic understandings of all of us.

Methods

Data reporting

No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.

Sample collection and approvals

This work combines publicly available sequence data and newly generated SNP array data from samples collected over different time periods by the participating institutions (Supplementary Tables 1, 2). Written informed consent was obtained from all participants and research ethics approval and permits were obtained from the following institutions: Stanford University Institutional Review Board (IRB approval no. 20839), Oxford University Tropical Research Ethics Committee (reference no. 537-14), and the Scientific Ethics Committee of the Catholic University of Chile (reference no. 1971092). This study was also approved by the Council of Polynesian Elders for the community of Rapa Nui, along with local educational institutions, including the Lyceum Hoŋa’a o te Mana and the Lorenzo Baeza school for adults. Community engagement, including pre-participation presentations and post-participation return of results, were conducted throughout the project. Local approvals for engagement with the Rapa Nui community were obtained from the mayor (P. P. E. Paoa) of the municipality of Easter Island, and the study was registered with the National Corporation for Indigenous Development (CONADI), in accordance with the indigenous law no. 19.253. The guidelines of the UNESCO International Declaration on Human Genetic Data and the Declaration of Helsinki were followed throughout the study.

Genotyping

Sampled populations and genotyping platforms are detailed in Supplementary Tables 1, 2. A total of 26 populations were genotyped at the University of California, San Francisco (UCSF) using Affymetrix Axiom LAT-1 arrays. Genotype calling was performed following default parameters using Affymetrix’s Genotyping Console software. The average call rate was 98.5% for all newly genotyped samples. Before filtering and merging, the total number of SNPs called was 813,036. The resulting SNP density after merging with different reference panels varied across working datasets for downstream analyses, as detailed throughout the methods below.

Data curation

Quality control filters were applied across all sampled individuals using the Plink 1.9 package54, removing individuals with >1% of genotyped sites missing (mind .01), removing genotyped sites missing in > 1% of individuals (geno .01), and removing sites (18 SNPs) with extreme deviations from Hardy–Weinberg equilibrium (P-value less than 10 × 10−110). The independence of drift between these separated, small island populations leads us to expect some deviation from Hardy-Weinberg in Polynesia, so we do not apply a typical threshold here. All samples were analysed on the GRCh37 (hg19) genome build55. REAP56 was used to determine kinship coefficients using the ADMIXTURE clustering results discussed below; individuals with a kinship coefficient of >0.2 (first-degree relatives) were iteratively removed. Total numbers of individuals from each population after all filters were applied are given in Supplementary Table 1. After merging reference sequence data with sample genotyped data, strand inconsistencies were flipped when unambiguous, while ambiguous SNPs were removed, leaving 689,899 SNP sites. The recombination map from the 1000 Genomes project was used to assign genetic positions57 in centimorgans (cM).

Admixture analyses

Principal component analysis

EIGENSOFT 7.2.158 was used for all PCA. Linkage disequilibrium pruning (LD-pruning) was used across sliding 50-SNP windows with 10-SNP steps to remove variants with >0.5 squared correlation (-indep-pairwise 50 10 .5), leaving 461,571 SNPs for PCA. Plots were made with ggplot2 3.1.059 using R 3.5.260.

Global ancestry clustering analysis

Unsupervised ancestry clustering was performed using ADMIXTURE 1.3.061 on the LD-pruned dataset described above for PCA combining samples from all Pacific island populations together with continental references from Africa (Yoruba), Europe (Britain and Spain), East Asia (Japan and China) and the Americas (Aymara, Mapuche, Huilliche and Pehuenche) for a total of 686 samples. The numbers of samples from each population are given in Supplementary Tables 1, 2. An elbow62 was found in the cross-validation error plot at K = 7 clusters, with larger numbers of clusters delivering little improvement (Supplementary Fig. 32).

Local ancestry analysis

Semi-supervised local ancestry inference was performed for all filtered Pacific island samples (430 samples, 689,899 SNP sites) using RFMix v1.5.463 with two expectation-maximization (EM) iterations and references from the five ancestry clusters, namely African (60 West African Yoruba individuals), European (30 Spanish and 30 British individuals), Native American (60 Native American individuals from Puno, Peru), Ni-Vanuatuan (all 19 individuals from Vanuatu) and Remote Oceanian (60 individuals with <1% ancestry from outside the Pacific islands as identified by ADMIXTURE). The existence of these five ancestries within the Pacific island samples had been indicated by the K = 7 unsupervised global ADMIXTURE clustering run discussed above (Supplementary Fig. 32). The recommended RFMix settings (two EM iterations and a 0.2-cM window size) were used, and unphased samples were first phased by SHAPEITv2.837 with default settings64. A few Pacific island individuals, particularly in the Marquesas, were found to also have >5% East Asian ancestry in the ADMIXTURE results. This is likely owing to the post-colonial movement to those islands of Hakka immigrants from China for work in the 19th century65. Those individuals were removed, so that this sixth ancestry did not need to be separately resolved by local ancestry analysis.

Masking

As discussed above, modern Pacific Islanders are often admixed, possessing European and occasionally Native American and African ancestries (Supplementary Fig. 32). European ancestries entered Polynesia during the colonial period with the first European explorer (Magellan) arriving in the 16th century and significant immigration commencing in the early 19th century65. Native American ancestry, particularly from emigration of admixed Hispanic individuals from Chile, which annexed Rapa Nui (Easter Island), and African ancestry also entered33. Because ancestries fully (or partially) introduced via colonial settlement did not necessarily follow the same island settlement process (or founder sizes and dates) as the original Polynesian settlement, such ancestries need to be distinguished, necessitating an ancestry-specific approach66 (Supplementary Figs. 10, 20). For this reason we removed European chromosomal segments, as well as African and Native American, from the Pacific island samples. This step is called masking67,68, since variants located in certain ancestry segments (identified above by RFMix via haplotype sequence pattern matching) are masked (removed) from the analysis. We refer to the remaining (unmasked) chromosomal segments as Polynesian ancestry chromosomal segments (Supplementary Table 3), and we refer to analyses that use only these segments as Polynesian ancestry-specific analyses. (Such analyses may still include as references non-Polynesian populations, such as from Europe or Taiwan. These reference populations will of course have non-Polynesian ancestry and are not masked.) A description of which analyses were performed masked and which unmasked, and when references were used, is given in Supplementary Table 4.

Polynesian ancestry-specific allele frequency analyses

Treemix analysis

Treemix40 was run on the combined set of Pacific island and reference populations (Supplementary Tables 1, 2) using raw marker counts for each population. It was also run on the Pacific island populations using only the counts of markers found in Polynesian-ancestry chromosomal segments for each population, as described above.

Creation of Polynesian ancestry-specific genotype frequency vectors and matrix

For each of an individual’s two haplotypes, variants located in non-Polynesian ancestry segments were masked, as described above. The two haplotypes for each individual were then averaged to create a genotype frequency vector having, for each site, 0 when no alternate allele was present, 0.5 when one alternate allele and one reference allele were present, and 1 when no reference allele was present. Some sites, where an individual had no Polynesian variant on either haplotype, remained missing. These missing values were accounted for in the following manner. The genotype frequency vector for each individual from the dataset was placed into the row of an N individuals × p genotyped markers matrix and the nuclear norm regularized matrix completion algorithm of Mazumder et. al was applied to create a reduced rank approximation to the original, incomplete 689,899-dimensional masked genotype matrix33,69. Unlike earlier methods70,71,72, this method permits the use of all samples rather than only a panel of reference samples for the completion step; thus, far more data is used allowing for more accurate completion. In addition, instead of using haplotypes (haploid genomes) as the unit of analysis, this method uses genotype frequency vectors (frequency vectors for the diploid genome). Since there is no linkage present in the genome across chromosome boundaries (owing to independent assortment of chromosomes), population phasing cannot resolve parental haplotypes across these boundaries. Thus, a genome-wide haplotype vector constructed by assembling all chromosomes sequentially into a single row vector will switch phase arbitrarily across chromosome boundaries and so is already a mixture of an individual’s two true parental haplotypes. Further, by explicitly averaging an individual’s two haplotype vectors to form a single genotype frequency vector for that individual, we are able to fill in much of the masked data that is missing from either of the two haplotypes.

Ancestry-specific drift projection

Each Pacific island individual’s Polynesian ancestry-specific genotype frequency vector, described above, was projected onto the axis (drift axis), defined as the axis between the centroid of the indigenous Taiwan (Atayal and Paiwan) genotype frequency vectors and the centroid of the Rapa Nui (Easter Island) genotype frequency vectors. Each Pacific island individual’s genotype frequency vector was also projected onto the first principal component of the subspace orthogonal to this axis to provide a second coordinate for two-dimensional visualization. The first principal component of this orthogonal subspace is computed by finding the residual of each data point after subtracting off its component parallel to the drift axis and then determining the direction of greatest variation for these residuals. The per cent variance explained by each dimension was computed directly by finding the variance of the projections on that dimension.

Ancestry-specific t-SNE

The number of significant (P >  0.05) dimensions for the genotype frequency matrix, described above, was determined (n = 14) using a Tracy–Widom distribution58 and verified via a scree plot73. To ensure that all population structure was captured, the genotype frequency matrix was projected onto its first twenty principal component axes. A t-SNE was generated by applying the Barnes–Hut t-SNE implementation to this projected matrix using: theta = 0, perplexity = 15, exaggeration factor = 10, max iter = 10,000, and lying iter = 1,000 parameters74,75. Both a two-dimensional and three-dimensional embedding were created. Projections onto fewer dimensions yielded similar results, with some clusters beginning to disappear in the range 12–15 dimensions, as predicted by the Tracy–Widom analysis.

Ancestry-specific UMAP

The left singular vectors of the completed genotype frequency matrix were used as input for computing a two-dimensional UMAP with a Manhattan distance metric and 80 nearest neighbours76,77.

Ancestry-specific SOM

A two-dimensional SOM of the genotype frequency matrix was produced on a 100 × 100 rectangular grid using a Gaussian neighbourhood78. The package Somoclu, a massively parallel implementation of SOM, was used for optimization with parameters: 10 epochs, stdcoef 0.5, and linear cooling79.

Principal coordinate analysis and principal curves

Principal coordinate analysis (PCoA) and principal curves were constructed from the relevant distance matrices (either π or F3, described below) using R 3.5.2 together with the package buds80.

Population statistics

All population statistics described below (ψ, π, F3, F4, Fst and heterozygosity) were computed on population variant frequency vectors  created by computing, for each site, \({\mathop{p}\limits^{ \sim }}_{i}=\frac{{\mathop{a}\limits^{ \sim }}_{i}}{{\mathop{n}\limits^{ \sim }}_{i}}\), where ãi is the minor allele count at the site aggregated across all individuals’ haplotypes (two haploid genomes per individual) having that site located in a Polynesian chromosomal segment for population i, and ñi is the total count of Polynesian minor and major alleles for i. A tilde is used to denote counts from Polynesian-specific chromosomal haplotypes. Any sites not located in a Polynesian segment for any of the individuals within a population (or located in only one haplotype within the entire population) were removed from the dataset for all populations, so as to have no populations with one or fewer total allele observations at any site. This filtering resulted in the loss of 60,377 SNPs (8.75% of the total 689,899 SNPs), leaving 629,522 SNPs across all populations for computation of population allele frequency statistics.

Psi (ψ)

The range expansion statistic (ψ) of Peter et al. 37 (see Supplementary Discussion, ‘On directionality’ and ‘On psi’) was computed first by polarizing all markers (identifying the minor allele) using the indigenous Taiwanese samples (Atayal and Paiwan) as an outgroup. To investigate the effect of using a different outgroup in a separate analysis a repolarization was performed using the western islands (Tonga, Samoa, Fiji) as an outgroup. The latter calculation reduced the standard errors for the range expansion statistic on islands settled subsequent to western Polynesia, that is, the eastern Polynesian islands; nevertheless, the general ordering of islands in the range expansion was the same for both calculations (see comparison Supplementary Fig. 14). Because allele frequencies drifted during the Pacific island settlement process, some minor alleles in Taiwan would have become major alleles by the time the settlers reached western Polynesia (see Supplementary Discussion, ‘On psi’), so the intermediate repolarization using Tonga, Samoa, and Fiji as an outgroup increased the resolution of the range expansion statistic (reduced standard errors) for downstream islands. The larger number of samples from Tonga, Samoa, and Fiji (51), as opposed to Taiwan (22), also contributed, as it allowed us to set a more permissive bound for confirming that an allele observed minor in the outgroup samples was also minor in the outgroup population. This in turn increased the number of markers present in the latter analysis. (A 0.1 or lower minor allele frequency was required in the merged Tonga, Samoa, Fiji outgroup samples, yielding 228,262 SNPs, as opposed to the more stringent requirement of minor alleles being fixed in the Taiwan outgroup samples, following the procedure of Zhan et. al. 38, which yielded only 137,383 SNPs.) ψ was calculated using the formula of Peter et. al,

$$\psi (A,B)=\frac{1}{{\rm{N}}{\rm{o}}.\,{\rm{o}}{\rm{f}}\,{\rm{s}}{\rm{h}}{\rm{a}}{\rm{r}}{\rm{e}}{\rm{d}}\,{\rm{S}}{\rm{N}}{\rm{P}}{\rm{s}}}{\sum }_{j\in {\rm{s}}{\rm{h}}{\rm{a}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{S}}{\rm{N}}{\rm{P}}{\rm{s}}}({\mathop{p}\limits^{ \sim }}_{A,j}-{\mathop{p}\limits^{ \sim }}_{B,j}),$$

where the sum is taken only over SNPs shared polymorphic in both the population A sample and the population B sample81. When using Taiwan as an outgroup, we masked the small Ni-Vanuatuan segments seen within Polynesia, since these segments trace their predominant ancestral origin back to a Papuan outgroup in New Guinea, rather than to the Austronesian outgroup, Taiwan. (Admixture between populations stemming from these two sources occurred on Vanuatu and other Melanesian islands in the thousand years before the settlement of Polynesia and was carried into Polynesia during its settlement17,82.) However, both masking methods (both the Taiwan and the Tonga, Samoa, Fiji outgroup polarizations) gave the same ordering of islands settled.

Pi (π)

This quantity is the average number of pairwise differences per pair of haplotypes (haploid genomes) selected at random, one from each population, normalized by the number of sites35,83,84. Also known as the nucleotide diversity85, it can be computed by first taking the ratio of the total number of mismatch combinations at a site to the total number of combinations, that is, where a1 is the number of alleles of one type in population 1 at a biallelic marker, b1 is the number of the other type, n1 = a1 + b1 is the total number of haplotypes in population 1, and thus p1 = a1/n1 is the allele frequency in population 1, then at this site

$${\pi }_{12}=\frac{{a}_{1}\cdot {b}_{2}+{b}_{1}\cdot {a}_{2}}{({a}_{1}+{b}_{1})({a}_{2}+{b}_{2})}=\frac{{a}_{1}({n}_{2}-{a}_{2})+{a}_{2}({n}_{1}-{a}_{1})}{{n}_{1}{n}_{2}}={p}_{1}-{p}_{1}{p}_{2}+{p}_{2}-{p}_{2}{p}_{1}={p}_{1}(1-2{p}_{2})+{p}_{2}.$$

This is an unbiased estimator that can be averaged over all sites to find the average number of pairwise differences per haplotype pair per site35,85. Using this frequency-based formulation, this estimator can be generalized to Polynesian-specific allele frequencies for each island \({\mathop{{\rm{p}}}\limits^{ \sim }}_{i}\).

F 3

The F3 shared drift statistic of Patterson et al.86 was computed using the formula

$${\hat{F}}_{3}(C;A,B)=({\mathop{p}\limits^{ \sim }}_{C}-{\mathop{p}\limits^{ \sim }}_{A})({\mathop{p}\limits^{ \sim }}_{C}-{\mathop{p}\limits^{ \sim }}_{B})-{\hat{h}}_{c}/{s}_{c},$$

where \({\mathop{{\rm{p}}}\limits^{ \sim }}_{A}\) = \({\mathop{{\rm{a}}}\limits^{ \sim }}_{A}\)/\({\mathop{{\rm{n}}}\limits^{ \sim }}_{A}\) is the sample allele frequency in the ancestry of interest in population A (\({\mathop{{\rm{n}}}\limits^{ \sim }}_{A}\) total observations and \({\mathop{{\rm{a}}}\limits^{ \sim }}_{A}\)observations of the allele a) and

$${\hat{h}}_{A}=\frac{{\mathop{a}\limits^{ \sim }}_{A}({\mathop{n}\limits^{ \sim }}_{A}-{\mathop{a}\limits^{ \sim }}_{A})}{{\mathop{n}\limits^{ \sim }}_{A}({\mathop{n}\limits^{ \sim }}_{A}-1)}$$

and similarly for B and C. For multiple sites these values are computed for each site and then averaged across all sites36.

F 4

To detect departures from the reconstructed settlement tree (inter-island admixture), the F4 statistic was computed for each site using the formula of Patterson et al.36

$${\hat{F}}_{4}(A,B;C,D)=({\mathop{p}\limits^{ \sim }}_{A}-{\mathop{p}\limits^{ \sim }}_{C})({\mathop{p}\limits^{ \sim }}_{b}-{\mathop{p}\limits^{ \sim }}_{C}),$$

and was then averaged across all sites. The F4 statistic is expected to be zero unless groups A and B do not form a separate clade from C and D within the actual population tree. Thus, when computing statistics of the form F4(parental_island, child_island; Samoa, X), where X varies across all islands that are not descended from parental_island in our model, a zero value of F4 is expected if the data completely support our settlement model. This is because all non-descendant islands (X) must lie in a common clade with outgroup Samoa; that is, external to the parental_island, child_island subclade. We look for significant evidence (P < 0.001) of departure from this model for each parental_island, child_island pair in our settlement sequence, and across all possible non-descendant islands X, while accounting for the multiple tests (n = 52) with a Bonferroni correction. We find deviations from our settlement tree only for 3 of its branches: Mangareva–Raivavae (migration from Tahiti), Mangareva–Palliser (migration from Tahiti), and North Marquesas–Palliser (migration from Tahiti and also from the Cooks). The Tahitian migrations go only to French Polynesian islands and likely reflect modern (see Supplementary Fig. 23) introgression to those islands from Tahiti, the modern capital, source of teachers, ministers, and civil servants, and centre of employment, transportation, and residential education for French Polynesia. The migration from the Cooks directly to North Marquesas (bypassing the Palliser group) is intriguing, especially in light of our late dated Palliser–North Marquesas connection (ad 1330). It could be that North Marquesas (Nuku Hiva) was settled earlier more directly from the Cooks, whereas South Marquesas (Fatu Hiva) was, we have found, settled early (ad 1140) from Palliser. Later within-island-group migration between these neighbouring islands may have led North Marquesas to exhibit these two origin signals, one from Palliser and one from the Cooks. If so, North and South Marquesas would be an unusual case, where two neighbouring islands were settled from different parental islands, then, because they were not separated by large oceanic distances, were able to exchange enough subsequent migrants to leave a notable genetic trace within their post-growth population base.

F st

The Hudson estimator for Fst is

$${\hat{F}}_{{\rm{s}}{\rm{t}}}^{{\rm{H}}{\rm{u}}{\rm{d}}{\rm{s}}{\rm{o}}{\rm{n}}}=\frac{{({\mathop{p}\limits^{ \sim }}_{A}-{\mathop{p}\limits^{ \sim }}_{B})}^{2}-\frac{{\mathop{p}\limits^{ \sim }}_{A}(1-{\mathop{p}\limits^{ \sim }}_{A})}{{\mathop{n}\limits^{ \sim }}_{A}-1}-\frac{{\mathop{p}\limits^{ \sim }}_{B}(1-{\mathop{p}\limits^{ \sim }}_{B})}{{\mathop{n}\limits^{ \sim }}_{B}-1}}{{\mathop{p}\limits^{ \sim }}_{A}(1-{\mathop{p}\limits^{ \sim }}_{B})+{\mathop{p}\limits^{ \sim }}_{B}(1-{\mathop{p}\limits^{ \sim }}_{A})},$$

for a given SNP. For multiple sites, the numerator and the denominator (unbiased estimators of the variance between populations and the variance in the ancestral population respectively) are averaged across all SNPs separately before taking the ratio to create a consistent estimator87.

Heterozygosity

The unbiased estimator for heterozygosity, first given by Nei and Roychoudhury88, for a specific site is

$$\hat{h}=1-\frac{N\,\sum {\mathop{p}\limits^{ \sim }}_{\ell }^{2}-1}{N-1},$$

where \({\mathop{p}\limits^{ \sim }}_{\ell }\) is the frequency of the \({\ell }\)th allele at the site, and N is the total number of alleles at that site (two for each of our SNPs). This estimator was aggregated across each SNP locus k using

$$\hat{H}={\sum }_{k=1}^{r}\frac{{\hat{h}}_{k}}{r},$$

for all r of our SNP loci35,88.

Standard errors

Standard errors for all allele frequency-based statistics were computed using the block bootstrap using 100 replicates and a block size of 1,000 markers89. This gives better variance estimates than the jackknife for these pairwise allele frequency comparisons35. The markers are bootstrapped together as long contiguous blocks to preserve the effects of linkage on the variance of the estimates36.

Migration network reconstruction

The various population measures of distance and directionality (ψ, π, F3) between all pairs of islands define together tensors that annotate the complete graph of island connectivity. It remains to prune this graph judiciously to arrive at the tree representing the branching settlement process of the serially founded Pacific islands; that is, a tree describing which islands were settled from which other islands (Supplementary Discussion, ‘On differences between range expansion trees and typical population trees’ and ‘On tree building’).

In brief, we use the range expansion statistic ψ (Fig. 2b) to determine the upstream islands along the range expansion; that is, the set of potential parent islands for each island. Beginning with the island with the largest ψ (measured against Samoa), we work backward in order of decreasing ψ (Fig. 2b, Extended Data Fig. 4a), joining each still orphaned island (j) to its closest related potential parent island (i) as defined by ψ. To measure genetic distance (closeness), we use the average number of pairwise differences πij (Extended Data Fig. 4b, Supplementary Fig. 17), since πij has been shown to have higher correlation with the divergence time between two populations (i and j) than the outgroup-F3 statistic84 (Supplementary Discussion, ‘On different drift distance metrics’), although the same settlement sequence is also returned when using the latter metric instead (Supplementary Fig. 12).

Begin with the island with the most potential parents (at the end of the range expansion) (Fig. 2b) or, in other words, the largest ψ, Rapa Nui. Consistent with its terminal position in the range expansion, Rapa Nui also has the lowest heterozygosity (Supplementary Fig. 31) and the highest intra-island IBD (Supplementary Figs. 25, 26). Starting with this terminal island, we consider all potential parent islands as indicated by the ψ directionality index, and connect Rapa Nui to the most closely related potential parent as indicated by the smallest average number of pairwise differences (π). We then proceed to the island with the second most potential parents according the ψ-statistic (here Raivavae (Fig. 2b)) and repeat. For Samoa, Fiji, Tonga and upstream islands, we use the ψ directionality index polarized using the Taiwan outgroup. For islands downstream of Samoa, as indicated by the Taiwan-polarization ψ, we use a ψ-statistic repolarized using the more proximal Samoa, Fiji and Tonga outgroup, since it has smaller standard errors (see ψ discussion above).

This recursive algorithm for building the branching settlement path of the Pacific islands is a form of the Chu–Liu–Edmonds algorithm, which is guaranteed to produce the minimum spanning tree of a directed acyclic graph90,91 (Supplementary Discussion, ‘On tree-building’). That our graph is acyclic can be shown (proof derived in Supplementary Discussion, ‘On the acyclicity of psi’) from the formal definition of ψ, which defines our edge directionality. The lack of significant internal cross-migration edges was determined by our F4 analysis above.

In the case of parental islands with multiple child islands, we can now use an inner product measure, the F3 statistic (Extended Data Fig. 4c), which measures shared genetic drift, to determine whether any of those child islands share additional drift with each other beyond what they share with their common parent (Extended Data Fig. 7). Such additional shared drift is indicated in Fig. 2a by branching arrows; that is, arrows from a parent island that share an initial path before later branching to each child island. The order of arrow divergence indicates the ordering of shared drift among the child populations. These shared paths may suggest that intermediate islands in the settlement sequence are missing from our dataset (Extended Data Fig. 5), since the founding bottleneck of an intermediate island could account for the additional shared drift. To further verify our settlement sequence, and to look for signs of post-settlement inter-island admixture, we compute F4 statistics of the form F4 (parental island, child island; Samoa, X) with X ranging over all Polynesian islands not stemming from parental island in the settlement tree (described above). These F4 statistics indicate whether there is statistically significant evidence for deviations from our settlement model; that is, later migrations across the ocean of sufficient size to significantly alter the genetic base of the post-growth island populations. Only three branches in our settlement sequence show any significant deviations, and each of these indicate a migration from Tahiti to an outlying French Polynesian island, consistent with Tahiti’s recent role as the capital of French Polynesia.

Principal curve analysis

To independently verify our settlement sequence map, we compute unsupervised principal curves78,80 between the islands using genetic distances defined by both the outgroup-F3 and π metrics (Extended Data Fig. 7 and Supplementary Fig. 17, respectively).

IBD analyses

In highly related populations, such as populations that have passed through a population size bottleneck in the recent past, individuals will share many ancestors, and thus many identical-by-descent (IBD) genetic fragments92. In such cases, for example serially founded small island populations, IBD-based analyses become a powerful tool for reconstructing migrations.

Germline

GERMLINE 1.5.3 was run on the phased Pacific islander samples to find all IBD shared segments of 5 cM or greater using the -min_m flag. Fragments shorter than this length are prone to false positives owing to insufficient SNPs93,94,95. Up to four homozygous marker mismatches were permitted per IBD slice (-err_hom), and one heterozygous marker mismatch was permitted per IBD slice (-err_het). For a demonstration that our results are robust to IBD breaks due to phasing errors, see Extended Data Fig. 6.

Polynesian ancestry-specific filtering

To deconvolve Polynesian ancestral history from later (colonial and post-colonial) ancestry histories (for instance, European) we used an ancestry-specific approach to IBD66. Inter-island IBD segments lying wholly within post-colonial ancestries, or spanning post-colonial and pre-colonial ancestries, are necessarily the result of post-colonial inter-island contact events and were discarded. IBD segments lying wholly within chromosomal regions of known pre-colonial ancestry sources, that is Polynesian ancestry, were identified and analysed together.

Runs of homozygosity

Polynesian runs of homozygosity (ROH) were computed by summing together only Polynesian-specific IBD segments found shared between an individual’s two haploid genomes, then normalizing by the effective fraction of homozygous Polynesian ancestry segments found in that individual. These are the only segments of the diploid genome that could have shared a Polynesian ancestry ROH. Population Polynesian-specific ROH values were computed by averaging these values for all individuals within each island population. Standard errors were calculated by using the jackknife over individuals in a population96.

Ancestry-specific sum of IBD segment lengths

When analysing IBD segments, it has been typical to sum the total length (Wab) of segments shared between a pair of individuals (a and b), one from each of a pair of populations (A and B), and then sum over all such pairs to arrive at a total sum of IBD sharing between each pair of populations97. This sum can be normalized, dividing by the total number of possible cross-population pairs of individuals, one from each of the populations (nAnB), to give the average total IBD length shared (WAB) per cross-population individual pair94,97,98

$${W}_{{AB}}=\frac{{\sum }_{a\in A}{\sum }_{b\in B}{W}_{{ab}}}{{n}_{A}{n}_{B}}$$

This normalization can also be performed over the total number of cross-population haplotype (haploid genome) pairs (\(2{n}_{A}\cdot 2{n}_{B}\)), rather than all individual pairs66 (nAnB).

When considering only IBD segments found in those portions of both individuals’ genomes that belong to a particular ancestry, the normalization must be modified to reflect the reduced fraction of the pairs’ genomes that were considered. Thus, we replace the number of cross-population pair comparisons by an effective number of pair comparisons. If \({f}_{a}\) is the fraction of the genome of a particular ancestry in individual a, and similarly for \({f}_{b}\), then the expected fraction of pairwise overlap between the two individuals is \({f}_{a}{f}_{b}\), rather than 1 as it is for non-admixed individuals. The denominator of the normalization above is now modified by the factor \(\bar{{f}_{A}}\bar{{f}_{B}}\), where \(\bar{{f}_{A}}\) is the average fraction of the ancestry of interest in population A

$$\sum _{{\rm{a}}\in {\rm{A}}}\sum _{{\rm{b}}\in {\rm{B}}}{f}_{a}{f}_{b}=(\sum _{a\in A}{f}_{a})(\sum _{b\in B}{f}_{b})=({n}_{A}\bar{{f}_{A}})({n}_{B}\bar{{f}_{B}})$$

Within a single non-admixed population, the normalized intra-population IBD length sharing per haplotype pair is,

$${W}_{{AA}}=\frac{{\sum }_{a\in A}{\sum }_{{\rm{\alpha }}\in A}{W}_{a{\rm{\alpha }}}}{{2n}_{A}(2{n}_{A}-1)}$$

The ancestry-specific normalization factor for intra-population IBD in an admixed population can be derived by considering the sum of all possible same-ancestry haplotype pair comparisons within the population of interest

$$\begin{array}{c}\,\mathop{\sum }\limits_{i=1}^{{n}_{A}}({f}_{i}(\sum _{j < i}{f}_{j}))=\frac{1}{2}(\mathop{\sum }\limits_{i=1}^{{n}_{A}}\mathop{\sum }\limits_{j=1}^{{n}_{A}}{f}_{i}{f}_{j}-\mathop{\sum }\limits_{i=1}^{{n}_{A}}{f}_{i}^{2})\\ \,=\frac{1}{2}(\mathop{\sum }\limits_{i=1}^{{n}_{A}}{f}_{i}\mathop{\sum }\limits_{j=1}^{{n}_{A}}{f}_{j}-\mathop{\sum }\limits_{i=1}^{{n}_{A}}{f}_{i}^{2})=\frac{1}{2}({n}_{A}^{2}\,{\bar{f}}_{A}^{2}-\mathop{\sum }\limits_{i=1}^{{n}_{A}}{f}_{i}^{2})\end{array}$$

These ancestry-specific normalization factors make clear that, although the normalized total length of IBD sharing between two populations gives a measure of the relatedness of the populations, it is quite sensitive to an accurate estimation of the average fraction in each population of the ancestry of interest.

A heat map showing the normalized Polynesian-specific IBD sum values for each pair of Pacific islands in our dataset is displayed in Supplementary Fig. 24. Trends of increasing IBD sharing along the course of the inferred settlement chain (see the map in Extended Data Fig. 5) are evident, but there is significant noise.

IBD segment length distributions

A better approach is to compute the distribution of lengths of IBD segments shared between pairs of individuals, one from each of the two populations being compared. Although the total count (integral) of this distribution will be influenced by the fraction in each population of the ancestry of interest, the shape (decay rate) of the distribution will not be. Such robustness to the estimate of each population’s ancestry fraction, which can vary by a few per cent between different ancestry inference methods, is of great benefit. In addition, the shape of the IBD length distribution (decay rate) changes steadily each generation. It does not depend, as genetic drift does, on the fluctuations, which are generally unknown, of the historical population sizes.

Assuming no interference, recombination can be modelled as a Poisson process occurring along the genome at a rate of one recombination break per generation per unit of genomic length (measured in Morgans)99. Thus, the length of a recombination segment, that is the distance between recombination events, is the waiting time of a Poisson process of rate T, where T is measured in generations. Hence, the distribution of the length of fragments (x) from a particular ancestor T generations ago will be exponential with λ = T decay rate100

$$f(x)=\lambda {e}^{-\lambda x}.$$

If we are considering recombination segments shared between two present day individuals stemming from the same common ancestor, that is IBD segments, we must adjust the rate for the number of recombination events per unit length that have occurred down both sides of the pedigree from this common ancestor, which gives a λ of 2T total95,98,101. Each of these 2T opportunities for recombination to occur along the genome is called a meiosis event. For our empirical calculations (and all plots), we use cM, rather than M, so the λ rate constant is divided by 100, yielding T/50.

The total distribution of tract lengths shared between all individuals can be viewed as independent samples from the same exponential distribution. Ralph and Coop have shown that the decay rate parameter λ of this distribution is a weighted average of the distribution of times to the most recent common ancestor across all genomic sites102. This distribution of times can be a complicated function of the demography, when the latter is not simple, leading to an ill-conditioned inverse problem102. However, for our problem—dating the founding of an isolated island group—the demography is amenable. Consider a parent island whose Polynesian explorers crossed thousands of kilometres of Pacific waves to discover and colonize a child island during the Polynesian settlement process. A pair of present-day Pacific Islanders, one from the child island and one from the parent island, cannot share a common ancestor at any site (in their Polynesian ancestry segments) more recently than the founding date of the child island. Moreover, because of the small size of the founding populations arriving on double-hulled sailing canoes2,7,21,22, all individuals on the child island will share ancestors with one another dating to at least the time of this founding bottleneck before which time they coalesce with the ancestors of modern individuals on the parent island. Thus, the decay rate parameter λ will measure the time (T/50 with T in generations) to the split of the parent and child island populations.

Example IBD length distributions for all pairs of individuals in our dataset—with one individual from Rapa Nui (Easter Island) and one from Mangareva, Palliser, Rarotonga, or Samoa—are plotted in Supplementary Fig. 22. Note that altering the normalization factor based on the estimated fractions of Polynesian ancestry amounts to a rescaling of the y-axis in Supplementary Fig. 22a or a translation of the y-axis in Supplementary Fig. 22b. This alters the amplitude, but not the decay rate shape parameter λ, of the exponential in Supplementary Fig. 22a, or, equivalently, the intercept, but not the slope of the lines in Supplementary Fig. 22b, thus demonstrating graphically that λ is robust to noise (errors) in ancestry normalization. However, the sum of IBD lengths, which is the integral of the curves in Supplementary Fig. 22a, is clearly not robust to such errors in the normalization (rescaling the y-axis).

Our empirically observed IBD length distributions are left truncated at 5 cM, since fragments shorter than this length are prone to false positives due to insufficient SNPs93,94,95. The distributions are also right truncated at 15 cM, because outlier segments longer than this are expected to stem from recent contact, dating to less than 10 generations ago (18th century or later) as computed from the expected generation time (g) based on a single fragment length \({\ell }\) in Morgans

$$E[g|{\ell }]\approx \frac{3}{2({\ell }/1M)}$$

where 1M is one Morgan94.

Such occasional post-colonial, inter-island Polynesian contact is not the focus of our Polynesian settlement analysis, so we filter out these few outlier inter-island IBD segments. Not removing these outliers does not change our island settlement dates significantly, but, by distorting the tail of the exponential decay distributions, does increase our standard error (see Supplementary Discussion, ‘On quantification of error in IBD dating’).

To estimate each pairwise λ, we use the maximum likelihood estimator for a left and right truncated exponential distribution. Since the exponential distribution is memoryless, the left truncation is trivially handled by translation of the distribution. That is, the distribution of the length in excess of 5 cM for each fragment is also exponential with the same decay constant λ. For what follows we assume that the IBD lengths have been thus recentred by subtracting 5 cM. The right truncation is less elegant to handle, yielding an equation for the maximum likelihood estimator (\(\hat{\lambda }\)) of λ given by103

$$\frac{\partial {\rm{ln}}{ {\mathcal L} }_{n}}{\partial \lambda }{|}_{\lambda =\hat{\lambda }}=n[{\hat{\lambda }}^{-1}-{x}_{0}{e}^{-\hat{\lambda }{x}_{0}}{(1-{e}^{-\hat{\lambda }{x}_{0}})}^{-1}-\bar{x}]=0,$$

where \({ {\mathcal L} }_{n}\) (the sample likelihood for the n IBD segments) is the product of their individual likelihoods \( {\mathcal L} \), \(\bar{x}\) is the (recentred) mean IBD length, x0 is the (recentred) right truncation point and n is sample size (number of IBD segments). This transcendental equation must be solved numerically. The standard error (SE) can be obtained directly from the observed Fisher Information I(\(\hat{\lambda }\))

$${\rm{SE}}\approx \frac{1}{\sqrt{n\cdot I(\hat{\lambda })}}$$

since

$$(\hat{\lambda }-\lambda )\sqrt{n\cdot I(\hat{\lambda })}\begin{array}{c}n\to {\rm{\infty }}\\ \sim \end{array}{\mathscr{N}}(0,1)$$

where the observed Fisher Information is found by

$$I(\hat{\lambda })=-\frac{{\partial }^{2}\,\mathrm{ln}\, {\mathcal L} }{\partial {\lambda }^{2}}{|}_{\lambda =\hat{\lambda }}=\frac{1}{{\hat{\lambda }}^{2}}-\frac{{x}_{0}^{2}}{4{\sinh }^{2}(\frac{1}{2}\hat{\lambda }{x}_{0})}$$

Using this method, the estimated λ values for the exponential distributions of Polynesian-specific IBD segment lengths for pairs of individuals, one from Rapa Nui (Easter Island) and one from each of the other remote Pacific islands in our dataset, are shown in Extended Data Fig. 4d. The pattern confirms the results of our drift statistics; Mangareva is the island most recently connected to Rapa Nui, while Samoa, the root of the expansion into remote Polynesia, is the most archaic connection.

A few caveats remain. The model of a Poisson process of recombination events along a continuous genome holds for small IBD segment lengths, that is, T >  5, but for more recent relatedness, leading to very long IBD segments, one must consider the finite size of the chromosomes themselves when computing the IBD length distribution104. In addition, the model of IBD segment independence holds only for (N >> T), where N is the population size and T is the number of generations100. Fortunately, our dataset has T values of 25−30 generations and N values in the thousands21, so we do not fall into these problematic regimes. However, because of the founding bottlenecks for each island, there is some intra-island IBD shared between pairs of individuals on the island (Supplementary Figs. 25, 26), so some recombination events will be non-productive. That is, when a haplotype stemming from one islander recombines with a haplotype from another with a recombination event occurring in the midst of an IBD segment on one haplotype that is shared with a third individual, the recombination break will occasionally not break up the IBD sequence, since the two different recombining haplotype segments might themselves be identical (IBD) at the recombination point and thus both in IBD with the third individual. This will happen with a frequency equal to the percentage of the genome shared IBD on average between pairs of individuals on the same island. Therefore, we can correct for the frequency of these non-productive recombination events at each meiosis event. The correction factor ρ (the proportion of the genome shared on average intra-island) is specific to each island (dependent on the intra-island IBD on each island accrued through preceding founding bottlenecks) and so must be applied separately to the two branches of the pedigree, one from the common ancestral population on the parent island down to the present population on the parent island (A) and from the common ancestral population down to the child island (B). The number of effective meioses, which is equal to λ, can be expressed after correction as

$$\lambda =g(1-{\rho }_{A})+g(1-{\rho }_{B})=g(2-{\rho }_{A}-{\rho }_{B})$$

where g is the number of generations to a common ancestor of populations A and B, ρA is the average fraction of the genome in IBD between pairs of individuals on A, equivalent to the probability of non-productive recombination on A, and similarly for ρB.

The correction factor ρ can be found by dividing the average total sum of IBD segments (S) between pairs of individuals on an island by the length of the genome93,105 (35.3 M). Since we empirically observe only the sum of IBD segments longer than 5 cM, we must extrapolate the total sum of all IBD fragments by integrating our fitted exponential distribution of IBD segment lengths (for example, Supplementary Fig. 22a). Thus, the total sum of IBD for N total IBD segment matches in the population (generally unknown) is given by

$$S=N\cdot {\int }_{{\rm{0cM}}}^{\infty }x\lambda {{\rm{e}}}^{-\lambda x}{\rm{d}}x=\frac{N}{\lambda },$$

and the truncated sum of IBD is given by

$$s=N\cdot {\int }_{{\rm{5cM}}}^{\infty }x\lambda {{\rm{e}}}^{-\lambda x}{\rm{d}}x=N\frac{5\lambda +1}{\lambda {e}^{5\lambda }}.$$

By inspection we can see that

$$S=s\frac{{e}^{5\lambda }}{5\lambda +1}.$$

The extrapolated sums of IBD in the Polynesian component on average between pairs of individuals on each island as a per cent of the genome are plotted in Supplementary Fig. 26, showing that these correction factors represent an adjustment of only a few per cent.

We can now construct the symmetric matrix of Polynesian-specific pairwise island λ values (shown in Supplementary Fig. 23), and convert it, using our ρ adjustment factors for each island, to a generation count to common ancestor for each island pair.

For a detailed discussion of the uncertainty in our estimates of these dates see Supplementary Discussion, ‘On quantification of error in IBD dating’ and Extended Data Fig. 6.

Some island pairs, particularly distantly related islands or island pairs each with small numbers of samples, have large standard errors. Removing all entries corresponding to λ values that have standard errors above 0.07 (representing errors larger than 15% of the average lambda value), creates a matrix of more precise generation values, but with some missing entries. Because this is a distance matrix, entries must be consistent with the triangle inequality, so we can impute these missing entries using triangulation from the precisely known entries.

In fact, we can use something stronger than the standard triangle inequality

$$d(i,j)\le d(i,k)+d(j,k)\,{\rm{\forall }}\,i,j,k\in \{{\rm{i}}{\rm{s}}{\rm{l}}{\rm{a}}{\rm{n}}{\rm{d}}{\rm{s}}\},$$

for distances d(i,j) between two islands i and j and an third island k. We can instead use the ultrametric inequality

$$d(i,j)\le max\{d(i,k);d(j,k)\}\,{\rm{\forall }}\,i,j,k\in \{{\rm{i}}{\rm{s}}{\rm{l}}{\rm{a}}{\rm{n}}{\rm{d}}{\rm{s}}\}.$$

This holds in our case, because all samples were taken from contemporary populations, and so all are leaf nodes of the ancestry tree dating to the same period (the present). Thus, so long as we use a distance metric d(i,j) that is uniform in time for each population, for instance the ρ-adjusted generation (or year) matrix, the distances back from each pair of populations to their common ancestor will be identical, yielding an ultrametric tree. This works because the per generation recombination rate is constant over time, so the distance from an ancestral population to any of its sampled contemporary descendants is the same when measured by segment length distributions. (Note that this matrix, measuring the total distance along the tree from one island to another, is twice the matrix measuring how many generations have passed since an island pair split, since the former sums down both tree branches descending from the split.) To complete the proof of the ultrametric inequality we notice that for any two contemporary populations i and j and a third population k, k must either coalesce in ancestry with i first, with j first, or after i and j have themselves first coalesced. In the last case, d(i,j) is clearly less than both d(i,k) and d(j,k), so the inequality above holds. In the first case, where k coalesces first with i, i and k have a shared common ancestor (m) before coalescing with the branch to j, so writing d(i,j) = d(i,m) + d(m,j), and noting that we said the distance to a common ancestor must be identical for terminal nodes d(i,m) = d(k,m) when using years, we have d(i,j) = d(i,m) + d(m,j) = d(k,m) + d(m,j)  = d(k,j). Hence, d(i,j) is equal to d(j,k) in case one (and similarly it is equal to d(i,k) in case two), making the bound of the ultrametric inequality valid (tight) for both cases.

Using the ultrametric inequality, we can impute unknown distances d(i,j) simply by searching across all intermediate populations k and finding the minimum106

$$\mathop{min}\limits_{k\in \text{pops}}\{max\{d(i,k),d(j,k)\}\}.$$

From this completed distance matrix of generations, we can apply dates to each of the migrations using the average human generation time (see Supplementary Discussion, ‘On generation times and meiosis events’).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.