Estimating the number of contributors to a DNA profile

Egeland, Thore; Dalen, Ingvild; Mostad, Petter F.

doi:10.1007/s00414-003-0382-7

Estimating the number of contributors to a DNA profile

Original Article
Published: 14 August 2003

Volume 117, pages 271–275, (2003)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

International Journal of Legal Medicine Aims and scope Submit manuscript

Estimating the number of contributors to a DNA profile

Download PDF

Thore Egeland¹,
Ingvild Dalen² &
Petter F. Mostad³

469 Accesses
37 Citations
6 Altmetric
Explore all metrics

Abstract

The broad topic of this paper is the evaluation of DNA evidence in criminal cases. More specifically, we deal with mixture evidence which refers to cases where there are, or could be, several contributors to a biological stain based on, e.g., blood or semen. The present paper adresses DNA mixtures based on single nucleotide polymorphism (SNP) markers, i.e., diallelic markers. Based on STR analysis, it is in most cases easy to identify the presence of a mixture since three or four bands will show up with a high probability for at least one locus. Obviously, this will not be the case for diallelic markers and interpreting mixtures will be a great challenge. We address this problem by first approaching the more general problem of estimating the number of contributors to a stain. In addition we discuss how the markers should be selected and how many are required.

DNA Profiling Methods and Tools: A Review

Evaluation of forensic DNA mixture evidence: protocol for evaluation, interpretation, and statistical calculations using the combined probability of inclusion

Article Open access 31 August 2016

MaSTR™: an effective probabilistic genotyping tool for interpretation of STR mixtures associated with differentially degraded DNA

Article 29 January 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The statistical interpretation of forensic DNA mixtures is well understood for many practical purposes and considerable progress has been made since Evett et al. (1991). The general formula for likelihood calculations in Weir et al. (1997) has been discussed and generalized, for instance by Fukshansky and Bär (1998), Curran et al. (1999) and Fung and Hu (2000). Several computer programs are available, see e.g. Mortera et al. (2002), Fung and Hu (2000, http://www.hku.hk/statistics/staff/wingfung/), and Curran et al. (1999, http://statgen.ncsu.edu/storey/). Some recent references are Hu and Fung (2003) and Fung and Hu (2002). There are however remaining challenges and some of these are addressed in the present paper. In particular, the numbers of contributors to a stain cannot normally be known with certainty. This problem has been handled in various ways, Weir (1995) presents alternative calculations assuming different number of contributors while Brenner et al. (1996), Buckleton et al. (1998), Lauritzen and Mortera (2002) and others give bounds on likelihood ratios that can be used when the number of donors to a stain cannot be agreed upon. The above approaches do not use data to estimate the number of contributors in a formal manner beyond observing that a stain indicates the minimum number of contributors, for instance at least three persons must have contributed if five different alleles are seen in a profile. Stockmarr (2000) estimated the number of contributors in a specific example by maximizing the likelihood. This paper continues this effort in a setting where it appears to be particularly relevant, namely for SNP (single nucleotide polymorphism) markers. For these diallelic markers, each locus will display one or two alleles. Consequently, it is more difficult to assess whether more than one person have contributed. In fact, Gill (2001) pointed out that "...the greatest challenge will be to identify and interpret mixtures". We address the question ("Is it a mixture?") by first approaching the more general problem of estimating the number of contributors to a stain. In addition we discuss how the markers should be selected and how many are required.

The next section discusses the methods. In particular the likelihood for SNP markers is written in a form that makes it easy to estimate the number of contributors and determine whether a stain is a mixture or not. The result section presents three examples based on simulated data. Based on these examples and the general methods, we draw some conclusions in the last section regarding the number of markers required and how these should be chosen.

Methods

We start this section by fixing some notation and reformulating the main research problems in more precise terms. For a specific marker we denote the less frequent variant B and the more frequent B ^c. Let p _i denote the frequency of B at locus i. An individual profile may be summarized by a vector of length N. Element i is 0, 1 or 2 depending on whether only B, only B ^c or both alleles are seen. A stain from x≥1 persons may be summarized similarly by a vector of length N. A main problem may now be phrased and exemplified: "Have more than one persons contributed to a specific stain, say (0, 1, 1, 2, 0, 1)?"

Likelihood and estimation

Using the notation explained above, the probabilities of observing 0, 1, or 2 for marker i may be written:

$${\matrix{ {{p_{{0i}} }} & { = } & {{p^{{2x}}_{i} }} \cr {{p_{{1i}} }} & { = } & {{{\left( {1 - p_{i} } \right)}^{{2x}} }} \cr {{p_{{2i}} }} & { = } & {{1 - p^{{2x}}_{i} - {\left( {1 - p_{i} } \right)}^{{2x}} .}} \cr } }$$

(1)

This follows by a direct argument and agrees with the more general formula in Weir et al. (1997). The above equation assumes independence between the two alleles from a person and independence between persons contributing. These independence assumptions may be relaxed, leading to modified versions of Eq. 1. If the markers are independent, the probability of observing

$${z = {\left( {0,1,1,2,0,1} \right)}}$$

equals

$${p_{{01}} p_{{12}} p_{{13}} p_{{24}} p_{{05}} p_{{16}} }$$

(2)

Generally, the likelihood for a profile (z ₁,...,z _N) may be written using indicator functions I(.)

$$ \begin{array}{*{20}l} {{L{\left( x \right)}} \hfill} & { = \hfill} & {{P{\left( {data\left| x \right.} \right)} = {\mathop \prod \limits_{i = 1}^N }p^{{I{\left( {z_{i} = 0} \right)}}}_{{0i}} p^{{I{\left( {z_{i} = 1} \right)}}}_{{1i}} p^{{I{\left( {z_{i} = 2} \right)}}}_{{2i}} } \hfill} \\ {{} \hfill} & { = \hfill} & {{{\mathop \prod \limits_{i = 1}^N }p^{{2xI{\left( {z_{i} = 0} \right)}}}_{i} {\left( {1 - p_{i} } \right)}^{{2xI{\left( {z_{i} = 1} \right)}}} {\left( {1 - p^{{2x}}_{i} - {\left( {1 - p_{i} } \right)}^{{2x}} } \right)}^{{I{\left( {z_{i} = 2} \right)}}} } \hfill} \\ \end{array} $$

(3)

Consider next the case where all p _i=p and let n ₀ and n ₁ count the number of occurrences of 0's and 1's. In this particular case, all relevant statistical information is contained in the sufficient statistic (n ₀,n ₁) and the probabilities are given by the multinomial formula

$$ P{\left( {n_{0} ,n_{1} \left| x \right.} \right)} = a{\left( {n_{0} ,n_{1} } \right)}p^{{2xn_{0} }} {\left( {1 - p} \right)}^{{2xn_{1} }} {\left( {1 - p^{{2x}} - {\left( {1 - p} \right)}^{{2x}} } \right)}^{{N - n_{0} - n_{1} }} $$

(4)

where a(n ₀,n ₁)=N!/(n ₀!n ₁!(N–n ₀–n ₁)!). In the general case, the unknown number of contributors may be estimated by maximizing Eq. 3 with respect to x. If all p _i=p, one may choose to use Eq. 4. A large number of computer programs will handle the maximization. Apparently, there is no simple formula for the maximum likelihood estimator except for the trivial case when all p _i=0.5. Then:

$${x* = {1 \over {2\log 2}}\log {{n_{0} + n_{1} } \over {2N}}.}$$

The above estimator is finite if and only if n ₀+n ₁≥1. In the Appendix it is shown that the general likelihood Eq. 3 also has a unique and finite maximum if and only if n ₀+n ₁≥1.

Note that:

$${P{\left( {n_{0} + n_{1} {\rm{ \geqq }}1} \right)} = 1 - {\prod\limits_{i = 1}^N {{\left( {1 - p^{{2x}}_{i} - {\left( {1 - p_{i} } \right)}^{{2x}} } \right)} = 1 - {\left( {1 - p^{{2x}} - {\left( {1 - p} \right)}^{{2x}} } \right)}^{N} } }}$$

(5)

where the last equality assumes p _i=p, i=1,...,N. The right hand side of Eq. 5 is minimized for p=0.5 for fixed x.

Is it a mixture?

We next consider the question of determining whether the stain is a mixture or not. Two approaches are outlined, a frequentist and a bayesian.

Frequentist approach

The parameter x can be considered fixed but unknown and two hypotheses formulated in the usual way:

$${\matrix{ {{{\rm{H}}_{{\rm{0}}} } \hfill} & {: \hfill} & {{{\rm{One}}\;{\rm{person}}\;{\rm{contributed}}{\rm{,}}\;{\rm{i}}{\rm{.e}}.,\;x{\rm{ = 1}}} \hfill} \cr {{{\rm{H}}_{{\rm{1}}} } \hfill} & {: \hfill} & {{{\rm{More}}\;{\rm{than}}\;{\rm{one}}\;{\rm{person}}\;{\rm{contributed}}{\rm{,}}\;{\rm{i}}{\rm{.e}}.,\;x \ge {\rm{2}}{\rm{.}}} \hfill} \cr } }$$

A reasonable approach is to reject H₀ when

$${K = {{\max _{{j = 1,2,3,...}} P{\left( {data\left| {x = j} \right.} \right)}} \over {P{\left( {data\left| {x = 1} \right.} \right)}}} > c.}$$

(6)

The specific value of c can be determined by simulating K under H₀. Since K is discrete, it is not possible to achieve a precise level of significance. Example 2 in the next section indicates that a reasonable and simple solution is to reject H₀ and claim that a stain is a mixture when K>1.

Bayesian approach

Sometimes there is information available in addition to the SNP markers. Different sources of data may be combined as explained below. Bayes theorem gives:

$$ P{\left( {x = i\left| {data} \right.} \right)} = \frac{{P{\left( {data\left| {x = i} \right.} \right)}\alpha {\left( i \right)}}} {{{\sum\nolimits_{j = 1}^\infty {P{\left( {data\left| {x = j} \right.} \right)}\alpha {\left( j \right)}} }}}, $$

where P(x=j)=α(j) is the prior distribution. The posterior odds for the stain to be a mixture can be written

$${\matrix{ {{{{P{\left( {x > 1\left| {data} \right.} \right)}} \over {P{\left( {x = 1\left| {data} \right.} \right)}}}} \hfill} & { = \hfill} & {{{\sum\limits_{j = 2}^\infty {{{P{\left( {x = j\left| {data} \right.} \right)}} \over {{\left( {x = 1\left| {data} \right.} \right)}}}} }} \hfill} \cr {{} \hfill} & { = \hfill} & {{{\sum\limits_{j = 2}^\infty {{{P{\left( {data\left| {x = j} \right.} \right)}\alpha {\left( j \right)}} \over {P{\left( {data\left| {x = 1} \right.} \right)}\alpha {\left( 1 \right)}}}} }} \hfill} \cr } }$$

To continue, some prior assumptions are required and a formulation in terms of the prior odds for being a mixture, ${R = {\sum\nolimits_{j = 2}^\infty {\alpha {\left( j \right)}/\alpha {\left( 1 \right)}} }}$, seems reasonable. The posterior odds will depend on not only R but the entire x distribution. However, we can find an upper bound for the posterior odds:

$${\matrix{ {{} \hfill} & {{{\sum\limits_{j = 2}^\infty {{{P{\left( {data\left| {x = j} \right.} \right)}\alpha {\left( j \right)}} \over {P{\left( {data\left| {x = 1} \right.} \right)}\alpha {\left( 1 \right)}}}} }} \hfill} \cr { \le \hfill} & {{{{M{\sum\nolimits_{j = 2}^\infty {\alpha {\left( j \right)}} }} \over {P{\left( {data\left| {x = 1} \right.} \right)}\alpha {\left( 1 \right)}}}} \hfill} \cr { = \hfill} & {{{M \over {P{\left( {data\left| {x = 1} \right.} \right)}}}R,} \hfill} \cr } }$$

(7)

where

$$ M = {\mathop {\max }\limits_{j = 2,3,4...} }\;P{\left( {data\left| {x = j} \right.} \right)}. $$

Observe that the above approach may only be used to statistically show that there is only one contributor. The previous frequentist approach applies more generally. However, if one is willing to assume more a priori data, the restriction on the Bayesian approach disappears. For instance, specifying the alternative hypothesis " x=2" corresponds to assuming α(j)=0 for j>2 and the posterior odds

$${{{P{\left( {data\left| {x = 2} \right.} \right)}\alpha {\left( 2 \right)}} \over {P{\left( {data\left| {x = 1} \right.} \right)}\alpha {\left( 1 \right)}}}}$$

can be used to distinguish between the alternatives for a specified prior on α(2)/α(1).

Results

The previous section has presented results regarding (1) estimation of the number of contributors to a stain, (2) testing if a stain is a mixture or not and (3) verification of a non-mixture allowing for inclusion of prior information or data. Three examples follow to demonstrate the practical implementation of the methods. The examples are based on 1000 simulated datasets in S-PLUS 6.0.

Example 1

This example discusses the number of loci required to accurately estimate the number of contributors. We provide detailed explanation of the first line of Table 1. Column 1 shows that the data is simulated with x equal to 1, followed by a column indicating the number of loci, N=50 in this case. The two next columns list the fraction correctly identified for p=0.1 and p=0.5. In the former case 0.965 or 96.5% were correctly classified whereas there were no errors for p=0.5. As expected, the precision increases in N and decreases in x . If the number of contributors is 3 or less, the correct classification rate is always above 87% for N=200. Observe that the case with 5 contributors may not be resolved satisfactorily even with 1,000 markers for p=0.5. Figure 1 shows the standard deviation of the estimator of x as a function of p<0.5. Two intuitive results are confirmed, the uncertainty increases in x and decreases in p. Figure 2 displays the number of loci required to secure a finite estimate of the number of contributors with probability b. The plot is based on inequality Eq. 5 and explains to some extent why it is difficult to estimate cases with many contributors for p close to 0.5.

Table 1. The fraction of correctly identified number of contributors is shown in the two rightmost columns for p=0.1 and p=0.5 for various values of x (the number of contributors) and number of markers (N)

Full size table

Example 2

Data was simulated first assuming x=2. The test statistic K defined in Eq. 6 was calculated and the null hypothesis was rejected when K>1. In other words, we conclude that two or more persons contributed if the maximal likelihood assuming x≥2 exceeds the likelihood assuming x=1. Table 2 summarizes the results for varying N (50, 100, 200, 500) and p (0.1 and 0.5). The power is high, or equivalently, the probability of a type II error is small. For N≥100 the probability of reaching the correct conclusion is 0.982 (N=100, p=0.1) or higher. It remains to check the significance level of the test and we simulated data with x=1 for this purpose. The two rightmost columns of Table 2 show that the test also performs well with respect to type I errors, i.e., the probability of falsely claiming a mixture is low.

Table 2. The properties of the K-statistic for mixtures defined in Eq. 6 are shown for various values of N and for p=0.1 and p=0.5

Full size table

Example 3

Recall that Eq. 7 could be useful to prove that a stain is not a mixture, when that is indeed the case. We simulated data for x=1, N=100, and p=0.1, and computed the ratio M/P(data|x=1). In 95% of the cases, the ratio was smaller than 0.0008, reducing any prior odds for a mixture substantially towards zero. For p=0.5, the ratios are even smaller.

Discussion and concluding remarks

The examples of the paper have been based on simulated data and so we are able to see how the methods perform in cases where the truth is known. Another reason for simulating is that relevant case data using SNP markers do not appear to be available. Gill (2001) considered 50–150 markers. For practical forensic case work, confusing a mixed profile and a profile from a single person, could have serious consequences as a match between a stain and reference person could be missed. Based on our results, it seems fair to conclude that a decision regarding mixture or not can be reached for the number of markers in the indicated range. Based on Table 2, we recommend 100 markers. In this case the type II error, i.e., the probability of missing a mixture stain, ranges from 0 to 0.018 while the type I error lies between 0 and 0.023. It is harder to estimate the precise number of contributors, particularly if a large number, say five or more, cannot a priori be excluded. Table 1 shows that with 1, 2 or 3 contributors, the correct classification rate is 75% or higher. This accuracy may be acceptable for investigating purposes, but insufficient for a court. It is possible to obtain a posterior distribution on the number of contributors. The evidence may then be weighed according to this distribution.

It remains to be seen what numbers will be available and if the problems of interpretation of data based on conventional markers (see Evett and Weir 1998; Evett et al. 1998) will be reduced for SNPs. Different contributors to a stain could have donated varying amounts and this information could be used to improve estimates.

The calculations are simplified by the diallelic structures of SNPs. For conventional markers similar calculations are obviously more complicated. However, numerical or simulation-based results are always obtainable. Moreover, the formulation of hypotheses would typically differ for conventional markers; the question of a mixture or not is typically not relevant. For instance, if one marker displays 5 alleles and the other fewer, one might want to test the null hypothesis x≤3 against the alternative x>3. The test procedure we have suggested extends easily to this case.

References

Brenner C, Fimmers R, Baur M (1996) Likelihood ratios for mixed stains when the number of donors cannot be agreed. Int J Legal Med 109:218–219
CAS PubMed Google Scholar
Buckleton J, Evett IW, Weir BS (1998) Setting bounds for the likelihood ratio when multiple hypotheses are postulated. Sci Justice 38:23–26
CAS PubMed Google Scholar
Curran JM, Triggs CM, Buckleton J, Weir BS (1999) Interpreting DNA mixtures in structured populations. J Forensic Sci 44:987–995
CAS PubMed Google Scholar
Evett IW, Weir BS (1998) Interpreting DNA evidence. Sinauer, Sunderland MA
Evett IW, Buffery C, Willott G, Stoney D (1991) A guide to interpreting single locus profiles of DNA mixtures in forensic cases. J Forensic Sci Soc 31:41–47
CAS PubMed Google Scholar
Evett IW, Gill P, Lambert J (1998) Taking account of peak areas when interpreting mixed DNA profiles. J Forensic Sci 43:62–69
CAS PubMed Google Scholar
Fukshansky N, Bär W (1998) Interpreting forensic DNA evidence on the basis of hypothesis testing. Int J Legal Med 111:62–66
CAS PubMed Google Scholar
Fung WK, Hu YQ (2000) Interpreting forensic DNA mixtures: allowing for uncertainty in population substructure and dependence. J R Statist Soc A 163:241–254
Google Scholar
Fung WK, Hu YQ (2002) The statistical evaluation of DNA mixtures with contributors from different ethnic groups. Int J Legal Med 116:79–86
PubMed Google Scholar
Gill P (2001) An assessment of the utility of single nucleotide polymorphisms (SNPs) for forensic purposes. Int J Legal Med 114:204–210
Article CAS PubMed Google Scholar
Hu YQ, Fung WK (2003) Interpreting DNA mixtures with the presence of relatives. Int J Legal Med 117:39–45
PubMed Google Scholar
Lauritzen SL, Mortera J (2002) Bounding the number of contributors to mixed DNA stains. Forensic Sci Int 130:125–126
Article CAS PubMed Google Scholar
Mortera J, Dawid AP, Lauritzen SL (2003) Probabilistic expert systems for DNA mixture profiling. Theor Popul Biol 63:191–205
Google Scholar
Stockmarr A (2000) The choice of hypotheses in the evaluation of DNA profile evidence. In: Gastwirth JL (ed) Statistical science in the courtroom. Springer, Berlin Heidelberg New York, pp 143–160
Weir BS (1995) DNA statistics in the Simpson matter. Nat Genet 11:366–368
Google Scholar
Weir BS, Triggs C, Starling L, Stowell L, Walsh K, Buckleton J (1997) Interpreting DNA mixtures. J Forensic Sci 47:213–222
Google Scholar

Download references

Acknowledgements

This work was supported by the Leverhulme Trust.

Author information

Authors and Affiliations

Biostatistics, Rikshospitalet University Hospital, 0027, Oslo, Norway
Thore Egeland
University of Oslo, Oslo, Norway
Ingvild Dalen
Chalmers Technical University, Göteborg, Sweden
Petter F. Mostad

Authors

Thore Egeland
View author publications
You can also search for this author in PubMed Google Scholar
Ingvild Dalen
View author publications
You can also search for this author in PubMed Google Scholar
Petter F. Mostad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thore Egeland.

Appendix

Consider the likelihood L(x) given in Eq. 3. Observe that the likelihood increases in x if all z _i=2 and decreases in x if all z _i<2. Assume that not all z _i equals 2. Then we show below that L(x) always has a single maximum for some x≥1.

Maximizing L(x) is equivalent to maximizing

$${\matrix{ {{f{\left( x \right)}} \hfill} & { = \hfill} & {{\log \;L{\left[ x \right]}} \hfill} \cr {{} \hfill} & { = \hfill} & {{{\sum\limits_{i = 1}^N {I{\left( {z_{i} = 0} \right)}2x\log p_{i} + I{\left( {z_{i} = 1} \right)}2x} }\log {\left( {1 - p_{i} } \right)}} \hfill} \cr {{} \hfill} & { = \hfill} & {{Cx + {\sum\limits_{i = 1}^N I }{\left( {z_{i} = 2} \right)}\log {\left( {g_{i} {\left( x \right)}} \right)},} \hfill} \cr } + I{\left( {z_{i} = 2} \right)}\log {\left( {1 - p^{{2x}}_{i} - {\left( {1 - p_{i} } \right)}^{{2x}} } \right)}}$$

where C<0 and

$${g_{i} {\left( x \right)} = 1 - p^{{2x}}_{i} - {\left( {1 - p_{i} } \right)}^{{2x}} }$$

We see by direct computation that ${g^{{''}}_{i} {\left( x \right)} < 0}$ for all x>0. Defining h _i(x)=log(g _i(x)), it follows that ${h^{{''}}_{i} {\left( x \right)} < 0}$ for all x>0, and thus that f(x)<0 for all x>0. Further, we get that lim _x→∞ g _i(x)=1, that lim _x→∞ h _i(x)=0, and that lim _x→∞ f(x)=–∞. These two facts about f show together that f, and thus L, has a unique maximum for some x≥0. For discrete x, one or two consecutive positive integers maximize L.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Egeland, T., Dalen, I. & Mostad, P.F. Estimating the number of contributors to a DNA profile. Int J Legal Med 117, 271–275 (2003). https://doi.org/10.1007/s00414-003-0382-7

Download citation

Received: 20 December 2002
Accepted: 25 April 2003
Published: 14 August 2003
Issue Date: October 2003
DOI: https://doi.org/10.1007/s00414-003-0382-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Estimating the number of contributors to a DNA profile

Abstract

Similar content being viewed by others

DNA Profiling Methods and Tools: A Review

Evaluation of forensic DNA mixture evidence: protocol for evaluation, interpretation, and statistical calculations using the combined probability of inclusion

MaSTR™: an effective probabilistic genotyping tool for interpretation of STR mixtures associated with differentially degraded DNA

Introduction