Introduction

The statistical interpretation of forensic DNA mixtures is well understood for many practical purposes and considerable progress has been made since Evett et al. (1991). The general formula for likelihood calculations in Weir et al. (1997) has been discussed and generalized, for instance by Fukshansky and Bär (1998), Curran et al. (1999) and Fung and Hu (2000). Several computer programs are available, see e.g. Mortera et al. (2002), Fung and Hu (2000, http://www.hku.hk/statistics/staff/wingfung/), and Curran et al. (1999, http://statgen.ncsu.edu/storey/). Some recent references are Hu and Fung (2003) and Fung and Hu (2002). There are however remaining challenges and some of these are addressed in the present paper. In particular, the numbers of contributors to a stain cannot normally be known with certainty. This problem has been handled in various ways, Weir (1995) presents alternative calculations assuming different number of contributors while Brenner et al. (1996), Buckleton et al. (1998), Lauritzen and Mortera (2002) and others give bounds on likelihood ratios that can be used when the number of donors to a stain cannot be agreed upon. The above approaches do not use data to estimate the number of contributors in a formal manner beyond observing that a stain indicates the minimum number of contributors, for instance at least three persons must have contributed if five different alleles are seen in a profile. Stockmarr (2000) estimated the number of contributors in a specific example by maximizing the likelihood. This paper continues this effort in a setting where it appears to be particularly relevant, namely for SNP (single nucleotide polymorphism) markers. For these diallelic markers, each locus will display one or two alleles. Consequently, it is more difficult to assess whether more than one person have contributed. In fact, Gill (2001) pointed out that "...the greatest challenge will be to identify and interpret mixtures". We address the question ("Is it a mixture?") by first approaching the more general problem of estimating the number of contributors to a stain. In addition we discuss how the markers should be selected and how many are required.

The next section discusses the methods. In particular the likelihood for SNP markers is written in a form that makes it easy to estimate the number of contributors and determine whether a stain is a mixture or not. The result section presents three examples based on simulated data. Based on these examples and the general methods, we draw some conclusions in the last section regarding the number of markers required and how these should be chosen.

Methods

We start this section by fixing some notation and reformulating the main research problems in more precise terms. For a specific marker we denote the less frequent variant B and the more frequent B c. Let p i denote the frequency of B at locus i. An individual profile may be summarized by a vector of length N. Element i is 0, 1 or 2 depending on whether only B, only B c or both alleles are seen. A stain from x≥1 persons may be summarized similarly by a vector of length N. A main problem may now be phrased and exemplified: "Have more than one persons contributed to a specific stain, say (0, 1, 1, 2, 0, 1)?"

Likelihood and estimation

Using the notation explained above, the probabilities of observing 0, 1, or 2 for marker i may be written:

$${\matrix{ {{p_{{0i}} }} & { = } & {{p^{{2x}}_{i} }} \cr {{p_{{1i}} }} & { = } & {{{\left( {1 - p_{i} } \right)}^{{2x}} }} \cr {{p_{{2i}} }} & { = } & {{1 - p^{{2x}}_{i} - {\left( {1 - p_{i} } \right)}^{{2x}} .}} \cr } }$$
(1)

This follows by a direct argument and agrees with the more general formula in Weir et al. (1997). The above equation assumes independence between the two alleles from a person and independence between persons contributing. These independence assumptions may be relaxed, leading to modified versions of Eq. 1. If the markers are independent, the probability of observing

$${z = {\left( {0,1,1,2,0,1} \right)}}$$

equals

$${p_{{01}} p_{{12}} p_{{13}} p_{{24}} p_{{05}} p_{{16}} }$$
(2)

Generally, the likelihood for a profile (z 1,...,z N ) may be written using indicator functions I(.)

$$ \begin{array}{*{20}l} {{L{\left( x \right)}} \hfill} & { = \hfill} & {{P{\left( {data\left| x \right.} \right)} = {\mathop \prod \limits_{i = 1}^N }p^{{I{\left( {z_{i} = 0} \right)}}}_{{0i}} p^{{I{\left( {z_{i} = 1} \right)}}}_{{1i}} p^{{I{\left( {z_{i} = 2} \right)}}}_{{2i}} } \hfill} \\ {{} \hfill} & { = \hfill} & {{{\mathop \prod \limits_{i = 1}^N }p^{{2xI{\left( {z_{i} = 0} \right)}}}_{i} {\left( {1 - p_{i} } \right)}^{{2xI{\left( {z_{i} = 1} \right)}}} {\left( {1 - p^{{2x}}_{i} - {\left( {1 - p_{i} } \right)}^{{2x}} } \right)}^{{I{\left( {z_{i} = 2} \right)}}} } \hfill} \\ \end{array} $$
(3)

Consider next the case where all p i =p and let n 0 and n 1 count the number of occurrences of 0's and 1's. In this particular case, all relevant statistical information is contained in the sufficient statistic (n 0,n 1) and the probabilities are given by the multinomial formula

$$ P{\left( {n_{0} ,n_{1} \left| x \right.} \right)} = a{\left( {n_{0} ,n_{1} } \right)}p^{{2xn_{0} }} {\left( {1 - p} \right)}^{{2xn_{1} }} {\left( {1 - p^{{2x}} - {\left( {1 - p} \right)}^{{2x}} } \right)}^{{N - n_{0} - n_{1} }} $$
(4)

where a(n 0,n 1)=N!/(n 0!n 1!(Nn 0n 1)!). In the general case, the unknown number of contributors may be estimated by maximizing Eq. 3 with respect to x. If all p i =p, one may choose to use Eq. 4. A large number of computer programs will handle the maximization. Apparently, there is no simple formula for the maximum likelihood estimator except for the trivial case when all p i =0.5. Then:

$${x* = {1 \over {2\log 2}}\log {{n_{0} + n_{1} } \over {2N}}.}$$

The above estimator is finite if and only if n 0+n 1≥1. In the Appendix it is shown that the general likelihood Eq. 3 also has a unique and finite maximum if and only if n 0+n 1≥1.

Note that:

$${P{\left( {n_{0} + n_{1} {\rm{ \geqq }}1} \right)} = 1 - {\prod\limits_{i = 1}^N {{\left( {1 - p^{{2x}}_{i} - {\left( {1 - p_{i} } \right)}^{{2x}} } \right)} = 1 - {\left( {1 - p^{{2x}} - {\left( {1 - p} \right)}^{{2x}} } \right)}^{N} } }}$$
(5)

where the last equality assumes p i =p, i=1,...,N. The right hand side of Eq. 5 is minimized for p=0.5 for fixed x.

Is it a mixture?

We next consider the question of determining whether the stain is a mixture or not. Two approaches are outlined, a frequentist and a bayesian.

Frequentist approach

The parameter x can be considered fixed but unknown and two hypotheses formulated in the usual way:

$${\matrix{ {{{\rm{H}}_{{\rm{0}}} } \hfill} & {: \hfill} & {{{\rm{One}}\;{\rm{person}}\;{\rm{contributed}}{\rm{,}}\;{\rm{i}}{\rm{.e}}.,\;x{\rm{ = 1}}} \hfill} \cr {{{\rm{H}}_{{\rm{1}}} } \hfill} & {: \hfill} & {{{\rm{More}}\;{\rm{than}}\;{\rm{one}}\;{\rm{person}}\;{\rm{contributed}}{\rm{,}}\;{\rm{i}}{\rm{.e}}.,\;x \ge {\rm{2}}{\rm{.}}} \hfill} \cr } }$$

A reasonable approach is to reject H0 when

$${K = {{\max _{{j = 1,2,3,...}} P{\left( {data\left| {x = j} \right.} \right)}} \over {P{\left( {data\left| {x = 1} \right.} \right)}}} > c.}$$
(6)

The specific value of c can be determined by simulating K under H0. Since K is discrete, it is not possible to achieve a precise level of significance. Example 2 in the next section indicates that a reasonable and simple solution is to reject H0 and claim that a stain is a mixture when K>1.

Bayesian approach

Sometimes there is information available in addition to the SNP markers. Different sources of data may be combined as explained below. Bayes theorem gives:

$$ P{\left( {x = i\left| {data} \right.} \right)} = \frac{{P{\left( {data\left| {x = i} \right.} \right)}\alpha {\left( i \right)}}} {{{\sum\nolimits_{j = 1}^\infty {P{\left( {data\left| {x = j} \right.} \right)}\alpha {\left( j \right)}} }}}, $$

where P(x=j)=α(j) is the prior distribution. The posterior odds for the stain to be a mixture can be written

$${\matrix{ {{{{P{\left( {x > 1\left| {data} \right.} \right)}} \over {P{\left( {x = 1\left| {data} \right.} \right)}}}} \hfill} & { = \hfill} & {{{\sum\limits_{j = 2}^\infty {{{P{\left( {x = j\left| {data} \right.} \right)}} \over {{\left( {x = 1\left| {data} \right.} \right)}}}} }} \hfill} \cr {{} \hfill} & { = \hfill} & {{{\sum\limits_{j = 2}^\infty {{{P{\left( {data\left| {x = j} \right.} \right)}\alpha {\left( j \right)}} \over {P{\left( {data\left| {x = 1} \right.} \right)}\alpha {\left( 1 \right)}}}} }} \hfill} \cr } }$$

To continue, some prior assumptions are required and a formulation in terms of the prior odds for being a mixture, \({R = {\sum\nolimits_{j = 2}^\infty {\alpha {\left( j \right)}/\alpha {\left( 1 \right)}} }}\), seems reasonable. The posterior odds will depend on not only R but the entire x distribution. However, we can find an upper bound for the posterior odds:

$${\matrix{ {{} \hfill} & {{{\sum\limits_{j = 2}^\infty {{{P{\left( {data\left| {x = j} \right.} \right)}\alpha {\left( j \right)}} \over {P{\left( {data\left| {x = 1} \right.} \right)}\alpha {\left( 1 \right)}}}} }} \hfill} \cr { \le \hfill} & {{{{M{\sum\nolimits_{j = 2}^\infty {\alpha {\left( j \right)}} }} \over {P{\left( {data\left| {x = 1} \right.} \right)}\alpha {\left( 1 \right)}}}} \hfill} \cr { = \hfill} & {{{M \over {P{\left( {data\left| {x = 1} \right.} \right)}}}R,} \hfill} \cr } }$$
(7)

where

$$ M = {\mathop {\max }\limits_{j = 2,3,4...} }\;P{\left( {data\left| {x = j} \right.} \right)}. $$

Observe that the above approach may only be used to statistically show that there is only one contributor. The previous frequentist approach applies more generally. However, if one is willing to assume more a priori data, the restriction on the Bayesian approach disappears. For instance, specifying the alternative hypothesis " x=2" corresponds to assuming α(j)=0 for j>2 and the posterior odds

$${{{P{\left( {data\left| {x = 2} \right.} \right)}\alpha {\left( 2 \right)}} \over {P{\left( {data\left| {x = 1} \right.} \right)}\alpha {\left( 1 \right)}}}}$$

can be used to distinguish between the alternatives for a specified prior on α(2)/α(1).

Results

The previous section has presented results regarding (1) estimation of the number of contributors to a stain, (2) testing if a stain is a mixture or not and (3) verification of a non-mixture allowing for inclusion of prior information or data. Three examples follow to demonstrate the practical implementation of the methods. The examples are based on 1000 simulated datasets in S-PLUS 6.0.

Example 1

This example discusses the number of loci required to accurately estimate the number of contributors. We provide detailed explanation of the first line of Table 1. Column 1 shows that the data is simulated with x equal to 1, followed by a column indicating the number of loci, N=50 in this case. The two next columns list the fraction correctly identified for p=0.1 and p=0.5. In the former case 0.965 or 96.5% were correctly classified whereas there were no errors for p=0.5. As expected, the precision increases in N and decreases in x . If the number of contributors is 3 or less, the correct classification rate is always above 87% for N=200. Observe that the case with 5 contributors may not be resolved satisfactorily even with 1,000 markers for p=0.5. Figure 1 shows the standard deviation of the estimator of x as a function of p<0.5. Two intuitive results are confirmed, the uncertainty increases in x and decreases in p. Figure 2 displays the number of loci required to secure a finite estimate of the number of contributors with probability b. The plot is based on inequality Eq. 5 and explains to some extent why it is difficult to estimate cases with many contributors for p close to 0.5.

Table 1. The fraction of correctly identified number of contributors is shown in the two rightmost columns for p=0.1 and p=0.5 for various values of x (the number of contributors) and number of markers (N)
Fig. 1.
figure 1

The standard deviation of the estimate of the number of contributors is plotted as a function of p for one contributor, i.e., x=1, (solid line) and x=2 based on a simulation exercise with 200 markers. The uncertainty increases in x and decreases in p

Fig. 2.
figure 2

The number of loci required to secure a finite maximum likelihood (ML) estimate of the number of contributors with probability b is plotted as a function of p based on Eq. 5

Example 2

Data was simulated first assuming x=2. The test statistic K defined in Eq. 6 was calculated and the null hypothesis was rejected when K>1. In other words, we conclude that two or more persons contributed if the maximal likelihood assuming x≥2 exceeds the likelihood assuming x=1. Table 2 summarizes the results for varying N (50, 100, 200, 500) and p (0.1 and 0.5). The power is high, or equivalently, the probability of a type II error is small. For N≥100 the probability of reaching the correct conclusion is 0.982 (N=100, p=0.1) or higher. It remains to check the significance level of the test and we simulated data with x=1 for this purpose. The two rightmost columns of Table 2 show that the test also performs well with respect to type I errors, i.e., the probability of falsely claiming a mixture is low.

Table 2. The properties of the K-statistic for mixtures defined in Eq. 6 are shown for various values of N and for p=0.1 and p=0.5

Example 3

Recall that Eq. 7 could be useful to prove that a stain is not a mixture, when that is indeed the case. We simulated data for x=1, N=100, and p=0.1, and computed the ratio M/P(data|x=1). In 95% of the cases, the ratio was smaller than 0.0008, reducing any prior odds for a mixture substantially towards zero. For p=0.5, the ratios are even smaller.

Discussion and concluding remarks

The examples of the paper have been based on simulated data and so we are able to see how the methods perform in cases where the truth is known. Another reason for simulating is that relevant case data using SNP markers do not appear to be available. Gill (2001) considered 50–150 markers. For practical forensic case work, confusing a mixed profile and a profile from a single person, could have serious consequences as a match between a stain and reference person could be missed. Based on our results, it seems fair to conclude that a decision regarding mixture or not can be reached for the number of markers in the indicated range. Based on Table 2, we recommend 100 markers. In this case the type II error, i.e., the probability of missing a mixture stain, ranges from 0 to 0.018 while the type I error lies between 0 and 0.023. It is harder to estimate the precise number of contributors, particularly if a large number, say five or more, cannot a priori be excluded. Table 1 shows that with 1, 2 or 3 contributors, the correct classification rate is 75% or higher. This accuracy may be acceptable for investigating purposes, but insufficient for a court. It is possible to obtain a posterior distribution on the number of contributors. The evidence may then be weighed according to this distribution.

It remains to be seen what numbers will be available and if the problems of interpretation of data based on conventional markers (see Evett and Weir 1998; Evett et al. 1998) will be reduced for SNPs. Different contributors to a stain could have donated varying amounts and this information could be used to improve estimates.

The calculations are simplified by the diallelic structures of SNPs. For conventional markers similar calculations are obviously more complicated. However, numerical or simulation-based results are always obtainable. Moreover, the formulation of hypotheses would typically differ for conventional markers; the question of a mixture or not is typically not relevant. For instance, if one marker displays 5 alleles and the other fewer, one might want to test the null hypothesis x≤3 against the alternative x>3. The test procedure we have suggested extends easily to this case.