1 Introduction

Consider the multinomial distribution \({\mathbf{P}} =(p_1,p_2,\dots ,p_k)\), where \(k\ge 2\) is the number of categories and \(p_i>0\) is the probability of seeing an observation from category i. We are interested in estimating the minimum probability

$$\begin{aligned} p_0=\min \{ p_i; i=1,\ldots ,k\} \end{aligned}$$

in both the cases where k is known and where it is unknown.

Given an independent and identically distributed random sample \(X_1,X_2, \ldots , X_n\) of size n from \(\mathbf{P }\), let \(y_{i}=\sum _{j=1}^n 1(X_j=i)\) be the number of observations of category i. Here and throughout, we write \(1(\cdot )\) to denote the indicator function. The maximum likelihood estimator (MLE) of \(p_i\) is \(\hat{p}_i=y_i/n\) and the MLE of \(p_0\) is

$$\begin{aligned} \hat{p}_0=\min \{\hat{p}_{i};i=1,\ldots ,k\}. \end{aligned}$$

The MLE has the obvious drawback that \(\hat{p}_{0}\) is zero when we do not have at least one observation from each category. To deal with this issue, one generally uses a modification of the MLE. Perhaps the most prominent modification is the so-called Laplace smoothing estimator (LSE). This estimator was introduced by Pierre-Simon Laplace in the late 1700s to estimate the probability that the sun will rise tomorrow, see, e.g., [4]. The LSE of \(p_0\) is given by

$$\begin{aligned} \hat{p}_{0}^{\mathrm {LS}}=\min \{(y_{i}+1)/(n+k); i=1,\cdots ,k\}. \end{aligned}$$

Note that both \(\hat{p}_0\) and \(\hat{p}_{0}^{\mathrm {LS}}\) are based only on the smallest \(y_i\). Note further that, in situations where we have not seen all of the categories in the sample, we always have \(\hat{p}_0=0\) and \(\hat{p}_{0}^{\mathrm {LS}}=1/(n+k)\). This holds, in particular, whenever \(n<k\). Thus, in these cases, the estimators are fully deterministic and completely ignore the data.

In this article, we introduce a new estimator for \(p_0\), which is based on a smooth approximation of the minimum. It uses information from all of the categories and thus avoids becoming deterministic for small sample sizes. We consider both the cases when the number of categories is known and when it is unknown. We show consistency of this estimator and characterize its asymptotic distributions. We also perform a small-scale simulation study to better understand its finite sample performance. Our numerical results show that, in certain situations, it outperforms both the MLE and the LSE.

The rest of the paper is organized as follows: In Sect.2, we introduce our estimator for the case where the number of categories k is known and derive its asymptotic distributions. Then, in Sect. 3 we consider the case where k is unknown, and in Sect. 4 we consider the related problem of estimating the maximum probability. In Sect. 5 we give our simulation results, and in Sect. 6 we give some conclusions and directions for future work. Finally, the proofs are given in “Appendix”. Before proceeding, we briefly describe a few applications:

  1. 1.

    One often needs to estimate the probability of a category that is not observed in a random sample. This is often estimated using the LSE, which always gives the deterministic value of \(1/(n+k)\). On the other hand, a data-driven estimate would be more reasonable. When the sample size is relatively large, it is reasonable to assume that the unobserved category has the smallest probability and our estimator could be used in this case. This situation comes up in a variety of applications including language processing, computer vision, and linguistics, see, e.g., [6, 14], or [15].

  2. 2.

    In the context of ecology, we may be interested in the probability of finding the rarest species in an ecosystem. Aside for the intrinsic interest in this question, this probability may be useful as a diversity index. In ecology, diversity indices are metrics used to measure and compare the diversity of species in different ecosystems, see, e.g., [7, 8], and the references therein. Generally one works with several indices at once as they give different information about the ecosystem. In particular, the probability of the rarest species may be especially useful when combined with the index of species richness, which is the total number of species in the ecosystem.

  3. 3.

    Consider the problem of internet ad placement. There are generally multiple ads that are shown on the same webpage, and at most one of these will be clicked. Thus, if there are \(k-1\) ads, then there are k possible outcomes, with the last outcome being that no ad is clicked. In this context, the probability of a click on a given ad is called the click through rate or CTR. Assume that there are \(k-1\) ads that have been displayed together on the same page and that we have data on these. Now, the ad company wants to replace one of these with a new ad, for which there are no data. In this case, the minimum probability of the original \(k-1\) ads may give a baseline for the CTR of the new ad. This may be useful for pricing.

2 The Estimator When k Is Known

We begin with the case where the number of categories k is known. Let \({\mathbf{p}} =(p_1,\dots ,p_{k-1})\), \(\hat{\mathbf{p }}=(\hat{p}_1,\dots ,\hat{p}_{k-1})\), and note that \(p_k=1-\sum \limits _{i=1}^{k-1} p_i\) and \({\hat{p}}_k=1-\sum \limits _{i=1}^{k-1} {\hat{p}}_i\). Since \(p_0=g({\mathbf {p}})\), where \(g({\mathbf {p}})=\min \{p_1,p_2,\dots , p_k\}\), a natural estimator of \(p_0\) is given by

$$\begin{aligned} {\hat{p}}_0 = g(\hat{{\mathbf {p}}})=\min \{{\hat{p}}_1,{\hat{p}}_2,\dots , {\hat{p}}_k\}, \end{aligned}$$

which is the MLE. However, this estimator takes the value of zero whenever there is a category that has not been observed. To deal with this issue, we propose approximating g with a smoother function. Such approximations, which are sometimes called smooth minimums, are often used in optimization theory, see, e.g., [1, 9, 10], or [11]. Specifically, we introduce the function

$$\begin{aligned} g_n({\mathbf{p}} )= w^{-1}\sum \limits _{i=1}^k p_i e^{-n^\alpha p_i}, \end{aligned}$$
(1)

where \(w = w({\mathbf {p}})= \sum \limits _{j=1}^ke^{-n^\alpha p_j}\) and \(\alpha >0\) is a tuning parameter. Note that

$$\begin{aligned} \lim _{n\rightarrow \infty } g_n({\mathbf{p}} )= g({\mathbf{p}} )= p_0. \end{aligned}$$
(2)

This leads to the estimator

$$\begin{aligned} {\hat{p}}_0^*= g_n(\hat{\mathbf{p }})= {\hat{w}}^{-1}\sum \limits _{i=1}^k {\hat{p}}_i e^{-n^\alpha {\hat{p}}_i}, \end{aligned}$$
(3)

where \(\hat{w} = w(\hat{{\mathbf {p}}})= \sum \limits _{j=1}^k e^{-n^\alpha \hat{p}_j}\).

We now study the asymptotic distributions of \({\hat{p}}_0^*\). Let \(\nabla g_n ({\mathbf{p}} ) = \left( \frac{\partial g_n({\mathbf{p}} )}{\partial p_1},\ldots ,\frac{\partial g_n({\mathbf{p}} )}{\partial p_{k-1}} \right) ^{T}.\) It is straightforward to check that, for \(1 \le i \le k-1\),

$$\begin{aligned} \frac{\partial g_n({\mathbf{p}} )}{\partial p_i}= e^{-n^{\alpha }p_i}w^{-1}[1+n^{\alpha }\left( g_n({\mathbf{p}} )-p_i\right) ] -e^{-n^{\alpha }p_k}w^{-1}[1+n^{\alpha }\left( g_n({\mathbf{p}} )-p_k\right) ]. \end{aligned}$$
(4)

Let r be the cardinality of the set \(\{ j:\,p_j=p_0,\ j=1,\ldots , k\}\), i.e., r is the number of categories that attain the minimum probability. Note that \(r\ge 1\) and that we have a uniform distribution if and only if \(r=k\). With this notation, we give the following result.

Theorem 2.1

Assume that \(0<\alpha <1/2\) and let \(\hat{\sigma }_n= \{\nabla g_n(\hat{\mathbf{p }})^T {\hat{\Sigma }} \nabla g_n(\hat{\mathbf{p }})\}^{1/2}\), where \({\hat{\Sigma }}={\text{ diag }}(\hat{\mathbf{p }})-\hat{\mathbf{p }}\hat{\mathbf{p }}^T\).

  1. (i)

    If \(r \ne k\), then

    $$\begin{aligned} \sqrt{n}\hat{\sigma }^{-1}_n \{{\hat{p}}_0^*-p_0\} \xrightarrow {D}{{\mathcal {N}}}(0,1). \end{aligned}$$
  2. (ii)

    If \(r = k\), then

    $$\begin{aligned} k^2 n^{1-\alpha }\{p_0-{\hat{p}}_0^*\} \xrightarrow {D}\upchi ^2_{(k-1)}. \end{aligned}$$

Clearly, Theorem 2.1 both proves consistency and characterizes the asymptotic distributions. Further, it allows us to construct asymptotic confidence intervals for \(p_0\). If \(r\ne k\), then an approximate \(100(1-\gamma )\%\) confidence interval is

$$\begin{aligned} {\hat{p}}_0^*\pm n^{-1/2}\hat{\sigma }_n z_{1-\gamma /2}, \end{aligned}$$

where \(z_{1-\gamma /2}\) is the \(100(1-\gamma /2)\)th percentile of the standard normal distribution. If \(r=k\), then the corresponding confidence interval is

$$\begin{aligned}{}[{\hat{p}}_0^* , {\hat{p}}_0^* +k^{-2}n^{\alpha -1} \chi ^2_{k-1,1-\gamma }], \end{aligned}$$

where \(\chi ^2_{k-1,1-\gamma }\) is the \(100(1-\gamma )\)th percentile of a Chi-squared distribution with \(k-1\) degrees of freedom.

As far as we know, these are the first confidence interval for the minimum to appear in the literature. In fact, to the best of our knowledge, the asymptotic distributions of the MLE and the LSE have not been established. One might think that a version of Theorem 2.1 for the MLE could be proved using the asymptotic normality of \(\hat{\mathbf{p }}\) and the delta method. However, the delta method cannot be applied since the minimum function g is not differentiable. Even in the case of the proposed estimator \({\hat{p}}_0^*\), where we use a smooth minimum, the delta method cannot be applied directly since the function \(g_n\) depends on the sample size n. Instead, a subtler approach is needed. The detailed proof is given in “Appendix”.

3 The Estimator When k Is Unknown

In this section, we consider the situation where the number of categories k is unknown. In this case, one cannot evaluate the estimator \({\hat{p}}_0^*\). The difficulty lies in the need to evaluate \({\hat{w}}\). Let \(\ell =\sum _{j=1}^k 1\left( y_j=0\right)\) be the number of categories that are not observed in the sample and note that

$$\begin{aligned} {\hat{w}} = \sum _{j=1}^k e^{-n^\alpha \hat{p}_j} = \sum _{j=1}^k e^{-n^\alpha \hat{p}_j} 1\left( y_j>0\right) + \ell . \end{aligned}$$

If we have an estimator \({\hat{\ell }}\) of \(\ell\), then we can take

$$\begin{aligned} {\hat{w}}^\sharp = \sum _{j=1}^k e^{-n^\alpha \hat{p}_j} 1\left( y_j>0\right) + {\hat{\ell }} \end{aligned}$$

and define the estimator

$$\begin{aligned} {\hat{p}}_0^\sharp = \frac{1}{{\hat{w}}^\sharp }\sum \limits _{i=1}^k {\hat{p}}_i e^{-n^\alpha {\hat{p}}_i}. \end{aligned}$$
(5)

Note that \({\hat{p}}_0^\sharp\) can be evaluated without knowledge of k since \({\hat{p}}_i=0\) for any category i that does not appear in the sample.

Now, assume that we have observed \(k^\sharp\) categories in our sample and note that \(k^\sharp \le k\). Without loss of generality, assume that these are categories \(1,2,\dots ,k^\sharp\). Assume that \(k^\sharp \ge 2\), let \(\hat{{\mathbf {p}}}^\sharp =({\hat{p}}_1,{\hat{p}}_2,\dots , {\hat{p}}_{k^\sharp -1})\), and note that \({\hat{p}}_{k^\sharp }=1-\sum _{i=1}^{k^\sharp -1}{\hat{p}}_i\). For \(i=1,2,\dots ,(k^\sharp -1)\) let

$$\begin{aligned} h_i = e^{-n^{\alpha }{\hat{p}}_i} \frac{1}{{\hat{w}}^\sharp }\left[ 1-n^{\alpha }\{{\hat{p}}_i-{\hat{p}}_0^\sharp \}\right] -e^{-n^{\alpha }{\hat{p}}_{k^\sharp }} \frac{1}{{\hat{w}}^\sharp }[1-n^{\alpha }\{{\hat{p}}_{k^\sharp }-{\hat{p}}_0^\sharp \}] \end{aligned}$$

and let \({\mathbf {h}} = (h_1,h_2,\dots ,h_{k^\sharp -1})\). Note that we can evaluate \({\mathbf {h}}\) without knowing k.

Theorem 3.1

Assume that \({\hat{\ell }}\) is such that, with probability 1, we eventually have \({\hat{\ell }}=0\). When \(k^\sharp \ge 2\), let \({\hat{\sigma }}_n^\sharp = \{{\mathbf {h}}^T {\hat{\Sigma }}^\sharp {\mathbf {h}}\}^{1/2}\), where \(\Sigma ^\sharp = {\mathrm {diag}}(\hat{\mathbf {p}}^\sharp )-\hat{{\mathbf {p}}}^\sharp \left( \hat{\mathbf {p}}^\sharp \right) ^T\). When \(k^\sharp =1\), let \({\hat{\sigma }}_n^\sharp =1\). If the assumptions of Theorem 2.1 hold, then the results of Theorem 2.1 hold with \({\hat{p}}_0^\sharp\) in place of \({\hat{p}}_0^*\) and \({\hat{\sigma }}_n^\sharp\) in place of \({\hat{\sigma }}_n\).

Proof

Since k is finite and we eventually have \({\hat{\ell }}=0\), there exists an almost surely finite random variable N such that if the sample size \(n\ge N\), then \({\hat{\ell }}=0\), and we have observed each category at least once. For such n, we have \(k^\sharp =k\), \({\hat{w}}^\sharp ={\hat{w}}\), \(\hat{{\mathbf {p}}}^\sharp =\hat{{\mathbf {p}}}\), and \(\nabla g_n(\hat{{\mathbf {p}}})={\mathbf {h}}\). If follows that, for such n, \({\hat{\sigma }}_n^\sharp = {\hat{\sigma }}_n\) and \({\hat{p}}_0^\sharp ={\hat{p}}_0\). Hence \({\hat{\sigma }}_n^\sharp /{\hat{\sigma }}_n{\mathop {\rightarrow }\limits ^{p}}1\) and \(\sqrt{n}\hat{\sigma }^{-1}_n \{{\hat{p}}_0^*-{\hat{p}}^\sharp _0\}{\mathop {\rightarrow }\limits ^{p}}0\). From here the case \(r\ne k\) follows by Theorem 2.1 and two applications of Slutsky’s theorem. The case \(r=k\) is similar and is thus omitted. \(\square\)

There are a number of estimators for \(\ell\) available in the literature, see, e.g., [2, 3, 5], or [16] and the references therein. One of the most popular is the so-called Chao2 estimator [3, 5], which is given by

$$\begin{aligned} \hat{\ell }= {\left\{ \begin{array}{ll} \frac{n-1}{n}\frac{f_1^2}{2f_2} &{} {\text{ if }} f_2 >0 \\ \frac{n-1}{n}\frac{f_1(f_1-1)}{2} &{} {\text{ if }} f_2=0, \end{array}\right. } \end{aligned}$$
(6)

where \(f_i=\sum _{j=1}^k1\left( y_j=i\right)\) is the number of categories that were observed exactly i times in the sample. Since k is finite, we will, with probability 1, eventually observe each category at least three times. Thus, we will eventually have \(f_1=f_2=0\) and \(\hat{\ell }=0\). Thus, this estimator satisfies the assumptions of Theorem 3.1. In the rest of the paper, when we use the notation \({\hat{p}}_0^\sharp\) we will mean the estimator where \({\hat{\ell }}\) is given by (6).

4 Estimation of the Maximum

The problem of estimating the maximum probability is generally easier than that of estimating the minimum. Nevertheless, it may be interesting to note that our methodology can be modified to estimate the maximum. Let

$$\begin{aligned} p_\vee = \max \{ p_i:\ i=1,\cdots ,k\}. \end{aligned}$$

We begin with the case where the number of categories k is known. We can approximate the maximum function with a smooth maximum given by

$$\begin{aligned} g_n^{\vee }({\mathbf{p}} )= w_{\vee }^{-1}\sum \limits _{i=1}^k p_i e^{n^\alpha p_i}, \end{aligned}$$
(7)

where \(w_{\vee } = w_\vee ({\mathbf {p}}) = \sum \limits _{i=1}^ke^{n^\alpha p_i}\). Note that

$$\begin{aligned} g_n^{\vee }({\mathbf{p}} ) = -g_n(-{\mathbf{p}} ), \end{aligned}$$

where \(g_n\) is given by (1). It is not difficult to verify that \(g_n^{\vee }({\mathbf{p}} )\rightarrow p_\vee\) as \(n\rightarrow \infty\). This suggests that we can estimate \(p_\vee\) by

$$\begin{aligned} {\hat{p}}_\vee ^*=g^{\vee }_n(\hat{\mathbf {p}})= \hat{w}_{\vee }^{-1}\sum \limits _{i=1}^k \hat{p}_i e^{n^\alpha \hat{p}_i}, \end{aligned}$$
(8)

where \(\hat{w}_{\vee } = w_\vee (\hat{{\mathbf {p}}})=\sum \limits _{i=1}^k e^{n^\alpha \hat{p}_i} .\)

Let \(r_{\vee }\) be the cardinality of the set \(\{ j:\,p_j=p_{\vee }, j=1,\ldots , k\}\) and let \(\nabla g_n^{\vee } ({\mathbf{p}} ) = \left( \frac{\partial g_n^{\vee }({\mathbf{p}} )}{\partial p_1},\ldots ,\frac{\partial g_n^{\vee }({\mathbf{p}} )}{\partial p_{k-1}} \right) ^{T}\). It is easily verified that, for \(1 \le i \le k-1\),

$$\begin{aligned}&\frac{\partial g_n^{\vee }({\mathbf{p}} )}{\partial p_i}= \frac{\partial g_n(-{\mathbf{p}} )}{\partial p_i}= e^{n^{\alpha }p_i}w_{\vee }^{-1}[1+n^{\alpha }\{p_i-g_n^{\vee }({\mathbf{p}} )\}] \nonumber \\&\quad -e^{n^{\alpha }p_k}w_{\vee }^{-1}[1+n^{\alpha }\{p_k-g_n^{\vee }({\mathbf{p}} )\}]. \end{aligned}$$
(9)

We now characterize the asymptotic distributions of \({\hat{p}}_\vee\).

Theorem 4.1

Assume that \(0<\alpha <1/2\) and let \(\hat{\sigma }^{\vee }_n= \{\nabla g_n^{\vee }(\hat{\mathbf{p }})^T {\hat{\Sigma }} \nabla g_n^{\vee }(\hat{\mathbf{p }})\}^{1/2}\), where \({\hat{\Sigma }}={\text{ diag }}(\hat{\mathbf{p }})-\hat{\mathbf{p }}\hat{\mathbf{p }}^T\).

  1. (i)

    If \(r_{\vee } \ne k\), then

    $$\begin{aligned} \frac{\sqrt{n}}{\hat{\sigma }^{\vee }_n} \{{\hat{p}}^*_\vee -p_\vee \} \xrightarrow {D}{{\mathcal {N}}}(0,1). \end{aligned}$$
  2. (ii)

    If \(r_{\vee } = k\), then

    $$\begin{aligned} k^2 n^{1-\alpha }\{{\hat{p}}_\vee ^*-p_\vee \} \xrightarrow {D}\upchi ^2_{(k-1)}. \end{aligned}$$

As with the minimum, we can consider the case where the number of categories k is unknown. In this case, we replace \({\hat{w}}_\vee\) with

$$\begin{aligned} {\hat{w}}^\sharp _{\vee } = \sum \limits _{i=1}^ke^{n^\alpha {\hat{p}}_i}1\left( y_i>0\right) + {\hat{\ell }}, \end{aligned}$$

for some estimator \({\hat{\ell }}\) of \(\ell\). Under the assumptions of Theorem 3.1 on \({\hat{\ell }}\), a version of that theorem for the maximum can be verified.

5 Simulations

In this section, we perform a small-scale simulation study to better understand the finite sample performance of the proposed estimator. We consider both the cases where the number of categories is known and where it is unknown. When the number of categories is known, we will compare the finite sample performance of our estimator \(p_0^*\) with that of the MLE \({\hat{p}}_0\) and the LSE \(\hat{p}_{0}^{\mathrm {LS}}\). When the number of categories is unknown, we will compare the performance of \({\hat{p}}_0^\sharp\) with modifications of the MLE and the LSE that do not require knowledge of k. Specifically, we will compare with

$$\begin{aligned} {\hat{p}}_{0,u} = \frac{y^\sharp _0}{n} {\text{ and }} {\hat{p}}^{\mathrm {LS}}_{0,u} = \frac{y_0^\sharp +1}{n+ k^\sharp }, \end{aligned}$$
(10)

where \(y_0^\sharp =\min \{y_i: y_i>0 , \ i=1,2,\dots ,k\}\) and \(k^\sharp =\sum _{i=1}^k 1(y_i>0)\). Clearly, both \({\hat{p}}_{0,u}\) and \({\hat{p}}^{\mathrm {LS}}_{0,u}\) can be evaluated without knowledge of k. Throughout this section, when evaluating \(p_0^*\) and \(p_0^\sharp\), we set the tuning parameter to be \(\alpha =0.49\). We chose this value because it tends to work well in practice and it is neither too large nor too small. If we take \(\alpha\) to be large, then (2) implies that the estimator will be almost indistinguishable from the MLE. On the other hand, if we take \(\alpha\) to be small, then the estimator will not work well because it will be too far from convergence.

In our simulations, we consider two distributions. These are the uniform distribution on k categories, denoted by U(k), and the so-called square-root distribution on k categories, denoted by S(k). The S(k) distribution has a probability mass function (pmf) given by

$$\begin{aligned} p(i) = C \frac{1}{\sqrt{i}}, \ \ \ i=1,2,\dots , k, \end{aligned}$$

where C is a normalizing constant. For each distribution, we will consider the case where \(k=10\) and \(k=20\). The true minimums for these distributions are given in Table 1.

Table 1 True minimums for the distributions considered

The simulations were performed as follows. For each of the four distributions and each sample size n ranging from 1 to 200, we simulated \(R=10000\) random samples of size n. For each of these random samples, we evaluated our estimator. This gave us the values \({\hat{p}}^*_{0,1},{\hat{p}}^*_{0,2},\dots ,{\hat{p}}^*_{0,R}\). We used these to estimate the relative root-mean-square error (relative RMSE) as follows:

$$\begin{aligned} {\text{ Relative }} {\text{ RMSE }} = \frac{1}{p_0}\sqrt{\frac{1}{R}\sum _{i=1}^R\left( {\hat{p}}^*_{0,i}-p_{0}\right) ^2}= \sqrt{\frac{1}{R}\sum _{i=1}^R\left( \frac{{\hat{p}}^*_{0,i}}{p_0}-1\right) ^2}, \end{aligned}$$

where \(p_0\) is the true minimum. We repeated this procedure with each of the estimators. Plots of the resulting relative RMSEs for the various distributions and estimators are given in Fig. 1 for the case where the number of categories k is known and in Fig. 2 for the case where k is unknown. We can see that the proposed estimator works very well for the uniform distributions in all cases. For the square-root distribution, it also beats the other estimators for a wide range of sample sizes.

It may be interesting to note that, in the case where k is known, the relative RMSE of the MLE \({\hat{p}}_0\) is exactly 1 for smaller sample sizes. This is because, when we have not seen all of the categories in our sample, the MLE is exactly 0. In particular, this holds for any sample size \(n<k\). When the MLE is 0, then the LSE \({\hat{p}}_0^{\mathrm {LS}}\) is exactly \(1/(n+k)\). Thus, when k is known and \(n<k\), both \({\hat{p}}_0\) and \({\hat{p}}_0^{\mathrm {LS}}\) are fully deterministic functions that ignore the data entirely. This is not the case with \({\hat{p}}_0^*\), which is always based on the data.

When k is unknown, we notice an interesting pattern in the errors of the MLE and the LSE. There is a dip at the beginning, where the errors decrease quickly before increasing just as quickly. After this, they level off and eventually begin to decrease slowly. While it is not clear what causes this, an explanation may be as follows. From (10), we can see that, for relatively small sample sizes, the numerators of both estimators are likely to be small as we would have only seen very few observations from the rarest category. As n begins to increase, the numerators should stay small, while the denominators increase. This would make the estimators decrease and thus get closer to the value of \(p_0\). However, once n becomes relatively large, the numerators should begin to increase, and thus, the errors would increase as well. It would not be until n gets even larger that it would be large enough for the errors to begin to come down due to the statistical properties of the estimators. If this is correct, then the dip is just an artifact of the deterministic nature of these estimators. For comparison, in most cases the error of \(p^*_0\) just decreases as the sample size increases. The one exception is under the square-root distribution, when the number of categories is known. It is not clear what causes the dip in this case, but it may be a similar issue.

Fig. 1
figure 1

Plots for the relative RMSE in the case where the sample size k is known. The solid line is for the proposed estimator \({\hat{p}}_0^*\), the dashed line is for the MLE \({\hat{p}}_{0}\), and dotted line is for the LSE \({\hat{p}}^{\mathrm {LS}}_{0}\)

Fig. 2
figure 2

Plots for the relative RMSE in the case where the sample size k is unknown. The solid line is for the proposed estimator \({\hat{p}}_0^\sharp\), the dashed line is for the MLE \({\hat{p}}_{0,u}\), and dotted line is the LSE \({\hat{p}}^{\mathrm {LS}}_{0,u}\)

6 Conclusions

In this paper, we have introduced a new method for estimating the minimum probability in a multinomial distribution. The proposed approach is based on a smooth approximation of the minimum function. We have considered the cases where the number of categories is known and where it is unknown. The approach is justified by our theoretical results, which verify consistency and categorize the asymptotic distributions. Further, a small-scale simulation study has shown that the method performs better than several baseline estimators for a wide range of sample sizes, although not for all sample sizes. A potential extension would be to prove asymptotic results in the situation where the number of categories increases with the sample size. This would be useful for studying the problem when there are a very large number of categories. Other directions for future research include obtaining theoretical results about the finite sample performance of the estimator and proposing modifications of the estimator with the aim of reducing the bias using, for instance, a jackknife approach.