Introduction

Marshall and Olkin (1997) introduced a method for adding a parameter to a distributions family through application to the exponential and Weibull with the purpose of obtaining new distributions. This procedure aims to develop new distributions with greater flexibility in modeling various data types. According to the authors, if \(\overline{F }(x)\) is denoted as a survival function of a continuous random variable \(X\), then the mechanism of adding parameter \(\alpha\), results in another survival function \(\overline{G}\left( x \right)\) defined by,

$$\bar G\left( {x,{\rm{~}}\alpha } \right) = \frac{{\alpha \bar F\left( x \right)}}{{1 - \left( {1 - \alpha } \right)\bar F{{\left( x \right)}}}},{\rm{~~~~~~~~~}} - \infty < x < \infty, {\rm{~~~~}}\alpha > 0 {\rm{,}}$$
(1)

where the probability density function (PDF) and the cumulative distribution function (CDF) corresponding to Eq. (1) are:

$$g\left( {x,{\rm{~}}\alpha } \right) = \frac{{\alpha f\left( x \right)}}{{{{\left[ {1 - \left( {1 - \alpha } \right)\bar F\left( x \right)} \right]}^2}}},{\rm{~~~~~~~~~~~~~~~~~}} - \infty < x < \infty, {\rm{~~~~}}\alpha > 0 {\rm{,}}$$
(2)

and

$$G\left( {x,{\rm{~}}\alpha } \right) = \frac{{F\left( x \right)}}{{{{\left[ {1 - \left( {1 - \alpha } \right)\bar F\left( x \right)} \right]}}}},{\rm{~~~~~~~~~~~~~~~~}} - \infty < x < \infty, {\rm{~~~~}}\alpha > 0 {\rm{,}}$$
(3)

where \(f\left( x \right)\) is the PDF corresponding to \(\overline{F}\left( x \right)\).

On the other hand, the hazard rate function of the Marshall–Olkin extended distribution is given by

$$h\left( {x;\alpha } \right) = \frac{r\left( x \right)}{{1 - \left( {1 - \alpha } \right)\overline{F}\left( x \right)}}{,}$$
(4)

where \(r\left(x\right)=\frac{f(x)}{\overline{F }\left(x\right)}\), which is the hazard rate function of the baseline distribution.

Marshall and Olkin called the additional parameter “tilt parameter” since the hazard rate of the new family is shifted below \((\alpha >1)\) or above \(\left(0<\alpha \le 1\right)\) of the hazard rate of the underlying distribution. In other words, for all \(x\ge 0\), \(h\left(x\right)\le r\left(x\right)\) when \(\alpha >1\), if \(0<\alpha \le 1\) then \(h\left(x\right)\ge r\left(x\right)\).

From the seminal work of Marshall and Olkin (1997), many researchers have introduced new distributions or generalized some existing distributions to model a real data set's behavior. The main objective of proposing, extending, or generalizing models is to explain how a data set behaves in lifetime analysis, survival times, failure times, and reliability analysis. Furthermore, the proposed models have been applied in areas such as medicine, public health, biology, physics, computer science, finance and insurance, engineering, industry, communications, among others (Jayakumar & Sankaran, 2019a, 2019b; Nassar et al., 2019; Rondero-Guerrero et al., 2020).

For instance, better modelling of reliability analysis of system components is increasingly important in virtually all sectors of manufacturing, engineering, and administration. Indeed, reliability engineering studies the ability of a device to function without failures in order to predict or estimate the risk of failure. That is, it studies the capacity of a component or system to function during a specific time or over an interval of time (Ricardo P. Oliveira et al., 2021). Therefore, the use or application of different lifetime distributions has become more critical due to the global dynamics of trade. Because globally there is a growing variety of products and increasing focus on quality control more companies are under pressure to perform reliability analysis of their products to understand failure and survival rates. In addition, statistical lifetime distributions have grown in fields such as biological sciences, life tests, and medicine since they can predict disease behaviors, in particular control measures and mitigation in response to the social impact of epidemics and pandemics.

Although, Ricardo P. Oliveira et al. (2021), Algarni (2021), and Eghwerido et al. (2021) mention that lifetime probability distributions such as Weibull, exponential, Lindley, and Weibull exponential distribution, among others, can be used to model the data, in many cases they do not provide a good fit for modeling phenomenon with non-monotone failure rates, such as bathtub upside down failure shaped. For this reason, many researchers have developed new, more flexible models in the last decade.

That is why this article aims to analyze publications related to the new distribution models or generalizations of existing distributions that derive from the seminal work of Marshall and Olkin (1997) through a bibliometric approach. For this bibliometric study, we focus on publications found in Web of Science (core collection) and Scopus databases. In addition, this study considers several bibliometric indicators related to authors, journals, and articles. The paper presents various bibliometric techniques using open-access software “Rstudio” of “R” to mapping collaboration and co-citation networks. The contribution of this work to the body of literature lies in:

  • Describing how the contribution by Marshall and Olkin (1997) in developing new distributions in terms of publications, authors, and journals is organized and advanced. As well as identify bibliometric trends.

  • Presenting the main characteristics of the new distributions or distribution families.

Materials and method

Bibliometric analysis is a type of quantitative analysis used to classify and report bibliographic data on a particular research topic. In other words, it is used to measure the impact of journals, identify authors, and detect new research lines. This type of analysis involves the mathematical and statistical treatment of scientific publications and their respective citations. Bibliometric analysis results provide relevant information about the level of activity (research) that exists among authors, organizations or countries, as well as the evolution of research topics (Cancino et al., 2019; Ferreira, 2018; Lei & Xu, 2020).

We chose a bibliometric study because it develops a systematic, transparent, and reproducible process of identifying relevant manuscripts. In addition, a series of techniques are applied to evaluate scientific production through objective and quantitative indicators of bibliographic data (Krainer et al., 2020). The protocol used for this bibliometric analysis is shown in Fig. 1. This process begins by establishing the research topic and then continues with four sequential stages that provide sufficient evidence on the contribution and development of scientific knowledge (development or generalization of new distributions).

Fig. 1
figure 1

Protocol for the bibliometric study

Also, in this protocol, two main categories of analysis are considered: performance analysis and scientific mapping. The first one aims to present a descriptive analysis of the following parameters: authors, journals, institutions, and articles. The second focuses on mapping bibliometric networks to explore the interrelationship between authors and references (Ruggeri et al., 2019).

The “RStudio” software was used with the Bibliometrix 3.0.1 package. This package is appropriate for bibliometric and scientometric studies because it provides users with a greater degree of control over modifying and adjusting the input and output data (Aria & Cuccurullo, 2020).

Data collection

Oorschot et al. (2018) and Ruggeri et al. (2019) state that to guarantee the documents' maximum quality be analyzed, it is vital to use the Web of Science database (core collection) since the validity of any bibliometric analysis depends mostly on selecting the publications. The authors indicate that Web of Science meets the highest standards in terms of impact factor and number of citations. On the other hand, Lei and Xu (2020) mention that another database with high-quality standards is the Scopus database. According to Lei and Xu, Scopus is the world's largest peer-reviewed journal abstract and citation database.

Therefore, to achieve the objective of this study, the Web of Science (main collection) and Scopus databases were both selected. Data collection was carried out beginning on April 30, 2021. For the Web of Science database, the following search string was used: Topic (Marshall-Olkin), Publication Years (2021–1997), and Document Types (Article). The results of this search produced 339 articles. To verify the coherence of the subject matter of each manuscript, each one of them was analyzed. Through this process, 239 documents were removed from the database, resulting in a dataset of 100 documents.

The SCOPUS database was obtained with the following search string: Article title, Abstract, Keywords (Marshall-Olkin), Year (2021–1997), and Document Type (Article). The result was 352 articles. For this database, the same Web of Science process was carried out to verify the coherence of the subject matter of each manuscript, which resulted in 106 documents.

Finally, to combine the bibliographic data Web of Science and Scopus, and eliminate the duplicate data (duplicate articles), the “mergeDbSources” function of the Bibliometrix package was used (Aria & Cuccurullo, 2020), where a total of 131 articles were ultimately obtained for bibliometric analysis.

Bibliometric analysis

In this section, a quantitative analysis of the articles composing the database is presented and discussed. We carried out a descriptive analysis considering the following parameters: authors, journals, institutions, and articles. Secondly, we conducted a bibliometric network analysis considering collaboration and co-citation networks.

Of the 131 articles analyzed, 238 authors participated in the development of the publications. Collaboration between authors is the key to the development of new models. Table 1 shows that eight papers were each written by only one author. The rate of collaboration between authors is 1.8. The main information about the bibliometric data is shown in Table 1.

Table 1 Main information about bibliometric data

Annuals and trends

Table 2 presents an overview of the annual scientific production from 1997 to April 21, 2021. As shown in the table, from 1997 to 2011, there were only six publications related to new classes or generalizations of distributions based on the seminal work by Marshall and Olkin, (1997). Table shows that from 2013 to 2015, an interest in developing new probability distributions began. From 2016 to date, there has been a significant increase in the publication of new distribution models. In terms of the number of total citations (TC) and the number of average citations per article (ATC), the most cited manuscripts correspond to the year 2007, although in that year there were only two publications. However, interest in this research topic increased significantly since 2016. Regarding the number of authors, there is an upward trend from 2013 to date, as shown in Table 2.

Table 2 Annual Scientific Production

The main authors, most influential articles and journals

Table 1 showed that, in this analyzed database, 238 authors have written about new distributions or have extended a distribution class. This section presents the most productive authors, those with five or more publications. Under this consideration, 12 researchers were identified, as Table 3 shows. These 12 authors are ordered by the number of publications from highest to lowest. Cordeiro G. is the author who has contributed the most to the development of new distributions with a total of 17 publications, followed by Afify A. with 14 articles. Regarding the total number of citations, the most cited authors are Cordeiro G. (123), Al-Awadhi F. (105), Alkhalfan L. (105), Ghitany M. (105), Yousof H. (101), and Afify A. (97).

Table 3 Most Relevant Authors

On the other hand, one of the objectives among the scientific community (researchers) is to achieve a significant and recognized impact through their publications. One way of assessing the impact of an author is by considering where and how often their work is cited. That is, author-level metrics are citations metrics that measure the bibliometric impact of authors individually. Table 3 presents three metrics that measure the impact of the authors analyzed here. As seen in Table 3, Cordeiro G. is in the first position of two of the three indices presented (h-index and g-index), and Yousof H. has the highest level in the m-index.

Another aspect of a bibliometric analysis is to identify the most influential documents in the development of new distributions. As shown in the Table 4, the document by Ghitany et al. (2007) is the most cited in the database, with a total of 105 citations. The authors of this article present a new variant of “the extended family of Marshall and Olkin distributions,” where they introduce the Lomax distribution to generate a new model. A significant aspect of the publications shown in Table 4 is that two articles were each written by a single author; the other eight papers were collaborations.

Table 4 The 10 most cited documents

Of the different document types available on scientific platforms or databases, journals are considered the most important means of communication, and it is one of the most widely used sources among the scientific community. Therefore, a journal works as an official means to record scientific and academic findings publicly; that is, it is considered as a scientific and social institution that indicates aspects such as the contribution and prestige of researchers (authors), the disciplines of the most influential scientists, and the most productive institutions, countries, and publishers.

Table 5 shows the journals that have two or more publications in this database. As shown in the table, the journal “Communications in Statistics-Theory and Methods” ranks first in the number of articles and citations, with 13 and 156, respectively. However, the journal “Statistical Papers” has the highest proportion of citations per paper (3 papers and 85 citations). On the other hand, journal quality can be determined by the impact factor (in this case, the relationship between the number of articles and the sum of the citations of these articles). In this bibliometric analysis, the journal with the highest impact factor is “Journal of King Saud University – Science”, followed by “Journal of Computational and Applied Mathematics”, and “Statistical Papers”.

Table 5 The most productive international journals

Bibliometric networks

Network analysis is a technique widely used in bibliometrics and scientometrics studies. Bibliometric networks generally consist of nodes, which may be authors (researchers), universities, countries, journals, keywords or references, and links representing the relationships between them. In each case, the corresponding bibliometric network represents a set of documents for study or analysis. The software used in this work allows the following bibliometric networks to be created and represented visually: Collaboration Networks, Co-citation Networks, Coupling Networks, and Co-occurrences Networks.

The mapping method in Bibliometrix 3.0.1 package consists of three stages: (1) standardization method, (2) type of network, and (3) clustering algorithm of the nodes in the network. To visualize bibliometric networks, the “net2VOSviewer” function of the Bibliometrix 3.0.1 package was used to export the obtained networks to the “VOSviewer” software (Aria & Cuccurullo, 2020). The collaboration network and co-citation network for this work are presented below.

Collaboration network

This analysis allows for identifying the level of collaboration of published research from three different approaches: authors, institutions, and countries. This analysis type aims to investigate the level of collaborative strength of research in a specific field.

Concerning co-authorship relationships, Table 1 shows that 12 of the 238 authors are listed as sole authors. That is, 91% of the articles in this database were published by co-authorship. Figure 2 shows the collaboration network between authors. The most collaborative authors, in terms of publications, are represented as larger nodes. In Fig. 2, it can be seen that the largest node is for Cordeiro G. and Afify A., which shows the most articles and the total strength of the link. The above is consistent with Table 3. On the other hand, 20 research clusters were identified, represented by different colors, where it can be observed that the research clusters of Afify A., Ozel G., Jamal F., Yousof H., Alizadeh M., and Nasir M. are closely collaborating with Cordeiro G. The rest of the researchers show a lower weight in publications and collaborative links. Thus, the clusters are distributed separately, where some are barely connected with one researcher, and others have no connection or collaboration with other research clusters.

Fig. 2
figure 2

Collaboration networks: co-authorship

Co-citation network

According to Ruggeri et al. (2019), they mention that citation analysis is a bibliometric technique proposed by Small, (1973) that aims to represent through a network the structure of a set of documents that are commonly cited among themselves. In other words, the more times two papers are cited together, the stronger their association, which allows us to infer in some way that the authors’ research is significantly related (they belong to the same research field).

In Fig. 3, the co-citation network of references is presented. The database recorded a total of 3,505 cited references. In Fig. 3, the 200 references with the most significant citation nodes are shown, indicating that these articles are the most frequently cited in publications related to the development of new distributions. The most cited reference is the seminal work by Marshall and Olkin, (1997); this allows us to infer that this article represents the central knowledge base for developing new distributions, and it is also congruent with the study objective of this work. On the other hand, of the 131 articles analyzed, it is observed that the manuscript of Jayakumar and Mathew, (2008) is the most cited within the local citations.

Fig. 3
figure 3

Co-citation network: references

Main findings of the bibliometric analysis

According to the database, in the last six years, mainly researchers in statistics, mathematics, and engineering have developed new distributions or have generalized and extended the existing models to increase the distributions' versatility. The popular and most used distributions such as the exponential, gamma, Normal, and Weibull, among others, are very limited in their characteristics and cannot show great flexibility. For this reason, many authors have used different techniques to build new models. According to El-Morshedy et al. (2020), the main reasons for these new models are:

  • To build heavy-tailed distributions to be able to model real data.

  • To make the kurtosis more flexible compared to the baseline model(s).

  • To generate distributions with symmetric, left-skewed, right-skewed, or reversed-J shape.

  • To provide more flexibility in the cumulative distribution function and the hazard rate function.

  • To provide a better fit than models generated under the same baseline distribution.

Another relevant point considered by the articles was the method for estimating the parameters of the model. Ninety-five percent of the papers presented a method for estimating the parameters of their model. The rest of the manuscripts did not present this point. The predominant method was that of maximum likelihood; 124 articles used this method. Also, 13 manuscripts applied the Bayesian estimation method. Other methods used to estimate parameters were the maximum product spacing, the least-squares method, and the interval estimation method.

The main topics addressed in the manuscripts are survival function, hazard function, mean residual life, Renyi entropy, moments, moment generating function, quantile function, order statistics, stochastic orderings, estimation method, simulation, and applications. Of the 131 works, 91 of the articles mention that they used software to find the values of the parameters in the different simulations and applications of the proposed models. Seventy-one manuscripts used the open-source software "R." They also used other software such as Maple, Matlab, Mathematical, SAS, Python, Mathcad, Mathematica, and Ox matrix programming language. The rest of the papers did not mention what software they used.

Ninety-one percent of the publications illustrate the proposed model's practical importance by applying it to a real data set to show the new distribution's potential and flexibility. The data used are adjusted to the proposed model and compared with other existing models. For comparison purposes, the authors calculate some goodness-of-fit statistics such as Akaike information criterion (AIC), consistent Akaike information criterion (CAIC), Bayesian information criterion (BIC), Hannan-Quinn information criterion (HQIC), Anderson Darling (AD), Cramér-von Mises Criterion (CVMC), Kolmogorov-Smirnov (KS) statistic, and corresponding p-values. An important aspect to note about the authors' data used to show the new distributions' application is that 109 articles take data from other publications with reference years from 1965 to 2011.

Some of the data used correspond to failure times of mechanical or electrical components, waiting times in banks, survival times due to tuberculosis infection, survival times in cancer patients, strength tests for glass fibers, number of deaths from vehicle accidents, fatigue times of 6061 T6 Aluminum Coupons, wind speed measured at 20 m height, remission times in cancer patients, tension at break of carbon fibers, GDP growth (% per year), nicotine measurements, equipment or device failure rate, fatigue fracture, monthly tax income, sports, assess the risks associated with earthquakes that occur near a nuclear power plant (distances, in miles, to the nuclear power plant and the epicenter of the earthquake), call times, average annual growth rate of carbon dioxide, maximum annual flood discharges, average maximum daily rainfall for 30 years, vehicular traffic, lifespan (in km) of front disk brake pads on randomly selected cars, marital status and divorce rates, and length of relief times of patients who received an analgesic. Consider that each of the models proposed in this database will serve as a possible model to others available through the literature to model another real-life data set.

One of the main characteristics of the new distributions or generalizations is the number of parameters added to the model to provide greater flexibility in modeling specific applications or data. Table 6 shows the name and number of parameters in each of the models proposed in the 131 articles analyzed. As can be seen in the table, 9.2% of the distributions only consider two parameters, 45% work with three parameters, 35.1% consider four parameters, 8.4% model with five parameters, and 2.3% (3 articles) are considering a distribution with six parameters, which was proposed by Handique and Chakraborty, (2017a, 2017b), Yousof et al. (2016), and Jose et al. (2011).

Table 6 Families of distributions

Clearly, from the empirical applications of the models analyzed in this bibliometric study, the results reported by the authors show that the new proposed distributions, which are generalizations of Marshall and Olkin, (1997), produce better results than other models that are already known and widely applied.

Conclusion

The main objective of this manuscript was to present a bibliometric analysis on distribution functions that have been developed from the seminal work of Marshall and Olkin (1997) over twenty-four years from 1997 and 2021. The Bibliometrix package was used through the R software for data mining and analysis and bibliometric network mapping. This process made it possible to identify the main trends and contributions in this line of research.

Two research repositories, Web of Science and Scopus, were used to compile the database of 131 articles. Some of the most relevant findings that contribute to the current literature are the new distributions adding one or two parameters to the baseline models or the previous generalizations. There are distributions where up to six parameters are involved in achieving greater flexibility in the proposed models. The maximum likelihood method is predominant for estimating parameters; in 124 articles, this method is used. However, other methods such as the maximum product spacing, the least-squares method, and the interval estimation method have been used in recent years to estimate parameters. On the other hand, the main topics addressed in the publications are survival function, hazard function, mean residual life, Renyi entropy, moments, moment generating function, quantile function, order statistics, stochastic orderings, and estimation method. In addition, in order to show the advantages of the proposed models, 91% of the analyzed publications carried out simulations and applications to real data to ratify and show the competitiveness of the new distributions that are being constituted.

Finally, we note that more than 90% of the distributions analyzed in this work are generalizations or extensions of models already existing in the literature. Thus, it follows that the different structures that have been used to model distributions will be combined or expanded, and new models will be developed that will make it possible to analyze the behavior of different data sets related to real-world problems.