Keywords

1 Introduction

The application of data mining in the medical domain, including the diagnostic medical laboratory, has recently seen an increasing trend. Data mining can be broadly defined as ‘a set of mechanisms and techniques, realised in software, to extract hidden information from data’ [1]. It is a subprocess of knowledge discovery in data (KDD) which is the ‘non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data’ [2]. The technical approaches of applying data mining in the medical world include data clustering and classification, making predictions, finding frequent patterns, analysing changes, and detecting anomalies. Clinical laboratory test results are paramount to evidence-based medicine, with nearly 80% of medical decisions made on the information provided by laboratory reports [3]. Without an accompanying set of reference intervals (RIs), a test result on its own is of little value [4]. A reference interval (RI), as defined by Ceriotti [5], “is an interval that, when applied to the population serviced by the laboratory, correctly includes most of the subjects with characteristics similar to the reference group and excludes the others.”

The RI serves as a health-associated benchmark with which to compare an individual test result and is vital in the implementation of mobile health monitoring system (mHealth) as we usher in the 4.0 industrial revolution. This system would enable clinicians and empower patients by illustrating the trace of critical physiological parameters, generating early warnings/alerts, and indicating the need for any significant changes to the results, consultation, medication, and treatments [6]. However, establishing accurate and reliable RI is considerably complex [7]. The paediatric RIs (PRIs) should reflect the dynamic biological and biochemical changes throughout the developmental growth to ensure correct diagnosis and treatment [8].

While the concept of RIs and their values appears simple, defining paediatric reference intervals (PRIs) using the direct method involving 120 presumed healthy reference individuals per partition is daunting and taxing. The cost of conducting a direct reference interval study based on the activity-based-costing (ABC) method described is also high [9]. Due to the obstacles accompanying the establishment of PRIs using the direct method, an alternative which is the indirect method, has started to garner a lot of attention. The indirect reference interval (IRI) method involves data mining of routine paediatric laboratory results collected for other purposes, including routine clinical care and screening from the laboratory information system (LIS). By using appropriate statistical techniques, PRIs are subsequently established [10]. The primary clinical data mining method used in IRI is descriptive cluster analysis, which is finding similar groups of objects to form clusters. It is an unsupervised machine-learning-based algorithm that acts on unlabeled data. A group of data points would comprise a cluster in which all the objects would belong to the same group, i.e., the partitioning of similar data points. The clustering methods commonly used include partitioning and density-based methods.

The IRI is based on identifying a distribution amid the data and does not require assessment of all individual results in the dataset as belonging to the reference population [10]. The IRI method assumes that the examined dataset consists of a mixture of parametrically distributed samples from healthy individuals and pathological samples not described by that distribution. In a sufficiently large dataset with a dominant fraction of physiological test results for the examined analyte, the distribution of non-pathological values can be estimated using advanced statistical methods and the pathological test results are assumed to have no substantial impact on the RIs [11].

This paper aims to highlight the historical aspect of IRI determination and the assessment of publications that have used data mining to establish the IRI in the pediatric population over the past ten years. This paper is arranged into five sections. Section two briefly explains the data mining in indirect PRIs establishment. This is followed by a summary of selected articles that utilised data mining to establish PRIs in Sect. 3, a discussion of the results in Sect. 4 and ending with the conclusion in Sect. 5.

2 Data Mining in IRI Establishment

This section describes the historical aspect of data mining in IRIs and the general steps involved in its establishment.

2.1 The Indirect Methods for RIs Establishment

The foundation of establishing IRIs using patients’ results stored in the LIS was laid as early as the 1960s by Robert G Hoffman, who proposed the application of statistics in medicine in the Journal of the American Medical Association [12] and was documented initially by John Glick in 1972. However, it was not until the personal computer arrived in the 1980s that enough computing power was available to apply it generally [13]. Subsequently, C. G. Bhattacharya explored a graphical method to identify Gaussian distribution components in 1967 [14]. This has paved the way for other scholars to apply the method in their research.

T Kouri and his team developed RIs for haematological blood indices partitioned for gender by combining data mined from the LIS and diagnostic data. They surmised that data mined from hospitalised patients based on diagnostic information may apply to other analytes. Horn and Pesce 1998 developed a robust approach for establishing RIs for small datasets. [15]. Later in 2019, a modified version was presented by Horn et al. to accommodate larger distributions of reference intervals [16]. The REALAB project by Grossi et al. in 2005 established RIs for 23 basic tests using approximately 15 million records using a multivariate algorithm. [17]. A novel approach of using a kernel-smoothed density function based on a bimodal method to estimate the distribution of the combined data for both non-diseased and diseased populations was developed by Arzideh et al. in 2007. This is a more advanced procedure to determine RIs from data mined in laboratory databases without considering any diseased population distribution. [18]. The Clinical Laboratory Standard Institute (CLSI), in its 2010 guidelines issue, has guided the establishment of RIs for quantitative clinical laboratory tests. The indirect method was briefly mentioned as an alternative, not a primary one, to replace direct RIs. With the explosion of big data technology, the challenges in recruiting reference individuals and the exorbitant cost involved in developing RIs using the direct method, especially in the paediatric population, many researchers from around the globe have taken an interest in exploring and conducting studies focusing on developing PRIs from data mining of patient data from diagnostic laboratories using the indirect method. This has led to the improvement of the methodology and statistical techniques in leaps and bounds [18,19,20,21,22].

2.2 The Indirect Methods for Paediatric RIs (PRIs) Establishment

A long-standing gap exists in the PRIs, especially in the neonates and young infant subgroups. This is because of the difficulty and ethical issues in obtaining blood from the healthy paediatric population. With the growth of technology and the availability of large laboratory databases, the indirect reference interval method is seen to have the full potential to fill in this gap. The challenge is determining the physiological samples amidst the pathological samples in the mixed laboratory dataset using either the metadata-driven or primary statistical strategy. The availability of many data set points has also contributed to the development of continuous percentile charts or dynamic reference intervals of biochemical and haematological analytes, which better represent the dynamic physiological development in the paediatric population, especially during the neonatal/infantile period and throughout puberty [23, 24]. The indirect method has mainly been used to establish paediatrics IRIs (PIRIs) for biochemical analytes such as calcium and bone markers, alkaline phosphatase, creatinine, lipids, arterial blood gases, creatinine and trace minerals [22, 25,26,27,28,29,30,31,32,33]. Apart from that, the indirect method has also been successfully used to establish haematological reference intervals for full blood count indices and coagulation profiles in many countries [21, 34,35,36,37,38,39].

2.3 The Steps Involved in IRI Establishment

The general steps involved in the IRI establishment, whether metadata driven or statistically-driven, include data collection, cleaning, data analysis and result verification. Data analysis comprises three main processes: partitioning of the input dataset according to desired groups, statistical analysis, which includes outlier removal, calculation of cumulative frequency (cdf) of each result, calculation of the inverse cdf of a standard Gaussian distribution and graphing the inverse cdf versus each of the measured analyte value and performing piece-wise linear regression in R software to identify the linear portion of the distribution. This is followed by graphing the linear part of the distribution and using linear regression to determine the equation that represents the linear portion of the distribution. Next, by using the linear equation, the 2.5th and 97.5th centiles may be extrapolated and taken as the lower and upper reference intervals [40]. In metadata-driven studies, additional steps are taken in the data cleaning process to remove results associated with abnormality of other analytes from any patients with known diseases, or the opinion of subject matter experts.

In calculating continuous reference intervals, additional steps would need to be taken. This involves dividing the datasets into overlapping timeframes, excluding pathological values using statistical methods, and calculating the 2.5th, 50th and 97.5th percentiles of the remaining values of each parameter using statistical and graphical software such as R and R Studio. Special consideration in the IRI is the verification of the various indirect approaches. To verify the newly established IRIs, many researchers may directly compare the results with previously published articles in the literature to assess the agreement or may perform an in-house verification using the standard verification procedure described by CLSI EP28-A3c, which emphasises that three approaches that can be used to verify RIs, i.e. subjective assessment, using a small number (n = 20) of reference individuals or using a large number or reference individual (n = 60 but fewer than 120). In the second and third approaches, if no more than 2 of the 20 samples (i.e., 10% of the test results) fall outside the RI, at least provisionally, it may be received for use. However, if 3 or 4 of the 20 samples fall outside the RI, a second set of 20 reference specimens should be obtained, and if again three or more of the new specimens (i.e., ≥ 10% of the test results) OR 5 or more of the original 20 falls outside the RI, the user should re-examine the analytical procedures used and consider possible differences in the biological characteristics of the two populations sampled.

At the time of writing this article, there are many published algorithms for derivation of IRI. Among them include the Hoffman and the modified Hoffman methods, the Bhattacharya method, the Arzideh method, and the Wosniok method [41]. Simulation studies are highly recommended in comparing the various indirect methods’ diagnostic efficiency and allow appropriate statistical confidence analysis [42].

3 Summary of Published Studies on RI Establishment Using Indirect Method in the Paediatric Population

Tables 1, 2 and 3 summarise the studies that employed the indirect method to establish paediatric reference intervals. Three databases (Scopus, EBSCO Medline and WOS) were searched using the terms’ data mining’, ‘data analytics’, ‘big data’, ‘calculating’, ‘constructing’, ‘developing’, ‘establishing’, ‘reference interval’, ‘normal range’, ‘reference limit’, ‘reference curves’, ‘paediatrics’, ‘child’, ‘adolescent’, ‘newborn’, and ‘neonate’ from 2012 through July 2022.

Table 1 presents a detailed summary of published papers reporting the establishment of PRIs of biochemical assays by indirect methods. This study will compare selected studies based on a few criteria, including the year, the country in which the study was conducted, the analytes included in the study, the methods used, discrete vs continuous PRIs and the type of partitioning established. Ten studies were included from 2012 through 2022.

In 2012, Eduardo et al. from Argentina established discrete age-specific thyroid hormones IRIs using laboratory results over a period of 5 years involving 7581 children [43, 44]. This study was meta-data driven as rigorous exclusion criteria were applied to the data prior to the final analysis. This study established higher TSH and T4 values than a previous direct RI study done in German [44], highlighting the importance of population-specific RIs. In the same year, a group of researchers from Israel established their discrete IRI partitioned by age for TSH and free T3 using results from over 11,000 children and adolescent and found that the then RI used were too low and suggested the transference of their results to other laboratories [45]. There was no partitioning based on gender done for both studies. Another study in the UK published in 2013 successfully established age and gender-specific IRIs for serum prolactin to aid in diagnosing neurometabolic conditions affecting dopamine metabolism [46]. This study extracted over ten years of data from 2369 hospital patients. The established IRI was comparable with previously published IRIs [47] and has filled the knowledge gap by providing the prolactin RI for infants under one year. In the same year, a group of researchers in America established the discrete age-specific IRIs for calcium using 4629 datasets. This meta-data-driven study found that the calcium IRIs were broader than the currently used and suggested that the differences may reflect seasonal or ethnic heterogeneity [48].

The Canadian group in 2014 published a paper studying the validity of establishing PRIs based on hospital patient data by comparing the age and gender-specific discrete PRIs results of 13 biochemical analytes established using the indirect method (modified Hoffmann) to results obtained in the CALIPER study [40]. This statistically-driven study analysed over 200,000 data points per analyte and found that the indirect PRIs established were generally wider than the CALIPER study. Another single-centre, metadata-driven study in Turkey published in 2015 analysed 1709 data points and developed gestational age-specific TSH and free T4 continuous IRIs. They found that free T4 correlated with gestational whilst TSH remained unchanged irrespective of gestational age [49]. In the same year, a team of researchers from Denmark published the results of their multicentre, statistically-driven study that analysed the creatinine results of over 11,000 data sets. The continuous age and gender-specific IRIs showed that age dependency was seen in both boys and girls from birth to adulthood [50].

A large multicentre, statistically-driven study in the Netherlands published in 2019 [51] analysed 7,574,327 results of children visiting their general practitioners and established discrete age and gender-specific IRIs of 18 biochemical analytes with the aim to adapt them as standardised national RIs. They found that there were significant age effects for liver enzymes and creatinine. One single-centre, statistically-driven study done in Pakistan was published in 2021. The group analysed 96104 data points and established discrete IRI for creatinine and found that the serum creatinine dynamics differ across gender and age groups. Compared to CALIPER, their creatinine IRIs were lower. This is thought to be due to the different genetic structures and, again, highlights the importance of developing population-specific RIs. Another large statistically-driven multicentre study in Germany established high-resolution age and gender-specific continuous IRIs for 15 biochemical analytes using an analysis of 217, 883 - 982,548 samples per analyte which showed high concordance to the continuous RIs of other large direct studies (CALIPER and HAPPi Kids) [52].

Table 1. Published papers reporting the establishment of PRIs of biochemical assays by indirect methods

Table 2 presents a detailed summary of published papers reporting the establishment of PRIs of haematological and coagulation assays by indirect methods. Six studies are included. The first study was done in Romania and published in 2013. This group of researchers conducted a single-centre, meta-data-driven study of 845 patient data sets to establish discrete IRIs for erythrocyte parameters specific for one-day-old neonates [34]. They found that the results were comparable to previously published direct RIs [53]. The same team later in the following year published an article on the discrete IRIs for platelet parameters in the first day of life, neonates, using 1124 patient datasets and partitioning the results according to gender [54]. The obtained values for some parameters agreed with the literature, while some differed [55]. This supports the need for establishing population-specific haematology reference intervals.

Zierk et al. in 2013 published the results of a statistically-driven German single-centre study of age-specific continuous IRIs using analysis of 56,253 – 60,394 data points for various haematology indices [35]. In this study, the results were comparable to the previously published KiGGS study and managed to capture biological events. Then in 2018, Weidhofer et al. from Australia established continuous age and gender-specific IRIs for coagulation parameters [36]. This study extracted data from two centres and analysed 19,684–55,101 data sets. The resulting IRIs highlighted the coagulation parameters’ age-dependent dynamics, and some of the parameters showed concordance with previous literature [56]. In 2019, Zierk and his team of researchers published the results of a large German metadata-driven multicentre study that analysed 9,576,910 samples from 358,292 patients that established continuous percentile charts of various haematology parameters partitioned according to age and gender [37]. They observed complex age and sex-related dynamics in haematology analytes during all periods of childhood and adolescence. Compared to their previous work in 2013, the current IRIs was narrower and showed high concordance with the KiGGS study. Another group of researchers from Germany published another article in 2022 [21]. This metadata-driven study was done in Berlin and Brandenburg to establish discrete IRIs for various haematology parameters. A total of 27,554 patient datasets were analysed, and age, as well as sex-specific IRIs, were established. The IRIs from this study showed differences from previously published articles which might be explained by the different population distribution due to high foreign influx [57, 58]. This further reiterates the need for the establishment of population-specific reference intervals.

Table 2. Published papers reporting the establishment of PRIs of haematological assays by indirect methods

Table 3 summarises three published papers reporting the establishment of PRIs of various biochemical, haematological, coagulation and other multi-discipline assays by indirect methods. Six studies are included. The first study was done in Germany by Zierk and his team of researchers [22]. This single-centre statistically driven study established the age and sex-dependent continuous reference intervals for 13 biochemical analytes and haematological parameters. In their research, electrolytes and total protein showed age-specific changes but not sex-specific. One of the analytes studied, alkaline phosphatase, showed complex dynamic patterns, and most of the analytes’ IRIs were comparable to CALIPER and KiGGS studies.

A team from Korea presented the results of their large multicentre study in 2021. This metadata-driven study established the discrete age and gender-specific IRIs for haematology, biochemical and coagulation parameters. The PRIs determined from this study differed from existing results and PRIs from other ethnicities. Subsequently, a team of researchers from America also published their age and gender-specific discrete reference intervals for 266 individual analytes across multiple clinical disciplines [59]. Patient results from 13 laboratories amounting to a total of 71,594,330 total patients test results were analysed in this statistically-driven study, and the team has successfully established IRIs with very powerful sample sizes for each age bracket.

Table 3. Published papers reporting the establishment of PRIs of biochemical, haematological and coagulation assays by indirect methods

4 Summary of Published Studies on RI Establishment Using Indirect Method in the Paediatric Population

This narrative review provides a historical review of data mining in the paediatric IRI determination and an assessment of the published articles within the past years that have utilised data mining in establishing the paediatric indirect reference intervals over the past ten years. There are many advantages of using the indirect method compared to the direct method. Indirect methods harness the power of big data that increases statistical power, are representative of the true population and allow easy application of complex statistical analysis to be applied to thousands and even millions of deidentified data points pulled from the laboratory database of a single or many centres for fast outlier removal, transformation and partitioning to establish robust IRIs. Applying further statistical analysis would allow for the creation of continuous or dynamic percentile charts that better represent the fluid physiological changes seen in children. The indirect method also provides analysis of retrospective data of difficult-to-obtain samples such as body fluids, CSF, and amniotic fluid, as the steps involved are identical to the analysis of data for serum, plasma, or whole blood samples.

On the contrary, the direct method is tedious as it involves recruiting healthy reference individuals, which is hard to come by, especially in healthy paediatric populations. The limited sample size reduces the statistical power, and application to a larger population is debatable. The typically small number of results hinders the ability to partition the data; hence, only discrete RIs could be established for certain arbitrarily set age brackets. Samples from reference samples would need to be collected, processed, and stored for batch analysis which may take longer. This cycle may also introduce bias in the result. Ethical issues involved are among the more challenging hurdles, as researchers would need to obtain informed consent from the parents of the paediatric reference individuals to allow the collection of data and venepuncture to be conducted. It is also more expensive to conduct the direct method as it involves the cost of reference individuals’ reimbursement, labour of testing and the cost of reagents and consumables.

Recently, there has been a significant increase in the number of publications on the indirect method, especially over the past five years. This signifies an interest in IRI establishment as an alternative to the laborious direct method. The boost of interest, especially of the laboratorians, to embark and report results of indirect reference interval studies is most likely contributed by the advantages of the indirect method discussed previously, coupled with the advanced database available in the laboratory, readily available volume of patient data stored for analysis plus easy access to statistical analysis tools developed by the previous group of researchers. Initially, it was noted that many of the earlier publications reviewed did not include a thorough description of the data mining and the statistical methods used in the IRI establishment. However, subsequent publications have included detailed step-by-step descriptions of that IRI establishment entails.

Jones et al. [10] have proposed a checklist of the minimum requirements for publication of IRI studies which include details of study design, a description of the population and the data source, a description of available records of preanalytical and analytical processes, the data set selection and filtering criteria, the description of the data set inclusive of number of samples, median, kurtosis and initial analysis of partitioning. The description of the statistical process inclusive of outlier detection, method and transformation, results of statistical analysis, comparison with other statistically reliable peer-reviewed published studies and final recommendations and discussion of the study would also need to be included. This would allow future researchers to understand the overall steps involved and critique the study to find any weaknesses, strengths, and opportunities for improvements before conducting their own population-specific indirect reference studies.

4.1 Comparison of Indirect PRIs Between Countries

Most of the reviewed articles were done in the European population. Only two studies were done in the Asian population (Pakistan and Korea). Data mining and indirect sampling have allowed multiple laboratories in a country or a region to conduct IRI studies using the same methodology and analytical platforms to establish common reference intervals. However, caution needs to be exercised. This is exemplified by the results of two studies in different regions in Germany that have shown variability in their IRI results. The difference might be due to the difference in the population, as one region is known to have a high foreign influx [21, 37]. Many other studies done in Europe have reported good agreement with previously reported RIs established by both the direct and indirect methods [35, 37]. However, variability is still seen especially wider values of certain analytes [40, 48]. The two studies done in the Asian continent [30, 60] have also come up with different IRIs than the ones currently used in their population and previously published PRIs from other countries. This serves as a reminder to laboratories from other countries, especially those with diverse multi-ethnic heterogeneous populations, to be cautious in the transference of IRI results from different countries and further highlights the necessity of establishing own population-specific reference intervals preferably partitioned according to age, gender, and ethnicity.

4.2 Discrete vs. Continuous IRIs

The centile charts are familiar among most health care providers and parents as they are used to assess their children’s developmental growth. The application of centile charts in biochemical and haematological paediatrics RI has led to the transformation of discrete RIs to dynamic continuous percentile charts [61, 62]. Percentile RI charts enable the removal of the arbitrarily set age group partitions that may confuse the interpretation of results, especially in children between age group brackets [63]. The intuitive percentile charts allow the physiological patterns and dynamics of paediatrics analytes to be visually represented. Seven articles described in this review have developed continuous reference intervals for various biochemical, haematological and coagulation assays [22, 35,36,37, 49, 50, 52]. The majority were multicentre studies and were statistically driven. Most of the studies applied the Arzideh method [64, 65, 22, 35,36,37, 52] and the kosmic software [52] in the calculation of continuous RIs. Even though there is a move towards developing continuous percentile charts, one major hurdle remains. Currently, many laboratories’ information systems are unable to incorporate advanced mathematical functions or graphical representations of patient results [3]. Hopefully, this obstacle will soon be overcome, and continuous reference percentiles can be integrated fully into clinical practice.

5 Conclusion

This paper observed that data mining techniques have been employed successfully in establishing PIRIs. Caution must be exercised during data cleansing as this process must be done thoroughly to ensure the voracity of the established PRIs. There is still a paucity of data regarding the PRIs based on different ethnicities. Many of the published PRIs were based on the Caucasian population and might not be suitable for the transference of PRIs to other medical diagnostic laboratories elsewhere. Therefore, many authors have highlighted the importance of establishing the age, sex and ethnicity-specific to the population. Many researchers are moving towards the establishment of dynamic continuous PRIs using a few recently published algorithms and programs that help to understand the physiological dynamic changes in paediatric biochemistry and complement age-specific RIs in the tracking, interpretation and application of the results in clinical patient management.