Keywords

1 Introduction

Lactose is a disaccharide that is found in all mammalian milks and is very important for nutrition of newborn and infants. In order to be digested, lactose has to be hydrolyzed by enzyme lactase (lactase-phlorizin hydrolase, or LPH) into simple sugars, glucose and galactose [1, 2]. Lactase is a trans-membrane glycoprotein of the small intestinal brush border membrane of enterocytes [3, 4] coded, in humans, with LCT gene located on the chromosome 2 [5], long (q) arm at position 21.3. This gene is 49.3 kb in length, consisted of 17 exons and is translated into a 6 kb transcript [6].

Lactase digestive activity reaches its peak in the first few months of life and decreases after the age of two years [7]. Deficient or absent lactase enzymatic activity in the small intestine results in inability of organism to digest lactose from milk and other dairy products. This condition is called lactose intolerance. Besides the congenital lactase deficiency, which is a very rare condition inherited in an autosomal recessive manner [8], identified by total lactose intolerance already at infant age, there are three other types of lactose intolerance: primary, secondary, and developmental lactase deficiency [9,10,11,12]. Developmental lactase deficiency and reduced lactase activity is found in infants born before 34 weeks of gestation [10]. Gray was among the first scientists to describe secondary lactase deficiency [13]. Secondary lactase deficiency could occur as a consequence of small intestinal injuries, caused by many different factors such as infections, surgery, chemotherapy, celiac disease, gastroenteritis, prolonged use of antibiotics and other [11]. The most common type of lactose intolerance, which appears in adulthood, is in most of cases characterized with low lactase activity (hypolactasia) leading to primary lactase deficiency [12].

Primary lactase deficiency prevalence in adults differs worldwide, varying from less than 5% to almost 100% of population [14]. Study that included data from 89 countries (approximately 84% of the world’s population) found that lactose intolerance is present in 19–37% of western, southern, and northern European population and in 57–83% of Middle East population [15].

Primary adult lactose intolerance is related to the absence of lactase persistence alleles, producing “lactase non-persistence” phenotype [16]. On the contrary, certain number of individuals keep neonatal levels of lactase enzymatic activity throughout the adulthood due to the presence of lactase related alleles, producing “lactase-persistence” phenotype. Except for some allelic differences (silent mutations), lactase persistent and lactase non-persistent groups of individuals have identical coding sequences [6]. Lactase persistence/non-persistence phenotypes are connected with few single nucleotide polymorphisms, whose respective frequencies vary across different world regions and ethnic groups [16]. The most investigated polymorphism associated with lactase persistence is LCT 13910*C/T (rs4988235), that is found to be in almost full concordance with LCT 22018*G/A (rs182549) polymorphism [17]. Both LCT-13910 CC and LCT22018 GG genotypes are strong predictors of lactase non-persistence [18].

In lactose intolerant individuals, non-digested lactose passes from the intestines to the colon, were it serves as a bacterial substrate. Depending on the amount of lactose ingested, people with lactose intolerance can, shortly after consumption of milk and dairy products, experience discomfort and pain as a manifestation of different gastrointestinal symptoms. The most common symptoms of lactose intolerance are diarrhea, bloating, flatulence, nausea, gut distension, and abdominal pain [19].

Lactose intolerance can be distinguished from other disorders by different diagnostic tests such as: lactose tolerance test, hydrogen breath test, stool acidity test (children) or by genetic testing. First three types of diagnostic tests require ingestion of certain amounts of lactose, which can cause discomfort and could be painful for the patients. Genetic testing of lactose intolerance associated lactase polymorphisms is not widely available [20].

Machine learning is a field in artificial intelligence and is one of the most rapidly developing subfields of artificial intelligence research. Machine learning enables highly proficient intelligent data analysis. The inexpensive and relatively easy methods developed within the last two decades for collecting and storing data also contributed to making machine learning procedures easier and more consistent. Since the beginning, machine learning was used and implemented within the medical field [21]. Many hospitals and clinics worldwide are monitoring and collecting data which can later be used for machine learning purposes. The machine learning methodology is most convenient for very specific diagnostic problems [22].

Approximations of explanations of certain processes can be considered as the essence of machine learning. Approximations generally do not and cannot explain the whole process, therefore usage of other algorithms would be more convenient. The machine learning process takes into account that the patterns observed within the existing dataset will not change within the future datasets regarding the same problem. In medicine, machine learning programs used for predictions of medical diagnosis are mostly based on concrete biological and physical parameters [23, 24]. However, sometimes, as is the case here, a target condition and symptom-oriented questionnaire can be used for creating the machine learning system [25].

The fundamental basis of machine learning is the optimization of prediction performance by utilizing previously collected data or previously gained experience. The machine learning models can be classified into two groups: predictive and descriptive. The predictive model makes future estimates based on the collected data, while the descriptive model obtains knowledge from the data. Sometimes, both of the models can be implemented into a single model [22].

ANNs are trained in such a way that the optimal weighting and bias values are acquired in order to obtain the desired mapping or clustering of data. In this manner, ANNs can find relationship within and between datasets without defining the exact mathematical principle behind it. The connection between the neurons in a neural network is what defines its architecture. There are two types of architectures: feedforward and feedback [26,27,28].

This paper presents the development and feasibility of an ANN for Lactose intolerance prediction. This diagnostic tool can assist specialists in clinical practice to make the diagnostic process significantly faster by avoiding unnecessary lactose tolerance and genetic testing.

2 Materials and Methods

2.1 Dataset

The dataset used in the development of this neural network was based on symptoms reported in lactose intolerance related questionnaire and obtained LCT 13910 C/T and 22018 G/A genotypes. Study included 100 unrelated participants from Bosnia and Herzegovina. Genetic analysis was done using PCR-RFLP methodology proposed by Bulchoes et al. [29]. The restriction digestion products were analyzed using agarose gel electrophoresis. LCT 13910 and 22018 related genotypes were determined according to the size of the digestion products.

The specific questionnaire was designed in order to investigate the occurrence and severity of main lactose intolerance symptoms, and to analyze symptoms with respect to obtained LCT 13910 and 22018 genotypes and self-reported lactose tolerance.

The questions that bared most correlation to the genotypes were:

  1. 1.

    Do you have close family members who experience health problems after consuming milk or dairy products?

  2. 2.

    Do you feel discomfort after consuming milk or dairy products?

  3. 3.

    Do you feel nausea after consuming milk or dairy products?

  4. 4.

    Do you feel flatulence after consuming milk or dairy products?

  5. 5.

    Do you feel pain in your stomach after consuming milk or dairy products?

  6. 6.

    Do you have diarrhea after consuming milk or dairy products?

The abovementioned questions and the answers in form of symptom intensity were the only parameters used as inputs of ANN. Their inputs are defined in Fig. 1, with Q1–Q6 each indicating a question respectively.

Fig. 1
figure 1

Architecture of ANN for lactose intolerance prediction

The dataset consisted of 100 samples whose distribution is presented in Table 1.

Table 1 Lactose tolerance dataset distribution

2.2 Development of Artificial Neural Network

Feedforward neural network architecture was constructed as it is best suited for solving problems related to classification.

The data division, for the purposes of artificial neural network training, was done in a 90/10 ratio, as confirmed by various trials. In order to prevent overfitting and due to its usefulness in pattern recognition, Bayesian regulation training algorithm was used. For each training iteration the train/test performance was calculated as Mean Square Error between the actual and predicted values (MSE).

Most prominently used training functions were used to test the performance of ANN in order to choose the appropriate architecture for further development. As it can be seen from Table 2, best performance was observed using Bayesian regularization training algorithm (Trainbr) with 20 neurons in the hidden layer. Bayesian regularization is an algorithm most prominently used with datasets consisting of small number of samples and therefore it was expected to be the most suitable algorithm for this particular dataset [30].

Table 2 ANN performance evaluation with different combinations of training algorithms and neuron numbers

After determining the most suitable training algorithm, the network was further tested with different combinations of transfer functions in the hidden layer (Table 3). As it can be inferred from Table 3, the best performance was achieved with 20 neurons in the hidden layer with Tansig transfer function in the hidden layer and Logsig transfer function in the output layer, which are the defaults in Bayesian regularization.

Table 3 ANN performance evaluation with different transfer function combinations and different neuron numbers

3 Results and Discussion

The final result of the evaluation suggests that the most suitable architecture for Artificial Neural Network for Lactose Intolerance prediction is the one with Bayesian regularization training algorithm, default transfer functions and 20 neurons in the hidden layer.

The final model consists of 6 neurons in the input layer of the network, one for each input parameter in form of 6 questions from the questionnaire. The architecture of the network continues with 20 neurons and Tansig transfer function in the hidden layer, ending with Logsig transfer function in the output layer. The output layer has only one neuron, with the final output of either 0 or 1, lactose tolerant or lactose intolerant respectively.

Subsequent validation was performed using 10 samples from the initial dataset, which makes 10% of the overall dataset. Evaluation of ANN performance through specificity, sensitivity and accuracy parameters is displayed in Table 4. Specificity is calculated as a number of correctly classified samples of lactose tolerant group divided by the total number of lactose tolerant samples. Sensitivity is calculated as the number of correctly classified samples of lactose intolerant group divided by the total number of lactose intolerant samples. Accuracy is determined by the number of correctly classified samples divided by the total number of samples. All three analyzed parameters resulted with 100% for subsequent validation dataset, meaning that this neural network can correctly differentiate lactose tolerance and lactose intolerance.

Table 4 Confusion matrix of subsequent validation dataset

k-fold cross validation method was implemented as an additional step in order to test the performance of ANN more thoroughly, according to the code presented in appendix. The dataset was subdivided into 10 classes for training and testing and the resulting accuracy varied slightly in multiple runs. The results obtained over trials of k-fold cross validation average to an accuracy level of 92.2% which is expected when taking the overall sample size into consideration. The most prominent resulting confusion matrix after cross validation is presented in Table 5.

Table 5 Confusion matrix for most of the k-folds of cross validation

4 Conclusion

An Artificial Neural Network for lactose intolerance prediction was presented in this paper. Training was done using 90 samples from a 100 samples dataset, 10 of which were used for subsequent validation. The ANN demonstrated very high specificity and sensitivity which indicates that successful and reliable ANNs based on Lactase non- persistence symptoms can be created.

Diagnosis of lactose intolerance is not a straightforward procedure and it usually involves analysis of initial symptoms and medical history combined with results of related biochemical and genetic testing. This ANN is an automatic diagnostic tool that is based solely on self-reported symptoms related to digestion of lactose. The final result is a tool, that if clinically optimized, is able to predict lactose intolerance without any laboratory testing.

Future perspectives of this work will include gathering more samples and performing LCT SNP related genotyping, which will further improve the scope of training parameters and enable the efficiency of the network when unexpected symptoms are reported. This work has the potential to be used in healthcare and provide medical professionals with both time and cost-effective lactose intolerance diagnosis procedure.