Keywords

1 Introduction

Cancer of the uterus is a type of pathology that develops in the lining of the uterus (endometrium). This type of cancer affects the female reproductive organs, located in the lower part of the uterus, near the vaginal canal [1]. According to a survey conducted by WHO [2], Cervical cancer is the fourth most frequent cancer in women with an estimated 570,000 new cases in 2018 representing 6.6% of all female cancers. Approximately 90% of deaths from cervical cancer occurred in low and middle-income countries. The proliferation of cancer cells occurs by the non-treatment of lesions that are mostly caused by the transmissible HPV virus [1].

The development of cervical cancer is usually slow and preceded by abnormalities in the cervix. However, the absence of early stage symptoms might cause carelessness in prevention. Additionally, a lack of effective screening programs aimed at detecting and treating precancerous conditions is a key reason for the much higher cervical cancer incidence in developing countries.

Several studies have been working to detect this type of cancer using different types of machines learning classifiers such as Decision Tree [3], Support Vector Machines (SVM) [4], Gaussian Naive Bayes [5] and Artificial Neural Network [6]. RNN is one of the underlying network architectures used to develop other deep learning architectures. However, RNNs can suffer from two problems: vanishing gradients or exploding gradients. The gradients carry information used in the RNN parameter update, thus, when the gradient becomes too small, the parameter updates become insignificant which means in real learning is done. The LSTM is a special type of recurring networks and was created by [7] to solved the above problems of RNN. Its popularity has grown in recent years as an RNN architecture for various applications, such as: text compression in natural language [8, 9], text recognition [10, 11], speech recognition [12,13,14], gesture recognition [15,16,17].

The traditional method for performing recurrent network training is Back Propagation. BP is primarily relying on the approach of gradient-descent for shrinking the calculated squared error. However, traditional training algorithms have some drawbacks such as slow speed of convergence, easy to fall into local minimum and the training process takes longtime [18].

This work intends to design an adaptation of the LSTM network in order to apply these metaheuristic algorithms to searching for optimal values for its bias and weights. In addition, the study can be used in the Oncological area, aiming to subsidize the clinical decision in patients with suspected cervical cancer.

The rest of the article is organized as follows: after this introduction, we explain the essential RNN and LSTM fundamentals. In Sect. 3, metaheuristic optimizers are explained in brief. Section 4, presented details about the process of data collection. Section 5, provides the results of the comparison followed by final remarks and future development in Sect. 6.

2 Neural Networks

2.1 RNN

Recurrent Neural Networks have the ability to receive signals from both the input layer and the hidden layer at the previous time iteration [19]. In some ways, the hidden layer simulates the operation of a memory. In addition, Recurrent Neural Networks work with sequential data on both the input layer and the output layer, fitting perfectly in the context of time series.

Fig. 1.
figure 1

Neuron of the Hidden layer of the Recurrent Neural Network [Olah, 2014].

In [20] explains that the hidden state of the Recurrent Neural Network receives information from the independent variables, as well as from its own processing result from the previous iteration, as shown in (Fig. 1). In this example, \(X_{t}\) is represented by the input data at time iteration t, which will be processed by the neuron A. \(h_{t}\) is represented by the output of neuron A in time iteration t, which can be used in the next iterations by the same neuron, as behavior. The practical effect of this is similar to the behavior of a short-term memory and is very useful in the context of time-series prediction, given that the data of the input and output layers are sequential data [21].

2.2 Long Short-Term Memory (LSTM)

Long Short-Term Memory networks are a recurrent and deep model of neural networks. It is a typical neural network architectures based on neurons and introduced the concept of memory cell. The memory cell can keep its value for a short or long time as a function of its inputs, which allows the cell to remember what is important and not only of its last computed value, that is, it can capture long dependencies range and nonlinear dynamics.

This type of network has been widely used and been able to achieve some of the best results when placed compared to other methods [23]. This fact is especially observed in the field of Natural Language Processing, and in calligraphy recognition is considered the state-of-the-art [24]. The architecture of LSTM cell is displayed in (Fig. 2).

Each LSTM block consists of a forget gate, input gate and an output gate. In Fig. 2, a basic LSTM cell with a step wise explanation of the gates is shown and on the top an other illustration of the cell connected into a network is shown. In the first step the forget gate looks at \(h_{t-1}\) and \(x_t\) to compute the output \(f_t\) (Eq. 1) which is a number between 0 and 1. This is multiplied by the cell state \(C_{t-1}\) and yield the cell to either forget everything or keep the information. In the next step the input gate is computing the update for the cell by first multiplying the outputs \(i_t\) and \(\tilde{C}_t\) (Eqs. 2 and 3) and then adding this output to the input \(C_{t-1}*f_t\), which was computed in the step before. Finally the output value has to be computed, which is done by multiplying \(o_t\) with the tanh of the result of the previous step, which can be seen in (Eqs. 5 and 6).

$$\begin{aligned} f_t = \sigma (W_{f} \cdot [h_{t-1}, x_t] + b_f) \end{aligned}$$
(1)
$$\begin{aligned} i_t = \sigma (W_{i} \cdot [h_{t-1}, x_t] + b_i) \end{aligned}$$
(2)
$$\begin{aligned} {\tilde{C}_t} = tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \end{aligned}$$
(3)
$$\begin{aligned} C_t = f_t * C_{t-1} + i_t * {\tilde{C}_t} \end{aligned}$$
(4)
$$\begin{aligned} o_t = \sigma (W_{o} \cdot x_t [h_{t-1}, x_t] + b_o) \end{aligned}$$
(5)
$$\begin{aligned} h_t = o_t * tanh(C_t) \end{aligned}$$
(6)
Fig. 2.
figure 2

Basic architectures of LSTM cells.

3 Metaheuristic Optimizers

3.1 Cuckoo Search Algorithm - CSA

The Cuckoo Search Algorithm is an evolutionary metaheuristic algorithm developed by [25], based on the behavior of the reproduction of some cuckoo bird species. In the process, these species lay their eggs in the nests of other birds. Some of these eggs, which are very similar to the eggs of the host bird, have the opportunity to grow and become adult cuckoos. This technique has been widely applied in global optimization problems, due to its simplicity and effectiveness, as well as its rapid convergence and the ability to avoid local minimums [25, 26]. CS can be described using following three idealized rules: 1. Each cuckoo lays one egg at a time and dumps it in a randomly chosen nest. 2. The best nests with the high quality of eggs will carry to the next generations. 3. The number of available host nest is fixed and if a host bird identifies the cuckoo egg with the probability of pa \(\in [0,1]\) then the host bird can either throw them away or abandon them and build a new nest.

3.2 Genetic Algorithm - GA

The Genetic Algorithm is one of the most consolidated methods of computational intelligence [27]. It is inspired by the theory of evolution of species. In this case, the best solution for a given problem is found from the combination of possible solutions that are improved at each iteration. At each iteration or generation, a new population of possible solutions or individuals is created from the genetic information of the best individuals of the previous generation population. The algorithm represents a possible solution using a simple structure, called a chromosome, to which genetically inspired operators of selection, crossing and mutation are applied, simulating the evolution process of the solution. Each chromosome is made up of genes, each being represented by bits.

3.3 Gravitational Search Algorithm - GSA

GSA belongs to the class of non-deterministic algorithms based on a population of candidates for solving the optimization problem. This algorithm, as in other evolutionary approaches, is characterized by generating an initial (random) population and defining a strategy for updating the candidate population. In GSA the development of this strategy is based on the laws of gravity and movement [28]. In this context, agents are considered as objects and their performance is measured by their masses. All mass attracts and is attracted to other masses due to gravitational force, causing the overall movement of the whole set (population) considered in the problem. Thus the masses intertwine using a direct form of communication, the gravitational force. To describe the GSA algorithm, consider a system with N agents (masses), the position of the agent i is defined by:

$$\begin{aligned} X_i = (X_i^1, \cdots , X_i^d, \cdots , X_i^n), \qquad for \quad i = 1,2, \cdots , N \end{aligned}$$
(7)

where \(x_i^d\) presents the position of the agent i in the dimension d and n is the search space dimension. After evaluating the current population fitness, the mass of each agent is calculated as follows:

$$\begin{aligned} M_i(t) = \frac{m_i(t)}{\sum _{j=1}^N m_j(t)} \end{aligned}$$
(8)

where

$$\begin{aligned} m_i(t) = \frac{fit_i(t)- worst(t)}{best(t) - worst(t)} \end{aligned}$$
(9)

where \(fit_i(t)\) represent the fitness value of the agent i at time t. The best(t) and worst(t) are the best and worst fitness of all agents, respectively and defined as follows:

$$\begin{aligned} best(t) = {min_{j \in {(1,\cdots N)}}} fit_j(t), \quad worst(t) = {max_{j \in {(1,\cdots N)}}} fit_j(t) \end{aligned}$$
(10)

To evaluate the acceleration of an agent, total forces from a set of heavier masses applied on it should be considered based on a combination of the law of gravity according to:

$$\begin{aligned} F_i^d(t) = \displaystyle {\sum _{j \in kbest, j \ne i } rand_j G(t)} \frac{M_j(t) \times M_i(t)}{R_{i,j}(t) + \epsilon } (x_j^d (t) - x_i^d (t)) \end{aligned}$$
(11)

where \(rand_{j}\) is a random number in the interval [0, 1], G(t) is the gravitational constant at time t, \(M_i\) and \(M_j\) are masses of agents i and j, \(\epsilon \) is a small value and \(R_{i,j}(t)\) is the Euclidean distance between two agents, i and j. kbest is the set of first K agents with the best fitness value and biggest mass, which is a function of time, initialized to \(K_0\) at the beginning and decreased with time. Here \(K_0\) is set to N (total number of agents) and is decreased linearly to 1. By the law of motion, the acceleration of the agent i at time t, and in direction d, \(a_i^d(t)\), is given as follows:

$$\begin{aligned} a_i^d(t) = \frac{F_i^d(t)}{M_i^d(t)} = \displaystyle {\sum _{j \in kbest, j \ne i } rand_j G(t)} \frac{M_j(t)}{{||X_i(t),{X_{j}(t)||}_2} + \epsilon } (x_j^d (t) - x_i^d (t)) \end{aligned}$$
(12)

Finally, the searching strategy on this concept can be described by following equations:

$$\begin{aligned} v_i^d(t+1) = rand_i \times v_i^d(t) + a_i^d (t) \end{aligned}$$
(13)
$$\begin{aligned} x_i^d(t+1) = x_i^d(t) + v_i^d(t+1) \end{aligned}$$
(14)

where \(x_i^d\), \(v_i^d\) and \(a_i^d\) represents the position, velocity and acceleration of ith agent in dth dimension, respectively. \(rand_i\) is a uniform random variable in the interval [0, 1]. This random number is applied to give a randomized characteristic to the search. It must be pointed out that the gravitational constant G(t) is important in determining the performance of GSA and is defined as a function of time t:

$$\begin{aligned} G(t) = G_0 \times exp \bigg (- \beta \times \frac{t}{t_{max}}\bigg ) \end{aligned}$$
(15)

3.4 Gray Wolf Algorithm - GWO

The Gray Wolf Algorithm (GWO) is a computational optimization technique created by [29] based on the hunting behavior of gray wolves (Canis lupus). This species usually lives in packs of 5 to 12 individuals. These adopt a well-defined and narrow hierarchy. The leader of the wolves is called the Alpha (\(\alpha \)), who is responsible for making decisions related to hunting, time, place of rest, etc. The second level is called Beta (\(\beta \)), which supports the Alphas making decisions. These are also strong candidates to take the lead in losing an Alpha. The lowest level of the hierarchy is occupied by Omega (\(\omega \)), who play the role of scapegoat and must satisfy the whole group. The third level is occupied by the Delta (\(\delta \)), responsible for the safety of the pack [30]. The grey wolves encircling behavior to hunt for a prey can be expressed as:

$$\begin{aligned} \varvec{D} = \Big | \varvec{C} \cdot \varvec{X}_p(t) - \varvec{X}_{(t)} \Big | \end{aligned}$$
(16)
$$\begin{aligned} \varvec{X}(t+1) = \varvec{X}_p{(t)} - \varvec{A} \cdot \varvec{D} \end{aligned}$$
(17)

where t indicates the current iteration, \(\varvec{A} = 2 \varvec{a} \cdot \varvec{r}_1 - \varvec{a}, \varvec{C} = 2 \cdot \varvec{r}_2\), where components of \(\varvec{a}\) are linearly decreased from 2 to 0 over the course of iterations and \(r_1\), \(r_2\) are random vectors in [0, 1], \(\varvec{X}_p\) is the prey’s position vector, and \(\varvec{X}\) indicates the position vector of a grey wolf. The second main phase is the hunting phase and it can be modeled as:

$$\begin{aligned} \varvec{D}_{\alpha } = \Big |\varvec{C}_1 \cdot \varvec{X}_{\alpha } - \varvec{X}\Big | , \varvec{D}_{\beta } = \Big |\varvec{C}_2 \cdot \varvec{X}_{\beta } - \varvec{X}\Big |, \varvec{D}_{\beta } = \Big | \varvec{C}_3 \cdot \varvec{X}_{\delta } - \varvec{X}\Big | \end{aligned}$$
(18)
$$\begin{aligned} \varvec{X}_{1} = \varvec{X}_{\alpha } - \varvec{A}_{1} \cdot (\varvec{D_{\alpha }}) , \varvec{X}_2 = \varvec{X}_{\beta } - \varvec{A}_{2} \cdot (\varvec{D_{\beta }}), \varvec{X}_{3} = \varvec{X}_{\delta } - \varvec{A}_{3} \cdot (\varvec{D_{\delta }}) \end{aligned}$$
(19)
$$\begin{aligned} \varvec{X}(t+1) = \frac{\varvec{X_1} + \varvec{X_2} + \varvec{X_3}}{3} \end{aligned}$$
(20)

3.5 Particle Swarm Optimizatio - PSO

The PSO algorithm came from the observation of the social behavior, birds in a flock and fish shoals [31]. The problem is expressed through the objective function. The quality of the solution represented by a particle is the value of the objective function at the position of this particle. The term particle is used, in analogy to physics, to have well defined position and velocity vector but has no mass or volume. Already the term swarm, represents a set of possible solutions. The velocity which takes the particle position close to pbest (own best position) and gbest (overall best position among the particles) is given by:

$$\begin{aligned} v_{id}^{k+1} = w_{id}^{k} + c_1 \times rand \times (pbest_{id} - s_{id}^{k}) + c_2 \times rand \times (gbest_{id} - s_{id}^{k}) \end{aligned}$$
(21)

The current searching position of the particle can be modified by:

$$\begin{aligned} s_{id}^{k+1} = s_{id}^{k} + v_{id}^{k+1} \end{aligned}$$
(22)

where \(s_{id}^{k}\) is the current searching point, \(c_1\) and \(c_2\) are learning factors \(w_{id}^{k}\) is weight function for velocity which is given by:

$$\begin{aligned} w_{i} = w_{max} - \frac{w_{max} - w_{min}}{k_{max}} \cdot k \end{aligned}$$
(23)

where \(w_{max}\) and \(w_{min}\) are the maximum and minimum weights respectively, \(k_{max}\) and k are the maximum and current iteration number [32].

4 Data Description

The dataset was collected at Hospital Universitario de Caracas in Caracas, Venezuela and it was published in 2017 by University of California, Irvine. According to the study [33], most of the patients belong to the lowest socioeconomic status with low income and educational level, being the population with the highest risk. The age of the patients spans between 13 and 84 years old. All patients are sexually active and 98% of them have been pregnant at least once. The dataset comprises demographic information, habits, and historic medical records of 858 patients and 32 features as well as four targets (Hinselmann, Schiller, Cytology and Biopsy). This paper focuses on studying the Biopsy target as it recommended by the literature review. The attributes in the dataset have been presented in the Table 1.

Table 1. Inputs with their respective type and valid amount of data.

4.1 Preprocessing

Real-world information mostly tend to contain low quality data which could not be used directly in mining process without pre-processing. Numerous aspects may influence the performance of a learning system due to data quality. There ate two steps in preprocessing:

Data Cleaning. The cervical dataset suffers from a vast number of missing cells due to the lack of information. Removes all data for an observation that has one or more missing values is a quick solution and typically is preferred in cases where the percentage of missing values is relatively low. Because the amount of data lost is very small relatively to the size of the data set, the omission of the few samples with missing features was the best strategy for not influencing the analysis. The final data set was 668 instances and 34 attributes.

Normalization. Since the data of cancer cervical has features with different ranges, normalization was necessary. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. Thus, all the features in our experiments using [0,1] normalization.

5 Simulation and Experiments

This section demonstrates the main test scenarios that have been conducted. All neural networks models used in this article were built in framework, called Keras [?]. Keras is a framework for building deep neural networks with Python and enables us to build state-of-the-art, deep learning systems. All computations are performed using a PC/Intel Core i7-5775R processor with 32 GB DDR4 RAM, GPU GeForce GTX 1080 and 1 hard drive SSD of 300 GB SATA 5 Gb/s.

The data set was used to feed the models: LSTM-CS, LSTM-GA, LSTM-GSA, LSTM-GWO and LSTM-PSO, then weights and biases are updated to get better accuracy results and less error rate. The main objective of these experiments is to compare all models of LSTM using different parameters, such as, numbers of hidden neurons, population and iteration for each model are chosen (Table 2).

The training and testing on the same set, the mean accuracy is determined using five-fold cross-validation.

Table 2. Parameters that will be modified in the scale of 4 to 40. In each test, one parameter changes while the others remain fixed.
Fig. 3.
figure 3

Variations of the LSTM networks with the metaheuristic algorithms and their respective accuracy rate, with different number of neurons.

Figures 3 show the accuracy of the classifiers with different numbers of neurons in hidden layer, but without changing the number of iterations and populations, remained fixed (total of 40) for all architectures presented. LSTM-PSO and LSTM-GA arrived close to their accuracy, reaching 92% and 91% success, respectively, while LSTM-GSA had the worst result. In this test, the increase in the number of neurons did not increase the accuracy of the models. The hidden neuron can influence the error on the nodes to which their output is connected. Thus, too much hidden neurons cause too much flexibility and this leads overfitting, and therefore, the neural networks have overestimate the complexity of the target problem.

Fig. 4.
figure 4

Variations of the LSTM networks with the metaheuristic algorithms and their respective accuracy rate, with different population size.

Fig. 5.
figure 5

Variations of the LSTM networks with the metaheuristic algorithms and their respective accuracy rate, with different number of iterations.

Figure 4 shows the accuracy of neural models increasing the population size. Each LSTM was tested with different population numbers to find the ideal solution, increasing the population at each step. For the classification procedures consistently different population sizes (from 5 to 40) were used. The number of iterations and the number of neurons in the hidden layer were fixed to 40. In this test, we show that increasing the population size increases the accuracy of all models. The LSTM-PSO network, achieved an accuracy rate of 96%.

Figure 5 shows increase the number of iterations, while the other parameters remained fixed at 40. Despite the increase in the accuracy of the models, this improvement was quite small. A high number of iterations allows that the algorithm converges to the optimal solution, although in this case study, the performance of all models except LSTM-GSA is roughly equivalent. Again, the LSTM-PSO model obtained the best result, reaching 91% accuracy.

6 Conclusions and Future Work

The mortality rate from cervical cancer worldwide has increased significantly in recent years. The cervical tumor, threatens not only the motherhood of women, but especially their lives. This paper has proposed a LSTM network, as an alternative to deal with slow convergence and convergence to local minimum in neural networks trained with backpropagation. The paper used five different optimizers based on Nature Inspired Algorithms, such as: Cuckoo Search, Genetic Algorithm, Gravitational Search, Gray Wolf Optimizer and Particle Swarm Optimization for training LSTM neural networks and apply all models in a cervical cancer dataset. The tests were performed on a dataset containing data from Venezuelan women and they were performed by changing three parameters: the number of neurons, the number of iterations and the population size. The developed neural networks were applied to diagnose cancer patients based on a number of content related features. The LSTM-BPTT approach was evaluated and compared with other neural networks trained by five different optimizers. After that, the five optimizers were compared based on their accuracy results. The experiments showed that LSTM with PSO performed better than LSTM with the other algorithms in the term of optimizers. The accuracy for LSTM with PSO was 0.92 by varying the number of neurons, 0.96 when there was modification in the population and 0.91 changing the number of iterations.

In addition, other observations could be made according to the experiments performed. The increase in the number of neurons was not interesting for the experiment. The accuracy declines with increasing the number of neurons in the hidden layer, and this might be due to the increased complexity in the network’s structure, which requires more training time to converge. On the other hand, the increase of the population was beneficial for the performance of models. Meanwhile, the increase in iterations did not show significant improvements.

Finally, this paper has provided a method for detecting women with cervical cancer with higher accuracy resulting by applying LSTM to overcome the backpropagation problems. The future work for this paper will include using other ways to address the missing data problem, verification of the optimization algorithms used in other types of neural networks and investigating the effectiveness of LSTM and algorithms with larger datasets. Prediction of different diseases with different datasets using LSTM neural networks could also be considered.