Cervical Cancer Identification with Synthetic Minority Oversampling Technique and PCA Analysis using Random Forest Classifier

Geetha, R.; Sivasubramanian, S.; Kaliappan, M.; Vimal, S.; Annamalai, Suresh

doi:10.1007/s10916-019-1402-6

Cervical Cancer Identification with Synthetic Minority Oversampling Technique and PCA Analysis using Random Forest Classifier

Patient Facing Systems
Published: 17 July 2019

Volume 43, article number 286, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Medical Systems Aims and scope Submit manuscript

Cervical Cancer Identification with Synthetic Minority Oversampling Technique and PCA Analysis using Random Forest Classifier

Download PDF

R. Geetha¹,
S. Sivasubramanian²,
M. Kaliappan³,
S. Vimal⁴ &
…
Suresh Annamalai⁵

2164 Accesses
81 Citations
1 Altmetric
Explore all metrics

Abstract

Cervical cancer is the fourth most communal malignant disease amongst women worldwide. In maximum circumstances, cervical cancer indications are not perceptible at its initial stages. There are a proportion of features that intensify the threat of emerging cervical cancer like human papilloma virus, sexual transmitted diseases, and smoking. Ascertaining those features and constructing a classification model to categorize, if the cases are cervical cancer or not is an existing challenging research. This learning intentions at using cervical cancer risk features to build classification model using Random Forest (RF) classification technique with the synthetic minority oversampling technique (SMOTE) and two feature reduction techniques recursive feature elimination and principle component analysis (PCA). Utmost medical data sets are frequently imbalanced since the number of patients is considerably fewer than the number of non-patients. For the imbalance of the used data set, SMOTE is cast-off to solve this problem. The data set comprises of 32 risk factors and four objective variables: Hinselmann, Schiller, Cytology and Biopsy. Accuracy, Sensitivity, Specificity, PPA and NPA of the four variables remains accurate after SMOTE when compared with values obtained before SMOTE. An RSOnto ontology has been created to visualize the progress in classification performance.

Effective Diagnosis of Cervical Cancer Using Balanced Random Forest Classifier

Performance Assessment of Machine Learning Classifiers Using Selective Feature Approaches for Cervical Cancer Detection

Article 12 January 2022

Classification Study and Prediction of Cervical Cancer

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In order to replace the dead and damaged cells a normal human body produces 50 to 70 billion cells every day. At times the growth of cells remain uncontrolled which results in benign or malignant. Malignant tumors are referred as a cancer case. This paper emphasis on a specific type of cancer called cervical cancer. Two main factors those are responsible for cervical cancer is

1.
Modifiable factors like sexual intercourse
2.
Non-modifiable factors like mutational hormones [1].

One of the serious health issue faced by women nowadays is cervical cancer [2]. 80% of cervical cancer cases prevail in developing countries [3]. The United States estimate 13.240 new cervical cancer cases in 2018 and about 4.170 estimated death [4] which means that the death ratio is nearly 31.5%. Cervical cancer affects the reproductive system of women by attacking women’s cervix. At the early stages it develops without any symptoms and these symptoms starts appearing only at later stage after spreading to all other organs. So it is very important to diagnose the infection at the early stage and increase the survival ratio.

Since the ratio of infected widely increases Machine learning techniques are used to resolve these problems in medical and disease diagnosis. In thispaper we apply Random Forest (RF) algorithm to deal withunbalanced data sets, to increase the performance [5, 6]. It remains better than simple neural networks technique.

Synthetic Minority Oversampling Technique (SMOTE) algorithm used balances the dataset classes there by quantitatively increasing the minority class. The increase of minority classes is based on k-nearest neighbors to nearly equal classes. In order to lessen the processing time and remove unimportant features in the classification Recursive Feature Elimination (RFE) and Principle Component Analysis (PCA) are used. Then Random Forest classification technique classifies the cases into 2 categories, cervical cancer and non-cervical. The completed performance is measured before and after SMOTE for further results.

The paper is structured as follows.

Section II - Related work of cervical cancer classification.
Section III - Methods of machine learning, oversampling, features reduction techniques used.
Section IV - Experimental results discussed. Analysis and comparison shown.
Section V - Ontological Representation
Section VI - Conclusion and Future work presented.

Related Work

Researchers have made many researches in the field of cervical cancer. Researchers used various approachesto detect and diagnose their presence. Various classification and segmentation methods are used at various time periods to enhance the research in this area. The enhanced versions are used so that they help to identify various risk factors in cervical cancer. Game theory model [7, 8], dynamic genetic algorithms [9] and Artificial Bee Colony based clustering approach [10] are play the vital role to develop a medical system model and ontological representation. This papers presents an ontological representation RSOnto for the enhanced study with SMOTE to enrich the research in this area.

In 2013, Tseng et al. [11] obtained the highest results in accuracy by using three classification models

1.
C5.0
2.
Support vector machine
3.
Extreme machine learning in cervical cancer.

The dataset collected form the Medical University Hospital, Chung Shan was with 12 features for 168 cases where two risk factors were identified. The results proved C5.0 obtained the highest classification.

In 2014, Hu et al. [12] using artificial neural networks obtained the highest classification accuracy by back substitution in cervical cancer.

In 2016, Sharma [13] obtained accurate results using naïve bayes which outperforms logistics regression.

In 2016 Sobar et al. [14] used the theory of behavior in social science and obtained accurate results using naïve bayes which outperforms logistics regression [15].

In 2017 Wu and Zhou [16] experimented a classification model based on Support Vector Machine (SVM) and obtained the highest accuracy ratio. Four target variables Hinselmann, Schiller, Cytology and Biopsy were determined by the relevant factors available. RFE and PCA techniques were used to reduce the processing time.

In 2019 3rd April KwandaNgwenduna (www.colloquium2019.org.za/wp.../2019/04/kwanda_sydwell_ngwenduna_10h45.pdf) stated that there remains class imbalance still and SMOTE can be combined with under sampling and remains comprehensive to regression and time series.

Proposed Methods

Random Forest (RF)

A renowned classification technique used in diverse classification areas is Random Forest (RF) [17, 18]. RF is also recognized as bagged decision trees [19, 20]. This algorithm [21] works on using group of weak learners to formulate strong learner. RF customizes 2 techniques

1.
Classification technique
2.
Regression Tree (CART) technique [22].

These techniques progresses uncorrelated combination or multiple decision trees centered on bootstrap aggregation (bagging) technique [23].CART technique enables us to learn the correct classification amongst some dependent variables (y) and some independent variables (x) and relationbetween them. Random Forest technique selects a subset randomly to build an independent decision tree. It is a repetition process which splits the selected random subset from the root node to a child node [24]. This splitting continues till each tree reaches a leaf node without cropping. Each tree makes the classification of the features and the objective variable independently and votes for the final tree class [25]. The overall classification is based on the majority acquired trees voting.

Steps for RF construction:

N - Quantitative amount of bootstrap samples
M - Total number of attributes
m - Sample size
k - Next node

Step1.
Creates N bootstrap samples from the dataset.
Step2.
Every node (sample) takes attributes randomly of size m where m < M.
Step3.
Builds a split for the m attributes selected in Step 2 and identifies the k node by utilizing the best split point.
Step4.
Split the tree repeatedly till one leaf node is reached and now the tree remains completed.
Step5.
The algorithm is trained on each bootstrapped separately.
Step6.
Using trees classification voting predicted data is collected from the trained trees (n).
Step7.
The final RF model is build using the highest voted features.

The proposed flow of activities has been diagrammatically represented in Fig. 1.

Features Selection Techniques

Two feature selection methods are used.

1.
Principle Component Analysis (PCA)
2.
Recursive Feature Elimination (RFE).

These selection techniques reduce the features remaining without degrading the model performance. The remaining features remain in the full features dataset.

Principle Component Analysis (PCA)

A statistical mathematical procedure that uses eigenvector to describe the feature orientation is PCA. This analysis maps the n-dimension feature space into k-dimension where k < n, known as principle component. The covariance matrix is calculated. The calculated result is used for defining eigenvectors and Eigen values [26]. Principle component is the Eigen vector with the highest Eigen value. This principle component is chosen from the cervical cancer dataset since it reveals the most important relationship amongst the data set attributes [7]. The Eigen values are arranged in ascending order where the most significant data is chosen and the least significant data is discarded. The highest dimension data is reduced to a lower dimension data [27].

In order to define the deviation of data in the data set calculate the variance (1) which depicts the spread of data.

$$ Var(x)=\left(i/n\right){\sum}_{n=1}^n\left({\hat{\mathrm{z}}}_{ij}-{\mu}_j\right) $$

(1)

Then covariance is determined to identify the relation of the dataset features. The high values express the high relation amongst features and zero values identifies that there is no relation amongst features. The covariance is calculated using eq. (2).

$$ Cov\Big(\left(x,y\right)=\left(\frac{1}{\mathrm{n}-1}\right){\sum}_{n=1}^n\left({x}_{ij}-{\mu}_{xj}\left)\right({y}_{ij}-{\mu}_{yj}\right) $$

(2)

The Eigenvalues and Eigenvectors for the covariance matrix are determined. The determined eigenvalues are then transformed (varimax orthogonal rotation) using eq. (3).

$$ Det\left(A-\lambda I\right)=0 $$

(3)

Recursive Feature Elimination (RFE)

RFE algorithm is also used with random forest for variable importance grouping [28]. RFE is proposed by Guyon et al. [29]. It was used in gene microarray where the number of features was thousands. Díaz-Uriarte and Alvarez de AndrØs [30] used RFE-RF for gene selection and class prediction; they used a back-word selection method in linear support vector machine. It also works with other linear classification methods. Figure 2 shows the pseudo-code for the algorithm.

Synthetic Minority Oversampling Technique (SMOTE)

Machine learning techniques facing troubles when one class dominates the dataset which means that the number of records in one class highly exceeds the number of the other classes. Dataset in this case is called imbalanced dataset and this kind of dataset misleads the classification and affects the results. SMOTE is used to solve this problem. SMOTE is one of the oversampling techniques that was introduced by Chawla et al. [31]. It is used synthetically to increase the minority class based on k-nearest neighbors [31], to balance the dataset. The SMOTE algorithm is used in different fields to solve the unbalanced problem like network intrusion detection systems [32], breast cancer detection [33] and sentence boundary in speech [34]. SMOTE technique synthetically increase the minority class.

$$ {x}_{syn}={x}_i+\left({x}_{knn}-{x}_i\right)\ast t $$

(4)

SMOTE can be described by the following steps.

Step1.
Identifies the feature vector x_i and identify the K-nearest neighborsx_knn.
Step2.
Calculates the difference between the feature vector and k-nearest neighbor.
Step3.
Multiplies the difference by a random number between 0 and 1.
Step4.
Adds the output number to feature vector to identify a new point on the line segment.
Step5.
Repeats the process from 1 to 4 for identifying the feature vectors.

Cervical Cancer Dataset

The cervical cancer dataset consisted of medical record history, routines and demographic information for 858 cases with 32 features for each and every case [35]. Dataset may have missing values because there are cases which decide not to answer all questions for confidential concern. The information predicts the

1.
Dataset features
2.
Total number of entries and
3.
The missing value for each feature.

The dataset is from https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening.

Missing values are visible in dataset. We can find a lot of missing values, which are removed and the number of features are decreased to 30. We use the mean equation to handle the missing values. Figure 3 represents the features in the dataset. Figure 4A and 4B illustrate the values related to the features.

Schiller’s test is used to diagnose cervical cancer by applying iodine solution in the cervix [36]. Hinselmann is used to examine the cervix, vulva and vagina [35]. Cytology is the test which checks for cancer, precancerous conditions, and urinary tract infection. Biopsy, a piece of tissue from the body used to examine and suggest if it is normal or not. These four tests are examined and the results are compared before and after SMOTE.

For each case of the 858 cases labeled with Hinselmann, Schiller, Cytology and Biopsy, an ontological representation for the same is provided in this paper. The objective variables articulate a form of cervical cancer examination.

Quantitatively the number of examinations of the patients are compared to the non-patients before and after SMOTE.

1.
Before SMOTE the data remains imbalanced.
2.
After the implementation of SMOTE algorithm the dataset remains balanced.

Evaluation Metrics

Using unbalanced dataset the accuracy, sensitivity, specificity, positive predicted accuracy (PPA) and negative predicted accuracy (NPA) are measured which predicts the performance of the classification. Using random forest with SMOTE and two feature reduction techniques the cervical cancer prediction is performed. In the pre-processing stage the unbalanced dataset with missing values and lack of information are deleted. Apply SMOTE to balance the unbalanced dataset. Apply the feature selection techniques PCA and RFE which reduces the number of features and decrease the processing time of the dataset. The second phase signifies the classification phase in which training is performed using random forest. The next phase emphasizes on 10-fold cross validation technique for validation and testing purpose. The concluding phase of the model compares the results with and without SMOTE algorithms and the obtained result with methodology is applied in ontology [37].

$$ Accuracy= TP/\left( TP+ TN+ FP+ FN\right) $$

(5)

$$ Sensitivity= TP/\left( TP+ FN\right) $$

(6)

$$ Specificity= TN/\left( TN+ TP\right) $$

(7)

$$ PPA= TP/\left( TP+ FP\right) $$

(8)

$$ NPA= TN/\left( TN+ FN\right) $$

(9)

Simulation Experiment

The cost of misdiagnose of a cervical cancer case or vice versa is high. The used dataset is unbalanced as the number of malignant records is fewer than the number of normal records so SMOTE algorithm is used to balance the number of classes. In this section three RF-based approaches were used to classify cervical cancer cases to identify the patient and the non-patient ones. For validating our model performance, 10-fold cross validations were used. The experiments were done before and after SMOTE with and without feature selection. Each experiment was executed separately to ensure the highest accuracy and avoid classification mislead due to the nature of the dataset. The experiments will be conferred in the imminent sections with reference count as in Table 1.

Table 1 Comparative patients and non-patients count

Full size table

Objective Variable: Hinselmann

In Hinselmann examination test, the RF before SMOTE was achieved with total accuracy of 95.91% with 35 patient records and 823 non-patient records. After using SMOTE algorithm RF achieved a total accuracy of 97.91% with number of patients 805 and non-patients 823. SMOTE algorithm increased the accuracy ratio with sensitivity ratio, PPA and as shown in Table 2 and Table 3.

Table 2 Hinselmann test (before SMOTE)

Full size table

Table 3 Hinselmann test (after SMOTE)

Full size table

Objective Varaible: Schiller

In Schiller examination test, the RF before SMOTE was achieved with total accuracy of 91.48 with 35 patient records and 823 non-patient records. After using SMOTE algorithm RF achieved a total accuracy of 95.02% with number of patients 805 and non-patients 823. SMOTE algorithm increased the accuracy ratio with sensitivity ratio, PPA and NPA as shown in Table 4 and Table 5.

Table 4 Schiller test (before SMOTE)

Full size table

Table 5 Schiller test (after SMOTE)

Full size table

Objective Varaible: Cytology

In Cytology examination test, the RF before SMOTE was achieved with total accuracy of 94.58% with 35 patient records and 823 non-patient records. After using SMOTE algorithm RF achieved a total accuracy of 95.02% with number of patients 805 and non-patients 823. SMOTE algorithm increased the accuracy ratio with sensitivity ratio, PPA and NPA as shown in Table 6 and Table 7.

Table 6 Cytology test (before SMOTE)

Full size table

Table 7 Cytology test (after SMOTE)

Full size table

Objective Varaible: Biopsy

In Biopsy examination test, the RF before SMOTE was achieved with total accuracy of 93.48% with 35 patient records and 823 non-patient records. After using SMOTE algorithm RF achieved a total accuracy of 94.02% with number of patients 805 and non-patients 823. SMOTE algorithm increased the accuracy ratio with sensitivity ratio, PPA and NPA as shown in Table 8 and Table 9.

Table 8 Biopsy test (before SMOTE)

Full size table

Table 9 Biopsy test (after SMOTE)

Full size table

Analysis and Comparison

The results has proved the practice of Random Forest technique to categorize the biased dataset to get a better accuracy ratio in classifying cervical cancer data has been graphically represented using Figs. 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32.

A comparative table using SVM and SMOTE has been tabulated by using values given in reference 16. Accuracy, sensitivity, specificity, PPA and NPA are the features calculated for 805 patients among 835 non-patients and given using Tables 10, 11, 12 and 13.

Table 10 Performance of Hinselmann test

Full size table

Table 11 Performance of Schiller test

Full size table

Table 12 Performance of Biopsy test

Full size table

Table 13 Performance of Cytology test

Full size table

Ontological Representation

Knowledge representation is ontology. Knowledge is in the form of vocabulary of concepts which are explicitlydefined with relationships amongst the concepts. Ontologies is also a structured view of the domain with rich semanticmeaning. Since the size and diversityof datasets semantically represened is growing dramatically,the computational load have been increased significantly.

A knowledge based graph on ontologytake an advantage of exihibiting relevant information visually which helps us to effectively and efficiently analyze the crucial need to find computationload without losing any data. The aforementionedrequirements and explanations stimulate us to educate how to recognizeand proceed these inherent semantic structures and hierarchies todetermine new perceptions and elevate prevailing services.

Figure 33 represents the ontographical representation of RSOnto which depict the relation amongst various classes. Figure 34 illustrates the classes and sub-classes in RSOnto ontology. Figure 35 and 36 illustrates the comparitive study with SVM and SMOTE. This ontology graphically represents the comparitive study of Hinselmann, Schiller, Biopsy and Cytology tests before and after SMOTE. The study relates the tests using the objective variable.

RSOnto depicts the accuracy, sensitivity, specificity, PPA and NPA for Hinselmann, Schiller, Biopsy and Cytology comparitively before and after SMOTE proving the efficiency of SMOTE. Figure 35 and Fig. 36 represents the graphical representation of tests before and after SMOTE. This framework is based on RDF/OWL which captures the dependencies amongst low level domain and complex activities. This defines the tests to capture the knowledge for detecting complex activities. This ontology-based semantic fusion aids as a baseline to recognize events in a universal view where complete multimodal is recorded.

Limitations

SMOTE is used only for 2 dimensional data here. When moving to higher dimensions smote is not very effective, since it does not consider adjacent nodes which results in overlapping, resulting in inaccuracy. In further study a higher version of SMOTE can be implemented for higher dimensions.

Conclusion and Future Work

The services and systems, provided for cervical cancer requires accurate and reliable considerations for the degree of expectation. Measuring the evaluation metrics of features is not much easier, since they remain with various uncertainties. It is a difficult and ambiguous task. In order to balance the imbalanced data set SMOTE is applied which is visualized using RSOnto ontology which increases the quality of metrics.

We presented the brief evaluation of metrics which in future work can be proved more efficient and accurate with several algorithms and various case studies.

References

Cancer Facts & Figures, American Cancer Society, Atlanta, GA, USA, 2018.
Saha, A., Chaudhury, A. N., Bhowmik, P., and Chatterjee, R., Awareness of cervical cancer among female students of premier colleges in Kolkata, India. Asian Paci c J. Cancer Prevention 11(4):1085 1090, 2010.
Google Scholar
El-Moselhy, E. A., Borg, H. M., and Atlam, S. A., Cervical cancer: Sociode-mographic and clinical risk factors among adult Egyptian females. J. Oncol. Res. Treat. 1(1):7, 2016.
Google Scholar
Siegel, R. L., Miller, K. D., and Jemal, A., Cancer statistics, 2018. CA,Cancer J. Clin. 68(1):7 30, Jan. 2018.
Article Google Scholar
Vimal, S., Kalaivani, L., and Kaliappan, M., Collaborative approach on mitigating spectrum sensing data hijack attack and dynamic spectrum allocation based on CASG modeling in wireless cognitive radio networks. Cluster Computing, 2017. https://doi.org/10.1007/s10586-017-1092-0.
Mariappan. E, Kaliappan. M, Vimal S, “Energy Efficient Routing protocol using Grover’s searching algorithm using MANET”, Asian Journal of Information Technology, Vol: 15, no.24, 2016.
Kaliappan, M., and Paramasivan, B., Enhancing secure routing in Mobile Ad Hoc Networks using a Dynamic Bayesian Signalling Game model. Journal of Computers & Electrical Engineering 41:301–313, 2015.
Article Google Scholar
B. Paramasivan, M.J VijuPrakash, M. Kaliappan, 2015 Development of a Secure Routing Protocol usingGame Theory Model in Mobile Ad Hoc Networks, Journal of Communications and Networks, Vol. 17, No. 1
Article Google Scholar
Kaliappan, M., Augustine, S., and Paramasivan, B., Enhancing energy efficiency and load balancing in mobile ad hoc network using dynamic genetic algorithms. Journal of Network and Computer Applications 73:35–43, 2016.
Article Google Scholar
SudhakarIlango, S., Vimal, S., Kaliappan, M., and Subbulakshmi, P., Optimization using Artificial Bee Colony based clustering approach for big data. Cluster Computing. https://doi.org/10.1007/s10586-017-1571-3.
Tseng, C.-J., Lu, C.-J., Chang, C.-C., and Chen, G.-D., Application of machine learning to predict the recurrence-proneness for cervical cancer. Neural Comput. Appl. 24(6):1311 1316, May 2014.
Article Google Scholar
Hu, B. et al., A risk evaluation model of cervical cancer based on etiol-ogy and human leukocyte antigen allele susceptibility. Int. J. InfectionDiseases 28:8 12, 2014.
Google Scholar
Sharma, S., Cervical cancer stage prediction using decision tree approach of machine learning. Int. J. Adv. Res. Comput. Commun. Eng. 5(4):345 348, 2016.
Google Scholar
Sobar, S., Machmud, R., and Wijaya, A., Behavior determinant based cervical cancer early detection with machine learning algorithm, in Proc.4th Int. Conf. Internet Services Technol. Inf. Eng., vol. 4, pp. 3120 3123, Jun. 2016.
Kannan, N., Sivasubramanian, S., Kaliappan, M., Vimal, S., and Suresh, A., Predictive big data analytic on demonetization data using support vector machine. Cluster Comput, 2018. https://doi.org/10.1007/s10586-018-2384-8 March 2018.
Wu, W., and Zhou, H., Data-driven diagnosis of cervical cancer with support vector machine-based approaches. IEEE Access 5:25189 25195, 2017.
Google Scholar
Lin, W.-Z., Fang, J.-A., Xiao, X., and Chou, K.-C., iDNA-Prot: Identica-tion of DNA binding proteins using random forest with grey model. PLoSONE 6(9):e24756, Sep. 2011.
Article CAS Google Scholar
Seera, M., and Lim, C. P., A hybrid intelligent system for medical data clas-sication. Expert Syst. Appl. 41(5):2239 2249, Apr. 2014.
Article Google Scholar
Breiman, L., Random forests. Mach. Learn. 45(1):5–32, 2001.
Article Google Scholar
Biau, G., Analysis of a random forests model, J. Mach. Learn. Res., vol. 13, pp. 1063 1095, Apr. 2012.
Breiman, L., Friedman, J. H., Olshen, R., and Stone, C. J., ClassicationandRegression Trees. Belmont, CA, USA: Wadsworth, 1984.
Google Scholar
Genuer, R., Poggi, J.-M., and Tuleau, C., Random forests: Some method-ological insights, INRIA, Saclay, France, Res. Rep. RR-6729, Nov. 2008.
Liaw, A., and Wiener, M., Classication and regression by random forest. R Newslett 2(3):18 22, 2002.
Google Scholar
Suresh, A., Udendhran, R., Balamurgan, M. et al., J Med Syst 43(165), 2019. https://doi.org/10.1007/s10916-019-1302-9.
Suresh, A., Udendhran, R., and Balamurgan, M., Soft Comput, 2019. https://doi.org/10.1007/s00500-019-04066-4.
Kotu, V., and Deshpande, B., Predictive Analytics and Data Mining. San Mateo, CA, USA: Morgan Kaufmann, 2015, 63 163.
Google Scholar
Kavitha, R. and Kannan, E., An efcient framework for heart disease clas-sication using feature extraction and feature selection technique in data mining, in Proc. Int. Conf. Emerg. Trends Eng., Technol. Sci. (ICETETS), Pudukkottai, India, pp. 1 5 2016.
Zhang, C., Li, Y., Yu, Z., and Tian, F., Feature selection of power system transient stability assessment based on random forest and recursive fea-ture elimination, in Proc. IEEE PES Asia Paci c Power Energy Eng.Conf. (APPEEC), Xi’an, China, pp. 1264 1268, 2016.
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V., Gene selection for cancer classication using support vector machines, Mach. Learn., vol. 46, nos. 1 3, pp. 389 422, 2002.
Díaz-Uriarte, R., and de AndrØs, S. A., Gene selection and classication of microarray data using random forest. BMC Bioinf. 7(1):3, Jan. 2006.
Article Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P., SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1):321 357, 2002.
Google Scholar
Cieslak, D. A., Chawla, N. V., and Striegel, A., Combating imbalance in network intrusion datasets, in Proc. IEEE Int. Conf. Granular Comput., pp. 732 737, 2006.
Fallahi, A., and Jafari, S., An expert system for detection of breast cancer using data preprocessing and Bayesian network. Int. J. Adv. Sci. Technol. 34(9):65 70, 2011.
Google Scholar
Liu, Y., Chawla, N. V., Harper, M. P., Shriberg, E., and Stolcke, A., A study in machine learning from imbalanced data for sentence boundary detection in speech. Comput. Speech Lang. 20:468 494, Oct. 2006.
Article Google Scholar
Chase, D. M., Kalouyan, M., and DiSaia, P. J., Colposcopy to evaluate abnormal cervical cytology in 2008. Am. J. Obstet. Gynecol. 200(5):472–480, May 2009. https://doi.org/10.1016/j.ajog.2008.12.025.PMID19375565.
Schiller's test at Who Named It?
Vimal, S., Kalaivani, L., Kaliappan, M., Suresh, A., Gao, X.-Z., and Varatharajan, R., Development of secured data transmission using machine learning based discrete time partial observed markov model and energy optimization in Cognitive radio networks. Neural Comput&Applic, 2018. https://doi.org/10.1007/s00521-018-3788-3.

Download references

Author information

Authors and Affiliations

Bharath Institute of Higher Education and Research, Tamil Nadu, India
R. Geetha
Mohamed Sathak A J Engineering College, Chennai, India
S. Sivasubramanian
Department of Computer Science and Engineering, Ramco Institute of Technology, Rajapalayam, India
M. Kaliappan
Department of Information Technology, National Engineering College, Kovilpatti, India
S. Vimal
Department of CSE, Nehru Institute of Engineering and Technology, Coimbatore, India
Suresh Annamalai

Authors

R. Geetha
View author publications
You can also search for this author in PubMed Google Scholar
S. Sivasubramanian
View author publications
You can also search for this author in PubMed Google Scholar
M. Kaliappan
View author publications
You can also search for this author in PubMed Google Scholar
S. Vimal
View author publications
You can also search for this author in PubMed Google Scholar
Suresh Annamalai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Sivasubramanian.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Patient Facing Systems

Rights and permissions

Reprints and permissions

About this article

Cite this article

Geetha, R., Sivasubramanian, S., Kaliappan, M. et al. Cervical Cancer Identification with Synthetic Minority Oversampling Technique and PCA Analysis using Random Forest Classifier. J Med Syst 43, 286 (2019). https://doi.org/10.1007/s10916-019-1402-6

Download citation

Received: 08 March 2019
Accepted: 25 June 2019
Published: 17 July 2019
DOI: https://doi.org/10.1007/s10916-019-1402-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Cervical Cancer Identification with Synthetic Minority Oversampling Technique and PCA Analysis using Random Forest Classifier

Abstract

Similar content being viewed by others

Effective Diagnosis of Cervical Cancer Using Balanced Random Forest Classifier

Performance Assessment of Machine Learning Classifiers Using Selective Feature Approaches for Cervical Cancer Detection

Classification Study and Prediction of Cervical Cancer

Introduction

Related Work

Proposed Methods

Random Forest (RF)

Features Selection Techniques

Principle Component Analysis (PCA)

Recursive Feature Elimination (RFE)

Synthetic Minority Oversampling Technique (SMOTE)

Cervical Cancer Dataset

Evaluation Metrics

Simulation Experiment

Objective Variable: Hinselmann

Objective Varaible: Schiller

Objective Varaible: Cytology

Objective Varaible: Biopsy

Analysis and Comparison

Ontological Representation

Limitations

Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cervical Cancer Identification with Synthetic Minority Oversampling Technique and PCA Analysis using Random Forest Classifier

Abstract

Similar content being viewed by others

Effective Diagnosis of Cervical Cancer Using Balanced Random Forest Classifier

Performance Assessment of Machine Learning Classifiers Using Selective Feature Approaches for Cervical Cancer Detection

Classification Study and Prediction of Cervical Cancer

Explore related subjects

Introduction

Related Work

Proposed Methods

Random Forest (RF)

Features Selection Techniques

Principle Component Analysis (PCA)

Recursive Feature Elimination (RFE)

Synthetic Minority Oversampling Technique (SMOTE)

Cervical Cancer Dataset

Evaluation Metrics

Simulation Experiment

Objective Variable: Hinselmann

Objective Varaible: Schiller

Objective Varaible: Cytology

Objective Varaible: Biopsy

Analysis and Comparison

Ontological Representation

Limitations

Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation