Keywords

6.1 Introduction

Human trafficking cases have been increasing continuously. Human traffickers target no exceptions of age, gender, and race. However, most reported victims worldwide are mostly the vulnerable ones, women and girls, and they are mostly bound for sexual exploitation [1, 2]. In the Philippines, commercial sex exploitation usually occurs near offshore gaming operations and tourist destinations [3]. But now, sex trafficking is also prevalent in cyberspace, leading to many cybersex trafficking cases. The popularization and free access to many online platforms, such as social media and video-sharing sites, have opened many opportunities to traffickers. These online activities in chat rooms, social networking sites, online ads, and many other social media sites have enabled traffickers to target more victims [4].

As cyber trafficking is unstoppable, the data related to these cases are growing. However, there is no assurance that all the data stored in the government’s database are “good” data as, according to [5], the data management process is poor. The challenge of analyzing good data paves the way for developing tools for information extraction, data mining, and machine learning. Experts can use these techniques to identify patterns with pertinent information on human trafficking on the internet [6]. In addition, many terminologies are used to identify human trafficking activities. For example, the terms “pimp” or “madam” are employed to indicate the probable trafficker in the situation, a “provider,” who is the same as the person being sold. The word “johns” is also a known word referring to the customers of this online trafficking business [7]. Realizing the greater challenges faced by the government and seeing the opportunities to help, many researchers were interested in analyzing the activities and strategies of traffickers. Thus, many research studies were conducted using advanced technologies and innovations such as sentiment analysis [8,9,10] and natural language processing [11]. Most studies just focused on one social platform, such as social media messages [12], dark web [13], website advertisements [14, 15], open internet sources [16], and Twitter posts [17].

The data quality and the learning algorithms’ efficacy all play a role in determining how successful a machine-learning solution will be [18]. Therefore, researching numerous machine learning algorithms enables one to determine which algorithms may be combined to provide the most accurate predictive model for monitoring websites. Studies that present classification predictions of cyber-trafficking websites are still limited. Most studies focus only on one platform or one of the gateways of cyber-trafficking. Therefore, this paper introduces the development and comparison of the Naive Bayes Algorithm, Logistic Regression, KNN, and SVM classification models to predict trafficking and non-trafficking websites.

6.2 Review of Related Studies and Literature

This section presents reviews of literature and studies related to trafficking and analytics.

6.2.1 Review on the Human Trafficking Cases

The data for the visualizations in this section was derived from the Counter-Trafficking Data Collaborative (CTDC), the world’s first global data hub on human trafficking, which publishes standardized data from anti-trafficking groups worldwide [19, 20].

Figure 6.1 shows that not only the most vulnerable in the society are being targeted by traffickers.

Fig. 6.1
A screenshot presents the distribution of gender with a dark background. It has a paragraph of text regarding victims and a pie chart of 2 partitions. Female, 35534. Male, 13267.

Pie chart distribution of gender

Human trafficking comes from different forms. Trafficking is really active in online and offline global trading. Figure 6.2 shows that sexual exploitation is the major market for human traffickers. It reflects that traffickers gain more profit in this form of trafficking.

Fig. 6.2
A screenshot presents the different forms of human trafficking with a dark background. It lists a table of 2 columns and 5 rows. Slavery and similar practices, 359. Sexual exploitation, 15989. Other, 7063. Forced marriage, 168. Forced labor, 8969.

Tabulation of the different forms of human trafficking

Human trafficking is not bounded by time and space, as presented in Fig. 6.3. There are many instances around the globe. Unfortunately, the Philippines has the highest proportion of trafficking occurrences among all nations, as shown in Fig. 6.3, with 11,365 instances, followed by Ukraine, which has 7,761 instances.

Fig. 6.3
A screenshot presents a filled map chart for global trafficking cases with a dark background. The record count in the Philippines is 11365. The text at the bottom reads, human trafficking has a global dimension, human trafficking affects people of all races, religions, social class, and education.

Filled map chart for global trafficking cases

Figure 6.4 indicates that the vast majority of victims did not provide any specific information about their interaction with the traffickers. Unfortunately, even the victims’ relatives are the primary reason victims suffer in misery in exchange for monetary compensation.

Fig. 6.4
A screenshot presents a bar chart with a dark background. The highest value is 8730 for not specified. The lowest value is 24 for not specified and others. The text reads, it also shows what type of relationship do most traffickers have with their victims.

Bar chart showing the recruitment relationship

6.2.2 Review on the Analytics to Combat Human Trafficking

Predictive analytics is the branch of advanced analytics used to predict unknown future events. This study used predictive analytics to determine a potential trafficking website using predetermined keywords. Like in the concept of Search Engine Optimization (SEO), keywords can be used to predict top search results. The study of [21] has integrated it with machine learning to predict SEO rankings. The same approach was applied in this study that a set of keywords were identified to determine a website that contains keywords most probably used by online traffickers.

Naive Bayes shows satisfactory results even in small-scale datasets [22]. Therefore, in this study, Naive Bayes was also used. In addition, Naive Bayes classifiers have better resilience to missing data than support vector machines [20]. Additionally, other machine learning algorithms, logistic regression, KNN, and SVM were also employed to evaluate their performance in predicting cyber trafficking.

6.3 Development of Cyber Trafficking Websites Classification Models

Naive Bayes classifier, Logistic Regression, KNN, and SVM were used to develop the model for classifying cyber-trafficking websites. In addition, it was used to forecast traffic on both trafficking and non-trafficking websites.

The data was extracted from scraped websites using a scraping algorithm written in the Python programming language and 37 keywords marked as red flags for traffickers. The websites were categorized into two types: trafficker and non-trafficker. The categorization process was done with the help of an expert who visited the websites with special tools. However, due to project time constraints, there were only 35 websites that were classified as trafficker out of the 63 total scraped websites. The data that was collected was formatted into a spreadsheet.

Then, to manage and analyze the data used for the analysis, the data were first cleaned by removing some of the less significant data and converting a portion of the data from text to numeric, since most of the data is string type. Encoding of string-type data is conducted so that the machine learning algorithm can execute its arithmetic operation to understand the data that must be analyzed. The algorithms performed for this study do not accept string values. Only those that are close to floating data types, which is why a label encoder is used. The data was separated into two categories: features and encodedClass. The feature category contains type and keyword data, whereas the encoded class category contains all other data. The following elaborates on the columns of the dataset:

  • Url is the link to the website.

  • Type indicates the type of the website or its general description such as a news article, eCommerce, educational, nonprofit, blogs, portfolio, portal, and search engines/job search engines.

  • Keywords are one of the 37 keywords with the highest occurrences on the particular website.

  • Keycounts corresponds to the number of occurrences of the keyword appearing on a website.

  • Date is the date that the website was posted. When no date indicates the website or page that contains the keywords, the date was set to the date of the data collection.

The accuracy of models created using Naive Bayes Algorithm, Logistic Regression, KNN, and SVM was compared to determine know which is best suited for the data on hand. The algorithm examines and calculates occurrences of keywords found on trafficking or non-trafficking websites.

6.4 Evaluation of the Model

Using the train-test split library in the sklearn package and the 80/20 ratio of the training and validation data sets, the dataset is split to produce the training and testing sets. The evaluation yields two classes: class 0 for non-traffickers and class 1 for traffickers.

Figure 6.5 shows the classification report of the logistic regression model. Out of all the non-traffickers, the model predicted that 100% of them were correctly classified. Similarly, classifying traffickers also accumulated a 100% correct classification rate. The value result showed that the model does a great job of predicting whether or not it is a trafficking or non-trafficking website.

Fig. 6.5
A classification report on machine learning. It lists 5 columns and 5 rows of data. The column headers are empty, precision, recall, f 1 score, and support. Row 1 reads 0, 1.00, 1.00, 1.00, 3. Row 2 reads 1, 1.00, 1.00, 1.00, 4.

Classification report for logistic regression

Precision and recall indicate a missing value in the accuracy column. This is because data accuracy should be as high as possible, and comparing the two models becomes difficult if the data has a low precision but a high recall, or vice versa. Therefore, to evaluate the results of the accuracy test, the F-score is used to evaluate both the precision and the recall of the data.

Figure 6.6 shows the classification report of the K-Nearest Neighbors model. The model showed a similar output to the logistic regression model with 100% correctly identified positives as well as the same for the accuracy of each positive prediction, while the support showed a balanced dataset consisting of three (3) non-traffickers and four (4) traffickers.

Fig. 6.6
A classification report on machine learning. It lists 5 columns and 6 rows of data. The column headers are empty, precision, recall, f 1 score, and support. Row 1 reads 0, 1.00, 1.00, 1.00, 3. Row 2 reads 1, 1.00, 1.00, 1.00, 4.

Classification report for k-nearest neighbors

Figure 6.7 shows that in the same manner as with the K-nearest neighbors and logistic regression, Naive Bayes model has 100% correctly identified positives. The same goes for the accuracy of each positive prediction as well as the support.

Fig. 6.7
A classification report on machine learning. It lists 5 columns and 5 rows of data. The column headers are empty, precision, recall, f 1 score, and support. Row 1 reads 0, 1.00, 1.00, 1.00, 3. Row 2 reads 1, 1.00, 1.00, 1.00, 4.

Classification report for Naive Bayes

Figure 6.8 shows the classification report of the SVM model. The model predicted that out of all the non-traffickers, 75% of them were correctly classified. On the other hand, classifying traffickers accumulated a 100% correct classification rate. The F1 score value appears to be near 1, indicating that the model is good at predicting both non-traffickers and traffickers while also maintaining support at an acceptable range.

Fig. 6.8
A classification report on machine learning for S V M. It lists 5 columns and 5 rows of data. The column headers are empty, precision, recall, f 1 score, and support. Row 1 reads 0, 0.75, 1.00, 0.86, 3. Row 2 reads 1, 1.00, 0.75, 0.86, 4.

Classification report for SVM

Figure 6.9 shows the confusion matrix for logistic regression, SVM, and Naive Bayes. These three models appeared to share a classification report, resulting in the same confusion matrix. The confusion matrix shows no misclassifications that appear in the model. Instead, the confusion matrix demonstrates that the classification report’s three non-trafficker websites and four trafficker websites appeared to be correctly classified. There are seven correctly classified websites for these models.

Fig. 6.9
A confusion matrix of y true versus y pred. The values of 0 0, 0 1, 1 0, and 1 1 are 3, 0, 0, and 4, respectively. It includes a color gradient scale of 0.0 to 4.0 on the right.

Confusion matrix for logistic regression, K-nearest neighbors (KNN), and Naive Bayes

A confusion matrix is utilized because it allows visitors to examine the outcomes of an algorithm at a glance. The confusion matrix presents analytical findings in the form of a straightforward table, which effectively condenses the outputs into a perspective that is easier to understand.

As shown in Fig. 6.10, there are six misclassifications. Three on the non-trafficking website and three on the trafficking website. The correctly classified website is only one (1).

Fig. 6.10
A confusion matrix of y true versus y pred. The values of 0 0, 0 1, 1 0, and 1 1 are 3, 0, 1, and 3, respectively. It includes a color gradient scale of 0.0 to 3.0 on the right.

Confusion matrix for SVM

Figure 6.11 shows the result of the cross-validations for all the models presented. The KFold validation was used in determining the accuracy of the data in all the models. This shows that although three of the models have the same classification report, they still differ when it comes to the accuracy of their data. Cross-validation is conducted to get more information about the algorithm performance. The validation showed that Naive Bayes appears to be the highest among all the models.

Fig. 6.11
A screenshot of cross-validation results in machine learning. The text reads logistic regression colon 0.818, S V M colon 0.636, K N N colon 0.636, and Naive Bayes colon 0.909. Values are simplified to 3 decimals.

Cross-validation results

6.5 Conclusion

The presented predictive algorithm using four machine learning algorithms, namely, the Naive Bayes Algorithm, Logistic Regression, K-Nearest Neighbor (KNN), and Support Vector Machine (SVM), was able to predict trafficking and non-trafficking websites. The evaluation shows that the model performs well, having an accuracy result of 91%, 81%, 64%, and 64%, respectively, in classifying trafficking and non-trafficking websites. Furthermore, as observed from the classification report, some machine learning algorithms reflect the same value, because it reflects the measurement’s proximity to the actual value, but they are not identical. Accuracy indicates how close a measurement is to a known or accepted value, regardless of how far it deviates from the accepted value. Both precise and accurate measurements are repeatable and close to the true values.

Moreover, Naive Bayes stands out with the highest accuracy rate, making it an ideal model to be used, also taking into consideration the results from SVM, which is acceptable enough to be used along with the Naive Bayes, which could garner a more effective outcome for predictions. The model can be integrated into a tool to help law enforcement agencies to analyze transactions on the web and identify possible cyber traffickers. It is an excellent line of defense to act on early signs of human trafficking before it has taken a victim. In this way, law enforcement can develop a task force that will watch over red flags identified by the analytical models presented. Moreover, scrutinizing each online transaction would be much easier for law enforcement agencies, allowing them to act on the concerns of a larger community in need of their assistance. It shows that the model can be helpful in proactively combating traffickers that are roaming around in cyberspace.