Keywords

1 Introduction

Human trafficking is a global problem affecting millions of people worldwide [29]. It is a form of modern slavery that involves the exploitation of individuals for various purposes, including forced labor, sexual exploitation, and organ removal, among others [43]. In recent years, social networks have emerged as a key platform for human traffickers to operate and reach potential victims [39]. Researchers and policymakers are interested in the role that social networks and other internet-based platforms play as facilitators of this type of crime that affects millions of people globally. Efforts are being made to employ various methodologies to identify and prevent these forms of exploitation.

The misuse of technology for human trafficking in all stages is increasing. For example, traffickers use deception techniques to hide their identities and avoid detection [25]. They can also use social networks to lure potential victims, often using emotional manipulation and other tactics to gain their trust [43]. In addition, the vast amount of data generated on social networks can make it difficult to identify and trace traffickers and their victims [4].

However, social media can also serve as a valuable tool for raising awareness of human trafficking and gathering support for victims and survivors [35]. Through social media campaigns and sharing of information and resources, individuals and organizations are able to take action in order to combat human trafficking. Additionally, social media provide a space for survivors to share their stories and connect with others, helping to break down isolation and stigma. While social media functionalities have the potential to play a significant role in anti-trafficking efforts, it is important to recognize its limitations and the need for broader actions to address the root causes of human trafficking and provide support to the victims [18].

In this context, technological advances present new opportunities for the detection and prevention of human trafficking. For instance, data generated on social networks can be utilized to identify patterns and trends in human trafficking activity [39]. Also, Machine Learning (ML) techniques and other data-driven approaches can be employed to create algorithms capable of automatically identifying and flagging suspicious activity on social networks, helping to address this social problem [17]. These technologies can also aid in the detection of human trafficking advertisements and the identification of relevant keywords used in social media to facilitate these crimes. [47].

In this paper, we present a comprehensive review of the current work that explores the application of Machine Learning techniques in social networks and other platforms to tackle the issue of human trafficking. Additionally, we delve into the challenges and opportunities associated with using ML approaches to detect and prevent human trafficking on social networks. The article is organized as follows: Sect. 2 presents the fundamental knowledge about ML techniques. In Sect. 3, we describe the methodology to follow to collect the related work. Section 4 reviews the existing research on ML techniques that address human trafficking on social media. Finally, in Sect. 5, we conclude the paper and discuss future directions for research in this area.

2 Background Study

This section provides definitions of human trafficking, discusses how this phenomenon behaves in social networks and outlines the concepts of Machine Learning, the various types of learning, and the techniques employed in each.

2.1 Human Trafficking on Social Media

Human trafficking is characterized by using force or coercion to recruit, transport, and exploit individuals for various purposes, including sexual exploitation, forced labor, and organ removal. It involves the abuse of power or vulnerability and the offering or receiving of payments or benefits in exchange for consent from those in positions of authority. Exploitation can include sexual exploitation, such as prostitution, and other forms of exploitation, including forced labor, slavery, servitude, and organ removal [43].

Social media platforms have become increasingly important in recruiting and grooming potential trafficking victims. For example, traffickers may use social media to lure individuals with false promises of employment or romantic relationships and then exploit them once they have been recruited [43]. In some cases, traffickers may use social media to advertise their victims for sexual exploitation or forced labor or to arrange transportation to different locations [3].

There are several ways in which social media can facilitate human trafficking [43]. First, it allows traffickers to reach a broad audience and target specific demographics, such as young people or those vulnerable due to economic or social circumstances. Second, social media can provide anonymity and secrecy, allowing traffickers to operate without detection. Third, it can be used to obscure the true nature of the exploitation, for example, by presenting it as legitimate work or a consensual relationship.

Combating human trafficking on social media requires a multifaceted approach involving governments, law enforcement agencies, stakeholders, and technology companies. They must work together to combat human trafficking and protect victims’ rights. In addition, efforts to raise awareness and educate the public about the risks of human trafficking on social media can help prevent individuals from falling victim to these crimes [43]. Human trafficking is a severe global problem that the proliferation of social media has exacerbated.

2.2 Machine Learning Techniques

Machine Learning (ML) techniques involve using algorithms and statistical models to analyze and identify patterns of large amounts of data [15]. These techniques have been used in various fields, including image recognition [11], speech recognition [21], sentiment analysis [30, 31], and predictive modeling [22]. They can be used as well in the battle against human trafficking. For instance, examining the content of social media posts and finding any that might be about trafficking [24], identifying patterns in the movements of individuals that may indicate trafficking activity [5, 10, 12], detecting anomalies in employment records that could indicate the presence of forced labor [34], predicting the likelihood of individuals becoming victims of trafficking [38, 42], and identifying potential victims before they are exploited [27]. ML techniques are categorized into supervised, unsupervised, and semi-supervised learning.

  • Supervised Learning: It is a subcategory of ML that uses labeled training data to predict outputs or classify data into predefined categories. The goal is to build a model to make accurate predictions or classifications based on input data. This is achieved by providing the model with a large set of labeled training examples consisting of input data and the corresponding correct outputs or classifications. The model is then trained to learn the relationship between the input data and the outputs or classifications, using this training data as a guide.

    Considering human trafficking, several types of supervised learning algorithms have been used, including Linear regression, Logistic regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVMs), Naive Bayes (NB), Random Forest (RF), Neural Networks (NN), Decision Tree (DT), AdaBoost, and so on.

  • Unsupervised Learning: It is a type of ML in which a model is trained to discover patterns and relationships in a dataset without using labeled training examples [7]. In this learning, the goal is to find hidden patterns in the data rather than to make specific predictions or classifications [14]. This is achieved by providing the model with a large dataset and allowing it to learn the underlying structure of the data through techniques such as clustering [23] or dimensionality reduction [33].

    It is also important to evaluate the model’s performance using appropriate evaluation metrics, such as silhouette scores (for clustering) or reconstruction errors (for dimensionality reduction) [36].

  • Semi-supervised Learning: It is an ML type between supervised and unsupervised learning. It involves using labeled and unlabeled data to improve the accuracy of predictions or classifications where the goal is to leverage the available labeled data to make better predictions or classifications on the unlabeled data, using techniques such as self-training or co-training [44].

    There are several semi-supervised learning algorithms, including self-training algorithms, co-training algorithms, and multi-view learning algorithms [44]. Self-training algorithms use a single model trained on labeled and unlabeled data [26]. Co-training algorithms involve using two or more models trained on different views of the data and used to label the unlabeled data iteratively [28]. Multi-view learning algorithms use multiple models trained on different data views and combined to make a final prediction or classification [48].

3 Methodology

The scope of this work is to review and compile previous approaches that asses human trafficking detection on social networks using ML techniques and extract the main hints and trending paths used to address this problem. To fulfill this, we define the following steps:

  1. 1.

    Select the most relevant papers based on a selection criteria to obtain the most relevant works in this area.

  2. 2.

    Provide a deep insight of the main aspects used to analyze Human Trafficking in social media with ML.

3.1 Paper Selection

The selection criteria were based on the fact that the work should contain the following aspects:

  • The focus of the work seeks to help the problem of human trafficking, such as sexual exploitation, forced labor, and modern slavery.

  • One of the methods used to address the human trafficking problem must be a Machine Learning technique.

  • The data used to work with must be a part of a social network or a website accessible to anyone.

  • The work is at least from the year 2016.

The databases and repositories employed for this investigation included Scopus, ScienceDirect, ArXiv, IEEE XPLORE, and SpringerLink. This study aims to comprehensively review existing research on human trafficking by employing these resources and combining relevant variables. Specifically, the fixed variable “human trafficking” was identified and combined with three descriptive variables, namely “social media”, “social networks”, and “machine learning” to create multiple search queries. The utilization of these variables allowed for the identification of a diverse range of relevant papers. At the end of this search, 23 papers between 2016 and 2022 were selected.

3.2 Main Aspects

This work has considered six general aspects to analyze in each reviewed paper to identify the most relevant contributions. These aspects include the type of data used, the number of classes, the model or algorithm employed, the dataset utilized, the number of observations in the dataset, and the metrics that the paper considered to evaluate the performance of the ML algorithm.

4 Main Findings

This section analyzes the various sub-aspects based on the general aspects to be considered when addressing human trafficking in social networks through ML algorithms.

4.1 Supervised Approaches

Table 1 summarizes the works using supervised algorithms. Supervised ML algorithms are commonly used for detecting human trafficking. However, obtaining labeled data for human trafficking is challenging, as it involves sensitive information and can potentially put individuals at risk. Despite this challenge, several approaches have been taken to obtain labeled data, including using pre-existing data or manually labeling data pulled from social media platforms like Twitter.

Table 1. Supervised Machine Learning Approaches to Address Human Trafficking on Social Media.

The data extraction typically involves using web scraping techniques or social network APIs, followed by pre-processing steps such as removing links and non-relevant characters such as emoticons and emojis. The labeled data is then fed into the supervised ML algorithm for further analysis. Once the data is ready, it is commonly split into training and testing subsets, or in some cases, into three subsets: training, validation, and testing. The proportion of the split varies depending on the researcher, but a common split is 80% for training and 20% for testing.

Supervised learning algorithms in human trafficking detection include Support Vector Machine (SVM) [1, 5, 8,9,10, 12, 13, 38, 40, 49], Logistics Regression (LR) [1, 5, 9, 40], Random Forest (RF) [1, 5, 8, 9, 40], Gaussian Naive Bayes (NB) [8,9,10, 12, 47], and Artificial Neural Networks [8, 47]. These algorithms form the basis for supervised learning in classification, detection, and regression tasks.

To evaluate the performance of the classifiers, metrics such as Precision, Recall, F1-score, Accuracy, and AUC are commonly used. These metrics provide insights into the algorithm’s effectiveness in detecting human trafficking and can guide future improvements in the methodology. Overall, supervised learning approaches have proven effective in detecting human trafficking. Despite the challenges of obtaining labeled data, various approaches have been taken to acquire it, such as using pre-existing datasets and manually labeling data extracted from social media platforms.

4.2 Unsupervised Approaches

Table 2 summarizes the works using unsupervised algorithms. These techniques have also been used to detect human trafficking, as they do not require labeled data to train the algorithms as in supervised learning. For instance, Clustering algorithms [19, 20] are commonly used to similar group data together based on similarities in their features. One of the most used clustering algorithms in human trafficking detection is k-means clustering. Clustering [19, 20] and anomaly detection [6, 37] can also be effective tools in identifying instances of human trafficking since they can be used when labeled data is unavailable and can be used in conjunction with supervised learning algorithms to improve their accuracy.

Once the unsupervised techniques have been applied, the researcher can further refine the results by manually examining the data points flagged as potential instances of human trafficking. This manual approach can eliminate false positives and increase the accuracy of the results.

Table 2. Unsupervised Machine Learning Approaches to Address Human Trafficking on Social Media.

4.3 Semi-supervised Approaches

Table 3 summarizes the works using semi-supervised techniques that combine unsupervised and supervised techniques. They have shown promising results in human trafficking detection. One such hybrid approach is to use unsupervised techniques for word embedding and supervised techniques for the detection task.

Table 3. Semi-supervised Machine Learning Approaches to Address Human Trafficking on Social Media.

Word embedding is a technique used to represent words in a vector space, where words with similar meanings are closer to each other. Word embedding techniques such as a bag of words [41], word2vec, TF-IDF, FastText, and Skip-grams [46] are commonly used. In the context of human trafficking detection, word embedding can be used to represent words and phrases that are commonly associated with human trafficking, such as “sex trafficking” or “forced labor”.

Once the word embedding is generated, it can be used as input to a supervised learning algorithm, such as an ANN or an SVM [1, 5]. The supervised algorithm is trained on labeled data to learn the relationship between the word embedding and the presence of human trafficking activity. The labeled data can be obtained using earlier methods, such as manual labeling or web scraping.

The advantage of using a semi-supervised approach is that it leverages the strengths of both unsupervised and supervised techniques. Unsupervised techniques can generate high-quality word embedding, while supervised techniques can be used to learn the relationship between the embedding and human trafficking activity. This approach can also be effective when labeled data is limited, as it can augment the available labeled data with word embedding generated from a larger, unlabeled dataset.

Moreover, it is used for a small amount of labeled data to train a model and then used the trained model to label a larger amount of unlabeled data as in the work of [16]. The labeled and unlabeled data can be trained in a new model. This iterative process can be repeated until the model’s accuracy is satisfactory.

4.4 Datasets

For data extraction, public web pages are used where there may be indications of human trafficking, such as pages of adult services and pornographic pages where sex work is offered [16]. Another common avenue for recruiters is social media. They can more easily contact potential victims by posing as friends, acquaintances, or job recruiters. Traffickers most widely use microblogging platforms because they allow for more meaningful interaction between strangers, and since they share everyday thoughts, they can more easily identify potential victims and form a relationship.

One of the most commonly used sites for data extraction using web scraping techniques is Backpage.com [1]. It was a classified ads website founded in 2004 which allowed users to post ads in categories such as personals, automotive, rentals, jobs, and adult services. The latter type was used by human trafficking rings to recruit potential victims for their network. For this same reason, it is widely used to extract advertisements related to human trafficking as used in [1].

Other sites used for data mining are news and advertisements [6, 8]. These are known to contain job offers that are methods used by traffickers to recruit new victims. Likewise, YELP reviews are also used to detect places where sexual services are provided based on keywords such as massage, spa, and so on.

Finally, social media contain a wealth of data that can be used to detect this problem. One of the most widely used data extraction services is Twitter [5, 6, 10, 12, 13, 19, 38, 41, 45], which has a freely available API that facilitates the retrieval of tweets using queries and keywords, without using web scraping techniques.

Data Labeling. Due to the complexity of the problem of human trafficking, it is very complicated to find publicly labeled data since it may contain sensitive information and data of persons who may or may not be involved, such as phone numbers, addresses, names, etc. Therefore, to request this data, it is usually necessary for the author to explain that it is for research and explain the area and what you plan to do with the data.

Manual data labeling relies on hand-checking and assigning a label to each piece of data, whether or not it is related to human trafficking. This process is highly dependent on the individual’s judgment, and there may be bias in assigning the label. Therefore, this task typically requires a person with human trafficking experience to review and label the data. This process requires manual review where the amount of final data is limited. There aren’t many people with experience in human trafficking willing to go through thousands or millions of pieces of data and label it. Furthermore, not all universities, research centers, or research groups have people with the ideal characteristics to carry out this task.

In human trafficking, there is a public dataset (upon request to authors) with labeled data that is used to train ML models to detect ads indicating human trafficking. This data is called Trafficking-10k and comprises 10 thousand ads and seven different labels: Certainly not, Probably not, Weakly no, Unsafe, Weakly yes, Probably yes, Certainly yes. Several papers have used this data set in their work [18, 40, 46, 47, 49].

4.5 Pre-processing

Several preprocessing steps are commonly followed by researchers when text is used in order to classificate it. These steps are aimed at cleaning and transforming the raw tweet data to a form suitable for further analysis. Some of the most commonly used preprocessing steps are:

  • Tokenization: It is breaking a text into individual words or tokens. In tweets, tokenization can be challenging due to the presence of emoticons, hashtags, and mentions. Therefore, specialized tokenization techniques. The works of [5, 10, 13] tokenize the text from the tweets to better understand the model and gain performance.

  • Stopword Removal: They are common words that do not carry much meaning, such as “the,” “a,” and “an.” Removing stopwords can reduce the dimensionality of the data and improve the efficiency of subsequent analysis steps. However, the effectiveness of stopword removal in a text has been debated in the literature, with some studies suggesting that it may harm the performance of classification models. In the case of the works [8, 9, 38] extract the data from social media platforms, so they remove stopwords less dimensionality to the model, and gain meaningful data for the model.

  • Stemming/Lemmatization: Stemming and lemmatization are techniques for reducing words to their root form. This can help reduce the data’s sparsity and improve the accuracy of subsequent analysis steps. However, stemming can also result in the loss of information, and lemmatization can be computationally expensive. Therefore, the choice of stemming/lemmatization technique may depend on the specific task and dataset.

  • Removing URLs, emojis, mentions, and hashtags: Tweets often contain URLs, mentions, emojis, and hashtags, which can be irrelevant or misleading for classification tasks. Therefore, these elements are often removed before analysis. The works [12, 41] extract the data from Twitter, so they come with too much poor information, such as URLs, emoticons, mentions, and hashtags, that are affecting the performance of the ML models.

  • Spell correction: Text from social media often contains misspellings and abbreviations, making it challenging to analyze the data accurately. Therefore, spell correction techniques, such as spell-checker, can improve the accuracy of subsequent analysis steps.

  • Normalization: Normalization refers to standardizing text data, which typically involves converting text to lowercase, removing punctuation, and replacing numbers with their word equivalents. The main goal of normalization is to reduce the variability in the text data and make it easier to process.

4.6 Machine Learning Tasks

The horrible crime of human trafficking, which affects millions of individuals globally, uses social media as a significant recruiting and victimization tool. Machine learning has become a potent weapon in the fight against human trafficking on social media, giving researchers the power to sift through massive volumes of data and spot possible instances of trafficking activity. Machine learning algorithms automate and streamline the detection and examination of probable human trafficking activity. In this context, ML can handle the following tasks:

  • Text Classification: ML algorithms are trained to automatically classify social media posts, comments, and messages as potential cases of human trafficking. For example, a study by [49] used supervised machine learning to classify online escort ads as either indicative of sex trafficking or not.

  • Entity extraction: ML is used to extract entities related to human trafficking, such as locations, names, and phone numbers, from social media posts. This can help to identify potential victims or perpetrators of trafficking. For example, a study by [47] used machine learning to extract entities related to human trafficking from ads.

  • Network analysis: ML techniques analyze the connections and interactions between individuals and groups involved in human trafficking on social media. For example, a study by [9] used network analysis and machine learning to identify the most influential countries in the human trafficking network.

  • Image analysis: ML is used to analyze images and identify potential instances of human trafficking. For example, a study by [10] used machine learning to analyze online images and identify potential victims of sex trafficking.

  • Topic modeling: ML techniques are employed to identify and analyze the topics and themes on social media posts related to human trafficking. This can help to identify patterns and trends in trafficking activity, as well as to understand the experiences and perspectives of victims and survivors. For example, a study by [6, 12] used topic modeling to analyze Twitter data related to human trafficking.

  • Sentiment analysis: ML techniques are used to analyze the sentiment of social media posts related to human trafficking, such as whether they express positive or negative emotions. This can help to identify potential victims or perpetrators of trafficking, as well as to understand the public perception of human trafficking. For example, a study by [5] used sentiment analysis to identify behavioral patterns related to human trafficking from social media posts.

  • Predictive modeling: ML techniques are employed to predict the likelihood of human trafficking activity based on social media data and identify potential victims and perpetrators. For example, a study by [19] used machine learning to predict Twitter bots and human trafficking activity with language-independent based on online escort ads.

5 Conclusions

Using machine learning techniques for assessing human trafficking in social media has shown promising results. Researchers have utilized supervised, unsupervised, and semi-supervised learning methods to analyze extensive data and datasets from social media platforms, intending to identify potential victims, traffickers, and understand the patterns and networks of trafficking activity. These methods have demonstrated high accuracy and efficiency in detecting potential cases of human trafficking and have the potential to assist law enforcement agencies in their efforts to combat this horrible crime. However, challenges remain in ensuring the ethical use of data, the sites where the data is extracted, how the corresponding label is assigned to the data, and developing models that can adapt to the dynamic and evolving nature of human trafficking networks. Nevertheless, the use of machine learning in this field has opened up new ways for understanding and combating human trafficking and holds great potential for further advancements in the future. Moving forward in this area, there is much work to do. One potential area is the development of hybrid models that combine multiple Machine Learning techniques to improve the accuracy and efficiency of trafficking assessments.