Keywords

1 Introduction

Between 2004 and 2006, Facebook and Twitter were founded; both platforms represented a turning point in social communication. Since then, new platforms for socio-digital interaction have been generated; these new social networks are also known as social media. It is clear that each year more users of social networks are added, and with it the increase in the flow of content and information between people. According to the “15th Study on the Habits of Internet Users in Mexico” [1] presented by the Internet MX Association, it revealed that Internet users in Mexico were 82.7 million and that of these 82% (67.8 million) are users who use the Internet to access social networks and 76% (62.8 million) of users use the Internet to search for information.

Socio-digital interaction has brought with it an increase in negative practices to social communication, such as misinformation and false news or fake news; an investigation carried out by Knight Foundation (2018) revealed that more than one million false tweets are published per day on Twitter. Various civil and government organizations have called this misinformative phenomenon as infodemic; within the context of the COVID-19 pandemic, international health organizations such as the World Health Organization (WHO) and the Pan American Health Organization (PAHO) issued on May 1, 2020, a document to explain the infodemic [2].

The fight against fake news has taken on greater relevance; due to the COVID-19 pandemic, on May 21, 2020, the UN issued a tweet from its official account showing the launch of a tool to verify news through the portal “Shareverified.com” under the motto “There has never been a greater need for accurate and verified information.”

On Twitter, content dissemination campaigns have been generated or developed with negative aspects, such as the spread of disinformation, fake news, the use of bots or automated digital positioning systems, or the dissemination of negative or hateful content, as indicated by Stella et al. [3].

Mønsted et al. [4] determined that socio-digital networks generate structures that propagate content or information, which can be analyzed with greater efficiency through complex contagion models, and also assume that the probability of content adoption depends on the number of unique sources of information. In this context, the complex contagion of content propagation bots can be targeted as news disseminators with erroneous information, disinformation, or fake news.

In the “Report on disinformation campaigns, fake news and their impact on the right to freedom of expression,” which was prepared by the National Human Rights Commission of Mexico [5], warns about the use of content with information that is not attached to objective or inaccurate facts, and that exploits the emotions or beliefs of the audiences, to attract more “likes” or “retweets” on the Facebook and Twitter platforms; it also mentions that most of the citizens do not have the time, resources, or instruments to verify the content or information they receive in an increasingly connected society.

In the bulletin UNAM-DGCS-318, the study entitled “Radiography on the Dissemination of Fake News in Mexico” is mentioned [6], prepared by UNAM, which ensures that at least 89% of users on Twitter in Mexico have been exposed to this type of content. Fake news affects various areas, such as business marketing, as determined by Visentin et al. [7], who investigated the negative effect on brands that advertise in media or sites that spread fake news; their results indicate that marketing managers should be encouraged to monitor, since the proliferation of fake news, constitutes there is an increasing risk for the business sector. The increase in the use of socio-digital platforms has brought with it an increase in harmful practices to social communication, such as the spread of fake news, the use of bots, trolling, and artificial positioning strategies. Vosoughi et al. [8] determined that fake news spreads further, faster, and deeper than real news on social networks like Twitter, as well as fake news on political issues; they have a higher level of spread than other social topics.

For the analysis of this growing and continuous amount of digital information, it is necessary to implement machine learning algorithms, as well as other artificial intelligence tools, because these tools provide a great capacity for data processing and have become an indispensable support for the development of highly competitive predictive models.

2 Literature Review

To guide the investigation on the construction of a system for the detection of fake news on Twitter in Mexico, this investigation has condensed the most relevant investigations that can help its development. Shao et al. [9] analyzed 14 million tweets, where 400 thousand articles were shared during 10 months between 2016 and 2017; through their analysis, they managed to find evidence that much of the misinformation is due to super propagators that are social bots that publish automatically links to articles; the analysis tools they used were the Hoaxy and Botometer verification systems, which were developed by researchers at Indiana University. Davis et al. [10], developers of the BotOrNot system, claim that the classifier generates more than 1000 characteristics through the use of metadata and information extracted from patterns and the content of the interaction.

The Hoaxy system collects public tweets that contain links to news; the platform is freely accessible and allows systematic studies on a large scale, on topics or hashtags that are part of a fake news dissemination strategy. Shao et al. [11] used the Hoaxy platform to carry out an investigation of the dissemination of erroneous information before and after the US presidential election in 2016, the study was based on the analysis of the core of the propagation networks, determined that the network of users is polarized between true or false information. The dissemination and propagation of fake news cover different areas or themes of society, but they can also be categorized by dimensions as described by Shu et al. [12], who determined three types of dimensions (content, social, and temporal); their research made it clear that fake news is not an insignificant matter, since they are built to lie to readers and propose from an analysis point of view of social networks, a method of inoculation before the spread of fake news, which consists of identifying the nodes, routes, or main propagation links, and with this information, strategies for inoculation, blocking, or containment can be created. Ahmed et al. [13] focused on the detection of spam and fake news through text classification, for which they developed a new n-gram model.

The detection of fake news is not easy; it requires models and systems that can summarize and compare the news with reliable sources to be able to categorize them; that is why alternatives are sought such as identifying the position through the automatic detection of the relationship between two pieces of a text. Thota et al. [14] developed a model where they used the deep learning architecture of neural networks, with vectorization through a bag of words with a dense neural network, to be able to categorize the positions; the model showed good results to categorize the headings and new articles or news. Altunbey et al. [15] compared more than 20 supervised artificial intelligence algorithms for the classification of fake news and determined that the decision tree algorithm obtained a better result.

Oehmichen et al. [16] determined the characteristics of the accounts that spread fake news; in their research, they started from the creation of a dataset, for which they collected for 4 months the tweets related to the hashtags of the presidential election in the USA of 2016; they took into account tweets greater than 1000 retweets and managed to create a dataset of 9001 tweets. They determined that the fake news spreading accounts are recently created; the vast majority are unverified, have fewer updates, use strange characters in the name and description, have few followers and follow many more, and are generally dedicated to interact with retweets.

Fake news can be used to stifle social protest movements. Zervopoulos et al. [17] used various machine learning techniques, such as naive Bayes, support vector machine, C4.5, and random forest, to be able to classify the characteristics linguistics of the fake news; for this, they took the tweets in English and Chinese from a Twitter database to be able to classify the fake news. Zhou et al. [18] proposed a multimodal analysis system, which integrates the textual and visual analysis of the news; for this, they resorted to the construction of a dataset with information from news verification sites in the USA.

An important space for analysis in the Spanish language has been the development of the Semantic Analysis Workshop (TASS) which is part of the actions of the Spanish Society for Natural Language Processing (SPNL), which aims to encourage semantic analysis in Spanish language. In this effort, the IBERLEF (Iberian Languages Evaluation Forum) has been integrated, where a competition is developed to encourage research for word processing for Iberian languages such as Spanish, Portuguese, Catalan, Basque, and Galician. Salas et al. [19] implemented an analysis scheme, using a system of machine learning algorithms, a model to determine Spanish and Mexican satire on Twitter. The results of their research showed a high accuracy for detecting satire and that there is no significant difference in satire from both countries.

Posadas et al. [20] conducted an investigation to detect fake news in the Spanish language, for which they created a new dataset of the content broadcast on Twitter by formal media and media that regularly publish false content; they used four algorithms for classification machine learning, which were support vector machine, logistic regression, random forest, and boosting. Within this review, very few investigations focused on the detection of fake news for the Mexican Spanish language were found.

Table 1 summarizes the investigations that are considered relevant due to their methods for the construction of an efficient system for the detection of fake news in Twitter Mexico networks.

Table 1 Literature review. Most relevant papers for detecting fake news with machine learning

3 Proposed Method

The monitoring and analysis of socio-digital interaction have become essential for the analysis and planning of communication strategies and socio-digital interaction, either to know social opinion, or in the development of strategies for digital marketing campaigns in business sectors, social or political, as Antoniadis et al. [21].

This research proposes the implementation of machine learning algorithms for data processing and analysis, since they are capable of systematically analyzing large dataset and being able to categorize them without the interference of human bias.

After reviewing the literature and the state of the art regarding the detection of fake news on Twitter Mexico, this research proposes to use text categorization techniques for the body or text of the news (URL in tweet) indexed in the tweet, for the classification of users using the random forest algorithm. For the analysis of the spread of content, it will be done using the concepts of social network analysis and the method to determine diffusion cascades of Goel et al. [22].

  • The independent variables for the development of this research will be:

    1. (a)

      Tweet text

    2. (b)

      Text of the news (URL in tweet)

    3. (c)

      Broadcast users

    4. (d)

      Propagation of the tweet

  • The dependent variables will be the following for the text of the tweet and the text of the news that is contained in the URL of the tweet:

    1. (a)

      Fake news

    2. (b)

      Satire

    3. (c)

      Propaganda

    4. (d)

      Real news

  • The dependent variables for the independent variable of @user will be:

    1. (a)

      Bot

    2. (b)

      Troll

    3. (c)

      Human

  • The dependent variables for tweet propagation will be:

    1. (a)

      Viral

    2. (b)

      Non-viral

The independent and dependent variables proposed will be integrated into a system that may be capable of detecting false news and other variants of the news content broadcast on Twitter Mexico (Fig. 1).

Fig. 1
figure 1

Block diagram of the independent and dependent variables for the model proposed for the detection of fake news in the socio-digital media in Mexico

4 Development Phases

The first phase of the proposed system for detecting fake news on Twitter Mexico with machine learning will be the review of the literature on the various automatic techniques for detecting fake news.

The second phase consists of creating a dataset with tweets in Mexican Spanish for training and testing. For this, you will need to partition the dataset into two datasets:

  • The first training partition will be by manual classification of the tweets and is composed of 70% for the dataset.

  • The second partition will contain 30% of the dataset for the test dataset.

The training dataset will be processed and analyzed with various machine learning word processing algorithms. With the analysis of the text, the tweet can be classified to determine if it is fake news, satire, propaganda, or real news.

The third phase will consist of designing and building a comprehensive algorithm for the detection of fake news in Mexican Spanish. That you will have to analyze and categorize fake news automatically and massively.

The fourth phase consists of testing the algorithm and analyzing the results and verifying if the proposed algorithm has an acceptable degree of efficiency in execution time and of efficiency in the detection of fake news.

The fifth phase will consist of communicating the results in a completed research paper.

In Table 2, you can see the phases to carry out the project of the algorithm for the detection of fake news in the socio-digital networks in Mexico. Currently, the project is in the phase of building the dataset for further training and testing. The later stage is the design and construction of the algorithm for the detection of fake news.

Table 2 Project calendar for the construction of the algorithm for the detection of fake news in the socio-digital networks in Mexico. The boxes in green are the actions that have already been carried out

5 Conclusion

The literature review found that studies and research on the detection of fake news with machine learning or artificial intelligence techniques have been carried out mainly in the English language or with translators. The most used techniques for the classification and detection of fake news have been those of SVM, logistic regression, decision tree, and naive Bayes.

During the literature review, it was observed that throughout various investigations, various datasets have been created that have ranged from official news media, alternative media, journalists’ accounts, as well as fake news verification websites and also fake news websites and satirical content. The topics that have covered the papers read have been mainly on politics and society.

For the Mexican Spanish language on Twitter, there is very little research that has been carried out in recent years, and they are limited to analyzing the text of the news that is attached through a link or link in a tweet; this limits the scope for detection, since it leaves out add-ons that can be integrated for a better detection of fake news.

The proposed research will be useful in the first place for the detection of false news that is disseminated on social networks, and it will also be a tool that helps to report content detected as false news in the different socio-digital networks. The system can also be of great help in the development and monitoring of social communication strategies for any type of organization, be it business, social, governmental, or political. The use of this algorithm will be useful to improve the veracity of the contents in the socio-digital networks.