Keywords

1 Introduction

Our environment and how we interact as a society has lived a great transformation in recent years. Technology and especially advances in communications have been largely responsible for this, providing alternatives to carry out actions that previously were carried out in physical sites consuming a lot of time on a daily basis, and now they can be carried out with just one click. The portfolio of services available on the web every day is more diverse and complete, after the pandemic episode it is not just an alternative, this issue has accelerated the digital transformation of many organizations, retail and small business, which has been motivated to publish via web site their services, for that reason safety for people who interact with them is a subject of great importance.

In this field there are two main actors, providers that offer services to increase the number of clients who make use of them, seeking a good experience that ensures continued use on posterity, and on the other hand, there are the users of these services who despite the comfort that these services can provide, they are not willing to sacrifice security in their transactions that may affect them monetarily or violate their data privacy. According to studies such as [1] Anti-Phishing Working Group in Q4 of 2021, phishing as an attack is one of the most suffered by users on the web, this study shows an amount of 316,747 attacks in December 2021 and it is the month with the most count of phishing attacks in the history. This report also mentions that the number of attacks at the end of 2021 has tripled the number of early attacks in the early months of 2020. It also makes an analysis most targeted industry sectors and found that the financial sector is one of the most suffered (23.2%) followed closed by SAAS/Web-mail (19.5%) and e-commerce/retail services (17.3%)

The same previous analysis was done by [2] and [3] in his introduction 2 and 4 years ago respectively and surprisingly these stats have continued growing in the following years. That shows that, although the research on phishing detection is varied and not especially a recent topic, the application of mechanisms that allow not only mitigating but preventing users from falling into the early stages of phishing should be strengthened. This review paper intends to conduct a study of the state of the art in detecting phishing and establish key points where research still has shortcomings.

In [4]it is described a review of research related to phishing and what it calls security challenges, Fig. 1 shows the aspects that this article wants to highlight.

Fig. 1.
figure 1

Challenges in phishing detection models

The reality is that some research, especially engineering research, should be concerned with innovating previously unused methods, looking for different configurations that provide better results, or trying new approaches, it should also prioritize risks and reevaluate objectives. Phishing detection research is not particularly new and targeting efforts to detect as many malicious URLs as possible might seem like a good goal, seeking to remove the manual effort from security SOCs and ranking potentially dangerous URLs rather than a list of websites that are not. Far from the purpose of this article is to point out that this objective is bad or unnecessary, otherwise, it is considered very important and it is proposed a thought related to changing the approach in which the detection of phishing must act by mixing various methods, various actors in the process and overall considering of vital importance to act in the stage of prevention.

Another aspect to consider is the change of the characteristics of phishing in the time [4]. Phishing of 2022 is different from the phishing of 2018 in just 4 years the attackers have found ways to steal the information of the users through forms, hacked sites, free hosting services, and tunneling of local sites. For that reason, it is necessary to consider characteristics that could give us not only high accuracy in the detection currently if else also can achieve mechanisms that can be adapted to different techniques and that can identify key elements of the features of phishing to act in the early stages allowing to keep the performance of the solution for an acceptable time after the implementation.

This paper will check the studies related to phishing detection, taking into account the chronology of the different studies, and identifying the stages in which they act. The article is structured with an initial explanation of the methodology used, with the research questions formulation, and next with the literature selection to build a frame in phishing detection that allows analysis and synthesis using 4 pillars found in the review. Finally, the findings will enable it to reflect on the features identified in state of the art and adjust a model proposal for phishing identification that manages to act in the early stages of detection.

2 Methodology

2.1 Research Question Formulation

Let’s launch a premise: “It is necessary to avoid that phishing attacks catch the people”. So, there are many ways to be approached a possible solution, but particularly we are thinking here in some collection of programs, algorithms, and validations that alert previously a user that could fall on the attack. So the following guiding questions were raised taking into account the phishing detection broad topic:

  • What are the characteristics of a phishing attack?

  • What are the currently used mechanisms to identify phishing?

  • What are the characteristics found in the literature used in phishing detection?

  • When phishing attacks are found?

  • What characteristics are attributable to the brand affected by phishing and which are typical of a generic phishing attack?

  • What are the challenges and/or gaps in phishing detection?

2.2 Sourcing of Relevant Literature

The chosen articles were obtained from a search equation based on the research questions: TITLE-ABS-KEY (“Characteristics of a phishing” OR “mechanisms to identify phishing” OR “extract features phishing” OR “features brand phishing” OR “features generic phishing” OR “Machine learning phishing detection”) Bibliographic database such as IEEE explorer where used in this literature search. Finally another articles where added from the citations of the first articles, taking into account some review articles and organizations mentioned in the first articles such as AWPG.

2.3 Literature Selection

Based on the research question formulation, search equations were executed to reach a reference frame that allowed identifying 4 pillars of this research, these are: When? (detection stage), Where? (information sources), What? (Phishing features), How? (Phishing methods detection). 244 research articles were collected from the first decade of 2000 up to date. However, in the first review, it was found that more recent works already covered previous studies taking as global categories detection methods as List, Heuristically, Machine learning and it was also necessary to consider a great variation in the techniques used by cyber-criminals. The same happens concerning the technologies and procedures used in the implementation of web pages. So 57 articles were identified as key in phishing detection literature in recent years, taking into account that the objective given by the thematic is identifying the 4 pillars in the articles, identifying the point of action of each one of the investigations, identifying the information sources, the characterization of the attack and the methodology used for detection.

3 Phishing Detection Stages, When?

To understand when the studies of phishing attacks are acting, it is important to identify different types of approaches in the literature that deal with the problem, and according to that analyze and present their results. Lets to show it through an example:

Lets to analyze different studies acting in completely different detection stage, studies such as [5] focuses its efforts on validating phishing from an already deployed URL, it is a mean which already exists the attack. Probably some people already also has received the same URL and some of them have fallen into phishing, meanwhile, some others have been instructed to avoid falling into it. On the other side studies such as [6] focus on creating domain generation algorithms that allow it to act in zero time; it is mean that this algorithm could identify a potential attack of phishing even before this attack would be deployed.

Although both of them are aimed at detecting phishing, they differ greatly on: methodology, stage detection, sources, and techniques used. Considering the result [6] presents accuracy less than 5% and [5] accuracy above 90%, however in where no person had to have fallen on fraud, while the accuracy can cover a wider range of brands taking into account that cost can not be quantified on how many people might fall on the attack before it is being detected. Taking this into account here a proposal of the stages for phishing detection will be proposed:

Fig. 2.
figure 2

Description of the detection stage in which the reviewed papers are located

3.1 Prevention Stage

Where there are jobs for generating domains such as nakamura 2019 [6], Buber 2017 [7], Adil 2020 [8], Spaulding 2016 [9], Starov 2019 [10], Ginsberg 2018 [11] and Li 2016 [12] based on features extracted, this stage is perfect pipelining for applications that actually prevent the spread of phishing before reaching a user or a propagation medium.

3.2 Diffusion Stage

In this stage there are works such as Ya 2019 [5], Li 2017 [13], Li 2020 [14], Eshmawi 2019 [15], Balim 2019 [16], Dalgic 2018 [17], Yan 2020 [18], Sahoo 2018 [19], Baykara 2018 [20], Lingam 2018 [21] and Lingam 2019 [22] related to identify mechanisms in which phishing reaches end users; this is how analyzes are presented on social networks or email dissemination, among others.

3.3 Mitigation Stage

Final stage of action on which 83% models studied act based on community databases or reported URLs such as Phishtank or Openphish.

Currently, the 35 articles and investigations here studied have focused on the last two stages, seeking to identify and study the means by which phishing spreads and how it is dispersed or analyzing the final URL in which users have already fallen.

For practical purposes throughout this paper, we will refer to these 3 previous stages as prevention, propagation, and mitigation, as it is depicted in Fig. 2.

4 Phishing Information Sources ?‘Where?

In studies such as Li 2017 [13], Sharma 2017 [23] and Pande 2017 [24] different sources were found, used either for the own study of phishing characteristics, or for the validation of results such as Adil 2020 [8] and Li 2016 [12]. For this, it is essential to count the sources of information that link phishing sites or at least that allow extracting of URLs related to phishing features. Here are some sources considered useful at different stages of detection:

Table 1. Some sources used in phishing search

Table 1 shows some sources found on the web to obtain websites reported by the community or specialized teams, where usually researchers can put together their data sets and make a preliminary analysis of the characteristics of phishing attacks. However, it is necessary to mention that these types of sources present different types of utilities for different users involved in the detection. However, although it is a good way to centralize information regarding active phishing, acting with these data for the detection of phishing would help only in the mitigation stage. Thus, it is possible to highlight what stage of detection certain investigations are at, based on the choice of their data sources. An investigation that wants to act in stages of dissemination, will look for sources related to social networks, emails, or web advertising, while a mitigation stage would use sources related to Table 1 that would allow automating classification processes where it would help in more systematized processes to classify URLs with more elements; while for early stages it would be ideal to act within the domain registry itself, where clues begin to be given that the domain is focused on impersonating another web page.

Table 2. Some sources that can be used in prevention stages

Table 2 shows some sources focused on domain detection from its prevention stage. They are sources that, based on a keyword, can allow searches for recently registered domains, as well as displaying whois, associated security certificates, and other domains.

5 Phishing Characterization What?

5.1 Challenges for Feature Selection

In studies such as Zhu 2018 [27] and Yang 2019 [28] authors seek to diversify the characteristics used in searching for higher detection accuracy. However, the more characterized phishing today is more susceptible to future attack changes. Within the features mentioned in the literature such as Aung 2019 [29], Eshmawi 2019 [15], McGahagan 2019 [30] and Yuan 2018 [31] as a phishing alarm, a wide variety of options are presented. They are considered to act at different stages since some depend on whether the URL is already deployed with a phishing attack, while others, such as those extracted in heuristic methods, know what they are specifically looking for and this could allow them to identify these characteristics in early stages, These features have different extraction mechanisms and different requirements to be able to quantify them. Ideally, a complete system should encompass the extraction of characteristics in all possible stages, although the ideal would be to identify in the prevention stage, so the greatest number of possible features in the first stage would be ideal. However, features of this type should be searched in the sources shown in Table 2. It is difficult to search in these sources if it is not sure what it is looking for, it is at this point where features related to the brand can help us search in the great number of domains registered every second and can help us validate phishing before it is in a diffusion stage.

6 Phishing Detection How?

6.1 Surveyed Papers for Detecting Phishing

It is considered important to start with the current approaches in phishing detection since it will give a general idea of what is being implemented and possible approaches to these methods (Fig. 3).

Fig. 3.
figure 3

Distribution of methods used in the literature.

6.2 Blacklist-Whitelist Approach

It is inevitable to mention the topic corresponding to the use of stored data for blocking IP or domains already detected as fraudulent since in articles as Buber 2017 [7], Mondal 2019 [32] and Patil 2018 [3] they are considered important features to take into account in more structured systems, where they can help mitigate an attack, being the most relevant fact of this model has a very effective mitigation potential if there is a system of interconnected information browsers. [33]

These types of strategies are still useful because they allow acting where the other validation systems have failed and although in lesser numbers, they may be able to act in early stages based on notorious precedents, such as past attacks reactivation or malicious IPs blocking.

6.3 Heuristic Approaches

The heuristic approaches depend on the quality of the features that are extracted some articles such as Ali 2019 [34], Huang 2019 [35], nakamura 2019 [6], Nathezhtha 2019 [36], Baral 2019 [37]. In these works results allow visualizing expected behaviors, that is, in the case of these implementations, it is necessary to know what is being looked for and based on this build the algorithms that allow the identification of these expected characteristics. This type of implementation can act in any of the three stages depending on the approach for which it is designed as in [6], where it is used for early detection. But it can also be used in mitigation stages since it can be based on features of a URL of phishing already deployed. In this type of method, it is important to have information about keywords and patterns that can be effective in the prevention stage. The biggest disadvantage is the static detection that results in evaluating additional non-obvious aspects.

6.4 Machine Learning

The other mechanism is the use of tools that monitor URLs and seek to give a risk based on characteristics detected on a certain web page. The tools that use machine learning have proven to have the best results in recent years [38]. Based on this risk, decisions can be made to mitigate the impact of fraud [33].

Although machine learning detection algorithms give the best results, there is still a gap in differentiating between phishing detection and validation. And the difference between these two concepts lies in the stage where the tests of these models are implemented. In 25 models studied, tests are performed on deployed URLs, so rather than detection, they are phishing validation systems acting in the mitigation stages. There is a general absence of evaluating results, being implemented in real environments and over a long period of time, to evaluate the accuracy that machine learning algorithms give against the changes that attackers show in their attacks.

6.5 Detection Phishing Methods

The methods used in the detection phishing process describe a set of phases to structure the design, the technical implementation, the results analysis and also the context of the application, and the best conditions for use it is considered. In a General classification, there are three big methods used in current research these are based list detection, heuristics detection, and machine learning detection.

Fig. 4.
figure 4

Distribution of methods used in the literature.

The Fig. 4 can provide a review of methods used in detection phishing researches.

7 Opportunities for Future Researches

7.1 Identifying Challenges in the Review

An important aspect to consider is the pattern found in most of the studied identification systems. It is important to have in mind that there are studies [4] that have identified different challenges in each one of these processes and for the case of this work are considered appropriate to mention.

  1. 1.

    Source extraction stage: This stage presents the obtaining information challenge that is not biased to a certain group of threats, as well as obtaining the information in real-time, in general, shortening the gap that limits the quality of the data and the ideal development of research.

  2. 2.

    Data analysis and relevant data extraction: At this stage, there are still quite a few challenges to explore. [4] raises two main ones that are related to the quality of the data from which these characteristics are extracted and the time scaled since the attack is active.

  3. 3.

    Training and/or adjustment of the system: At this stage, the challenge of configuring the appropriate features are posed so that while the system learns it can be adaptable not only to the data which it was trained-configured but also could help with attacks that come to the future.

  4. 4.

    System evaluation: There are challenges in that proper evaluation parameter must be sought that not only depend on a correct interpretation of the results but also depends on the quality of the past stages to provide information beyond just the effectiveness of “validating phishing”.

7.2 Include of Brand Information to Improve Phishing Representation

It is necessary to identify the big majority of features related to phishing at an early stage and consider elements such as changing these features over time, the change in the technology, and security certificates. They should be considered in the analysis of the choice of features to change the global understanding of the problem to protect specific brands that allow user protection before the attack is widespread. For it, the industry should assume the role of protecting the service offered to the users and implement customized systems to address the problem from the characteristics of its fraud threats. In that way, some researchers were found to use a different approach that could be used under the concept of brand features such as Zuraiq 2019 [39], Ginsberg 2018 [11] and Concone 2019 [40]. Related work was found as an example of email-phish with high similarity, demonstrating recurrent neural networks with an accuracy of more than 98% [5]. Figure 5 presents features that can be extracted in each stage.

Additionally [41] shows how from NLP-W2V-based feature extraction it is possible to run a model that can be tracked in real-time. However, a connection is not established with phishing prevention stages, so it is not possible to determine that it can work in previous stages, but it shows how it could be used for the extraction of possible brand features and their vector representation.

Another approach [42] provides results using the extraction characteristics of logos showing an accuracy of 97%. So consideration should be given to using image validation from early stages.

Fig. 5.
figure 5

Features extraction diagram and its importance in the different stages

7.3 How Patterns and Brand Features Can be Used in Prevention Stages

The most valuable improvement that the new research can develop is the capability to increase the detection in prevention stages according to Fig. 2. Additionally, these must increase the accuracy in this stage, the majority of methods used in this stage are associated with heuristics and list methods but Machine Learning methods are low. The machine learning methods need a number of relevant features that can teach the model to learn the characteristics of phishing attacks. But what kind of features can a model take of a recently registered domain? the answer is probably none with the current approach. A recent domain just has a sequence of chars associated and probably information through a WHOIS request.

Here is where a new approach can appear taking into account the patterns and the acknowledgment of the brands, although it could sound like a start of heuristics methods, a machine learning model that can learn the patterns and features associated with a brand could be powerful recognizing phishing for a specific brand. So the importance of studying these features could be of vital importance in developing a detection phishing system in the prevention stage. Here are described global features to be taken into account for this proposal:

Attacker Characterization. A domain recently registered can detect a number of features related to the WHOIS record, such as registrant data, registrar, hosting, country, and actives services as possible MX record, as also can be detected ssl certified and his respective organization.

Character Analysis. Generally, the domains can contain the name of the affected brand or similar chars [6] that can be processed with just the domain considering additionally features as [12] and involving methods such as Ya 2019 [5], McGahagan 2019 [30] and Xiang [43].

Brand Characterization. As in Sect. 5.1 described knowledge about the brand can provide a detection system to provide security to users that want to access the offered services, the knowledge about the common colors used, the official domain or IPs associated, the language used in the official pages, as also the patterns used commonly in phishing attacks as keywords and patterns in paths could provide high-quality features for a robust system detection.

8 Conclusions

After reviewing the research, although there are good results regarding phishing “detection-validation” from a sample of URLs, showing accuracy detection above 90%, in most cases, there are still many challenges that suggest trying other approaches from different perspectives to achieve comprehensive action in the 3 stages described in this paper. Although some approximations may be appropriated for a certain group of threats, it might be good to review the methodologies used to be more effective in seeking benefits applicable to reality. There is a problem in providing models that act in the early stages where the objective is not mitigation, but prevention. Acting from the domain registry itself is not possible if it is not known what is being searched in the registered domains or what content within the domain can help to identify a potential threat. Additionally, the implemented model is required to have adaptability over time, an aspect that was not found in the results of any of the related works. The industry should understand that it should protect the service offered to the users by implementing custom systems to address the problem from its own features. For this reason, a particular approach to avoid phishing from the affected brand should be considered. Identifying particular brand features could help to be focused on early detection.