1 Background

“Dear valued customer, an unusual activity has been noticed in your bank account. To continue operating your account, please verify your details by clicking on the link below.” If someone receives any such e-mail from their bank and clicks the given link without verifying its authenticity, then there is a high probability that they may fall prey to phishing. Phishing is a social engineering attack wherein the fraudsters manipulate the victim and exploit human error [1] with the sole purpose of getting access to restricted data, personal information, login credentials, or spreading malware. In most cases, it is performed by forwarding a URL or link through spoofed e-mails [2]. After clicking on the link, the victim is redirected to a fake website, which the fraudsters have created by replicating the authentic one. The details keyed in by the victim on the fake web page are captured by the attackers, following which the victim is redirected to the authentic website without raising any suspicion. The damage is irreversible by the time the victim realises that he or she has been phished.

The COVID-19 pandemic has significantly compelled individuals to be reliant on online services [3]. As per a report, after the pandemic, the total internet hits surged by between 50 and 70% [4]. With the spread of the pandemic and restrictions on human interactions being imposed in almost every region, organisations were left with no other choice but to take refuge in technology to continue with their operations. Moreover, this transformation occurred in almost every sector of society. From entertainment, education, e-travelling and shopping to working from home, meetings, and e-banking, there is not a single sphere of life that was untouched by the use of information systems or networks. Many organisations started believing this situation as the new normal and considered a permanent transition to a hybrid model of operation (working-from-home and working-from-office) [5, 6]. Members of the workforce found it difficult to come to terms with this shift in their work patterns.

Nonetheless, the sudden occurrence of this metamorphosis exposed the vulnerabilities of the system to cyber attackers. An unprecedented situation and struggle for survival forced organisations to involuntarily move towards online mode without enough planning, infrastructure, and training. Many businesses did not even have a cyber-security policy in place to guide users about the plan of action in case of a cyber-attack. This subservience was exploited by fraudsters. Stress, fear, and anxiety amongst users during the COVID pandemic contributed to their falling for various phishing attacks [8,9,10,11,12].

1.1 Phishing statistics

The fraudsters leveraged the widespread panic caused by the pandemic and endorsed different technical and psychological techniques to persuade the victims to click on the phishing link. According to the Anti-Phishing Working Group (APWG) report [13], beginning in March 2020, cyber-criminals launched a variety of COVID-themed phishing and malware attacks against workers, healthcare facilities, and recently unemployed. In many countries, after COVID-19, government-aided financial assistance programs were started. The fraudsters made use of phishing to steal sensitive information from the beneficiaries and deceitfully applied for government benefits. As per another study, in the first quarter of 2020, COVID-based phishing e-mail attacks were up 600% [14]. The annual report of the Internet Crime Complaint Centre (IC3) of the Federal Bureau of Investigation (FBI), 2020, states that there has been an upsurge of 69% in the total number of cyber-crime cases in the USA, with losses exceeding $4.1 billion. Out of this, phishing scams accounted for over $54 million [15]. There has been an unprecedented increase in the number of detected phishing websites over the last 10 years, as shown in Fig. 1. October 2022 saw the highest number of monthly phishing attacks reported in AWPG history which almost doubled since early 2022. Figure 2 shows the monthly growth in phishing websites in the year 2022. Apart from some decline in the initial months of the first quarter of 2022, there has been a significant rise in the number of unique phishing websites. Figure 3 shows the industry domains most targeted by phishing attacks in the fourth quarter of 2022 [7]. Financial Institutions along with WebMail-based organisations and Social Media were the most targeted sectors by fraudsters. About 55% of the total phishing attacks in 2022 were observed in these organisations.

Fig. 1
figure 1

Unique phishing websites detected in last 10 years [7]

Fig. 2
figure 2

Unique phishing websites detected since January 2022 [7]

Fig. 3
figure 3

Most targeted industries in Quarter 4,2022 [7]

1.2 Motivation

Scores of defence techniques against phishing have been proposed by researchers over the years. Still, phishing attacks are on the rise. The perpetrators involved are technically sound and outmanoeuvre the defence approaches being applied and devise new methods to deceive the users. The main reason for the disparity between anti-phishing and phishing attacks is the lack of sufficient knowledge about phishing strategies applied by criminals. Ever since its outset, there has been a quest amongst researchers to effectively summarise the various dimensions of phishing attacks through surveys. There is an abundance of research work in this domain, but a considerable portion of that mainly focuses on anti-phishing. Not many authors have focused on the overall approach utilised by phishers to commit fraud. Very little importance is given to the mediums of phishing distribution and the category into which the said phishing attack falls. If there is precise knowledge about the classification of different types of phishing attacks, competent phishing detection techniques can be designed in a customised manner to combat them.

Through this survey, a novel taxonomy of phishing is presented, where phishing has been classified based on the mediums of circulation, intended targets, and the techniques used to perform phishing. The suggested taxonomy covers all the aspects of phishing attacks without being complex. We did not come across any existing literature, where an attempt is made to explain the different subcategories of phishing circulation mediums as thoroughly as in this work. To present the gravity of the situation, notorious case studies involving phishing are also discussed.

The statistics related to phishing as reported by prominent groups in this domain such as APWG, Kaspersky, Cisco, Verizon, etc. are mentioned throughout the course of this survey. Some previous surveys and reviews in this genre are studied and analysed from the perspective of their contribution towards phishing research.

1.3 Contributions

Following are the contributions of this survey, which we are sure will enable the researchers to move towards a better understanding of the rapidly advancing threat of phishing:

  1. 1.

    Illustrating the objectives behind a phishing attack and providing the researchers with an insight into the magnitude of the situation through case studies and statistical analysis.

  2. 2.

    Presenting a novel phishing profile that clearly identifies the intended targets of phishing, categorises the different mediums through which a phishing attack can be circulated, and explores the various phishing attack techniques that are currently employed by the phishers.

  3. 3.

    Critical evaluation of various phishing detection techniques along with their comparative analysis.

  4. 4.

    Identifying the open challenges in phishing countermeasures and suggesting future research directions.

The rest of the paper is organised as follows. Section 2 discusses surveys related to phishing, including a comparative analysis in the form of a table. Section 3 delves into the history of phishing, the phishing process, and the various goals of phishing attacks. Infamous phishing case studies have also been discussed. Section 4 presents a classification of phishing based on different heads, namely circulation mediums, intended targets, and techniques. Section 5 presents an analysis and discussion on various phishing attacks. Section 6 discusses various phishing detection approaches along with their pros and cons. Section 7 summarises the conclusions and future scope.

2 Related surveys

Diverse phishing-related literature is available throughout various libraries. Some of the earliest works in this domain were presented by [16, 17] who were among the first to enlighten the researchers on various aspects of phishing. Authors in [16] have discussed different conventional phishing attack techniques employed by the threat actors, along with a methodology for preventing them.

[17] have presented an extensive discussion on the social engineering factors and phishing attack vectors. However, phishing detection approaches were not discussed. Rather, they focused more on phishing prevention and suggested traditional methods to combat phishing. Even though these works are more than a decade old, they have proven to be of great help in understanding the basics of phishing.

The author in [18] has presented a high-level insight about phishing by describing the entire phishing structure, i.e. from the time the idea of the attack is conceived to the time when the illegally obtained benefits are received by the attacker. Different categories of brands being targeted are also discussed. The author has illustrated some variations in the phishing attacks that fraudsters employ to communicate phishing URLs. Some advanced phishing techniques and countermeasures, along with their merits are discussed as well.

In the survey presented in [19], the authors emphasise that different anti-phishing approaches need to be viewed with respect to the entire process of phishing. The user education-based phishing detection approach is evaluated against the software-based approach, and it is concluded that user education alone cannot guarantee a positive response towards phishing awareness. It needs to be complemented with a software-based solution. Various software-based phishing detection techniques have been reviewed against the metrics of detection accuracy and low false positives.

The survey in [20] focuses mainly on e-mail-based phishing and overlooks other phishing mediums. The authors have termed phishing as a type of spam that utilises two different approaches: social engineering-based, which depends on spurious e-mails to obtain victim’s data, and technical subterfuge-based, which uses malware to exploit security gaps in the victim’s system to perform frauds. A survey of feature sets for phishing e-mail detection is also presented, which classifies the features into three groups based upon their method of extraction, i.e. (1) features extracted directly from e-mail like structural, link, element, spam filter, and word list (2) features based on some phishing keywords appearing together such as-click, URL/link, prize, account, etc. (3) text-based features. Anti-phishing approaches have been discussed and classified as per their relevance in the various stages of a phishing attack.

[21] have presented the life cycle of a phishing attack and focus mainly on web-based phishing attacks. The authors have grouped the phishing strategies adopted by phishers into three main categories based on different stages of the phishing attack: First, the attacker imitates and sends a fake message to the victims, instructing them to validate or update their credentials through a specific URL. These messages are carefully designed with logos and other visual details of the authentic sender. Second, when the victim clicks on the URL, a hoax web page opens that asks the victims for their details following which the victim is redirected to the original website through Man in the Middle (MITM) technique. Third, a variation of the MITM technique, that requires users to enter their details through a pop-up window. The authors have also studied and evaluated various phishing detection approaches.

Techniques of phishing detection and filtering have been discussed in [22], along with their advantages and disadvantages. The authors have performed a relative analysis of these different techniques. Some emerging phishing attack trends and the attacker’s motivation behind them are also presented.

[23] categorises anti-phishing solutions as phishing prevention, user education, and phishing detection. Various phishing prevention techniques are mentioned, but to be successful, they are dependent on the user’s ability to understand them. Also, to be implemented, they require modifications to the existing system and have proven to be complex and expensive. Hence, the authors have focused on phishing detection techniques. Detection schemes are classified on the basis of their approach along with their pros and cons, novelty factor, dataset, and accuracy. The authors have also made suggestions about the scope of improvement in different phishing detection schemes.

Along with conventional mediums, [24] have explored phishing attacks and phishing detection techniques in new channels such as social networking sites and mobile phones. A phishing taxonomy comprising various dimensions of phishing, such as phishing communication media, the devices being targeted, phishing execution methods, and anti-phishing measures has been presented. To depict the factual significance of phishing detection, a comparison of commercial anti-phishing tools and research anti-phishing tools has been presented. Furthermore, tools are analysed for their performance and ranked accordingly.

Apart from basic information about phishing, such as history, life cycle, types, and countermeasures, the authors in [25] have addressed open issues and challenges being faced in the current scenario. The menace of phishing in the emerging domain of IoT has been communicated. For a better comprehension of the issue, prevailing resolutions, and future outlook, the authors have also discussed different datasets and tools currently being used by academics.

The authors in [26] have presented a systematic review of software-based phishing detection techniques. Along with a taxonomy of phishing detection, evaluation datasets and evaluation metrics have been discussed. To facilitate zero-day attack discovery, a newfound feature called Network Round Trip Time has been studied. A timeline-based record of different phishing detection techniques proposed over the years has been presented. Phishing detection features based on URL, website content, and website visual similarity have been summarised. Research guidance related to dataset selection, feature selection, and the detection scheme to be applied is also provided.

In [27], a comprehensive review of old and current phishing attacks is presented. The medium of phishing, the vector being used to transmit, and the phishing techniques used to perform the attack are reviewed for each phishing attack being discussed. The survey focuses on phishing attack techniques with a detailed description of the technical subterfuge involved in each one of them. The authors have also presented a state-of-the-art forecast about how different phishing attack techniques can be combined in the future to launch attacks of higher sophistication.

In [28], critical scrutiny of different anti-phishing genres, i.e. legal, educational, computerised using human-designed mechanisms, and intelligent machine learning mechanisms is performed. Authors have also stressed the importance of a user education-based anti-phishing approach. Content-based phishing detection methods have been described in detail. A comparison of various machine learning-based phishing detection techniques has been illustrated on the basis of performance, merits, and demerits.

The survey [29] reexamines the available phishing research literature from the point of view of current security challenges namely zero-day attack detection, base-rate neglect, time taken for attack detection, and limited availability of diverse and good-quality (near to reality) datasets. In addition, the authors have categorised phishing detection techniques on the basis of different attack vectors. The features used, detection methods, dataset properties (availability, size and class ratio, diversity, etc.), and evaluation metrics in different phishing detection techniques have been enlisted. The need to include feature importance in research and the scarcity of diverse datasets has been highlighted.

In the study [30], a survey on machine learning (ML) based (Random Forest, SVM, K-star, Adaboost, etc.) and nature-inspired (NI) based ( Particle Swarm Optimisation and Firefly Algorithm etc.) phishing detection techniques is presented. The survey focuses on phishing websites as well as e-mails. Various drawbacks of existing solutions are discussed, such as insufficient dataset, use of third-party, small feature-set, and dependency on the database. The authors have suggested the development of deep learning-based and NI-based phishing detection algorithms to enhance the overall performance of the model.

In [31] along with a survey of past and current phishing attack techniques, an extensive review of conventional and modern phishing detection techniques is done. The authors have discussed prevailing challenges and trends in the domain of phishing.

[32] concentrates on studying the detection of UBEs (Unsolicited Bulk e-mails) that are spam and phishing e-mails through machine learning. Various UBE filtering approaches are broadly labelled as content-based and behaviour-based filters, case-based filters, heuristic filters, previous likeness-based, and adaptive filters. The working of various commercial UBE filters is summarised. A mechanism to process raw e-mail data based on forty distinguishing features is described. The readers are also enlightened about how to determine feature importance. Distinct feature extraction approaches are explained. e-mail classification using many machine learning algorithms is illustrated and evaluated on the basis of performance.

[33] performs an extensive systematic search pertaining to web phishing detection. The anti-phishing solutions are categorised on the basis of input dimensions (URL/website address-based approach, textual content of the web page-based/similarity-based approach, hybrid approach). Website address-based approaches are further categorised as heuristic, list, and learning-based approaches. Textual content-based approaches are further categorised as ML-based and rule-based approaches. Different solutions proposed for each category are mentioned in detail and compared depending on the evaluation metrics, performance, benefits, and drawbacks. The authors conclude by suggesting the use of hybrid techniques with deep learning methodologies for improved performance and efficiency.

The study in [34] is confined to investigating the attack techniques practised by a particular phishing attack group that targets public institutions in South Korea for the purpose of intelligence gathering. Common distinguishing features in phishing e-mails originating from this attack group are identified and analysed. Post the analysis, their purpose is determined, and suggestions regarding phishing countermeasures to be applied by mail service providers are given.

In [35], the characteristics which make some individuals more vulnerable to being exposed to phishing are recognised. The current situation of phishing attacks is identified and prevalent phishing detection techniques are reviewed. A new phishing anatomy is proposed, which includes stages of the attack and their types, mediums of phishing, susceptibilities, threats, targets, and techniques. Some counteractive measures are also suggested.

[36] thoroughly examines various phishing methods and anti-phishing techniques. The evolution of phishing, its life cycle, and attacker motivation are also covered. A detailed taxonomy of phishing attacks on desktops as well as mobile devices is presented. Various open research challenges or research gaps have been identified and discussed.

Characteristics of phishing attacks during the COVID-19 pandemic are studied in [37]. Scientific studies, along with government reports and other literature, that investigated phishing during the pandemic are reviewed, and a comparative analysis is presented. Noticeable phishing attacks that were detected during the initial months of the pandemic are listed along with a description of motive, target, attack vector, and date. Current challenges are highlighted, and future research avenues are identified, which include the need for large benchmark datasets, and the use of deep-learning-based methods to extract the features and perform attack detection. The authors also stress efforts to be made towards quantifying the impact of the attack.

Authors in [38] have provided a literature review on various Artificial Intelligence (AI)-based phishing detection techniques. A comparison chart that includes classification algorithms, feature selection methods, and accuracy of different artificial intelligence techniques, namely, machine learning, deep learning, hybrid learning, and scenario-based approach, proposed through the literature is illustrated. The current practices and challenges in this domain are discussed, and future research directions are proposed with the aim of developing adaptable and sturdy anti-phishing solutions.

Table 1 Summary of existing phishing-related surveys

A systematic literature review has been presented in [39], which studies the application of natural language processing (NLP) for phishing e-mail detection. The structure of an e-mail for extracting and selecting the features is explained. A brief summary of various machine learning algorithms used for phishing e-mail detection, some feature extraction techniques, tools used for evaluation, evaluation metrics, datasets used and their properties, and optimisation algorithms used across various studies have been summarised. Based on their research, the authors have concluded that more research needs to be performed on using deep learning as a phishing e-mail detection technique and attention needs to be given to phishing detection in languages other than English. Also, the authors found TF-IDF (Term frequency-Inverse Document Frequency) and word embedding to be the most frequently used NLP techniques for detecting phishing e-mails. However, the survey is limited to studying works only in the field of phishing e-mails, and little or no attention is given to other facets of phishing, such as phishing URLs, smishing, vishing, compromised domains, etc.

Another systematic review is done in [40] where phishing is divided into various types: website, web page, e-mail, SMS, tweet, financial data, and URL. The anti-phishing approaches proposed during the last decade are compared based on the type of phishing, classification algorithm used, type of dataset and performance evaluation method. Future research scope and insights have been provided, which include phishing research for non-English languages, expert validation of features, and standard threshold for performance evaluation. Table 1 summarises the above-discussed surveys and compares them with this work.

3 History, stages, modus-operandi, case studies

3.1 History of phishing

In the early nineties, when the internet had just made an appearance to be used by the general public, online security was a matter of concern only for government agencies. Private organisations were least responsible for the cyber-security of their end users. America-On-Line (AOL) paid the price when its users reported the first incident of phishing in 1995 [16]. A small community of self-identified computer hackers, mostly teenagers, wrote a software program called AOHell [41]. It facilitated an automated method of stealing passwords and credit card details. In those days, AOL did not issue any warnings related to login and credit card scams to its users. Specifically, the ‘New Member Lounge’ chat rooms were targeted as they had users who were new to the use of the internet. Direct messages were sent to clueless users who were tricked into revealing their login credentials and became victims of the first-ever phishing incident. Even though the motivation behind the attack was to continue with uninterrupted access to the internet, it paved the way for other much more lethal phishing incidents that have occurred over the years.

3.2 Phishing case studies

  1. 1.

    Phone Phishing attack on Twitter [42, 43] In July 2020, some Twitter employees were subjected to a phishing attack via phone. The attackers, who professed to be Twitter employees, exploited human vulnerabilities and manipulated them to divulge their login credentials. Through these credentials, the adversaries were able to access Twitter administrator tools, which further equipped them to access the Twitter accounts of many celebrities, send fake tweets, and ask for Bitcoin contributions on their behalf. A huge fan following of the celebrities ensured a transfer of more than $100,000 in bitcoins to bogus accounts.

  2. 2.

    Phishing attack on Google and Facebook [44] The perpetrators sent forged e-mails with fraudulent invoices perceived to be originating from Quanta Computers in Taiwan to some employees of the two technology giants. Since Quanta Computers regularly carried out business with Google and Facebook, no suspicion was raised, and more than $100 million were transferred to the fake company’s bank accounts between 2013 and 2015.

  3. 3.

    Fake President Scam on FACC [45] In 2016, an Austrian aerospace parts manufacturer company, FACC, lost around $61 million as a result of a phishing attack. The phisher masqueraded as the CEO of the company and sent a hoax e-mail to the finance department with instructions to transfer funds to the attacker-controlled bank account.

  4. 4.

    Phishing attack on COVID vaccine supply chain [46] In 2020, a phishing attack against global vaccine supply chain manufacturers was uncovered. The intention behind the attack was to access sensitive information related to the COVID vaccine cold chain distribution system. The attack was spread across multiple countries and targeted the employees of companies that were involved in the attempt to keep up the COVID vaccine supply chain, such as biomedical research organisations, medical equipment manufacturers, immunology experts and pharmaceutical firms. Logistics firms involved in the transportation of the vaccine were also targeted.

3.3 The phishing process

Fig. 4
figure 4

Process of a phishing attack

Figure 4 shows the various stages of a phishing attack. A phisher takes the following course of action to steal information from the victims [21]:

  1. 1.

    Phish Set-up During this stage, a phisher identifies the victim or a group of victims, marks the information to be extracted, and sets up a phishing website to be used for deception. The phishing website is uploaded to a web-hosting server. The attacker can either acquire a domain and use it for malicious purposes or hack a legitimate domain and append the phishing web pages.

  2. 2.

    Phish Spread In this stage, the already identified victims are exposed to the already created phishing website through a phishing link. The different methods to transmit this link have been discussed in the next section.

  3. 3.

    Phish Strike Once the victims click on the phishing link, they are redirected to the fake website that has been created by the fraudsters. On the fake website, the victims provide login credentials or other personal information that might be used by the scammers for malicious intentions. Malware can also be installed on the victim’s computer. The fraud website has a similar look and feel as the original, including the same logo. The created link has almost the same orthography as the original brand name but with minute differences. This stage relies heavily on human error and a lack of online security awareness. It manipulates human psychology which tends to ignore certain security aspects when exposed to urgency or anxiety.

  4. 4.

    Phish Retreat After having fulfilled the purpose, the victims are redirected to the authentic website, and all the traces of phishing are attempted to be deleted (such as the fake website, server logs, etc.).

3.4 Phishing objectives

There might be multiple objectives behind a phishing attack. The pivotal ones are [47]:

  1. 1.

    Financial gains The attackers gain access to the online banking login credentials of the victims through the mimicked website and can perform monetary transactions. The majority of phishing attacks are motivated by a desire for financial gains.

  2. 2.

    Defamation Getting access to social media login details enables the phishers to send derogatory messages or upload obscene posts from the victim’s profile with the intent of defamation.

  3. 3.

    Impersonation The attackers imitate the identity with a motive to execute malicious ventures. This can be done for financial benefits, criminal activities like committing fraud or to malign the reputation of an individual or an organisation. The stolen identity can also be used to perform further phishing attacks.

  4. 4.

    Identity fraud Stolen identities are in huge demand on the dark web [48]. Instead of utilising the victim’s identity for financial gains directly, the phisher can sell them on the dark web, where they can be further utilised to perform unlawful activities and even acts of terrorism [49, 50]. The main reason that this nexus works is that there are no geographical limitations on the internet. Since the crime is spread across multiple locations, it becomes very difficult to track, and the scammers can continue with their endeavours for a longer duration.

  5. 5.

    Espionage Business rivals use phishing to spy on their counterparts to steal trade secrets. Business proprietary information includes details about products, pricing, corporate strategies, industrial research, and financial statements. This information is sensitive and, if revealed to a competing company, can be used by it to get ahead in acquiring a contract or to taint the public impression of an enterprise, leading to losses worth billions.

  6. 6.

    Malware Installation Another purpose of phishing is to install malicious software on the target computer. In most cases, an e-mail containing malware as an attachment is sent to the victim. Upon clicking, the malware installs itself on the victim’s machine and performs desired tasks such as snooping, encrypting, corrupting the data, or even opening a backdoor to the system for the attackers for a much more hazardous attack later. The attack on the Ukrainian power grid [51] is a perfect example, when prior to the hijack of the SCADA system at the power grid, the employees of the power grid companies were sent phishing e-mails containing BlackEnergy3 malware. It enabled the attackers to gain access to the user credentials some of which were for VPN (Virtual Private Network) that the grid workers used to remotely log into the SCADA system. There are many variants of malware that pose a threat to the victim’s computer. Some of them are:

    1. (a)

      Ransomware: Malware that denies the victims access to their data by encrypting it until a ransom is paid, mostly in the form of cryptocurrencies. Almost 70% of malware breaches between November 2020 and October 2021 had ransomware involved [52]. As per a survey [53] conducted across 31 countries, 66% of organisations were hit by ransomware in the year 2021. This amounts to an increase of 78% as compared to the previous year. Out of these organisations, 46% paid the ransom to get their data back. However, only 61% of the data could be restored after paying the ransom. Only 4% of the ransom-paying organisations got all their data back. There has also been a 4.8-fold increase in the average ransom payment. There is an increase in different variants of ransomware, which makes it easy for them to evade the anti-malware software [54].

    2. (b)

      Spyware: Malware that is delivered mainly through phishing e-mails with the purpose of snooping on the victim’s data, tracking the websites being visited, and monitoring the online activity. The main motive to infiltrate spyware on the victim’s computer is to gather information. Spyware can capture confidential data like passwords, account PINs, browsing habits, credit card details etc. and send it to the spyware authors, who can sell it or use it to carry out a more fatal cyber attack thereafter.

    3. (c)

      Viruses, worms and trojans: All these are malicious software that can cause damage to a computer, but some difference exists among the three. Virus is malware that attaches itself to another software program that needs to be executed to install the virus on the victim’s computer. A virus when installed can corrupt the data or hardware of the system. It has the capability to replicate itself and can also spread to another computer. However, a virus cannot spread without human intervention, such as executing or sharing an infected file through e-mails knowingly or unknowingly. Worm is malware that like a virus, spreads to other computers, but it does not attach itself to another program and also does not need human help to spread. It has the capability to replicate itself on the system and travel through networks. So, it can transmit thousands of copies of itself which spread further and create a disaster. The main objective of a worm is to eat up the system’s resources, causing it to slow down, and allow a malicious user to control the system remotely. Trojan Horse is a malicious program that disguises itself as a genuine application. It does not replicate itself but can be equally catastrophic. The primary purpose of the Trojan Horse is to collect information.

    4. (d)

      Adware: Adware is malware that automatically displays or downloads advertising content, such as pop-ups or banners once the user is online. It enters the system through software that a person downloads from the internet, usually freeware, and discreetly installs itself on the victim’s computer. Adware harms the victim’s computer by slowing it down and hijacking the browser.

    5. (e)

      Keyloggers: A malware which covertly records the keystrokes on a keyboard with the purpose of getting unauthorised access to sensitive information related to the victim [55]. A keylogger can expose passwords, banking details, personal correspondence, and any other activity of the victim to the adversary who can use it to fulfil malicious aspirations.

4 Phishing classification

Figure 5 depicts the phishing classification suggested in this article.

4.1 Classification based on phishing circulation mediums

In order to trap the identified victim, the phisher needs to approach the victim through various mediums. The medium provides the fraudster with a mechanism to dispatch the phishing link, and then gather the sensitive information out of the details entered by the victim on the phoney web page. Following phishing circulation mediums have been recognised which are employed by phishers throughout the world:

4.1.1 Mobile phone phishing

An increase in the number of subscribers and the change in user requirements have led to a massive advancement in mobile technology in the last few years. Mobile phones have been replaced by smartphones, which have now become such an integral part of our daily routine that it is hard to imagine stepping out without carrying one. Netizens no longer need to worry about owning a computer or a laptop when they have access to a smartphone, which is compact, inexpensive, has a long battery life, and most importantly, performs similar functions. In the year 2016, there were 2.7 billion smartphone users worldwide. This figure rose to 6.5 billion in 2022 and is expected to reach about 7.7 billion by the year 2027 [56]. Mobile devices have evolved from a luxury to a utility and finally, a necessity. Apart from harbouring personal information like pictures, e-mails, and social media accounts, smartphones are also prevalently used to perform monetary transactions through e-commerce websites and for utility bill payments. Mobile devices are with users almost all of the time, and their increased reliance on them to perform their daily chores has made them an appealing option for phishers to carry out fraud using them as a medium. APWG report [57] states that there has been a 70% rise in mobile phone-based frauds in the second quarter of 2022 as compared to the first quarter. A mobile phone user is more susceptible to phishing attacks as compared to a desktop user [58]. The main reasons behind this are small screen size, the inability of the user to see background applications and lack of vigilance [59]. Mobile applications have simple user interfaces that can be easily fabricated [60]. In addition, mobile devices lack application identity indicators [61]. A high response rate of text messages as compared to e-mails is another reason for the growing interest of phishers in this domain. As per a study [62], the open rate of SMS is about 98%, 95% of which are read within the first 3 min. Consequently, mobile phishing can be carried out through the following means:

Fig. 5
figure 5

Classification of phishing

Phishing through SMS Also known as smishing, the perpetrators send a malicious link to the victims through an SMS message on their mobile devices. SMS phishing can be done through one of the following methods [63]:

  • Malware The smishing message contains a link which on clicking installs malware on the victim’s device. The malware can monitor the victim’s online activity, capture the login credentials and other sensitive information and send the same to cyber-criminals. It can also disguise itself as a genuine app, tricking the users into revealing their confidential information.

  • Malicious website The link/URL in the smishing message leads the victims to a malicious website that may be masquerading as a reputed website. The victims may end up filling in the login details or banking information at the phoney website and suffer from monetary losses.

  • Contact through phone/e-mail Victims are asked to contact a given phone number or e-mail ID to claim complimentary gifts or vouchers. Upon being contacted, the attacker tricks the victims into divulging their credentials.

  • Self-replying SMS Malicious links are sent to the victims through a self-replying SMS that asks them to agree or disagree with a subscription.

Phishing through calls A survey [64] reveals that people are more cautious about e-mails as compared to phone calls. When they receive a call on their phone, even from an unknown person, people tend to believe it is genuine without verifying its authenticity. Also known as vishing which means the use of voice to perform phishing. The victims receive phone calls purported to have originated from a trusted party say a bank or credit card company. The callers are trained to speak in a persuasive manner and try to create a sense of urgency or fear amongst the victims like the bank account being dysfunctional if the required details are not provided. In most cases, the attackers use VoIP (Voice over Internet Protocol) [65] or caller ID spoofing [66] to make the call. The frantic victims are convinced that they are left with no other option but to reveal their sensitive information like social media account details, bank login credentials, or credit card details. The disclosure of private and confidential attributes can lead to further victimisation like stalking or online harassment. The victims are also at risk of becoming suspects in crimes committed by an imposter.

Phishing through QR codes QR or quick response codes were first used in the Japanese auto industry in the ’90 s [67] to abrogate the limitations of one-dimensional barcodes that could store only a small amount of data. Later, the QR codes carved their own path to being used in smartphones due to their ability to be machine-readable and to store large amounts of data.

QR codes became more prevalent during the pandemic as they proved to be an efficient means to ensure social distancing. Furthermore, they are easy to deploy, free of cost, fast, and convenient. They are damage resistant as compared to one-dimensional barcodes [68]. It is also not mandatory to know the domain name of a website. The URL of the website is encoded through the QR code and the users are required to scan the same through their smartphone camera. The QR code reading application installed on the phone does the rest. This sheer convenience has contributed to raising their popularity among the masses. QR codes are being used in a wide variety of applications like physical access control, ticketing and logistics, identification, and electronic payments. However, the enhanced use of QR codes has attracted criminal minds to consider them as a potential medium to access users’ confidential information, spread malware, and most importantly for monetary benefits. Phishers can embed malicious URLs within the QR code [69] which upon scanning leads the victims to a mimicked website. They can also create an entirely new QR code and stick it over an authentic QR code on a card or a flyer at the retail stores. Usually, the users can decide whether or not to open the link on the phone browser. However, there are certain applications that directly visit the web page without waiting for approval from the users. The use of ‘URL shorteners’ also limits the user’s ability to assess a URL before visiting. Curiosity towards QR codes also leads the user to visit a web page without verifying its authenticity. As per a study [70], 85% of users who scanned a QR code chose to visit the URL in their phone browser even when the domain was unfamiliar. Since no conventional approach or authentication mechanism is employed to generate a QR code [71], they are always at risk of being used by fraudsters as a phishing medium until a comprehensive solution is achieved.

Phishing through Instant Messaging The youngsters of today, who are constantly active on their smartphones make use of instant messaging to get in touch with their acquaintances. Instant messaging applications like WhatsApp and Telegram offer features ranging from audio and video calling to hyperlinks, emojis, photos, videos, and file sharing. This medium of communication is much more popular than conventional SMS and has attracted phishers.

4.1.2 E-mail phishing

As per the Verizon Data breach Report [72], 96% of social attacks arrive through e-mails and almost 100% social attacks involve phishing. E-mails are also one of the dominant mediums to circulate malware as 46% of organisations receive malware through e-mails and 94% of malware are delivered through e-mails. This implies that e-mails are the most lucrative medium to propagate phishing. A report [73] suggests that malware delivered via e-mails have tripled in the fourth quarter of 2021. The e-mails designed for malicious purposes are generally spoofed [2], that is, they appear to have originated from a trusted sender such as a bank, educational institution, credit card company, or a business partner. The probability of a phishing attempt being successful is greatly enhanced when the phishing e-mail is received from someone known to the victim [74]. The contents of the e-mail give the impression to be genuine in terms of language and overall visual details but they instigate feelings of fear, urgency, or greed and impede the prudence of the victim. For example, issuing a warning about the credit card being blocked in case of non-compliance to the e-mail or social media account to be deactivated if the user fails to change his or her password. Some phishing e-mails also start with a congratulatory message about the victim winning a lottery and ask for account details to transfer the prize money. Prior to launching the phishing attack, the fraudsters perform social engineering and gather basic details about the victim. These details when mentioned in the e-mail, hinder the victim’s power of reasoning and eliminate any suspicion. CISCO’s 2021 Cyber-security threat trends report [75] suggests that at least one person clicked a phishing link in around 86% of organizations. As per Avanan Global Phish Cyber Attack Report 2021 [76], one out of every 99 e-mails is malicious and 54% of total phishing e-mails are sent with an objective to harvest credentials. Another key finding states that Business E-mail Compromise incorporates 20.7% of all phishing attacks and 2.2% of phishing attacks are for the motive of extortion. The report also suggests the use of Artificial Intelligence to mitigate phishing attacks.

4.1.3 Phishing through social media

Phishing through social media refers to phishing attacks that are carried out through social networking platforms like Facebook, Instagram, Twitter, etc. The busy personal and professional lives of users have created a geographical barrier between friends and loved ones. People find it difficult and sometimes cumbersome to personally go and meet each other. Social media has come to the rescue in this situation and has given netizens an opportunity to be a part of the lives of loved ones without being physically present. In 2022, out of 5 billion internet users throughout the world, 93% have an account on social media [77]. This wide outreach of social networking sites has enticed the phishers to use them as a potential medium to spread phishing through impersonation, credential theft and data gathering. The phishers can create a fake social media profile of the victim and send friend requests to their acquaintances. Furthermore, they can ask for money from the victim’s friends and family on their behalf. Defamation can also be a reason to set up a fake profile. Fake profiles of business houses, brands, and celebrities also exist.

[78] states that 16% of all the accounts on Facebook are either fake or duplicates. According to a report by the Federal Trade Commission (FTC), in 2021, more than one out of every four persons lost money due to fraud that started on social media. As per a report [73] by PhishLabs, there has been a two-fold increase in phishing attacks with social media as a channel in 2021.

4.1.4 Phishing through Wi-Fi

The never-ending need for users to be online all the time has caused a rapid enhancement of wireless technologies. Users can access the internet even when they are not in the comfort of their homes or offices. They use Wi-Fi hotspot connections at various public places like airports, restaurants, shopping malls, etc. This has made users susceptible to becoming victims of Wi-Fi Phishing or Evil-Twin Attack [79]. The phishers create a fraud access point, also known as ’Evil Twin’ with the same network name or Service Set Identifier (SSID) as the authentic access point operating in that area. The login interface of Evil-Twin is reasonably forged to deceive the users and appears to be genuine. In most cases, Evil Twin promises free internet service, and once the user logs into the rogue Wi-Fi network, their sensitive information like passwords and bank details can be snooped by the phisher.

4.1.5 Phishing through Internet of Things (IoT)

The main purpose behind the development of IoT was to create a network of consumer-level devices such as security cameras, lighting systems, doorbells, refrigerators, televisions etc., that can interact over the network and be controlled through voice command or smartphones. However, their enhanced usage in various spheres of our lives has also enhanced the attack surface [80, 81] which is being exploited by the phishers. A network of entwined devices, if breached, can pose a serious threat to user privacy and security in the victim organisation. As discussed in [82], IoT systems are vulnerable to phishing attacks, as the web portal used by the consumer to configure the system is seldom accessed. Thus, the user is unfamiliar with the web portal and cannot discern a fake web page. Moreover, lack of standard technology or protocols to be used by the individual devices may lead to in-coordination and eventually, a vulnerable system [83]. As per a report [84], 98% of data over the IoT is unencrypted and more than half of the devices are exposed to cyber-attacks. Once compromised, a node in an IoT can be made to act as a botnet to affect other nodes or used to initiate a distributed denial of service attack.

4.2 Classification based on intended recipient

4.2.1 Common phishing

This is the most simplified version of phishing, which relies on the fact that some people are more focused on the task at hand and tend to ignore the subtle inconsistencies that exist in the received message. Phishers send the message with the phishing link/URL using any of the phishing distribution mediums to thousands of random victims and hope that some of them will respond. The messages are forged to convince the recipients to divulge their credentials. The probability of success is low.

4.2.2 Spear phishing

This version of phishing is targeted at specific individuals or organisations [85,86,87]. Prior to launching the attack, the offenders congregate social and personal information about the victims by monitoring their social media activity. After gathering sufficient information, a message is sent. The message seems to be from a trusted party, such as a friend, colleague, business associate, or an organisation such as a bank, credit card company, or admin of social media account. The language and impression of the message are forged and persuade the victims to disclose their sensitive information. A report [88] suggests that, out of all the known groups carrying out targeted cyber-attacks, two-thirds make use of spear phishing. According to research [89], the actual incidents of spear phishing may be much higher as its reporting is scarce. The authors have examined reasons behind the non-reporting of spear phishing by the users and concluded that self-efficacy, expected negative outcomes from reporting spear phishing e-mails, and cyber security self-monitoring are the factors that influence the likelihood of reporting spear phishing e-mails by the individuals.

4.2.3 Whaling

A variant of spear phishing in which the executives with high-level access to organisational resources and information are targeted. Being a targeted attack, it is more sophisticated as compared to a common phishing attack. The success rate is high and the phishers perform a great deal of preparation before launching such attacks. The whaling e-mails are crafted with much effort and include business terminology and tone. They also contain personal information about the targeted individual to avoid any suspicion. The phishing attack on Facebook and Google [44] executives mentioned in the previous section is an example of a whaling attack.

4.2.4 Business e-mail compromise

Business e-mail compromise (BEC) [90] is like whaling in the sense that both have company executives as victims. However, in a whaling attack, the business executive is the target but in BEC, the executive (mainly the CEO) or a trusted party is impersonated and the main target is the business itself. After posing as the CEO or an individual with high authority in the business, the phisher sends spoofed e-mails [2] to lower-level employees instructing them to transfer funds. The mail has a personalised or genuine appeal with business terminology leaving little scope of suspicion for the victim. The Fake president phishing attack on FACC [45] discussed in the phishing case studies is an example of BEC. As per APWG phishing activity trends report, in the second quarter of 2022 [57], 73% of BEC messages were sent from free webmail accounts 72% of which used Google webmail addresses. Gift cards, being used in about 40% of total BEC attacks, were the most popular means to avail benefits. In 26% of the cases, payroll diversion was attempted and wire transfer in 9.6% of the cases. 15.5% of the cases involved advanced fee fraud. BEC has also become more costly with a 14% rise in average wire transfer cost. Public Service Announcement made by Internet Crime Control Centre (IC3) [91] of the FBI mentions that from 2019 to 2021, there has been a 65% climb in global losses due to BEC. For almost the same duration, another report [92] mentions an increase in those BEC complaints that involve the use of virtual meeting platforms. As a result of the COVID-19 pandemic, most organisations conducted business through virtual meetings. The threat actors leveraged the opportunity and used the virtual meeting platforms to commit BEC scams. BEC attacks are difficult to detect as they rely on the social engineering skills of the phisher. Malicious URLs or software are not utilised and it is the ability of the fraudster to impersonate a high-ranking official which convinces the victim to transfer funds.

4.3 Classification based on technique

4.3.1 Phishing sites hosted on compromised domains (PSHCD)

To stage phishing, the fake web pages need to be hosted on a server. It is the domain of this server that a victim is being redirected to. However, hosting involves some cost that is to be borne by the phishers. The phishing URL might also be blacklisted or blocked by the search engine once it is reported. To overcome these obstacles, the phishers host their phishing site on a domain that is not owned by them but on the one which they have hijacked by exploiting vulnerabilities in the website content management system such as WordPress, Joomla or Drupal [93, 94]. In a study [95] as early as 2007, 76% of phishing web pages were found to be hosted on compromised domains. A recent report [96] mentions that only 24% of phishing websites are hosted on domains owned by the attacker. Rest are hosted either on hijacked domains or on free hosting services.

4.3.2 DNS poisoning

Domain Name Server (DNS) is used to convert web addresses into numeric IP addresses [97]. Whenever the users type in a web address in their internet browser, the record from the DNS cache is fetched and the user is directed to the corresponding IP Address. DNS Poisoning [98, 99] is an attack in which the record in the DNS cache is altered with an IP Address that serves malicious users. The internet traffic intended for genuine IP Addresses is redirected to the modified one. The attackers can host a phishing site on the fake IP Address and the victim can be duped. It is also known as DNS cache Poisoning.

4.3.3 Phishing through Botnets

A botnet is a network of systems that have been infected with malware, that can be remotely controlled and commanded by the cyber-criminals [100,101,102]. The malware is called a bot or a zombie and the controlling party of the botnet is termed as bot-herders. Instead of targeting specific individuals, the bot eyes those machines on the internet that are vulnerable. After infecting and adding another machine to the botnet, its defence is incapacitated such as impairing the anti-virus. The bot-herders can then communicate with the bot and issue instructions, receive vital information and specify the further course of action. The objective of the phishers behind using this technique is to perform automated tasks by employing the extensive processing capability of a large number of machines that have been cobbled together to create a botnet. The phishers can administer botnets to [103]: send phishing e-mails to millions of other users, act as proxy services, carry out distributed denial of service attacks, spy, spread malware, act as servers to host phishing sites, spamming.

4.3.4 Phishing through content injection

Content injection refers to the insinuation of malicious code, script, image, link, etc. into a legitimate site by exploiting vulnerabilities in its code. Apart from leading the victims to a phishing site, this attack can also induct malware on their computers. There are many variations through which an attacker can carry out a content injection attack.Cross-site scripting [104] or XSS is one such attack where the attacker injects malicious javascript code on the data entry field or in a URL on the graphical user interface of a legitimate website. XSS has the ability to bypass the same-origin policy [105], which restricts interaction between scripts and data originating from two different domains. The malicious script executes when the page is loaded on the web browser of the victim. Upon execution, the malicious script can access the victim’s personal information stored on the browser and convey it to the attacker’s server. The script can also present a counterfeit login form to the victim to capture his or her confidential details. The victim can also be redirected to a phishing site under the pretext of visiting a legitimate site. The PatchStack white paper [106] states that in 2021, out of total WordPress vulnerabilities, almost 50% belong to cross-site scripting which is about 36% more than the previous year.

Phishing through content injection can also be done by Cross-site Request Forgery [107, 108] or CSRF. To launch a CSRF attack, the phishers lure the victim to click on a malicious URL which can be dispatched to the victim through e-mail or social media. This URL is crafted to send an unauthorised request to the web application with which the victim already has an active session. The susceptibility of the target application to failing to distinguish between genuine requests by the user and unapproved requests by the user is exploited in CSRF. The CSRF attack can be used by the attacker to trick the victim and send an unfavourable request to the web application server. Such requests can be to change passwords, transfer funds, add or delete records, etc. The report in [106] states that Cross-Site Request Forgery amounts to 11.18% of total WordPress vulnerabilities.

Another variation of content injection attack is Cross-site malicious CAPTCHA attack [109]. This attack also attempts to outmanoeuvre the same-origin-policy [105] as the XSS attack. The victim is tricked into revealing their sensitive personal information to a fraud website set up by the phishers. The sensitive information of the victim is displayed on the phoney website in an obscured manner like a CAPTCHA code and the victim is asked to complete the details. Other methods to lure the victim include a gaming challenge or typing test. The naïve victim fills in the personal details which are forwarded to the phishers.

4.3.5 Phishing through search engine optimisation

After fabricating a phishing website, the aim of the phisher is to induce the victims to click on the link of the website. Through the search engine optimization [110, 111], the phishers make sure that while searching for the particular goods or services on the search engine, their phishing site is displayed amongst the top results. To the victim, the phishing site appears to be indexed legitimately by the search engine. The phishers can also incorporate black hat search engine optimization [112] to improve the page rank of their phishing website resulting in better search engine indexing. This is achieved by including keywords of prominent occasions or trends in the designed malicious website. This technique boosts the probability of the victim clicking on the phishing link.

4.3.6 Phishing through domain-squatting

Domain-squatting [113] or cyber-squatting refers to the act of registering domain names that are much too similar to those of trusted legitimate websites. Apart from registration, the phishers also set up a clone website which is almost the same as the original. When a user makes a mistake while typing the domain name of a website, since the domain name with the same typographical error is already registered, he or she is directed to the phoney website which has the same visual appearance as the original website. The unwary victim enters the details in the login entry form at the phishing site and ends up disclosing their credentials. Apart from relying on the victims to make a typographical error, the phishers can also forward the phishing links to unsuspecting potential victims. There are many techniques that fraudsters employ to go through with domain-squatting.

Typo-squatting [114,115,116] is planned to manipulate the typographical mistakes in the target domain names typed by the victim. For example, use Citibak.com or Citibamk.com in place of Citibank.com. Typo-squatting can also be executed with well-known software packages. The users can unknowingly download malicious software on their systems.

Bit-squatting [117] is a variation of cyber-squatting which relies on bit-flip faults that occur while the user has made a DNS query. The bits can flip due to various reasons which include cosmic rays, malfunctioning hardware, and operating the device outside the permissible temperature range. As per the authors in [117], bit flips can also occur due to the absence of Error Correcting Code RAM in a majority of computers and smartphones. The Error Correcting Code RAM has the potential to identify and correct the bit flips. To leverage this fault, the phishers register domain names that have a letter differing by one bit from the corresponding letter in the target legitimate domain.

Sound-squatting [118] also known as homophone squatting is a domain-squatting technique that is based on homophones, i.e. words that are spelt differently but sound identical. For example, Citibank.com and Sitibank.com. The phishers can register multiple homophonic variations of a particular target domain and wait for a confused user to be redirected to their phoney website.

Homograph-squatting [119] technique exploits the visual similarity between different characters and registers bogus domain names to fool the users. An example can be googIe.com in place of google.com (use of upper case ‘i’ in place of lower case ‘L’). This form of cyber-squatting has been further enhanced by the introduction of IDN (Internationalised Domain Name) homograph-squatting. In IDN homograph-squatting, the attacker replaces one or more characters of the target domain name with other indistinguishable characters from a different language. For example, the Greek letter omicron (o) and Latin small o (o), Cyrillic ‘a’ and Latin ‘a’.

Combo-squatting [120] suggests registering domains by leaving the original target domain intact but including additional keywords along with it (example-youtubelogin.com or facebooklive.com). Amongst all the domain-squatting mechanisms, combo-squatting attracts the most traffic [121]. After analysing more than 468 billion DNS entries over a period of six years, authors in [120] have concluded that even after 1000 days, 60% of combo-squatting domains continued to be alive. Table 2 shows various examples of domain-squatting.

Table 2 Domain-squatting of the URL www.phishing.com

The detection of domain-squatting can be performed by analysis of the target domain. Authors in [114] have proposed a model to predict different typo-squatted domains for a particular target domain. The authors in [122] have proposed the detection of bit-squatting by analysing the different permutations of bit-flips for all characters of the domain.

4.3.7 Phishing through URL obfuscation

The internet users of today are smart enough to characterise a genuine and a fake URL if there are visible differences between the two. In pursuance of a possible victim to click on a phishing URL without being suspicious, the cyber-criminals practice URL obfuscation [17, 123]. The phishing URL is either hidden/shortened or imitated as the authentic URL. Several techniques to obscure a phishing URL have been identified: The attacker can register domain names similar to authentic popular websites and share the phishing URL with the victim mainly through e-mails. This technique is also known as Bad Domain Names. A popular way towards creating an obfuscated URL is by adding a subdomain. The phisher can register a domain (for example- mydomain.com) and share a URL with the victim by appending a subdomain of a popular website (for example- facebook.mydomain.com). Here, the victims who are unaware of the technicalities of a URL can consider this as genuine without noticing the actual domain. URL obfuscation can also be achieved by swapping the domain and subdomain. Changing the top-level domain or country code top-level domain can also result in a phishing URL. An unsuspecting user can visit the said URL and end up providing the login details to phishers.

URL shortening can also be used for the purpose of phishing. Third-party URL shorteners like tinyurl.com and smallurl.com provide a facility to shorten a URL. This helps to manage lengthy URLs which have various subdomains, multiple subfolders and query strings. Although these URL shortening services do not intend to, they are used by the phishers to fulfil their unlawful purposes. The fraudsters register a phoney website and use the third-party service to generate a shortened URL which they share with the victim. Since the actual URL is obscured, the victim is unaware as to whether the web page to be visited is genuine or forged. Table 3 lists some sample obfuscated URLs for the genuine URL www.new.legitimate.com.

Table 3 URL obfuscation for www.new.legitimate.com

4.3.8 Phishing through JavaScript obfuscation

JavaScript is a scripting language used to design client-side web pages and is applied in most websites [124]. A fraudster can embed malicious script in the client side code which executes when the page is loaded in the victim’s browser and redirects a victim to the phishing site or to install malware on the victim’s computer. JavaScript can also be used to create a hoax address bar, padlock icon, and SSL certificate and make the URL of a phishing site appear legitimate. The user does not become suspicious in the absence of any apparent discrepancies. A survey [125] reports that 83% of the phishing sites use SSL certificates. However, with growing incidents of phishing, users employ anti-virus software to evade such attacks on client-side JavaScript code. But the attackers have come up with a mechanism of obfuscating the phishing code to dodge the anti-virus software [126, 127]. The obfuscated code is similar to machine-level code and is difficult to comprehend and analyse. Obfuscation also makes it difficult for the researchers to reverse-engineer the phishing code and understand its origin [103].

4.3.9 Phishing through SQL injection

The development of Internet technology has led to an increase in the number of web-based applications. An urgency among web developers to deliver the software by the deadline has become a cause of many security-related issues. As per a study [128], 75% of web applications and online businesses are susceptible to being attacked online and SQL Injection attack [129, 130] is one such attack. Through an SQL injection, the attacker sends malicious SQL commands to the database via the data entry form at the website’s user interface. The attacker mainly targets those web applications that lack proper security measures such as user input validation, web application firewall, data sanitisation, etc. SQL injection attack, if successful, enables the attacker to get unauthorised access to information such as sensitive business attributes, personal details and financial information. The attacker can manipulate the back-end database and steal credentials for impersonation and privilege abuse. Administrative rights enabling the attacker to modify or delete entire tables can also be achieved through SQL injection attacks.

4.3.10 Phishing through man in the middle attack

As the name suggests, in Man-in-the-Middle (MITM) attack [8, 17, 27], the phisher places itself in between the user and the web application server and acts as a proxy to the server. By employing the MITM attack, the assailant can interject, eavesdrop and even alter the communication transpiring between the user and the server and get access to private and confidential information of the victim without raising suspicion. The phisher sets up two separate connections. One between itself and the victim, and the other between itself and the server. The victim perceives the attacker’s server as the genuine server and discloses his or her credentials which the attacker can use to escalate the attack or store to be misused later. Also, the attacker’s server communicates with the real web server masquerading as the actual user. Thus, both the victim as well as the server correspond with the attacker’s proxy server. MITM attacks can be deployed in various communication channels [131] such as Bluetooth, Wi-Fi, and GSM. This attack is hard to detect as the two-way communication between the victim and the server happens seamlessly and there are no markers of anything going haywire. Even in cases of secured HTTPS communications, the attacker establishes its own SSL connections. To facilitate the exposure of the victim to itself, the attacker deploys various methods like DNS cache poisoning, URL obfuscation, transparent proxy, browser proxy configuration, etc. DNS cache poisoning and URL obfuscation have already been discussed in earlier parts of this section. The phisher reuses the source code of the genuine website to create its phishing version and sets up transparent proxy cache [17]. The victim’s request for the genuine website is intercepted by the proxy, and the created phishing version is returned. This fraud version expropriates the communication between the victim and the genuine web server and can also store the victim’s credentials and sensitive information. Browser proxy configuration [17] involves tweaking the proxy configuration at the victim’s web browser and influencing the entire web traffic to pass through the attacker’s server. In order to alter the proxy settings, the phisher initiates the attack in advance by deploying malware (maybe through an e-mail). After altering the proxy configurations in the browser, the attacker’s server acts as a proxy between the victim and the authentic server destination.

4.3.11 Phishing through ClickJacking

Also known as UI redressing Attack [27, 132,133,134], ClickJacking attempts to expropriate the clicks done on the user interface of a web page, link of which the victim might have received through a spoofed e-mail [2]. The victim is lured to click on a web page element (such as “Click on this button to claim your reward”) that is disguised as a genuine element. However, the clicking action leads the victim to perform unintended tasks like inadvertent download of malware, execution of malicious scripts, submission of credentials, making payments, etc. The attacker exploits the web designing functionality of iframes, which allow one web page to be displayed or overlapped within another parent web page. The target web page is invisible and is embedded on top of the legitimate web page, which the user recognises. The user has the perception that the web element being clicked is genuine, but in fact, he or she is clicking on a malicious invisible element affixed on top of it. There are many variants of ClickJacking attacks. LikeJacking [135] where the user is tempted to like a post on Facebook by transparently superimposing it over some other web page element (such as the ‘Skip Ad’ button). CursorJacking [133] where the attacker replaces the genuine cursor with a decoyed cursor. The victim gets disoriented about the actual position of the cursor and clicks on the unintended region of the page. Another type of attack is an attack on the pointer [136], where an invisible iframe is attached along with the pointer and moves throughout the screen with the pointer. Whenever the user clicks, regardless of the user’s intentions, the invisible iframe is clicked. The attacker can confuse the victim through Partial Overlays [137] by hiding a part of the target web element. For example, on the online payment page, the attacker can obscure the receiver details and amount but leave the ‘Make Payment’ button as it is. The oblivious victim goes ahead with the transaction without knowing about the actual receiver of the amount. The attacker can also exploit the response delay exhibited by the users while clicking during any task [137] and launch a Timing Attack. The time it takes to react can be the time it takes to click while hovering over a display element or the time it takes between two individual clicks of a double click. The attacker can insert a web element (such as a ‘Pay Now’ button) over a decoy button just before the user clicks.

4.3.12 Phishing through embedded objects

Most of the phishing detection techniques [23, 138,139,140,141,142] rely on the source code and textual details of the suspicious website. They extract features accordingly and compare them with features of an authenticated site. If for a unique URL, the similarity is below a predefined threshold, the suspicious website is classified as a phishing website. To bypass these techniques, attackers replace the entire textual content of a web page with embedded objects such as images [103], flash, scripts, etc. Upon substituting the source code content with an embedded object (say, an image), the various phishing detection methods are not able to extract features, thus resulting in inaccurate classification.

4.3.13 Phishing through tab-napping

Also known as tab-jacking [143] i.e. hijacking of the browser tab. Users who have a habit of opening multiple tabs simultaneously while accessing the internet are the most vulnerable to becoming victims of this attack. The attacker shares a phishing link with the victim. Once, the victim clicks on the link, a phishing page resembling a genuine web page is opened. Nothing remarkable happens while the victim is on that phishing web page. However, once the victim navigates to another opened tab and the phishing site’s tab becomes inactive, a malicious script embedded in the phishing site executes and loads a hoax login screen (such as for an e-mail or social media account) and modifies its favicon and title. As the user’s focus again shifts to the phishing tab and he notices a login screen, he perceives that his earlier session has expired. Since the user had opened the same website earlier, even though on a separate tab, he or she remains unsuspicious about the phishing website and submits the login credentials, which are redirected to the attacker. The attacker exploits the presumption of the victim that once a tab is opened, its contents remain static. The victim is oblivious to the fact that even a pre-loaded tab can be led to open a phishing website by executing some malicious JavaScript code. [144] demonstrated the execution of this attack.

4.3.14 Phishing through session fixation

Also known as preset session attack [17], this attack focuses on the session identifiers that are used by the server to monitor the activity of the user throughout a session. After performing validation, when the user logs into a website and performs various endeavours, such as selecting items to buy, making payment, updating their profile, etc., a unique session identification ID (SID) is assigned to that particular session of the user. This SID keeps track of the user’s activity when he or she navigates through various web pages within the website. The SID can be saved as a URL, cookie, or form field. In the session fixation attack, the phishers create a session ID before the victim logs into a web server, and lure the victim to start a session with it. Thus, there is no need to get the victim’s session ID thereafter [145]. This may be done by sending an e-mail containing a URL with the created SID to the victim. The URL may be for a login form. As soon as the victim authenticates with his or her login details, the phisher can hijack the active session and unauthenticated transactions can be performed. Since the URL is that of a legitimate website, the victim does not become suspicious.

4.3.15 Phishing through phishing kits

Once a phishing site is reported, it can be efficiently blacklisted or blocked. It is a time-consuming process to create a phishing site from scratch. The use of a phishing kit [146] enables criminal minds to create a fresh hoax website easily just by following simple instructions without having advanced programming abilities. Rather than designing the phishing website themselves, the attackers deploy phishing kits which are readily available on the dark web [48]. Phishing kits are ready-made fake templates of famous websites that have a vast customer base. The fraudsters must execute the instructions provided along with the phishing template to carry out a phishing attack. Consequently, they need not possess advanced technical skills to be successful phishers. Most phishing kits also facilitate the hosting of the phishing site, mainly on compromised websites or on websites that provide free hosting services. Some sophisticated phishing kits may also contain means of transmitting the phishing website to the victim and scripts to capture victim credentials. In the year 2021, Kaspersky blocked 1.2 million phishing websites which were generated through 469 phishing kits [147]. Phishing-as-a-Service (Phaas) [31] is also available from a variety of online resources and can be purchased by anyone with money and malicious intent. Phaas is a business model through which an experienced cyber-criminal becomes a service provider for a novice phisher. Through Phaas, the phisher can develop, deploy and manage the money-related aspects of phishing sites without any hassles.

4.3.16 Phishing through AI-generated URLs

The software-based phishing detection techniques use Artificial intelligence (AI) to train the system for the detection of phishing websites. The phishers can also try to improve the standard of their attacks to bypass the anti-phishing approaches. In an attempt to outline the different approaches used by the threat actors to evade AI-based phishing detection techniques, the authors in [148] analysed more than a million phishing URLs and tried to understand various strategies that the phishers can utilise to create phishing URLs. The authors have simulated how deep neural networks may be used by attackers to improve their efficiency. Through Long-Short Term Memory Networks (LSTM), the authors have created an algorithm that generates synthetic URLs. It has been proved that these URLs have a much better likelihood of avoiding AI-based detection mechanisms.

5 Analysis and discussion on various phishing attacks

This section presents a detailed analysis of different categories of phishing attacks that have been identified in the previous section.

Phishing Distribution Mediums:In an attempt to widen their reach, the attackers are targeting mobile users through personal communication mediums at a much higher rate. A report [150], shows a 50% rise in the attacks on mobile devices in 2022 as compared to the previous year. Handheld devices are more vulnerable to phishing attacks as compared to a desktop, because of the architectural differences. There has also been a surge in the use of voice mail and text messages to carry out spear phishing and BEC attacks. The purpose behind this is to lend a sense of credibility to the sender. Attackers also target open Wi-Fi networks in public places to steal the user’s credentials. The use of VPN (Virtual Private Networks) can provide an additional layer of safety against the same [151]. In 2022, attacks on another phishing distribution medium, IoT, rose by 65% [152]. The primary reason identified behind this surge is the lack of sufficient security mechanisms in IoT devices, the unregulated addition of devices with dubious supply chains which makes the network vulnerable to malware attacks, and uninterrupted access to the IoT devices through the internet which opens a back-door for the fraudsters.

Though the attackers are exploring a wide range of phishing distribution mediums, a majority (96%) of phishing attacks continue to be delivered through phishing e-mails [72]. The primary reason includes them being the most widely accepted, convenient and inexpensive means to share messages with millions of users at a mouse click. Moreover, the degree of anonymity that the attackers enjoy while sending a phishing e-mail is unparalleled. They can use spoofed or compromised sender addresses for the e-mails to appear genuine. As per [153], 21% of the users are unaware of the concept of e-mail spoofing. The attackers also launch targeted attacks that have the ability to bypass the e-mail filtering mechanism resulting in 18.8% of phishing e-mails entering the victims’ inboxes. [154]

Phishing Intended Recipients:Amongst targeted attacks, spear phishing is the most common variant. Spear phishing e-mails are difficult to identify as they do not depict any signs of being fake. Also, most of them emerge from hijacked e-mail accounts. Due to the availability of various platforms to host fake web pages such as Microsoft Azure Custom domains, it has become arduous to single out fake ones. The presence of a Microsoft SSL (Secure Socket Layer) certificate further dampens any suspicion. However, more than 80% of the phishing sites are protected by SSL [125]. This implies that the mere presence of an SSL certificate can no longer be considered a sign of safe browsing.

Table 4 Analysis of various phishing attacks

The BEC attacks have shown an increase of 81% over the two halves of 2022 [155].The open rate of BEC attack e-mails was 28%, out of which, the reply rate was 15%. The major cause behind this substantial growth has been identified as the presence of victims’ information on Linkedin, social networking sites, parent websites, etc. The threat actors can leverage these details and produce convincing e-mails with a higher likelihood of tricking the employees of the organisation.

Phishing Techniques:The phishing techniques discussed in the previous section employ a wide range of tactics to deceive the victims. Some of these techniques exploit weaknesses in the target organisation such as lack of sufficient security measures, firewalls, data validations, and coding vulnerabilities. Phishing through botnets, content injection, JavaScript obfuscation and SQL Injection attacks fall into this category. Initially, JavaScript obfuscation was used to prevent web scams by obscuring the actual code. But, the phishers are using the same strategy to evade detection. A report [156], uncovers that, out of 10,000 malicious JavaScript samples, at least 25% used obfuscation. SQL Injection attacks comprised 76% of all web application attacks in 2020 [157]. Cyber-criminals are increasingly using botnets to launch a series of attacks on unaware victims. In 2021, botnet attacks saw a rise of 23% [158].

Phishing attacks through session fixation, ClickJacking and Tab-Napping work by relying on the technical skills of the attacker. In order to generate phoney clicks on hidden adverts, fraudsters have long depended on malware or automated scripts. However, in recent years, criminal organisations have begun to switch to methods that hijack actual user clicks. A study [159] collected data on 250,000 websites and found ClickJacking scripts on 613 popular websites. Although this is not an astounding figure, the fact that these 613 websites attracted a daily traffic of 43 million hits is alarming.

Phishing through domain squatting and URL obfuscation depends on the victim’s ignorance or inability to distinguish between genuine and fake brand domain names. In both these attacks, popular brand names are either misspelt in the domain or used as it is in the subdomain or path with the intention to mislead the victim towards visiting the same. During the COVID pandemic, 55% phishing sites used brand names in the URL [160]. LinkedIn was the most impersonated brand in the first Quarter of 2022 [161].

A major percentage of phishing sites are hosted on compromised domains [95, 96]. The rationale behind the same is that PSHCD can bypass list-based detection measures as the legitimate domains on which they are hosted are not included in blacklists. Since these domains have been in existence for a long time, they can bypass those detection measures that take into account the domain registration time. Due to the reputation of legitimate domains, they are indexed in search engines and are also able to bypass search engine-based phishing detection approaches. It is very important for academia to devise an efficient technique to distinguish between maliciously owned and compromised domains. A different strategy needs to be followed once a phishing domain is identified as compromised. The reason is that the owner of the compromised domain is also a victim [93]. If the compromised domain is blacklisted, its owner would suffer from monetary losses without being the culprit. Table 4 presents a summary of the critical analysis of various phishing attacks.

Phishing detection approaches mentioned here are discussed in detail in the following section.

6 Phishing countermeasures

This section presents a summary of different anti-phishing measures that are being applied to counter phishing. Throughout the literature, phishing countermeasures are segregated into user involvement-based countermeasures and software-based countermeasures. The main idea behind user education-based phishing countermeasures is that most phishers are able to fulfil their goal due to a lack of awareness among individuals. A large proportion of internet users are unaware of basic security etiquette that must be followed while being online. So, it is of utmost importance for them to be provided with effective training and guidance about the response to be followed during diverse interactions on the internet (such as e-mails from banks or any other online service for routine maintenance and data updates).

User education is a very important and effective constituent of phishing countermeasures. But, its main drawback is its reliance on the user’s skill and ability to understand the use of the system. Even the security experts are outwitted by the phishers, who are learning new skills to introduce new methods of deception. Moreover, users have to dedicate considerable man-hours towards learning the process.

6.1 Software-based phishing countermeasures

The researchers’ preference for software-based phishing countermeasures stems from their ability to withstand a phishing attack with minimal user involvement. Based on the features being used, the software-based phishing approaches are categorised as list-based approaches, heuristics-based approaches and visual similarity-based approaches. Their further categorisation is discussed here.

6.1.1 List-based phishing detection approach

This approach is further broken down into blacklist and whitelist-based approaches. Blacklist-based phishing detection techniques [162,163,164]maintain a database of resources such as URLs, websites, images, DOM (Document Object Model), etc which are known to be reported as phishing sites. Whenever the user clicks on a URL or visits a web page, it is first verified with the black-list. If a match is found, the system warns the user about a possible phishing threat or even blocks the malicious web page from being loaded onto the user’s browser. Whitelist-based phishing detection techniques [165,166,167] maintain a database of legitimate resources. The available resource of a suspicious website is matched with the white list and flagged as phishing if its similarity with an entry in the white list is above a predefined threshold but its domain mismatches. In the case of URLs, if the URL matches an entry in the whitelist, it is termed legal.

Pros and Cons The main advantage of list-based techniques is that they are simple and lightweight to implement in the client’s browser. But their primary drawback lies with the fact that they are not able to detect zero-day phishing attacks i.e. those phishing attacks that are yet unknown to the users and the anti-phishing players in the industry. As per [168], most of the phishing websites (47%-83%) are updated after 12 h in the black-list. Moreover, the phishers make slight modifications to blacklisted URLs and are able to evade the phishing detection filters. Another drawback is that the list needs to be updated periodically to keep pace with the exponential growth of phishing websites.

Table 5 Some URL-based features

6.1.2 Heuristic phishing detection approach

This category of phishing detection approach relies on the probable features/properties that are displayed by a known phishing website and trains a machine learning model or uses a rule-based approach to try and find these properties in a suspicious website. This approach is based on features extracted from URL, source code, and a third party. URL-based heuristic phishing detection techniques [138, 169,170,171] select features from the suspected URL to detect phishing. These features may be count-based or binary. Some of the URL-based features are listed in Table 5. Source code-based heuristic phishing detection techniques extract common features present in the content of the suspicious web page being loaded. These can be based on hyperlinks [139] or textual keywords [172, 173]. Table 6 lists some of the source code-based features. Third-party-based heuristic phishing detection techniques [142, 174, 175] rely on data provided by a service other than the software or the user such as search engine indexing, page rank, WHOIS information, domain age, etc. Search Engine-based techniques [176] extract keywords along with the title, meta description, copyright information, domain from website’s source code and generate a query. This query is then fed to the search engine. The given website is classified as legitimate only if its domain is returned in the top search engine results of the query. Hybrid techniques [140, 141, 177,178,179] are also in prevalence which employ a combination of heuristic approaches (such as hyperlink with content-based or content-based with search engine-based).

Table 6 Source code-based features
Table 7 Summary of various anti-phishing approaches

Pros and Cons The primary advantage that heuristics-based approaches enjoy over list-based approaches is that they are capable of detecting zero-day attacks. Machine learning algorithms help to achieve high accuracy even though they increase the computational overhead and training cost. Search engine-based techniques have low complexity and work in real time but they fail when a newly registered genuine website is encountered resulting in high false positives. Keyword-based techniques are language-dependent and work only for the English language. Another limitation of heuristics-based approaches is that all phishing sites do not possess similar features. Once the fraudsters gain knowledge about the phishing detection scheme used, they can easily bypass the features and continue with their malicious designs. Also, they are not able to correctly classify phishing sites hosted on compromised domains (PSHCD).

6.1.3 Visual similarity-based phishing detection approach

To circumvent the heuristic phishing detection techniques, attackers replace the entire content of the web page with embedded objects such as images, Flash, JAVA Applets, etc. To contradict these phishing sites, visual similarity-based phishing detection techniques [180,181,182,183] are used, which are based on the assumption that the attacker tries to imitate the visual details of a targeted genuine website to deceive the victim. A database containing visual features (such as font details, images, logos, page layout, etc.) of known legitimate sites is maintained, and if the similarity score between the suspicious website and an entry in the pre-stored database is above a certain threshold for mismatching domains, the suspicious site is labelled as phishing.

Pros and Cons Visual similarity-based techniques can detect embedded objects in a web page that the heuristic techniques fail to detect. Moreover, these techniques use features that are common for the entire website. So, there is no need to extract different features for different web pages of a single website. However, they fail when non-pre-stored phishing sites are encountered. Insertion of empty HTML tags or deletion of unimportant tags also leads to their failure. These techniques suffer from large storage requirements and computational complexity. Furthermore, the attacker can evade these techniques by reducing the similarity in appearance.

Table 7 summarises various anti-phishing approaches on the basis of different properties exhibited by them.

6.2 Datasets

The researchers need access to a dataset of phishing and legitimate websites not only to test and train the proposed technique but also for performance evaluation. One of the benchmark datasets for malicious sites is PhishTank. Launched in 2006, PhishTank [184] is a community-based phishing verification system. A suspicious phishing site is added to the dataset once it is verified after being reported. The dataset is updated periodically and has proved to be of great assistance to researchers in this genre. Alexa Top Websites was a resource for legitimate websites. Though it was discontinued in 2022, other players like Ahrefs [185], Similarweb [186] and Majestic Million [187] have filled in.

To tailor the data as per their requirements, researchers also create and publish their own datasets. [188], proposed by [169] is one such dataset. It incorporates 36,400 genuine URLs and 37,175 malicious URLs. Another dataset is ISCX-URL2016 [189], prepared by researchers at the University of New Brunswick. A repository containing 1,14,250 phishing as well as legitimate URLs is created by consolidating the URLs from five different data sources. The Mendeley dataset [190], contains 58,000 legitimate and 30,647 phishing web instances having 111 features each. The computer emergency response team (CERT) of Japan released a dataset JPCERT/CC [191] of phishing URLs which were confirmed from January 2019 to June 2022. All the above-mentioned datasets are publicly available and are of cardinal value to the research community.

6.3 Analysis of various phishing counter measures

This section presents a comprehensive analysis of various phishing detection approaches that researchers have proposed over the years.

List-based approaches To tackle the matter of near duplicate blacklisted URLs, [164] proposed an approach which detects variants of blacklisted sites by generating all possible URLs from the blacklist. The generated URLs are checked for maliciousness through content similarity and DNS query. The technique fails when different URLs having the same phishing content are encountered. In [162], a third-party independent approach is proposed to detect replicas of existing blacklisted sites. The source code features of suspicious sites are compared with those of the blacklisted web pages and similarity is calculated using hamming distance. The approach does not give the desired results when the entire content of the web page is replaced with an image. [192] adopts a distinct viewpoint to identify malicious web pages and add them to blacklists. The new URLs are found by pursuing the phishing forms iteratively and tracking the redirections obtained from URLs that are blacklisted. The whitelist approach works by classifying the resources that are not in the list as malicious [165]. But this approach classifies newly visited web pages as phishing. To overcome this issue, in [166], the new web pages that a user visits are first checked for legitimacy via DNS details and hyperlink information, and if found genuine, updated in the whitelist. The approach is said to be able to detect zero-day attacks. A similar strategy is suggested in [167] but with a different method for comparing the domain name to one of the whitelist entries. However, both approaches suffer from third-party dependency.

Heuristics-based approaches In [170], 14 URL-based features are extracted to train Naive Bayes and SVM (Support Vector Machines) classifiers. An accuracy of 90% is achieved through SVM. [169] presents a method for phishing detection that combines seven different machine learning algorithms with word vectors, hybrid features, and NLP (Natural Language Processing) based features. The suggested method can be used in any language, is independent of third parties, operates in real-time, and can even identify newly launched phishing websites. With 73,575 URLs, the authors have produced a sizable phishing dataset. According to the trial findings, a Random Forest classifier combined with NLP-based features had the best accuracy of 97.98%. However, the method is less effective when used with phishing URLs that have short domains or no path. In [138], a phishing detection method with only 9 URL-based lexical features is proposed. The approach’s major goal is to create a system that can be used in Android applications and IoT (Internet of Things) contexts and quickly identify malicious URLs. Although the method yields a 99.5% accuracy, using too few features can lower the accuracy in actual settings. In [171], 9 efficient features from the URL are extracted and used to train 7 different machine learning classifiers. The random forest model provided the best accuracy of 95.2%. In [194], 33 URL-based attributes are collected from more than 11,000 websites. These attributes are fed to various machine learning classifiers after preprocessing and the proposed ensemble classifier of LSD (Logistic Regression, SVM and Decision Tree) achieved the highest accuracy of 98.12%.

The title of the web page is appended to the domain name and given as a search query to the Google search engine in [174]. The website is considered legitimate if the domain name matches any of the domain names of the top search results. The user is warned of phishing if the domain is not on the search engine result page. A 95.95% accuracy is attained, although the method is language dependent and suffers from significant false positives when legitimate but less well-known freshly released websites are encountered. A client-side deployable, third-party independent approach is proposed in [139].To obtain a 98.4% accuracy rate, the suggested method extracts hyperlink-based features from the website source code and trains a logistic regression classifier. But if a phisher modifies the page source references (for instance, favicons, pictures, or javascript) or uses embedded objects, the approach can be circumvented. Integration of search engine-based and heuristic methods is presented in [176] to propose a language-independent and lightweight solution that achieves a true positive rate of 98.15%. [193] presents a methodology that combines a web content-based approach, heuristic features, and blacklist-based characteristics. Extraction of comprehensive features from data acquired from four separate sources namely, phishing sites, suspicious sites, legitimate sites, and spoofed web improves detection accuracy.

[173] propose a novel technique for phishing detection using plain text and domain-specific word embedding from the HTML source code. To evaluate their model, they used several word embeddings by utilising ensemble and multimodal approaches. The proposed approach, however, depends solely on the website content, and might not work if the content is changed to images.

To detect PSHCD, [142] suggests the incorporation of similarity-based features in addition to a search engine-based approach. An accuracy of 98.61% is achieved but the technique fails when PSHCD which are indexed in search engines are encountered. Also, to minimise false positives, the similarity threshold is kept at 0. Another approach is proposed in [141], where PSHCD are determined by calculating the similarity score between the login and home page of the suspicious website using the Jaccard similarity coefficient. Other phishing websites are detected through hyperlinks and URL-based features with the TWSVM classifier. The accuracy achieved is 98.05%. The approach fails to detect phishing sites having the same login and home page

[59] trains a machine learning classifier with static and site popularity features extracted from the URL to present an efficient technique to detect mobile phishing. A detection accuracy of 93.85% is obtained through the Random Forest algorithm. [179] integrates the URL-based and text-based approaches to detect smishing SMS. The machine learning classifiers for both approaches are merged by a voting classifier to achieve an accuracy of 99.03%.

Visual Similarity-based Approaches: An improved approach of [162] is proposed in [182]. At the first level, similarity-based features (tag attributes, scripts, paths, filenames etc.) are used to establish similarity with a blacklisted site. To detect non-blacklisted or previously unseen phishing sites the authors have implemented a second-level heuristic filter. An ensemble machine learning model is trained with URL and source code-based features to obtain an accuracy of 98.72%. [181] develop a technique for detecting ’very similar’ and ’locally similar’ phishing websites. For ’very similar’, the wHash (wavelet Hashing) process with the colour histogram has proved to be accurate and stable. For ’local similar’, the SIFT (Scale-Invariant Feature Transform) approach is chosen. A cache is also included to shorten the detection time. In [183], visual renderings of target brand logos are extracted by HOG (Histogram of Oriented Gradients) in a scale-invariant manner. Further, an SVM classifier is used to reduce false positives. The technique was able to achieve 93.50% precision and 77.94% recall score. The approach is limited to already learnt logos.

Table 8 Analysis of proposed phishing detection techniques

Table 8 presents a tabular analysis of various phishing countermeasures.

7 Challenges, future scope and conclusion

The phishers use various distribution mediums to share malicious links with the victims. After a detailed analysis in Sect. 5, it is observed that along with e-mails, the use of other distribution mediums such as mobiles, IoT etc. is also skyrocketing. The adversaries are focusing on targeted attacks as they are more likely to succeed and provide higher returns. Furthermore, apart from relying on the victims’ ignorance or unpreparedness, the phishers are creating opportunities for themselves by not only trying to find system vulnerabilities but also making themselves technically sound.

In this paper, a detailed anatomy of phishing is presented, which explores its multiple related genres namely, the motivations, case studies, mediums of circulation, intended recipients, attack techniques and countermeasure approaches. Distinct surveys in the domain of phishing are reviewed and compared with this survey. The paper emphasises the certitude that before beginning with the design of a phishing detection technique, understanding the different aspects of phishing, such as circulation mediums, targeted victims, intentions behind the attack and the attack technique being used is more important. The researchers can develop efficient solutions with high precision if they have knowledge about the different types of phishing attacks that they are dealing with.

This paper identifies three broad categories of phishing attacks, i.e. phishing on the basis of medium, phishing on the basis of intended targets and phishing on the basis of technique. All these categories are discussed in detail along with supporting reports from eminent players in the domain of cyber-security and phishing. Even though the main focus of this survey is on categorising phishing attacks and discussing attack techniques, we have also weighed upon phishing countermeasures and various phishing detection approaches that have been proposed by the researchers. The benefits and drawbacks of each of the phishing detection approaches are mentioned as well.

Some open research challenges in the current anti-phishing scenario have been identified that need to be addressed:

  • The list-based approaches are easy to deploy, but they are dependent on a database, which needs to be updated frequently, and are unable to detect zero-day attacks. These approaches also fail when different versions of the same phishing site are encountered.

  • The heuristics-based approach can be bypassed if the attacker gets to know about the detection algorithm and features being used.

  • Keyword-based solutions are language dependent.

  • Heuristic approaches based on search engines suffer from latency and are unable to distinguish newly launched legitimate websites. Additionally, the phishing sites hosted on compromised domains (PSHCD) are not classified correctly as they are already indexed with the search engine and can bypass URL-based approaches.

  • Heuristic techniques based on hyperlink features fail when the phishing web page has all the hyperlinks pointing to a common domain.

  • In phishing sites where the entire content is replaced with an embedded object such as an image, the features cannot be extracted by detection approaches that are based on textual data such as HTML or DOM.

  • Visual similarity-based approaches require high storage and computational costs. Moreover, if the fraudster creates a phishing website with a slight reduction in similarity, it results in high false negatives.

The bewildering rise in the number of phishing websites being reported demonstrates that, despite numerous researchers proposing a wide range of anti-phishing methodologies, the attackers always stay one step ahead and find a way to elude the phishing countermeasures. It is also worth mentioning that no single phishing detection approach is sufficient to detect all kinds of phishing attacks.

Based on our analysis of the literature related to phishing, we suggest the scope of future research directions in this domain:

  • Layered or hybrid phishing detection techniques that are efficient as well as robust and incorporate the benefits of various existing phishing countermeasures should be developed.

  • The approach should be such that the attacker is unable to evade the technique.

  • The measures should be lightweight and database independent, with the capability of detecting PSHCD and embedded objects without latency.

  • The need of the hour is that the researchers and developers can decide on a trade-off between accuracy and computation time depending on the organisational requirements and then settle on a plan of action for the phishing detection approach to be applied.