1 Introduction

Counterfeit or fake goods are unauthorized replicas of products that attempt to pass as legitimate ones. They cover a large array of goods, such as pharmaceuticals [17], electronics [1], aircraft parts [37], and books [31].

Luxury goods, from brands such as Nike and Louis Vuitton, are among the most popular counterfeit products. Their popularity originates from the consumer’s high demand, leading to high-profit margins [37] for those who sell them. In the U.S. alone, seizures at the border of counterfeit goods in 2017 had an estimated value of US$1.2 billion [36], had these products been genuine. In the EU, 2016 border seizures were valued at €670 million (US$ 743 million) [33]. In both cases, most shipments originated from China, which has been also found as a major source of counterfeit shoes [28].

To be able to sell online, counterfeiters first have to attract potential buyers, and they have been using various tactics. In a previous study, Wang et al. [38] have shown how counterfeiters often employ search engine optimization (SEO) in an attempt to improve rankings in search engines. In addition, social networking websites have been also employed [22]: in 2016, a large number of Instagram accounts were dedicated to disseminating counterfeit luxurious goods—roughly 20% of the 150k analyzed posts [35], which contained links to stores dedicated to selling these type of products. Last, market places such as Amazon [31] and Ebay have been exploited by counterfeiters.

Buyers of these goods are often unaware that they are buying from a counterfeit webshop and in many cases, they end up not receiving any product, or receiving a lower quality version—being scammed either way. Moreover, they may become victims of ID theft, given that they have to provide their credit card details and address information. Financial losses to online shoppers have also been widely reported by several media outlets in The Netherlands [21,22,23], where they have been known to exist since 2016 in the .nl zone (Figs. 11 and 12 in Appendix A and [5]). This is not only observed in .nl: Germany’s .de was found to have more than 16.000 counterfeit shops, many active for several years [24].

In this paper, we focus on a subset of the counterfeit industry—the so-called luxury goods that are sold online, that often leads to shoppers experiencing financial loss. We leverage our centralized vantage point as the country-code top-level domain (ccTLD) registry for The Netherlands (.nl), operated by SIDN [30]. Centralized, in this context, refers to access we have to historical registration data of all .nl domain names, which also includes registrants’ contact details. Given that most webshops in The Netherlands are registered under the .nl ccTLD (and are available in Dutch language), counterfeiters would have incentives to register their domains under .nl as well, to mimic what most legitimate webshops do. As such, our centralized vantage point allow us to leverage this strong association between ccTLD, country, and language.

This paper presents the results of a multiyear effort in detecting such webshops, which led to 4455 domain names being removed from the .nl zone. We present two detection systems—BrandCounter (Sect. 3) and FaDe (Sect. 4), which have been used in production over the past three years by our Abuse Handling Analysts to evaluate .nl domain names and notify registrars and/or registrants. BrandCounter, the first system from 2017, employs a very simple but effective heuristic. We used its results in a case study with Registrar A, that ultimately led to the removal of \(\sim \)3.7k counterfeit webshops from the .nl zone (Sect. 3.1). FaDe (Sect. 4), in turn, was developed in early 2019 to cope with the new tactics employed by counterfeiters (Sect. 4.2), who adapted after the initial take downs based on BrandCounter’s results. We carried out another case study with the results from FaDe together with International Credit Cards (ICS, [11]), a major credit card issuer in The Netherlands with more than 3.5 million clients. This study led to the removal of an additional 747 domain names (Sect. 4.1). Lastly, we infer the popularity of the counterfeit domains among users by analyzing the volume of DNS queries to the .nl authoritative servers (Sect. 5).

2 Background

Domain Name Registration: Registering a domain name is the process of creating a unique name that is added to a DNS zone file. Next, we describe this process under .nl. It usually involves a registrant, registrar (or reseller), and registry. The registrant (a user) requests an accredited registrar to register an available domain name at the registry. The registrar only executes this request once certain requirements are met, such as registrant information and payment being cleared, as shown in the left part of Fig. 1.

Fig. 1.
figure 1

TLD operations: registration (left), domain resolution (right), and datasets.

Domains are registered for a period of one year, which will be automatically renewed at .nl. If the domain is cancelled, it will expire and is put on hold for 40 days and right after that made available for a new registration by any registrant. The list of valid domain names is then used to generate a DNS Zone File (Fig. 1) that contains the list of all domains under .nl, and their respective DNS records. These Zone Files are used as input on the authoritative name servers, which are used to answer queries on .nl domain names.

Domain Name Resolution: Domain name resolution consists of resolving a domain name into, ultimately, its IP address or other specific types of DNS records [18]. To do that, a user’s application contacts the stub DNS resolver (Fig. 1) on his/her computer, which, in turn, sends a DNS request to its DNS resolver [10]. The DNS resolver will, on behalf of the user, recursively resolve the requested domain name, and ultimately contact the appropriate authoritative name server. Caching on DNS resolvers [19, 20] is used to eliminate frequently issued queries, improving response times.

2.1 Datasets

We leverage three types of datasets available at the .nl registry. Two of them are passive data, while one is obtained through active measurements:

  • RegDB: We have access to the historical database of registration and removal of .nl second-level domains (such as example.nl), which covers 20+ years. This dataset contains complete information about registrant and registrar (and resellers, if applicable), as well as some of the DNS records of the respective domains [18].

  • Scans: We crawl all domains under .nl on a monthly basis. We scan for four types of application: DNS records, HTTP pages, SMTP and TLS (and its certificates on web pages). We employ DMap [40], an application we have developed to carry out these scans. Besides that, the .nl zone is scanned daily by OpenIntel [26], a research project that crawls daily multiple TLD zones for various DNS record types.

  • AuthDNS: We have access to historical query data from two out of the four authoritative name servers for .nl. This data provides a centralized but sampled view (due to caching on the resolvers) of all queries issued to .nl. We use our open-source Hadoop-based ENTRADA [41] to store and process this dataset.

Fig. 2.
figure 2

BrandCounter suspicious domain results for .nl zone.

3 BrandCounter

While detecting phishing domains in the .nl zone in 2016 [5], we came across the first suspicious luxury goods webshops, which advertised goods at high discount, as shown in Fig. 11, in Appendix A. Upon inspection, we observed that they shared one common feature: long page titles (HTML element ) that listed a series of luxury brands—in an attempt to improve rankings on search engines [38].

That provided us with a simple but effective way to detect such shops in the entire .nl zone: we crawl the zone for web pages and, for each page, we compare how many words in the page title match the words from our 1,100+ pre-compiled list of luxury brands and discount-related words, such as “discount”, “sale”, in both English and Dutch. We determined empirically a threshold of \(t\ge 5\) matching words to classify webpages as suspicious. We automated this process into a single tool (BrandCounter), and ran it roughly once a month, for over 1.5+ years, as shown Fig. 2. In total, BrandCounter detected 18952 suspicious webshops.

Results and Analysis: Eighteen thousand allegedly counterfeit webshops seems like a large number—0.3% of the entire of .nl zone. We analyzed these domains, and observe the following characteristics:

Domains are Cheap and Disposable: Given that it is relatively cheap to register a .nl domain (less than €10 in 2020), counterfeiters may choose to register a large number of domains, and even if some are taken down, the profits made from the remaining ones are enough to sustain the operation. The relatively short lifetime also indicates that domains are disposable (Fig. 6).

Registrar Concentration: Out of 18952 domains, 16512 are registered by 10 registrars, as can be seen in Fig. 3. The top registrar—Reg. A—is alone responsible for 8017 (42.3%) of all detected shops. One of the reasons for that may be the fact that Reg. A ranks among the cheapest registrars and provides an API that allows for bulk registration of domains, which is very handy in case of automated registrations. Given such concentration, we carried a case study with Reg. A (Sect. 3.1), in which a large part of these domains were suspended.

Fig. 3.
figure 3

Top 10 registrars with suspicious domains.

Fig. 4.
figure 4

Top 10 ASes (countries) hosting suspicious domains.

Similar But Yet Different Website Templates: We analyzed the home pages of some of these webshops and found out that they are different, but seem to be using a few content-management systems (CMS). The webshops do not support HTTPS, and have a single image in the page footer that contains icons of most credit card companies with no link or a broken link. Such designs also suggest use of automated tools to create such websites. Wang et al. [38] describe many doorway pages, which are non-shopping sites that are specifically designed to improve SEO results and redirect users to the real websites. In our work we do not see such pages since we do not rely on search engine results—we see the actual automatically generated pages listing the counterfeit goods, always with large discounts.

Most Domains were Drop-Catch: 15242 shops are hosted on domains that expired and were re-registered by the counterfeiters (80.4%). The majority of these domains are immediately registered when they became available (Fig. 5), a practice known as “drop-catch” [7]. By registering freshly expired domains to host counterfeit webshops, counterfeiters can benefit from their previously built reputation [14]. This timely precision in registering domains—and the fact that they seem indifferent to the name of the domain itself, as many were previously used by small businesses such as bakeries, beauty parlors—supports the idea of automation in the registration process.

Fig. 5.
figure 5

Suspicious domains: days in between domain expiration and re-registration.

Fig. 6.
figure 6

Suspicious domains lifetime: most domains are not renewed after one year—the registration period.

Chinese e-mails and Chinese Diurnal Registration Timing: Registrants are required to provide their e-mail address to register a domain with .nl. Out of 18925 suspicious domains, 4696 are registered using 163.com (24.81%), a well-known Chinese e-mail provider which is particularly not popular in The Netherlands (Fig. 7). Moreover, the registration diurnal patterns coincide with east China working hours (Fig. 8).

Fig. 7.
figure 7

Number of shops by the registrant’s e-mail domain.

Fig. 8.
figure 8

Number of shops by registration hour.

Hosting Provider Concentration: We see that 66.59% of the counterfeit webshops are hosted in 10 ASes—as can be seen in Fig. 4—and none of them are located in China. We also see that most of them, however, use default DNS services provided by their registrars during registration. We inspected a sample of websites from the .com zone hosted under some of the same IP addresses of AS197328. Some were counterfeit webshops in other languages, but we also found websites that seemed legitimate, such as small businesses in Turkey.

3.1 Registrar Notification Case Study

Counterfeiters employed Registrar A to register 8017 suspicious domains (Fig. 2), from the more than 15k detected, for the entire period covered. Given this concentration, we partner with Reg. A in a case study of three months in which we provided them with a list of domains that were labeled as suspicious by BrandCounter. In these three months, we sent 4106 domains to Reg. A.

Reg. A, in turn, would verify the identity of the registrants and take appropriate measures according to their regulations. While other registrars were also notified—and many also removed suspicious domain names—we single out Reg. A in this section, because we only tracked results for this registrar.

Table 1 shows the number of domains we notified to Reg. A—more than four thousand in the three notifications. Upon receiving the list of domains, Reg. A determined the accuracy of the registrant data and judged each domain individually. The column “Suspended” shows the number of suspended domains by Reg. A—meaning they changed their NS records to sinkhole-like authoritative name servers (e.g., sinkhole.example.nl), which they typically use for their suspended domains. To determine when the suspension occurs, we use daily crawls provided by OpenIntel [26].

Table 1. Registrar A notification and suspension results.

We can see the effects from this notification in Fig. 2: first, a drop in the number of domains labeled as “suspicious” originated from Reg A, followed by an overall drop of detected suspicious domains. We also see the effectiveness of this intervention in the same figure: as domains started to be suspended by registrars, and we see a drop in the number of domains classified as suspicious. After October 2018, we see very little change in the volume of such domains. Overall, our notification study lead to more than 3708 domains being ultimately suspended by Registrar A, potentially protecting users from scams.

4 FaDe

BrandCounter was initially effective in detecting counterfeit webshops, but after a first round of takedowns, we observed a sharp decrease in the number of suspicious domains (Fig. 2). Why was that? Have the counterfeiters given up or have they learned to avoid detection by BrandCounter? Given that, we set out to develop a new detector FaDe—Fake Detector—which does not rely on the words in the web page title. Instead we utilize a Support Vector Machine (SVM) [32] that employs nine features related to the registration itself and the infrastructure. We chose SVM because it is a robust method that has been successfully applied to classify various types of malicious activity [4, 12, 13].

SVM is a supervised learning method and relies upon labeled data for training. For that, we collaborated with the Abuse Department of ICS, a major credit card issuer in The Netherlands. ICS provided us with a list of 231 .nl domains labeled as fraudulent (Nov 2018–Jan 2019). We also randomly sampled 229 webshops from our zone which we manually labelled as a trustworthy webshop. This resulted in a data set of 460 samples.

Feature Selection: We employ nine features in FaDe that characterize counterfeit webshops (Table 2). The first three were inspired on the work by Hao et al. [6]—which we also observed with BrandCounter (Sect. 3). Re-registration indicates if the domain has been previously registered or not, Registration Hour represents the hour of the day in which the domain was registered, and the third was the registrar used.

The remaining six features (highlighted in Table 2) are based on other patterns we have seen with the domains detected by BrandCounter (Sect. 3) and the training set provided by ICS. E-mail provider indicates whether a suspicious e-mail domain is used by the registrant, given we have seen a high concentration of unusual mail providers (Fig. 7). The fifth feature—reported domains score—is the ratio of malicious domains reported via the Netcraft abuse list [15] divided by all the domains registered by a given registrar in 2018 on the .nl zone. The sixth feature captures the ratio of lowercase characters in the registrant’s name, given that we noticed that many counterfeit webshops register with lowercase only. We observed that 227 of the 231 webshops reported by ICS did not configure mail servers (defined by their MX record [18]), which we also then use as a feature. The eighth feature is the issuer of the TLS certificate, because we observed that 3 issuers are responsible for 156 of the 183 webshops that were labeled by ICS as fraudulent and have TLS configured (websites have also been found employing TLS [27]). Finally, we consider the autonomous system of the A record and of the domain, i.e., the AS of the hosting provider, given the high concentration of certain ASes (Fig. 4). All features are normalized to the same scale ([0, 1]) to ensure they all have the same influence on the distance metric.

Table 2. Features used by FaDe.

Model Training: To train our model, we start by randomly splitting our dataset with 460 samples into two categories: training set (367 samples, 80%) and test set (93 samples, 20%). We then use grid search [2] to find the optimal SVM parameters (i.e., kernel, C and \(\gamma \)). We employ cross-validation [8] so that we can use the full training set for both training and validation. The best scores over all folds—mean precision of 0.98 and mean recall of 0.97—were obtained using the RBF kernel with \(C=10\) and \(\gamma =0.1\). Next, we train our final model using these parameters and the full training set. This model was then applied to the test set yielding a precision of 1.00 and recall of 1.00. Although the test set is small, it at least indicates that our model performed well.

Feature Importance: To estimate feature importance, we use the coefficients of the best SVM classifier with a linear kernel. We omit the exact coefficients because we do not want to help counterfeiters with exact values and show the relative importance in Table 2.

Results and Analysis: After training our model, we apply it to a subset of the .nl zone: only domains that are automatically classified as eCommerce by our crawler DMap [40]. We focus on this subset to prevent many false positives that could discourage abuse analysts. For this purpose, the crawler extracts technologies used on webpages using Wappalyzer [39] and some regular expressions that look at specific HTTP headers, HTML content and cookies. A domain is classified as eCommerce if it has at least one eCommerce related technology (e.g., Zen Cart or WooCommerce).

Ultimately, we evaluated 30k domains of our zone that were classified as eCommerce and were registered at most 365 days ago, using data crawled in January 2019. Table 3 shows the results. In total, FaDe classifier detected 1407 suspicious domains.

Table 3. FaDe results and validation.
Table 4. Notification and take down results.

To validate the results, we shared the lists of suspicious domains with ICS, where analysts manually verified every single domain in the period between 2019-01-29 and 2019-02-04—including evaluating the payment provider used by the website. Out of the 1407 domains, 181 domains (Table 4) were not reachable anymore by the time of the validation—in 14 cases analysts report a DNS error, 167 domains are annotated with a generic ‘no response’ label which could indicate failure at the DNS or server level. This left us with 1226 domains that were both suspicious and reachable. Out of these, 894 were confirmed as true positives (72.92% precision). ICS analysts reported notes on a few false positives: 38 were redirects to legitimate webshops and 8 were adult websites.

4.1 Registrar Notification and Takedown

Being able to detect these counterfeit webshops is just the first step. To protect .nl users, we need to act upon these domains, and preferably take them down. We then split the true positives per registrar, as can be seen in Table 4, and notified the respective registrars of these domains, via two channels: ICS carried out their notifications and the registration department at SIDN also notified registrars. After receiving notifications, registrars can individually decide, according to their policies and processes, to suspend the domain—a process that we were not involved in.

To determine which domains were taken down, we could use the same approach shown in Sect. 3.1. However, different registrars may employ different take down methods: web page content changes, domain suspension, DNS records changes, among others. Given that we notify multiple registrars, we analyze changes in the content of web pages, domain cancellations, and nameserver changes in the period starting from the notification date until 01-05-2019. We use RegDB data and Scans data (Sect. 2.1) for this purpose.

Table 4 shows the results. Out of the 894 domains that we notified to registrars, 747 (83.56%) were effectively taken down, as measured by a change of webpage content. We can also see in the same table the method employed by the registrar: 133 (14.88%) domains are cancelled resulting in an NX domain and 713 (79.75%) changed their NS records [18], which point to the authoritative name server of a domain. We manually checked the name server changes. 677 domains changed to a sinkhole name server and 36 to a regular name server. This indicates that registrars employ different strategies to take down counterfeit webshops. For example, Reg. B suspended most domains by changing name servers whereas all Reg. F domains were cancelled. 147 (16.44%) of the notified domains were not taken down. In the majority of those cases the registrar did not respond and the registrant details were legitimate, giving us no ground to remove the domain from our zone.

4.2 BrandCounter vs FaDe Compared: Evolving Tactics

Given that BrandCounter was effective with such a simple heuristic, we can deduct that counterfeiters were likely facing very little defensive pressure—they did not seem to make any efforts to hide the suspicious characteristics of their websites, or at least not in early 2017. We could expect counterfeiters to adapt to our detection methods, especially because thousands of domains were taken down.

To determine why BrandCounter’s performance reduces over time (Fig. 2), we apply BrandCounter to the true positives generated by FaDe. Out of the 894 domains, 707 had a score of 0 matching words—and no domain had a score above 3. Given we use a threshold of \(t>5\), counterfeiters evaded BrandCounter detection. In other words, they adapted to BrandCounter. Upon inspection, we see that they have essentially removed references to popular brands and inserted generic product titles, colors, type of garment, and targeted age group/gender, ultimately evading BrandCounter—which is surprising, given that up to that point we have not disclosed how we detected these websites.

Registrar and Email Provider Diversification: We have shown in Sect. 3.1 how Registrar A took down more than 3700 domain names upon our notifications. We could expect counterfeiters to respond to that. We see that in Fig. 3, in which registrar B becomes the number 1 registrar employed by counterfeiters. More prominently, we see a diversification of e-mail providers used by registrants (Fig. 7)—moving from the dominant 163.com for BrandCounter detected domains, to a more diverse distribution for domains detected by FaDe.

Hosting Diversification: We still observe that counterfeit webshops are hosted on a small number of ASes. However, the ASes themselves did change over the years as can be seen in Fig. 4. AS 41204 and AS 204353 were frequently observed during the second study based on FaDe, while no shops were later hosted on AS 197328. Interestingly, the hosting infrastructure still does not map to Chinese IP addresses.

5 How Popular Are the Counterfeit Webshops?

Our notification campaigns led to 4.5k domains being removed or suspended. In this section, we explore the popularity of these counterfeit webshops.

We can indirectly infer a counterfeit webshop popularity by analyzing incoming queries for the .nl authoritative server—leveraging our AuthDNS dataset described in Sect. 2.1. For each domain name d, we extract the number of queries and unique IP addresses of resolvers we observed one week before the notification dates. (we chose one week given the known weekly diurnal patterns of Internet traffic [25]). While the number of queries and resolvers do not correspond to the number of unique shoppers (due to caching at DNS resolvers), it provides an indication of how diverse the population of the resolver is.

Figure 9 shows the average number of daily queries for the domains taken down before the notification, while Fig. 10 shows the average daily number of resolvers. The baseline consists of a random set of 500k domain names that serve a website (defined by a 200 OK HTTP status code). We see a significant discrepancy in counterfeit webshops popularity: 50% of them have, on average, 100 daily queries prior to the notification, from \(\sim \)70 unique resolvers. However, there are some domains that are very popular: 55 domains had an average 1000 daily queries from 653 resolvers. We manually analyzed the queries of the top 10 counterfeit webshops and found that most queries originated from public resolvers and local ISPs, which is similar to normal query behavior. This suggests variability in domains’ popularity, which may coincide with their advertisement strategies.

Fig. 9.
figure 9

Average number of daily DNS queries for counterfeit shops one week prior notification and a random subset of 500k domains that serve a website.

Fig. 10.
figure 10

Average number of daily unique resolvers for counterfeit shops one week prior notification and a random subset of 500k domains that serve a website.

6 Privacy and Legal Considerations

Together with our legal department, we have developed a publicly available data privacy framework [3] that conforms to both EU and Dutch [3, 9] legislation. This framework has been implemented, including a privacy board that oversees SIDN Labs’ research. For the purpose of this research, only domain names and their associated labels—either legitimate or suspicious—were shared between SIDN Labs research and respectively ICS and the registrars. This collaboration was formalised using a data sharing agreement.

Note that domains with counterfeit webshops were mostly taken down by registrars. SIDN only takes down domains based on content if it is clearly criminal or unlawful. However, .nl regulations [29] determines that registrant data must be legitimate. Failure to conform to the regulation may result in domain name removal from the zone—the legal instrument that has been used in some take down procedures.

7 Related Work

Counterfeit Market: Counterfeit industry has been previously studied by criminology researchers [37]. However, they focus on sales in the streets and not online. The online world of counterfeit stores have been extensively studied and mapped by [38]. The authors’ starting point was Google search results. Our work, however, is based on 5.8M domains issued by .nl, and with a focus on non-English results. Besides, we cover years of continuous efforts to mitigate such webshops and we carry out notification campaigns with domain registrars and a credit card issuer, which lead to 4.5k domains being taken down (and more belonging to other registrars, which our colleagues of the registration department notified but we did not cover in this study). We also show how counterfeiters adapted to our first classifier, once their domains started being taken down.

Payment Systems: McCoy et al. [16] cover payment systems in abuse-advertised goods, and in 2018 they focused on bullet-proof payment systems [34]. We do not cover payment systems in this paper, but we collaborated with ICS, which is a major credit card provider that deals with payment systems themselves.

8 Conclusions

Counterfeit luxury goods are a very profitable business, and employ high levels of automation in both registration and hosting. Our results suggest most registrations are supposedly done from China, but most hosting is not. We show that counterfeiters operate not only in English and in .com, as in previous works, but also in Dutch and on .nl, which illustrates how professional this industry is.

We have developed and used two systems to detect counterfeit webshops in production at .nl, detecting more than 20k suspicious webshops over a period of more than two years. By notifying registrars and teaming up with ICS, we carried out notification campaigns that resulted in 4455 domains being suspended, ultimately protecting users of the .nl zone from possible scams. Both detectors are relatively simple but at the same time effective, suggesting that counterfeiters were suffering little defensive pressure. As such, we can expect they will try to evade our detection systems again—as they have done with BrandCounter—which requires us to continuously adapt to evolving tactics.