Introduction

Counterfeit medicines pose a huge threat to public health [1]. The World Health Organization states that a counterfeit medicine is “one which is deliberately and fraudulently mislabelled with respect to identity and/or source. Counterfeiting can apply to both branded and generic products and counterfeit products may include products with the correct ingredients or with the wrong ingredients, without active ingredients, with insufficient active ingredients or with fake packaging” [2]. As a matter of fact, counterfeit medicines can range from inactive and useless formulations to harmful and toxic products [1].

Medicine counterfeiting not only exists in developing countries; industrialized countries, such as European countries, the USA, and Japan, are exposed to this health threat as well [1]. It is estimated that about 1 % of the total medicines market of industrialized countries consists of counterfeit medicines, while in African countries and parts of Asia and Latin America, about 30 % of the medicine market is covered by counterfeit pharmaceuticals [1]. In spite of effective regulatory systems and market control, the quantity of counterfeit medicines seized in Europe has increased exponentially in recent years [3]. A total of 148 cases of counterfeit medicines was registered by EU customs in 2005; by 2013, this number increased up to 1175 cases, peaking in 2009 with 3368 registered cases [4]. This increase is most likely due to the extension of the Internet and more thorough border controls by EU customs [5, 6]. The types of medicines which are sold most as counterfeit in industrialized countries are commonly referred to as “life style drugs” and comprise phosphodiesterase type 5 (PDE-5) inhibitors for the treatment of erectile dysfunction, weight loss products, anabolic hormones, and products treating hair loss [3, 7]. These forged pharmaceuticals are often manufactured by uncontrolled or street laboratories [8], and therefore their safety, efficacy, and quality cannot be guaranteed [2].

Despite all efforts to tackle the distribution of counterfeit medicines [3, 9], high amounts keep entering the European market [4]. Moreover, the sale of counterfeit medicines is not only restricted to the internet since there is a significant risk of these forged products to enter the legal medicine supply chain. For instance, in the UK, nine pharmaceutical recalls, due to counterfeit medicines which had reached official pharmacies, were reported [9]. This clearly shows the need for analytical techniques able to detect these counterfeit pharmaceuticals and distinguish them from genuine medicines. Numerous analytical techniques have already been described in literature; they can be divided in two main groups: chromatographic and spectroscopic techniques [5]. Despite the fact that spectroscopic techniques are often preferred due to their short analysis time and often non-destructive character, chromatographic techniques have proven to be useful as well [10].

Liquid chromatography coupled to UV detection (LC-UV) is a valuable tool in the detection and characterization of counterfeit medicines due to its low cost and ease of use. Its importance is demonstrated by the numerous methods described for the separation and quantification of PDE-5 inhibitors and detection of analogues [1117]. However, LC-UV has also widely been used in the detection and characterization of other counterfeited pharmaceuticals such as anti-malaria medicines, antibiotics, and weight loss products [10]. Liquid chromatography equipped with mass spectrometry (LC-MS) is often preferred when screening counterfeit medicines since it allows target analysis, identification, and structural elucidation. Owing to LC-MS, a number of non-registered analogues of the PDE-5 inhibitors have been detected and identified (often in combination with NMR) [10]. This method has also proven to be useful as a screening method for the PDE-5 inhibitors and their analogues [14, 1721]. In addition, LC-MS has also been used in a quantitative way by Lebel et al. [22]. An overview of available analytical techniques in the field of pharmaceutical counterfeiting, including the identification of unknown analogues, is provided in the review by Deconinck et al. [10].

In this study, a high-performance liquid chromatography–photodiode array (HPLC-PDA) and a high-performance liquid chromatography–mass spectrometry (HPLC-MS) method were developed for the analysis of genuine Viagra®, generic products of Viagra®, and counterfeit samples. The difference between these two methods and the UV/MS methods mentioned earlier is that these newly developed methods do not aim at identifying and quantifying active pharmaceutical ingredients (APIs) and/or their analogues but to obtain fingerprints which contain as much information as possible from each sample. To do so, both methods were developed in a way to detect the present impurities and secondary components. Chromatographic fingerprinting has already proven its usefulness in the field of pharmacognosy for the identification and quality control of plants. A fingerprint is a characteristic profile which reveals the complex composition of a sample; it generates a holistic view of a sample rather than focusing on specific and predefined characteristics. Most of the literature dealing with the issue of characterizing counterfeit medicines focuses on the identification and quantification of the present APIs. This strategy has the disadvantage that a product can be evaluated as relatively safe based on the present APIs and dosage while it, in actual fact, can contain potential toxic secondary components such as impurities and residual solvents [23]. Therefore, the fingerprint approach might be more interesting for the detection (and possibly identification) of these secondary components in counterfeit medicines as well, instead of focussing on active ingredients. Since the fingerprint approach is used in this study, high amounts of data are generated, which requires the need for chemometric data analysis (both explorative and supervised pattern recognition techniques). It will be tested whether a distinction can be obtained between genuine, generic, and counterfeit medicines. For this purpose, the influence of potential present APIs will be eliminated, thereby ensuring that the aimed discrimination will solely be based on the detected secondary components and impurities.

First, both types of fingerprints (PDA and MS fingerprints) will be tested separately for their discriminating abilities. Secondly, the potential complimentary character of both detection techniques will be explored. For this second aim, the discriminating abilities of PDA and MS will be compared in order to investigate which detection technique is most suited for the desired discrimination. Moreover, it will also be verified whether the combination of fingerprints from both detectors will result in an improvement of the acquired diagnostic models. To our knowledge, this is the first paper that explores which detection technique (or perhaps combination of detection techniques) is most suited to obtain the desired discrimination. This study will allow to determine which strategy is most successful in the detection and distinction of counterfeit medicines, which could be useful for other laboratories involved in the detection of counterfeit medicines.

Materials and methods

Samples

A sample set was tested consisting of 13 genuine Viagra® samples (Pfizer), 33 generic products of Viagra® (Pfizer, Apotex, Mylan, Sandoz, Eurogenerics, and Teva), and 97 counterfeit samples.

Genuine Viagra® samples and generic products were purchased in a local pharmacy. All three dosages (25, 50, and 100 mg sildenafil) were included. Inspection of the batch numbers, for both genuine and all generic products, revealed that all samples originate from a different production batch. All counterfeit samples were donated by the Federal Agency for Medicines and Health Products (FAMHP) in Belgium. Not all counterfeit samples mentioned a dosage on the package; however, in case of mentioning, it was stated that the samples contain 100 mg sildenafil. All samples were delivered in blisters or closed jars and stored, protected from light, at ambient temperature.

Standards and reagents

Ethanol and methanol (HPLC grade) were purchased from Biosolve (Valkenswaard, The Netherlands). Formic acid was purchased from VWR Prolabo (Fontenay-Sous-Bois, France). Ammonium formate was procured from Sigma-Aldrich (St. Louis, USA). A sildenafil citrate reference standard was kindly donated by Pfizer (New York City, USA). The water, used during this study, was produced by a Milli-Q Gradient A10 system (Millipore, Billerica, USA) and will be referred to as “water” in the next paragraphs.

An ammonium formate buffer (0.020 M) pH = 3 was prepared which served as aqueous phase during the HPLC-PDA analysis.

A reference solution of sildenafil citrate (0.1 mg mL−1) in ethanol/water (50/50 v/v%) was prepared and analyzed under the same experimental conditions as the samples in order to determine the specific retention time.

Sample preparation

One tablet from each sample was crushed and homogenized using a pestle and mortar; capsules were opened and homogenized as well. Then, 30 mg of this powder mixture (Sartorius Analytic AC 210S, Goettingen, Germany) was brought to suspension in 10 mL of a mixture of ethanol/water (50/50 v/v%) and sonicated (M8800, Branson, Danbury, USA) for 15 min. Afterwards, the samples were centrifuged (Heraeus Multifuge 3SR, Thermo Scientific, Waltham, USA) at 894 g during 10 min.

HPLC-PDA: equipment and chromatographic conditions

The samples were analyzed using a HPLC system (Waters 2695 Separations Module, Milford, USA) coupled to a PDA detector (Waters 2998 Photodiode Array Detector, Milford, USA). The analysis was performed on an Alltima C18 column (250 mm × 3 mm; 5 μm particle size) (Grace, Columbia, USA). The mobile phase consisted of a gradient with an ammonium formate buffer (0.020 M) pH = 3 and methanol. First, a ratio of 90 % buffer and 10 % methanol was held for 2 min. During the next 5 min, the ratio changed to 50 % buffer and 50 % methanol. This ratio was kept for 7 min. The next 6 min, the gradient altered to 10 % buffer and 90 % methanol, which was held for 5 min. During the last 5 min, the gradient returned to its starting condition, making a total run of 30 min for each sample. This gradient was run at a flow rate of 0.5 mL min−1. Five microliters of each sample was injected at a temperature of 15 °C, while the column temperature was set at 30 °C. PDA signals were measured in the range of 210 to 400 nm. Data acquisition was achieved using the Empower software version 3 (Waters, Milford, USA).

HPLC-MS: equipment and chromatographic conditions

All the samples were analyzed a second time using a HPLC (Dionex Ultimate 3000 UHPLC+ focussed, Thermo Scientific, Waltham, USA) equipped with a MS system (Bruker, Billerica, USA). For these analyses the same Alltima C18 column (Grace) was used. Although the mobile phase gradient (used in the HPLC-PDA method) was transferred entirely, the aqueous and organic phases were slightly altered. The aqueous phase consisted of water, the organic phase of methanol. Formic acid was added to both phases in a concentration of 0.01 % (v/v). The other HPLC parameters (injection and column temperatures, injection volume, flow rate, and run time) remained the same.

The mass spectrometer used for this analysis was an AmaZon Speed ETD iontrap (Bruker). Ionization was obtained by electrospray which was operated in positive mode with a spray voltage of 4.5 kV and an end plate voltage of 500 V. The nebulizer was set to 3 bar. The desolvation gas temperature was heated to 300 °C, and the flow rate was fixed at 12 L min−1. The mass spectrometer was operated in Auto MS2 mode in the mass range of 50 to 1200 m/z and total ion chromatograms (TIC) were collected. For the selection of MS/MS precursors, the most intense ions were isolated above the absolute intensity of 2500 and 5 % relative intensity threshold. The ion charge control was set to 200.000 with a maximum accumulation time of 200 ms. Collision-induced dissociation was performed with helium as collision gas. The target mass was set to 475 m/z, which is the mass of the sildenafil base, with a fragmentation amplitude of 100 % using SmartFrag™ Enhanced for amplitude ramping (75–150 %). The SmartFrag™ function enables the system to determine the optimal fragmentation voltage automatically depending on the stability of the precursors. Fragmentation time was set to 20 ms. After analyzing the samples, it was observed that the detection of impurities was not optimal due to the high quantities in which sildenafil is present and the fact that relative intensities are recorded. Signals due to impurities were hardly visible on the acquired fingerprints, which is a prerequisite when analyzing fingerprints by chemometrics. Furthermore, it was also perceived that almost all impurities elute before sildenafil. As a consequence, the mass spectrometer was programmed to detect the first 17 min only, despite the fact that one run lasts 30 min. In that way, sildenafil, which has a retention time of 17.7 min, was not detected, resulting in higher relative intensities for the present impurities.

Chemometric approaches

The purpose of the HPLC-PDA and HPLC-MS analyses was to obtain as much information as possible about all samples. Therefore, it was chosen to include both MS1 and MS2 fingerprints in the data analysis. The MS1 fingerprints are the TIC profiles (relative intensity in function of retention time). The MS2 fingerprints are a visualization of the fragments of precursors detected in MS1. All acquired fingerprints were surveyed at different UV wavelengths, including less specific wavelengths such as 210 and 230 nm. Furthermore, a review of the literature [17, 2435] was performed to assist the choice of wavelengths to be included in the data analysis. Surprisingly, the best fingerprints were acquired at wavelengths 254, 270, and 290 nm in terms of the largest number of peaks and concomitant intensities. As a matter of fact, a lower number of peaks was observed at 210 and 230 nm and the peaks, visible at both wavelengths, showed higher intensity at 254, 270, and 290 nm. Therefore, it was decided to include these three wavelengths in the chemometric analysis.

Data pre-processing

During the collection of chromatograms, peaks can shift along the elution time axis due to column aging, instability of the instrument, or variance in mobile phase composition [36]. Despite the taken precautions to reduce peak shifts as much as possible, i.e., one batch of mobile phase and all samples analyzed on the same column in one series, the acquired chromatograms had to be aligned. Alignment of chromatograms is considered to be a critical step prior to the application of chemometric techniques [37]. For this purpose, correlation optimized warping (COW) was used.

COW is a technique which performs a fragment-wise stretching and compressing of the time axis in order to align chromatographic profiles. It uses the correlation coefficient as a similarity measure of the involved fingerprints [38]. First, a target profile (T) is selected with which the other profiles are aligned. The target profile is the one that is characterized by the highest mean correlation coefficient among all chromatograms [37]. Both the target profile and the profiles to be aligned are divided into a number of sections N, each containing approximately the same number of sampling points. Each section may be warped to a smaller or greater length by linear interpolation. However, only a finite number of possible warping magnitudes can be explored for each section [38].

All sections are aligned individually, starting at the end node and working backwards to the first section of the profile. Finding the optimal overall alignment is achieved by usage of dynamic programming, which explores all possible warping magnitudes for each section. The quality of alignment is determined by calculating the correlation coefficient between section i after alignment and the corresponding section of the target profile. During the dynamic programming, all suboptimal combinations of warping are discarded, retaining only the optimal warping combination. This optimal combination is characterized by the largest value of the summed correlation coefficients [38, 39]. More detailed information about COW can be found in refs. [38, 39].

Exploratory analysis of chromatographic fingerprints–principal component analysis (PCA)

PCA was performed to test whether this technique can visualize a discrimination between genuine, generic, and counterfeit samples.

PCA is a widely used method that projects high-dimensional data into a low-dimensional space, which is defined by new latent variables. These latent variables are commonly referred to as principal components (PCs) and are linear combinations of the original (high dimensional) variables. The first constructed PC represents the highest variance in the data; the second PC explains the highest residual variance around the first PC and is therefore, by definition, orthogonal to the first. The same principle is repeated for PC3 around the plane defined by PCs 1 and 2, etc. Since data structure can usually be summarized efficiently by a few PCs, PCA helps in reducing data dimensionality [40].

PCA results in two matrices: a loading matrix and a score matrix. The loadings express the contribution of each original variable to a given PC. The scores represent the projections of each object (= sample) on the constructed PCs. Therefore, they provide information about the (dis)similarities among the objects [5, 40].

It should be mentioned that the clustering, acquired when using PCA, is only a visual one. PCA is an unsupervised projection technique, not a clustering technique. Therefore, this technique does not explicitly seek for present clusters, it only aids at visualizing and interpreting data [40].

Selection of a training and test set

In order to validate any model, the data set was split into a training set and a test set using the Kennard and Stone algorithm. The training set is used to generate the classification models; the test set is selected to perform an external validation of the obtained prediction models.

The Kennard and Stone algorithm starts by selecting the sample which is situated closest to the data mean. This sample (s1) is assigned to the training set. The second sample, which is included in the training set (s2), is situated furthest away from s1. The third sample to be allocated to the training set (s3) is the one most remote from both s1 and s2. This procedure is repeated until the required number of samples in the training set is selected. The test set is composed of the remaining non-assigned samples [41].

Modelling techniques

A number of modelling techniques were applied to test whether appropriate classification models could be obtained which also might serve to classify unknown samples. It was chosen to include partial least squares-discriminant analysis (PLS-DA), soft independent modelling of class analogy (SIMCA), and k nearest neighbors (kNN) since these techniques are the main techniques found in literature for the discrimination between genuine and counterfeit samples. Moreover, they are relatively simple techniques, easy to understand, and are already applied successfully by our group [5, 4245].

The genuine samples (Viagra®) are defined as class 1, the generic samples of Viagra® constitute class 2, and class 3 consists of all the counterfeit samples.

PLS-DA

PLS-DA is a supervised technique which aims to differentiate between groups of samples. The group membership of samples is indicated by a categorical dependent variable y. A PLS-DA model is acquired by constructing so-called PLS factors, which are linear combinations of the original variables. These PLS factors are constructed in a way that they represent maximum covariance between the original variables and response variable y. In order to obtain the best performing PLS-DA model, its complexity, i.e., number of PLS factors, is optimized using leave-one-out cross-validation [42, 44, 46].

This technique not only enables the construction of a diagnostic model, it also gives insight into the data structure by exploring the space of the latent variables (PLS factors).

SIMCA

SIMCA is a supervised classification technique that models each class of samples separately by defining a number of PCs derived from PCA. First, the optimal number of PCs is determined which is required to describe each training class individually. This optimal number of PCs is found using a cross-validation procedure. Next, classification rules are constructed by considering two critical values: (1) one for the Euclidean distances towards the model and (2) the Mahalanobis distances calculated in the space of scores. These two critical values define a restricted space around the samples of one particular class. The position of a new sample (object) is calculated using the scores and loadings of the created model. If the object is situated within the restricted space around a training class, then the object is assigned to that class [5, 42].

Confidence limits were set at 95 %. Contrary to PLS-DA, SIMCA is a soft classification method, meaning that a sample can be assigned to one or more existing classes or to any [5, 42].

kNN

kNN is a fairly simple technique helping to construct classification models. In this method, the Euclidean distance between an unknown object and each of the objects of the training set is calculated. If the training set includes n samples, then n distances are calculated. Subsequently, the k nearest objects to the unknown object are selected and a majority rule is applied, i.e., the object with unknown label is assigned to the class to which the majority of the k neighboring objects belongs. The number of nearest neighbors (k) to be included in the construction of a classification model has to be optimized [47]. A number of kNN models are built using a different number of neighbors. The best model is selected based on the cross-validation error obtained using a tenfold cross-validation procedure.

Software

All data treatments were performed using Matlab version 8.0.0 (The Mathworks, Natick, USA). The algorithms of PCA, kNN, and Kennard and Stone were part of the ChemoAC toolbox (Freeware, ChemoAC Consortium, Brussels, Belgium, version 4.0). The toolboxes for SIMCA and PLS-DA were downloaded from the Matlab Central [48, 49]. The COW algorithm was downloaded from http://www.models.kvl.dk/DTW_COW [50].

Results and discussion

Data pre-processing

Prior to the chemometric data analysis, all chromatograms were aligned using COW. This alignment procedure was performed separately for all three included wavelengths (i.e., 254, 270, and 290 nm) and both MS1 and MS2 profiles. The chromatograms recorded at the three wavelengths show a large sildenafil peak. This peak was used as a marker for the alignment. Both MS1 and MS2 profiles were aligned without a marker peak. As an example, Fig. 1 shows the marker peak of the chromatograms measured at 254 nm, before and after alignment.

Fig. 1
figure 1

Overlay of the largest peak measured at 254 nm, before (a) and after (b) alignment

These aligned chromatograms were used as fingerprints. All PDA fingerprints were cut in order to limit the profile to the elution time window between 2 and 28 min since other regions did not contain any useful information. In order to focus the data analysis on secondary substances and impurities, the large sildenafil peak was eliminated as well by removing the section between 20 and 21.5 min. No cutting was performed on the MS1 and MS2 fingerprints since they were measured between 0 and 17 min only. Figure 2 shows a number of exemplary fingerprints.

Fig. 2
figure 2

Exemplary fingerprints obtained by PDA measured at 254 nm (a) and MS1 (b) for a genuine, generic, and counterfeit sample. For both types of fingerprints, the large sildenafil peak was eliminated. The dotted line on the PDA fingerprints indicates the time window between 20 and 21.5 min which was eliminated from the fingerprints

Prior to data analysis, all five types of fingerprints were normalized. In addition, both MS1 and MS2 fingerprints were log10-transformed.

Throughout the data analysis, the measured UV intensities and relative MS intensities in the fingerprints were used as explanatory variables; the class numbers (see Table 1) were incorporated as response variables.

Table 1 Overview of the sample classification and composition of the used training and test set

Selection of the test and training set

The selection of the test and training set was performed by the Kennard and Stone algorithm on one large data set containing the data from all three wavelengths and both MS1 and MS2. That way, a number of samples were assigned to the test set. For each subsequent data analysis, the test set was composed of these respective samples. More details about the used test and training set can be found in Table 1.

PDA

Single wavelengths

Exploratory analysis

When performing a PCA for all three wavelengths separately, 254 nm generated the best result as can be seen on the corresponding score plots in Fig. 3. It was chosen to limit the number of PCs to two since 96.23 % of the total variance was explained for the data obtained at 254 nm (PC1 = 93.22 % and PC2 = 3.01 %). For the fingerprints measured at 270 nm, 97.17 % of the total variance was explained by two PCs (PC1 = 96.51 % and PC2 = 0.66 %), and at 290 nm, the percentage of explained variance totals 97.48 % (PC1 = 96.86 % and PC2 = 0.62 %).

Fig. 3
figure 3

Score plots obtained by principal component analysis of each included wavelength separately: (a) 254 nm, (b) 270 nm, and (c) 290 nm

The score plot resulting from the data measured at 254 nm (Fig. 3a) shows one large cluster on the left side of the score plot, mainly consisting of counterfeit samples. Unfortunately, a number of genuine samples are part of this cluster. On the right side, two smaller clusters can be observed; the lower one contains generic samples and a small number of genuines only. The upper cluster is composed of four counterfeit samples and one genuine. Overall, the discrimination obtained at 254 nm is not optimal since a number of genuine samples cannot be distinguished from the counterfeits. However, this discrimination is better compared to the results obtained at 270 and 290 nm. At these wavelengths, a small cluster with counterfeit samples is indistinguishable from the generic and genuine samples (Fig. 3b, c).

Fingerprints obtained using three wavelength channels were also analyzed separately using PLS-DA. The obtained score plots (figures not shown) were very similar to those obtained by PCA. In this case, 254 nm also resulted in the best, however, suboptimal discrimination; a number of genuine samples were clustered together with the counterfeit samples. The results acquired for 270 and 290 nm failed in supporting differentiation of genuine/generic samples from counterfeit ones since a number of the latter were clustered together with the genuines/generics (just like with PCA).

Modelling techniques

Three different modelling techniques, i.e., PLS-DA, SIMCA, and kNN, were applied to verify whether they can successfully discriminate between the different sample classes. It was also tested if these methods can predict the class membership of unknown samples, using an external validation of the model (i.e., prediction of the test set samples).

An overview of the results obtained by each of the modelling techniques for the three included wavelengths can be found in the Electronic Supplementary Material (ESM, see Table S1). The best model for the data measured at 254 nm is obtained by kNN for k = 3. This model shows a correct classification rate of cross-validation of 97.37 % which is due to the misclassification of only three samples; three genuines are classified as generics. Therefore, this model shows a perfect discrimination between genuine/generic samples and counterfeit samples. The test set exhibits a correct classification rate of 96.55 % since only one sample is classified incorrectly; one genuine sample is considered to be counterfeit. In overall, kNN produces a satisfactory diagnostic model.

When analyzing the fingerprints acquired at 270 nm, the best model is acquired by the SIMCA approach. Seven PCs were used to model both classes 1 (genuines) and 2 (generics), whereas 12 PCs were necessary to describe class 3 (counterfeits). This model results in a 97.37 % correct classification rate of cross-validation due to three generic samples which are misclassified as counterfeit samples. The external validation shows a correct classification rate of 93.10 %. Two out of 29 test set samples are misclassified. Unfortunately, this misclassification concerns two genuine samples which are assigned to the counterfeit group.

The best diagnostic model constructed for the 290 nm data is offered by PLS-DA. The optimal PLS-DA model includes four PLS factors. This model is characterized by a correct classification rate of 91.23 % for cross-validation and 89.66 % for external validation. A total of ten training set samples is misclassified, which are all ten genuines present in the training set: three are misclassified as generic, and the remaining seven samples are considered to be counterfeit. This indicates that this model is not capable to classify genuine medicines. This is also demonstrated by the test set; all three genuine samples are wrongly classified: one as generic, the two others as counterfeit.

Comparison of these models shows that the best model is obtained using kNN for the 254 nm data. Only this model results in a perfect discrimination between genuines/generics and counterfeits for the training set. The fact that three genuine medicines are recognized as generic pharmaceuticals does not pose any problems since both genuine and generic products have to comply with the same quality requirements. The test set contains only one misclassification: a genuine sample considered to be counterfeit. This misclassification is also acceptable since a genuine sample, which is suspected to be counterfeit, poses less risks to public health than a counterfeit sample, which is believed to be genuine.

Combinations of wavelengths

All possible combinations of the three wavelengths were tested: (1) 254 nm_270 nm, (2) 254 nm_290 nm, (3) 270 nm_290 nm, and (4) 254 nm_270 nm_290 nm.

Exploratory analysis

The results obtained with PCA and PLS-DA for all four combinations (score plots not shown) are very similar to those shown in Fig. 3b, c. A number of counterfeit samples are clustered together with the genuine/generic samples. Therefore, no clear distinction could be made.

Modelling techniques

The correct classification rates obtained using all three modelling techniques for all tested PDA data combinations are summarized in the ESM (see Table S2). The best model for the 254 nm_270 nm fingerprint combination is generated by SIMCA. This model is characterized by a correct classification rate of cross-validation of 98.25 % which is due to the misclassification of two genuines as generic pharmaceuticals. The test set exhibits a correct classification rate of 93.10 %. Two genuine samples are assigned to a wrong class: one as a generic, the other as a counterfeit.

SIMCA also provides the best model for the 254 nm_290 nm data combination. Correct classification rates of 98.25 % for the training set and 93.10 % for the test set are obtained. Study of the occurring misclassifications shows that this SIMCA model not only results in the same type of misclassifications compared to the SIMCA model of 254 nm_270 nm but also that these misclassifications concern exactly the same genuine samples as in the 254 nm_270 nm SIMCA model.

From a chemometric point of view, the best model for the data combination 270 nm_290 nm is achieved by PLS-DA when including three PLS factors. This model has a correct classification rate of cross-validation of 91.23 % (ten samples are misclassified). Unfortunately, these misclassifications concern all ten genuine samples present in the training set. Three genuine samples are considered to be generic; the remaining misclassifications concern genuine samples which are wrongly classified as counterfeit. The test set is characterized by a 89.66 % correct classification rate due to the misclassification of all three genuine samples: one as generic, the other two as counterfeit. This shows that this model is not capable of modelling or predicting the genuine samples in a correct way and therefore this model is less suitable.

When combining all three wavelengths, SIMCA results in the best model with a 100 % correct classification rate for the training set. The external validation presents a correct classification rate of 93.10 %. This percentage is due to the misclassification of two genuine samples: one is assigned to the generics class, and the second is considered to be a counterfeit.

Comparison of the abovementioned models demonstrates that the best model is obtained by SIMCA when combining all three wavelengths. This model shows a perfect discrimination between genuine, generic, and counterfeit medicines for the training set. Prediction of the test set results in two misclassifications. A genuine sample which is considered to be generic does not pose any problems for public health, and a genuine medicine which is regarded as counterfeit threatens public health much less than a counterfeit considered to be genuine. However, it should be mentioned that the misclassifications of the SIMCA models, observed for the test set of the 254_270, 254_290, and 254_270_290 nm fingerprint combinations, concern exactly the same genuine samples for all three combinations. The training sets of 254 nm_270 nm and 254 nm_290 nm only show two genuine samples which are believed to be generics. Therefore, the SIMCA model obtained by 254 nm_270 nm_290 nm differs only little from the SIMCA models acquired by 254_270 and 254_290 nm. Consequently, the superiority of the 254 nm_270 nm_290 nm data combination could be questioned. Furthermore, the computation time for this triplex fingerprint combination was considerably longer compared to the duo combinations and the single wavelength data.

Comparison of the models obtained for the single wavelength data sets and the mentioned combinations of wavelengths shows that, in overall, the best prediction model is obtained by kNN for 254 nm merely since this model exhibits the highest correct classification rate for external validation.

MS

The MS data were analyzed twice; firstly, the MS1 data were analyzed separately, and secondly, the combination of MS1 and MS2 data (MS1_MS2) was tested.

Exploratory analysis

PCA (score plots not shown) did not result in a good discrimination between genuine, generic, and counterfeit samples.

The resulting score plots of PLS-DA are shown in Fig. 4. It was chosen to limit the number of PLS factors to two since a third PLS factor did not provide any extra information. The score plot obtained for the MS1 data (Fig. 4a) does not show a clear distinction between the three groups of samples. However, a tendency of discrimination is present. The genuine samples are clustered in the lower left corner, the counterfeit samples are mostly clustered in the upper part of the plot, and the generics are mainly grouped between the genuine and counterfeit samples. This tendency of distinction is also present on the plot acquired for the combination of MS1_MS2 data (Fig. 4b); only on this plot the trend seems to be more clear.

Fig. 4
figure 4

Score plots obtained by partial least squares of both MS data sets: (a) MS1 and (b) MS1_MS2

Modelling techniques

An overview of the acquired results for both MS data sets is provided in the ESM (see Table S3). The best model for the MS1 data is obtained by PLS-DA. This model includes six PLS factors and exhibits a correct classification rate of cross-validation of 95.61 %. One genuine sample is considered to be generic. The four remaining misclassifications concern generic samples of which two are regarded as genuine and the other two as counterfeit. The test set presents a correct classification rate of 93.10 % since only two misclassifications occur. Unfortunately, these misclassifications concern two counterfeit samples of which one is classified as genuine, the other as generic.

The results obtained for the MS1_MS2 fingerprint combination show that PLS-DA (including seven PLS factors) clearly performs best since both cross-validation and external validation are featured by a correct classification rate of 100 %. This indicates that a perfect discrimination between genuine, generic, and counterfeit samples is acquired for both training and test set.

For both MS1 and MS1_MS2, the models obtained with SIMCA are not satisfying. The external validation shows a correct classification rate of 82.76 % for both data sets. Survey of the misclassifications reveals that for both data sets, all three genuine samples and both generic samples are classified as counterfeit, indicating that this model is not capable of discriminating between genuine/generic and counterfeit medicines. Also, the kNN approach does not provide reliable models. The test set presents a correct classification rate of 96.55 % for both data sets since only one sample is misclassified. For the MS1 data, this misclassification concerns a genuine sample attributed to the counterfeits class; for the MS1_MS2 data, one counterfeit sample is considered to be genuine. However, these two models show a large number of misclassifications in the training set: 11 misclassified samples and one unclassified sample for the MS1 data and 11 misclassified and five unclassified samples for the MS1_MS2 data. Therefore, these kNN models are considered to be less suitable. When comparing the obtained SIMCA and kNN models, SIMCA shows a larger correct classification rate for the cross-validation compared to the external validation while kNN exhibits the opposite. This might be due to the fact that the SIMCA model shows overfitting, in contrast to kNN.

These results clearly show that the best model is obtained using PLS-DA for the MS1_MS2 data combination.

PDA-MS

A survey of the loadings of all three wavelengths in the respective PLS-DA and SIMCA models was performed in order to determine which wavelength should be combined with the MS1 data in order to obtain the best model. This survey suggested the combination of MS1 data with the data measured at 254 nm (254 nm_MS1).

Exploratory analysis

The score plots obtained by PCA and PLS-DA for this combination of data are shown in Fig. 5a, b. Only two PCs (Fig. 5a) were retained, since they explain 95.65 % of the total variance (PC1 = 92.26 % and PC2 = 3.39 %). In case of the PLS-DA analysis (Fig. 5b), two PLS factors were included. These two score plots are not only very similar to each other, but they also show a great conformity with the PCA plot obtained for the data measured at 254 nm (Fig. 3a). Despite the fact that the obtained clustering is not optimal, a relative good discrimination between the three classes of samples can be made. However, a number of genuine samples are exempted from this observation as they are clustered together with the counterfeit samples.

Fig. 5
figure 5

Score plots obtained by principal component analysis (a) and partial least squares-discriminant analysis (b) for the data combination 254 nm_MS1

Modelling techniques

kNN results in the best model for this data combination. It includes three nearest neighbors and shows a correct classification rate of cross-validation of 97.37 %. Three genuine samples are classified incorrectly, but these misclassifications do not pose any problems since they are all three considered to be generics. The test set generates a correct classification rate of 96.55 %. Only one misclassification occurred: a genuine sample which is assigned to the counterfeit class.

The PLS-DA model obtained with eight PLS factors is also quite suitable. The training set features a correct classification rate of 98.25 % due to the misclassification of two genuine samples which are considered to be generic. Two out of 29 test set samples are misclassified as well, resulting in a 93.10 % correct classification rate. One genuine and one counterfeit sample are regarded as generic.

SIMCA, on the other hand, results in a less suited model. Six PCs were retained to model class 1, nine PCs were used to describe class 2, and ten PCs were kept for class 3. Despite a 100 % correct classification rate of cross-validation, the test set is characterized by a correct classification rate of 89.66 %. All three genuine samples are misclassified: one as generic, the other two as counterfeit. This indicates that this model is not capable of discriminating and predicting the authentic nature of genuine samples.

Sensitivity and specificity

Table 2 presents an overview of the best models obtained for the different tested data sets. For both cross-validation and external validation, the performance of these models is expressed in correct classification rates.

Table 2 General overview of the best performing models

The associated confusion matrix is shown in Table 3.

Table 3 Overall confusion matrix summarizing the classification results obtained by the model performing best for each fingerprint combination

It could, however, also be interesting to express the performance of these models in terms of sensitivity and specificity. Since the classification problem considered in this study is a three class problem and sensitivity and specificity are statistical evaluation measurements of binary classification models, the considered classification has to be slightly modified. This modification can easily be performed by combining classes 1 and 2. Class 1 consists of genuine Viagra® samples, and class 2 is composed of generic products of Viagra®. Since both groups of medicines are produced in a legal way and have to comply with the same quality requirements, their fusion into one class is justified.

Sensitivity is a measure for the true positive rate; in this study, a true positive is defined as a counterfeit which is considered to be counterfeit. Specificity expresses the true negative rate, which signifies the rate of legal medicines (genuine and generic) regarded as legal. The sensitivity and specificity values for each of the models in Table 2 are presented in Table 4.

Table 4 Sensitivity and specificity of the best performing models

For all models, a sensitivity and specificity of 100 % is obtained for the training set despite the fact that the correct classification rates for the kNN models obtained for the 254 nm and 254 nm_MS1 data do not equal 100 % (Table 2). This is due to misclassifications of genuine samples as generics. Since genuines and generics constitute the same class for the calculation of sensitivity and specificity, these misclassifications are not taken into account.

Only the PLS-DA model obtained for the MS1_MS2 fingerprint combination exhibits a 100 % correct classification rate for external validation (Table 2), which is mirrored in the perfect sensitivity and specificity for the test set of this model. The remaining models show a perfect sensitivity and a specificity of 80 % which is due to the misclassification of a genuine as a counterfeit. The external validation of the SIMCA model obtained for the 254 nm_270 nm_290 nm data combination shows one additional misclassification, i.e., a genuine considered to be generic, which is not taken into account in the calculation of sensitivity and specificity.

Conclusion

Counterfeit medicines pose a threat to public health worldwide, even in Europe. Therefore, characterization of these products is a very important issue. During this study, a set of 143 samples was analyzed using a PDA and MS (ion trap) detector in order to obtain different types of fingerprints revealing the present impurities and other secondary components. The purpose is to explore whether or not PDA and MS are two complementary detection techniques by trying to resolve the question which technique (or perhaps combination of both) is most suited to distinguish genuine/generic medicines from counterfeit ones.

Exploratory analysis of all combinations of PDA data revealed that neither PCA nor PLS-DA is capable to yield a satisfying discrimination between genuine, generic, and counterfeit medicines. Surprisingly, PLS-DA is not capable to improve the acquired discrimination compared to PCA, despite the fact that PLS-DA is a supervised method. However, for the MS fingerprints, observations are different; the PCA score plot does not provide a useful clustering, while PLS-DA results in a clear tendency of discrimination.

Since no optimal visual clustering is obtained, supervised techniques were applied to model the data. Overall, very adequate diagnostic models are obtained by means of three basic chemometric techniques. When comparing all models acquired for the PDA data combinations, it can be concluded that the 254 nm fingerprint set provides the best result since the external validation generates the highest correct classification rate. When exploring the MS data, it is clear that the best diagnostic model is obtained for the MS1_MS2 combination which is characterized by a perfect discrimination for both training set and test set. Based on a survey of the loadings, it was decided to combine the fingerprints measured at 254 nm and the MS1 data. For this data combination, the best diagnostic model is obtained by kNN; only one misclassification is of importance which is a genuine sample misclassified as a counterfeit. The acquired PLS-DA model also generates correct classification rates which are better than those obtained for the 254nm and MS1 fingerprints separately. The only misclassification of importance concerns a test set counterfeit which is considered to be a generic sample. For SIMCA, conclusions are more complicated. The training set is characterized by a 100 % correct classification rate but the test set performs less compared to 254 nm. This is due to only one additional genuine sample which is misclassified as a counterfeit. However, this SIMCA model is quite suitable since the misclassification of genuine samples poses less risks to public health than a counterfeit sample which is considered to be genuine. A counterfeit sample will be retained from the market until a thorough analysis identifies its true nature. If the respective sample turns out to be genuine after all, it will be released again.

Based on the results obtained for this data set, it could be concluded that MS provides less suitable models (except for PLS-DA) since several genuines and generics are classified as counterfeits and vice versa. This is probably due to the high complexity of the data and the good overall results obtained with the PDA data. However, when selecting the appropriate chemometric techniques carefully, the preferred detection method can be used. For instance, when combining MS1 and MS2 data, a perfect discrimination can be obtained using PLS-DA; when applying kNN, good classification models can be obtained by UV detection at 254 nm. This might be an interesting observation for the characterization of counterfeit drugs in developing countries since more sophisticated equipment is often not available. Nevertheless, if no selection of chemometric tools can be performed in advance, the combination of PDA and MS data (254 nm_MS1) is likely to generate better classification models than PDA or MS individually. In general, taking all three modelling techniques into account, this combination results in less classification errors between the genuines/generics and counterfeits compared to the PDA and MS data separately. Most occurring misclassifications concern genuine samples which are considered to be generics, which does not pose any public health threats since both genuine and generic medicines have to comply with the same quality requirements. Therefore, this combination of data is preferred.

The results obtained in this study might be useful for other laboratories responsible for detection of counterfeit medicines. Moreover, the strategy presented here could be tested and useful for other groups of medicines which are often counterfeited such as slimming products, pain killers, and sleeping aids.