1 Problem statement

In the following, the term plagiarism refers to text plagiarism, i.e., the use of another author’s information, language, or writing, when done without proper acknowledgment of the original source. Plagiarism detection refers to the unveiling of text plagiarism. Existing approaches to computer-based plagiarism detection break down this task into manageable parts:

“Given a text d and a reference collection D, does d contain a section s for which one can find a documentd i  ∈ Dthat contains a sections i such that under some retrieval model\({\mathcal{R}}\)the similarity\(\varphi_{{\mathcal{R}}}\)betweensands i is above a threshold θ?”

Observe that research on automated plagiarism detection presumes a closed world where a reference collection D is given. Since D can be extremely large—possibly the entire indexed part of the World Wide Web—the main research focus is on efficient search technology: near-similarity search and near-duplicate detection (Brin et al. 1995; Hoad and Zobel 2003; Bernstein and Zobel 2004; Henzinger 2006; Hinton and Salakhutdinov 2006; Yang and Callan 2006), tailored indexes for near-duplicate detection (Finkel et al. 2002; Bernstein and Zobel 2004; Broder et al. 2006) or similarity hashing techniques (Kleinberg 1997; Indyk and Motwani 1998; Gionis 1999; Stein 2005, 2007). This article, however, deals with technology to identify plagiarized sections in a text if no reference collection is given. We distinguish the two analysis challenges as external and intrinsic analysis respectively. Note that human readers are able to identify plagiarism without having a reference collection at their disposal: changes between brilliant and baffling passages, or the change of person narrative give hints to multiple authorship.

1.1 Intrinsic plagiarism analysis and authorship verification

Intrinsic plagiarism analysis is closely related to authorship verification: goal of the former is to identify potential plagiarism by analyzing a document with respect to undeclared changes in writing style. Similarly, in an authorship verification problem one is given writing examples of an author A, and one is asked to determine whether or not a text with doubtful authorship is also from A. Intrinsic plagiarism analysis can be understood as a more general form of the authorship verification problem:

  1. 1.

    one is given a single document only, and

  2. 2.

    one is faced with the problem of finding the suspicious sections.

Intrinsic plagiarism analysis and authorship verification are one-class classification problems. A one-class classification problem defines a target class for which a certain number of examples exist. Objects outside the target class are called outliers, and the classification task is to tell apart outliers from target class members. Actually, the set of “outliers” can be much bigger than the target class, and an arbitrary number of outlier examples could be collected. Hence a one-class classification problem may look like a two-class discrimination problem, but there is an important difference: members of the target class can be considered as representatives for their class, whereas one will not be able to compile a set of outliers that is representative for some kind of “non-target class”. This fact is rooted in the huge number and the diversity of possible non-target objects. Put another way, solving a one-class classification problem means to learn a concept (the concept of the target class) in the absence of discriminating features. However, in rare cases, knowledge about outliers can be used to construct representative counter examples related to the target class. Then a standard discrimination strategy can be followed.

1.2 Decision problems

Within the classical authorship verification problem the target class is comprised of writing examples of a known author A, and each piece of text written by an author B, \(B\not=A,\) is considered as a (style) outlier. Intrinsic plagiarism analysis is an intricate variant of authorship verification, imposing particular constraints and assumptions on the availability of writing style examples. To organize existing research we introduce the following authorship verification problems, formulated as decision problems.

  1. 1.

    Problem. AVextern

    • Given. A text d, written by author A, and a set of texts, D = {d1, …, d n }, written by authors B, \(A\not\in{\bf B}\) .

    • Question. Does d contain a section whose similarity to a section in d i , d i  ∈ D, is above a threshold θ?

  2. 2.

    Problem. AVfind

    • Given. A text d, allegedly written by author A.

    • Question. Does d contain a section written by an author B, \(B\not=A\) ?

  3. 3.

    Problem. AVoutlier

    • Given. A set of texts D = {d1, …, d n }, written by author A, and a text d, allegedly written by author A.

    • Question. Is d written by an author B, \(B\not=A?\)

The problem class AVextern corresponds to the external plagiarism analysis problem mentioned at the outset; the problem class AVfind corresponds to the general intrinsic plagiarism analysis problem, and the problem class AVoutlier corresponds to the classical authorship verification problem. An instance π of AVfind can be reduced to m instances of AVoutlier, AVfind \(\le_{{tt}}^{p}\) AVoutlier, by applying a canonical chunking strategy that splits a document into m sections while asking for each section whether it forms an outlier or not. If at least one instance of AVoutlier is answered with yes, the answer to π is yes.Footnote 1 Likewise, an instance π of AVoutlier can be reduced to an instance of AVfind, AVoutlierAVfind, by simply merging d and all documents in D into a single document. The different complexity of the problem classes is reflected by the reductions \(\le_{{tt}}^{p}\) and ≤.

If the answer to an instance π of AVfind is given via a reduction of π to m AVoutlier problems, one can try to raise the evidence of this answer by a post-processing step: from the m potential outlier sections two sets D 1 and D 2 are formed, comprising those sections that have been classified as targets into one set, and those that have been classified as outliers into the other. Again, we ask whether the documents in these two sets are written by a single author, this time applying an analysis method which takes advantage of the two sample sets, D 1, D 2, and which hence is more reliable than the outlier analysis. Since this decision problem is important from an algorithmic viewpoint we introduce a respective problem class:

  1. 3.′

    Problem. AVbatch

    • Given. A set of texts D1 = {d1_1, …, d1_k } written by author A, and a second set of texts, D2 = {d2_1, …, d2_l }, allegedly written by author A.

    • Q. Does D2 contain a text written by an author B, \(B\not=A\) ?

Obviously AVoutlier and AVbatch can be reduced to each other in polynomial time, hence AVoutlierAVbatch. However, it is important to note that both reductions, AVfind \(\le_{{tt}}^{p}\) AVoutlier and AVoutlierAVbatch, are constrained by a minimum text length that is necessary to perform a sensible style analysis. Experience shows that a style analysis becomes statistically unreliable for text lengths below 250 words (Stein and Meyer zu Eissen 2007).

1.3 Existing research

Authorship analysis divides into authorship verification problems and authorship attribution problems. The by far larger part of the research addresses the attribution problem: given a document d of unknown authorship and a set D of candidate authors with writing examples, and one is asked to attribute d to one author. In a verification problem (see above) one is given writing examples of an author A, and one is asked to verify whether or not a document d of unknown authorship in fact is written by A. Recent contributions to the authorship attribution problem include (Rudman 1997; Stamatatos 2001, 2007, 2009; Chaski 2005; Juola 2006; Malyutov 2006; Sanderson and Guenter 2006b); the authorship verification problem is addressed in Koppel and Schler (2004b), van Halteren (2004, 2007), Meyer zu Eissen and Stein (2006, 2007), Koppel et al. (2007), Stein and Meyer zu Eissen (2007), Stein et al. 2008 and Pavelec et al. (2008).

Several research areas are related to authorship verification, in particular: (1) stylometry, i.e., the construction of models for the quantification of writing style, text complexity, and grading level assessment, (2) outlier analysis and meta learning (Tax 2001; Tax and Duin 2001; Manevitz and Yousef 2001; Rätsch et al. 2002; Koppel and Schler 2003, 2004b, 2006), and (3) symbolic knowledge processing, i.e., knowledge representation, deduction, and heuristic inference (Russel and Norvig 1995; Stefik 1995).

In their excellent paper from 2004 Koppel and Schler give an illustrative discussion of authorship verification as a one-class classification problem (Koppel and Schler 2004b). At the same place they introduce the unmasking approach to determine whether a set of writing examples is a subset of the target class. Observe the term “set” in this connection: unmasking does not solve the one-class classification problem for a single object but requires a batch of objects all of which must stem either from the target class or not.

2 Building blocks to operationalize authorship verification

Plagiarism detection can be operationalized by decomposing a document into natural sections, such as sentences, chapters, or topically related blocks, and analyzing the variance of stylometric features for these sections. In this regard the decision problems in Sect. 1.2 are of decreasing complexity: instances of AVfind are comprised of both a selection problem (finding suspicious sections) and an AVoutlier problem; instances of AVbatch are a restricted variant of AVoutlier since one has the additional knowledge that all elements of a batch are (or are not) outliers at the same time.

Solving instances of AVfind involves various subtasks; Table 1 organizes them as building blocks—from left to right—following the logical text processing chain. Among others the building blocks denote alternative decomposition strategies, alternative style models, alternative classification technology, as well as post-processing options whose objective is to improve the analysis’ overall precision and recall. The table highlights those building blocks that are combined in our analysis chain; the following subsections discuss them in greater detail. Note that even with a skillful combination and adaptation of these building blocks it is pretty difficult to end up with an analysis process comparable to the power of a human reader.

Table 1 Building blocks to operationalize authorship verification

2.1 Impurity assessment

How likely is the fact that a document d contains a section of another author? We expect that the lengths, the places, and the entire fraction θ of such sections depend on particular document characteristics. Hence it makes sense to analyze the document type (paper, dissertation), its genre (novel, factual report, research, dictionary entry), but also the issuing institution (university, company, public service). Algorithmic means to reveal such information interpret document lengths, genres, and occurring named entities.

2.2 Decomposition strategy

The simplest strategy is to decompose a text d into sections s 1, …, s n of uniform length; in Meyer zu Eissen and Stein (2006) the authors integrate an additional sentence detection. However, a more sensible interpretation of structural boundaries (chapters, paragraphs) is possible, which may consider special text elements like tables, formulas, footnotes, or quotations as well (Reynar 1998). Though quite difficult, the detection of topical boundaries has a significant impact on the usefulness of a decomposition (Choi 2000). In Graham et al. (2005) the authors even try to identify stylistic boundaries.

2.3 Style model construction

The statistical analysis of literary style is called stylometry, and the first ideas date back to 1851 (Holmes 1998). The automation of this task requires a quantifiable style model, and efforts in this direction became a more active research field in the 1930s (Zipf 1932; Yule 1944; Flesch 1948). In the meantime various stylometric features, also termed style markers, have been proposed. They measure writer-specific aspects like vocabulary richness (Honore 1979; Yule 1944), text complexity and understandability (Flesch 1948), or reader-specific grading levels that are necessary to understand a text (Dale and Chall 1948; Kincaid et al. 1975; Chall and Dale 1995). Note that the mentioned style features have been developed to judge longer texts, ranging from a few pages up to book size.

Style model construction must consider the decomposition strategy: different stylometric features have different strengths and also pose different constraints on text length, text genre, or topic variation. Since text plagiarism typically relates to sections that are shorter than a single page (Mansfield 2004), the decomposition of a document into sections s 1, …, s n must not be too coarse, and, it is questionable which of the stylometric features will work for short sections. It should be clear that style features that employ measures like average paragraph length are not reliable in general. The authors in Meyer zu Eissen and Stein (2007) investigate the robustness of the vocabulary richness measures Yule’s K, Honore’s R, and the average word frequency class. They observe that the average word frequency class can be called robust: it provides reliable results even for short sections, which can be explained with its word-based granularity. In Meyer zu Eissen and Stein (2006) connections of this type have been analyzed for the Flesch Kincaid Grade Level (1948, 1975), the Dale–Chall formula (1948, 1995), Yule’s K (1944), Honore’s R (1979), the Gunning Fog index (1952), and the averaged word frequency class (Meyer zu Eissen and Stein 2004).

Table 2 compiles an overview of important stylometric features that have been proposed so far; we distinguish between lexical features (character-based and word-based), syntactic features, and structural features. Our overview is restricted to the well-known style features and omits esoteric variants. Those features marked with an asterisk have been reported to be particularly discriminative for authorship analysis and are used within our stylometric analysis.

Table 2 Compilation of important and well-known features used within a stylometric analysis. Features that are implemented within our style model are marked with an asterisk

2.4 Outlier identification

The decomposition of a document d gives a sequence s 1, …, s n of sections, for which the computation of a style model gives a sequence \({\bf s}_1, \, \ldots ,{\bf s}_n\) of feature vectors, which in turn are analyzed with respect to outliers. The identification of outliers among the \({\bf s}_i\) has to be solved on the basis of positive examples only and hence poses a one-class classification problem. Following Tax, one-class classification approaches fall into one of the following three classes (Tax 2001):

  1. (a)

    Density methods, which directly estimate the probability distributions of features for the target class. Outliers are assumed to be uniformly distributed, and, for example, Bayes’ rule can be applied to separate outliers from target class members.

  2. (b)

    Boundary methods, which avoid the estimation of the multi-dimensional density function but try to define a boundary around the set of target objects. The boundary computation is based on the distances between the objects in the target set.

  3. (c)

    Reconstruction methods come into play if prior knowledge for the generation process of target objects is available. Outliers can be be distinguished from targets because of the higher reconstruction error they incur during the model fit.

The main advantage of boundary methods, namely to get by without assessing the multi-dimensional density function, can also be achieved with a density-based approach under Naive Bayes. Moreover, for our domain it is not clear how a boundary around the target set should be defined. We have also developed and analyzed reconstruction methods that rely on factor analysis and principal component analysis, but experienced difficulties due to unsatisfactory generalization behavior. Here, within our analysis chain, we resort to a one-class classifier of Type (a), which is outlined in the following.

Let S t denote the event that a section s ∈ {s 1, …, s n } belongs to the target group (= not plagiarized); likewise, let S o denote the event that s belongs to the outlier group (= plagiarized). Given a document d and a single style features x, the maximum a-posteriori hypothesis H ∈ {S t, S o} can be determined with Bayes’ rule:

$$ H = \mathop{{\rm argmax}}_{S\in\{S^t, S^o\}} \frac{P(x(s)\, |\, S)\cdot P(S)} {P(x(s))} $$
(1)

where x(s) denotes the style features value for section s, and P(x(s) | S t) and P(x(s) | S o) denote the respective conditional probabilities that x(s) is observed in the target group or the outlier group. Since the fraction of outliers is small compared to all sections it is sensible to estimate the P(x(s) | S t) with a Gaussian distribution; the expectation and the variance for x are estimated from x(s 1), …, x(s n ), omitting those sections s i that maximize or minimize x(s i ). The outliers can stem from different authors, and hence the P(x(s) | S o) are estimated with a uniform distribution, following a least commitment consideration (Tax 2001). See Fig. 1 for an illustration of the assumed style feature distributions in target and outlier sections. The priors P(S t) and P(S o) correspond to 1 − θ and θ respectively and require an impurity assessment (see Sect. 2.1). If no information about θ is available a uniform distribution is assumed for the priors, i.e., we resort to the maximum likelihood estimator.

Fig. 1
figure 1

Targets and outliers can be separated if they are differently distributed

Multiple style features x 1, …, x m require the accounting of multiple conditional probabilities. Under the conditional independence assumption the naive Bayes approach can be applied; the accepted a-posteriori hypothesis then computes as follows:

$$ H = \mathop{{\rm argmax}}_{S\in\{S^o, S^t\}} P(S)\cdot \prod_{i=1}^m P(x_i(s) \,|\, S) $$
(2)

For the maximum a-posteriori decision (2) only those style features x are considered whose values fall outside the uncertainty intervals (cf. Fig. 1), which are defined by 1.0 and 2.0 times the estimated standard deviation.

2.5 Outlier post-processing

The post-processing methods in Table 1 can be distinguished in knowledge-based methods and meta learning approaches. To the former count heuristic voting, citation analysis, and human inspection. Heuristic voting, which is applied here, is the estimation and use of acceptance and rejection thresholds based on the number of classified outlier sections. Meta learning is brought into play if from the solution of several AVoutlier problems two sets D 1 (sections labeled as targets) and D 2 (sections labeled as outliers) are formed, obtaining this way an instance of the AVbatch problem. Possible meta learning approaches are:

  1. (a)

    Unmasking (Koppel and Schler 2004b), which is a representative of what Tax terms “reconstruction method” (Tax 2001); it measures the increase of a sequence of reconstruction errors, starting with a good reconstruction which then is successively impaired.

  2. (b)

    The Qsum heuristic (Morton and Michaelson 1990; Hilton and Holmes 1993), which compares the growth rates of two cumulative sums over a sequence of sentences. Basis for the sums are the deviations from the mean sentence length and the deviations of function word frequencies.

  3. (c)

    Batch means, which is applied within the analysis of simulation data in order to detect the end of a transient phase. For a series of values the variance development of the sample mean is measured while the sample size is successively increased.

Unmasking has been successfully applied to solve instances of AVbatch (Sanderson and Guenter 2006b; Koppel and Schler 2004a; Koppel et al. 2007; Surdulescu 2004). The robustness of the approach is also reported by Kacmarcik and Gamon who develop methods for obfuscating document stylometry in order to preserve author anonymity (Kacmarcik and Gamon 2006). Since unmasking is a building block in our analysis chain it is explained in greater detail now. The use of unmasking for intrinsic plagiarism analysis was proposed in Stein and Meyer zu Eissen (2007), who consider a style outlier analysis as a heuristic to compile a potentially plagiarized and sufficiently large auxiliary document.

Recall that the set D 1 (targets) is attributed to author A, while the authorships of the sections in D 2 (outliers) is considered as unsettled. With unmasking we seek further evidence for the hypothesis whether a text in D 2 is written by an author B, \(B\not=A.\) At first, D 1 and D 2 are represented under a reduced vector space model, designated as \({\bf D}_1\) and \({\bf D}_2.\) As an initial feature set the 250 words with the highest relative frequency in \(D_1 \cup D_2\) are chosen. Unmasking then happens in the following steps (see Fig. 2):

  1. 1.

    Model Fitting. Training of a classifier that separates \({\bf D}_1\) from \({\bf D}_2.\) In Koppel and Schler (2004b) the authors implement a tenfold cross-validation experiment with a linear kernel SVM to determine the achievable accuracy.

  2. 2.

    Impairing. Elimination of the most discriminative features with respect to the model obtained in Step 1; construction of new collections \({\bf D}_1,\)\({\bf D}_2,\) which now contain impaired representations. Koppel and Schler (2004b) reports on convincing results by eliminating the six most discriminating features. This heuristic depends on the section length which in turn depends on the length of d.

  3. 3.

    Go to Step 1 until the feature set is sufficiently reduced. About 5–10 iterations are typical.

  4. 4.

    Meta Learning. Analyze the degradation in the quality of the model fitting process: if after the last impairing step the sets \({\bf D}_1\) and \({\bf D}_2\) can still be separated with a small error, assume that d1 and d2 stem from different authors. Figure 3 shows a characteristic plot where unmasking is applied to short papers of 4–8 pp.

Fig. 2
figure 2

Given are two sets of sections D 1 and D 2, allegedly written by a single author. Unmasking measures the separability of \({\bf D}_1\) versus \({\bf D}_2\) when the style model is successively impaired

Fig. 3
figure 3

Unmasking at work: each line corresponds to a comparison of two papers. A solid red line belongs to papers of two different authors; a dashed green line belongs to papers of the same author

The rationale of unmasking: Two sets of sections, D 1, D 2, constructed from two different documents d 1 and d 2 of the same author can be told apart easily if a vector space model (VSM) retrieval model is chosen. The VSM considers all words in \(d_1 \cup d_2,\) and hence it includes all kinds of open class and closed class word sets. If only the 250 most-frequent words are selected, a large fraction of them will be function words and stop words.Footnote 2 Among these 250 most-frequent words a small number does the major part of the discrimination job; these words capture topical differences, differences that result from genre, purpose, or the like. By eliminating them, one approaches step by step the distinctive and subconscious manifestation of an author’s writing style. After several iterations the remaining features are not powerful enough to discriminate two documents of the same author. But, if d 1 and d 2 stem from two different authors, the remaining features will still quantify significant differences between \({\bf D}_1\) and \({\bf D}_2\).

3 Analysis

This section reports on the performance of the operationalized analysis chain. Figure 4 gives an illustration: the top row shows documents with original sections (green), plagiarized sections (red), and sections spotted by the classifier (hashed); the middle row shows the micro- and macro-averaged outlier classification performance; the bottom row shows three alternative post-processing strategies. These strategies differ with respect to the interpretation of the fraction θ′ of sections per document that are classified as outliers: under the minimum risk strategy a document d is considered as plagiarized if at least one outlier section is spotted, under the heuristic voting strategy θ′ is compared to a threshold τ, and under the unmasking strategy meta learning is applied if θ′ falls into an uncertainty interval. The remainder of this section gives particulars.

Fig. 4
figure 4

Illustration of the analysis chain. Top: corpus with five documents of author A, containing sections of some author \(B \ne A.\) Middle: micro- and macro-averaged analysis of the outlier identification performance. Bottom: outlier post-processing according to three alternative strategies; θ′ denotes the fraction of sections per document that are classified as outliers

3.1 Corpus

To run analyses on a large scale one has to resort to artificially plagiarized documents. Here, we use a subset of the corpus that has been constructed for the intrinsic plagiarism analysis task of the PAN’09 competition (Potthast et al. 2009). The PAN’09 corpus comprises about 3,000 generated cases of intrinsic plagiarism—more precisely: cases of style contamination—exhibiting varying degrees of obfuscation. The corpus is based on books from the English part of the Project Gutenberg and contains mainly narrative text. Sections of varying length, ranging from a few sentences up to many pages, are inserted into other documents according to heuristic placement rules. In addition, obfuscation of the inserted sections is performed by replacing, shuffling, deleting, or adding words.Footnote 3

For our experiments the documents of the PAN’09 corpus are uniformly decomposed into candidate sections of 5,000 characters; each candidate section s in turn is categorized as being either non-plagiarized, if s contains no word from an inserted section, or plagiarized, if s consists to more than 50% of an inserted section. Otherwise s is discarded and excluded from further investigations. Documents with less than seven sections are removed from the corpus because they are considered to be too short for a reliable stylometric analysis.

In order to study the effect of document length and impurity on the performance of our analysis chain, four disjoint collections are compiled. For this purpose two levels of document lengths are introduced (short versus long) and combined with two levels of impurity (light versus strong). Short documents consist of less than 250,000 characters, which corresponds to approximately 40,000 words. The impurity θ of a document is defined as the portion of plagiarized characters, i.e., characters that belong to an inserted section. A document is considered to have a light impurity if θ ≤ 0.15; it has a strong impurity if θ > 0.15. Finally, the number of plagiarized documents per collection is set to 50%. The resulting test collections exhibit varying degrees of difficulty, both in terms of training data scarcity (document length) and class imbalance (impurity). We number the collections according to their level of difficulty and show selected summary statistics in Table 3.

Table 3 Selected summary statistics of the four test collections

3.2 Performance of outlier identification

Outlier identification is addressed with the density estimation method as described in Sect. 2.4. To capture a broad range of writing styles a diverse set of stylometric features is employed, belonging to three of the four categories introduced in Sect. 2.3: lexical character features, lexical word features, and syntactical features. Among the employed stylometric features are the classical measures for vocabulary richness, text complexity, as well as stylometric features that have been reported to be particularly discriminative for authorship analysis, such as character n-grams and the frequency of function words (see Table 2). To capture syntactic variations in writing style, part-of-speech information in the form of part-of-speech trigrams is exploited; the tagging is done with the probabilistic part-of-speech tagger QTAG.

Table 4 shows the top 30 stylometric features with respect to their discriminative power; the F-Measure-value pertains to the outlier class and is computed as micro-averaged mean over the four collections. The decision whether or not a section is classified as an outlier is given by the maximum a-posteriori hypothesis of the univariate model in Eq. 1. Note that this ranking serves merely for illustration purposes and is not used for feature selection: the outlier analysis in the analysis chain is based on the multivariate use of all stylometric features. For each document in a collection an individual style classifier according to Eq. 2 is constructed and applied to each section of that document. The correctness of each classification decision is pooled over all documents. Table 5 summarizes the achieved classification results in terms of micro-averaged F-Measure for both the outlier class and the target class.

Table 4 Stylometric features ranked by their F-measure performance in a style outlier detection task. The classification decision is given by the maximum a-posterior hypothesis from Eq. 1
Table 5 Performance of the one-class classifier. The target class relates to sections of author A; the outlier class relates to sections of foreign authors \(B\not=A\)

Recall that the four collections are compiled in a way that sections with less than 50% plagiarism are discarded. If all sections with less than 90% plagiarism are discarded, the precision of the outlier class is unchanged, but its recall increases by 9% on average over all collections. On the other hand, if sections with less than 50% plagiarism are kept, the precision and the recall of the outlier class decrease by 4% on average.

3.3 Performance of meta learning

To illustrate the performance of the unmasking approach we evaluate the meta learner that is used in Step 4 of the unmasking procedure. Unmasking is parameterized as follows: documents are represented under the term frequency vector space model, defined by the 500 most frequent words of the input document sets, without applying stemming or stop wording. In each iteration i of 30 unmasking iterations the best 10 features according to the information gain heuristic are removed and the classification accuracy, acc i , of a linear kernel SVM is computed, based on fivefold cross validation.

In practice the distribution of the outlier and target class is extremely unbalanced. In order to correct this class imbalance, the outlier class is over-sampled. Here, the SMOTE approach is used to create new, synthetic instances of the outlier class by interpolating between the original instances (Chawla et al. 2002). A meta learner is trained with vectors each of which comprising the following elements: the acc-values of iteration i, the Δ-acc-values to iteration i − 1, the Δ-acc-values to iteration i − 2, and a class label “plagiarized” or “non-plagiarized”. This meta learner is also realized as a linear kernel SVM; Table 6 reports on its performance.

Table 6 Evaluation of the unmasking meta learner

The unmasking approach of Koppel and Schler decides for two sets of documents whether or not all documents stem from a single author. If both sets belong to the same author the associated unmasking curve drops away (cf. the dashed green lines in Fig. 3). This fact is exploited within our analysis chain in order to reduce the number of misclassified non-plagiarized documents, which are caused by the insufficient precision of the one-class classifier.

3.4 Performance of the analysis chain

We evaluate three strategies, from naive to sophisticated, to solve AVfind for a document d. Under the minimum risk strategy d is classified as plagiarized if at least one style outlier has been announced for d. Under the heuristic voting strategy d is classified as plagiarized if the detected fraction of outlier text is above a threshold τ. Under the unmasking strategy d is classified as plagiarized if the detected fraction of outlier text is above an upper threshold τ u ; d is classified as non-plagiarized if the detected fraction of outlier text is below a lower threshold τ l ; for all other cases unmasking is applied. Note that the values for τ, τ u , and τ l are collection-dependent. In our experiments τ and τ l are fitted to the averaged impurities of the collections, while τ u is chosen overly optimistic. Table 7 summarizes the results: the minimum risk strategy classifies all documents as plagiarized because of the imprecision of the outlier detection, which claims at least one section in each document as outlier. Heuristic voting and unmasking consider the outlier detection characteristic. A main observation is that especially unmasking can be used to substantially increase the precision when solving instances of AVfind.

Table 7 Overall performance of the analysis chain. Performance of the solution of the AVfind problem under different strategies: minimum risk (columns 2–4), heuristic voting (columns 5–8), and unmasking (columns 9–12). Maximum precision values are shown bold

4 Summary

Intrinsic plagiarism detection is the spotting of sections with undeclared writing style changes in a text document. Intrinsic plagiarism detection is a one-class classification problem that cannot be tackled with a single technique but requires the combination of algorithmic and statistical building blocks. Our article provides an overview of these building blocks and presents ideas to operationalize analysis chains that cope with the intrinsic plagiarism challenge.

Intrinsic plagiarism detection and authorship verification are two sides of the same coin. This fact is explained in this article, and, in order to organize existing research and to work out the intricate difficulties between problem variants, we introduce four problem classes for authorship verification problems. We propose and implement an analysis chain that integrates document chunking, style model computation, style outlier identification, and outlier post-processing. Style outlier identification is unreliable, among others because it is difficult to quantify style and to spot style changes in short sections. Since we feel that plagiarism detection technology should avoid the announcement of wrongly claimed plagiarism at all costs, we propose to post-process the results of the outlier identification step. We employ the unmasking technology for this purpose, which has been developed to settle the authorship for a text in question—if sufficient sample text is at one’s disposal. The combination of outlier identification with unmasking entails a significant improvement of the precision (see Table 7 for details). However, we see different places and room to improve certain building blocks in the overall picture, among others: knowledge-based chunking, better style models, multivariate one-class classification, and bootstrapping for outlier identification.