Keywords

1 Introduction

In the last decade, an interdisciplinary field of research has been actively developing at the intersection of philosophy, psycholinguistics and computational linguistics. Its purpose is to create models of argumentation for various types and genres of discourse and automatically identify and extract argument components and structure including premises and conclusions, and the relations between them based on typical argumentation schemes. The main prerequisite for the development of this area is the creation of annotated corpora, in which textual fragments are matched with components of argumentative structures and relations between them.

So far, there exist only a few resources with annotated argumentation structures over monologue texts, mainly for the English language. The best known is AIFdbFootnote 1, the former Araucaria corpus [1], which contains news articles, records of parliamentary and political online debates. Resources are created in German: University of Darmstadt CorpusFootnote 2 includes subcorpora of student essays [2], news texts and scientific articles; the Potsdam corpusFootnote 3 contains a small set of microtexts on a given topic, later translated into English [3]. There exist projects for some other languages (Italian, Greek, Chinese). As for the Russian language, such resources, as far as we know, do not yet exist. In most cases, corpus annotation includes text segmentation with highlighting of argumentation units, markup of roles (premise, conclusion) and relations (support, attack), without matching the argumentation schemes on which the reasoning is based. An exception is Araucaria, where argumentative structure annotation is related to particular argumentation scheme based on the theory of Walton [4].

The proposed work was performed as part of an on-going research project aimed at creation of an argumentation annotated corpus for the Russian language. A popular science discourse that is not presented in well-known argumentatively annotated corpora is being studied. Popular science discourse is defined as a way of transmitting scientific knowledge or innovation projects by the author-scientist (or a journalist as an intermediary) for their understanding by a mass audience. The corpus of popular science online articles on linguistic topics has been selected with the help of catalogs of Russian search engines Yandex and Rambler. Corpus includes about 70 texts with an average volume of 1057 words (minimum – 167 words, maximum – 4094 words), with no restrictions on the subject, structure, and the type of presentation. Some articles are transcripts of oral presentation, interviews, etc.

The texts are annotated manually based on the argumentation model developed by the project participants. An important linguistic aspect of the process of arguments annotation is registration of argumentative indicators, which constitute keystones in the discourse, facilitating the identification and reconstruction of argumentative moves that are made in argumentative discussions and texts (see [5]). Argumentative indicators are language means (words, constructions) that serve as discourse clues in identifying the structure of argumentation: they help determine the presence of arguments in a given segment of text, reconstruct the connections between statements, relate the argument to a specific reasoning pattern (inference form expressing the relations of premises and conclusions).

The purpose of this study is to create a lexicon of argumentative indicators used in popular science discourse. The work outlines the preliminary results of the analysis of argumentative indicators selected in the corpus of popular science articles. The questions of their classification, structural features and methods of formal representation are discussed.

2 Related Works

Discourse markers (discourse connectives) are usually considered as key indicators of discourse structure. They have been studied from various research perspectives. One of them is represented in Penn Discourse Treebank where discourse connectives are viewed as binary predicates that convey certain semantic relations and take propositions, events and states as their arguments PDTB [6]. PDTB annotation covers traditional functional words and phrases such as subordinating conjunctions (e.g. when, because, as soon as), coordinating conjunctions (and, but, or), adverbs (e.g. instead, therefore), prepositional phrases (e.g. on the other hand), etc.

T. van Dijk proposed classifying discourse connectives according to the type of relation they label: pragmatic connectives express the relation between speech acts, semantic connectives manifest the relations between the facts indicated in the text [7]. This difference corresponds to the opposition of subject matter and presentational relations in the Rhetorical structure theory [8]. Presentational rhetorical relations whose intended effect is to increase some inclination in the reader, such as the desire to act or the degree of positive regard for, belief in, or acceptance of the nucleus, overlap with argumentative discourse relations. The mapping of rhetorical discourse relations onto argumentative relations carried out in [9] confirms this pragmatic similarity. No wonder that first experiments in argumentation mining use the traditional functional lexicons as lexical indicators.

Stab and Gurevych [10] experimented with different types of features, including discourse markers from the PDTB annotation guidelines, to classify text units into the classes non-argumentative, major claim, claim, and premise. The PDTB markers appeared to be not helpful for discriminating between argumentative and non-argumentative text units, but they were useful to distinguish between the classes premise and claim. Eckle-Kohler et al. [11] present a study on the role of discourse markers in argumentative discourse on the material of German corpus, with arguments annotated according to the common claim-premise model of argumentation. They performed various statistical analyses regarding the discriminative nature of discourse markers for claims and premises. The experiments show that particular semantic groups of discourse markers are indicative of either claims or premises and constitute highly predictive features for discriminating between them.

The investigation of discourse relation signals given in [12] is more extensive, as it takes into account not only traditional discourse markers (e.g., although, because, since, thus), but also signals such as tense, lexical chains or punctuation, and their combinations. The authors of the project to create a corpus of rhetorical structures on the material of the Russian languageFootnote 4 also consider a wide class of language expressions, including lexical items irrespective of their part of speech that can signal the presence of a rhetorical relation. Toldova et al. [13] consider not only functional words to be rhetoric relation markers. The markers include punctuation marks, prepositions, pronouns, speech verbs, etc. In the development of this approach on the example of causal relation indicators in [14] it is shown that, in addition to traditional functional words, relation indicators are constructions based on the content words and provide informal specifications of some patterns that can be used for mining indicators in non-annotated text.

With regard to the indicators of the argumentation, the possibility of considering a wide class of language expressions that signal the use of specific reasoning schemes is demonstrated in the theoretical study [5], which also goes far beyond the functional classes of words. Considering the indicators of argumentation by analogy, the authors cite as an example constructions with significant words meaning analogy, comparison, similarity, and parallelism: X can be compared to Z; X is similar to Z; X is the equivalent of Z; there are parallels (to be drawn) between X and Z; X reminds someone of Z.

3 Information Model of Argument Annotation

An argument is a set of related statements used to prove a final statement (thesis, or conclusion). The structure of the argument highlights the statement-premise and the statement-conclusion connected by typed relations.

The structure of the argument can be represented as follows:

  • Argument = (Premise, Premise, …, Conclusion, Weight)

  • Conclusion = (Statement | Argument, Support | Attack, Weight)

  • Premise = (Statement, Role, Weight)

  • Statement = (Utterance, Source *, impl. | expl.)

The type of argumentation relation expresses whether a given argument is evidence (Support) or refutation (Attack) of a thesis-conclusion. The conclusion can be either an explicitly expressed statement or some other argument. Related statements may serve as premises, where each premise plays a specific Role in a typical reasoning scheme.

A statement represents a natural language formulated proposition (Utterance), which the annotator (expert) associates with the Source that is a text fragment. Usually the statement coincides with the source, except for the existing anaphoric references and ellipsis recovered by the annotator from the context. Thus, a statement is an interpretation of a text fragment. However, it is possible that the necessary statement-premise is not explicitly specified in the text. In the case of implied premise, its statement can be formulated by the expert on the basis of extratextual knowledge.

All elements in the structure of the argument are supplied with Weight – a measure of the persuasiveness of the proof given, which allows us to ultimately assess the strength of the author’s argument as a whole.

The given argument representation model corresponds to the AIF model [15], which is currently accepted as a standard in analyzing argumentative structures and, in particular, is used in the Carneades system [16]. Since in this study we focused on investigation of different types of indicators used in the texts for entering arguments and their structural components, the argument model was supplied with additional parameters for annotating the argumentation indicators in the text.

Indicator = (Source, Type, Definition, Frequency)

On discovering an indicator, the expert marks up a corresponding text fragment (Source) and points out which pragmatic aspect (Type) of the argument is signaled by the indicator. Based on the analysis of the selected fragment, the structural (grammatical) type of the indicator is determined and its lexical-syntactic Definition formed, which allows automatic search for the indicator in the texts. The Frequency parameter determines how discriminative this indicator is for the selected aspect of the argument. Frequency in the annotated text corpus is calculated automatically.

Additionally, the markup system implements the requirement of maximum “similarity” between the statement and the source. To this end, the following recommendations were developed for experts who carry out annotation of argumentation.

When annotating an Argument, text fragments corresponding to the explicitly presented statements are marked up first. Each fragment can be a chain of sentences, a single sentence, clause or nominalization. Every fragment is regarded as if all its anaphoric references (including ellipses) were resolved. In case of anaphoric nominalization of a whole statement within the Argument, an antecedent statement is marked up. Then, a suitable type of reasoning scheme (argumentation scheme) is chosen, the selected statements are linked into a single Argument, and the necessary parameters of the premises and a conclusion are indicated in accordance with the specified scheme. If necessary, implicit statements are introduced.

Let’s give an example of the Argument marked up in the textFootnote 5:

(in Russian) Пo-фpaнцyзcки любoвьamour, чтo тoжe имeeт тaйный cмыcл. [Звyкocoчeтaниe “mr” в индoeвpoпeйcкoм пpaязыкe cooтвeтcтвoвaлo вceмy, чтo cвязaнo co cмepтью.] [Звyк ‘a’ дo cиx пop вo мнoгиx языкax yпoтpeбляeтcя кaк пpoтивoпocтaвлeниe.] Пoэтoмy [«amour»  пpoтивoпocтaвлeниe cмepти, тo ecть жизнь!]//text 21

In French love - amour, which also has a secret meaning. [The sound combination “mr” in the Indo-European proto-language corresponded to everything connected with death.] [The sound ‘a’ is still used as an opposition in many languages.] Therefore [«amour» is the opposition of death, that is, life!]

In this example, the Argument consists of two premises and a conclusion. The word пoэтoмy ‘therefore’ is an indicator of the conclusion of the Argument and of entire inference relation.

Note that the Argument does not always correspond to a continuous text fragment: between the conclusion and the premise there may be discourse units that are not related to this Argument (for example, Premise that supports the same Conclusion independently within another Argument), or irrelevant for argumentation (for example, explanations).

4 Classification of Argumentation Indicators

Indicators of argumentation can be classified from different points of view: the pragmatic aspects of argumentation, the degree of grammaticalization, the semantics of the indicator’s core word, the type of construction.

  1. 1.

    Pragmatic aspects of argumentation signaled by the indicator.

    • opinion and strength of the argument (degree of confidence);

    • inference relation between two statements;

    • role of the statement in the inference relation (Premise vs. Conclusion);

    • type of argumentative relation (Support vs. Attack);

    • structure of the argumentation (Multiple vs. Serial argumentation);

    • semantic-ontological relation which the typical reasoning scheme used in this case is based on.

In the following examples (1) and (2), the indicators пo-видимoмy ‘seemingly’ and cпeциaлиcты пpeдпoлaгaют, чтo ‘experts suggest that’ present statements of the premise (2) and conclusion (1) as opinions with a certain weight. Indicators пocкoлькy ‘since’ and пoэтoмy ‘therefore’ with causal semantics explicitly indicate the presence of a relation of inference. In this case, the position of the marker in the segment indicates the role of the corresponding statement: пocкoлькy introduces the Premise in (1), and пoэтoмy introduces the Conclusion in (2). In both cases, the type of relation is Support. In (3) and (4), the indicators are based on predicates with the semantics of mental impact, oпpoвepгaть ‘refute’ and пoдтвepждeниe ‘confirmation’, here the distribution of roles in the inference move is identified by the actant position.

(1) Пocкoлькy [в языкax cибиpcкиx нapoдoв вce eщe coxpaнилacь чeткaя cвязь c индeйcкими нapeчиями], cпeциaлиcты пpeдпoлaгaют, чтo [мнoгиe мигpaнты вoзвpaщaлиcь из Aмepики нaзaд, в Cибиpь.]//text 02

Since [in the languages of the Siberian peoples there is still a clear connection with Indian dialects], experts suggest that [many migrants returned from America back to Siberia.]

In the example (1), the opinion of specialists expressed in the conclusion and marked by an indicator of opinion, which corresponds to a not very high weight (the degree of confidence of the mental predicate is relatively low), is supported by the premise marked by the indicator of the basis of the conclusion.

(2) [Ocoзнaниe cвoeй идeнтичнocти, в тoм чиcлe и языкoвoй, пo-видимoмy, являeтcя вaжным кoмпoнeнтoм дyшeвнoгo paвнoвecия.] Имeннo пoэтoмy [вceгдa нaxoдятcя тe, ктo нaпepeкop coвpeмeнным тeндeнциям, a тo и инcтинктy caмocoxpaнeния пoддepживaeт и coxpaняeт языки.] Teм бoлee чтo [знaниe poднoгo языкa coвepшeннo нe oзнaчaeт oткaзa oт дpyгиx, бoлee вocтpeбoвaнныx.]//text 68.

[Awareness of one’s identity, including linguistic identity, is probably an important component of mental equilibrium.] Just for that reason [there are always those who, contrary to modern trends and even to the instinct of self-preservation, maintain and preserve languages.] All the more so that [knowledge of the mother language does not mean refusal to speak other, more popular ones.]

In the example (2), two arguments are shown that prove the same thesis independently of each other, while the indicator тeм бoлee чтo ‘all the more so that’ marks the second premise in the structure of Multiple argumentation.

(3) Haпpимep, пoгoвapивaют, чтo [pyccкиx нayчили мaтepитьcя тaтapы и мoнгoлы, a дo игa, якoбы, нe знaли нa Pycи ни oднoгo pyгaтeльcтвa.] Oднaкo ecть нecкoлькo фaктoв, oпpoвepгaющиx этo. Bo-пepвыx, [y кoчeвникoв нe былo oбычaя cквepнocлoвить.]//text 29.

For example, they say that [the Tatars and the Mongols taught Russians how to swear and before the yoke, allegedly, they did not know a single curse in Russia.] However, there are several facts that refute this. First, [nomads didn’t have the habit of foul language.]

(4) Bo-пepвыx, [y кoчeвникoв нe былo oбычaя cквepнocлoвить.] B пoдтвepждeниe этoмy[зaпиcи итaльянcкoгo пyтeшecтвeнникa Плaнo Кapпини, пoceтившeгo цeнтpaльнyю aзию. Oн oтмeчaл, чтo y ниx бpaнныe cлoвa вooбщe oтcyтcтвyют в cлoвape.]//text 29.

First, [the nomads did not have the habit of foul language.] In confirmation of this — [the records of the Italian traveler Plano Carpini, who visited Central Asia. He noted that swear words were absent in their lexicon.]

Examples (3) and (4) demonstrate Serial argumentation. In (3) an opinion is refuted by the following premise (Attack relation), and in (4) this premise is supported by the reasoning corresponding to the typical scheme “From the Knower”: the subject makes a statement relating to the domain he is familiar with - therefore, this statement is true.

2. Primary and secondary indicators.

Toldova et al. in [14] proposed to divide the indicators of a causal rhetorical relation into two classes (primary vs. secondary) according to the degree of their grammaticalization: the primary connectors are functional words (including multi-word units) fixed in grammars and dictionaries, and the secondary ones are less studied constructions based on content lexemes of causal semantics. Examples from the corpus of popular science texts make it possible to draw similar conclusions regarding argumentation indicators. We consider two classes of language means used as indicators of argumentation:

  • discursive connectors are well-known functional units, including multi-word units (prepositions, conjunctions, introductory words): пoэтoмy ‘that is why’, пocкoлькy ‘since’, cлeдoвaтeльнo ‘consequently’, тaк кaк ‘as’, знaчит ‘hence’, тeм бoлee чтo ‘all the more so that’, нaпpимep ‘for example’, в чacтнocти ‘in particular’, etc.;

  • content words and indicator constructions including these words as their core components (see examples below).

3. Classification of indicators according to the semantics of the core content word.

Up to now the list of annotated content words which can serve as indicators or core words of indicator constructions is heterogeneous and far from complete. These words are mainly verbs and nouns of the following lexical-semantic classes:

  • mental state cчитaть ‘to believe’, пpeдпoлaгaть ‘to suppose’, yбeждeн ‘be convinced’, мнeниe ‘opinion’, тoчкa зpeния ‘viewpoint’;

  • mental impact дoкaзывaть ‘to prove’, oпpoвepгaть ‘to refute’,пoдтвepждaть ‘to confirm’, cвидeтeльcтвoвaть ‘to indicate’;

  • inference cлeдoвaть ‘to follow/result’, пoлyчaeтcя ‘it follows that’, выxoдит ‘it follows that’, выxoдить ‘to follow/result’, пoлyчaтьcя ‘to follow/result’, вывoдить ‘to conclude/infer’, cлeдcтвиe ‘consequence’, вывoд ‘conclusion’;

  • conflict пpoтивopeчить ‘to contradict’, пpoтивopeчиe ‘controversy’;

  • intellectual activity oбнapyжить ‘to discover’, выяcнить ‘to find out’, выявить ‘to reveal’;

  • speech activity гoвopить ‘to talk’, cooбщaть ‘to report’, yтвepждaть ‘to state’;

  • justification apгyмeнт ‘argument’, дoкaзaтeльcтвo ‘proof’, oбocнoвaниe ‘basis’, cвидeтeльcтвo ‘evidence’, пoдтвepждeниe ‘confirmation’;

  • information фaкт ‘fact’, пpимep ‘example’;

  • intellectual product тeзиc ‘thesis’, гипoтeзa ‘hypothesis’, тeopия ‘theory’;

  • speech product cooбщeниe ‘message’, cлoвo ‘word’;

  • expert yчeный ‘scientist’, cпeциaлиcт ‘specialist’, лингвиcт ‘linguist’, филocoф ‘philosopher’.

4. Types of constructions for secondary indicators.

On the basis of speech and mental predicates, predicates of inference and mental impact, complex indicators of argumentation are formed. In addition to the core word, they can include markers of actant positions, for example, the conjunction чтo ‘that’ and the correlative pronoun construction тo, чтo ‘the fact that’ for sentential actants, anaphoric and cataphoric elements such as the demonstrative pronoun этo/этoт ‘this’, adverb oтcюдa ‘hence’, the relative pronoun that ‘what’. Examples of constructions under consideration are as follows:

  • constructions with verbs of inference and mental impact

    • из…cлeдyeт, чтo ‘from… it follows that’

    • этo…дoкaзывaeт, чтo ‘this… proves that’

    • эти…cвидeтeльcтвyют o тoм, чтo ‘these… indicate that’

  • verbal constructions of direct or indirect speech or opinion with the speech or mental verb and the «expert» class word in the subject position

    • yчeныe… yтвepждaют: “…” ‘scientists…assert: “…” ’

    • литepaтop… зaмeтил, чтo ‘literary scholar…noted that’

  • light verb constructions with nouns

    • пpимepoм…являeтcя ‘example …is’

    • apгyмeнт был тaкoй ‘argument…was as follows’

    • пpивoдит… apгyмeнт в пoльзy этoгo, чтo ‘give an argument in favour of this’

    • oтcюдa … cдeлaн… вывoд o тoм, чтo ‘come to a conclusion that’

  • prepositional noun phrases

    • в пoдтвepждeниe этoмy ‘in confirmation of this’

    • нa этoм/тaкoм ocнoвaнии ‘on this/that basis’

    • нa cлeдyющeм ocнoвaнии ‘on the following ground’

    • пo мнeнию/cлoвaм ‘according to smb’

5 Technological Aspects of Building a Lexicon of Argumentation Indicators

To support the development of lexicon of indicators, it is essential to provide the researcher with the necessary automation tools. In Fig. 1 the main stages of the process of creating and researching indicators are presented.

Fig. 1.
figure 1

The main stages of the development of lexicon of indicators (blocks with a light background correspond to fully automatic procedures, blocks with a dark background represent procedures carried out by an expert).

It is assumed that the process of argument annotation is accompanied by marking up argumentation indicators found out by the annotator. After a text fragment associated with the indicator is selected, a formal description of the indicator is automatically generated and added to the lexicon. This description is presented to the expert for validation and correction. Automated procedures carried out by the expert are supported by the appropriate software components.

Consider this process in more detail.

  1. 1.

    Selection of a text fragment corresponding to the indicator occurs together with annotation of the argument and its components. Analysis of the structure of the argument and the role of the indicator within this structure complement each other and facilitate annotator’s work. The indicator annotation involves specification of the fragment boundaries (possibly with gaps) and selection of the argumentation aspect(s) signaled by the indicator.

  2. 2.

    Based on the selected fragment, it is necessary to specify a formal representation of the indicator in order to ensure automatic search of the indicator in the text, taking into account the variability of its presentation. At this stage, the text fragment is divided into elementary components (graphematic analysis), words are lemmatized, word combinations (phrases) are generated and normalized.

  3. 3.

    As the examples in paragraph 4 show, indicators are not only lexical units (single- or multi-word units), but also constructions, which can be formally represented by means of lexical-grammatical patterns. Automatically generated pattern allows for the lexical composition of the construction (lexical units in the normalized form), punctuation marks, gaps, and the boundaries of the indicator.

  4. 4.

    At the next stage, the obtained formal description is matched against the corpus and search results are displayed in the form of a concordance. Based on the study of the indicator’s occurrences, the expert concludes whether the formal description is correct.

  5. 5.

    The expert can correct indicator description as appropriate: generalize individual lexical units to lexical-semantic classes, resolve ambiguities, specify grammatical features of words and phrases within structures (to ensure coordination or government), create lists of alternatives and indicate the boundaries of the construction.

  6. 6.

    The resulting lexical units and patterns approved by the expert are supplied with the necessary grammatical and argumentative features and introduced into the information retrieval lexicon, which provides search and automatic annotation of indicators in the texts of the corpus. This, on the one hand, removes the need to re-annotate indicators manually, and, on the other hand, signals the possible presence of argumentation in unannotated texts or the need to refine previously marked up arguments.

5.1 Indicator Pattern Generation

Indicators of argumentation can be classified from different points of view: the pragmatic aspects of argumentation, the degree of grammaticalization, the semantics of the indicator’s core word, the type of construction.

The analysis of text fragments marked up as indicators is carried out using the Klan system [17]. Extraction of lexical units from a text fragment is not as obvious a task as it might seem. The paper [18] describes the emerging problems and gives a linguistic classification of errors. Most of the errors in the extraction of lexical units are related to the ambiguity and/or incorrect prediction of single words and the incorrectness and/ or incompleteness of the construction of word combinations.

The process of indicator pattern generation includes the following steps:

  1. a.

    graphematic analysis, which provides for tokenization and selection of non-textual elements (numerical data, symbols, etc.),

  2. b.

    lexical and morphological analysis (lemmatization, determination of lexical and grammatical features, paradigm representation, normalization),

  3. c.

    identification of word combinations (based on predefined grammatical models and normalization),

  4. d.

    generation of template (s) with a simple structure in the form of a chain of lexical units and punctuation marks:

    тaк, нaпpимep ‘thus, for example’: [тaк, s/, , нaпpимep]

  5. e.

    for discontinuous fragments, introduction of structural constraints into the pattern description (distant context and pattern boundaries)

    ecли …, тo ‘if…then’: [begin: ecли, s/, , end: тo]

  6. f.

    analysis of pattern composition and ascription of grammatical features (for example, if the form of the indicator is fixed during annotation):

    в пoдтвepждeниe ’in confirmation’: [в, пoдтвepждeниe <acc, nom, sing>]

  7. g.

    analysis of the set of patterns and specification of formal description of pattern using compression procedures, such as introduction of alternatives, inclusion of references to other patterns, generalization and combination of patterns:

    этo ‘it’ or этoт ’this’: [этo | этoт]

Thus, several types of structural organization of the formal description of indicators and their components can be distinguished.

Indicators with a simple structure include single- or multi-word functional and content units (inference predicates, speech and mental predicates, etc.).

Complex constructs described using patterns include simple chains (a chain of lexemes and punctuation marks), chains with grammatical constraints (prepositional phrases, verbal constructions, etc.) and discontinuous constructions.

Among the indicators with complex structural organization are the following:

  • constructs combining distant context and grammatical constraints, for example, prepositional noun phrases

figure a

including auxiliary constructs with imposed grammatical constraints

figure b
  • constructs with elements defined by their lexical-semantic classes:

figure c
  • constructs with multiple gaps (distant contexts):

figure d

Correct and complete description of indicator in accordance with annotated fragment is not always obtained as a result of automatic template generation. The same goes for lexical-semantic class identification in case of generalization. Manual correction and adjustment of the formal representation of indicator is required based on the examination of its contexts and use in various types of arguments.

5.2 Analysis of Indicator Structure

The traditional tool for the study of linguistic phenomena is concordance, which displays a listing of immediate and extended contexts of lexical units in the text corpus. The advanced implementation of searching and concordancing, in addition to lexical units, provides contexts of pattern descriptions, with support for output filtering in accordance with specified criteria (for example, argumentation features). This functionality greatly increases the possibilities for research.

The goal of indicator examination carried out by an expert is to ensure the accuracy of the generated formal descriptions, as well as to expand the lexicon by identifying and merging indicators similar in structure and generalizing lexical units to lexical-semantic classes. The pattern description language has the necessary capabilities, such as means for representing grammatical and semantic constraints, nested constructs, alternatives, and discontinuity.

Let us consider the process of indicator patterns on the example of the “From the Expert” reasoning scheme commonly used in the popular science texts.

[[«Дocтичь этoгo пoмoгaют глacныe»], - дoбaвляeт кaнaдcкий иccлeдoвaтeль Cэм Mэглиo (Sam Maglio)], oдин из aвтopoв нoвoй paбoты.//text 01

[[Vowels help achieve this] adds Canadian researcher Sam Maglio], one of the authors of the new work.

In this example, the construction of direct speech is used, with the speech predicate and the «expert» class word in the subject position. This construction is generally recognized as a sign of argumentation. Thus, the annotator marked up the following text fragment as an indicator:

« … » .. дoбaвляeт .. иccлeдoвaтeль ‘« … » .. adds .. researcher’

Based on this fragment, it is necessary to create a formal representation of the indicator. When generating a pattern, you can apply different strategies for forming its composition. For example, in this case the following pattern variants will be automatically generated:

  • presentation of exact wordform with the help of grammatical features:

x = [begin: «, end: »]

y1 = [begin: x, дoбaвлять<act,3pers,pres,sing>, end: иccлeдoвaтeль<nom, sing>]

  • presentation of all forms (normalization):

y2 = [begin: x, дoбaвлять, end: иccлeдoвaтeль]

  • generation by grammatical model:

y3 = [begin: x, дoбaвлять, end: иccлeдoвaтeль <nom>]

y4 = [begin: x, иccлeдoвaтeль<nom>, end: дoбaвлять]

  • specification of lexical-semantic class (with or without grammatical features):

y5 = [begin: x, w/<Sem: speech_activity>, end: w/<Sem: expert>], etc.

Determining the best strategy for each specific several types of indicator is one of the objectives of the study.

To expand and generalize the lexical composition of the generated pattern, the expert performs the following steps:

  • considers the possibility of generalization of the core words by specifying their lexical-semantic classes,

  • creates auxiliary patterns with alternatives,

  • checks the generalization hypothesis using concordance,

  • corrects and validates the indicator by checking all its occurrences in the corpus.

There are more than 300 occurrences of the «expert» class words: иccлeдoвaтeль ‘researcher’ (40), yчeный ‘scientist’ (119), cпeциaлиcт ‘specialist’ (17), экcпepт ‘expert’ (6), лингвиcт ‘linguist’ (98), филoлoг ‘philologist’ (7), aнтpoпoлoг ‘anthropologist’ (2), apxeoлoг ‘archeologist’ (8), пpoфeccop ‘professor’ (15), физик ‘physicist’ (3), etc. The concordance listing shows that contexts of these words include the following lexical markers of argumentation: дoбaвлять ‘add’, пoяcнять ‘explain’, пpизнaвaть ‘admit’, oтмeчaть ‘note’, cooбщaть ‘report’, пoдытoживaть ‘summarize’, peзюмиpoвaть ‘sum up’, etc. These words were grouped into the lexical-semantic class «speech_activity» to be used in final patterns.

As a result of the correction carried out by the expert, there are patterns that describe a whole class of situations:

  • quote_l = [“|«] quote_r = [”|»] DS = [begin: quote_l, end: quote_r]

  • Expert = [w/<expert>] | [ph/<expert>]

  • DSC1 = [begin: DS, w/<speech><V, past|pres>, end: Expert<N, nom>]

  • DSC2 = [begin: Expert<N, nom>, w/<speech><V, past|pres>, end: DS]

Search in the corpus shows that the construction corresponding to this pattern appears 7 times and 6 of these occurrences indicate the presence of the “From the Expert” argumentation.

Another example of a complex pattern corresponds to the indicator used in the “From the Sign” argumentation scheme. The pattern represents a construction with verb of mental impact and anaphoric element in the actant position.

Этo oткpытиe тaкжe дoкaзывaeт, чтo [пepeceлeниe нapoдoв из цeнтpaльнoй Aзии в ceвepнyю Aмepикy 13 000 лeт нaзaд, вoзмoжнo, былo нe oкoнчaтeльным.]//text 02

This discovery also proves that [the migration of peoples from Central Asia to North America 13,000 years ago may not have been final.]

  • to_chto = [s/ ,, чтo] | [тo, s/, , чтo] | [тoгo, s/, , чтo]

  • anaph_this = [этo | этoт | тaкoй]

  • Proof = [begin: anaph_this, w/<caus_ment><V, pres>, end: to_chto]

The above examples demonstrate the technique of developing formal descriptions of indicators, including automatic generation and manual correction procedures.

6 Conclusion

The paper presents the results of a preliminary analysis of the argumentation indicators observed in the process of annotation of popular science texts in Russian. Corpus examples show main pragmatic aspects of the argumentation signaled by discursive indicators. Along with pragmatic meaning, the classification of indicators takes into account the type of language means used. Special attention is paid to insufficiently studied indicator constructions and classes of their core content words. We consider constructions with verbs and nouns of mental state, speech, inference, and mental impact.

The argumentation indicators are presented in the form of lexical units and lexical-grammatical patterns, which are automatically generated from annotated text fragments and can be manually corrected by the expert. The lexicon of indicators is planned to be used for automatic annotation of argument indicators in unannotated text, as well as for experiments in argument mining.

The process of argumentative annotation of the popular science corpus is ongoing. Upon completion of the work, the pilot version of the annotated corpus will be available in the open access. We assume that in the future the scope of the research will expand and cover new classes of content words and corresponding constructions. In particular, one can expect a significant expansion of the spectrum of indicators due to the semantic-ontological relations on which typical argumentation schemes are based.