1 Introduction

With the surge of user-generated and social media content, such as blogs, wikis, discussion forums, news comments and tweets, which are published non-commissioned by non-professionals (Vickery and Wunsch-Vincent 2007), social media outlets and networking sites are becoming a major source of knowledge and opinion, and are considered a catalyst of bottom-up communication practices that contribute towards the democratization of language. As a consequence, there is a growing need for a thorough multidisciplinary understanding of this type of communication, shaped by the specific communication goals as well as social and technical circumstances in which it is produced: rich in colloquialisms and foreign language elements, non-canonical spelling variants and syntax, idiosyncratic abbreviations and neologisms.

This highly participatory, interactive and multimodal communication is also accompanied by easily accessible and rich (sociodemographic) metadata, which opens a wide range of new and exciting research opportunities, not only in linguistics and natural language processing, but also in the digital humanities and social sciences, as well as bringing about new technical, linguistic and ethical challenges for scholars.

Despite relatively good coverage of reference (Logar et al. 2012; Erjavec et al. 2015) and specialized corpora (Erjavec 2015; Verdonik et al. 2013) for Slovene, none of them contain significant amounts of user-generated content (UGC), which was the goal of the Janes project, presented in this paper. To make this possible, the project also had to develop UGC specific tools, and, in order to train them, manually annotated datasets. The resulting set of resources and tools was to be suitable both for HLT research and development as well as for linguistic investigations. To promote open science and to facilitate HLT development for Slovene, these resources should be openly accessible.

The paper presents the results of this undertaking, and should be interesting both for those that would use the developed resources and tools, as for those that would undertake similar projects for other languages. Namely, UGC for most languages is published on similar media and similar issues and solutions can be expected in the data collection process. Furthermore, for processing the collected data, the machine learning paradigm exploited in the Janes project is easily transferable to other languages through (1) comparable manual annotation of training datasets and (2) training of the tools on those manually annotated datasets.

The paper is structured as follows: Sect. 2 gives a brief overview of UGC corpora and datasets for other languages; Sect. 3 details the Janes corpus, including its ingest, metadata, text annotation and encoding, size, and the public version of the corpus; Sect. 4 describes the annotation tools developed in the scope of the project, in particular the tokeniser adapted for non-standard text, word normaliser, tagger and lemmatiser, and named entity recogniser; Sect. 5 introduces the manually annotated datasets for training language technology tools and for linguistic investigations; and Sect. 6 gives conclusions and directions for future work.

2 Related corpora

Even though research on computer mediated communication (CMC) in natural language processing, corpus linguistics as well as social sciences has had a strong empirical focus from its very beginning, relatively few CMC corpora have been made available to the scientific community (Beißwenger and Storrer 2008). The three largest that are available for download are Suomi24 (Lagus et al. 2016) which contains 2.38 billion tokens from Finnish discussion forums; the German DEREKO-Wikipedia (Margaretha and Lüngen 2014) with 581 million tokens from article and user talk pages and CoMeRe (Chanier et al. 2014), a French collection of corpora from e-mails, forums, chats, tweets and the French Wikipedia from various periods comprising 80 million tokens.

For linguistic investigations, a whole order of magnitude smaller but carefully sampled and annotated corpora have been developed. The German Dortmund Chat Corpus (Beißwenger et al. 2015) comprises 1 million tokens of chat discourse, manually anonymised, annotated for selected CMC-specific phenomena and available through CLARIN-D for download. sms4science.ch (Dürscheid and Stark 2011) is a donation-based corpus of 650 thousand tokens of SMS messages written in German, French, Swiss German, Italian and Romansh from Switzerland, but available only for online querying, as is the DiDi corpus (Frey et al. 2016) that contains 570.000 tokens from Facebook posts written in German, Italian and South Tyrolean.

In addition to corpora, specialised training sets have been compiled to train NLP tools, such as sentiment analysis (Barbieri et al. 2016; Rei et al. 2016), named entity recognition (Derczynski et al. 2016; Rei et al. 2016), entity linking (Derczynski et al. 2015) and word-sense disambiguation (Johansson et al. 2016).

3 The Janes corpus

The Janes corpus contains five major types of publicly accessible UGC: tweets, forum posts, user comments from on-line news portals, blog posts along with their user comments, and talk and user pages from Wikipedia.Footnote 1 The corpus is limited to published user-generated content, which is why we deliberately excluded instant messaging applications and social networking sites that are primarily intended for private communication, such as WhatsApp and Snapchat. Facebook, on the other hand, was not included due to their restrictive terms-of-use which prevent dissemination of the collected material, which was a major goal of the project. Furthermore, since each new source requires dedicated tools for its harvesting and processing, we preferred to concentrate on other aspects of the project, rather than further expand the range of sources.

The selected sources cover reactive (e.g. news and blog comments) as well as interactive (e.g. forum posts) types of UGC (Walther 2012). While the harvested content has in common that it has been largely published non-commissioned by non-professional platform users, their communication goals, genres, registers, modalities, styles as well as social and technological settings vary, which is why a high degree of heterogeneity of the linguistic phenomena included in the corpus was both expected and desired for the purposes of the project.

The collection of tweets and Wikipedia talk pages is comprehensive in the sense that the corpus includes all the Slovene users and their posts that we could identify during the four-year harvesting period. For the other text types we selected only a small set of the most popular sources offering the most textual content, taking care, unlike in web corpora, to extract the structure and metadata for each source separately. Furthermore, each new source involves writing dedicated tools for its harvesting and processing, so we preferred to concentrate on other aspects of the project, rather than further expand the range of sources.

Apart from the kind of texts they contain, the five Janes subcorpora differ also in other aspects, including conditions on public use. We therefore consider them separately and in the remainder of this section detail their harvesting, structure, metadata, encoding, size and public versions.

3.1 Corpus sources and ingest

The subcorpora were gathered and processed with dedicated tools, which, where necessary, took care of boiler-plate removal, structuring, and basic metadata extraction, on a per-source basis. The subcorpora were additionally cleaned regarding their character encoding, as several variants exist for Slovene; texts where the coding could not be reliably normalised were not included in the corpus. The result of this process were five clean UTF-8 encoded and valid XML subcorpora encoded to their dedicated RNG schemas.

3.1.1 Tweets

Slovene tweets where harvested with TweetCat (Ljubešić et al. 2014), which uses the Twitter Search API and was developed especially for building Twitter corpora of smaller languages. The tool can be run continuously, constantly expanding the pool of users identified as tweeting predominantly in the specified language and gathering their tweets. The limitations of the Twitter API mean that old tweets are typically not available,Footnote 2 however, TweetCat for Slovene was run more or less continuously from June 2013 to July 2017, so that we gathered at least 3 years of tweets.

3.1.2 Forums

Forum posts included in the corpus are among the most active forums in Slovenia. The selection of the forums was based on a detailed analysis of 96 Slovenian online forums conducted by Lebar et al. (2012). As selection criteria we used the number of registered members, posting dynamics and the number of active topics on the forums. On this basis we chose the top-ranking forums in three diverse domains: medicine (med.over.net), automotive (avtomobilizem.com) and science (kvarkadabra.net). As the structure of the web sites differs substantially, we developed specialised Web crawlers and scrapers for each forum separately, which took care to extract only clean, content-full text from the posts with basic metadata.Footnote 3 These tools also kept the original structure of the forums, so that the Forum subcorpus is organised into distinct sub-forums and discussion topics or conversation threads. The forum extractor was run on one single occasion (January 2015) due to frequent changes of the appearance of the sites, which would necessitate re-writing the text extractors.

3.1.3 News

News articles along with their comments were harvested from the national news portal rtvslo.si and two political weeklies, the left-wing mladina.si and right-wing oriented reporter.si. Here a key factor in deciding which sources to include was, apart from their popularity and hence a large number of user comments, also the fact that these portals do not keep their content behind a pay-wall, nor do they automatically delete old comments, unlike quite a number of other Slovene news portals. As with forums, the collection of texts proceeded with dedicated crawlers and scrapers written for each source separately, enabling us to keep the user- and time-stamped comment thread in the News subcorpus. For completeness, the news articles themselves are also included, even though they are not user-generated. As with forums, this subcorpus was gathered in one run (January 2015).

3.1.4 Blogs

Blogs and their comments were gathered from rtvslo.si and publishwall.si, which are popular among Slovene amateur bloggers. An additional reason to chose these two sites is that they have a uniform structure, which enabled quality harvesting of the majority of their content. Furthermore, as both sites are in the .si domain, they contain almost entirely Slovene text, unlike international sites, such as blogger.com, where it is very difficult to separate Slovene post from the rest, esp. as nowadays such sites also contain blogs translated with Google Translate. Again, the text clean-up and structure extraction proceeded with per-site dedicated extractors, enabling us to keep (and distinguish) the blog texts from their comment thread. Here too the subcorpus was harvested once only (January 2016).

3.1.5 Wikipedia

Wikipedia discussion pages, either commenting on particular Slovene Wikipedia articles (pagetalk) or those entering into a discussion with a particular Wikipedia author (usertalk), were collected with a dedicated tool WikiTalkExtractorFootnote 4 that takes the Wikipedia dump and two parameters (type of page to collect and the language) and returns a cleaned and structured corpus. The subcorpus was obtained several times, with the last collection made, as with the the Tweet subcorpus, in July 2017.

3.2 Metadata

An important feature of the Janes corpus is the wealth of metadata it contains, enabling much richer linguistic or sociological analyses of its language than would otherwise be possible. Some metadata was collected directly in the process of harvesting, in particular the identifier (for tweets) or URL (for the other sources), the date and time of posting and the username of the author. For tweets, we also collected the number of retweets and favourites, while for the remaining sources we additionally extracted titles and up- and downvotes, when applicable.

In addition to metadata that could be relatively simply extracted from the HTML or the JSON object, the subcorpora were enriched with additional metadata added either manually or automatically, which we discuss below.

3.2.1 Author gender

Knowing the gender of the author is important for sociolinguistic and other research. For the Tweet subcorpus, as well as for post authors of the Blog subcorpus, the gender of the authors was, mostly based on the authors profiles and usernames, determined manually. In addition to the male and female gender we distinguish also the neutral gender, used for impersonal authors, such as those of corporate accounts. Given that Slovene explicitly expresses the first person gender in the past and future tenses we also developed an automatic method that uses the PoS annotated text of one author to determine his or her gender. As citations can introduce cues for the other gender, the masculine/feminine label is assigned only if the texts contain a significantly larger proportion of one gender against the other, otherwise the neuter gender is assigned. We used this method to gender-label all the other texts from the Janes corpus. The evaluation of the method using the manually labelled subcorpora showed that the method gives the correct answer for 76% of the authors. However, there were only 5% of errors where the method assigned the male gender to a female user or vice-versa. The method therefore does not introduce much noise in cases where the opposition of female/male writing is under investigation. A content analysis of the Janes corpus regarding the gender variable (Verhoeven et al. 2017), once obvious gender markers were removed from the corpus, showed for men to be easiest distinguishable through swearing, usage of numerals and other non-alphabetical symbols, talking about beer, food and sports, while women are easiest to spot through the usage of emoticons (except :-P), interjections, character flooding, adjectives denoting attitude and personal and possessive pronouns, both of which mostly follow well known stereotypes already documented in the literature.

3.2.2 Author type

Given that communication goals play an important part in the choice of linguistic means, we manually labelled the type of author—either private or corporate—for the tweet and blog authors. An analysis of the Tweet subcorpus showed that it contains 76% private users, with 24% being corporate, while for blog authors there are only 51% private users and the other 49% corporate. We also compared the type of user with their gender, as we would expect that corporate users are always of a neutral gender; and while this is mainly true, there are nevertheless 16% corporate tweet users with a non-neutral gender, which is in 13% male and 3% female, with almost exactly the same number also holding for the corporate blog authors. For uniformity of metadata we also automatically assigned the author type to other Janes subcorpora, where we simply took all the forum posts and blog, news, and wiki comments as being authored by private individuals. While investigating the possibility of identifying author type on Twitter without the usage of linguistic signal (Ljubešić and Fišer 2016a), we identified, among others, the corporate users to use more URLs, publish more during working hours, write longer tweets and have more followers than friends, while private users more often reply to tweets, mention other users, favourite tweets of other uses, write tweets of different length and at different times of the day.

3.2.3 Text standardness

The first analyses of the corpus showed that a significant portion of the contained texts is written in perfectly standard Slovene, while the focus of the project was on exploring non-standard language. We therefore developed an automatic procedure (Ljubešić et al. 2015) which determines the standardness level of a text. It turned out to be advantageous to distinguish two types of (non)standardness, the technical and linguistic one. Technical standardness (T) is, to a large extent, determined by the input medium or device, and is manifested in non-use of capitalisation, punctuation and spaces, while linguistic standardness (L) relates to the lexis and syntax of the texts, comprising the choice and spelling of words, their morphosyntactic properties, and word order. For both dimensions the system returns a value between 1 (perfectly standard) to 3 (very non-standard), which we usually shorten to the closest integer, so, for example, the text “komunistična ideologija ubijaj,kradi laži...” normalised as Komunistična ideologija: ubijaj, kradi, laži...—“Communist ideology: kill, steal, lie...” receives the label T3L1 while “A nis bla včer na Bledu?” normalised as Ali nisi bila včeraj na Bledu?—“Weren’t you in Bled yesterday?” gets the label T1L3. The evaluation of the method showed that its average absolute error is 0.38 for technical and 0.42 for linguistic standardness estimation. While the results miss the perfect score on average between a third and a half of a point (on a two point scale), the results have nevertheless proven to be very useful in filtering the corpus texts to those of interest to an investigation. Performing a series of experiments on Slovene Twitter data enriched with standardness information (Ljubešić and Fišer 2016b) showed the standardness level of tweets to be at highest in the morning, slowly dropping throughout the day, and with a sharp drop in the first hours of the following day. Additionally, exploiting both the gender and standardness metadata showed that women produce more non-standard linguistic material than men, a finding contrary to existing beliefs.

3.2.4 Text sentiment

Sentiment labelling in the area of UGC has become a very popular area of research (Liu 2015), as it enables determining if the public is in favour of e.g. a presidential candidate, a proposed bill or a new product, and we can also observe trends in the sentiment on a particular topic. The most popular categorisation of sentiment is into negative, positive and neutral classes, where the last also covers cases of mixed sentiment texts. For labelling the texts in the Janes corpus with these three classes we used a SVM-based classifier (Smailović et al. 2014) which was trained on a large collection of manually labelled Slovene tweets (Mozetič et al. 2016). The evaluation on a sample from the Janes corpus (Fišer et al. 2016) showed that the inter-annotator agreement between three annotators was 0.56 using Krippendorff’s alpha (Krippendorff 2012), while the system achieved a score of 0.43 compared to the majority class of the annotators. The agreement is usually taken to be satisfactory if it is above 0.4, so the system just passes this threshold. However, sentiment classification, esp. of very short texts such as commonly found in UGC is a difficult task, as can also be seen from the fairly low inter-annotator agreement. The already mentioned investigation of interdependence of the various metadata available in the Twitter part of the Janes corpus (Ljubešić and Fišer 2016b) showed the users to produce more negative than positive content throughout the week, with Friday being a turning point towards the weekend when the amount of positive and negative content is comparable. Furthermore, men have shown to be more negative than women, and non-standard content having a higher positive-to-negative ratio than standard content.

3.3 Corpus encoding

The Janes corpus is encoded in XML, according to the Text Encoding Initiative Guidelines (TEI Consortium 2017). Each of the five subcorpora is stored as a separate TEI document which contains its TEI header with the metadata for the subcorpus, the body, and back matter containing the definition of the part-of-speech (or, rather morphosyntactic) tagset used. This definition using the MULTEXT-East MSD tagset for Slovene (Erjavec 2012) is encoded as a feature-structure library, where each element of the tagset is decomposed into its features, i.e. pairs of attributes and values.

The TEI header contains extensive metadata for each source, giving the authors and funders, the source description, principles of segmentation and interpretation, and the number and descriptions of the TEI elements used in the subcorpus. It also contains the standard values used in the markup of the corpus, in particular the classes of named entities. All the metadata is given both in Slovene and English.

The TEI text body contains the collected texts structured as nested divisions (e.g. subforums, discussion topics or discussion threads, wherever available from the source), with the bottom-most division corresponding to one text; the types of divisions are distinguished by their @type attribute, with the text marked as tweet, article, post or comment. TEI does not provide dedicated elements for the kinds of texts found in CMC/UGC, and there have been attempts to extend it in this direction (Beißwenger et al. 2012). However, their proposal introduces a host of new elements and redefines many existing ones, as well as not yet being in its final form, so we opted rather to use standard TEI, where we encode each text as a series of paragraphs, and the metadata for each division is simply encoded as a feature-structure with its features having string values.

As discussed in Sect. 4, the texts in the corpus have been automatically tokenised, word-level normalised to standard Slovene, and the normalised forms PoS (MSD) tagged and lemmatised. The encoding is illustrated in Fig. 1, where it should be noted that linguistically relevant whitespace is preserved and that it also covers cases of \(1-n\) or \(n-1\) mappings between the original and normalised words.

Fig. 1
figure 1

TEI encoding of a text in the corpus. The text reads “Kaj ni to tazadnje AAjevska molitev?” normalised as Kaj ni to ta zadnje AA-jevska molitev?—“Isn’t this last an AA prayer?

3.4 Quantifying the corpus

Table 1 gives quantitative data on the Janes corpus and its subcorpora. The first column lists the subcorpora, where we distinguish the two types of texts in blogs and news. It should be noted that news articles are not marked for authorship, so the author is here “undefined”.

The second column gives the number of authors, where an author is defined by their username in a particular subcorpus; this means that in case two persons have the same username in e.g. two different forums, they are treated as the same author, but if the same username is used in, e.g., tweets and in forums they are counted as two authors. The number of authors is, in any case, imprecise, as one person could post under several usernames in the same subcorpus.

With these provisions we can note that we managed to collect about 100,000 UGC authors, i.e. about 5% of the population of Slovenia, with by far the most authors coming from Forums, and the least from Blog posts and Wikipedia.

The third column gives the number of texts, i.e. tweets, posts, articles and comments, of which there are together almost 13 million. By far the largest here is the Twitter subcorpus with over 11 million tweets, while from the others only the Forum subcorpus comes at least close to a million texts.

The fourth column gives the numbers of tokens, where the complete corpus has almost 270 million, a respectable size for a language with 2 million speakers. Of this, 160 million come from tweets, with the next largest being forums with almost 50 million. The other subcorpora are quite a bit smaller, down to the Wikipedia comment one with only 5 million tokens.

The final two columns give the dates of the oldest and most recent text in the subcorpora. The latter date is the date of the (last) harvesting of the corpus, with the Twitter and Wikipedia the most recent, blogs at the start of 2016, and forums and news at the beginning of 2015. The date of the oldest texts depends on the source and method of collection: tweets are the youngest, starting only in mid-2013, while forums retain the oldest posts, going back to 2001. Of course, the distribution over time is not even, but typically heavily skewed towards recent texts.

Given the limited funding available in the Janes project, which is to be expected for most languages with a smaller number of speakers, we have thus opted for including relatively few sources in the corpus (as each source requires development time for constructing the text cleaning and structuring software) but collecting as much data as possible from each chosen source. The Janes corpus is thus not ideally representative and is definitely not balanced for the complete Slovene UGC production; as regards the latter, we leave to each researcher and their specific agenda to perform sampling over the texts included, a process significantly aided by the wealth of the text-associated metadata.

Table 1 Size of Janes corpus

3.5 The public version of the corpus

The Janes corpus discussed so far is the project internal version, which was also the basis for the linguistic investigations of Slovene UGC undertaken in the scope of the project (Fišer 2018). However, an important goal was to make the developed resources and tools publicly available, in order to enable other researcher (and companies) to exploit and build on our work. The tools, as discussed further in Sect. 4, are made available as open source, while the manually annotated datasets (cf. Sect. 5) are also made freely available under CC-BY in their entirety; given their small size, and that the Twitter terms of service allow publishing up to 50,000 tweets as free text, we do not take that to be problematic.

Making the complete Janes corpus publicly available is more difficult. While with traditional corpora the greatest obstacle to redistribution was copyright, with UGC corpora it is the terms-of-use of the platform owners and privacy protection (including the right to be forgotten) of the authors and of the people mentioned in the texts.

The situation is further complicated by the fact that we wanted to make the corpora available not only for on-line exploration, but also for download. For the former we use the two CLARIN.SI concordancers, namely KonTextFootnote 5 and noSketch Engine.Footnote 6 The two concordancers share their back-end, namely Manatee (Rychlý 2007), which enables complex queries over richly annotated corpora. Both can also be integrated with CLARIN Federated Content Search,Footnote 7 a functionality we plan to add in the future.

To enable download of the project corpora, we use the CLARIN.SI repository,Footnote 8 where each resource has its landing page (with a stable PID), giving the metadata of the resource, the way in which it should be cited, and the resource itself. Where available, the landing page gives a cross-reference to the concordancers for the particular corpus, and vice-versa. The repository also exposes its metadata, which is being harvested by a number of other services.

As the five Janes subcorpora have different restrictions on their redistribution, we make them available separately, as explained below.

3.5.1 Tweets

Twitter often changes its terms-of-use but the main and constant restriction is that the texts of the tweets (except relatively small samples) are not allowed to be redistributed. A common way around this problem is to distribute only tweet IDs and it is then possible to retrieve the texts of the tweets that have not in the meantime been deleted by using the Twitter API. We also adopted this solution for the downloadable Janes-Tweet corpus (Ljubešić et al. 2017b), however, we have an added complication that the words in our corpus are normalised and lemmatised, so it would be easy to reconstitute the tweets from these annotations fairly well. We therefore code the tokens, their normalisations and lemmas as offsets against the original texts, diffs to the tokens, and diffs to normalised tokens respectively. For performing the encoding and decoding of our Twitter corpus we developed a dedicated tool, TweetPubFootnote 9 that enables users to make linguistically annotated tweet collections publishable. While we used the encoding option of the tool to prepare the corpus, users interested in retrieving it can re-collect tweets with the Twitter API and regenerate the linguistically processed corpus with the decoding option of the tool. In the Janes-Tweet corpus available under the CLARIN.SI concordancers, we removed the author metadata and anonymised the personal names (including mentions and URLs) in the corpus. We also produced a subcorpus of Janes-Tweet called Janes-TwePo, which contains only tweets of 91 Slovene politicians active on the Twitter network (Ljubešić et al. 2017c), and where each included user has additional metadata giving their party allegiance(s), and their political function(s).

3.5.2 Forums

For all the three forums included in the Janes-Forum corpus (Erjavec et al. 2017b) we obtained permissions of the platform owners, that allow us to freely redistribute the corpus under the condition that it is anonymised, not only for persons but also for company names. The reason for the latter condition is that users often criticise companies in their posts, and the forum owners had already been threatened with legal action as a consequence. The public Janes-Forum corpus, both on the concordancers and in the repository, is therefore distributed with authors’ usernames removed, and with personal and organisation names anonymised.

3.5.3 News

For two out of the three sources of news articles and comments we received written permission to freely redistribute the corpus, while for the third source, the national news portal rtvslo.si we received verbal assurances that redistribution of the comments is not a problem, however, redistribution of the news articles themselves is, as some of them come from the Slovenian Press Agency which does not allow redistribution. As news articles are not UGC in any case, the public Janes-News corpus (Erjavec et al. 2017c) is thus distributed without the news articles, and, as with most other subcorpora, authors’ usernames removed, and with person names in the text anonymised.

3.5.4 Blogs

One of the blog sources was rtvslo.si so the same permission applied as for the news corpus, while for the other, publishwall.si, we were unfortunately unable to obtain an answer to our request for entering into a redistribution agreement. Regardless, we make the Janes-Blog corpus (Erjavec et al. 2017a) available as most of the others, i.e. we deleted the authors’ usernames, and anonymised person names in the text.

3.5.5 Wikipedia

The simplest subcorpus to release was Janes-Wiki (Ljubešić et al. 2017c), as Wikipedia, including talk pages, is published under the CC BY-SA licence, and we adopt it for the downloadable corpus; we do not do any anonymisation on Janes-Wiki.

While the complete Janes corpus is available for download only per partes, we do make it available through the CLARIN.SI concordancers, but uniformly anonymised and with a limited context window.

As can be summarised from the preceding, we do not have a completely clean situation as regards redistribution agreements or, in fact, anonymisation, which relies on automatic NE assignment, which also has a non-zero error rate. But we consider that the public benefit of releasing the corpora outweighs the lack of—in some cases—written permissions and the lack of perfect anonymisation. CLARIN.SI also adopts a take-down policy, so in case a platform owner requests the removal of their texts, or a person of texts they consider harmful to them, we will comply with their request.

4 Annotation tools

Basic text annotation tools and datasets for Slovene already exist: morphosyntactic tagging and lemmatisation can be performed with ToTaLe (Erjavec et al. 2005) and Obeliks (Grčar et al. 2012), while new taggers, lemmatisers and syntactic parsers can be trained on the openly available manually annotated corpus ssj500k (Krek et al. 2013) and the morphological lexicon Sloleks (Dobrovoljc et al. 2015). However, these tools and resources deal only with standard Slovene and it has been often shown that tools for annotating standard language perform poorly on UGC (Gimpel et al. 2011; Ljubešić et al. 2017a), as diacritics and punctuation are often omitted, and phonetic spelling and slang words frequently used, leading to many unknown words for standard models. For this reason we developed a number of tools that are either specifically designed to deal with (Slovene) UGC and other non-standard language or can be trained to do so. We introduce these tools in the following subsections, also giving related work at the end of each subsection.

4.1 Tokenisation and sentence segmentation

For tokenisation and sentence segmentation we developed a Python tool that currently covers Slovene, Croatian and Serbian (Ljubešić and Erjavec 2016). Like most tokenisers, ours is based on manually specified rules (implemented as regular expressions) and uses language-specific lexicons with, e.g. lists of abbreviations. However, the tokeniser also supports the option to specify that the text to be processed is non-standard. In this case it uses rules that are less strict than those for standard language as well as several additional rules. An example of the former is that a full stop can end a sentence even though the following word does not begin with a capital letter or is even not separated from the full stop by a space. Nevertheless, tokens that end with a full stop and are on the list of abbreviations that do not end a sentence, e.g. “prof.” will not end a sentence. For the latter case, one of the additional regular expressions is devoted to recognising emoticons, e.g. “:-]”, “:-PPPP”, etc.

An evaluation of the tool on highly non-standard tweets showed that sentence segmentation could still be significantly improved (86.3% accuracy), while tokenisation is relatively good (99.2%) taking into account that both tasks are very difficult for non-standard language.

Most tokenisers nowadays still follow the rule-based approach as training a good statistical segmenter requires a significant amount of manually segmented data for all relevant phenomena to occur, which can easier be dealt with by a series of rules written by a researcher knowledgeable of the problem. This is why one of the most popular tokenisers for Twitter is still the rule-based NLTK TweetTokenizer.Footnote 10 Given that during the project manually annotated datasets were developed (described in Sect. 5), we will consider in the future replacing the symbolic segmenter with a statistical one, trained on those datasets.

4.2 Normalisation

Normalising non-standard word tokens to their standard form has two advantages. First, it becomes possible to search for a word without having to consider or be aware of all its variant spellings and, second, tools for standard language, such as part-of-speech taggers, can be used in further linguistic processing if they take as their input the normalised forms of words. In the Janes corpus all the word tokens have automatically normalised when necessary by using a sequence of two steps.

Many UGC texts are written without using diacritics (e.g. “krizisce\(\rightarrow \)križišče—“crossroads”), so we first use a dedicated tool (Ljubešić et al. 2016) to restore them. The tool learns the rediacritisation model on a large collection of texts with diacritics paired with the same texts with diacritics removed. The evaluation showed that the tool achieves a token accuracy of 99.62% on standard texts (Wikipedia) and 99.12% on partially non-standard texts (tweets).

In the second step the rediacriticised word tokens are normalised with a method that is based on character-level statistical machine translation (CSMT) (Ljubešić et al. 2014, 2016). The goal of the translation is to normalise words written in a non-standard form (e.g. “jest”, “jst”, “jas”, “js”) to their standard equivalent (jaz—“I”). The translation model for Slovene was trained on the Janes-Norm dataset (Erjavec et al. 2016c) further discussed in Sect. 5.1, while the target (i.e. standard) language model was trained on the Kres balanced corpus of Slovene (Logar Berginc et al. 2012) and on the tweets from the Janes corpus labelled as linguistically standard. Our experiments (Ljubešić et al. 2016) showed that on non-standard tweets we achieve a word- and character-level error reduction of 70%, while the same error reduction on more standard tweets is 55%. The proposed method was published as a wrapper tool around the Moses SMT toolkit called csmtiser.Footnote 11

For the standard practices in performing normalisation, aside from some older symbolic, lexicon- and distance-based approaches (Metzler et al. 2007; Baron and Rayson 2008), the current trend is data-driven, i.e. based on learning transformations from non-standard input to standard output from parallel data. Given that sequence-labelling techniques such as Hidden Markov Models, Conditional Random Fields (CRF) and Bidirectional Long Short-Term Memories (Bi-LSTM) require the data to be already aligned on the character-level, the currently most popular and state-of-the-art method is CSMT (Tjong et al. 2017), with the recent encoder-decoder neural models trying to catch up also in this resource-poor setting (Bollmann et al. 2017). Namely, while encoder-decoder models are better than the traditional SMT setting, as long as very large amount of data is available (Koehn and Knowles 2017), in the normalisation setting, where only thousands of word pairs are regularly available for training the networks, CSMT still performs better (Tjong et al. 2017).

4.3 Tagging and lemmatisation

As the next step in the text annotation pipeline the normalised tokens are annotated with their morphosyntactic description (MSD) and lemma. For this we used a newly developed CRF-based tagger-lemmatiser that was trained for Slovene, Croatian and Serbian (Ljubešić and Erjavec 2016). The main innovation of the tool is that it does not use its lexicon directly, as a constraint on possible MSDs of a word, but rather indirectly, as a source of features; it thus makes no distinction between known and unknown words.

For annotating the Janes corpus the tool was trained on the already mentioned ssj500k 1.3 corpus (Krek et al. 2013) and the Sloleks 1.2 lexicon (Dobrovoljc et al. 2015), as well as the newly developed dataset Janes-Tag (Erjavec et al. 2016d) described in Sect. 5.2. The two available corpora were simply merged for training, with the significantly smaller Janes-Tag corpus being repeated three times, as our experiments on annotating non-standard Slovene (Ljubešić et al. 2017a) have shown this to be the optimal ratio of standard and non-standard data. Compared to the previous best result for Slovene using the Obeliks tagger (Grčar et al. 2012), when trained on the same datasets, the CRF tagger reduces the relative error by almost 25% achieving 94.3% on the testing set comprising the last tenth of the ssj500k corpus. On non-standard text, applying the tagger without any adaptations increases the error rate five times, while performing both supervised (training data) and unsupervised (word clustering) adaptation of the tagger eradicates 80% of the error that was introduced through non-standard data (Ljubešić et al. 2017a).

It should be noted that the MSD tagset used in Janes follows the (draft) MULTEXT-East Version 5 morphosyntactic specifications for Slovene,Footnote 12 which are identical with the Version 4 specifications (Erjavec 2012), except that they, following Bartz et al. (2014), introduce new MSDs for annotation of UGC content, in particular Xw (e-mails, URLs), Xe (emoticons and emojis), Xh (hashtags, e.g. #kvadogaja—“#whatshappening” and Xa (mentions, e.g. “@dfiser3”).

The lemmatisation, which is also part of the tool, takes into account the posited MSD and the lexicon; for pairs word-form : MSD that are already in the training lexicon it simply retrieves the lemma, while for others it uses its lemma prediction model.

Regarding the best practices in part-of-speech tagging of non-standard data, using CRFs with lexicon and distributional features still shows very strong performance (Horsmann and Zesch 2016). Minor improvements were recently obtained through Bi-LSTM-based taggers (Zampieri et al. 2018), with minor, but statistically significant improvements over CRFs with carefully engineered features (Ljubešić 2018). The main advantage of Bi-LSTMs over CRFs is, naturally, no need for feature engineering.

4.4 Named entity recognition

A final corpus processing step was the annotation of named entities with the goal of automatic anonymisation.

Within the Janes project we have developed a named-entity recognition tool Janes-NER,Footnote 13 which is CRF-based and uses a rather standard feature set relevant for identifying named entities, as well as distributional information in form of Brown clusters (Brown et al. 1992) whose binary paths of different length are exploited as features.

Currently there are two main sources of annotated corpora for training a Slovene NER system: 200 thousand tokens in the ssj500k manually annotated corpus that were annotated for named entities by following the same annotation guidelines as the Janes-Tag corpus described in Sect. 5.2. For training the model used in annotating the Janes corpus we merged the relevant portion of ssj500k with Janes-Tag. Given the larger size of ssj500k, we repeated the content of Janes-Tag twice. The decision on oversampling was based on our NER experiments and previous experiments on tagging non-standard text (Ljubešić et al. 2017a). The Brown clusters used by the classifier were trained on the 1.2 billion words slWaC corpus of the Slovene web (Erjavec et al. 2015). Our experiments on a combination of standard and non-standard data, both for training and testing, showed that our macro-average F1 = 0.69. Split by NE category, “other” class has the lowest F1 = 0.30, followed by organizations with F1 = 0.56, then locations with F1 = 0.80, and the person class having the highest F1 = 0.92.

Related work on performing NER both on standard and non-standard data shows that CRFs with additional sources of information such as word clusters and gazetteers are still competitive, but are gradually being superseded by a combination of a neural model for feature extraction and a CRF for inference (Huang et al. 2015).

5 Manually annotated datasets

A number of manually annotated datasets were produced in the scope of the project, some to train and test UGC annotation tools (Čibej et al. 2016; Erjavec et al. 2016b), and others to primarily enable empirically based linguistic research. In general, the annotation campaigns for all the datasets proceeded in the same way, essentially following the MATTER (Model, Annotate, Train, Test, Evaluate, and Revise) process (Pustejovsky and Stubbs 2012), i.e. the modeling of the phenomena under investigation was discussed, a preliminary version of the annotation guidelines was written, the Janes dataset was sampled according to criteria for the particular dataset, a small train dataset annotated, including revisions to the guidelines and training the annotators, and then the chosen size of the text samples annotated.

The annotation was performed in WebAnno (Yimam et al. 2013), a general-purpose web-based annotation tool that enables e.g. multi-layer annotation and features with multiple values. However, the tool is difficult to use for correcting tokenisation (and hence all the token dependent layers), so we had to introduce multivalued features and some special symbols in order to be able to split and merge tokens and assign sentence boundaries. We also put special emphasis on format conversion. As mentioned, the Janes corpus is encoded in TEI P5 (TEI Consortium 2017), which WebAnno does not support. We therefore developed a conversion from TEI to the WebAnno TSV tabular format, and a merge operation from the WebAnno exported TSV with the source TEI, resulting in a TEI encoding with corrected annotations. Given that it was possible to change token boundaries in WebAnno this operation is fairly complex (Erjavec et al. 2016a).

As with the Janes corpora, all the discussed datasets are freely available for exploration under the CLARIN.SI concordancers and for download under the CC BY 4.0 licence form the CLARIN.SI repository. The texts have not been anonymised or otherwise changed, as we consider these datasets too small to give rise to privacy or terms-of-use concerns. In addition to the XML-encoded datasets, the repository items also contain the main publication relating to the resource, the annotation guidelines, and the corpus converted to the simpler vertical format used by the concordancers.

5.1 The Janes-Norm dataset

The Janes-Norm dataset (Erjavec et al. 2016c) contains manually annotated tokens, sentences and normalised (standardised) words. It consists of complete text samples from the Janes corpus, including tweets from private individuals, forum posts and comments on blog posts and news articles; priority was given to texts which were linguistically non-standard, i.e. half of the included texts are marked as L3. The details of the sampling procedure and annotation campaign are given in Erjavec et al. (2016b).

Our Guidelines for UGC annotation mostly followed the Guidelines for annotating historical (Erjavec 2015) Slovene texts but with some modifications regarding the differences of the medium (e.g. emoticons, URLs). Special emphasis was given to the treatment of non-standard words with multiple spelling variants and without a standard form (e.g. “orng”, “ornk”, “oreng”, “orenk”—“very”), foreign language elements (e.g. “updateati”, “updajtati”, “updejtati”, “apdejtati” as the differently spelled loan-word “to update”) and linguistic features that are not normalised (e.g. hashtags, non-standard syntax and stylistic issues). The annotation also covered cases where one non-standard word corresponds to two or more standard words or vice versa (e.g. “tamau\(\rightarrow \)ta mali”—“junior”; “ne malo\(\rightarrow \)nemalo”—“not little”/“considerable”).

All the texts were first automatically annotated with the initial tools and then checked and corrected manually by a team of students. Each text was annotated by two different annotators and then curated by the team leader. Janes-Norm contains 7816 texts and 184,755 tokens and was used for training the normaliser described in Sect. 4.2.

5.2 The Janes-Tag dataset

The Janes-Tag dataset (Erjavec et al. 2016d) is a subset of Janes-Norm where additionally the MSD and lemma of each token were manually annotated, i.e. it is meant as a gold-standard dataset for PoS tagging and lemmatisation. As with Janes-Norm, the details of the sampling procedure and annotation campaign are given in (Erjavec et al. 2016b).

The annotation guidelines followed the Guidelines for MSD and lemma annotation for standard (Holozan et al. 2008) and historical (Erjavec 2015) Slovene texts, again with some modifications regarding the differences of the medium. In particular, the Janes-Tag guidelines were designed to deal with foreign language elements, proper names and abbreviations as well as non-standard use of cases and particles.

Similar to the Janes-Norm dataset, all texts were first automatically annotated with the initial tools and then checked and corrected manually by the same team of students as for Janes-Norm. Each text was annotated by two different annotators and then curated by the team leader. Janes-Tag contains 2958 texts and 75,276 tokens and was used as one of the resources for training the tagger and lemmatiser described in Sect. 4.3.

5.3 The Janes-Syn dataset

We also produced a small dataset of manually syntactically annotated texts Janes-Syn (Arhar Holdt et al. 2016; Arhar Holdt et al. 2017), sampled from the Janes-Tag dataset. The overall annotation model follows the dependency framework used for the JOS treebank of Slovene (Erjavec et al. 2010), but with emphasis on the annotation of UGC-specific elements that required special treatment: foreign language elements, ellipsis and fragments, non-standard use of punctuation, and other non-standard language features. The texts were first annotated by the dependency parser for standard Slovene (Dobrovoljc et al. 2012) and corrected by one experienced annotator. Janes-Syn contains 168 tweets, 413 sentences and 4388 tokens.

5.4 The Janes datasets for linguistic investigations

Several datasets were also produced to provide the basis for linguistic investigations of various UGC phenomena in Slovene. While the main objective was to undertake and publish linguistic findings, it should be noted that these datasets could also be used for machine learning of more advanced aspects of Slovene UGC. The preparation and campaign in these cases proceeded similarly to the datasets described above, except that typically only one person, the main author of the investigation and annotation guidelines, was involved in the annotation.

Janes-Kratko (i.e. “Janes-Short”) (Goli et al. 2016, 2017) was concerned with modeling the types of shortening phenomena in Slovene tweets. The dataset contains 777 tweets (20,222 tokens) of various levels of standardness in which instances of shortening strategies were observed on the levels of spelling, lexis and syntax. 3464 instances of shortening strategies that belong to 32 different categories from the developed typology were annotated, while the analysis showed that the highest number and widest range of shortening strategies arise on the orthographic level, whereas they are the lowest on the syntactic level.

Janes-Vejica (i.e. “Janes-Comma”) (Popič et al. 2017; Popič and Fišer 2018) observed commas in Slovene tweets, as the correct use of the comma is one of the most difficult aspects of Slovene grammar. The dataset contains 495 tweets (14,031 tokens) of various levels of standardness, where all commas were annotated with the reason for their (in)correct use, according to the developed typology. The results showed that in Slovenian UGC comma use is problematic mostly in regard to the missing comma, especially between dependent and independent clauses, and after and before small clauses.

Janes-Preklop (i.e. “Janes-Switch”) (Reher and Fišer 2018; Reher et al. 2017) contains annotations of code-switching in Slovene tweets. The dataset contains 3200 standard and non-standard tweets, half of which were sampled from those tweets in the Janes-Tweet corpus that contain at least one token that was tagged as a foreign word, and the other half from those that belong to the top 100 users with most foreign language elements detected by the tagger. Each identified code-switch was manually annotated on five levels: language (English, German etc.), type of CS (intra- and intersentential), orthography (assimilated, partially assimilated or non-assimilated), morphology (not evident or evident from the ending/affix) and part of speech (noun phase, verb phrase etc.).

6 Conclusion

The paper presented the Janes corpora, tools, and manually annotated datasets developed for processing and analysis of non-standard Slovene found on social media and networking sites. In comparison to classic web corpora, the Janes corpora stand out with a carefully preserved source data structure and rich metadata. The novelty in the corpus preprocessing toolchain is high-quality normalization of non-standard words prior to morphosyntactic tagging and lemmatisation. Another original contribution is the automatic assignment of a technical and linguistic standardness measure to the texts in the corpus that enables more focused linguistic research as well as facilitates the development of tools for the processing of non-standard Slovene. All the developed tools are available on GitHub. A major achievement of the project is also the steps taken that lift privacy and terms-of-use restrictions and enable the dissemination of the Janes corpora and datasets as widely as possible through the CLARIN.SI infrastructure.

In addition to scientific publications, the developed resources, tools and research methods have been, in the scope of the project, disseminated through a wide range of training events: summer schools for high school and university students, workshops and seminars for researchers of Slovene and for linguists interested in South Slavic languages.Footnote 14

In future work we would like to offer annual upgrades of the Twitter subcorpus, and continue improving the quality of the processing tools, both by extending their training sets and investigating new methods in annotation, in particular neural networks. We would also like to capitalise on the produced corpus and datasets for further linguistic investigations. With the experience gained in gathering and processing UGC we are also moving to new areas, in particular investigation of socially unacceptable discourse in Slovene UGC (Fišer et al. 2017).