Keywords

1 Introduction

India is a land of many languages. People on social media often use more than one language to express themselves. But the problem starts with Multilingual speakers tend to exhibit code-mixing and code-switching in their use of language on social media platforms. Code-mixing refers to the mixing of two or more languages or language varieties in speech. Code-mixing occurs due to various reasons. According to a work by [1], the major reasons for Code-Mixing:- 45% Real lexical needs, 40% Talking about a particular topic and 5% For content clarification. [3] noted that the complexity in analyzing code-mixed social media text (CMST) stems from non adherence to a formal grammar, spelling variations, lack of annotated data, inherent conversational nature of the text and of-course, code-mixing. Therefore, there is a need to create datasets and Natural Language Processing tools for code-mixed social media text as traditional tools are ill-equipped for it. Taking a step in this direction, we present our work on building a POS tagger for Konkani-English code-mixed data collected from social media site Facebook.

2 Related Work

Code-mixing being a relatively newer phenomena has gained attention of researchers only in the past two decades. POS taggers on monolingual data give an accuracy of about 97.3% for English text [6]. They are often seen as sequence labeling problems and have used the context based information in the form of lexical and sub-lexical characteristics of neighboring words. But in code-mixed setting, the context information can be in a different language which makes the understanding difficult. [3] reported challenges in processing Hindi-English CMST and performed initial experiments on POS tagging. Their POS tagger accuracy fell by 14% to 65% without using gold language labels and normalization. Thus, language identification and normalization are critical for POS tagging [3]. [7] also built a POS tagger for Hindi-English CMST using Random Forests on 2,583 utterances with gold language labels and achieved an accuracy of 79.8%.

[8] further improved this POS tagger, increasing the accuracy to 93%. [11] worked on a complete pipeline for shallow parsing and performed tokenisation, language identification, normalisation, POS tagging and finally, shallow parsing and achieved accuracy of 83.4% for code-mixed Hindi-English social media text.

3 Data Preparation

Significant studies and dataset of the code-mixing phenomenon can be found in [2]. These works discuss the dataset preparation and dataset statistics of code-mixing of Konkani-English as well as its linguistic nature. For the POS tagging of Konkani-English language we extracted the code-mixed corpus which was discussed in [2]. We then manually tagged them by their language, normalisation form and by their POS tags.

3.1 Dataset Annotation Guidelines

The creation of this linguistic resource involved Language identification, Normalisation and POS tagger layer. The following paragraphs describe the annotation guidelines for these tasks in detail.

  1. 1.

    Language Identification: Every word was given a tag out of three - en, kn and rest to mark its language. Words that a bilingual speaker could identify as belonging to either Konkani or English were marked as ‘kn’ or ‘en’, respectively. The label ‘rest’ was given to symbols, emoticons, punctuation, named entities, acronyms and foreign words.

  2. 2.

    Normalisation: Words with language tag ‘kn’ in Roman script were labeled with their standard form in the native script of Konkani Devanagari, i.e. a back-transliteration was performed. Words with language tag ‘en’ were labeled with their standard spelling. Words with language tag ‘rest’ were kept as they are.

  3. 3.

    Part-of-Speech Tagging: The universal Part-of-speech tagset [9] was used to label the POS of each word as this tagset is applicable to both English and Konkani words, and it contained a level of coarseness that suited our goals. The following case-specific guidelines were also observed:

    1. 1.

      Sub-lexical code-mixed words were annotated based on their context, since POS is a function of a word in a given context.

    2. 2.

      Words embedded in a sentence of another language were tagged as per context of the matrix language, irrespective of the POS tag of the word in its original language.

4 Experiments and Results

Here, we repeated the experiments performed by [2] and added new part to system. The original system first tokenizes an utterance into words. Then, a language identification module classifies each word as Konkani, English or Rest. Based on the language assigned, the Normalisation module runs the Konkani or English normalisers. In this section, we explain the POS Tagging system used after the Normalisation system.

4.1 Part-of-Speech Tagging System

Understanding the Part-of-Speech POS tagging, which provides a basic level of syntactic analysis for a given word or sentence. It was modeled as a sequence labeling task using CRFs [10] and SVM following paper. The feature set comprised of:

  1. 1.

    Basic Word Features: Word based features such as affixes, context and the word itself.

  2. 2.

    LANG: Language label of the token, obtained from the Language Identification system. This can have the values - ‘en’, ‘kn’ or ‘rest’.

  3. 3.

    NORM: Lexical features extracted from the normalised form of the word. These include linguistic features such as bound and free morphemes, suffixes, prefixes.

  4. 4.

    TPOS: Output of Twitter POS tagger [8] for the given word.

  5. 5.

    KPOS: Output of Konkani POS taggerFootnote 1 for the given word.

To obtain the Konkani POS tag output, the output from the normalisation module was used by transliterating Romanised Konkani words into WX-notations. Konkani POS tags were obtained using the Cdac Konkani POS tagger http://kbcs.in/tools.html. This POS Tagger is trained on WX-notation, thus English and ‘Rest’ words were transliterated to WX-notation. These transliterations along with the Konkani normalised data was sent to the POS tagger and final POS tag was obtained. The features ablation for the POS Tagger are shown in Sect. 4.1. Each feature was added only if it showed a positive increase in the system accuracy. Table 1 presents the obtained results.

Table 1. Token level POS Tagger Accuracy

5 Conclusion and Future Work

In this Paper, we have focused on building first step of shallow parser for Konkani-English code-mixed social data. Through this paper we present our efforts at attempting various statistical methods for POS tagging of code-mixed social media data. We have attempted to build Part-of-speech tagger for this language pair, which we hope would result in better data-mining and sentiment analysis across the Indian subcontinent. We also create a standard dataset of 5088 code-mixed Konkani-English sentences for building supervised models of shallow parsing on this data which we consider as our immediate future work. In the future, we intend to continue creating more annotated code-mixed social media data. We intend to use this dataset to build tools for code-mixed data like morph analysers, chunkers and parsers.