Abstract
Fundamental Big Five personality traits (e.g., Extraversion) and their facets (e.g., Activity) are known to correlate with a broad range of linguistic features and, accordingly, the recognition of personality traits from text is a well-known Natural Language Processing task. Labelling text data with facets information, however, may require the use of lengthy personality inventories, and perhaps for that reason existing computational models of this kind are usually limited to the recognition of the fundamental traits. Based on these observations, this paper investigates the issue of personality facets recognition from text labelled only with information available from a shorter personality inventory. In doing so, we provide a low-cost model for the recognition of certain personality facets, and present reference results for further studies in this field.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The Big Five personality model [4] comprises five fundamental categories of personality - Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness to experience - which are further divided into dozens of more specific facets. For instance, the Neuroticism category includes facets representing Anxiety, Depression etc. Big Five categories are strongly correlated to (and possibly defined by) language use and, as a result, the recognition of an individual’s personality traits from text is a well-established task in the Natural Language Processing (NLP) field [14].
Models for the recognition of personality traits from text are usually based on supervised machine learning methods that take as an input a text corpus labelled with personality scores. These scores, in turn, are computed from a range of personality inventories (or questionnaires) such as the BFI-44 inventory [7]. The BFI-44 consists of a relatively short, 44 multiple-choice inventory conveying short items such as ‘I see myself as someone who is depressed, blue’. Items are to be answered on a zero (disagree strongly) to five (agree strongly) scale.
Knowing the five fundamental categories of personality of an individual may be sufficient for a number of practical applications. For others, however, a more detailed assessment of personality facets may be called-for. Assessing personality facets usually involves the use of a more extensive personality inventory, such as the 260-item NEO-PI-R [8]. From a computational perspective, however, large or complex inventories of this kind may be impractical, which may explain why studies on personality recognition from text [9, 11, 14, 17] are usually limited to the five main personality categories obtainable from short inventories such as the BFI-44.
Despite these difficulties, a compromise between convenience (as in the BFI-44) and expressiveness (as in NEO-PI-R) may still be possible. In particular, we notice that the work in [18] proved evidence that, although most facets cannot be explicitly captured by the BFI-44, a small subset of 10 facets (two from each of the main Big Five factors) are inferable from this short scale. Thus, it may be possible to obtain at least some of the facet labels available from NEO-PI-R at a much lower cost.
Based on these observations, the actual NLP question to be investigated in this paper is whether the 10 additional facets proposed in [18] may be automatically recognised from text labelled with BFI-44 information only. To this end, we developed a series of binary classifiers for Big Five facet recognition from a labelled corpus of Brazilian Facebook status updates, and we present reference results for further studies in this field. To the best of our knowledge, our work is the first attempt to learn personality facets in this way, and it is most likely the first of its kind to be devoted to the Brazilian Portuguese language.
2 Related Work
We are not aware of any large-scale work on Big Five facet recognition from text, but there is a wide range of studies focused on the more general task of recognising its main five personality categories. Given that the applicable methods are presumably similar, in what follows we briefly review a number of instances of the latter.
The work in [9] presents a comprehensive view of the personality recognition task from multiple computational perspectives (i.e., as classification, regression and ranking tasks), by comparing the use of written essays and speech corpus as input data, and by comparing the use of self-reported Big Five scores and those produced by specialists, among other issues. The study makes extensive use of psycholinguistic features provided by the LIWC [12] and MRC [3] databases, and results suggest that using ranking algorithms, speech as input data, and personality reports produced by specialists work best.
Contrary to the use of psycholinguistics-motivated features in [9] and others, the work in [11] makes use of n-gram models to classify extremes of personality using both Naive-Bayes and SVM models. Evaluation based on a corpus of personal blogs achieves maximum accuracy of 65%.
In the context of the PAN-CLEF shared task series [14], a number of supervised models of personality recognition based on Twitter data labelled with personality scores obtained from a 10-item Big Five inventory have been developed. These include the overall winner of the competition [1], which combines second order attributes with a LSA text representation; the work in [5], which makes use of char and POS n-gram models, and the work in [19], which makes use of TF-IDF counts and stylistic features. For details, we refer to [14].
3 Personality Facet Recognition
The present study aims to compare a number of models of personality facet recognition from text. More specifically, we consider the set of 10 personality facets that, according to the method discussed in [18], may be inferred from the BFI-44 inventory [7]: Assertiveness and Activity facets (under the main Extraversion category), Altruism and Compliance (under Agreeableness), Order and Self-discipline (under Conscientiousness), Anxiety and Depression (under Neuroticism), and Aesthetics and Ideas (under Openness to experience).
The method proposed in [18] consists of a series of theoretically-motivated calculations (in addition to those already performed to obtain the basic Big Five personality scores) over the set of 44 responses provided by the BFI-44 inventory. Thus, provided that the full set of BFI-44 responses about an individual is known, computing these 10 additional facet scores is straightforward.
For instance, according to [18], the Activity facet of the Big Five Extraversion category is defined as the simple average of two of the BFI-44 scores from which the main Extraversion score is obtained in the first place. In the present work, these facet scores are therefore taken as given, and we do not discuss the underlying method to obtain them. For details, see [18].
Following existing work on Big Five personality recognition for the English language and others [9, 11], personality facet recognition is presently regarded as a set of independent binary classification tasks. To this end, a document is to be labelled as a positive instance of a given facet if the corresponding author shows an above-average score for that facet when considering the entire set of authors in the domain. Since personality facets are, by definition, independent from each other [4], each document is to be assigned ten individual labels corresponding to each facet, which are to be classified one at a time.
4 Experiment
4.1 Overview
We devised an experiment to compare three binary classifiers for personality facet recognition from text:
-
BoW: bag-of-words features from the 3000 most frequent words in corpus
-
skip: average word vectors obtained from a skip-gram-1000 model
-
cbow: average word vectors obtained from a cbow-1000 model
The Bow model is built using Naive Bayes classification. Both skip and cbow models are built using logistic regression and pre-trained word embeddings computed from a 150-million Brazilian Twitter corpus using word2vec [10] with window size = 5 and min_count = 10. In addition to these three classifiers, we also consider a simple Majority class baseline system for illustration purposes.
4.2 Data
We use the 2.2 million-words b5-post corpus of Brazilian Facebook [13], conveying 194k status updates written by 1019 users, which are accompanied by self-reported BFI-44 [7] inventories filled-in by every user. The b5-post corpus has been previously taken as the input to a number of author profiling tasks [6], including personality recognition [17].
The text portion of the corpus was subject to basic spell checking and term substitution (e.g., laugh expressions such as ‘haha’ were replaced by a common $LAUGH$ symbol etc.) From the corpus inventories, 10 additional personality facets were inferred according to the method in [18]. This information constitutes the set of ten class labels for each document as discussed in the previous section.
4.3 Procedure
All models were built using 10-fold cross validation over the entire b5-post dataset. However, since that we now intend to learn ten (facet) classes, and not only five (main categories), and since many facets may be considerably more sparse than others (e.g., the Depression facet of Neuroticism may be naturally less common than, say, Self-consciousness), data imbalance is a major concern to our work. As a means to alleviate this, we resort to SMOTE minority sampling [2] with \(k=5\) neighbours.
5 Results
Table 1 shows reference results for the majority class baseline, and for the three models of interest. The first column represents mean F1 scores over the ten classification tasks, followed by the number of times (wins) in which each model was the overall winner, and the mean F1 measure for each individual class.
Although all models present a considerable improvement over our admittedly simple baseline, the distinction among them is narrow, particularly between BoW and skip. A slight advantage of the cbow model over the others is however noticeable in the number of classes (wins) for which cbow was the overall winner (7 out of 10 classification tasks).
As it is usually the case in personality classification, some personality traits tend to be more evident from text than others. In the present setting, we notice that Compliance and Depression recognition were the most challenging tasks. However, it remains unclear whether these facets are less explicit in language use in general, or simply less explicit in our Facebook domain.
Finally, we notice that the present results are generally similar to those observed in Big Five personality classification in English [9] and other languages, and also along the lines of previous studies on the recognition of the main Big Five categories from the b5-post corpus [15, 16].
6 Final Remarks
This paper presented a number of models of Big Five facet recognition from a Brazilian Portuguese Facebook corpus and corresponding BFI-44 information. Our study suggests that, not unlike basic Big Five categories, the ten facets proposed in [18] may be recognised from text with reasonable accuracy if compared to a simple baseline system. In other words, our experiments suggest that we may in principle develop supervised models of personality recognition at a level of abstraction more specific than those obtainable from existing work, and without resorting to larger or more complex inventories to provide the required text labels.
The current work provides only initial reference results for further studies in this field, and a number of possible improvements are left as future work. In particular, we envisage the use of larger word embedding models and alternative learning architectures for this task, and further evaluation work by directly comparing our results against text labelled with actual facet information.
References
Álvarez-Carmona, M., López-Monroy, A., Montes-y-Gómez, M., Villaseñor-Pineda, L., Escalante, H.: INAOE’s participation at PAN’15: author profiling task. In: CLEF 2015 (2015)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
Coltheart, M.: The MRC psycholinguistic database. Q. J. Exp. Psychol. Sect. A: Hum. Exp. Psychol. 33(4), 497–505 (1981)
Goldberg, L.R.: An alternative description of personality: the Big-Five factor structure. J. Pers. Soc. Psychol. 59, 1216–1229 (1990)
González-Gallardo, C., et al.: Tweets classification using corpus dependent tags, character and POS N-grams. In: CLEF 2015 (2015)
Hsieh, F.C., Dias, R.F.S., Paraboni, I.: Author profiling from Facebook corpora. In: 11th International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan, pp. 2566–2570. ELRA (2018)
John, O.P., Naumann, L.P., Soto, C.J.: Paradigm Shift to the Integrative Big-Five Trait Taxonomy: History, Measurement, and Conceptual Issues, pp. 114–158. Guilford Press, New York (2008)
Costa Jr., P.T., McCrae, R.R.: Revised NEO Personality Inventory (Neo-PI-R) and NEO Five-Factor Inventory (NEO-FFI): Professional Manual. Psychological Assessment Resources, Odessa (1992)
Mairesse, F., Walker, M., Mehl, M., Moore, R.: Using linguistic cues for the automatic recognition of personality in conversation and text. J. Artif. Intell. Res. (JAIR) 30, 457–500 (2007)
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT-2013, Atlanta, USA, pp. 746–751. Association for Computational Linguistics (2013)
Nowson, S., Oberlander, J.: Identifying more bloggers: towards large scale personality classification of personal weblogs. In: Proceedings of the International Conference on Weblogs and Social Media, Boulder, Colorado, USA (2007)
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Inquiry and Word Count: LIWC. Lawrence Erlbaum, Mahwah (2001)
Ramos, R.M.S., Neto, G.B.S., da Silva, B.B.C., Monteiro, D.S., Paraboni, I., Dias, R.F.S.: Building a corpus for personality-dependent natural language understanding and generation. In: 11th International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan, pp. 1138–1145. ELRA (2018)
Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 Evaluation Labs and Workshop, Toulouse, France (2015). CEUR-WS.org
dos Santos, V.G., Paraboni, I., da Silva, B.B.C.: Big five personality recognition from multiple text genres. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 29–37. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_4
da Silva, B.B.C., Paraboni, I.: Learning personality traits from Facebook text. IEEE Latin Am. Trans. 16(4), 1256–1262 (2018). https://doi.org/10.1109/TLA.2018.8362165
da Silva, B.B.C., Paraboni, I.: Personality recognition from Facebook text. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 107–114. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_11
Soto, C.J., John, O.P.: Ten facet scales for the Big Five Inventory: convergence with NEO PI-R facets, self-peer agreement, and discriminant validity. J. Res. Pers. 43(1), 84–90 (2009). https://doi.org/10.1016/j.jrp.2008.10.002
Ṣulea, O.M., Dichiu, D.: Automatic profiling of twitter users based on their tweets. In: CLEF 2015 (2015)
Acknowledgements
This work received support by FAPESP grant 2017/06828-1 and 2016/14223-0.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
dos Santos, W.R., Paraboni, I. (2019). Personality Facets Recognition from Text. In: Crestani, F., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science(), vol 11696. Springer, Cham. https://doi.org/10.1007/978-3-030-28577-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-28577-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28576-0
Online ISBN: 978-3-030-28577-7
eBook Packages: Computer ScienceComputer Science (R0)