Keywords

1 Introduction

The extensive growth of the web and the plethora of options that social media provide, have resulted in the increase of the web users population, especially in the most developed countries. This reality results to the production of large amounts of written web posts on a daily basis. The automatic extraction of information from these online data is related not only to the text itself but also to the gender, age and other demographic characteristics of the user that are essential in the e-government, security and e-commerce market.

The detection of demographic information and more specifically the detection of age, among social media users may be important not only for commercial and sociological purposes, but also for security reasons. Teen users are allowed to use social media without often being supervised by adults, a situation that can be fatal in extreme conditions. It is thus important to be able to automatically estimate the age of an internet user from his/her writing input on the web. Except security, the estimation of the user’s age can be important in detecting the different trends, opinions, political and social views of each age group. This can enable social scientists to derive important clues about the anthropography among social media users, and how different age groups behave online. Market analysts and advertisers may also be interested in this kind of studies, in order to promote their product or a service in an age-targeted way according to their expressed interests and opinions.

Most studies on age identification treat the issue as a classification problem. In this article, instead of following an age category classification approach, we investigate the appropriateness of several regression algorithms on the task of age estimation of bloggers, dealing with a numerical estimation problem. We relied on several text-based features that have been widely used in the literature for text classification, authorship attribution, gender and age identification, in order to evaluate the performance of regression methods. The remainder of this paper is organized as follows: Sect. 2 presents the state-of-the-art in theoretical and automatic age estimation. Section 3 describes the followed methodology for age estimation from web posts. Section 4 presents the experimental setup and the achieved results. Finally Sect. 5 concludes this work.

2 Background Work

People of different age, gender, educational level, professional activity and geographical orientation make various linguistic choices, due to these social factors [1]. The matching of a linguistic attitude to the corresponding social group is one of the objectives of sociolinguistics. Several sociolinguistic studies in age variation [2, 3] observed that teenagers use the language in a more creative and non-contractual way, by producing new forms, when adults prefer more standard types. Semantic neologisms, slang types, loanwords and code expressions are produced by teens, when adults tend to have a more conservative linguistic attitude. This can be explained after the social role in the production/work cycle and the family responsibilities that adulthood occurs, when teens and older people let to a more “loose” use of language [4].

Whilst sociolinguistic researches in age variation stand on theoretical and empirical findings, recent studies in text mining use machine learning algorithms and natural language processing methods for the automatic estimation of the authors’ age. Schler et al. [5] create the “Blog Authorship Corpus” in order to identify the author’s age and gender. They used style-related features and content-based characteristics in order to detect the gender and the age. They observed that specific forms and unigrams are more frequent in young bloggers, the blogging style and topics are different among 10’s, 20’s and 30’s. Argamon et al. [6] used the corpus from their previous study [5], in order to go deeper in the gender and age mining from text. They used stylistic and content-based features in order to demonstrate the significant variation between different genders and ages in blogging. Goswami et al. [7] performed a stylometric analysis in terms of gender and age by using non-dictionary forms and the sentence length as features. The slang, smileys, out-of-dictionary words, chat abbreviations, on the one hand, and the sentence length on the other, proved to be highly distinctive among different ages and gender. Tam and Martell [8] performed age classification experiments, using Bayesian and SVM classifiers. They extracted character n-grams and word meta-data features, in order to classify the “NPS Chat Corpus” into five age groups. In their work, Peersman et al. [9], implemented age classification in small texts, using chat words as features, along with character-based features, achieving more than 88 % of accuracy. Other studies in age prediction [10, 11], proved that content and stylistic features are extremely significant, and when the online users’ activity is added, the classification accuracy increases approximately to 80 % [10]. In their overview of PAN 2013, Rangel et al. [12] presented the different feature sets that the participants in the Author Profiling Task used, which were finally grouped into stylistic-based, content-based, n-grams-, IR-, and collocations-based. Many of the participants dealt with the age detection, and Flekova & Gurevych [13] focused on age and gender using surface, syntactic and punctuation, readability, semantic, content, lexical and stop words features. They observed eventually that the age and gender profiling are not independent issues, but they are determined by the same features. Rangel & Rosso [14] used the PAN-AP-13 dataset in order to perform classification experiments in terms of age and gender, using though features based in cognitive traits of neurology studies. Their approach was more efficient in age than gender prediction and they proved the differences in language use of different ages, in English and Spanish. [15] is a quite integrated study in personality, gender and age detection of Facebook users. Standard approaches were implemented and a particular method was proposed for linguistic analysis and evaluation in terms of personality, age and gender with reliable results, contribution to interdisciplinary researches, and suggestion of new hypotheses and insights. Nguyen et al. [16] performed a study in language use among different age categories of Twitter users. Their analysis showed that differences in style, references, conversation and sharing depended not only on the age category estimation, but also on the life stage and the actual age of the user. Lately, the authorship profiling has become a task about multilingual efforts and [17] is one of the several studies implemented in a non-English corpus for stylometric research and possibilities to perform age, gender, opinion, authorship and personality experiments.

3 Proposed Age Estimation of Web Bloggers Using Regression Models

The estimation of the age of an author is a numerical estimation problem. Although some of the related work found in the literature targets at identifying the age class the author belongs to, according to some quantization of the age scale to age intervals of interest, we target at the direct estimation of the age value of the web blogger (i.e. the author). Thus the problem is formulated as follows: We consider a representation of each web blog post with a feature vector \(V_n\), for the \(n^{th}\) post with \(1 \le n \le N\). A machine learning regression algorithm, f, is used as a numerical estimator, for assigning an age estimation, u, to each feature vector \(V_n\), i.e. \(u=f(V_n)\).

For the representation of each web blog post with a feature vector, we used a number of well-known and widely used in text-based analyses features, which have been used in the tasks of author, gender and age identification, they are normalized and they are presented in Table 1. The resulted feature vector has length equal to 42, i.e. \(V_n \in \mathfrak {R}^{42}\).

Table 1. The description of features used in our study.

For the regression stage, we relied on a number of dissimilar machine learning algorithms, which have extensively been reported in the literature. In particular, we used:

  • The multilayer perceptron neural network (MLP) with three layers which is capable for numerical predictions [18], since neurons are isolated and region approximations can be adjusted independently to each other,

  • The support vector machines (SVM) for regression using the sequential minimal optimization algorithm and two different kernels, the radial basis kernel (rbf) and the polynomial kernel (poly),

  • The M5 model tree (M5P) algorithm, which is a rational reconstruction of M5 method,

  • The K-nearest neighbors algorithm (IBk),

  • The RepTree, a fast decision tree learner, which builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning (with back-fitting). RepTree only sorts values for numeric attributes once and missing values are dealt with by splitting the corresponding instances into pieces (i.e. as in C4.5),

  • The Additive regression meta-classifier that enhances the performance of a regression base classifier along with DecisionStump, SVMs with polynomial and radial basis kernel and REPTrees,

  • The bagging algorithm combined with REPTree and SVMs with polynomial and radial basis kernel, aiming to reduce variance,

  • The M5Rules, which generates a decision list for regression problems using separate-and-conquer. In each iteration, it builds a model tree using M5 and makes the “best” leaf into a rule.

All regression algorithms were implemented using the WEKA machine learning toolkit [19].

4 Experimental Setup and Results

For the present evaluation we used the “Blog Authorship Corpus” [5], a collection of blog posts from 19,320 bloggers which have posted in their blogs. These blog posts were gathered from blogger.com in August 2004. The size of the corpus is 681,288 posts and over 140 million of words, which corresponds to 35 posts and 7,250 words per person. The bloggers fall into three age categories: 10’s, 20’s and 30’s. The 10’s age group is constituted of 8,240 blogs whose authors are between 13 and 17 years. The 20’s is constituted of 8,086 blogs of 23–27 years old authors. Finally the 30’s age group contains 2,994 blogs produced by bloggers between 33 and 47 years. Each blog is structured in a separate file containing the bloggers’ posts, the bloggers’ id number, his/her gender, his/her exact age and in many cases other anonymised personal pieces of information.

The “Blog Authorship Corpus” was evaluated on the task of age estimation, using the features described in the previous section. The performance of the evaluated regression algorithms was measured using the mean absolute error (MAE) and the root mean squared error (RMSE) of the difference (i.e. the error) between the actual and the estimated age of each web blogger. In order to avoid overlap between training and test subsets, a 10-fold cross validation evaluation protocol was followed. The experimental results for the evaluated regression algorithms in terms of MAE and RMSE are tabulated in Table 2. The best performance for each of the above metrics is indicated in bold.

As can be seen in Table 2, the best performing algorithm was the Bagging implemented with the RepTree base learner, achieving MAE and RMSE equal to 5.44 and 7.15, respectively. The second and third best performance was achieved by the RepTree regression algorithm and the Additive Regression algorithm with the RepTrees regression base classifier with MAE approximately equal to 5.67. The results show the appropriateness of RepTree regression algorithm for the task of age estimation from web blog posts, since it outperforms all the other algorithms either as a base learner within a meta-classification scheme or as a standalone regression algorithm. The superiority of the RepTree regression algorithm is not restricted only in the MAE criterion, but is also presented in the RMSE criterion, which shows that RepTrees offer the minimum outliers in terms of age estimation comparing to the rest of the evaluated algorithms.

The only regression algorithm which was found to have performance comparable to RepTrees was the SVM with polynomial kernel, performing slightly worse both as standalone and as base learner of a meta-classification algorithm. The good performance of the SVM algorithm is probably owed to the fact that they don’t suffer from the curse of dimensionality. The worse performance was achieved by the M5Rules and M5P regression algorithms, which are model trees in contrast to the best performing RepTree which is a regression tree. Their low performance is probably owed to the fact that they leverage potential linearity at leaf nodes and the fact that they construct hard-decision rules based on the best leaf.

Table 2. Age estimation MAE and RMSE per regression algorithm.

5 Conclusion

We presented an evaluation of regression algorithms for the estimation of web bloggers’ age. For the estimation of the age we relied on a number of text-based characteristics, which are typical in text classification tasks related to gender and age identification, and constructed one feature vector for each blogger’s post. The evaluation results showed that by using regression methods, age estimation can be adequately performed. The RepTree algorithm proved to outperform all the evaluated regression algorithms, and achieved accurate age estimations both when used as main regression algorithm and as a base learner of a meta-classification method. The application of regression algorithms on age categories dramatically increased age estimation accuracy both in terms of mean absolute error and in terms of root mean square error, which indicates that the combination of age category classification followed by age regression per category would offer robust estimation of the web bloggers’ age.