1 Introduction

In sociological research, Web2.0 has triggered a shift from traditional social study to big data mining for online social networks because of its big data volume and convenience. In the field of computational social science, personality [1, 14] refers to a synthesis of all the features which characterize one’s pattern to other people or situations. It has been widely [21, 32, 40] acknowledged that personality is a significant explanation for one’s various outward manifestations. Therefore it has broad applications [49] such as human-computer interaction [39, 54], mental healthcare [23], business analysis [29] and human resource management [7].

A generally accepted and influential metric in psychology for characterizing and measuring personality traits is the OCEAN Model (also known as the big five model) [13], which consists of five traits: O penness, C onscientiousness, E xtroversion, A greeableness, and N euroticism. Details of these five traits are introduced in Table 1.

Table 1 Overview of the OCEAN model

A widely used method to obtain personality labels is the big five inventory.Footnote 1 Target users have to finish all of the questions in the inventory and their reports are then delivered to specialists for further analysis [8, 17, 18, 43]. Nevertheless, there are three main problems with this approach. First, the reliability of the results depends to a great extent on the participant’s temporary psychological state, not his/her chronic, stable characteristics (an outgoing boy may be upset by an accident because of a breakup with his girlfriend when he implemented the questionnaire, for example). Second, the criteria of different experts may not be the same, leading to deviations from the practice [50]. Third, this method is high cost and time-consuming, and thus not a wise choice for human resource departments.

With the penetration of online social media in our lives, studying one’s personality traits from their online social records instead of traditional questionnaires has been of great interest to the researchers due to its convenience, and efficiency. Many studies [3, 4, 6, 38] have suggested the feasibility of detecting users’ personality traits from their generated social texts. Compared with traditional psychology researches, which are limited by the huge cost of artificial analysis, computer science has its unique advantage. Aiming at automatically analyzing data, researches from computer science have the potential to deal with the challenge of big data. Regrettably, there is no open published large labeled personality dataset because of the enormous costs for data annotation and the policy of privacy protection, let alone the labeled personality data with network topology. These situations have become severe obstacles for more thorough researches because most of the related works can only use supervised approaches on small datasets and research their samples independently, ignoring the mutual influence from the group, which is far from addressing the forthcoming challenges of massive data with limited annotations for personality analysis.

We have realized there exist three tough obstacles impeding further researches. The first one is how to introduce semi-supervised or unsupervised methods so that we can meet the challenges of massive sparsely labeled data in online social media. Once we solve this problem, another obstacle comes out that how to make the methods scalable so it can be used in large datasets. The third obstacle is how to analyze one’s personality based on the influence of groups so that we can leverage limited datasets more comprehensively for better performance. In light of these problems, we here propose a new and novel method to detect one’s personality from his generated texts. Our principle contributions are summarized as follows.

  • In order to meet the challenges of massive data but with sparse personality labels in online social media, we are the first paper to introduce the network representation learning (NRL) method into the field of personality detection, which is an unsupervised feature learning method for personality detection and can be easily transformed as distributed algorithms.

  • We are the first paper to predict one’s personality traits based on the collaborative identification, which can significantly improve the performance. Experiment results on EIGHT heterogeneous datasets (five personality datasets and three non-personality datasets) compared with more than TEN related famous methods confirm our method’s advantages and verify the significance of the group perspective and unsupervised methods in the application of personality analysis.

The meaning of our work is

we are not only the first paper to push related researches into group-level, which can significantly improve the performance, but also the paper proposing new thinking on current dilemmas faced by academic peers. Through our frameworks, we have better adaptability on limited datasets, and meanwhile have more potential to meet the challenges of large online social datasets with sparse personality labels in the near future.

The rest of this paper is organized as follows. We present related works in Section 2 and the basic concept in Section 3. Our model will be introduced in Section 4. In Section 5, we conduct various evaluations in multi-class classification and regression prediction for personality perception. Finally, we conclude our work in Section 6.

2 Related works

Due to the broad potential applications [36, 45, 52], personality detection has gradually come into the sight of computer science researchers. Compared to traditional sociological projects, online social media provides a lower-cost way for personality capture. Although in the budding period, related works have achieved fruitful outcomes. Early researches mainly focused on the artificial feature design such as Mairesse [25], and the model used was relatively simple [24, 37]. However, these artificial features require long texts, while in online social networks, most users generated texts records (such as twitter, weibo and so on) are very short, which can not meet the requirement of these features. Recently, researchers start to get rid of artificial features by automatically supervised feature learning. Therefore deep learning methods have been evolutionally applied, from shallow neural networks [49] to more complex networks [26, 46]; from just text data to multi-modal data [19, 49]. Unfortunately, most datasets are too small to support the training process of these deep learning models.

Broadly, the data type used in this area mainly includes texts, images [16], videos [19] and likes [20]. Besides these types, network structures also deserve to be mentioned here because of the straightforward common, “Birds of a feather flock together.” People with a closely related personality more often build connections on various social occasions [5]. As an effective network topology research method, researches on network representation learning (NRL) have become more and more active recently. It aims at transforming the total network or each node from tricky topology to tractable vector in low-dimensional space. For example, DeepWalk [35] transforms the network topology into a series of sequences by the random walk. Then they use word2vec method [28] to evaluate each node’s representation. After that, many variations such as LINE [47] and node2vec [15] based on the random walk have emerged.

Recent years, network representation learning has been widely recognized as an effective feature learning method in various scenarios such as sentiment analysis [48], recommender systems [12], community detection [10] and so on. Despite the great potentialities, there is no prior work concerning predicting one’s personality in this way. As previously mentioned, it is impractical to collect social network data with personality labels for each node. Consequently, most of the existing studies with the assumption of independent and identically distribution (iid) ignore mutual influence between individuals. In the following section, we display a phenomenon that the similarity between users’ generated texts is correlated with their personality trait similarity. It suggests that although there are no open personality datasets with following networks or retweeting networks, we can still construct a network based on the similarity of their generated textual contents as an instantiation of peoples’ various relationships in real life.

3 NRL for personality analysis

Due to the limitations in personality analysis which are mentioned above, it is quite natural to introduce the NRL model into this field. Reasons are threefold: Firstly, most NRL models belong to unsupervised learning methods, which means they rely less on data annotations. Secondly, the latest models are generally scalable for large size data, especially those based on the random walk because they are easy to be transformed as distributed algorithms. Thirdly, NRL models can capture mutual influences in the dataset; thus they can perfectly match this application scenario in online social media.

To be specific, for each personality dataset, we first construct a complete graph \(\mathcal {G}(\mathcal {V},\mathcal {E},\mathcal {W},{\mathscr{L}})\). Here \(\mathcal {V}\) is the user set. \(\mathcal {E}\) is the edge set. For each edge \(e(v_{i},v_{j}),v_{i},v_{j} \in \mathcal {V}\), \(w(v_{i},v_{j}) \in \mathcal {W}\) measures the similarity between the texts generated by vi and vj. \({\mathscr{L}}\) is the set of users’ Big Five personalityFootnote 2 label. For each user vi, his personality label is a vector with five entries, denoted by \(\ell _{i}=\lbrack {\ell _{i}^{1}},{\ell _{i}^{2}},{\ell _{i}^{3}},{\ell _{i}^{4}},{\ell _{i}^{5}} \rbrack \). Then the similarity of labels from two users can be calculated by the cosine similarity of each pair of nodes.

Following above, we analyzed four personality datasets including Youtube, MyPersonality, PAN and OpenPsychometrics. Details of these datasets are given in Section 5.2.1. A sample network based on texts similarity in the Youtube dataset can be seen in Figure 1. We remove edges with very tiny weight ω. Edges’ colors stand for the similarity of texts, while the width of edges describes the personality similarity. Two samples connected by a thicker edge with deeper color are more similar both in personality and texts, from which we can find that the correlation truly exists between personality and texts similarity. Results from other three datasets are very similar with Figure 1 and are not given here for brevity.

Figure 1
figure 1

Texts Similarity w.r.t. Personality Similarity. This graph is generated from the Youtube dataset (see Section 5.2.1). The color and the width of edges describe the text similarity and personality similarity respectively. A pair of nodes connected by a thicker edge with deeper color are more similar both in personality and texts

Problem definition

Our problem can be treated as multi-classification or regression, depending on the values of labels. To be specific, for each user vi, his personality label is a vector with five entries denoted by \(\ell _{i}=\left \lbrack {\ell _{i}^{1}},{\ell _{i}^{2}},{\ell _{i}^{3}},{\ell _{i}^{4}},{\ell _{i}^{5}} \right \rbrack \). Each entry stands for the score in one personality trait. In most datasets, the entry value is continuous ([0,5] for example, stands for the confidence in one category). In other datasets, the entry is two-valued (for example, \({\ell _{i}^{j}}=0\) means this user does not belong to trait j,j = 1,2,3,4,5 and vice versa). Given each user’s feature vector as the input, we want to predict his personality traits scores across all five entries when the label values are continuous, or his trait categories when the label values are two-valued. In the rest of this section, we give some basic foundation of mathematics so that our model can be introduced more naturally.

Given a network \(\mathcal {G}(\mathcal {V},\mathcal {E},\mathcal {W},{\mathscr{L}})\), we want to learn a node embedding matrix \({\varPhi } \in \mathbb {R}^{|\mathcal {V}|\times d}\). The learning process can be seen as the SkipGram method [28], a language model that maximizes the co-occurrence probability among the neighbors given a node (word). To be specific, if we get the neighbor nodes N(vj) of \(v_{j}, v_{j} \in \mathcal {V}\), where N(vj) = {vjω,⋯ ,vj+ωvj} and ω is context size, then the following objective function will be optimized:

$$ \max \limits_{\varPhi} \sum\limits_{v \in \mathcal{V}}\log \Pr(N(v)|{\varPhi}(v)) $$
(1)

where:

$$ \Pr(N(v_{j})|{\varPhi}(v_{j}))=\underset{i\neq j}{\underset{i=j-\omega}{\overset{j+\omega}{\Pi}}}\Pr(v_{i}|{\varPhi}(v_{j})) $$
(2)

\(\Pr (v_{j}|{\varPhi }(v_{j})\) can be calculated by hierarchical softmax or negative sampling [28]. In the following section, we propose a sampling method based on the random walk. After that, we transform a network into a sentence-like corpus and then use the above optimization method to learn our node embeddings.

4 AdaWalk: adaptive walk for NRL

In this section, we will discuss the main components of our algorithm and give some methods to accelerate computing.

4.1 AdaWalk method

Given an on-going random walk path < v1,⋯ ,vt >, node vt+ 1 can be assigned by the following probability:

$$ \begin{array}{ll} & \Pr(v_{t+1}|v_{t},v_{t-1})=\\ & \left\{ \begin{array}{ll} \frac{\alpha(v_{t-1},v_{t},v_{t+1})w(v_{t},v_{t+1})}{Z},& \text{if } (v_{t},v_{t+1}) \in \mathcal{E} \\ 0, &\text{otherwise} \end{array} \right. \end{array} $$
(3)

where Z denotes the normalizing constant, w(vt,vt+ 1) is the weight of edge (vt,vt+ 1). Inspired by the node2vec model [15], we design a biased walk strategy controlled by α(vt− 1,vt,vt+ 1), which is defined as:

$$ \begin{array}{ll} & \alpha(v_{t-1},v_{t},v_{t+1})=\\ & \left\{ \begin{array}{ll} \frac{1}{g_{1}}K(v_{t+1})\cdot K(v_{t-1}),& \text{if } d_{v_{t-1},v_{t+1}}=0 \\ K(v_{t+1})\cdot K(v_{t-1}), &\text{if } d_{v_{t-1},v_{t+1}}=1\\ \frac{1}{g_{2}}K(v_{t+1})\cdot (1-K(v_{t-1})), &\text{if } d_{v_{t-1},v_{t+1}}=2 \end{array} \right. \end{array} $$
(4)

The difference between AdaWalk and node2vec is that, for node2vec, the parameters g1, and g2 can be useful when the model deals with diverse datasets but not sufficient for fine-grained situations within the same dataset. For example, a node near a clique is more likely to be attracted to explore around the clique, while a node far from the clique is more likely to explore like DFS. This problem can be solved by our proposed method because K(v) denotes the importance or attraction of node v to other nodes, which can be calculated by:

  • Degree: let d(v) be the degree of node v, we calculate K(v) as a normalized degree value like:

    $$ \begin{array}{@{}rcl@{}} K(v)=\frac{d(v)}{|\mathcal{V}|-1} \end{array} $$
  • Clustering Coefficient: let T(v) is the number of triangles through node v, and d(v) is the degree of v, then K(v) can be calculated as follows:

    $$ \begin{array}{@{}rcl@{}} K(v)=\frac{2T(v)}{d(v)(d(v)-1)} \end{array} $$
  • Page Rank [31]: Page Rank computes a ranking of the nodes in the graph based the structure of the incoming links.

    K(vi) then can be calculated as:

    $$ \begin{array}{@{}rcl@{}} K(v_i)=\frac{1-\alpha}{|\mathcal{V}|}+\alpha \sum\limits_{v_j \in M(v_i)}\frac{K(v_j)}{d^{out}(v_j)} \end{array} $$

    where M(vi) is the set of pages that link to vi, dout(vj) is the number of outbound edges of vi and α is the hyperparameter. Note that the value should be normalized before used in our model.

We compare the performance of different kernels in many related datasets, and Figure 2 is the results on Cora dataset (see Section 5.1.3), from which we can find that performances of different kernels are very close.

Figure 2
figure 2

Different kernels on the Cora dataset. The horizontal axis stands for the percentage of training data and the vertical axis stands for Micro-F1 a and Macro-F1 b in the multi-classification task on the Cora dataset (see Section 5.1.3)

Intuitively, when selecting the consequent node for vt, if the value of K(vt− 1) is big enough, the model is more likely to select those which are near to vt− 1. On the other hand, when the model faces some nodes in a similar situation, the candidate with a bigger value of K(vt+ 1) is more likely to be chosen as the next hop. The pseudocode of AdaWalk is given in Algorithm 1, and our sampling strategy can be seen in Algorithm 1. There exist three phases of AdaWalk. Firstly, we preprocess the transition probabilities offline. Then the corpus is generated by a series of random walks with the complexity of \(O(|\mathcal {E}|)\). Thirdly, optimization processing using SkipGram is executed sequentially.

4.2 Algorithm acceleration

We list some methods to accelerate our model:

  • Firstly, some components can be calculated offline such as kernel K for each node, the transition probabilities, and each node’s one-hop neighbors.

  • Secondly, in the 10th line of Algorithm 2, we do not need to traverse all one-hop neighbors of node vcurr, and only a part of them is sufficient to make the model perform well. This strategy is especially useful when the network is a dense graph.

  • Thirdly, the course of corpus generation (line 2-7 in Algorithm 1) can be executed parallelly because each iteration is independent of the others.

  • Fourthly, according to [35], the frequency distribution of nodes in the corpus generated by random walks follows a power law, which means only a little number of nodes can affect the updating of Φ. Therefore the SkipGram Algorithm can be further optimized by the various parallel versions of the gradient descent method (asynchronous stochastic gradient descent with delay compensation [53], for example).

figure a
figure b

5 Experiments

Studies of the personality detection mainly treat the task as multi-label classification or regression, depending on the label values of the personality datasets. As most personality datasets’ labels are continuous numbers, it is more common to treat the task as regression. In this section, we want to measure our method from two aspects: (i) From Non-Personality datasets to Personality datasets. (ii) From Multi-Classification to Regression.

Firstly, as our method is an NRL model, it is quite necessary to compare it with other famous NRL methods in network learning. Through this, we can find that our method is better than the others and that is one of the reasons why we do not use other related NRL methods in our later personality detection. Therefore, in Section 5.1, we first compare our method with five famous NRL methods (Graph Factorization, DeepWalk, LINE, node2vec, HOPE) in the multi-classification task on three heterogeneous non-personality datasets and one personality dataset. We do this because the multi-classification task is one of the most important tasks in the NRL research field, and nearly conducted in every related NRL models. However, most personality datasets are not suitable for multi-classification because their label values are not discrete. At present, we only find one valid personality dataset with discrete label values, named stream-of-consciousness essays (SoCE). In order to make our assessment more convincing, we here supplement three more widely used datasets in most NRL models’ evaluations (they are: BlogCatalog dataset, Cora dataset, and Wiki dataset). Secondly, we choose EIGHT famous personality detection methods as the baselines in the personality regression task in Section 5.2. We compare them with our AdaWalk model and report the root mean square error (RMSE) on FOUR diverse personality datasets.

STATEMENT

In order to ensure the replicability to our experiments, we open our source code.Footnote 3 We have realized the challenge between privacy protection and the academic researches. Although all the datasets in this paper are obtained from open public sources, we still take very careful measures to achieve the best balance between academic needs and users’ privacy protection, including but not limited to: a) removing user’s sensitive information; b) acquiring the necessary license agreements before using related datasets; c) only for academic purpose.

5.1 Multi-class classification

5.1.1 Baselines

To validate the performance of our approach we compare it against the following baselines:

  • Graph Factorization [2]: Graph Factorization is a distribute decomposition and inference framework for large-scale graphs.

  • DeepWalk [35]: DeepWalk uses unbiased random walks to generate the corpus, then they use SkipGram to learn node embeddings.

  • LINE [47]: LINE believes that a pair of nodes should be placed closely not only when they are connected (first-order proximity), but also when they share similar neighbors (second-order proximity). Therefore they use a breadth-first search strategy to generate context nodes and try to capture both the first-order proximity and the second-order proximity.

  • node2vec [15]: On the basis of DeepWalk and LINE, node2vec discuss two search strategies: breadth-first search and depth-first search. Then they use a biased random walk in coarse-grained level to generate the corpus.

  • HOPE [30]: HOPE incorporates Katz index to measure the proximity of nodes and preserves the asymmetric transitivity in the network. However, there exists a high time complexity of calculating high-order proximity measurements.

5.1.2 Parameter setup

Following above, we randomly sample a portion (70%) of the labeled nodes as training data. The rest of the nodes are used as the test set. We set the parameters as follows: the number of walks γ = 30, walk length = 80, embedding size d = 128, context size ω = 20. Specifically, for AdaWalk and node2vec, parameters g1,g2,p,q are selected from [0.25,0.5,2]. For LINE, Graph Factorization, and HOPE, we use the open source codes from OpenNEFootnote 4 and parameters are set in default. We repeat each test 10 times and report the best performance in terms of both Macro-F 1 and Micro-F 1. For the downstream multi-class classifier, we use one-vs-rest logistic regression to return the most probable labels.

5.1.3 Datasets

We test our benchmarks on the following datasets:

  • SoCE [33]: The stream-of-consciousness essays (SoCE) dataset contains 2, 467 persuasive anonymous essays tagged with the authors’ personality traits: extroversion, neuroticism, agreeableness, conscientiousness, and openness. Each trait has a number choice from [0,1], where 0 means the sample does not belong to this trait and vice versa.

  • BlogCatalog [51]: it is an undirected network of social relationships from 10, 312 bloggers. The network contains 10, 312 nodes with 39 labels and 333, 983 edges.

  • Cora [27]: Cora is a typical paper citation directed network. It contains 2, 708 nodes, 5, 429 edges and 7 labels.

  • Wiki [44]: It contains 2, 405 Web pages from 19 categories and 17, 981 links between them.

5.1.4 Experimental results

As shown in Table 2, we evaluate the Micro-F1 score, and Macro-F1 score on BlogCatalog, Cora, Wikipedia, and SoCE with 70% labeled nodes. Numbers in bold represent the highest performance in each column. Note that we only sample 10% of one-hop neighbors in each random walk but achieve comparable performance with these models.

Table 2 Multi-label classification results (70% labeled)

AdaWalk outperforms all baselines by at least 8% in BlogCatalog, 3% in Cora, and 7% in Wiki w.r.t. Micro-F1. Extraordinary, for personality dataset, SoCE, the Micro-F1 score of our method is up to 97.74%. It also performs quite well in Macro-F1. We also note that node2vec generally performs better than DeepWalk, while our models perform better than node2vec, which suggests that it is meaningful to change the random walk from unbiased walk to biased walk with coarse-grained and fine-grained situations.

Additionally, we also change the labeled nodes percentage from 10% to 90% and compared with these baselines again. The results can be seen in Figure 3, from which we have observed that the performance of our models is very close to HOPE in Cora but beats most baselines in other datasets especially when the training set percentage down from 60% to 10%. This is especially useful when we deal with massive data with sparse annotations. The superiority of our model is even expanded when the training percentage up from 70% to 90%.

Figure 3
figure 3

Results w.r.t. labeled nodes percentage. Our model achieves the best performance even when the labeled nodes are reduced from 90% to 10%, which demonstrates the superiority when dealing with sparsely labeled cases

5.1.5 Parameter sensitivity

We compare our model with different one-hop neighbor sampling ratio r, walk length , and context size ω in the above multi-class classification task. Results are shown in Figure 4, from which we have the following observations:

  • In order to accelerate computing, we sample each node’s one-hop neighbors by the parameter r. From Figure 4a and b, we can find that both Micro-F1 and Macro-F1 in Cora and Wiki increase when r change from 0.1 to 0.2 but then both of them become stable, which suggests that there is no need to get the whole one-hop neighbors in each random walk. This strategy is especially necessary when the size of the dataset is large because compared with Cora and Wiki, performance in BlogCatalog w.r.t. r nearly does not fluctuate.

  • We also research the performance when walk length > 30. From Figure 4c and d, there is no drastic fluctuation when increases from 30 to 200. Besides, a larger dataset (BlogCatalog) performs more stable than smaller ones (Cora and Wiki).

  • Finally, we also analyze the relationships between Micro/Macro-F1 and context size ω. From Figure 4e and f, the values of Micro/Macro-F1 scores have a little fluctuation when ω becomes larger. However, the performance differences are not that large in this case.

Figure 4
figure 4

Results w.r.t. sampling ratio r, walk length , and context size ω

5.2 Regression prediction for personality

5.2.1 Datasets

Having compared our models with various baselines in multi-class classification, we now use AdaWalk to predict one’s personality traits as the regression task. We extensively evaluate related models in four open published personality datasets, which are listed as follows:

  • Youtube Personality [9]: This dataset consists of a collection of speech transcriptions with their Big Five personality impression scores (score value ranges from 1 to 7). We use it to generate a text similarity network which has 404 nodes and 81, 406 edges.

  • MyPersonality [11]: This dataset comes from an open project on Facebook. It contains 9, 900 status updates from 250 users as well as their Big Five scores (range from 1 to 5). We translate this dataset into a similarity network, and it has 250 nodes with 31, 125 edges.

  • PAN [42]: This dataset comes from a well-known data science competition, PAN2015, and it includes four languages (Dutch, English, Italian and Spanish) datasets. We select English data to construct the network and it contains 294 nodes with 43, 070 edges.

  • OpenPsychometrics: This dataset comes from Open Source Psychometrics Project.Footnote 5 It contains 6 incomplete sentence responses, gender, age, and Big Five scores. We translate the dataset into a network based on the similarity of sentence responses. The network has 933 nodes with 434, 778 edges.

5.2.2 Baselines and parameter setup

Baselines are as follows:

a) by artificial features: :
  • mairesse. [25]: Mairesse expands the LIWC (linguistic inquiry and word count) dictionary which is developed by Pennebaker et al. [34] and makes the word dictionary become one of the most famous references when analyzing personality from texts.

  • TF-IDF. [41]: TF-IDF is the product of the term frequency (TF) and inverse document frequency (IDF), which is one of the most basic and standard methods for text classification.

b) by supervised feature learning: :
  • 2CLSTMs [46]: In this work, they design a deep learning framework to process texts data so that they can detect the personality traits of the authors. The architecture of 2CLSTM can be divided into to parts. In the first part, they use bidirectional LSTMs concatenated with word vectors to encode sentence embeddings. In the second part, they use a group of CNNs to learn the sentence groups and generate the final feature vectors as the input of the downstream classifier.

  • Kampman et al. [19]: They use deep learning method to detect one’s personality from three channels (audio, text, and video). Specifically, for the text channels, they only use CNNs to finish the feature learning. Considering that we focus on text data, we only extract the text channel in our later experiment.

  • Wei et al. [49]: They also use the CNN framework to process the text data, but before sending the final features learned by CNNs, they concatenate them with LIWC features.

c) by unsupervised feature learning: :
  • doc2vec [22]: This is a famous NLP model which is derived from Word2Vec and has been widely used in industry.

  • DeepWalk: A NRL method mentioned in Section 5.1.1.

  • node2vec: A NRL method mentioned in Section 5.1.1.

Given that above datasets are relatively smaller in the number of nodes, we split the training set and test set as 1 : 1. We use SVR (support vector regression) to predict personality scores and RMSE (root mean square error) as the evaluation metric. In order to demonstrate that NRL methods are truly useful in this application scenario, we also calculate the results from random guess and take them as the worse case.

5.2.3 Experimental results

Results are shown in Table 3; we use EXT (Extraversion), AGR (Agreeableness), CON (Conscientiousness), EMO | STA | NEU (Emotion | Stability | Neuroticism), and OPN (Openness) to stand for the five traits in Big Five personality model. Note that in the Big Five personality model, researchers use different expressions to name the fourth dimension (EMO and NEU have the opposite value of STA).

Table 3 RMSE for personality prediction (50% labeled)

From Table 3 we can find that NRL models are truly useful in personality prediction. The values of RMSE are all lower than a random guess by 40%-96%. Furthermore, our model outperforms in most personality traits. Although significant advantages our model has achieved, we can still find that some artificial features such as mairesse still perform pretty well in some cases. For example, the results of mairesse on PAN dataset are very competitive but relatively poor on the other three datasets. This is because the text length in PAN dataset is relatively longer than the other datasets. Therefore the artificial features can perform well. However, as we previously mentioned before, in online social networks, most texts records are short, which means mairesse may not perform well in this scenario. We also notice that most supervised feature learning methods, especially those based on deep learning methods such as 2CLSTMs, Wei et al., and Kampman et al. are difficult to perform at full steam because of the limitation of datasets.

6 Limitations and conclusion

In this paper, we analyzed the feasibility of personality prediction from a group perspective. We introduce the NRL method to this field and extensively evaluate our model with other famous works. Results confirm the significance of the group perspective and unsupervised methods in the application of personality analysis.

However, this work also exists some limitations. For example, our text generated networks are constructed in advance based on the TF-IDF method. We do this because this is the simplest method to measure the similarity of two documents and can be conducted in all scenarios. Even though we’ve managed to build text networks, they are still the simulation of real-life relationships, not the real social networks such as following networks, retweeting networks and so on. However, to the best of our knowledge, there is no public social networks data with personality annotations, leaving our method be the most feasible choice.

Our work stands at a momentous crossing in the field of computational personality

We are now living in an era of big data. There should have been much greater chance than ever before for researchers in social personality psychology to study one’s personality in bigger data situations, especially in online social networks. However, for a long time, researchers have sunk into an embarrassing situation because they suffer from huge costs of personality annotation. Therefore, most open personality datasets are not big enough, resulting in related researches not adaptive for the forthcoming challenges from big data.

This contradictory situation has forced us to make a new thinking

how to leverage the existing small labeled datasets more comprehensively, and meanwhile how to make our methods more scalable to deal with large-scale data in the forthcoming future. In light of above challenges faced by academic peers, we try to push related works to rely less on personality annotations, to leverage limited datasets more comprehensively, and to be more scalable for big data.