Keywords

1 Introduction

Personality traits identification is a vital issue in the everyday life of the person such as the results of elections and verdicts of courts [1]. In addition, it can also be used to find leader, assess student performance, interview and selection of students for research or jobs or defense [2]. For all the above applications, it is necessary to estimate the degree of different personality traits in the images. This is because multiple personality traits, like multiple skills, such as agreeableness, openness, and conscientiousness are required to choose suitable candidates especially for hiring software employees and research. At the same time, a few applications may look for particular personality traits irrespective of the degree of personality traits like finance, management, security jobs etc. It is also true that if we look at the facial or posture, pose and background information of the person in the case of images and multiple words in the text lines, one can observe multiple personality traits rather than a single personality trait.

It is illustrated in Fig. 1, where we can see for each image, different personality traits according to textual, facial and background information. For example, (i) Openness: From the image, we can see a creative mask being worn for an ID photo which indicates the openness to experience trait. Also, from the text, the word “new” is recognized, which also indicates the openness class, which is the dominant trait. Also, from the dark imagery and the intention of hiding the face, we can infer high traits of neuroticism from the image. (ii) Conscientiousness: The text recognized contains the word “educate” which is an indicator of class conscientiousness, which is the dominant trait. We can also infer the trait of agreeableness due to the inference of the words “reduce poverty”. (iii) Extraversion: The image shows a huge gathering of people all with happy faces and enjoying themselves. This is highly indicative of the trait of extraversion. We can also see that none of the other traits is high in this image. (iv) Agreeableness: The image shows a sympathetic person and the text recognized contains the word “sorry” which are both indicative of the trait of agreeableness, which is dominant here. (v) Neuroticism: The image shows a man hiding in dark clothing and the text recognized contains “madness”. Also, the text infers that a person must not lose their madness which is very indicative of mental instability and the trait neuroticism, which is dominant here.

Therefore, one can infer that for better solutions and decisions, assessing and estimating the degree of multiple personality traits from a single image is vital for many day-to-day situations. For personality traits identification, several methods have been developed in the past using facial information [1], handwritten text information [3], normal text uploaded on social media information [4], and profile information, which provides multimodal information [4]. Most methods aim at detecting single personality traits for the input image or text information. Similarly, if personality traits identification is considered as a classification problem, the methods developed for classification may not perform well for the images containing person, actions and postures etc. [5]. This is due to these models are limited to scene images. This observation has motivated us to introduce multiple personality traits identification in a single image through this work. To the best of our knowledge, this is the first of its kind work.

To localize the content in the image, the proposed work detects text in the images and segments words from the text line. We believe the meaning of text directly represents personality traits like facial information. Therefore, for each word, the proposed method obtains feature vectors. In addition, for the whole input image, the proposed work uses Google Cloud Vision API [6] to obtain labels (text annotations). Besides, if any text, like, the description of the image and profile picture, the proposed work obtains a feature vector. The feature vectors obtained from each word, annotations and description of the images are considered as textual features. To strengthen the feature extraction for personality traits identification, the proposed work segments the whole image into face and background clusters (other than the face and text information in the input image). The features extracted from the image content are called visual features. Finally, the model fuses textual and visual features for estimating multiple personality traits in a single image. Feature extraction from textual and image information is motivated by contrastive learning [7], which has special properties for representing visual and textual features. Then classification is done using an Artificial Neural Network (ANN) model by feeding feature vectors as inputs.

Fig. 1.
figure 1

Sample images with different degrees of personality traits assessments.

The main contributions of the work are as follows: (i) Proposing a new method to estimate multiple personality traits from a single image. (ii) Exploring contrastive learning for feature extraction from visual and textual features is novel here. (iii) The weighted approach for fusing visual and textual features for multiple personality traits estimation is also a novel contribution.

2 Related Work

For personality traits identification/assessment, in the past, many methods have been proposed based on normal text uploaded on social media, handwritten text with a graphological approach, facial information and image-text information. We review some of the methods in this section.

Kumar et al. [8] proposed a language embedding-based model for personality traits assessment using social network activities. The work considers the review or any text which describes personality traits uploaded on social media for personality traits assessment. Anglekar et al. [9] proposed a method based on deep learning for self-assessment, especially for interview preparations. Dickmond et al. [10] explored machine learning for extracting features from curriculum vitae to identify personality traits. It is noted from the above methods that the scope of the models is limited to textual information, and researchers ignored image information for personality traits identification. Therefore, these methods may not be effective for estimating the degree of different personality traits in a single image.

There are methods for personality trait identification using a graphology-based approach, which use handwritten characters to study the behavior of the writer [3, 11, 12]. Although, the models used image information, developed based on an unscientific approach. Therefore, the results of the modes may not be consistent and reliable for estimating the degree of different personality traits in the images.

To improve the performance of personality trait identification, some models used image and textual information [3, 13, 14]. For example, Sun et al. [2] developed a method to evaluate the aptitude and entrance psychological aspects of the students. The approach extracts features from the facial region for assessing personality traits. Beyan et al. [15] extracted non-verbal features from key dynamic images for personality trait identification. The combination of convolutional and long short-term memory models has been proposed for improving performance of the personality traits identification. Xu et al. [16] proposed a model for predicting five personality traits using static facial images and different academic backgrounds. For a given input video frame, Ventura et al. [1] estimated the degree of different personality traits. This model requires a still frame, multiple frames and audio to achieve the best results. Therefore, this model may not perform well for a single image.

Since sentiment analysis is related to personality traits identification, we review the methods of sentiment analysis to show that the methods are not effective for estimating degree of personality traits. For instance, Yu et al. [17] proposed a transformer-based based step to fuse image and text modalities with self-attention layers for sentiment analysis. Thus, estimating the degree of multiple personality traits for a given input image remains challenging. Hence, this work focuses on developing a new model based on segmented local information and contrastive learning for assessing multiple personality traits in the image.

3 Proposed Model

For a given input image, the proposed method segments facial and textual regions and considers other than textual and facial information as background. It is true that each segment provides cues of different personality traits if the image contains multiple people with emotions, expressions and actions. Therefore, this work segments the text using Google Cloud Vision API, face using the Haar cascades algorithm [18]. For the background, the method uses the Google Cloud Vision API for obtaining labels. In the same way, for the detected text and description of the image/profile, this work uses Google Cloud Vision API for obtaining recognition results.

Overall, the Google Cloud Vision API provides recognition results for the text in the image, description of the image, and background scene of the image, which are grouped as textual information, and facial information as visual information. For extracting features from textual and facial information, inspired by the success of contrastive learning for discriminating the objects based on similarity and dissimilarity [6], we explore the same for feature extraction process in this work. The extracted features are fused for estimating multiple degrees of personality traits in the image through five individual neural network regression models. The pipeline of the proposed method can be seen in Fig. 2. For a given input image, the work segments parts for feature extraction are represented as defined in Eq. (1).

$$ \begin{array}{*{20}c} {I_{S} = T_{R} + T_{A} + T_{B} + I_{F} } \\ \end{array} $$
(1)

where, \(I_{S}\) stands for sample image, \(T_{R}\) stands for text recognized, \(T_{A}\) stands for annotated text, \(T_{B}\) stands for background scene text and \(I_{F}\) stands for facial part. The face is detected using the Haar Cascades algorithm [18], which works based on the sum of the pixel intensities in each region and the differences between the sums. The process is performed in a cascading fashion. The extracted features are then supplied to a machine-learning algorithm for face detection. The detected face and text information is removed from the input image and replaced with a black rectangular patch. This results in a background image as defined in Eq. (2).

$$ \begin{array}{*{20}c} {I_{B} = I_{S} - I_{F} } \\ \end{array} $$
(2)

where, \(I_{B}\) is the background image. The background image is fed to the recognizer (Google Cloud Vision API) to determine labels.

Fig. 2.
figure 2

Block diagram of the proposed method used to estimate the degree of multiple personality traits. The face, text recognition, annotation and background are extracted from the sample image and encoded. Then clustering and fusion steps followed by regression are done to estimate the degree of personalities.

3.1 Contrastive Learning for Textual Feature Extraction

This work considers three types of texts for each sample. Firstly, text recognized from the image is obtained using Google Cloud Vision API [6]. Secondly, image annotations from the sample image are obtained again using [6]. Only the top ten most confident annotations are taken into consideration. Thirdly, background scenes are recognized using ResNet18 [19] which has been trained on the Places dataset [20]. This dataset has 10 million images comprising 400+ unique scene categories. From this model, we choose the top 10 highest confidence scene categories as the background scenes. While assessing the personality traits, the personality trait with the highest degree is known as the dominant personality trait. All texts obtained from the previous step are compiled into a corpus with the text parts as individual documents and the dominant personality trait as the document label.

The text within the corpus is then used to train the representation space. This is done using contrastive learning. A pair of text from the same class labels is taken and fed through a neural network to get a pair of outputs. The similarity score between the pair of outputs from the neural network is calculated using Eq. (3). In another instance a pair of text from two different classes is taken and fed through the same neural network. Again, the similarity score between these outputs is calculated using Eq. (3). Finally, the loss is calculated using Eq. (4). This loss is propagated backwards to the network learns to optimize the loss. This output space of the learned neural network is known as the textual representation space. A pair of texts with different labels repel each other as they are contrastive samples, whereas similar labels attract each other. This is done using NT-Xent loss. This loss uses a similarity function.

$$\begin{array}{*{20}{c}} {sim\left( {{t_a},{t_b}} \right) = ~\frac{{{t_a}^T{t_b}}}{{\left\| {{t_a}} \right\|\left\| {{t_b}} \right\|}}} \end{array} $$
(3)

where \(t_{a}\) represents text part a and \(t_{b}\) represents textual part b, \(t_{a}^{T}\) is the transpose of \(t_{a}\) and \(\left\| {t_{a} } \right\|\) represents the norm of \(t_{a}\). NT-Xent Loss [6] is given by:

$$ \begin{array}{*{20}c} {L_{a,b} = - \log \frac{{e^{{{\raise0.7ex\hbox{${sim\left( {t_{a} ,t_{b} } \right)}$} \!\mathord{\left/ {\vphantom {{sim\left( {t_{a} ,t_{b} } \right)} \tau }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\tau $}}}} }}{{\mathop \sum \nolimits_{k = 1}^{2N} 1_{{\left[ {k \ne i} \right]}} e^{{{\raise0.7ex\hbox{${sim\left( {t_{a} ,t_{b} } \right)}$} \!\mathord{\left/ {\vphantom {{sim\left( {t_{a} ,t_{b} } \right)} \tau }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\tau $}}}} }}} \\ \end{array} $$
(4)

where, \(1_{{\left[ {k \ne i} \right]}}\) is an indicator function evaluating to 1 iff \(k \ne i\), and \(sim\left( {t_{a} ,t_{b} } \right)\) comes from Eq. (3). Hence minimizing this function in Eq. (4) over all pairs of attractive and repulsive pairs gives us a text representation space. Since there is a variable number of words, and a variable dimensional input to a neural network is not possible, there is a need to fix the number of dimensions. This is performed by clustering the words in the representation space into 10 clusters using K-means clustering. The clusters centroids from the representation space are then considered as the text embeddings. Separate text embeddings for text recognized annotations and background scenes each with 10 centroids are obtained. The value of 10 clusters is determined empirically. This process has been shown in Fig. 3.

$$ \begin{array}{*{20}c} {E_{R} = TCLR\left( {T_{R} } \right), E_{A} = TCLR\left( {T_{A} } \right), E_{B} = TCLR\left( {T_{B} } \right)} \\ \end{array} $$
(5)

where, \(T_{R} , T_{A}\), and \(T_{B}\) are taken from Eq. (1), and \(TCLR()\) is the Textual Contrastive Learning Representation function.

Fig. 3.
figure 3

Block diagram for word embeddings using contrastive learning. The recognized words from same class attract each other whereas, the recognized words from differing class repel each other.

3.2 Contrastive Learning for Visual Feature Extraction

The sample image may contain many facial regions owing to multiple people in the same image. Motivation has been drawn from other visual learning models like [21]. These facial images are labelled according to their dominant personality traits. Now, the visual representation space for facial images is trained using contrastive learning in the same way. Similar to the textual counterpart, two facial images belonging to the same class are considered and supplied to a neural network. The similarity between the outputs is considered and loss is calculated. This loss is propagated back through the network to bring the facial images of same class closer. Facial images of varying classes are used further away with a pair of facial images from different classes. This results in learning visual representation space. Hence this framework can be used for both texts as well as images. The same NT-Xent loss is used for training as shown in Eq. (4). This results in an image embedding for every facial image in the visual representation space. Again, there can be multiple people in a single image resulting in multiple facial image embeddings. To supply this to a neural network, the length of the embedding needs to be fixed. Again, clustering is used in the visual representation space on the facial images. K-means clustering has been used to find two cluster centroids. These are the facial image embeddings. The method has been shown in Fig. 4.

$$ \begin{array}{*{20}c} {E_{F} = VCLR\left( {I_{F} } \right)} \\ \end{array} $$
(6)

where, \(I_{F}\) is taken from Eq. (1) and \(VCLR()\) is the Visual Contrastive Learning Representation function.

Fig. 4.
figure 4

Block diagram for image embeddings using contrastive learning. The recognized facial images from same class attract each other whereas, the recognized facial images from differing class repel each other.

3.3 Weighted Fusion and Estimating Degree of Personality Traits

In this step, the embeddings obtained in the previous step are fused to obtain the final feature vector \(F_{v}\). This is performed using a weighted approach. The objective here is to assign the weights so that the features which are more discriminating are given priority. With this motivation, variance has been used as a metric that can evaluate the discriminating power of a feature. The embeddings of text recognized, annotation, background, and facial image are concatenated, and the variance is calculated. The embeddings of all the parts, other than the text recognized, is concatenated and variance is calculated, for calculating for the weightage of text recognized. Then the difference between these two variances is taken and the square root is the weightage of the text recognized and represented by \(E_{R}\). This calculation has been shown in Eq. (7).

$$ \begin{array}{*{20}c} {W_{R} = \sqrt {\left[ {\sigma^{2} \left( {E_{R} \oplus E_{A} \oplus E_{B} \oplus I_{F} } \right) - { }\sigma^{2} \left( {E_{A} \oplus E_{B} \oplus I_{F} } \right)} \right]^{2} } } \\ \end{array} $$
(7)

where, \(\sigma^{2}\) is the variance, \(\oplus\) stands for concatenation, \(E_{R}\), \(E_{A}\), and \(E_{B}\) come from Eq. (5), \(I_{F}\) comes from Eq. (6), and \(W_{R}\) is the weightage of text recognized. Similarly, the weightage of annotations, background scene text and facial image are calculated as follows:

$$ \begin{array}{*{20}c} {W_{A} = \sqrt {\left[ {\sigma^{2} \left( {E_{R} \oplus E_{A} \oplus E_{B} \oplus I_{F} } \right) - { }\sigma^{2} \left( {E_{R} \oplus E_{B} \oplus I_{F} } \right)} \right]^{2} } } \\ \end{array} $$
(8)

where, \(W_{A}\) stands for the weightage of annotation. Similarly, \(W_{B}\), and \(W_{F}\), which are the weights of background and face respectively, are obtained using Eq. (8). The weightage of text recognition, \(W_{R}\), is multiplied with the text recognized embedding, \(E_{R}\) to obtain the final weighted text recognition embedding. Similarly, weightage of annotation, \(W_{A}\), is multiplied with annotation embedding, \(E_{A}\), weightage of annotation background, \(W_{B}\), is multiplied with background embedding, \(E_{B}\), and weightage of facial images, \(W_{F}\), is multiplied with facial embedding, \(E_{F}\) to obtain the respective final embeddings. These four embeddings are concatenated to obtain the final feature vector as shown in Eq. (9).

$$ \begin{array}{*{20}c} {F_{v} = W_{R} \cdot E_{R} \oplus W_{A} \cdot E_{A} \oplus W_{B} \cdot E_{B} \oplus W_{F} \cdot E_{F} } \\ \end{array} $$
(9)

where, \(F_{v}\) is the final feature vector. This process has been shown in Fig. 5.

Fig. 5.
figure 5

Weighted fusion technique based on Variance and Estimating degree of personality traits. The representations are multiplied with their weightage and concatenated to form the final feature vector Fv. This Fv is then supplied to FC layers for estimating degrees of personality.

The final feature vector \(F_{v}\) is supplied to five separate neural networks regression models as shown in Fig. 5 for classification. Each network has four hidden layers with 512, 256, 64, and 16 activation units with the rectified linear unit (ReLU) activation function [22]. The output layer has one unit. The loss function used is Mean Squared Error (MSE) [23] with Adam optimizer [24]. The batch size has been set to 32 with learning rates of 0.001. The training is run for 50 epochs for each neural network separately to obtain the predictions, as shown in Fig. 5.

4 Experimental Results

Our dataset includes 5000 Twitter images that are collected from the source given in [25], which provides 559 Twitter user accounts and image posts. For all the collected images, the ground truth is generated via a questionnaire NEO-PI-R [26], which is a standard procedure used for psychological data collection. In the test, responses from the users are recorded for a given multiple questions, and the responses are measured on a Likert scale from 1–5. The weightage has been assigned according to the question. The average of weights is considered the actual label of a personality trait. This calculation results in fine-grained personality trait scores for each individual personality per image. To test the performance of the proposed method on a standard dataset, the PERS Twitter dataset [27] which is a considerably large dataset containing 28434 profile pictures, has been used.

To test the effectiveness of the proposed method, a comparison is done with the following state-of-the-art models. Biswas et al. [4], use a multimodal concept for personality traits identification using Twitter posts. Wu et al. [5], use vision transformers for the classification of images. Further, we also implemented the state-of-the-art method [17] developed for sentiment analysis to show that sentiment analysis method is ineffective for estimating degree of personality traits in the images. The existing methods are fine-tuned with the training sample as the proposed method for experimentation. The same setup is used for all the experiments. The ratio of 70:30 for training and testing is considered for all the experimentation.

For measuring the performance of the method, we use three standard metrics that are (i) Spearman’s Correlation Coefficient, (ii) Pearson’s correlation, and (iii) Root Mean Square Error (RMSE). Spearman’s Correlation Coefficient: Spearman’s correlation coefficient is a nonparametric measure of the rank correlation between two variables as defined in Eq. (10). A high value indicates better performance.

$$ \begin{array}{*{20}c} {R_{S} = 1 - \frac{{6\mathop \sum \nolimits_{i = 1}^{n} \left( {t_{i} - p_{i} } \right)}}{{n\left( {n^{2} - 1} \right)}}} \\ \end{array} $$
(10)

where, \(t_{i}\) is the ground truth value, \(p_{i}\) is the predicted value and n is the number of samples. Pearson’s correlation coefficient: Pearson’s correlation coefficient is the test statistic that measures the statistical relationship, or association, between two continuous variables. It is defined in Eq. (11). A high value indicates better performance.

$$ \begin{array}{*{20}c} {R_{P} = \frac{{n\sum t_{i} p_{i} - { }\sum t_{i} \sum p_{i} }}{{\sqrt {\left[ {n\sum t_{i}^{2} - { }\left( {\sum t_{i} } \right)^{2} } \right]{ }\left[ {n\sum p_{i}^{2} - { }\left( {\sum p_{i} } \right)^{2} } \right]} }}} \\ \end{array} $$
(11)

where, \(n\) is the number of samples. Root Mean Square Error (RMSE): RMSE shows how far predictions fall from measured true values using the Euclidean distance. It is defined in Eq. (12). A low value indicates better performance.

$$ \begin{array}{*{20}c} {RMSE = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left\| {t_{i} - p_{i} } \right\|^{2} }}{n}} } \\ \end{array} $$
(12)

where n is the number of samples, \(t_{i}\) is the ground truth value, and \(p_{i}\) is the predicted value.

4.1 Ablation Study

To estimate the degree of multiple personality traits for each image, features, namely, textual, image and weighted fusion are the key components in the proposed model. Therefore, to validate the effectiveness of each component, we have conducted the following experiments. We use (i) Only image features, (ii) Only textual features, (iii) Image along with textual features, (iv) Image features with weighted fusion, and (v) Textual features with weighted fusion, and (vi) Image and textual features with weighted fusion for calculating measures. Table 1 shows that the results of textual features alone are better than image features alone. This shows that textual features provide more rich semantic information than image features. The same conclusions can be drawn from the experimental results shown in (iv) and (v). In the same way, when we combine image and textual features, as reported in (iii), without weighted fusion, which is better than individual features and image features with the weighted fusion. This shows that image and textual features contribute equally to achieving the best results. However, when we combined all images, textual and weighted fusion, which is actually the proposed method, reports the highest results compared to all other experiments. Therefore, one can infer that the key features are effective in coping up with the challenges of multiple personality trait estimations. Note that a “ ” in the case of weighted fusion indicates a normal concatenation, whereas, in the case of the features it indicates that the features are omitted. The reported correlations and RMSE are averaged for all the five personality traits scores.

Table 1. Assessing the contribution of each modality using our dataset.
Fig. 6.
figure 6

Illustrating segmentation results of the proposed method. Texts are separated by fixing bounding boxes, faces are separated by face detection step and other than a face and text is considered as the background scene.

4.2 Experiments on Segmentation

For evaluating the segmentation step, qualitative results of the proposed method on sample images of different personality traits are shown in Fig. 6, where it is noted that for all three input images, the steps segment text, face and background well. Therefore, we can conclude that the steps are robust to degradation caused by social media.

4.3 Estimating Multiple Personality Traits

Qualitative results of the proposed method for estimating five personality traits for each image are shown in Fig. 7, where it is noted that for all the images of (O, C, E, A, N), the predicted values are almost close to the ground truth. This indicates that our method is capable of estimating multiple personality traits for each image. It is also noted from Fig. 7 that the predicted bold values are still close to the ground truth compared to the score of other personality traits. The dominant score can be used for the classification of single personality traits of each image instead of multiple personality traits. Therefore, the method can be used for estimating multiple personality traits as well as single personality traits. To verify the same conclusions, quantitative results of the proposed and existing methods [4, 5] on our dataset and standard dataset (PERS Twitter) are reported in Table 2 and Table 3, respectively.

Fig. 7.
figure 7

Example of the successful result of the proposed method. The bold represents the actual personality traits class if we consider the dominant score as the actual label.

The scores of Spearman’s and Person’s correlation coefficients, and RMSE reported in Table 2 and Table 3 on our and PERS datasets show that our method is better than existing methods in terms of all three measures except the method [4] for one personality trait. Therefore, it can be concluded that the method estimates multiple personality traits accurately for each image in terms of statistical relationship, monotonic relationship and error. Since the RMSE of the proposed method is lower than the existing methods for almost all personality traits, the predicted score is close to the ground truth. This observation confirms that the proposed method is accurate for multiple personality trait estimations. However, the existing methods report poor results because the method [4] was developed for single personality traits identification, while the method [5] was developed for image classification but not personality traits identification. In the same way, since the method [17] was developed for sentiment analysis, it does not perform well for classification of multiple personality traits in the images. For some personality traits, the method [4] achieves the highest result compared to the proposed and other existing methods [5]. This is because the objective of the method was to estimate single personality traits while the proposed method is to estimate multiple personality traits.

Table 2. Spearman’s Correlation Coefficient, Pearson’s Correlation Coefficient and RMSE for the proposed and existing methods on our dataset.
Table 3. Spearman’s Correlation Coefficient, Pearson’s Correlation Coefficient and RMSE for the proposed and existing methods on PERS Twitter Dataset.
Fig. 8.
figure 8

Samples where poor results are reported. The bold represents the actual personality traits class if we consider the dominant score as the actual label.

Sometimes, when the image does not provide sufficient clues for multiple personality traits estimation as shown in sample images in Fig. 8, the method fails to achieve the best results. For example, the third and fifth images from the left-top, where we can see the images lost visual features to define it as an Extraversion (ground truth). The method misclassifies both images as consciousness. In the same way, for the first image, the visual feature indicates Openness, while the textual feature indicates neuroticism and the ground truth indicates Openness. When there is a conflict among the features, the method fails to estimate personality traits correctly. The same conflict can be seen in the second and fourth images, and hence, the method classifies it as Neuroticism. To overcome this problem, we plan to develop an end-to-end transformer, which can cope up with the challenges of degradation and extract minute information. Therefore, there is scope for improvement in the near future.

5 Conclusion and Future Work

In this work, we have proposed a new method for estimating multiple personality traits in a single image rather than a single personality for each image. To extract the local information to study the multiple personality traits, the proposed work segments text, face and background. For each segment, we apply contrastive learning for feature extraction, which extracts textual from text information and visual features from image information. The work also introduces a weighted fusion operation for fusing textual and visual features. The features are fed to five ANN based regression models for estimating five personality traits. Experimental results on our and standard datasets show that the proposed method outperforms the existing methods in terms of statistical relationship, monotonic relationship and mean square error. However, when the image loses clues and conflicts between the visual and textual features, the method performs poorly. We plan to deal with this in the future.