On 14th October 2019, the famous Korean superstar Sulli committed suicide in her own house. Her premature death at the age of 25 drew public attention back to the topic of suicide. Suicide has become the third chief cause of death in today’s world. It is estimated that there are 1.6 million suicidal attempts each year, and approximately 0.8 million people die of suicide [1]. Suicide has always been a hot topic for discussion. In the 1900s, the suicides of two famous female writers -- Sanmao and Virginia Woolf, from China and the U.K., shocked the literary world. Therefore, it is crucial to find methods to detect suicidal ideation or tendency in order to prevent suicide.

Many significant efforts have been made recently. As a first step, several psychological dictionaries have been compiled [2,3,4,5,6]. With the help of these resources and computing technologies, remarkable progress has been made in detecting suicidal ideation and attempts. Most of such research has been conducted on social media materials or has analyzed suicide notes, and has helped to prevent people with suicidal tendencies. Most of the psychological dictionaries have compiled specific words related to different themes, such as death, love, anger, etc. However, language is full of duplicity, especially when it is employed to express negative emotions. Therefore, it is problematic to merely use specific words to detect suicidal ideation. With the help of quantitative linguistics techniques, we can find markers reflecting language phenomena that cannot be consciously hidden by writers. These markers may be used to analyze works in different languages, simultaneously manifesting the general nature of language and the psychology of human beings.

In this paper, we use five quantitative linguistic markers –Richness, Pronoun, Sensory, Lambda and Zipf’s Exponent -- to analyze works of two female writers, Virginia Woolf and Sanmao. Virginia Woolf’s works are in English whereas Sanmao wrote in Chinese. We can clearly see how big events have influenced their writing styles and their courses of life, respectively. After analysing the values of these markers, we also try to find out different reasons why they committed suicide.

Literary scholars tend to consider a writer’s life as auxiliary information for textual interpretation and analysis. Few of them have attempted to probe into the mental conditions of creators of literary works by analyzing their texts. Most of them have applied Freudian psychoanalysis in their studies. Here we take the works of Virginia Woolf as an example. Virginia Woolf’s life and novels are both fascinating and perplexing to literary scholars. In tracing Woolf’s protagonists’ remarks about death, notably in her later works, Brombert has proclaimed that Woolf considered writing “as a form of salvation”, “a struggle against death, an act of defiance, a way of protecting her sanity” [7]. Jean Thomson has focused particularly on Mrs. Dalloway and believed that Woolf weaved her own experiences of excitement and depression into this book in order to give an authentic portrayal of the mind of an ex-soldier, Septimus Smith. Thomson has also seen Mrs. Dalloway as a kind of “Dance of Death” [8]. Other critical biographers, underlining Woolf’s manic-depressive side, have denied her intention of control over her life [9]. C. Caramagno has also expressed his complaints about Woolf’s biographies in his contentious essay in the rather subjective nature [10]. Although literary scholars have tended to concentrate on biographical details, especially suicidal attempts, most of their focuses have been put on plot and characterization.

With the help of combining linguistic features and computing technologies, linguists and computing experts are making efforts to study the relation between features of writing style and writers with suicidal ideation. Earlier studies have focused on which professions are more prone to mental disorder [11] or used simple linguistic features to distinguish between ordinary letters and suicidal notes [12]. After psychological dictionaries have been compiled, especially LIWC, the analysis of suicide and language has become a problem of classification by using natural language processing methods. These works include detecting whether the users have suicidal ideation [13,14,15,16], or mental disorder [17], and analyzing the emotions and language features of people with suicidal ideation and attempts [18,19,20]. By using Suicide Probability Scale (SPS) and a classification model, scholars can distinguish people with a high suicidal probability and those with a low probability [21,22,23]. In distinguishing suicide notes and ordinary notes, computers do a better job than mental health professionals [24]. Technologies like sentiment analysis, opinion mining and depression detection are not perfect for use with suicide [25]. Besides, most studies on detecting suicide in real life have focused on the language used in social-media blogs. The method by using psychological dictionaries is just sentiment analysis. No real analysis of language is performed. In addition, no studies has focused on the chronological changes in the writing styles of the writers with suicidal ideation, as if they somehow all had suicidal ideation throughout their lifetimes.

In suicide detection studies, some scholars have used dictionaries to study the change of words in the blogs of a Chinese adolescent one year before s/he committed suicide [26]. Distinctively, this study concentrates on such change in chronological order, but the time is too short to reflect the subject’s lifetime. Some others have mainly used four quantitative linguistic markers: Richness, Pronoun, Activity Power and Sensory (RPAS) [27, 28], as well as Critical Slowing Down [29], to compare two female English writers, Iris Murdoch and P. D. James, to see how big events and the mental disease Alzheimer influence writers’ writing style. The quantitative markers RPAS were used in identifying personality in previous studies [30,31,32] and were proved to have good results. It is also believed that these markers can be used to identify mental diseases like Alzheimer, depression and extreme behaviour like terrorism. But these studies have only focused on the English language. Therefore, it is necessary to prove that these measures are also effective with works in different languages.

In this paper, we have chosen the works of Sanmao and Virginia Woolf to study their change of writing style by using quantitative linguistic markers. Sanmao and Virginia Woolf were both famous female writers who died of suicide. Tables 1 and 2 show the works of Virginia Woolf and Sanmao that we have chosen.

We have only chosen full-length novels of Virginia Woolf to control the variable of genre. All the works are from Project GutenbergFootnote 1 and their contents are in English. The Python package NLTK is used to tag parts-of-speech. We have only selected the novels and proses of Sanmao to control the variable of genre. It should be noticed that the first published novel of Sanmao is The Story of the Sahara and the second one is Rainy Season Won’t Come Again. But Rainy Season Won’t Come Again is a collection of essays and short stories Sanmao wrote between the age of seventeen and twenty-two. So, we have placed Rainy Season Won’t Come Again as her first book to analyze. In order to keep the purity of the selected works, everything written by others (including forewords, epilogues, etc.) is removed. The Python package Jieba is used to segment and to tag the parts-of-speech.

Quantitative Index Text Analyzer (QUITA) [33] is used to compute the markers R4, Hapax Percentage and Lambda. Altmann Fitter is used to compute Zipf's Exponent [34].


Richness is a measure of a person’s ability to use vocabulary (with focus on its size). It reflects not only the person’s age and education background, but also his/her mood and psychological activities [27, 28, 31, 32]. People in a normal or joyful mood tend to have higher Richness than people in a bad mood, such as people with suicidal ideation [12]. The traditional Richness measure Type-Token Ratio (TTR) is subject to the influence of text size. Therefore, R4 and Hapax Percentage are chosen as the Richness markers as these two measures can eliminate the influence of text size.

R4 is one of the many ways of calculating the vocabulary richness of a text. The bigger it is, the higher Richness the text has. Formula (1) shows how R4 is calculated. Here V is the total number of types in a text, N is the total number of tokens in the text, r is the rank and f(r) is the frequency in the rank-frequency distribution of the text.

$$R4=1-\frac{1}{V}(V+1-\frac{2}{N}\sum _{r=1}^{V}rf(r))$$


Hapax Percentage is the simple ratio of the number of hapax legomena in a text to the number of tokens. Hapax legomena are words that occur only once in a text. Formula (2) shows how Hapax Percentage is calculated. Here, NH is the number of hapax legomena in a text.



Pronoun is generally used to identify the author’s gender [27, 28, 31, 32]. It has been shown that females prefer to use more personal pronouns and that the more pronouns a writer uses, the more emotional s/he is [35, 36]. In addition, people with suicidal ideation use more personal pronouns in their social media [18]. Formula (3) shows how Pronoun Percentage is calculated. Here P is the total number of personal pronouns in a text.



Sensory words refer to visual, auditory, haptic, olfactory and gustatory adjectives [37]. Sensory words have been proved to have the capacity to stimulate our cerebral cortex [38]. Less sensory adjectives and smaller values of exclusivity can also be a signal of mental disease [28, 32]. In this paper, we use formula (4) to calculate Sensory [32]. Here \({\boldsymbol{\varphi }}_{\boldsymbol{i}}\) is the number of the i-th adjective in the text and \({\boldsymbol{\vartheta }}_{\boldsymbol{i}}\) is the value of exclusivity of the adjective, NK is the total number of the sensory adjective list. For the English texts, the sensory adjective list with exclusivity is from [39] and for the Chinese texts, the sensory adjective list with exclusivity is from [40]. To eliminate the influence of text size, all of the terms in the summation are normalized.

$$Sensory=\sum _{i=1}^{i=NK}\frac{{\varphi }_{i}{\vartheta }_{i}}{N}$$


Lambda is an indicator that deals with the frequency structure of a text. It reflects both vocabulary richness and the relationship between the neighboring frequencies in a text. This measure also eliminates the influence of text size. Formula (5) shows its calculation.

$$Lambda=\frac{\sum _{r=1}^{V-1}{[(f\left(r\right)-f\left(r+1\right){)}^{2}+1]}^{\frac{1}{2}}({log}_{10}N)}{N}$$

Formula (6) shows Zipf’s formula. Here \({\boldsymbol{P}}_{\boldsymbol{r}}\) is the frequency of the word with rank r. C and b are parameters with b being an exponent. Studies have proved that the exponent b can distinguish languages. The exponent of English is around 1.11 which is the highest among natural languages [41, 42]. Other studies also found that Zipf’s exponent might be related to language cognition and other aspects [43,44,45,46]. Zipf’s exponent is related to language complexity, meaning that children’s Zipf’s exponents are higher than adults’ exponents [47]. We believe that Zipf’s exponent is also related to one’s mood. If the writer is in the good condition, the Zipf’s exponent of his/her language should be low.


4 Analysis

4.1 Correlation Analysis

The two matrices in Fig. 1 show the Pearson correlations between the quantitative linguistic markers of Virginia Woolf’s works and Sanmao’s works respectively. It is clearly seen that the markers Richness, Sensory and Lambda show homogeneity as the higher the markers are, the better mood and mental condition the writers are in. The markers Pronoun and Zipf’s Exponent also exhibit homogeneity as the higher the markers are, the worse mental conditions the writers are in. However, it is improper to suggest that the higher the Pearson Correlations are, the better efforts the quantitative linguistic measures give as the different quantitative linguistic markers represent different aspects.

4.2 Markers Visualization

Richness, Sensory and Lambda Visualization.

Figure 2 shows how the markers’ values change diachronically in the works of Virginia Woolf and Sanmao. R4, Hapax percentage, Sensory and Lambda share a similar tendency in Woolf’s works. In the beginning, when she wrote her first two novels (V1 and V2), the values of the markers were relatively low. It was painstaking and difficult for her both physically and mentally. Leonard Woolf, her husband, believed The Voyage Out (V1) “to be the root cause of Virginia’s distress” [48]. While waiting for the publication of The Voyage Out, Woolf was extremely anxious about its critical reception. In September 1913, she attempted to commit suicide [49]. On 23rd February 1915, Woolf became hallucinated, seeing her mother in the room talking wildly to her. It was only until September that she began to lead a normal life. Woolf remained relatively stable in 1916 and 1917, but afterwards, Vanessa’s desperate wish to get pregnant and the need to finish Night and Day (V2) made her depression deteriorate [48]. The next decade witnessed the prime time of Woolf’s literary career. The values of V3 to V7 were at a high level, and Woolf’s condition was relatively stable. Though the Sensory values of V4 to V5 and the values of three other markers of V5 were low, they were still relatively high compared to the values of V1, V2 and V8. Then, when she wrote and revised her eighth novel (V8), an anti-war novel, she was horrified and threatened by the imminent war. It made her extremely angst-ridden. The pressure of outside cruelty and the stress of finishing the novel forced Virginia Woolf into insomnia, headaches and a breakdown that followed [48]. For V9, the values of the three markers other than Sensory were higher than the values of the corresponding markers of V8. This is because the mood of Virginia Woolf got better in her last years. It was a sudden event -- the horrific war (the World War II) destroyed her beloved city London -- that accelerated Woolf’s suicidal tendency.

With Sanmao, we can see a very low point in Dream Whispering Color (S7). In this book, Sanmao wrote down her sad feelings about the loss of her husband Jose (Hexi). She suffered a lot during that time. Before S7, Sanmao wrote about her happy marriage life in the Sahara. After S7, she travelled to Latin America and Europe. This, in a sense, alleviated her pain. So, the Richness values during this period were higher. At the end of her writing career (S10, S11 and S12) the marker values fall as they predict her suicide. This also means that there was no sudden event that led to her death like Woolf.

Pronoun and Zipf’s Exponent Visualization.

Figure 3 shows the Pronoun and Zipf’s Exponent values changing diachronically in the works of Virginia Woolf and Sanmao. Pronoun and Zipf’s Exponent values share a similar tendency in Woolf’s works. The value of markers indicates high emotional level at the beginning (V1 & V2), in the middle (V5) and almost at the end (V8). Emotional elation can be caused by personal temperament and high involvement in the process of writing. As we have shown above, when Woolf wrote The Voyage Out (V1), Night and Day (V2), and The Years (V8), she suffered from cyclothymic depression and her mood alternated between extreme depression and euphoria. In these periods, Woolf was highly depressive. As for the case of To the Lighthouse (V5), it is Woolf’s most “self-revealing” novel [7]. This novel is personal, for it is an elegy for Woolf’s parents. As for V4, even though she was in a mentally stable condition, the content of her writing was disturbing: Mrs. Dalloway dealt with an ex-soldier, who suffered from PTSD and committed suicide at the end of the novel. The emotional depth Woolf revealed through these works is profound.

Sanmao’s Pronoun Percentage and Zipf’s Exponent values reach their highest points in S7. She wrote S7 when she was suffering badly from the loss of her husband. The marker values of her last two books have rising slopes. This shows that Sanmao’s mental condition was deteriorating. The Zipf’s Exponent graphs also shows that the exponent of English work is higher, confirming the findings of previous study [42].

4.3 Sequence Clustering

After normalization of all the markers, system clustering is used for sequence clustering [50]. Figure 4 shows the sequence clustering of the works of Sanmao and Woolf.

  • For Virginia Woolf, we understand:

  • V1, V2: Virginia Woolf’s earlier writings. As a novice in literature, Woolf suffered greatly from the pressure to finish the novels. So, her mood alternated between depression and euphoria.

  • V3, V4, V5: These three books can be seen as Woolf’s best literary works. Also, this period was the primetime of her career and personal life.

  • V6, V7: These two books are generally considered as the best of her later career.

  • V8: As the outcry of antiwar by Woolf, The Years was consuming for Woolf’s sensitive nerves, unstable mentality and vulnerable physical condition.

  • V9: We can see a recovery tendency in the period between Woolf’s last two novels. However, the horrific and inexorable war accelerated Woolf’s suicidal tendency.

  • For Sanmao, we understand:

  • S1, S2, S3, S4, S5, S6: These are her early works and the reflection of her life in the Sahara.

  • S7: It reveals the darkness of her life because of her beloved husband’s death.

  • S8, S9, S10: She created these works during or after her travels in Latin America and Europe.

  • S11, S12: Her last 2 works indicate her suicidal tendency because of her negative emotion.

5 Conclusion

We use five quantitative linguistic markers to analyze the English and Chinese works of two female writers. Our results show that all the markers we have chosen can reflect the psychological conditions of the two writers. With the help of these markers, we can find out the psychological changes in the minds of writers who never told the public why they chose to commit suicide to end their lives. We also find that big events, such as mental disorder and the loss of spouse and war, can influence one’s life. These influences may be hidden by using emotional words, but cannot be hidden when the language of a writer’s works is analyzed using quantitative linguistic markers. However, the markers we use are far from enough. More and more materials should be gathered to analyze the people’s psychological activities. We hope that more and more useful features will be found for detecting suicidal ideation. We hope that anyone suffering from mental problems will indeed get the help they need -- indeed and in need.