A statistical probe into the word frequency and length distributions prevalent in the translations of Bhagavad Gita

Rajput, Nikhil Kumar; Ahuja, Bhavya; Riyal, Manoj Kumar

doi:10.1007/s12043-018-1709-8

A statistical probe into the word frequency and length distributions prevalent in the translations of Bhagavad Gita

Published: 15 February 2019

Volume 92, article number 60, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Pramana Aims and scope Submit manuscript

A statistical probe into the word frequency and length distributions prevalent in the translations of Bhagavad Gita

Download PDF

Nikhil Kumar Rajput¹,
Bhavya Ahuja¹ &
Manoj Kumar Riyal²

269 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

A statistical study has been conducted on Bhagavad Gita. Four measures have been derived for the original text in Sanskrit and its translations in Hindi, English and French. First, word frequency distributions for the documents were modelled. Power law was observed with the longest tail in the case of Sanskrit. For other versions, the distributions well replicated the Zipf–Mandelbrot pattern. Second, the Kullback–Leibler (KL) divergence between the documents has been computed with the highest value recorded in all three translations from the Sanskrit text. Next, a Shannon entropy-based measure: vocabulary quotient has been calculated, which estimates the vocabulary richness the texts offer; the highest being in the case of Bhagavad Gita in Sanskrit. Finally, word-length distributions were obtained with the longest word length in Sanskrit. The results attribute to the inflectional nature of Sanskrit.

Quantifying English and Polish Lolitas: A Corpus-Driven Stylistic Comparison

RedBird: Rendering Entropy Data and ST-Based Information into a Rich Discourse on Translation

Using a Product Metric to Identify Differential Cognitive Effort in Translation from Japanese to English and Spanish

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Statistical characterisation of languages and literary works has been one of the intriguing domains for physicists, linguists and statisticians [1]. Predominantly, studying the pattern of frequency distributions of the words in a literary document has been one of the areas of priority [2]. The patterns mostly imitate Zipf’s law [3, 4], which states that for an array of words x, the word frequency distribution varies as an inverse power of x.

Other distributions such as Zipf–Mandelbrot [5], lognormal [6, 7], Gauss–Poisson [6], extended generalised Zipf law [6] have also marked their presence. The language corpus studied is not limited to English, but also includes languages such as Mongolian [8], Chinese [9], Japanese [10], Hindi [11] and many others [2].

Another characteristic feature that has been exhaustively studied and applied is the entropic framework due to Shannon [12]. Entropy-based studies have been carried out for symbolic sequence [13], back-off language models [14], constancy rate principle [15] and several other domains. Long-range correlations based on entropy have been found over two literary texts where the mutual information for the pairs of letters and the entropy of the two documents have also been analysed. A power-law decay in scaling laws for the mutual information, inverse square root of the number of subwords for the entropy per letter and stretched exponential for the word numbers were also observed [16]. Statistical analysis of English literary words was carried out resulting in the creation of a cluster of certain groups of words. A relation was established between the English content words and entropy computed over its probability distribution [17].

In this paper, a statistical analysis of ‘Bhagavad Gita’, one of the most sacred writings of Hindu religion, has been presented. The literature has previously been extensively studied from various perspectives such as management [18], psychiatry [19], theosophy [20], ethics [21] and the plausible application of its teachings in these domains. The present work is aimed at conducting a statistical characterisation of this text. The effort is interdisciplinary in nature in the sense that it utilises the concepts of statistical physics for the analysis of Bhagavad Gita in its four translations. Measures such as entropy are employed to determine the pattern and randomness in the system (text here). Deduction of power-law distribution in word frequency also connotes the statistical patterns in the text. The corpus for the study comprises four versions of the text written in the Indo–European family of languages: Sanskrit, Hindi, French and English. The documents have been obtained from [22] for Sanskrit and Hindi, [23] for French and [24] for English.

2 A statistical study

2.1 Modelling the word frequency distribution

Statistical models have been deduced for various literatures and scriptures. For instance, Zipf’s law was deployed to model word frequencies in Holy Bible translations for 100 live languages by Mehri and Jamaati which produced a Zipf exponent in the range (0.765–1.442) [25]. The fit of Zipf’s law was shown to perform poorly as compared to a number of Pareto-type distributions [26].

The word frequency of the ‘Bhagavad Gita’ texts in four different languages has been analysed in the present work. Figure 1 depicts the word frequency distributions (on a log–log scale) for the four versions. The curves can be seen to replicate the power-law pattern with the longest tail in the case of Sanskrit version of Bhagavad Gita which can be attributed to its higher number of unique words. The curves representing French, English and Hindi are closer compared to Sanskrit and follow similar pattern of descent as is recorded by the value of Kullback–Leibler (KL) divergence (discussed in the succeeding subsection).

Table 1 Results for modelling of word frequency probability distribution.

Full size table

The probability distribution of word frequencies for the four documents has been modelled using four distributions, namely Zipf, Zipf–Mandelbrot [5], Pareto [27] and lognormal [6, 7].

Mathematically, Zipf’s law can be defined as $p(x) = a/x^b$, where a denotes the normalising constant and b is the Zipf exponent. a and b can be computed empirically for a particular document. Zipf–Mandelbrot as a variation of Zipf’s law shows relatively better fit for lower rank values because of the presence of more functional words [28] which can be captured by introducing a new parameter c and takes the form

$$\begin{aligned} p(x) = \frac{a}{(1+cx)^b}. \end{aligned}$$

(1)

For comparative analysis, two more distributions have been considered which are Pareto and lognormal distributions. Mathematical form of Pareto distribution in terms of shape parameter a and scale parameter b can be defined as

$$\begin{aligned} p(x) = \frac{ab^a}{x^{a+1}}, \quad x\ge b \end{aligned}$$

(2)

and the lognormal distribution with mean $\mu $ and standard deviation $\sigma $ as

$$\begin{aligned} p(x) = \frac{1}{\sigma \surd (2 \pi x)} \exp \frac{{(- \log (x)- \mu )}^2}{2 \sigma ^2}, \quad x>0. \end{aligned}$$

(3)

The results for the same have been given in table 1. To validate the goodness of fit for each, sum of squared errors (SSE), $R^2$ and root mean square errors (RMSE) have been calculated. It was seen that Zipf–Mandelbrot provided the best fit with the lowest values for SSE and RMSE in the case of English, French and Hindi. Also, the value of $R^2$ obtained was close to 1. In the case of Sanskrit, Zipf model provided the best fit for the data. The value of Zipf’s exponent obtained in the four cases was 0.777 for English, 0.754 for French, 0.732 for Hindi and 0.768 for Sanskrit and for Zipf–Mandelbrot, the exponent obtained is close to 1 for English, French and Hindi. In the case of Hindi, a negative value of $R^2$ for lognormal shows that the fit was very poor.

2.1.1 KL divergence among the four versions

The KL divergence [29, 30] gives an asymmetric quantitative measure for the distance between two distributions. For probability distributions $p_1$ and $p_2$, the KL divergence $D_\mathrm{KL}$ is given by

$$\begin{aligned} D_\mathrm{KL}(p_1\Vert p_2)=\sum _i p_1(i)\ln \frac{p_1(i)}{p_2(i)}. \end{aligned}$$

(4)

The KL divergence has been computed between the probability distributions derived for the four versions. Table 2 presents the results obtained for the KL divergence obtained for each pair of languages.

The entries in the last column record the KL divergence of the Sanskrit word frequency distribution from English, French and Hindi which are 0.1638, 0.1913 and 0.2044, respectively. It may be pointed out that these values are larger compared to the KL divergence between English and French; English and Hindi; and French and Hindi.

Table 2 KL divergence for each pair of languages.

Full size table

2.2 Vocabulary quotient

Vocabulary quotient, a measure used to determine the randomness in the text which is related to the uniqueness in terms of the words used, has been calculated for the documents in the four languages. For computing the entropy, the technique described in [31] is used and further the vocabulary quotient has been derived. To briefly summarise the technique, frequency of each word used in the document is calculated and its probability of occurrence is determined. Using this probability, the Shannon entropy given by

$$\begin{aligned} S=-\sum p_i \log p_i \end{aligned}$$

(5)

is computed and the vocabulary quotient is obtained by normalising the entropy value over the maximum possible entropy. The maximum possible entropy was computed as

$$\begin{aligned} S_{\max } = - \log (1/n) \end{aligned}$$

(6)

which represents the maximum entropy value that a document can have, given n number of words [31].

Table 3 Entropy analysis and vocabulary quotient of Bhagavad Gita in four languages.

Full size table

Table 3 presents the results obtained for the vocabulary quotient in each document. The column entitled number of words gives the count of total number of words in the document. The maximum possible entropy and the entropy of the document computed using the technique prescribed in [31] are recorded in columns 3 and 4, respectively. It can be observed that the maximum document entropy and the highest vocabulary quotient are obtained in the case of the Sanskrit version of the text and may be attributed to the higher number of unique words used in the document. Also, it was seen that there was an extreme usage of fusion words that are formed by the combination of multiple words in Sanskrit. This characteristic was found much more frequently in Sanskrit than in Hindi, English and French which contributes to the vivacity of the document.

2.3 Word-length distribution

Word length in a language has been one of the factors in equipping oneself with a new language. Also, it is a marker of the utilisation of the alphabet set of a language and features such as compositions and fusions.

The distributions demonstrated by the word length have been studied by various researchers. The typical pattern has been reported to be Poisson and Ord distribution [32]. A study of word-length distribution on Shakespeare’s and Bacon’s works was carried out [33]. A variant of gamma distribution has been observed in English, Swedish and German language word-length distributions [34]. Word lengths in the Sanskrit version of Bhagavad Gita were found to have typically larger size in comparison to English, Hindi and French versions. The distributions have been depicted in figures 2–5. The highest word length, 54, was reported in the case of Sanskrit while in English it is 24. This highlights the immense presence of fusion in the Sanskrit version of Bhagavad Gita. The most frequent word length in Sanskrit is one due to the maximum occurrence of functional words like and .

3 Conclusion

A statistical characterisation of the Bhagavad Gita text has been performed in four languages: English, French, Hindi and Sanskrit. The study has been conducted in four dimensions: building a statistical model based on word frequency distribution, KL divergence between the word frequency distributions of the documents, vocabulary quotient and word-length distributions. The probability distribution of the word frequency was modelled using the Zipf, Zipf–Mandelbrot, Pareto and lognormal distributions. The texts in Hindi, French and English produced a plot that followed the Zipf–Mandelbrot pattern with exponents close to 1. In the case of Sanskrit, the Zipf model fitted better with an exponent of 0.7679. Next, the KL divergence has been computed between the distributions and higher values of KL divergence reflected that there was an immense difference between the probability distribution of Sanskrit and the translated versions. The vocabulary quotients have also been obtained and ranged from 0.6251 to 0.8853 with the highest value for Sanskrit which is an indicator of more number of unique words in the document. Finally, the word-length distributions were plotted and it was noted that the length of the Sanskrit words in Bhagavad Gita was typically quite long compared to other languages.

It was found that word ‘the’ found major occupancy in the document followed by ‘and’ in the case of English. A similar trend was observed in Sanskrit where the word was most frequent which means ‘and’ in English. was second most frequently used in Sanskrit which means ‘not’ in English.

References

C D Manning and H Schütze, Foundations of statistical natural language processing (MIT Press, UK, 1999)
R Harald Baayen, Word frequency distributions (Springer Science & Business Media, 2001), Vol. 18
G K Zipf, The psycho-biology of language (George Routledge & Sons, Ltd., 1936), reprinted in 2002
W Li, IEEE Trans. Inf. Theory 38(6), 1842 (1992)
B Mandelbrot, Information theory and psycholinguistics (BB Wolman and E, USA, 1965)
H Baayen, Comput. Human. 26(5–6), 347 (1992)
Article Google Scholar
J B Carroll, Proceedings of the Conference on Language and Language Behavior edited by E M Zale (Appleton-Century-Crofts, New York, 1968) pp. 213–235
J Narisong Jiang and H Liu, J. Quant. Linguist. 21(2), 123 (2014)
S Shtrikman, J. Inf. Sci. 20(2), 142 (1994)
Article Google Scholar
S Miyazima, Y Lee, T Nagamine and H Miyajima, Phys. A: Stat. Mech. Appl. 278(1–2), 282 (2000)
Article Google Scholar
B D Jayaram and M N Vidya, J. Quant. Linguist. 15(4), 293 (2008)
Article Google Scholar
C E Shannon, ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3 (2001)
Article Google Scholar
W Ebeling and G Nicolis, Chaos Solitons Fractals 2(6), 635 (1992)
Article ADS MathSciNet Google Scholar
A Stolcke, Entropy-based pruning of backoff language models, arXiv:cs/0006025 (2000)
D Genzel and E Charniak, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, 2002) pp. 199–206
W Ebeling and T Pöschel, Europhys. Lett. 26(4), 241 (1994)
Article ADS Google Scholar
M A Montemurro and D H Zanette, Adv. Complex Syst. 5(01), 7 (2002)
Article Google Scholar
C C Hoi Hee, Singapore Manag. Rev. 29(1), 73 (2007)
D V Jeste and I V Vahia, Psychiatry Interpers. Biol. Process. 71(3), 197 (2008)
Article Google Scholar
W J Johnson, The Bhagavad Gita (Oxford University Press, New York, 1994)
Google Scholar
S Radakrishnan, Int. J. Ethics 21(4), 465 (1911)
Article Google Scholar
www.gitasupersite.iitk.ac.in
www.archive.org/stream/LaBhagavadGita-FrenchTranHrBslationHrB
www.gutenberg.org
A Mehri and M Jamaati, Phys. Lett. A 381(31), 2470 (2017)
Article ADS Google Scholar
M Wiegand, S Nadarajah and Y Si, Phys. Lett. A 382, 621 (2018)
Article ADS Google Scholar
M E J Newman, Contemp. Phys. 46(5), 323 (2005)
Article ADS Google Scholar
M A Montemurro, Phys. A: Stat. Mech. Appl. 300(3–4), 567 (2001)
Article Google Scholar
A K Singh et al, IEEE Commun. Lett. 18(8), 1335 (2014)
Article Google Scholar
T M Cover and J A Thomas, Elements of information theory (John Wiley & Sons, USA, 2012)
N K Rajput, B Ahuja and M K Riyal, Digit. Scholarship Human. 33, 894 (2018)
Article Google Scholar
G Wimmer, R Köhler, R Grotjahn and G Altmann, J. Quant. Linguist. 1(1), 98 (1994)
Article Google Scholar
C B Williams, Biometrika 62(1), 207 (1975)
Article Google Scholar
B Sigurd, M Eeg-Olofsson and J Van Weijer, Studia Linguist. 58(1), 37 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Ramanujan College, University of Delhi, New Delhi, 110 019, India
Nikhil Kumar Rajput & Bhavya Ahuja
Department of Physics, Veer Chandra Singh Garhwali Uttarakhand University of Horticulture and Forestry, Tehri Garhwal, 246 123, India
Manoj Kumar Riyal

Authors

Nikhil Kumar Rajput
View author publications
You can also search for this author in PubMed Google Scholar
Bhavya Ahuja
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Kumar Riyal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bhavya Ahuja.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rajput, N.K., Ahuja, B. & Riyal, M.K. A statistical probe into the word frequency and length distributions prevalent in the translations of Bhagavad Gita. Pramana - J Phys 92, 60 (2019). https://doi.org/10.1007/s12043-018-1709-8

Download citation

Received: 02 July 2018
Revised: 08 August 2018
Accepted: 27 August 2018
Published: 15 February 2019
DOI: https://doi.org/10.1007/s12043-018-1709-8

Keywords

PACS Nos

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A statistical probe into the word frequency and length distributions prevalent in the translations of Bhagavad Gita

Abstract

Similar content being viewed by others

Quantifying English and Polish Lolitas: A Corpus-Driven Stylistic Comparison

RedBird: Rendering Entropy Data and ST-Based Information into a Rich Discourse on Translation

Using a Product Metric to Identify Differential Cognitive Effort in Translation from Japanese to English and Spanish

1 Introduction

2 A statistical study