1 Introduction

Statistical characterisation of languages and literary works has been one of the intriguing domains for physicists, linguists and statisticians [1]. Predominantly, studying the pattern of frequency distributions of the words in a literary document has been one of the areas of priority [2]. The patterns mostly imitate Zipf’s law [3, 4], which states that for an array of words x, the word frequency distribution varies as an inverse power of x.

Other distributions such as Zipf–Mandelbrot [5], lognormal [6, 7], Gauss–Poisson [6], extended generalised Zipf law [6] have also marked their presence. The language corpus studied is not limited to English, but also includes languages such as Mongolian [8], Chinese [9], Japanese [10], Hindi [11] and many others [2].

Fig. 1
figure 1

Word frequency distributions (log–log scale).

Another characteristic feature that has been exhaustively studied and applied is the entropic framework due to Shannon [12]. Entropy-based studies have been carried out for symbolic sequence [13], back-off language models [14], constancy rate principle [15] and several other domains. Long-range correlations based on entropy have been found over two literary texts where the mutual information for the pairs of letters and the entropy of the two documents have also been analysed. A power-law decay in scaling laws for the mutual information, inverse square root of the number of subwords for the entropy per letter and stretched exponential for the word numbers were also observed [16]. Statistical analysis of English literary words was carried out resulting in the creation of a cluster of certain groups of words. A relation was established between the English content words and entropy computed over its probability distribution [17].

In this paper, a statistical analysis of ‘Bhagavad Gita’, one of the most sacred writings of Hindu religion, has been presented. The literature has previously been extensively studied from various perspectives such as management [18], psychiatry [19], theosophy [20], ethics [21] and the plausible application of its teachings in these domains. The present work is aimed at conducting a statistical characterisation of this text. The effort is interdisciplinary in nature in the sense that it utilises the concepts of statistical physics for the analysis of Bhagavad Gita in its four translations. Measures such as entropy are employed to determine the pattern and randomness in the system (text here). Deduction of power-law distribution in word frequency also connotes the statistical patterns in the text. The corpus for the study comprises four versions of the text written in the Indo–European family of languages: Sanskrit, Hindi, French and English. The documents have been obtained from [22] for Sanskrit and Hindi, [23] for French and [24] for English.

2 A statistical study

2.1 Modelling the word frequency distribution

Statistical models have been deduced for various literatures and scriptures. For instance, Zipf’s law was deployed to model word frequencies in Holy Bible translations for 100 live languages by Mehri and Jamaati which produced a Zipf exponent in the range (0.765–1.442) [25]. The fit of Zipf’s law was shown to perform poorly as compared to a number of Pareto-type distributions [26].

The word frequency of the ‘Bhagavad Gita’ texts in four different languages has been analysed in the present work. Figure 1 depicts the word frequency distributions (on a log–log scale) for the four versions. The curves can be seen to replicate the power-law pattern with the longest tail in the case of Sanskrit version of Bhagavad Gita which can be attributed to its higher number of unique words. The curves representing French, English and Hindi are closer compared to Sanskrit and follow similar pattern of descent as is recorded by the value of Kullback–Leibler (KL) divergence (discussed in the succeeding subsection).

Table 1 Results for modelling of word frequency probability distribution.

The probability distribution of word frequencies for the four documents has been modelled using four distributions, namely Zipf, Zipf–Mandelbrot [5], Pareto [27] and lognormal [6, 7].

Mathematically, Zipf’s law can be defined as \(p(x) = a/x^b\), where a denotes the normalising constant and b is the Zipf exponent. a and b can be computed empirically for a particular document. Zipf–Mandelbrot as a variation of Zipf’s law shows relatively better fit for lower rank values because of the presence of more functional words [28] which can be captured by introducing a new parameter c and takes the form

$$\begin{aligned} p(x) = \frac{a}{(1+cx)^b}. \end{aligned}$$
(1)

For comparative analysis, two more distributions have been considered which are Pareto and lognormal distributions. Mathematical form of Pareto distribution in terms of shape parameter a and scale parameter b can be defined as

$$\begin{aligned} p(x) = \frac{ab^a}{x^{a+1}}, \quad x\ge b \end{aligned}$$
(2)

and the lognormal distribution with mean \(\mu \) and standard deviation \(\sigma \) as

$$\begin{aligned} p(x) = \frac{1}{\sigma \surd (2 \pi x)} \exp \frac{{(- \log (x)- \mu )}^2}{2 \sigma ^2}, \quad x>0. \end{aligned}$$
(3)

The results for the same have been given in table 1. To validate the goodness of fit for each, sum of squared errors (SSE), \(R^2\) and root mean square errors (RMSE) have been calculated. It was seen that Zipf–Mandelbrot provided the best fit with the lowest values for SSE and RMSE in the case of English, French and Hindi. Also, the value of \(R^2\) obtained was close to 1. In the case of Sanskrit, Zipf model provided the best fit for the data. The value of Zipf’s exponent obtained in the four cases was 0.777 for English, 0.754 for French, 0.732 for Hindi and 0.768 for Sanskrit and for Zipf–Mandelbrot, the exponent obtained is close to 1 for English, French and Hindi. In the case of Hindi, a negative value of \(R^2\) for lognormal shows that the fit was very poor.

2.1.1 KL divergence among the four versions

The KL divergence [29, 30] gives an asymmetric quantitative measure for the distance between two distributions. For probability distributions \(p_1\) and \(p_2\), the KL divergence \(D_\mathrm{KL}\) is given by

$$\begin{aligned} D_\mathrm{KL}(p_1\Vert p_2)=\sum _i p_1(i)\ln \frac{p_1(i)}{p_2(i)}. \end{aligned}$$
(4)

The KL divergence has been computed between the probability distributions derived for the four versions. Table 2 presents the results obtained for the KL divergence obtained for each pair of languages.

The entries in the last column record the KL divergence of the Sanskrit word frequency distribution from English, French and Hindi which are 0.1638, 0.1913 and 0.2044, respectively. It may be pointed out that these values are larger compared to the KL divergence between English and French; English and Hindi; and French and Hindi.

Table 2 KL divergence for each pair of languages.

2.2 Vocabulary quotient

Vocabulary quotient, a measure used to determine the randomness in the text which is related to the uniqueness in terms of the words used, has been calculated for the documents in the four languages. For computing the entropy, the technique described in [31] is used and further the vocabulary quotient has been derived. To briefly summarise the technique, frequency of each word used in the document is calculated and its probability of occurrence is determined. Using this probability, the Shannon entropy given by

$$\begin{aligned} S=-\sum p_i \log p_i \end{aligned}$$
(5)

is computed and the vocabulary quotient is obtained by normalising the entropy value over the maximum possible entropy. The maximum possible entropy was computed as

$$\begin{aligned} S_{\max } = - \log (1/n) \end{aligned}$$
(6)

which represents the maximum entropy value that a document can have, given n number of words [31].

Table 3 Entropy analysis and vocabulary quotient of Bhagavad Gita in four languages.
Fig. 2
figure 2

Probability distribution of word lengths in English.

Fig. 3
figure 3

Probability distribution of word lengths in Hindi.

Table 3 presents the results obtained for the vocabulary quotient in each document. The column entitled number of words gives the count of total number of words in the document. The maximum possible entropy and the entropy of the document computed using the technique prescribed in [31] are recorded in columns 3 and 4, respectively. It can be observed that the maximum document entropy and the highest vocabulary quotient are obtained in the case of the Sanskrit version of the text and may be attributed to the higher number of unique words used in the document. Also, it was seen that there was an extreme usage of fusion words that are formed by the combination of multiple words in Sanskrit. This characteristic was found much more frequently in Sanskrit than in Hindi, English and French which contributes to the vivacity of the document.

Fig. 4
figure 4

Probability distribution of word lengths in French.

Fig. 5
figure 5

Probability distribution of word lengths in Sanskrit.

2.3 Word-length distribution

Word length in a language has been one of the factors in equipping oneself with a new language. Also, it is a marker of the utilisation of the alphabet set of a language and features such as compositions and fusions.

The distributions demonstrated by the word length have been studied by various researchers. The typical pattern has been reported to be Poisson and Ord distribution [32]. A study of word-length distribution on Shakespeare’s and Bacon’s works was carried out [33]. A variant of gamma distribution has been observed in English, Swedish and German language word-length distributions [34]. Word lengths in the Sanskrit version of Bhagavad Gita were found to have typically larger size in comparison to English, Hindi and French versions. The distributions have been depicted in figures 25. The highest word length, 54, was reported in the case of Sanskrit while in English it is 24. This highlights the immense presence of fusion in the Sanskrit version of Bhagavad Gita. The most frequent word length in Sanskrit is one due to the maximum occurrence of functional words like and .

3 Conclusion

A statistical characterisation of the Bhagavad Gita text has been performed in four languages: English, French, Hindi and Sanskrit. The study has been conducted in four dimensions: building a statistical model based on word frequency distribution, KL divergence between the word frequency distributions of the documents, vocabulary quotient and word-length distributions. The probability distribution of the word frequency was modelled using the Zipf, Zipf–Mandelbrot, Pareto and lognormal distributions. The texts in Hindi, French and English produced a plot that followed the Zipf–Mandelbrot pattern with exponents close to 1. In the case of Sanskrit, the Zipf model fitted better with an exponent of 0.7679. Next, the KL divergence has been computed between the distributions and higher values of KL divergence reflected that there was an immense difference between the probability distribution of Sanskrit and the translated versions. The vocabulary quotients have also been obtained and ranged from 0.6251 to 0.8853 with the highest value for Sanskrit which is an indicator of more number of unique words in the document. Finally, the word-length distributions were plotted and it was noted that the length of the Sanskrit words in Bhagavad Gita was typically quite long compared to other languages.

It was found that word ‘the’ found major occupancy in the document followed by ‘and’ in the case of English. A similar trend was observed in Sanskrit where the word was most frequent which means ‘and’ in English. was second most frequently used in Sanskrit which means ‘not’ in English.