Keywords

1 Introduction

Text classification is important means and methods of text mining, and also a part of data mining. Text classification is a supervised classification of documents, which divides and categorize a text collection into several subsets called sets, the text of each set has greater similarity than the one in different set. Text classification differs from text clustering where labels within the data sets are well known compared to clustering that deals with unsupervised documents [1].

Recently, semantic similarity has been applied in many different applications including the health and medical domain. For better understanding the textual resources, semantic similarity estimation is used [2].

Properly classifying and clustering texts based on certain criteria and trend improves the understanding and relatedness of data. Decision makers find it easy to extract the knowledge based on good classification [3]. Analyzing the relationships between documents based on concepts and terms is one of the semantic analysis methods. The best way to make use of the information retrieved rather than rebuilding it from scratch [4].

Text categorization has recently become a hot topic in the area of information retrieval. The main objective of text categorization is to assign free text documents to one or more predefined categories based on their content. Traditionally this process is performed manually by domain experts. This process is very time-consuming and costly, thus limiting its applicability [5]. The concepts of text similarity and semantic similarity have been extensively developed to tackle this problem in an automatic way. There are many text similarity methods that can be used to ease text categorize, of these methods are: Semantic sequence kin model (SSK), which extracts semantic sequences from a text as its feature strings, and then takes into account both word and word position when 2 semantic sequences are compared [6]. Common semantic sequence model (CSSM), which is similar to semantic sequence kin model, but uses another formula to calculate similarity of semantic sequences, in which the word position is not considered [7]. Word similarity estimation has many direct applications. In word-sense disambiguation, context terms can be semantically compared to discover the most similar sense. In document categorization or clustering, the semantic similarity between words can be compared to group documents according to their subject. In word spelling correction, semantic similarity can assess which is the most appropriate correction for a potential misspelling according to its similarity against context words [3].

The concepts of similarity and distance are crucial in many scientific applications. Similarity and distance measurements are mostly needed to compute the similarities/distances between different objects, an essential requirement in almost all pattern recognition applications including clustering, classification, feature selection, outlier detection, regression, and search. There exist a large number of similarity measurements in the literature; thus, selecting one most appropriate similarity measurement for a particular task is a crucial issue. The success or failure of any pattern recognition technique largely depends on the choice of the similarity measurement. The similarity measurements vary depending on the data types used [8].

The rest of this paper is organized as follows. Section 2 states brief explanation of different similarity measures types with respect to the distance. Section 3 focus on optimizing documents/texts as by showing mathematically comparisons between the distance based similarity measures. Finally, Sect. 4 addresses conclusions and future work.

2 Types of Similarity Measures

Distance based similarity measure is considered as an essential parameter for the classification and measuring the similarity or dissimilarity between two or more vectors/objects (Fig. 1).

Fig. 1
figure 1

Variables in similarity equations. a Set1. b Set2. c Intersection/common features. d Information system/dataset

Following are distance metric families with 34 different distance metrics used for matching the distance and measuring the similarity between any two or more vectors.

2.1 L1 Minkowski Distance Metric Family

In L1 distance metric family all distance metrics are defined for \( p = 1 \).

  1. 1.

    Sorenson distance [9]

    $$ d_{Sor} \left( {A,B} \right) = \frac{2c}{a + b} $$
    (1)

    We say two points are adjacent if their L1 distance is defined for a = 1. Eq. 1, is known as Sorenson distance sometimes called or Bray Curtis where difference is normalized by the sum of feature vectors at each dimension.

  2. 2.

    Kulczynksi distance [9, 10]

    $$ d_{Kul} \left( {P,Q} \right) = \frac{{\left( {\frac{a}{a + b} + \frac{a}{a + c}} \right)}}{2} $$
    (2)

    While Eq. 2 shows that the Kulczynksi distance is measures as the difference of the two vectors A and B normalized by the minimum value between two of the features in A and B.

  3. 3.

    Gower distance [9, 10]

    $$ d_{Gow} \left( {A,B} \right) = \frac{a + d}{{\sqrt {\left( {a - c} \right)\left( {a + c} \right)\left( {b + d} \right)\left( {c + d} \right)} }} $$
    (3)

    In 1985, Gower defined a general similarity coefficient to overcome this as in Eq. 3 where A and B for binary variables stands for the presence/absence of that variable.

2.2 Inner Product Distance Metrics Family

The inner product distance metrics family deals exclusively with similarity measures which incorporate the inner product of two vectors yields a scalar and is sometimes called the scalar product or dot product.

  1. 1.

    Cosine Similarity [916]

    $$ d_{\text{Cos}} \left( {A,B} \right) = \frac{a}{{\sqrt {\left( {a + b} \right)\left( {a + c} \right)^{2} } }} $$
    (4)

    Ochiai and Carbo are other names to the cosine co-efficient. As given in Eq. 4, the distance is measured based on the angle between two vectors and is thus often called the angular metric.

  2. 2.

    Jaccard Coefficient [911, 13, 15, 1719]

    $$ d_{Jack} (A,B) = \frac{a}{a + b + c} $$
    (5)

    Similarly Eqs. 4 and 5 measure the similarity between two vectors as it gives the minimum value for more similar feature vectors.

  3. 3.

    Dice Coefficient [9, 10, 17]

    $$ d_{Dice} \left( {A,B} \right) = \frac{2a}{2a + b + c} $$
    (6)

    Dice distance is very sensitive to small changes as the numerator is multiplied by factor 2.

2.3 Squared Chord Distance Metrics Family

  1. 1.

    Squared Chord [9, 10]

    $$ d_{Schi} \left( {A,B} \right) = \sqrt {2\left( {1 - \frac{a}{{\left( {a + b} \right)\left( {a + c} \right)}}} \right)} $$
    (7)

    The sum of geometric means \( \sqrt {AB} \), where A referred to as fidelity similarity or squared chord distance metric. Squared chord distance is given by Eq. 7. In squared chord distance, the sum of square of square root difference at each dimension is taken which increases the difference for more dissimilar feature vectors. This distance cannot be used for feature space with negative values.

  2. 2.

    Hellinger distance [9]

    $$ d_{Heling} \left( {P,Q} \right) = 2\sqrt {\left( {1 - \frac{a}{{\left( {a + b} \right)\left( {a + c} \right)}}} \right)} $$
    (8)

    The Hellinger distance is another way to measure the distance to check the similarity of even the dissimilarity of two or more points. In 1909 introduced by Ernst Hellinger is the Hellinger distance which is used to quantify the similarity between two probability distributions. Talking about the discrete distribution, for two discrete probability distributions A and B their Hellinger distance is defined as in Eq. 8 which is directly related to the Euclidean norm of the difference of the square root vectors.

2.4 Other Distance Metrics

  1. 1.

    Average distance [9, 13]

    $$ d_{Avg} \left( {A,B} \right) = \frac{{\varSigma \left| {a - b} \right| + \hbox{max} \left| {a - b} \right|}}{2} $$
    (9)

    Distance given by Eq. 9 is just the average of city block and Chebyshev distance.

  2. 2.

    Sokal and Sneath II [10, 17]

    $$ S_{SOKAL\& Sneath - II} = \frac{a}{a + 2b + 2c} $$
    (10)

    For the similarity measure as in Eq. 10, this measure has a minimum value of 0, has no upper limit, and is undefined when there are no non matches (b = 0 and c = 0).

3 Experimental Results

The main target of the study is to increase the scientific research field in the different faculties of Cairo University. To do so, we focused in the theses mining concept for instance for the Faculty of Medicine, Cairo University. Data was collected from the digital library of the Faculty of Medicine and then cleansed for further process. The data collected was theses documents including the title of the theses and the abstract with keywords. The theses documents are classified into Master and doctorate theses. Documents are tracked within the last 10 years separated and categorized based on the departments within the Medicine school. Data collected was 4,878 theses with 15,808 keyword across all departments.

Faculty of Medicine in Cairo University is classified into 35 departments. The aim of the work is to find out the departments that can work together easily to increase the research within the faulty. As well as making it easy for departments to find a way to increase the research within each department.

Different distance measures were applied to get the nearest departments that can work together. Results includes measures of the equations Euclidean distance, City block distance, Sorenson distance, Kulczynksi distance, Cosine Similarity, Jaccard Coefficient, Dice Coefficient, Hellinger distance, Sokal and Sneath and Johnson similarity equation.

As previously mentioned, the main objective is to find out the best of departments’ combinations that can collaborate easily together. After applying the Cosine similarity on four randomly selected departments (Anatomy, Andrology, Anesthesia and Cardiology), the results shows for instance that the department of Anatomy can work easily with the department of Cardiology with a percentage of 60 %. While the department of Anesthesia with Cardiology by 90 %. While after applying the Dice distance equation the combination of Anatomy with the Cardiology decreased to 40 %. Then the results decreased again with the Jaccard Coefficient, Sokal and Sneath as well as Sorenson; and increased again with the Kulczynksi distance equation Tables 1, 2, 3, 4, 5, 6.

Table 1 Applying cosine similarity
Table 2 Applying Dice coefficient
Table 3 Applying Jaccard coefficient
Table 4 Applying Kulczynksi distance
Table 5 Applying Sokal and Sneath distance
Table 6 Applying Sorenson distance

After making comparisons between the different distance equations (Cosine similarity distance, Dice Coefficient, Jaccard Coefficient, Sokal and Sneath distance, Kulczynksi distance and Sorenson distance), we find out that the Kulczynksi distance is better that any other equations used as shown in the figures below which is more relevant to human assessments (Figs. 2, 3, 4 and 5).

Fig. 2
figure 2

Anatomy with others

Fig. 3
figure 3

Andrology with others

Fig. 4
figure 4

Anesthesia with others

Fig. 5
figure 5

Cardiology with others

4 Conclusion and Future Work

In this paper, we showed mathematically how texts can be clustered by using different similarity distance equations on documents. By taking an example where theses documents were clustered and classified better with the Kulczynksi distance by 92.51 % directly without applying normalization than other distance measures. Other algorithms can be considered as well for future work, like applying the genetic programming; neural networks and comparing the results simultaneously.