Keywords

1 Introduction

With the development of computer science, it has been a very common ways to solute some difficult problems in reality by simulating with computer. Meanwhile, with the advancement of artificial intelligence, judicial judgement is getting closer to the justice of law with the aid of big data analysis. It is worth noting that the similarity analysis of judicial cases is the basis of wisdom judicature. A formative judicial case contains the court, the accuser and the accused, the fact, and the result of the case. In order to give credibility within a community, jury trials must take all these complicated factors into consideration with reference to similar cases. With the explosion in the number of judicial cases, it is difficult to consider similar cases without omission. Because of this, we seek to provide a novel recommendation method to assist judicial processing.

Starting with the study of Becker [1], researchers focus on what factors influence the optimal amount of enforcement, like the cost of catching criminals, the subjective decisions that affect the result. However, in practice, these factors are affected by political, moral and many other subjective constraints. Our main purpose is to make use of the objective factors among judicial cases.

Despite the fact that judicial study has gained some achievements in many aspects, such as legal word embeddings [2], inferring of the penalty [3] and judicial data standard [4], a recommender system is needed to deal with the large volume problem of judicial cases. In general, three filtering techniques such as content-based [5], collaborative [6, 7] and hybrid filtering [8, 9] are presented in the recommender system literature to filter records and identify the relevant information. Some of the progressive collaborative filtering algorithm [10, 11] take cold start into consideration on the situation of lack of users or users’ behaviours. In the meantime, it is challenging in judicial area because there exist many one-time users.

In view of the current situation, we propose an effective way to get recommendations, which is to collect the judicial cases a certain user put in. Our primary focus is to explore the judicial cases that are used to capture semantic similarities among text snippets. As mentioned above, given the cases that user input, the proposed model can return a recommended list of the relevant cases. We proposed our framework of content-based judicial case recommendation, as shown in a flow chart, Fig. 1.

In summary, we do the following work in this paper.

  • We propose a content-based recommendation method for judicial cases.

  • We develop a co-training process with TF-IDF and LDA to gain a plausible performance.

  • We conduct an extensive experiments to test the performance of our proposed method, and the result reveals when the number of topic is around 80, our proposed method shows best performance.

The rest of this paper is organized as follows. Section 2 first describes relevant background of the models and algorithms, then sets out the proposed model and theoretical basis. Section 3 presents the experimental results and Sect. 4 summarizes this paper.

Fig. 1.
figure 1

Framework of content-based judicial case recommendation

2 Methodology

2.1 Background

In this part, we provide detailed background of the models and algorithms used in this paper.

Recommender Systems. Recommendation systems recommend items that specific users may be interested in books, news, movies, etc. At present, the methods of recommender systems are mainly based on collaborative filtering [12], association rules [13], content or hybrid algorithm [14]. LDA-based recommendation belongs to content-based recommendation.

Cold-starting is taken into consideration on the situation of lack of users or user behaviours and can be proved efficiently in many real projects. It also calls for attention in judicial field because existing a mass of one-time users.

Content-Based Recommendation with LDA. In natural language processing field, topic modeling is a kind of modeling for discovering the abstract “topics” that occur in a collection of documents. The LDA(Latent Dirichlet Allocation) model proposed by Blei in 2003 [19] has set the topic model on fire. The so-called generation model indicates that we think that every word in a document is achieved through the process of selecting a topic with a certain probability.

It’s been a long time that LDA has been used to study user interests and build a system to recommend more friends with the same or similar user interests [17]. However, considering the lack of label of user interest and behavior among judicial cases, it is difficult to focus on user-generated content. We seek to turn to a new direction, which is to analyze and classify judicial cases input by users as content instead of user-generated content. In addition, TF-IDF is another reasonable algorithm in case recommendation.

TF-IDF Algorithm. TF-IDF is a commonly used weighting technology for information retrieval and data mining. TF means word frequency, IDF means inverse document frequency. TF-IDF proved useful and effective in stop-word filtering in various subject fields including text summarization and classification [18].

  • TF Score (Term Frequency) considers documents as bag of words, agnostic to order of words. A document with 10 occurrences of the term is more relevant than a document with term frequency 1.

  • We also want to use the frequency of the term in the collection for weighting and ranking. Rare terms are more informative than frequent terms. We want low positive weights for frequent terms and high weights for rare terms.

2.2 Preliminaries

For convenience, we define the custom data formats and definitions used in Table 1.

Table 1. Notations

Definition 1

Judicial Case. A judicial case consists of a collection \(R_m (c,q,l,p)\), which means that judicial case m is made up of the collections of words \(R_m\) with four elements cqlp.

Definition 2

Topic. LDA defines each topic as a bag of words. Given a dataset of cases, topics maximize the posterior probability of the observed corpus.

2.3 Data Preprocessing

In light of the difference between Chinese and Romance languages, we use “jieba” text segmentation to get word sequences from dataset. For each judicial m in the dataset, we get the collection \(R_m(c,q,l,p)\). Also, a special filter is set up to filter out key data and sensitive vocabulary in the cases to remove interferences. We make a transformation \(R_m(c,q,l,p)\rightarrow W_m(c)\) to get filtered collection of words in judicial case m.

2.4 Information Extraction

TF-IDF and LDA are trained to constitute the recommendation knowledge together in this part.

First, in order to smooth frequency of words in preprocessed data of M judicial cases, we use TF-IDF to obtain new corpus for the following training. TF-IDF assumes that if a word is important for a document, it would repeatedly appear in that document whereas it would be relatively rare in other documents. The TF is associated with the former assumption and the IDF is associated with the latter. TF-IDF is defined as

$$\begin{aligned} \text {tfidf}(t,d,D)=\text {tf}(t,d)\times \text {idf}(t,D) \end{aligned}$$

where \(f_{d(t)}\) is the normalized frequency of term \(t\in w\) Therefore, it is defined as:

$$\text {tf}(t,d)=\frac{f_{d(t)}}{\text {max}_{w\in d}f_{d(w)}}$$

In document d, \(f_{d(t)}\) is the frequency of term t and w is an existing word. Also, idf(tD) shows the IDF t, which is defined as

$$\text {idf}(t,D)=\text {log}_2(\frac{|p|}{|(d\in D,t\in d)|})$$

where |D| indicates the total number of documents in the corpus, and \(|(d\in D,t\in d)|\) is the number of documents in which the term t appears.

The remaining words were filtered by frequency using the TF-IDF score. TF-IDF measures the importance of a word in a corpus as seen above. It increases with the number of occurrences in the document and decreases with the frequency in the corpus. We compute TF-IDF for each word of each document-plot in the corpus and keep a certain number of words with the highest score to optimize the corpus.

Although LDA assumes the documents to be in bag of words (bow) representation. We find success when using TF-IDF representation as it can be considered a weighted bag of words. It changes \(\theta _m\) and \(\varphi _k\) in LDA model, as shown in Fig. 2.

Fig. 2.
figure 2

Graphical representation of LDA model

We describe the LDA process of a judicial case data set in formal language, as shown below. \(\text {Dirichlet}()\) represents Dirichlet distribution and \(\text {Multi}()\) represents multinomial distribution.

  1. 1.

    For each topic \(k\in {1,\dots ,K}\), draw \(\varphi _k\sim \text {Dirichlet}(\beta )\), denoting the specific word distribution for topic k.

  2. 2.

    For each judicial case \(m\in {1,\dots ,M}\):

    • Draw \(\theta _m\sim \text {Dirichlet}(\alpha )\), indicating the distribution of topics embedded in judicial case m;

    • For the n-th word in case m, \(n\in {1,\dots ,N}\), draw a \(W~\text {Multi}(\phi _z)\) for each word \(w\in W_{m,n}(c)\).

The progress above can be used to gain knowledge among different kind of judicial cases. In order to generate recommendations for uses, we also need to do information retrieval from the topic distribution.

2.5 Information Retrieval

For each judicial case \(m\in {1,\dots ,M}\), we can get a vector of K topic distribution via information extraction, which is defined as

$$m=(s_1,\ldots ,s_k)$$

where we seek \(s_i\) referring to the maximum among \(s_1,\ldots ,s_k\). On this occasion, i is the topic we regarded as the classification of case S. On account of two cases are similar if they contain similar topic contribution, similarity between cases is measured by cosine angle between vectors. Given a judicial case s input by user, which belongs to classification i, for each judicial case t \(\in {1,\dots ,M_i}\), we get \(\text {Sim}(s,t)\), which is defined as:

$$\text {Sim}(s,t)=\cos {(s,t)}=\frac{s\cdot t}{\Vert s\Vert \times \Vert t\Vert }$$

Recommendation list is composed of Top 5 cases of Sim(st).

3 Experiments

In this part, we give the whole realization of our framework.

3.1 Dataset

We perform experiments on the law case dataset CAIL2018_Small, which contains 204, 231 documents in total. After conducting TF-IDF, we retrieve a list of low value words (TF-IDF score under 0.025) and filter them out of the dictionary. In the end, we get a dictionary with 311, 024 words. Considering actual processing of judicial cases, we take a large number of judicial cases without manual labeling results into account. Therefore, we only consider using the fact description label in this dataset. In order to eliminate the interference items, we add the screening of time, place, person and number before data preprocessing, so as to get the final dataset. The specific methods for judicial cases are as follows:

  • Regular expressions are used to match time keywords that appear in the cases.

  • Regular expressions are used to match location keywords that appear in the cases, such as ‘province’, ‘city’, ‘district’.

  • Characters in the format of “XXX” are replaced by “PERSON” fields.

  • For the regular matching of measurement units, the size of money is judged and divided into seven grades and marked as follows (Table 2):

Table 2. Measurement labels

To analyze the dataset as a whole, we give the statistics of money in the dataset, as shown in Fig. 3. Among the whole dataset, the proportion of Small-money criminal cases is very high, while the cases involving large amounts of money are very low. In all, the amount of m7-level criminal cases is 0. This figure reflects the case characteristics of CAIL2018_Small dataset from aspect of money. And the timeline of CAIL2018_Small dataset shows in Fig. 4.

Fig. 3.
figure 3

Statistics of money

Fig. 4.
figure 4

Statistics of time

3.2 Experimental Results

We implement perplexity as the indicator [19]. Perplexity is a statistical measure of how well a probability model predicts a sample. In information theory, perplexity is the probability that the test data is monotonically decreasing, which is the algebraic equivalent of the inverse of the probability geometric mean of each word. The lower the complexity score, the better the generalization performance [20]. Perplexity of the untrained dataset (\(D_{test}\)) is defined as follows:

$$\text {perplexity}(D_{\text {test}})=\text {exp}(\frac{-\sum _{d=1}^{M}\log (p(w_d))}{\sum _{d=1}^{M}N_d})$$

where M is the total number of documents in judicial dataset. In document d, \(W_d\) represents words and \(N_d\) is the number of words.

Among the primary setting, for each num of topic \(k\in [10,150]\), we set hyperparameters \(\alpha =\frac{50}{k}\), \(\beta =0.01\), following the studies of [21]. Figure 5 illustrates the perplexity figures with different numbers of topic k.

Fig. 5.
figure 5

Results of k-topic LDA model with TF-IDF in perplexity

As can be seen in Fig. 5, when num of topic \(k\simeq 80\), perplexity requires the minimum value about 155, which is acceptable. The perplexity declines significantly when \(k\in [10,50]\), and are in an upward trend when \(k\in [80,95]\), but also generally falls for \(k>95\) in the process.

Next we figure out exactly the value of k, we reduce the scope and choose k = 75, 76, 77, 78, 79, 80, then calculate the perplexity as showing in Fig. 6.

Fig. 6.
figure 6

Results of perplexity \(k\in [75,80]\)

As shown in Fig. 6, when k = 78, perplexity achieves the minimum value nearly 154. In all, we choose k = 78 as ideal topic number. We display the top 30 words with TF-IDF value in the model with k = 78, as shown in Fig. 7.

Fig. 7.
figure 7

Top 30 words

Fig. 8.
figure 8

Case input by user

Fig. 9.
figure 9

Recommended case

In order to test the actual result of our model, we simulate a series of tests to show model’s performance. Firstly, we build a classified corpus according to the topic distribution of each document in CAIL2018_Small dataset. More specifically, for each document, we choose most probable topic as its subject catalog. After this, we build matrix similarity indexes for each topic catalog. After classifying corpus, we can recommend cases to users. Here, the experiment simulates judicial cases input by user. For example, a user enters judicial case as follow (Fig. 8):

Then we load the topic index, calculate the similarity between the input case and each cases in the indexcatalog by cosine similarity. We select the top 5 cases of similarity as the recommendation judicial cases to present to the user. Top three judicial cases is shown in Fig. 9 and the cosine similarities are 0.9613, 0.9492, 0.9462.

4 Conclusion

In this paper, we present a content-based method of judicial case recommendation to address the problem of how to help user better understand judicial cases in depth. Specifically, we develop a co-training process with TF-IDF and LDA to gain a plausible model performance. Given LDA is an unsupervised learning algorithm, we conduct experiments to evaluate the performance of the proposed recommender system. The results show the optimal number of topic. Our recommendation method still has some room for improvement. Putting state-of-the-art algorithms into practice with good performance is always a critical problem, which we will focus on in the future.