A Content-Based Recommendation Framework for Judicial Cases

Guo, Zichen; He, Tieke; Qin, Zemin; Xie, Zicong; Liu, Jia

doi:10.1007/978-981-15-0118-0_7

Zichen Guo¹¹,
Tieke He¹¹,
Zemin Qin¹¹,
Zicong Xie¹¹ &
…
Jia Liu¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1058))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

1541 Accesses
3 Citations

Abstract

Under the background of the Judicial Reform of China, big data of judicial cases are widely used to solve the problem of judicial research. Similarity analysis of judicial cases is the basis of wisdom judicature. In view of the necessity of getting rid of the ineffective information and extracting useful rules and conditions from the descriptive document, the analysis of Chinese judicial cases with a certain format is a big challenge. Hence, we propose a method that focuses on producing recommendations that are based on the content of judicial cases. Considering the particularity of Chinese language, we use “jieba” text segmentation to preprocess the cases. In view of the lack of labels of user interest and behavior, the proposed method considers the content information via adopting TF-IDF combined with LDA topic model, as opposed to the traditional methods such as CF (Collaborative Filtering Recommendations). Users are recommended to compute cosine similarity of cases in the same topic. In the experiments, we evaluate the performance of the proposed model on a given dataset of nearly 200,000 judicial cases. The experimental result reveals when the number of topics is around 80, the proposed method gets the best performance.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Topic Model Based Text Similarity Measure for Chinese Judgment Document

IntelliLegalRec: An RDF Based Metadata Driven Semantically Compliant Recommendation System for Socio-legal Judicial Documents

Topic Term Clustering Based on Semi-supervised Co-occurrence Graph and Its Application in Chinese Judgement Documents

Keywords

1 Introduction

With the development of computer science, it has been a very common ways to solute some difficult problems in reality by simulating with computer. Meanwhile, with the advancement of artificial intelligence, judicial judgement is getting closer to the justice of law with the aid of big data analysis. It is worth noting that the similarity analysis of judicial cases is the basis of wisdom judicature. A formative judicial case contains the court, the accuser and the accused, the fact, and the result of the case. In order to give credibility within a community, jury trials must take all these complicated factors into consideration with reference to similar cases. With the explosion in the number of judicial cases, it is difficult to consider similar cases without omission. Because of this, we seek to provide a novel recommendation method to assist judicial processing.

Starting with the study of Becker [1], researchers focus on what factors influence the optimal amount of enforcement, like the cost of catching criminals, the subjective decisions that affect the result. However, in practice, these factors are affected by political, moral and many other subjective constraints. Our main purpose is to make use of the objective factors among judicial cases.

Despite the fact that judicial study has gained some achievements in many aspects, such as legal word embeddings [2], inferring of the penalty [3] and judicial data standard [4], a recommender system is needed to deal with the large volume problem of judicial cases. In general, three filtering techniques such as content-based [5], collaborative [6, 7] and hybrid filtering [8, 9] are presented in the recommender system literature to filter records and identify the relevant information. Some of the progressive collaborative filtering algorithm [10, 11] take cold start into consideration on the situation of lack of users or users’ behaviours. In the meantime, it is challenging in judicial area because there exist many one-time users.

In view of the current situation, we propose an effective way to get recommendations, which is to collect the judicial cases a certain user put in. Our primary focus is to explore the judicial cases that are used to capture semantic similarities among text snippets. As mentioned above, given the cases that user input, the proposed model can return a recommended list of the relevant cases. We proposed our framework of content-based judicial case recommendation, as shown in a flow chart, Fig. 1.

In summary, we do the following work in this paper.

We propose a content-based recommendation method for judicial cases.
We develop a co-training process with TF-IDF and LDA to gain a plausible performance.
We conduct an extensive experiments to test the performance of our proposed method, and the result reveals when the number of topic is around 80, our proposed method shows best performance.

The rest of this paper is organized as follows. Section 2 first describes relevant background of the models and algorithms, then sets out the proposed model and theoretical basis. Section 3 presents the experimental results and Sect. 4 summarizes this paper.

2 Methodology

2.1 Background

In this part, we provide detailed background of the models and algorithms used in this paper.

Recommender Systems. Recommendation systems recommend items that specific users may be interested in books, news, movies, etc. At present, the methods of recommender systems are mainly based on collaborative filtering [12], association rules [13], content or hybrid algorithm [14]. LDA-based recommendation belongs to content-based recommendation.

Cold-starting is taken into consideration on the situation of lack of users or user behaviours and can be proved efficiently in many real projects. It also calls for attention in judicial field because existing a mass of one-time users.

Content-Based Recommendation with LDA. In natural language processing field, topic modeling is a kind of modeling for discovering the abstract “topics” that occur in a collection of documents. The LDA(Latent Dirichlet Allocation) model proposed by Blei in 2003 [19] has set the topic model on fire. The so-called generation model indicates that we think that every word in a document is achieved through the process of selecting a topic with a certain probability.

It’s been a long time that LDA has been used to study user interests and build a system to recommend more friends with the same or similar user interests [17]. However, considering the lack of label of user interest and behavior among judicial cases, it is difficult to focus on user-generated content. We seek to turn to a new direction, which is to analyze and classify judicial cases input by users as content instead of user-generated content. In addition, TF-IDF is another reasonable algorithm in case recommendation.

TF-IDF Algorithm. TF-IDF is a commonly used weighting technology for information retrieval and data mining. TF means word frequency, IDF means inverse document frequency. TF-IDF proved useful and effective in stop-word filtering in various subject fields including text summarization and classification [18].

TF Score (Term Frequency) considers documents as bag of words, agnostic to order of words. A document with 10 occurrences of the term is more relevant than a document with term frequency 1.
We also want to use the frequency of the term in the collection for weighting and ranking. Rare terms are more informative than frequent terms. We want low positive weights for frequent terms and high weights for rare terms.

2.2 Preliminaries

For convenience, we define the custom data formats and definitions used in Table 1.

Table 1. Notations

Full size table

Definition 1

Judicial Case. A judicial case consists of a collection $R_m (c,q,l,p)$, which means that judicial case m is made up of the collections of words $R_m$ with four elements c, q, l, p.

Definition 2

Topic. LDA defines each topic as a bag of words. Given a dataset of cases, topics maximize the posterior probability of the observed corpus.

2.3 Data Preprocessing

In light of the difference between Chinese and Romance languages, we use “jieba” text segmentation to get word sequences from dataset. For each judicial m in the dataset, we get the collection $R_m(c,q,l,p)$. Also, a special filter is set up to filter out key data and sensitive vocabulary in the cases to remove interferences. We make a transformation $R_m(c,q,l,p)\rightarrow W_m(c)$ to get filtered collection of words in judicial case m.

2.4 Information Extraction

TF-IDF and LDA are trained to constitute the recommendation knowledge together in this part.

First, in order to smooth frequency of words in preprocessed data of M judicial cases, we use TF-IDF to obtain new corpus for the following training. TF-IDF assumes that if a word is important for a document, it would repeatedly appear in that document whereas it would be relatively rare in other documents. The TF is associated with the former assumption and the IDF is associated with the latter. TF-IDF is defined as

$$\begin{aligned} \text {tfidf}(t,d,D)=\text {tf}(t,d)\times \text {idf}(t,D) \end{aligned}$$

where $f_{d(t)}$ is the normalized frequency of term $t\in w$ Therefore, it is defined as:

$$\text {tf}(t,d)=\frac{f_{d(t)}}{\text {max}_{w\in d}f_{d(w)}}$$

In document d, $f_{d(t)}$ is the frequency of term t and w is an existing word. Also, idf(t, D) shows the IDF t, which is defined as

$$\text {idf}(t,D)=\text {log}_2(\frac{|p|}{|(d\in D,t\in d)|})$$

where |D| indicates the total number of documents in the corpus, and $|(d\in D,t\in d)|$ is the number of documents in which the term t appears.

The remaining words were filtered by frequency using the TF-IDF score. TF-IDF measures the importance of a word in a corpus as seen above. It increases with the number of occurrences in the document and decreases with the frequency in the corpus. We compute TF-IDF for each word of each document-plot in the corpus and keep a certain number of words with the highest score to optimize the corpus.

Although LDA assumes the documents to be in bag of words (bow) representation. We find success when using TF-IDF representation as it can be considered a weighted bag of words. It changes $\theta _m$ and $\varphi _k$ in LDA model, as shown in Fig. 2.

We describe the LDA process of a judicial case data set in formal language, as shown below. $\text {Dirichlet}()$ represents Dirichlet distribution and $\text {Multi}()$ represents multinomial distribution.

1.
For each topic $k\in {1,\dots ,K}$, draw $\varphi _k\sim \text {Dirichlet}(\beta )$, denoting the specific word distribution for topic k.
2.
For each judicial case $m\in {1,\dots ,M}$:
- Draw $\theta _m\sim \text {Dirichlet}(\alpha )$, indicating the distribution of topics embedded in judicial case m;
- For the n-th word in case m, $n\in {1,\dots ,N}$, draw a $W~\text {Multi}(\phi _z)$ for each word $w\in W_{m,n}(c)$.

The progress above can be used to gain knowledge among different kind of judicial cases. In order to generate recommendations for uses, we also need to do information retrieval from the topic distribution.

2.5 Information Retrieval

For each judicial case $m\in {1,\dots ,M}$, we can get a vector of K topic distribution via information extraction, which is defined as

$$m=(s_1,\ldots ,s_k)$$

where we seek $s_i$ referring to the maximum among $s_1,\ldots ,s_k$. On this occasion, i is the topic we regarded as the classification of case S. On account of two cases are similar if they contain similar topic contribution, similarity between cases is measured by cosine angle between vectors. Given a judicial case s input by user, which belongs to classification i, for each judicial case t $\in {1,\dots ,M_i}$, we get $\text {Sim}(s,t)$, which is defined as:

$$\text {Sim}(s,t)=\cos {(s,t)}=\frac{s\cdot t}{\Vert s\Vert \times \Vert t\Vert }$$

Recommendation list is composed of Top 5 cases of Sim(s, t).

3 Experiments

In this part, we give the whole realization of our framework.

3.1 Dataset

We perform experiments on the law case dataset CAIL2018_Small, which contains 204, 231 documents in total. After conducting TF-IDF, we retrieve a list of low value words (TF-IDF score under 0.025) and filter them out of the dictionary. In the end, we get a dictionary with 311, 024 words. Considering actual processing of judicial cases, we take a large number of judicial cases without manual labeling results into account. Therefore, we only consider using the fact description label in this dataset. In order to eliminate the interference items, we add the screening of time, place, person and number before data preprocessing, so as to get the final dataset. The specific methods for judicial cases are as follows:

Regular expressions are used to match time keywords that appear in the cases.
Regular expressions are used to match location keywords that appear in the cases, such as ‘province’, ‘city’, ‘district’.
Characters in the format of “XXX” are replaced by “PERSON” fields.
For the regular matching of measurement units, the size of money is judged and divided into seven grades and marked as follows (Table 2):

Table 2. Measurement labels

Full size table

To analyze the dataset as a whole, we give the statistics of money in the dataset, as shown in Fig. 3. Among the whole dataset, the proportion of Small-money criminal cases is very high, while the cases involving large amounts of money are very low. In all, the amount of m7-level criminal cases is 0. This figure reflects the case characteristics of CAIL2018_Small dataset from aspect of money. And the timeline of CAIL2018_Small dataset shows in Fig. 4.

3.2 Experimental Results

We implement perplexity as the indicator [19]. Perplexity is a statistical measure of how well a probability model predicts a sample. In information theory, perplexity is the probability that the test data is monotonically decreasing, which is the algebraic equivalent of the inverse of the probability geometric mean of each word. The lower the complexity score, the better the generalization performance [20]. Perplexity of the untrained dataset ($D_{test}$) is defined as follows:

$$\text {perplexity}(D_{\text {test}})=\text {exp}(\frac{-\sum _{d=1}^{M}\log (p(w_d))}{\sum _{d=1}^{M}N_d})$$

where M is the total number of documents in judicial dataset. In document d, $W_d$ represents words and $N_d$ is the number of words.

Among the primary setting, for each num of topic $k\in [10,150]$, we set hyperparameters $\alpha =\frac{50}{k}$, $\beta =0.01$, following the studies of [21]. Figure 5 illustrates the perplexity figures with different numbers of topic k.

As can be seen in Fig. 5, when num of topic $k\simeq 80$, perplexity requires the minimum value about 155, which is acceptable. The perplexity declines significantly when $k\in [10,50]$, and are in an upward trend when $k\in [80,95]$, but also generally falls for $k>95$ in the process.

Next we figure out exactly the value of k, we reduce the scope and choose k = 75, 76, 77, 78, 79, 80, then calculate the perplexity as showing in Fig. 6.

As shown in Fig. 6, when k = 78, perplexity achieves the minimum value nearly 154. In all, we choose k = 78 as ideal topic number. We display the top 30 words with TF-IDF value in the model with k = 78, as shown in Fig. 7.

In order to test the actual result of our model, we simulate a series of tests to show model’s performance. Firstly, we build a classified corpus according to the topic distribution of each document in CAIL2018_Small dataset. More specifically, for each document, we choose most probable topic as its subject catalog. After this, we build matrix similarity indexes for each topic catalog. After classifying corpus, we can recommend cases to users. Here, the experiment simulates judicial cases input by user. For example, a user enters judicial case as follow (Fig. 8):

Then we load the topic index, calculate the similarity between the input case and each cases in the indexcatalog by cosine similarity. We select the top 5 cases of similarity as the recommendation judicial cases to present to the user. Top three judicial cases is shown in Fig. 9 and the cosine similarities are 0.9613, 0.9492, 0.9462.

4 Conclusion

In this paper, we present a content-based method of judicial case recommendation to address the problem of how to help user better understand judicial cases in depth. Specifically, we develop a co-training process with TF-IDF and LDA to gain a plausible model performance. Given LDA is an unsupervised learning algorithm, we conduct experiments to evaluate the performance of the proposed recommender system. The results show the optimal number of topic. Our recommendation method still has some room for improvement. Putting state-of-the-art algorithms into practice with good performance is always a critical problem, which we will focus on in the future.

References

Becker, G.S., Landes, W.M.: Essays in the Economics of Crime and Punishment. Number 3 in Human Behavior and Social Institutions. National Bureau of Economic Research: Distributed by Columbia University Press
Google Scholar
He, T., Lian, H., Qin, Z., Zou, Z., Luo, B.: Word embedding based document similarity for the inferring of penalty. In: Meng, X., Li, R., Wang, K., Niu, B., Wang, X., Zhao, G. (eds.) WISA 2018. LNCS, vol. 11242, pp. 240–251. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02934-0_22
Chapter Google Scholar
He, T.-K., Lian, H., Qin, Z.-M., Chen, Z.-Y., Luo, B.: PTM: a topic model for the inferring of the penalty. J. Comput. Sci. Technol. 33(4), 756–767 (2018)
Article Google Scholar
Qin, Z., He, T., Lian, H., Tian, Y., Liu, J.: Research on judicial data standard. In: 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), pp. 175–177. IEEE (2018)
Google Scholar
Balabanovic, M., Shoham, Y.: Fab: content-based, collaborative recommendation. Commun. ACM 40, 66–72 (1997)
Article Google Scholar
Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item collaborative filtering
Google Scholar
Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web - WWW 2007, p. 271. ACM Press (2007)
Google Scholar
Badaro, G., Hajj, H., El-Hajj, W., Nachman, L.: A hybrid approach with collaborative filtering for recommender systems. In: 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC), pp. 349–354, July 2013
Google Scholar
Strub, F., Mary, J., Gaudel, R.: Hybrid collaborative filtering with autoencoders (2016)
Google Scholar
Ahn, H.J.: A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem. Inf. Sci. 178(1), 37–51 (2008)
Article Google Scholar
Patra, B.Kr., Launonen, R., Ollikainen, V., Nandi, S.: A new similarity measure using Bhattacharyya coefficient for collaborative filtering in sparse data. Knowl.-Based Syst. 82(C), 163–177 (2015)
Article Google Scholar
Ekstrand, M.D.: Collaborative filtering recommender systems 4(2), 81–173
Google Scholar
Lin, W., Alvarez, S.A., Ruiz, C.: Efficient adaptive-support association rule mining for recommender systems. Data Min. Knowl. Disc. 6(1), 83–105 (2002)
Article MathSciNet Google Scholar
Kardan, A.A., Ebrahimi, M.: A novel approach to hybrid recommendation systems based on association rules mining for content recommendation in asynchronous discussion groups. Inf. Sci. 219, 93–110 (2013)
Article Google Scholar
Nagori, R., Aghila, G.: LDA based integrated document recommendation model for e-learning systems, pp. 230–233, April 2011
Google Scholar
Luostarinen, T., Kohonen, O.: Using topic models in content-based news recommender systems. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pp. 239–251. Linköping University Electronic Press, Sweden (2013)
Google Scholar
Pennacchiotti, M., Gurumurthy, S.: Investigating topic models for social media user recommendation. In: Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011, pp. 101–102. ACM, New York (2011)
Google Scholar
Ramos, J.: Using TF-IDF to determine word relevance in document queries
Google Scholar
Blei, D.M.: Latent Dirichlet allocation, p. 30
Google Scholar
Arora, K.: Contrastive perplexity: a new evaluation metric for sentence level language models. CoRR, abs/1601.00248 (2016)
Google Scholar
Yin, H., Sun, Y., Cui, B., Hu, Z., Chen, L.: LCARS: a location-content-aware recommender system, pp. 221–229, August 2013
Google Scholar

Download references

Acknowledgment

The work is supported in part by the National Key Research and Development Program of China (2016YFC0800805) and the National Natural Science Foundation of China (61772014).

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China
Zichen Guo, Tieke He, Zemin Qin, Zicong Xie & Jia Liu

Authors

Zichen Guo
View author publications
You can also search for this author in PubMed Google Scholar
Tieke He
View author publications
You can also search for this author in PubMed Google Scholar
Zemin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zicong Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jia Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Liu .

Editor information

Editors and Affiliations

Guilin University of Technology, Guilin, China
Xiaohui Cheng
Northeast Forestry University, Harbin, China
Weipeng Jing
Harbin University of Science and Technology, Harbin, China
Xianhua Song
National Academy of Guo Ding Institute of Data Science, Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, Z., He, T., Qin, Z., Xie, Z., Liu, J. (2019). A Content-Based Recommendation Framework for Judicial Cases. In: Cheng, X., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2019. Communications in Computer and Information Science, vol 1058. Springer, Singapore. https://doi.org/10.1007/978-981-15-0118-0_7

Download citation

DOI: https://doi.org/10.1007/978-981-15-0118-0_7
Published: 13 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0117-3
Online ISBN: 978-981-15-0118-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics