1 Introduction

Similar case analysis (SCA) aims to perform semantic analysis on similar legal cases and find similar cases. SCA has two main tasks: similar case matching (SCM) and similar case retrieval (SCR). SCM aims to determine whether legal case documents are similar, and SCR aims to find cases similar to the target case from candidate cases and sort them by similarity. SCA plays a significant role in the legal domain. In common law systems, like in the United States, Canada, and India, case law dominates, which means the judgments are made according to similar and representative cases in the past (Zhong et al. 2020). Although statutes are the primary in civil law systems, like in China, Germany, and Italy, similar cases still serve as references for legal professionals. As a large number of legal documents are generated and accumulated, how to retrieve similar cases efficiently from vast amounts of legal document data is a big challenge.

Theoretically, the SCR task can be solved using a collaborative-based approach, such as collaborative filtering, thus avoiding the similarity calculation. However, it is difficult to model the new cases due to the lack of user-item interactions in the recommendation scenario (Yang et al. 2022). Therefore, in the SCR task, we need to utilize the semantic information of the text to model the similarity and complete the case retrieval task. SCM aims to determine whether legal case documents are similar or not, which is based on similarity calculation. To sum up, the core problem of solving SCA is to calculate the similarity between legal documents.

With the development of deep learning in natural language processing (NLP), exploiting NLP techniques to assist legal tasks has drawn increasing attention rapidly. SCA is a crucial component of legal tasks, and as a result, there has been a lot of research work that has explored the possibility of NLP and SCA. For example, Mandal et al. (2021) convert legal documents into embedding vectors and calculate the text similarity between the embedding vectors. Wehnert et al. (2021) combine BERT word embeddings with TF-IDF vectors to enrich the document representations. Wu et al. (2021) improve search effectiveness by matching judgments to queries at the semantics level rather than at the keyword level. Shao et al. (2020) and Ma et al. (2021a) propose a BERT model based on paragraph-level semantic information. There have been several benchmark efforts in the field of legal artificial intelligence, including SCA, such as CAIL (Xiao et al. 2019), Legal TREC (Oard and Webber 2013), AILA (Bhattacharya et al. 2019) and COLIEE (Rabelo et al. 2020a). In the context of Chinese corpus, CAIL has released benchmarks for Chinese legal tasks, such as CAIL-2019 (Xiao et al. 2019) and LeCaRD (Ma et al. 2021b). These datasets have served as a crucial foundation for research in Chinese SCA, such as Lawformer (Xiao et al. 2021) and LFESM (Hong et al. 2020). However, it is still confronted with the following challenges in the Chinese SCA task:

Challenges 1: Similarity in semantic structure is not equivalent to case similarity. The existing methods have primarily focused on semantic structures, neglecting the significance of legal elements that can impact both the verdict and the similarity between cases. This includes crucial elements like legal events. Taking the SCM task as an example, as depicted in Fig. 1, let's temporarily set aside the assumptions in Case B. While the fact statements of Case A and Case B may be semantically similar, and they are not actually similar because Case A involves violence events while Case B does not. Consequently, Case B should be classified as theft, whereas Case A should be classified as robbery. Traditional SCA methods relying solely on semantic similarity can be easily misled by semantic structures and often wrongly determine that Case A is more similar to Case B, when in fact the ground truth is that Case A is more similar to Case C. However, if we extract and compare the violent incidents in Case A, Case B, and Case C, it becomes evident that the correct conclusion can be reached. Figure 1 illustrates that the events of being threatened and beaten in Case A correspond to intentional injury events, which correspond to being robbed and stabbed in Case C. However, there are no such events in Case B. By incorporating such improvements, we can address the limitations of traditional language models that overly rely on semantic similarity.

Fig. 1
figure 1

An illustration of similar case matching. A and B are more similar in semantic structure, but in terms of the case, A and C are more similar because their events and the severity of events are more similar

Some researchers have solved this problem by extracting legal elements via human design. Hong et al. (2020) leverage regular expression to incorporate legal elements into text parsing. Hu et al. (2018) add attributes for charges manually in legal judgment prediction. However, manual rule-based approaches heavily rely on domain-specific prior knowledge and require significant human effort, which can be inefficient. Additionally, extracting legal elements as a single entity can lead to challenging problems. As shown in Fig. 1, if the assumption content is added to Case A, violent events would also be present. At this point, if similarity is solely judged based on the event sequence, it would fall into the semantic similarity trap. In Case A, although a violent incident occurred, its severity was much lower than in Cases B and C. Therefore, when considering the events that occur in a case, it is essential to take into account not only the sequence of events but also the context in which they occur.

Overall, when it comes to legal judgments, events are crucial in determining case similarity for several reasons. Firstly, events directly impact case outcomes and significantly influence the legal analysis and decision-making process. Focusing on events allows us to capture the key actions and incidents that shape the case outcome. Secondly, events provide contextual relevance by reflecting the environment and background of the case, helping to identify similarities and differences. Furthermore, events serve as objective and identifiable elements that can be documented, analyzed, and compared, facilitating a systematic and consistent approach to case analysis. Gathering comprehensive information about events is often more feasible than quantifying complex legal elements or abstract concepts. However, it is important to note that other legal elements may hold greater importance in specific fields of law, which can be explored in our future work.

Besides, legal documents are professional and must contain many common language structures, and these parts will cause certain interference for SCA analysis. For example, a civil case document will often contain the following information:

  • Personal information of plaintiff and defendant.

  • Description of the facts of the case and the plaintiff's claims.

  • The analysis of the court based on the factual description.

The above information usually tends to be mixed in an actual legal document. However, only the fact description is the key to SCA, and the rest parts would interfere with the SCA analysis. Moreover, many professional terminologies are used in legal documents, such as found through trial, the focus of the dispute in this case. These terminologies may interfere with similarity calculation since they rarely contain the key features required by similarity calculation and are challenging to be removed by data cleaning.

Challenge 2: Combine multiple datasets for training. The existing methods usually train their models only on the dataset of specific tasks, such as SCA. Thus, the models cannot utilize the knowledge from another dataset, like the event detection (ED) dataset. For example, to make the model have the ability of SCA, we usually train our model on a dedicated SCA dataset, such as (Xiao et al. 2019). However, as mentioned above, the event labels, which are important for SCM, are not involved in the existing SCM dataset. Thus, if we want to perform multi-task training of SCM and ED, we need to manually label events on the SCM dataset, which takes much time. Therefore, it remains an unsolved problem to leverage the existing event detection dataset, like LEVEN (Yao et al. 2022), to assist the SCA tasks.

Challenge 3: Properly integrating event information and text semantic features is a big challenge. ED is an information extraction task which can effectively capture the sequence of events contained in the text. It aims to automatically extract the event triggers from text and then classify their corresponding event types, which has been formalized as a sequence labeling task. Chinese characteristic law systems divide a legal criminal case into the event sequences and the penalties corresponding to the event (Feng et al. 2022). There is a clear causal relationship between incidents and penalties. When an event is detected, a penalty must be imposed. Therefore, events play a crucial role in the penalty system, and judgment forms the basis for assessing the similarity of cases. While the concept of constituent elements is commonly associated with criminal law, the comparative analysis of case facts and legal provisions is universally applicable in legal practice. This analysis extends to civil cases as well, where comparing facts between resolved cases and ongoing cases is an indispensable component. Even without the requirement for criminal punishment, events can provide valuable insights and help establish the sequence of actions, identify responsibilities, and determine liability in civil disputes. This practice is also widespread in common law jurisdictions.

There have been many studies on ED. Li et al. (2021) propose a method that consists of a semantic feature extractor, a statistical feature extractor and a joint event discriminator to avoid being confused by the varied contexts. Si et al. (2022) introduce the prompt-based learning strategy to the domain of ED. Although the field of ED is developing rapidly, most of the current research is based on public datasets such as the common domain ED dataset ACE2005,Footnote 1 and has not explored the downstream tasks of ED. Moreover, just locating events is not enough to support the judgments of similar cases. For example, the same event with different contexts will reflect different information, like the severity of the event. Therefore, integrating both events and their corresponding contexts for judging similar cases may be a better choice.

To summarize, this study addressed the following research questions (RQs):

RQ1: How to introduce events into SCA as legal elements instead of just considering semantic similarity at semantic level?

RQ2: How to combine multiple legal datasets for joint training to leverage knowledge from other datasets?

To address these issues, we propose a legal event-context model named LECM for SCA tasks.

Firstly, to integrate the event and context information, an event-context integration mechanism is proposed to formalize the events and their context semantic features based on the attention mechanism. By highlighting the contextual features related to events, it helps alienates the features of the same event in different contexts and reduce the impact of legal terminology, narrative structure and other features on similarity calculation. Based on this event-context integration mechanism, the proposed LECM model for SCA can leverage semantic and event features for inference, thus improving accuracy and interpretability. Then, to help the LECM model locate key information of events, we use ED as an auxiliary task for the SCA tasks. Specially, an ED module is pre-trained on an ED dataset to locate event types and locations. As a bridge between the ED task and the SCA tasks, the ED module can improve efficiency by avoiding labeling event annotation manually for SCA tasks. Finally, the event information obtained by the ED module and related intermediate layer features will be used for subsequent similarity calculations for more accurate SCA. We conduct experiments on some real-world SCA datasets to investigate the effectiveness of our model. The experiment results show that our method outperforms the competitive baselines. Specifically, LECM achieves the highest performance in precision and accuracy.

The main contributions of this paper can be summarized as follows:

(1) We propose a novel legal event-context model named LECM with three characteristics: (1) can improve the accuracy and interpretability of SCA by detecting events and extracting event context features based on the proposed event-context integration mechanism. (2) Can help SCA task to locate event key information by integrating event detection; (3) Can improve efficiency by utilizing a pre-trained ED module instead of labeling events manually for the target dataset, like SCA datasets in this paper.

(2) To evaluate the proposed LECM model, we conduct extensive experiments on two tasks of SCA, i.e., SCM and SCR. Comparing the competitive baselines, LECM achieves the highest performance in precision and accuracy. The experiments show that LECM yields substantial improvements in SCA tasks. Further ablation tests and the case study demonstrate the effectiveness of our method.

The rest of the paper is structured as follows: the related work on SCM and ED is introduced in Sect. 2. Section 3 elaborates on the problem definition and the proposed model. Experiment settings and results are discussed in Sect. 4. Finally, we conclude our work in Sect. 5.

2 Related works

2.1 Similar case analysis

SCA is an essential topic in legal artificial intelligence, consisting of SCM and SCR. SCM aims to measure the similarity between legal case documents, which is a particular form of semantic matching. SCR aims to find cases similar to the target case from candidate cases and sort them by similarity, an information retrieval task.

There are two broad approaches for SCA task: graph-based methods and semantic-based methods. The graph-based methods (Minocha et al. 2015; Bhattacharya et al. 2020; Bi et al. 2022; Yang et al. 2022) aim to construct a graph neural network based on the existing correlation information between cases, and uses the similarity of nodes to represent the similarity of cases. However, it is not easy to model the new cases due to a lack of user-item interactions. Therefore, we mainly consider semantic-based methods in this paper.

Traditional semantic-based methods for SCA tasks often rely on bag-of-words models, such as TF-IDF(Salton and Buckley 1988), BM25(Robertson and Walker 1994), and LMIR(Ponte and Croft 2017), which prioritize term-level similarities using statistical models. Traditional methods capture key features of text by comparing the frequency or weight of words and have achieved good results in certain tasks. In the study conducted by (Souza et al. 2021) on Brazilian legal document retrieval, various variants of the BM25 algorithm and language models were compared. The study demonstrated the effectiveness of the bag-of-words models in SCA tasks, highlighting its excellent performance in the legal domain. Mandal et al. (2017) perform extensive experiments on a large dataset of Indian Supreme Court cases to compare various methodologies (TF-IDF, topic modelling, neural network) for measuring the textual similarity of legal documents. Although traditional methods have achieved significant research results, they may encounter certain challenges when dealing with complex texts, such as legal documents. Some of these challenges include high dimensionality and inaccurate context capture (Kusner et al. 2015; Zhao and Mao 2018; Ali et al. 2019). More recently, deep learning has been widely used in semantic matching. Based on the idea of representation learning, researchers (Wang et al. 2017; Jiang et al. 2019) began using latent space vectors of texts based on deep learning models, and the similarity score between texts is calculated based on their latent space vectors. Pre-trained language models, which are trained on unlabeled corpora, have been proven to benefit various NLP downstream tasks (Choi et al. 2020; Röttger and Pierrehumbert 2021).

Researchers from various countries have made significant contributions to the field of SCA tasks using deep learning and neural networks. Bench-Capon et al. (2012) explore different statistical methods, learning techniques, logical analysis, and expert knowledge in this area. Saravanan et al. (2009) propose an ontological framework to improve user queries for retrieving truly relevant legal judgments. Liu et al. (2022) apply a conversational agent workflow, originally designed for web search, to legal case retrieval. Furthermore, Opijnen and Santos (2017) identify several limitations of general information retrieval methods in the legal domain and propose a unique framework with six dimensions to capture the concept of relevance in legal information retrieval. Shao et al. (2020) utilize Bert to capture the relationships at the paragraph-level and then aggregate the paragraph-level representations to infer the relevance between two legal cases. Rabelo et al. (2019) apply a transformer-based technique to tackle identifying entailment relationships between a decision and candidate entailing paragraph.

Several approaches have been explored in previous research of SCA. A number of benchmarks have been published, such as CAIL(Xiao et al. 2019), Legal TREC(Oard and Webber 2013), AILA(Bhattacharya et al. 2019), and COLIEE(Rabelo et al. 2020a). In the previous COLIEE competition, a BERT-based language model, Legal-BERT(Chalkidis et al. 2020) is presented, which is pre-trained on a collection of several fields of English legal text. It is mentioned that Kim et al. (2017) introduce judicial document retrieval as an upstream task in the judicial question answering task, and achieved excellent retrieval performance through Siamese networks. Furthermore, in COLIEE 2020 (Rabelo et al. 2020b), a combination of the universal sentence encoder, TF-IDF, and a support vector machine has been proposed, which achieved a good performance for the case law retrieval task.

The SCA task based on the Chinese corpus has also attracted a lot of attention. Xiao et al. (2021) propose Lawformer, a Longformer-based language model, which is pre-trained on large-scale Chinese legal documents. It is demonstrated that Lawformer achieves improvement in a variety of legal artificial intelligence tasks. Hong et al. (2020) leverage regular expressions to extract auxiliary information and combine the Siamese network architecture to complete the semantic analysis of legal cases.

2.2 Event detection

Event Detection (ED) is a crucial information retrieval task in the NLP field. ED aims to extract the event triggers from texts and then classify their corresponding event types.

Existing ED methods can be categorized into two classes, feature-based methods and representation-based methods. Early works mainly focus on feature-based methods (McClosky et al. 2011; Li et al. 2013). Recently, representation-based methods have raised more attention. Li et al. (2021) introduce word-event co-occurrence frequencies into ED, to reduce the impact of similar contexts on ED. Deng et al. (2021) define links for different events to improve the model's performance on rare events. Besides, some methods are also developed for the legal domain. Feng et al. (2022) manually label events and use them for downstream fine-tuning. Wang et al. (2019) apply adversarial training to the task of event detection and employ dynamic pooling layers to obtain trigger-specific representation for each candidate. Researchers also propose several deep neural network for legal documents. Chen et al. (2020) extract the entities and the semantic relations for drug-related legal documents. Devlin et al. (2019) propose a document-level event-argument link method. HGEED (Lv et al. 2021) introduce a document graph to model sentence-to-sentence dependencies. For Chinese legal text ED tasks, a BiLSTM-CRF-based ED model (Li et al. 2020) is proposed. Shen et al. (2020) propose a pedal attention mechanism to extract semantic relations in long-distance. Li et al. (2019)present a mechanism to define focus events and a two-level labeling approach to automatically extract focus events from case materials. Similar to other specific domains, a legal ED dataset (Yao et al. 2022) is developed.

Despite the great success of ED, few studies have explored the downstream tasks of event detection in the legal domain. Event is an important feature of legal case documents, and it is an important basis for inferring the relevance between legal cases. Therefore, we take SCA as a downstream task of ED in this paper. However, since most of the current ED methods lead to excessive computational complexity, we adopt the BERT + CRF (Lafferty et al. 2001; Devlin et al. 2019) method to reduce the computational cost.

3 Method

In this section, we will elaborate on the proposed LECM in detail. First, we give the definition of the SCA tasks. Then, we give an overview of LECM as shown in Fig. 2 and describe the details of each component respectively. Notably, to adapt to different SCA tasks, some minor changes are required for LECM, as we will describe in detail.

Fig. 2
figure 2

The framework of LECM. \(k\in \{{\mathbb{A}}_{p}, {\mathbb{B}}_{p}\}\) here

3.1 Problem definition

We evaluate the capability of the model for SCA through two specific tasks:

SCM: SCM aims to measure the similarity among legal documents and select the case most similar to the target case. In this paper, for simplicity, the input of SCM is supposed to be a triplet. To be specific, for a given triplet \((A, B, C)\), case A, case B, and case C represent the different legal case fact descriptions. We use word sequences to denote the triplet: \(A =\left[{w}_{1}^{a},{w}_{2}^{a}, \dots , {w}_{{l}_{a}}^{a}\right], B =[{w}_{1}^{b},{w}_{2}^{b}, \dots , {w}_{{l}_{b}}^{b}]\) and \(C =[{w}_{1}^{c},{w}_{2}^{c}, \dots , {w}_{{l}_{c}}^{c}]\), where \({l}_{j}\) is the length of word sequence, \({w}_{j}^{i}\in V\) denotes a word, and \(V\) is the pre-set fixed vocabulary. The SCM task can be represented as predicting the label \({y}_{scm}\in \{\mathrm{0,1}\}\), where \(y=1\) indicates that the similarity between A and B (denoted as \({sim }_{A,B}\)) is less than the similarity between A and C (denoted as \({sim }_{A,C}\)). Conversely, y = 0 indicates that the similarity between A and B is greater than the similarity between A and C. LECM will output probability values for label 0 and label 1. Typically, the label with the higher probability value is considered as the model's final prediction.

SCR: Given a query case, SCR task is to retrieve relevant cases from a pool of candidate cases. Unlike general recommendation tasks, SCR focuses solely on text similarity since there is a lack of user-item information. In this study, we transform the SCR task into a query and candidate matching problem that involves calculating similarity. To be specific, given a query case \(Q\) and a set of candidate case\(Ca=\{{Ca}_{1},{Ca}_{2},\dots ,{Ca}_{N}\}\), where \(Q =\left[{w}_{1}^{Q}, {w}_{2}^{Q},\dots , {w}_{{l}_{q}}^{Q}\right],\) \({Ca}_{i}=\left[{w}_{1}^{c},{w}_{2}^{c} ,\dots , {w}_{{l}_{ca}}^{c}\right].\) During the training phase, SCR aims to calculate the similarity between the query and candidates, denoted as\({y}_{scr}=\{{sim }_{{Ca}_{1},Q},\dots ,{sim }_{{Ca}_{\mathrm{N}},Q}\}\). For a given query, after calculating the similarity for all candidate cases, we sort them based on their similarity scores and evaluate the model's performance based on this ranking.

Since the task forms of SCM and SCR are similar, LECM can be applied to both tasks with only a few modifications. Therefore, for the sake of brevity, unless otherwise specified, in SCM task, we use \({\mathbb{A}}\) to represent the case \(A\) and use \({\mathbb{B}}\) to represent the case \(B\). In SCR task, we use \({\mathbb{A}}\) to represent query \(Q\) and use \({\mathbb{B}}\) to represent the candidate \({Ca}_{i}\).

In addition to the target task, we utilize ED as an auxiliary task, so we give the problem definition of ED here:

ED: Given a token sequence \(=[{x}_{1},{x}_{2},\dots ,{x}_{l}]\), where \(l\) is the maximum number of tokens and \({x}_{i}\) is the \(i\)-th token. ED needs to first identify the trigger word and then determine the corresponding event type. A trigger word refers to a keyword or phrase that initiates a specific event. These words are responsible for causing or triggering specific events in the text. Any tokens in the statement that do not qualify as trigger words are classified as non-trigger words. By identifying these trigger words, the occurrence of an event can be determined. Usually, these trigger words are associated with specific event types. Initially, the model needs to determine whether \({x}_{i}\) is a trigger word. If it is, the model predicts the event type \({e}_{i}\) for it, where\({e}_{i}\in \{{Event}_{0}, {Event}_{1},\dots ,{Event}_{N}\}\), and \({E}_{i}\) represents a specific event type while \(N\) is the total number of event types. In this paper, for all non-trigger words that do not themselves represent any type of event, we still define them as a class of events denoted as\({Event}_{0}\).

3.2 LECM model overview

The proposed legal event-context model learns to extract the representation of the event, and the fact description representation, which could be applied to the downstream task, such as SCA task. The architecture of LECM is shown in Fig. 2.

In LECM, the ED module is proposed to capture the information about legal events, which is firstly pre-trained on the LEVEN dataset (Yao et al. 2022) that is an auxiliary dataset in this paper to assist the model in downstream tasks. After that, the ED module will be a part of the LECM model to complete SCA tasks jointly. In the encoder layer, words are mapped to continuous vectors. We use the BERT (Devlin et al. 2019) to obtain the contextual representation of legal fact description. Inspired by BERT-PLI (Shao et al. 2020), the document is segmented into paragraphs and encode each paragraph separately. Since the number of paragraphs is variable, the model should be able to handle long or short legal documents. Then, in the event-context integration mechanism, the pre-trained ED module is utilized to extract the context features of the event. More specifically, the attention weights are calculated in a specific range of contexts based on the embedding representation of the event and the hidden layer vector of the ED module. Next, the interactive layer will capture the interactive semantic information between paragraphs based on the original semantics of paragraphs and event-context information. Finally, the aggregate and output layer are adopted to aggregate the paragraph-level features and predict the final results of SCA. For the SCM task, the similarity between case A and case B, case A and case C will be output here, while for the SCR task, this layer will output the similarity between the query document and the candidate document.

3.3 Detail of LECM model

3.3.1 Paragraph segmentation

Most of the text in the SCA datasets exceeds the maximum input length of BERT, and truncating the text will result in information loss. To tackle this challenge, the legal document is segmented into paragraphs, and the interactive features are modeled at the paragraph-level, then the paragraph-level features will be aggregated in the aggregate and output layer. Since the input forms of SCM task and SCR task are different, the two tasks are processed slightly differently at this layer. To be specific, for SCM task, we first break the triplet into paragraphs and the length of each paragraph is the maximum input length of BERT:

$$\begin{array}{c}A=\left[{A}_{1},{A}_{2},\dots ,{A}_{{N}_{A}}\right]\end{array}$$
(1)
$$\begin{array}{c}B=\left[{B}_{1},{B}_{2},\dots ,{B}_{{N}_{B}}\right]\end{array}$$
(2)
$$\begin{array}{c}C=\left[{C}_{1},{C}_{2},\dots ,{C}_{{N}_{C}}\right]\end{array}$$
(3)

where \({N}_{i}\) is the total number of paragraphs. For the SCR task, we break query case \(Q\) and candidate case \({Ca}_{i}\) into paragraphs, similar to the SCM task:

$$\begin{array}{c}Q=\left[{Q}_{1},{Q}_{2},\dots ,{Q}_{{N}_{Q}}\right]\end{array}$$
(4)
$$\begin{array}{c}{Ca}_{i}=\left[{Ca}_{i,1},{Ca}_{i,2},\dots ,{Ca}_{i,{N}_{C}}\right]\end{array}$$
(5)

The following core work is to model the interaction features between the paragraphs. For the SCM task, it is to calculate the interaction characteristics between \(A\) and \(B\), or \(A\) and \(C\). For the SCR task, it is to calculate the interaction characteristics between \(Q\) and \({Ca}_{i}\). The procedure of these two tasks is the same until the output step. Therefore, to make the expression more concise, we use \({\mathbb{A}}\) and \({\mathbb{B}}\) to represent the case pair and \({\mathbb{A}}_{p}\) and \({\mathbb{B}}_{p}\) to represent the paragraphs in the two tasks that need to model the interaction features in the following parts. Any pair of paragraphs in \({\mathbb{A}}\) and \({\mathbb{B}}\) will feed into the ED module, the encoding step and the interactive information calculation step.

3.3.2 ED module

ED aims to predict the event label \({e}_{i}\) on each individual token, taking into account the context and potential variations within each statement. In this paper, we consider events as legal elements that play an important role in subsequent SCA tasks. Although there are many successful ED models, using it as an upstream task will lead to the excessive computational complexity of LECM. Taking DMBERT (Wang et al. 2019) as an example, the input of this method needs to specify the position of the token to be predicted in the sentence. If there are \(m\) sentences and each sentence contains \(n\) tokens, then the time complexity after predicting all events is \(O(mn)\). Since our model needs to complete downstream tasks on the basis of ED, such time complexity is unacceptable. Therefore, we chose BERT + CRF, a low time complexity ED model. It performs the ED task on \(m\) sentences, and the time complexity is only \(O(m)\), independent of the specific length of text.

The ED module is pre-trained on the ED task before training LECM on the SCA task so that the event information can be leveraged by LECM in SCA task. Formally, denoting an input sequence \(k = \left[{w}_{1}^{ed},{w}_{2}^{ed}, \dots , {w}_{{l}_{e}}^{ed}\right]\), ED aims to predict the event label \({e}_{i}\) on \({w}_{i}^{ed}\).

Considering pre-trained language model has been proven to benefit various NLP downstream tasks (Devlin et al. 2019; Choi et al. 2020; Röttger and Pierrehumbert 2021), we employ BERT, a general pre-trained language model, as our basic encoder in the ED module to generate the embeddings of each token dynamically. Since all the legal documents of the datasets in this work are written in Simplified Chinese, the OpenCLap (Zhong et al. 2019) model is adopted as our BERT model, a pre-trained BERT model based on a large legal Chinese corpus.

In the encoder of the ED module, BERT learns the representation of the legal text as follows:

$$\begin{array}{c}{h}^{ed,k}=BERT\left(k\right)\in {\mathbb{R}}^{{l}_{ed}\times {d}_{s}}\end{array}$$
(6)

where \({h}^{ed,k}\) represents the embedded representation of paragraph \(k\in \{{\mathbb{A}}_{p}, { {\mathbb{B}}}_{p}\}\) encoded by BERT and \({d}_{s}\) is the size of hidden states generated by BERT. In this way, the legal knowledge from the pre-trained corpus is brought into the text embedding \({h}^{ed,k}\). Then, we employ a fully-connected layer to make the final prediction of ED task:

$$\begin{array}{c}{\widehat{y}}^{ed}=\sigma \left({W}^{ed}{h}^{ed,k}+{b}^{ed}\right)\end{array}$$
(7)

where \({W}^{ed}\) and \({b}^{ed}\) are the parameter of the linear transformation, \(\sigma\) is the nonlinear activation function, \({\widehat{y}}^{ed}\) is the probability distribution predicted by the ED module, which can be specifically expressed as \({\widehat{y}}^{ed}=[{\widehat{y}}_{1}^{ed},{\widehat{y}}_{2}^{ed},{\dots ,\widehat{y}}_{{l}_{ed}}^{ed}]\).

For the training procedure of ED module, referring to the work of (Lample et al. 2016), the loss function of the model is built based on CRF. To be specific, we assign one of the paths of \({\widehat{y}}^{ed}\) to be \(e=\left[{e}_{1},{e}_{2},\dots ,{e}_{{l}_{e}}\right]\), where \(e\in E\) and \(E\) is the set of all possible paths. Then, we define the score of input text \(k\) and prediction path \(E\) as the combination of the transition probability matrix and the emission probability matrix:

$$\begin{array}{c}score\left( k,e\right)=\sum_{i=0}^{{l}_{e}}{T}_{{e}_{i},{e}_{i+1}}+\sum_{i=0}^{{l}_{e}}{F}_{k,{e}_{i}}\end{array}$$
(8)

where \(F\) is the emission matrix, \({F}_{k,{e}_{i}}\) represents the score of event label \({e}_{i}\) at the \(i\)-th position. \(T\) is the transition matrix and \({T}_{{e}_{i},{e}_{i+1}}\) represents the transition matrix score from state \({e}_{i}\) to state \({e}_{i+1}\). Given an input text \(k\), the probability of an event label sequence \(E\) is:

$$\begin{array}{c}{p}_{crf}\left(\left.e\right|k\right)=softmax(score\left( k,e\right)) \#\left(9\right)\end{array}$$
(9)

The current most probable path \({e}^{*}\) is calculated as:

$$\begin{array}{c}{e}^{*}={argmax}_{e\in E} {p}_{crf}\left(e|k\right)\end{array}$$
(10)

The loss function is log-likelihood loss:

$$\begin{array}{c}{\mathcal{L}}^{ed}=-\mathit{lo}g\left({p}_{crf}\left(\left.e\right|k\right)\right)\end{array}$$
(11)

By employing BERT + CRF to complete the ED task, we fine-tune the BERT model on the ED task and obtain the prediction event sequence \({e}^{*}\) of a given legal text \(k\) and the hidden state \({h}^{ed,k}\), which contains semantic features relevant to the ED task.

3.3.3 Encoder layer

The encoder layer maps the fact description of a case into continuous hidden states, which contain contextual features. Similar to the ED module, we apply the pre-trained BERT from OpenCLap (Zhong et al. 2019) to encode legal documents. Inspired by the Siamese network (Neculoiu et al. 2016), we design our encoder based on a shared-weight BERT to encode every paragraph, which is beneficial to reducing model parameters while fully considering the interaction information between different documents. Specifically, given a paragraph pair \({\mathbb{A}}_{p}\) and \({\mathbb{B}}_{p}\), a shared-weight BERT is used to capture contextual representations:

$$\begin{array}{c}{h}^{k}=BERT\left(k\right)\in {\mathbb{R}}^{{l}_{k}\times {d}_{s}}\end{array}$$
(12)

where \(k\in \{{\mathbb{A}}_{p}, { {\mathbb{B}}}_{p}\}\).

3.3.4 Event-context integration mechanism

After encoding each paragraph, the ED module is utilized to capture the event features in this layer. Specifically, we propose the event-context integration mechanism to model the interaction between the events and the context. Our method is different from LFESM (Hong et al. 2020), which uses one-hot vectors to represent legal features. We treat events as legal features and map them into learnable vectors, integrating the contextual features of events with the semantic features of the original text.

First, we leverage the ED module to obtain the event features. We load the parameters of the pre-trained ED module. Then, \({\mathbb{A}}_{p}\) and \({\mathbb{B}}_{p}\) are fed into the ED module to obtain the event label sequences of cases as \({E}^{k}=\left[{e}_{1}^{k},{e}_{2}^{k},\dots ,{e}_{{l}_{k}}^{k}\right], k\in \{{\mathbb{A}}_{p}, { {\mathbb{B}}}_{p}\}\). To further extract the event information of fact description, we feed the event label \({E}^{k}\) and the hidden states \({h}^{ed,k}\) of the ED module into the event-context integration layer. Notably, for efficiency consideration, the parameters of the ED module are frozen after pre-training. The effect of freezing the parameters will be discussed in detail in Sect. 4.5, corresponding to LECM/FT.

To map the event sequence \({E}^{k}\) into a continuous vector space, a random lookup matrix \(Emb\) that stores embeddings of events is initialized. Here \(Emb \in {\mathbb{R}}^{{N}_{e}\times {d}_{s}}\), \({N}_{e}\) is the total number of event types. Before training begins, we randomly initialize EMB, where each event is assigned a randomly initialized embedding vector. Specifically, we set the embeddings of non-event labels to zero vectors to prevent interference with event fusion. Formally, the embedding \({h}^{e,k}\) of \({E}^{k}\) is defined as:

$$\begin{array}{c}{h}^{e,k}= Emb\left({E}^{k}\right)\in {\mathbb{R}}^{{l}_{k}\times {d}_{s}}\end{array}$$
(13)

where \(Emb\) will take out the vector of the corresponding column according to the index of \({E}^{k}\) as the embedding vector of the event. \({h}^{e,k}\) is the embedding of each event in the sequence of events \({E}^{k}\), which can be represented as: \({h}^{e,k}=[{h}_{1}^{e,k}, {h}_{2}^{e,k}, \dots ,{h}_{{l}_{k}}^{e,k}]\). The lookup matrix \(Emb\) is updated through backward gradient propagation.

The second step is to capture the context features of events. If events are only used as features for interactive computation, the same event will still be represented as the same feature in different contexts. However, in fact, the context of the event contains the information related to the event (e.g., the severity of the event). Therefore, the features of events with their corresponding contextual features are integrated. Besides, the context related to the event is an important part affecting the similarity. Suppose there are no legal events in a piece of text. In that case, there is a high probability that this event is some general description (e.g., personal information of plaintiff and defendant) or has little to do with the whole legal case. As long as these parts of the text are not included in the event context, their impact on the similarity of the cases can be reduced. Based on the above assumptions, we propose the event-context integration mechanism.

The hidden states \({h}^{ed,k}\) of the ED module contain context semantic information related to ED. The interaction features between ED context semantic features \({h}^{ed,k}\) and SCA semantic features \({h}^{k}\) are calculated by:

$$\begin{array}{c}{h}^{es, k}={h}^{ed,k}{W}^{es}{h}^{k}+{b}^{es}\end{array}$$
(14)

where \({W}^{es}\in {\mathbb{R}}^{{d}_{s}\times {l}_{k}}\), \({b}^{es}\in {\mathbb{R}}^{{d}_{s}}\). \({h}^{ES, k}\) represents the integration of event features with SCA features, and it can also be expressed as \({h}^{es, k}=[{h}_{1}^{es,k},{h}_{2}^{es,k},\dots ,{h}_{{l}_{k}}^{es,k}]\). The symbol \(es\) identifies the vector related to this integration process, and its meaning is event detection features integrating with semantics. Inspired by (Vaswani et al. 2017), we extract contextual features based on the attention mechanism. More specifically, the attention weights from the \(i\)-th position event to the \(j\)-th position token are represented as follows:

$$\begin{array}{c}{\mathcal{w}}_{i,j}^{k}=softmax({\left({w}_{i,j}^{e}{h}_{j}^{es,k}\right)}^{T}\cdot {h}_{i}^{e,k}+{m}_{i,j}),\forall i,j\in \left[1,..,{l}_{k}\right]\end{array} .$$
(15)
$$\begin{array}{c}{m}_{i,j}=\left\{\begin{array}{c}0, allow to attend\\ -\infty , prevent to attend\end{array}\right.\end{array}$$
(16)

where \({w}_{i,j}^{e}\) is a learnable transformation parameter and \({m}_{i,j}\) controls the window size of event attention.

As Fig. 3 shows, the attention window of the event context depends on the trigger word position of the event. For the sake of logical clarity, the description of the event in the legal text usually appears around the trigger word. Thus, we set the center of the attention window as the trigger word position, and only tokens within the window will participate in the calculation of attention weights, corresponding to \({m}_{i,j}=1\). If the token is outside the window, then it will not be noticed by the event, corresponding to \({m}_{i,j}=-\infty\). In this way, we can let the event only focus on the context related to it. Moreover, the text without events will lose the features of this part, making the model more focused on modelling the semantic features of the event-related context.

Fig. 3
figure 3

The event-context integration mechanism. On the left is the standard self-attention mechanism. All the tokens from the document will be attended to. On the right is our event-context integration mechanism. We calculate the attention weights between events and tokens. The event only attends to the words close to the corresponding trigger words, and in the figure, the attention window size of the events is 2

After that, the event-context vector is represented as \({\mathcal{E}}^{k}=[{\mathcal{E}}_{1}^{k},\dots ,{\mathcal{E}}_{{l}_{k}}^{k}]\), where \({\mathcal{E}}_{i}^{k}\) is calculated by:

$$\begin{array}{c}{\mathcal{E}}_{i}^{k}=\sum_{j=0}^{{l}_{k}}{\mathcal{w}}_{i,j}^{k}{h}_{j}^{ES,k}, \forall i\in \left[1,..,{l}_{k}\right]\end{array}$$
(17)

In (17), the context tokens within the attention window of event \({e}_{i}^{k}\) is aggregated according to the corresponding attention weight \({\mathcal{w}}_{i,j}^{k}\). In this way, the context features of event \({e}_{i}^{k}\) is compressed in \({\mathcal{E}}_{i}^{k}\).

3.3.5 Interactive layer

In the previous steps, we mainly model the internal information of the case. However, to calculate the similarity of case pairs, we also need to model the interactive semantic information between case pairs. In this layer, we will calculate the interactive semantic information between case pairs based on the multi-head attention mechanism. As Fig. 4 shows, we mainly utilize the two-layer cross-attention modules to realize the interactive semantic information of modelling case pairs. The difference between our cross-attention and most traditional self-attention lies in that the attention weight calculated by our cross-attention comes from paragraph \({\mathbb{A}}_{p}\) and paragraph \({\mathbb{B}}_{p}\), which reflects the interactive information between cases, while the traditional attention weight only comes from the query document.

Fig. 4
figure 4

The framework of the interactive layer. \(n\) denotes the total number of heads in multi-head attention

More specifically, first we model the semantic information from \({\mathbb{A}}_{p}\) to \({\mathbb{B}}_{p}\). The query matrix \({K}_{i}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}\), value matrix \({V}_{i}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}\) and value matrix \({Q}_{i}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}\) are constructed as follows:

$$\begin{array}{c}{K}_{i}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}={h}^{{\mathbb{A}}_{p}}{W}_{i}^{k}\end{array}$$
(18)
$$\begin{array}{c}{V}_{i}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}={h}^{{\mathbb{A}}_{p}}{W}_{i}^{v}\end{array}$$
(19)
$$\begin{array}{c}{Q}_{i}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}={h}^{{\mathbb{B}}_{p}}{W}_{i}^{q}\end{array}$$
(20)

where \({W}_{i}^{k}\), \({W}_{i}^{v}\), \({W}_{i}^{q}\in {\mathbb{R}}^{{d}_{s}\times {d}_{s}}\) and \(i\) represents the index of heads in multi head attention. Then, the cross-attention from case \({\mathbb{A}}\) to case \({\mathbb{B}}\) is calculated by:

$$\begin{array}{c}{attn}_{i}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}} = softmax\left(\frac{{Q}_{i}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}{\left({K}_{i}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\right)}^{T}}{\sqrt{{d}_{s}}}\right){V}_{i}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\end{array}$$
(21)

For multi-head attention, the result of single attention will be concatenated together:

$$\begin{array}{c}{attn}_{multi}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}={attn}_{1}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\oplus {attn}_{2}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\oplus \dots \oplus {attn}_{n}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\end{array}$$
(22)

where \(n\) denotes the total number of heads in multi-head attention, and \(\oplus\) means the concatenation operation. To measure the original similarity information between case \({\mathbb{A}}_{p}\) and case \({\mathbb{B}}_{p}\), the difference and element-wise multiplication are calculated, then we concatenate event-context features and the interactive semantic features with the element-wise results together:

$$\begin{array}{c}{I}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}={\mathcal{E}}^{{\mathbb{A}}_{p}}\oplus {attn}_{multi}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\oplus \left({h}^{{\mathbb{A}}_{p}}\odot {h}^{{\mathbb{B}}_{p}}\right)\end{array}$$
(23)

Here, \({I}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}\) is concatenated by those vectors, and \(\odot\) is the element-wise product between two vectors. \({I}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}\) is considered as high-order interactive information from case \({\mathbb{A}}_{p}\) to case \({\mathbb{B}}_{p}\), which includes event-context features, interactive semantic features, and original similarity information. In this way, the semantic information of \({\mathbb{A}}_{p}\) can be fully utilized when calculating the similarity finally. The semantic information features from paragraph \({\mathbb{B}}_{p}\) to paragraph \({\mathbb{A}}_{p}\) are calculated in the same way as (18)–(23) shows and we can obtain \({I}^{{ {\mathbb{B}}}_{p}{\mathbb{A}}_{p}}\).

Similar to (18)–(22), we utilize the attention mechanism to integrate \({I}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}\) and \({I}^{{ {\mathbb{B}}}_{p}{\mathbb{A}}_{p}}\). The query matrix \({K}_{i}^{I}\), value matrix \({V}_{i}^{I}\) and value matrix \({Q}_{i}^{I}\) are constructed as follows:

$$\begin{array}{c}{K}_{i}^{I}={I}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}{W}_{i}^{k,I}\end{array}$$
(24)
$$\begin{array}{c}{V}_{i}^{I}={I}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}{W}_{i}^{v,I}\end{array}$$
(25)
$$\begin{array}{c}{Q}_{i}^{I}={I}^{{\mathbb{B}}_{p}{\mathbb{A}}_{p}}{W}_{i}^{q,I}\end{array}$$
(26)

where \({W}_{i}^{k,I}\), \({W}_{i}^{v,I}\), \({W}_{i}^{q,I}\in {\mathbb{R}}^{{d}_{s}\times {d}_{s}}\). We obtain the similarity features \({attn}_{mutil}^{s,{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}=[{s}_{1}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}},{s}_{2}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}},\dots ,{s}_{{l}_{{\mathbb{A}}_{p}}}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}]\) between paragraph \({\mathbb{A}}_{p}\) and paragraph \({\mathbb{B}}_{p}\) as:

$$\begin{array}{c}{attn}_{i}^{s,{\mathbb{A}}_{p}{\mathbb{B}}_{p}} = softmax\left(\frac{{Q}_{i}^{I}{\left({K}_{i}^{I}\right)}^{T}}{\sqrt{{d}_{s}}}\right) {V}_{i}^{I}\end{array}$$
(27)
$$\begin{array}{c}{attn}_{multi}^{s,{\mathbb{A}}_{p}{\mathbb{B}}_{p}}={attn}_{1}^{s,{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\oplus {attn}_{2}^{s,{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\oplus \dots \oplus {attn}_{n}^{s,{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\end{array}$$
(28)

After that, it is fed into a max-pooling layer:

$$\begin{array}{c}{s}^{{\mathbb{A}}_{p}{\mathbb{B}}_{p}}=\mathit{Pooling}\left({attn}_{mutil}^{s,{\mathbb{A}}_{p}{\mathbb{B}}_{p}}\right)\end{array}$$
(29)

where \(\mathrm{Pooling}\) stands for the pooling operation over the dimension of sequence length. \({s}^{{\mathbb{A}}_{p}{ {\mathbb{B}}}_{p}}\) represents the similarity information between paragraph \({\mathbb{A}}_{p}\) and paragraph \({\mathbb{B}}_{p}\).

3.3.6 Aggregate and output layer

The construction of the aggregate and output layer is shown in Fig. 5. After each paragraph pair from case \({\mathbb{A}}\) and case \({\mathbb{B}}\) pass through the event-detection layer and interactive attention layer, we can obtain the similarity information between any two paragraphs. They are combined as (30):

$$\begin{array}{c}{s}^{{\mathbb{A}}{\mathbb{B}} }=\left[\begin{array}{ccc}{s}^{{\mathbb{A}}_{1}{\mathbb{B}}_{1}}& \cdots & {s}^{{\mathbb{A}}_{1}{\mathbb{B}}_{{N}_{\mathbb{B}}}}\\ \vdots & \ddots & \vdots \\ {s}^{{\mathbb{A}}_{{N}_{\mathbb{A}}}{\mathbb{B}}_{1}}& \cdots & {s}^{{\mathbb{A}}_{{N}_{\mathbb{A}}}{\mathbb{B}}_{{N}_{\mathbb{B}}}}\end{array}\right]\end{array}$$
(30)

where \({N}_{\mathbb{A}}\) and \({N}_{\mathbb{B}}\) represent the total number of paragraphs in case \({\mathbb{A}}\) and case \({\mathbb{B}}\). \({s}^{{\mathbb{A}}{\mathbb{B}}}\) aggregates the similarity information of all paragraphs in the case \({\mathbb{A}}\) corresponding to the paragraph in the case \({\mathbb{B}}\). Then, \({s}^{{\mathbb{A}}{\mathbb{B}}}\) is passed through a max-pooling layer to obtain the document-level similarity information as follows:

$$\begin{array}{c}{d}^{{\mathbb{A}}{\mathbb{B}} }=Pooling\left({s}^{{\mathbb{A}}{\mathbb{B}} }\right)\end{array}$$
(31)

where \(Pooling\) represents performing a max-pooling operation on all paragraph-related dimensions. \({d}^{{\mathbb{A}}{\mathbb{B}}}\) represent the similarity information between case \({\mathbb{A}}\) and \({\mathbb{B}}\), and all the similarity features are compressed in it. Next, we need to construct \({d}^{{\mathbb{A}}{\mathbb{B}}}\) as the result required by different SCA tasks. Since the target output of different SCA tasks differs, we need to explain the output methods under various tasks separately.

Fig. 5
figure 5

The construction of the aggregate and output layer

For SCM task, taking the similarity features \({d}^{AB}\) and \({d}^{AC}\) as input, the predicted distribution \(y\) is calculated as follows:

$$\begin{array}{c}R={d}^{AB }\oplus {d}^{AC }\end{array}$$
(32)
$$\begin{array}{c}{\widehat{y}}^{scm}=softmax\left({W}_{scm}^{y}R+{b}_{scm}^{y}\right)\end{array}$$
(33)

Here, \({d}^{AB}\) and \({d}^{AC}\) are concatenated into the predicted result distribution \(R\), and \(\widehat{y}\) represents the probability distribution of the sample, which can be expressed as:

$$\begin{array}{c}{\widehat{y}}^{scm}=\left[{sim}_{A,B},s{im}_{A,C}\right]\end{array}$$
(34)

Finally, we use the cross-entropy loss function to train our model:

$$\begin{array}{c}{\mathcal{L}}^{scm}=-\sum_{i=0}^{\left|R\right|}{y}_{i}^{scm}\mathit{log}{\widehat{y}}_{i}^{scm} \end{array}$$
(35)

where \({y}_{i}^{scm}\) is the ground-truth label and \({\widehat{y}}_{i}^{scm}\) is the predicted result. \(R\) denotes the set of relevant labels.

For SCR task, the similarity information \({d}^{QC}\) is passed through a fully-connected layer followed by a \(softmax\) function to make a prediction as follows:

$$\begin{array}{c}{\widehat{y}}^{scr}=softmax\left({W}_{scr}^{y}{d}^{QC }+{b}_{scr}^{y}\right)\end{array}$$
(36)

The loss function is the same as the SCM task:

$$\begin{array}{c}{\mathcal{L}}^{scr}=-\sum_{i=0}^{\left|R\right|}{y}_{i}^{scr}\mathit{log}{\widehat{y}}_{i}^{scr}\end{array}$$
(37)

LECM takes legal events as the core basis for judging the similarity of cases and models the characteristics of legal events through events and event contexts. Legal case similarity is different from general textual similarity task, it needs to consider textual similarity from the legal professional point of view, and legal event features can reflect this well. We design models for two subtasks of SCA, which differ only slightly in detail. These differences are caused by the input and output forms of tasks. Theoretically, event legal features are not limited to calculating the similarity of legal texts, and event legal features are still needed in tasks such as crime prediction and sentence prediction. Therefore, in future work, we will explore more specific legal tasks.

4 Experiments

In this section, to investigate the effectiveness of LECM on similar case analysis, we carry out experiments on the public datasets and then compare the performance of our model with the baselines. Then, we conduct ablation experiments to investigate the effectiveness of each module in LECM. After that, we explore the impact of auxiliary datasets and attention window size on LECM. Finally, we select some typical cases from the datasets to illustrate the working mechanism of the model.

4.1 Datasets

SCA is formalized as two subtasks: SCM and SCR, both of which can be used to evaluate the SCA performance of the model. To evaluate the performance of LECM, we use CAIL-2019Footnote 2 dataset and LeCaRDFootnote 3 dataset, corresponding to the two subtasks. In addition, as mentioned in Sect. 3.3.2, LEVEN (Yao et al. 2022) dataset is used to train the ED module of LECM, which is an ED dataset.

CAIL-2019 is an open-source dataset that focuses on the SCM task. The input of CAIL-2019 is a triplet (A, B, C), where A, B, and C are fact descriptions of three cases. The objective is to determine whether case A is more similar to case B or case C, simplifying the task into binary classification. Positive or Negative labels are assigned based on the similarity between cases A and B. If case A is similar to B, it is recorded as Positive. Otherwise, it is recorded as Negative. All legal documents from CAIL-2019 are collected from China Judgments Online.Footnote 4 There are 8,138 samples in the dataset, of which 5,102 samples constitute the training dataset, 1,536 samples constitute the validation dataset, and the test dataset is composed of the rest 1,500 samples. All samples are related to civil, and the similarity between these documents is defined by legal professionals. Table 1 provides an overview of the dataset, demonstrating a balanced distribution of positive and negative samples. Notably, the average input length of this dataset is relatively short, enabling assessment of the LECM's performance on such text.

Table 1 Data statistic

The LeCaRD dataset is a dataset for training the SCR task. LeCaRD is a legal case retrieval dataset in China's legal system, which is designed under the guidance of the official document published by the Supreme People's Court of China. LeCaRD consists of 107 query cases and 10,700 candidate cases, most of which are criminal cases. Each query will have about 100 candidate cases, and the model needs to sort the 100 candidate cases according to the similarity according to the text of the query. The higher the similarity, the higher the ranking. As evident from Table 1, the average query length is relatively shorter in comparison. However, the length of each candidate case is considerably longer, surpassing the maximum length capacity of general language models like Bert. This presents an opportunity to assess the performance of LECM when faced with a long document.

In addition, our ED module is trained using the LEVEN dataset, which consists of 8,116 legal cases and 150,997 human-annotated event mentions. Similar to CAIL-2019, LEVEN data is sourced from China Judgments Online, and events are marked by experienced legal experts. LEVEN encompasses 108 event types, covering various common categories such as deception, violence, accidents, and more. The cases in the CAIL-2019 and LeCaRD datasets mainly belong to civil and criminal cases, and LEVEN includes frequent events from these domains. As LEVEN serves as an auxiliary dataset and is not utilized for performance testing, the division rules for training and testing are not provided in Table 1. Importantly, the average length of LEVEN is only 495.83, which falls within the maximum processing length of BERT. Consequently, our ED module does not require extensive processing of long text in LEVEN. Examples of text snippets for the above three datasets are shown in the Appendix.

4.2 Baselines

To verify the effectiveness of the proposed model, we compare our model with the following competitive baseline models:

  • TF-IDF: As a robust classification model, term-frequency inverse document frequency (Salton and Buckley 1988) is used to extract features of inputs, and SVM (Suykens and Vandewalle 1999) is adopted as the classifier.

  • LMIR: Language models for information retrieval (Ponte and Croft 2017) is a traditional retrieval model based on bag-of-words models.

  • TextCNN: TextCNN (Kim 2014) is a classic CNN-based text classification model. We employ TextCNN with a single-layer convolution for fact encoding and classifier. Since TextCNN is not good at capturing long text features, we implement a Siamese network-based version, denoted as TextCNNS.

  • SMASH-RNN: Jiang et al. (2019) propose a hierarchical RNN based on attention, which uses the document structure to improve the representation of long-form documents.

  • Lawformer: Lawformer (Xiao et al. 2021) is a longformer-based pre-trained language model, which is trained on large-scale legal case documents. Since lawformer can handle longer texts, we implement two versions of lawformer model: based on concatenation and based on the Siamese network, denoted as LawformerC and LawformerS, respectively.

  • BERT: Bert (Devlin et al. 2019) is a mainstream pre-trained language model. It has demonstrated superior performance on various downstream tasks. Since the length of the input limits BERT, we only implement a Siamese network-based version, denoted as BERTS.

  • BERT-PLI: BERT-PLI (Shao et al. 2020) break the text into paragraphs and calculate similarity at the paragraph-level. In this way, BERT-PLI model the semantic interactions between paragraphs. Experiments show that it has good performance on legal texts.

  • LFESM: Hong et al. (2020) extract legal elements via regular expressions and adopt BERT to capture long-range dependencies in the legal documents.

4.3 Experiment settings

For TF-IDF, we set the feature size to 2,000. The filter width of TextCNN is {2,3,4,5}, and each filter size was 25. For the SMASH-RNN, the hidden state size is 768. For the Bert-based model, we adopt the bert-base-chinese checkpoint from OpenCLapFootnote 5 as the basic encoder. The lawformer model can process longer sentences, thus we set the max length of each input to 700 for the lawformer-based model and for the rest model 512. Since the length 512 supported by Bert and LFESM is much smaller than the average document length of LeCaRD, we adopt the paragraph-segmentation method in Sect. 3.3.1 for the encoder and output head of Bert and LFESM, so that they can be adapted to longer text.

Note that the training procedure of our model is divided into two steps: pre-training the ED module and training the whole model with a frozen ED module. Each stage uses different hyper-parameters. Hyper-parameters are tuned on the validation set. We pre-train the ED module on the LEVEN dataset, and the dropout rate among each layer is 0.1. The batch size of the ED module is 16. The learning rate of the ED module is 1e−5. We take 20% of the data in LEVEN as the validation set, and the best performance model is adopted as our ED module. On the LEVEN dataset, we use accuracy to evaluate the model. The rest part of our model is trained on the CAIL-2019 and LeCaRD datasets. Hyperparameters are the same on both datasets unless otherwise specified. For the input document, we adopt a paragraph-segmentation approach mentioned in Sect. 3.3.1. Specifically, the number of paragraphs of query and candidate in the SCR task are 2 and 10, and the number of paragraphs of cases in the SCM task are 2, respectively. The window size of the event-context integration mechanism is 64. Our method relies on event context features, and we believe that just the right window size can improve performance. As for the interactive layer, the hidden size of the multi-head attention layer is set to 768, and the number of heads in the multi-head attention layer is 4. Except for the ED module, the dropout rate per layer in the LECM model is 0.3. The batch size during training is 8 and 2 on CAIL-2019 and LeCaRD, respectively. We use Adam (Kingma and Ba 2015) as the optimizer to optimize the model, which is effective in neural model training. We set the learning rate to 1e−5 and the l2-normalization coefficient λ is 1e−5. In addition, we use NVIDIA Apex to accelerate the training procedure.

Since SCM is a binary classification task and CAIL-2019 is a balanced dataset, we employ accuracy (Acc.) as our evaluating metrics, which more objectively reflects the effectiveness of LECM and other baselines. Note that the validation set and the test set of CAIL-2019 are divided by the original author, so the validation set can also fairly reflect the performance of the model. Therefore, we utilize both the validation set and the test set as the evaluation results. For the SCR task, we utilize precision (P) and normalized discounted cumulative gain (NDCG) (Järvelin and Kekäläinen 2002) to evaluate the performance. Precision metrics include P@5, P@10 and mean average precision (MAP). NDCG metrics include NDCG@10, NDCG@20 and NDCG@30. P@k concerns whether the ground-truth case appears in the Top-K retrieval result list. NDCG@k concerns the position of the ground-truth cases in the retrieval result list. Following the literature (Ma et al. 2021b), we randomly sample 20% of the data from the LeCaRD dataset as the test set to evaluate the performance of the models. Other standard parameters follow the default settings of the PytorchFootnote 6 framework.

4.4 Experimental results and discussion

We first evaluate the overall performance of all models on two SCA subtasks, including SCM and SCR. Tables 2 and 3 shows the comparative experimental results of our model with baselines on the CAIL-2019 dataset and the LeCaRD dataset. The result in bold represent the best performing methods. The symbol of “\” indicates that the method cannot converge normally within a limited number of training epochs. According to the results, we can observe that LECM significantly outperforms all previous baselines on the SCM and SCR tasks. we discuss these experimental results in detail in following subsections.

  1. (1)

    For the SCM task, among all of the baselines, LFESM achieves the highest performance, indicating that the legal feature captured by the regular expression is helpful for the SCM task. Compared to LFESM, our model achieves the highest accuracy on validation and test datasets, respectively, which verify the effectiveness of our model for SCM. Compared with manually designing regular expressions, the cooperation of the ED module and event-context integration mechanism can extract more comprehensive features. However, on the SCR task, LFESM does not perform well because LFESM mainly considers civil cases when designing regular expressions matching rules, while the LeCaRD dataset is dominated by criminal cases. Therefore, LFESM cannot extract helpful legal features from the LeCaRD dataset. This also confirms the importance of introducing legal features in the SCA task.

  2. (2)

    For the SCR task, BERT-PLI is the best-performing baseline model, but it does not perform well on the SCM task. This shows that although BERT-PLI can capture the dependencies between paragraphs on long texts, such dependencies cannot help model inference on short texts because short texts pay more attention to the similarity between paragraphs. Except for NDCG@20, LECM achieves the highest performance in the remaining indicators of the SCR task. This indicates that LECM is more precise in selecting top-ranked candidate cases. Besides, our method achieves better performance on the SCM task, indicating that the paragraph-level features in LECM can be leveraged in a long either short text.

  3. (3)

    It is observed that Siamese-based models generally outperform concatenation-based models. It can be seen that the neural network model is more inclined to encode a single case rather than a concatenation of case triplet, thereby reducing interference information. It shows the importance and rationality of using Siamese-based architecture in the encoder layer of LECM.

  4. (4)

    The baseline model based on the bag-of-words model performs much better on the SCR task than on the SCM task. For TF-IDF and LMIR, they take advantage of the whole legal document though they are weaker than the neural network-based model in semantic understanding. However, TextCNN and SMASHRNN cannot converge on the SCR task. Although TextCNN and SMASHRNN can accept long input text, they are not designed for long text tasks. Longer inputs will cause problems like exploding gradients or gradients disappearing. Therefore, it is hard for them to handle long text retrieval.

Table 2 Similar case analysis results on LeCaRD for SCR task
Table 3 Similar case analysis results on CAIL-2019 for SCM task

From the overall results, LECM is significantly superior over the best baseline for a large margin on adopted evaluation metrics, which indicates that LECM has excellent SCA performance. We summarize the reasons as follows: (1) In our model, the event-context integration mechanism will assign different context information to different events, alienating the impact of the same event in the same description. (2) While introducing events as legal features, we also incorporate event-related contextual features into the inference process. (3) We break long documents into paragraphs and perform event detection of the segmented texts. We use the pooling layer to aggregate the results at the end. This enables LECM to handle longer texts while also having excellent performance on short texts. Besides, the ED module cannot handle text that exceeds the input limit, and the paragraph-segmentation mechanism in LECM avoids this problem.

4.5 Ablation test

To study the impact of each layer in our model, we designed several ablation tests to investigate the performance of LECM. Some modifications of our method are listed as follows:

  • LECM/EC: We remove the event-context integration mechanism and ED module.

  • LECM + RE: This model removes the ED module, and the event sequence is randomly initialized.

  • LECM/I: This model removes the interactive layer, and all hidden vectors are concatenated.

  • LECM/FT: We remove the freeze on the ED module parameter. LECM will continue to update the parameters of this module when training on the SCA dataset.

Tables 4 and 5 shows the performance of these LECM variants. First, when we remove the event-context layer and ED module, our method loses the ability to capture events and their context in fact description. Due to the lack of information and context of the event, the performance of LECM/EC declined by a large margin, which shows that taking the event and its context as legal features can facilitate the model to capture the critical textual features. To further demonstrate the effectiveness of the context of an event, we replace the ED module with a random sequence of events. On the one hand, the decline of LECM + RE verified that the accuracy of ED would affect the model. On the other hand, the decline of LECM is not as apparent as we expected because LECM can reduce the accumulation of ED module errors by flexibly learning the embedded vectors of events. Second, we remove the interactive layer and feed the results of the event-context integration mechanism into the aggregate and output layer. The decrease in the results demonstrates that the interactive layer plays an irreplaceable role in our model. Third, the performance of LECM/FT is almost the same as the original model. However, since the model unfreezes the parameters of the ED module, the model incurs more computational costs when performing forward and backward propagation. Experimental results of LECM/FT show that these costs do not lead to improvement. Besides, since the parameters of the ED module have changed, the accuracy of the event detection task will not be guaranteed. The improvement of LECM /FT compared to the performance of the baseline models will mainly come from a more complex network structure rather than accurate events and their contexts. This also leads to poor interpretability of the model, which is vital in legal artificial intelligence.

Table 4 Ablation test results on LeCaRD for SCR task
Table 5 Ablation test results on CAIL-2019 for SCM task

4.6 Impact of window size

To further explore the effectiveness of the event-context integration mechanism, we test our model with various attention window sizes. The core of LECM is the attention to computational events and their contexts. Therefore, the size of the attention window is an important hyperparameter for LECM. We gradually increment the window size by 2 and test the performance of LECM. Figure 6 shows the model performance concerning the context window size.

Fig. 6
figure 6

The impact of attention window size

It can be observed that the LECM is very sensitive to changes in the window size. More specifically, we find that the performance of setting window size as 4 or 16 was not very ideal. The accuracy of the model is around 50% in the SCM task, which is approximately equal to the model making random guesses. We suppose that due to language habits, the adjacent words of trigger words are similar in a small range, so they cannot provide helpful level features, interfering with the original semantic information. As a result, this has a noticeable impact on the performance of the model. Therefore, when the window size is gradually increased from 4, the performance of the model is also significantly improved. When the attention window size of the model exceeds 64, the performance of the model starts to degrade slowly. When the window size is too large, event-context attention degrades to approximate global attention, and trigger words will attend to tokens that do not describe themselves, which will also affect the performance. The model achieves the best performance when the attention window size is 64 on three test sets. Although the datasets contain a large number of long texts, we take the paragraph-segmentation mechanism for the texts, and the maximum length of each paragraph is still limited to 512, which is the maximum input length of BERT. Thus, we speculate that when the length of input text is 512, the optimal size of the window size is about 64. In this case, the model can focus on the words that describe itself and avoid interference from other words. Therefore, we adopt the window size 64 in our method. There could be a correlation between the window size and the maximum input length, but further investigation will be conducted in our future research to explore this relationship.

4.7 Impact of auxiliary dataset

LECM involves two datasets during the training process: the main dataset of the SCA task and the auxiliary dataset LEVEN. To explore the impact of the auxiliary dataset on the performance of LECM, we make different transformations on the auxiliary dataset LEVEN as follows:

  • Data augmentation (LEVENDA): This transformation represents augmenting the entire fact description, including swapping sentence positions in the fact description, deleting a sentence at random, and copying the fact description.

  • Keep civil events (LEVEN+CE): The legal case texts of the LEVEN dataset are divided into two categories: criminal cases and civil cases. To explore the impact of different case types in LEVEN on downstream tasks, we use regular expressions to remove the criminal case documents and keep only the civil case documents.

  • Remove civil event (LEVEN-CE): We only keep the criminal cases documents of LEVEN, similar to LEVEN+CE.

  • Random Delete (LEVENRD): Random delete documents from LEVEN.

  • Copy sentences (LEVENCS): As Tables 1 and 2 showed, the average case length of the LEVEN dataset is less than 512, while the CAIL-2019 and LeCaRD dataset both exceed 512. We randomly copy some sentences in the LEVEN dataset to make the length of the case reach 512 to observe whether the length of the case affects the Model performance.

Table 6 shows the test performance of the ED module on the LEVEN dataset after the above transformation. Tables 7 and 8 shows the test results of LECM corresponding to different auxiliary datasets. We can observe that:

Table 6 Experimental results of ED Module on LEVEN
Table 7 Experimental results of different auxiliary dataset on LeCaRD for SCR task
Table 8 Experimental results of different auxiliary dataset on CAIL-2019 for SCM task

(1) For LEVENDA, the ED module outperforms the original LEVEN dataset on the LEVENDA dataset. Data augmentation enriches the LEVEN dataset to a certain extent, so the performance of the ED module is improved. However, this did not result in a significant change in performance on the downstream SCA task. The ED module is more inclined to accept data with the same distribution as the LEVENDA. However, when performing the SCA task, the data accepted by the model comes from CAIL-2019 and LeCaRD, so the improvement of the ED module on the DA dataset cannot be generalized to the SCA dataset. We suspect that one of the reasons for the different distribution of the data is the average input length. As shown in Table 1, the average case text length of the LEVEN dataset is less than 512, while the average length of CAIl-2019 and LeCaRD is longer than 512 (the excess will be truncated). Therefore, we randomly replicate the sentences of the case text in the LEVEN dataset to make it longer than 512, thus constructing the LEVENCS dataset. The ED module shows a noticeable performance drop on the LEVENCS dataset, which is caused by the ED module overfitting events in repeated sentences. The effect of LEVENCS on ED did not significantly affect the SCA task. We speculate that the identical distribution of the data offsets the effect of overfitting.

(2) For LEVENRD, the ED module does not perform well on LEVEN, affecting the accuracy of LECM on CAIL-2019 and LeCaRD. In LEVENRD, due to the small amount of data, the generalization performance of the ED module is poor, resulting in a low accuracy rate on LEVEN, and this error will accumulate in LECM. This also shows that LECM will use a specific type of event during inference, and the wrong event type will lead to the degradation of model performance.

(3) From the experimental results of LEVEN-CE and LEVEN+CE, it can be seen that whether LEVEN contains civil cases will have a more significant impact on the performance of LECM on CAIL-2019. In the absence of civil events (i.e., LEVEN-CE), most events will be attributed to \(other\) type and this results in a significant drop in LECM performance on CAIL-2019. Although LECM will still learn the context information corresponding to other type of events, due to the lack of effective event embedding, it is difficult to capture the words that can really impact the context information. For LEVEN+CE, since criminal events were low frequent in CAIL-2019, keeping criminal cases in the LEVEN dataset has a small impact. We speculate that the performance of LECM on the SCA task is mainly derived from related type events. For example, the performance of LECM on the CAIL-2019 dataset mainly depends on the related events of civil cases, and the performance on the LeCaRD dataset mainly depends on the related events of criminal cases. The absence of relevant events has a performance loss. To further verify this speculation, we also performed the same test on the LeCaRD dataset. For LEVEN-CE, we removed civil cases and kept criminal cases, and the results show that LECM decreases to a certain extent on both CAIL-2019 and LeCaRD, but the reduction of LeCaRD is smaller, and the reduction of CAIL-2019 is more significant. There will be a substantial decline in LeCaRD only under LEVEN+CE. This verifies our conjecture that the key to auxiliary dataset selection is whether to include downstream SCA task-related events.

4.8 Case study

To understand how integrating event and context information benefits the SCA task, we show the inference process behind SCM.

The core process of LECM lies in the calculation of event and context attention weight. Thus, we visualize the event-context attention heat map to illustrate how LECM helps promote the performance of SCM. Figure 7 is a heatmap of the event-context attention matrix between the event sequence and legal case fact description. We intercept four representative events from the complete event sequence. First, the event only attends to words within the window to avoid attending to the context that does not belong to the description itself. Taking the event "gambling" as an example, the deep color part of "gambling" represents the tokens that are not in the attention window. The light color part represents the words that are allowed to be attended. The brighter the color, the more relevant the words are to the event. The attention ranges of different events may overlap to some extent. For example, the event "detain" and the event "buy" have the same attention window range in this part of the fact description. This indicates that the trigger word of the event is too close. Although the attention windows overlap, their attention weights are not the same. Because different events are mapped to different embedding vectors, they will focus on different parts under the same attention window.

Fig. 7
figure 7

The heatmap shows the event-context attention matrix between event sequence and fact description. We sample partial events from the complete sequence of events for presentation

Furthermore, we cite a typical example from the training datasets to illustrate that our method works. As Fig. 8 shows, since the original text is in Simplified Chinese, the order and segmentation of the text cannot be reflected in the translation, so we did not add a callout symbol to the translation. First, there are two events in this paragraph: drink alcohol and search/seizure. In the context of these events, we highlight parts with high attention weight. Note that the event can pay attention to the relevant part of the context. In addition, for the general text in the second half of the paragraph, no event occurs in this part of the text, and they will lose the event-context features. Note that LECM does not involve case pairs when extracting event features, so the features are still suitable for single-text legal tasks. Therefore, we are considering exploring the application of LECM to more downstream legal tasks as our future work (Table 9).

Fig. 8
figure 8

A typical example from training dataset

Table 9 Example in error analysis

4.9 Error analysis

Error analysis is the process of identifying, examining, and understanding the mistakes made by a model in order to gain insights into its performance and improve its accuracy. It involves analyzing erroneous predictions and determining the underlying causes of these errors. We have conducted a thorough analysis of the erroneous predictions in the LECM and have identified the five most common types of errors. The detailed analysis is as follows.

  1. 1.

    Numeric dependency error Among the error cases, the most common error is related to numerical values. In the test set, there are finance-related cases where the similarity is closely tied to the monetary amount involved. For instance, when the amount involved is as high as one million RMB, it should significantly influence the judgment of case similarity. However, the LECM model lacks the ability to effectively perceive and understand the contextual information related to numerical values, such as the significance of different monetary amounts in financial cases. As a result, the LECM model fails to accurately calculate the similarity of cases based on important numerical types like the amount involved or the weight of drugs.

  2. 2.

    Word sense disambiguation Word sense disambiguation poses a common challenge in ED. For instance, in the given example, the key trigger word "flood" was originally associated with Distributed Denial-of-Service (DDoS) attacks in the original text. However, during the ED stage, the system mistakenly interpreted it as a natural disaster, resulting in an erroneous event classification as a flood. In event descriptions, the same words can be used to represent different events, and their meanings can vary depending on the context. This inherent ambiguity and uncertainty make it more difficult to accurately identify and classify events in ED.

  3. 3.

    Event missing error All events detected by the LECM are derived from the LEVEN dataset. While the LEVEN dataset offers a comprehensive overview of judicial events, there might be some inevitable omissions. In the given example, the term invade indicates a network attack event. However, due to certain reasons, this specific event type was not included in the LEVEN dataset, posing challenges for the LECM's recognition of such events. The fundamental reason for this error is that the LECM heavily relies on the quality of the auxiliary dataset. If the auxiliary dataset is small in size and has fewer event types, it becomes difficult for the LECM to make reasonable inferences and accurately identify events. To address this issue, it is crucial to continuously update and expand the auxiliary dataset with a wider range of event types to enhance the LECM's event recognition capabilities.

  4. 4.

    Attention window error Encoding context within the attention window based on detected events is a crucial step for LECM. Currently, LECM considers the proximity range of the words triggered by events as the attention window range. Previous experiments have demonstrated that this rule can accurately capture the relevant context. However, there are also special cases where the context is not within the adjacent range. In the given example, the plaintiff mentioned the keyword "release the loan" but did not provide a specific description of this event, resulting in the event context not being found in its adjacent text. Finding a more flexible approach to selecting context based on events will be one of our future areas of focus and improvement.

  5. 5.

    Misjudgment of event type This is a commonly encountered error. For our ED module, we opted to use BERT + CRF, which handles event complexity well but sacrifices some accuracy. However, when there is a widespread occurrence of event type errors, LECM struggles to provide precise answers based on the event and its context. In the specific example given, the term "owe" highlighted in bold fails to capture the concept of debt. In the Chinese context, the syntax of "owe" bears similarity to that of "debt" leading to incorrect judgment of the event by the ED module.

5 Conclusion and future work

In this paper, we explore the task of SCA and propose the legal event-context model (LECM) to solve it. First, we propose the event-context integration mechanism to formalize the event and the corresponding context information, which captures the contextual features related to events. The event-context integration mechanism introduces events as legal features into the reasoning process, which calculate the similarity of text pairs from a more legal perspective to improve the accuracy and interpretability of SCA. Then, we leverage ED as an auxiliary task for the SCA tasks to help the model locate events and provide the semantic features related to ED. The ED module acts as a bridge between the ED task and the SCA tasks while avoiding the difficulty of manual event annotation for SCA tasks. The experimental results show that LECM outperforms the state-of-the-art model in SCA tasks, which indicates that our model can effectively leverage event-context features from fact description to improve performance and is prospected to be applied to other downstream subtasks of legal intelligence.

5.1 Discussion

Our study has three important theoretical implications. First, we propose the event-context integration mechanism to integrate legal events with relevant contexts. Different from common semantic matching models, our method enables the model to calculate case similarity from two dimensions of legal elements and semantic features. Current researches on legal element extraction (Hong et al. 2020) are mainly based on manually pre-defined rules. However, the definition of rules is often challenging, and the scope of the application needs to be more comprehensive. Compared with the rule definition method, based on the support of the LEVEN dataset, LECM covers the element types more comprehensively, can extract legal elements more accurately, and avoids the tedious rule-building work.

Second, although ED methods have been developed (McClosky et al. 2011; Li et al. 2013; Deng et al. 2021), the applications of these methods to downstream prediction tasks are rare. We build a legal incident detection model based on the LEVEN dataset and apply it to the SCA task. Our experimental results show that introducing events can effectively improve the accuracy of SCA. Besides, the downstream tasks of event detection are broader than the analysis of similar cases. We believe it can play an important role in more downstream judicial applications, which will be our future work.

Third, we introduce an additional ED dataset to avoid manually labeling events on the SCA dataset. Current multi-task learning methods (Sener and Koltun 2018; Hu et al. 2022) focus on a single dataset. In the legal artificial intelligence field, there is a correlation between legal datasets, so it is necessary to utilize existing datasets to avoid heavy manual labeling work. We adopt a two-stage training method to pre-train the ED task and integrate the ED model into the SCA model. Our findings reveal the regularity between the performance of the SCA model and ED datasets, which clarifies the basis for selecting the auxiliary dataset.

This study also provides noteworthy practical implications. First, due to a large number of cases, legal practitioners often need to spend a lot of time and energy screening similar cases to quickly focus on the core of the case and prepare for litigation ideas. LECM can accurately analyze similar cases and help legal practitioners judge whether they can refer to a particular case or select similar cases from candidate cases, thereby saving judicial resources. Second, due to the need for more judicial expertise, ordinary people often need help to inquire and understand similar cases accurately. The automated similar case analysis model allows ordinary people to understand the essence of cases through similar cases and simultaneously eases the pressure on judicial practitioners.

5.2 Future work

In future work, we will explore more downstream tasks (e.g., legal judgement prediction, legal question answering) to investigate the effectiveness of LECM. Moreover, we will optimize the ED module in LECM so that the ED task and downstream tasks can be better combined.