Keywords

1 Introduction

Similar case matching (SCM) is a critical task in the legal domain, aiming to determine the similarity between legal case documents. SCM plays a significant role in both common law systems, such as the United States, Canada, and India, where judgments are made based on similar and representative cases in the past, and civil law systems, such as China, Germany, and Italy, where similar cases still serve as references for legal professionals, although statutes are the primary source of law [1, 2]. With the generation and accumulation of a large number of legal documents, retrieving similar cases efficiently from vast amounts of legal document data poses a significant challenge.

With the development of deep learning in natural language processing (NLP), exploiting NLP techniques to assist the SCM task has drawn increasing attention rapidly. Liu et al. [3] apply the conversational agent workflow in web search to legal case retrieval. Mandal et al. [4] compare the performance of a series of vector generation models in document similarity calculation on a dataset of Indian legal texts. Yang et al. [5] construct a graph neural network based on the existing correlation information between cases. Pre-trained language models, trained on unlabeled corpora, have been shown to be beneficial to various NLP downstream tasks [6, 7]. Therefore some studies focus on utilizing pre-trained language models specifically for the legal domain [8,9,10].

Although significant progress has been made in the development of SCM, this task is still faced with a few challenges.

The first challenge for SCM is how to improve the accuracy by utilizing some key legal elements, such as legal events, as essential components, instead of relying solely on semantic similarity. Existing methods tend to overlook legal events that could influence the verdict and the similarity between cases. As illustrated in Fig. 1, although the fact statements of Case A and Case B are semantically similar, the two cases are not similar due to the presence of violent events in Case A, while no violent event is present in Case B. Therefore, Case B should be categorized as theft, while Case A should be categorized as robbery. Traditional SCM methods that rely solely on semantic similarity can be easily misled by semantic structures and may erroneously categorize A as more similar to B, whereas the ground truth is that A is more similar to C. Moreover, just locating events is not enough to support the judgments of similar cases. The severity of the event is the basis for the judgments. Therefore, the collection of the events and the severities determine how similar the cases are. Some researchers have solved this problem by extracting legal events or elements via human design. Hong et al. [1] leverage regular expression to incorporate legal key elements into text parsing. Hu et al. [11] add attributes for charges manually in legal judgment prediction. Nevertheless, manual rules heavily depend on domain-specific prior knowledge and human efforts, which is inefficient.

In addition, another challenge is how to leverage knowledge from other legal datasets, such as event detection (ED) datasets to improve the efficiency of SCM by joint training. Existing methods typically train their models only on specific datasets for a particular task, such as SCM, without utilizing knowledge from other datasets, such as event detection (ED) datasets. For example, in order to enable a model to perform SCM, it is usually trained on a dedicated SCM dataset, such as CAIL-2019 [12]. However, as mentioned earlier, the event labels, which are crucial for SCM, are not available in the existing SCM datasets. This means that if we want to perform multi-task training of SCM and ED, we would need to manually label events in the SCM dataset, which can be time-consuming. Thus, leveraging existing event detection datasets, such as LEVEN [13], to assist with the SCM task remains an unresolved challenge.

Fig. 1.
figure 1

An illustration of similar case matching.

To address these challenges, we propose a model called Event-Context Detection Model (ECDM) for SCM. In order to integrate event and context features, we introduce an event-context detection mechanism that formalizes the event and its context information. Specifically, ECDM learns event features from legal documents, and based on the observation that the severity of an event is often described by the context of the trigger word, we capture the context features and re-weight the event features accordingly. Subsequently, we integrate the re-weighted features with the hidden vectors of the context. This event-context detection mechanism enables the model to leverage both semantic and event features for inference, thereby improving accuracy and interpretability. Additionally, our approach avoid the labor-intensive task of manually labeling events in the target SCM dataset by pre-training an ED model, which serves as the auxiliary module of ECDM, and leveraging it to assist ECDM in completing the SCM task. We conduct experiments on the CAIL-2019 SCM dataset to evaluate the effectiveness of our proposed model.

To summarize, the main contributions of this paper can be summarized as the following:

  1. (1)

    We propose a novel event-context detection model named ECDM with two characteristics: 1) can extract event context features besides detecting events to help improve the accuracy of SCM; 2) utilize an efficient pre-trained based ED model instead of labeling events manually for the target dataset, like SCM dataset in this paper.

  2. (2)

    To estimate the performance, we conduct extensive experiments comparing existing SCM or semantic matching models on a real-world dataset. The experiments show that ECDM yields substantial improvements in SCM. Further ablation tests and the case study demonstrate the effectiveness of our methods.

2 Method

In this section, we will provide a detailed elaboration of the proposed ECDM. Firstly, we will define the SCM task. Then, we will present the overview of ECDM in Fig. 2, and discuss the specifics of each component in detail.

2.1 Problem Definition

The goal of SCM is to determine the similarity between legal documents and identify the most similar case to the target case. In this paper, for simplicity, the input of SCM is supposed to be a triplet. Given a triplet \((A, B, C)\) as input, case A, case B, and case C represent the different legal case fact descriptions. We use word sequences to denote the triplet: \(A =\left[{w}_{1}^{a},{w}_{2}^{a}, \dots , {w}_{{l}_{a}}^{a}\right]\), \(B =\left[{w}_{1}^{b},{w}_{2}^{b}, \dots , {w}_{{l}_{b}}^{b}\right]\), and \(C =[{w}_{1}^{c},{w}_{2}^{c}, \dots , {w}_{{l}_{c}}^{c}]\), where \({l}_{j}\) is the length of word sequence, \({w}_{j}^{i}\in V\) denotes a character, and \(V\) is the fixed vocabulary. The SCM task can be represented as estimating a conditional probability \(P(y|A, B, C)\) based on the training set \({D}_{train}\), and the SCM model predicts a similarity relative result for testing examples by \({y}^{*} = {argmax}_{y\in \mathrm{Y}} P(y|A, B, C)\). Concretely, \(sim(A, B)\) denotes the similarity between case A and case B. \(Y=\left\{\mathrm{0,1}\right\}\), where \(y=1\) means that \(sim(A,B) < sim(A,C)\), otherwise \(y=0\).

Fig. 2.
figure 2

The framework of ECDM.

2.2 Model Overview

In this paper, we present the Event-Context Detection Model (ECDM) that learns to extract comprehensive representations of events, event contexts, and fact descriptions for downstream tasks. The architecture of our model is shown in Fig. 2. We first pre-train an ED model on the LEVEN dataset [13] as the auxiliary module. LEVEN serves as an auxiliary dataset to assist the model in downstream tasks. The auxiliary module is then integrated into the ECDM model to jointly complete the SCM task. In the encoding process, we use BERT [14] to obtain contextual representations of legal fact descriptions. Specifically, a triplet \((A, B, C)\) is considered as the input, where \(A\), \(B\), and \(C\) are fact descriptions of three cases. We propose the event-context mechanism to integrate event features and context information. In the event-context mechanism, the pre-trained auxiliary module is utilized to extract context features of the event. The interaction layer captures interactive semantic information between case pairs based on the original semantic of the fact descriptions and the event-context information. Finally, the output layer is used to predict the final results of SCM.

2.3 Detail of ECDM

Auxiliary Task.

We pre-train an auxiliary module on the ED task before training ECDM on SCM so that the event information could be leveraged by ECDM in the SCM task. Formally, denoting an input sequence \(A = \left[{w}_{1}^{e},{w}_{2}^{e}, {w}_{3}^{e}, \dots , {w}_{{l}_{e}}^{e}\right]\), ED aims to predict the event label \({e}_{i}\) on each word. Although there are successful ED models, using them as an upstream task will lead to the excessive computational complexity for the ECDM. Taking DMBERT [15] as an example, the input of this method needs to specify the position of the token to be predicted in the sentence. If there are \(m\) sentences and each sentence contain \(n\) tokens, then the time complexity after predicting all events is \(O(mn)\). Since we need to complete downstream tasks on the basis of ED, such time complexity is unacceptable. Therefore, we chose BERT + CRF, a low time complexity ED model. It performs the ED task on \(m\) sentences, and the time complexity is only \(O(m)\), independent of text length.

Encoding.

As illustrated in Fig. 2, the encoder maps the triplet of fact description into continuous hidden states, which contain contextual features. Inspired by Siamese network, we design our encoder based on a shared-weight BERT to encode every sequence in the triplet, which is beneficial to reducing model parameters while fully considering the interaction information between different documents. Specifically, given a fact description triplet\((A, B, C)\), a shared-weight BERT is used to capture contextual representations for the triplet. Each fact description is represented as \({H}_{k}=\left[{h}_{1}^{k},{h}_{2}^{k}, \dots , {h}_{{l}_{k}}^{k}\right]\in {\mathbb{R}}^{{l}_{k}\times {d}_{s}}\), here\(k\in \left\{a,b,c\right\}\).

Event-Context Detection Mechanism.

Fig. 3.
figure 3

An illustration of event-context attention mechanism.

First, we load the parameters of the pre-trained auxiliary module and then predict the event label sequences of cases as\({E}^{k}=[{e}_{1}^{k},{e}_{2}^{k},\dots ,{e}_{{l}_{k}}^{k}]\). To further extract the event information of fact description, we feed the event label \({E}^{k}\) and the embedding vector \({H}_{k}\) into the event-context detection layer. Specifically, we initialize a random lookup matrix that stores embeddings of events, so that the event label could correspond to an embedding vector \({h}_{i}^{e}\). The embeddings of non-event labels are set to zero vectors to avoid interfering with event fusion. After that, we input A, B, C into the auxiliary module and map to a continuous vector via the lookup matrix:\({H}_{a}^{e}=[{h}_{1}^{a,e},\dots ,{h}_{{l}_{a}}^{a,e}]\),\({H}_{b}^{e}=\left[{h}_{1}^{b,e},\dots ,{h}_{{l}_{b}}^{b,e}\right]\),\({H}_{c}^{e}=[{h}_{1}^{c,e},\dots ,{h}_{{l}_{c}}^{c,e}]\). Events have different severity levels in different contexts, which will affect the similarity of cases. The event weight of k is defined as:

$$\begin{array}{*{20}c} {W_{k}^{e} = softmax\left( {H_{k}^{e} W_{Q} \left( {H_{k} W_{K} } \right)^{T} + M} \right)} \\ \end{array}$$
(1)
$$M_{i,j} = \left\{ {\begin{array}{*{20}c} {0,} & {{\text{allow}}\,{\text{to}}\,{\text{attend}}} \\ { - \infty ,} & {{\text{prevent}}\,{\text{to}}\,{\text{attend}}} \\ \end{array} } \right.$$
(2)

where \({W}_{Q}\) and \({W}_{K}\) is the learnable parameter matrix, \(M\) controls the window size of event attention. As Fig. 3 shows, since the severity information of an event is usually implied in the context, we set the context adjacent to the trigger word as an attentive word, to avoid the extent of information about other events being attended. After that, the event-context vector \({\mathcal{E}}^{k}\) is calculated by:

$$\varepsilon^{k} = W_{k}^{e} \left( {H_{k} W_{V} } \right)$$
(3)

Here, \({W}_{V}\) is the learnable parameter matrix.

Interaction.

In this layer, the interactive semantic information is calculated between case pairs based on the multi-head attention mechanism [16]. Taking case A and case B as an example, we set keys \({K}_{ab}^{i}={H}_{A}{W}_{k}^{i}\), values \({V}_{ab}^{i}={H}_{B}{W}_{v}^{i}\), and queries \({Q}_{ab}^{i}={H}_{B}{W}_{q}^{i}\), where the hidden states \({H}_{A}\) of the encoder layer is linearly projected to a triple of keys, values, and queries. The semantic information features from case A to case B is calculated by:

$${\text{attention}}_{{{\text{ab}}}}^{{{\text{multi}}}} = {\text{Multi}}\_{\text{head}}\_{\text{Attention}}\left( {Q_{ab}^{i} ,\,K_{ab}^{i} ,\,V_{ab}^{i} } \right)$$
(4)

where \(Multi\_head\_Attention\) is the multi-head attention mechanism, n denotes the number of heads in multi-head attention.

Afterwards, we integrate the event-context features and the interactive semantic features with them. To measure the similarity between case A and case B, we calculate the difference and element-wise multiplication, then concatenate the semantic features with results together:

$$I_{ab} = \varepsilon^{a} \oplus \left( {H_{a}^{e} \odot H_{b}^{a} } \right) \oplus attention_{{{\text{ab}}}}^{{{\text{multi}}}}$$
(5)

The similarity information features \({I}_{ba}\) from case B to case A are calculated in the same way as \((5)\) shows. The attention mechanism is utilized to compute the similarity features. We set keys \({K}_{ab}^{s}={I}_{ab}{W}_{k}^{s}\), values \({V}_{ab}^{s}={I}_{ab}{W}_{V}^{s}\), and queries \({Q}_{ab}^{s}={I}_{ba}{W}_{Q}^{s}\), and the similarity features is obtained by \({attn}_{ab}^{s}=[{s}_{1}^{ab},{s}_{2}^{ab},\dots ,{s}_{la}^{ab}]\) between case A and case B as:

$${\text{attention}}_{{{\text{ab}}}}^{{\text{s}}} = Multi\_head\_Attention\left( {Q_{ab}^{s} ,\,K_{ab}^{s} ,\,V_{ab}^{s} } \right)$$
(6)

After that, it is fed into a max-pooling layer:

$$\begin{array}{*{20}c} {b_{s} = \max \_pooling\left( {attention_{ab}^{s} } \right)} \\ \end{array}$$
(7)

where \(max\_pooling\) stands for the pooling operation over the dimension of sequence length.

Prediction and Loss Function.

As stated above, taking the similarity features \({b}_{s}\) and \({c}_{s}\) as input, the predicted distribution \(y\) is calculated as follows:

$$\begin{array}{*{20}c} {R = b_{s} \oplus c_{s} } \\ \end{array}$$
(8)
$$\begin{array}{*{20}c} {\hat{y} = softmax\left( {W^{y} R + b^{y} } \right)} \\ \end{array}$$
(9)

Finally, we use the cross-entropy loss function to train our model:

$$\begin{array}{*{20}c} {{\mathcal{L}}^{s} = - \sum_{i} \left( {y_{i} \log \hat{y}_{i} + \left( {1 - y_{i} } \right)\log \left( {1 - \hat{y}_{i} } \right)} \right)} \\ \end{array}$$
(10)

where \({y}_{i}\) is the ground-truth label, \({\widehat{y}}_{i}\) is the predicted result. The training objective of ECDM is to minimize the cross-entropy between predicted results \(y\) and the ground-truth distribution \(\widehat{y}\).

3 Experiments

In this section, we investigate the effectiveness of ECDM on SCM through a series of experiments conducted on a public dataset. We compare the performance of our model with several baselines to demonstrate its superiority. Additionally, we conduct ablation experiments to evaluate the effectiveness of each module in ECDM. Lastly, we present a typical case from the dataset to illustrate the working mechanism of our model.

3.1 Baselines

To verify the effectiveness of the proposed model, we compare our model with the following competitive baseline models.

TF-IDF.

As a robust classification model, TFIDF [17] is used to extract features of inputs, and SVM [18] is adopted as the classifier.

TextCNN.

TextCNN [19] is a renowned CNN-based text classification model. However, due to its limitations in capturing long text features, we introduce a Siamese network-based variant called TextCNNS to overcome this challenge.

SMASH-RNN.

Jiang et al. [20] propose a hierarchical RNN based on attention, which uses the document structure to improve the representation of long-form documents.

Lawformer.

We optimized Lawformer [10], a longformer-based language model for legal case documents, by implementing two versions: LawformerC and LawformerS, based on concatenation and Siamese network respectively.

BERT.

BERT [14] is a mainstream pre-trained language model. Since the length of the input limits BERT, we implement a Siamese network-based version, denoted as BERTS.

BERT-PLI.

BERT-PLI [9] break the text into paragraphs and calculate similarity at the paragraph-level.

LFESM.

Hong et al. [1] extract legal elements via regular expressions and adopt BERT to capture long-range dependencies in the legal documents.

3.2 Datasets and Experiment Settings

Hyper-parameters are tuned on the validation dataset. For TF-IDF, we set the feature size to \(\mathrm{2,000}\). The filter width of TextCNN is \(\{\mathrm{2,3},\mathrm{4,5}\}\), each filter size was \(25\). For the SMASH-RNN, the hidden state size is \(768\). For the BERT-based model, we adopt the bert-xs checkpointFootnote 1 from OpenCLap as the basic encoder. Since the lawformer can process longer sentences, we set the max length of each input to \(700\) for the lawformer-based model and for the rest model to 512.

We train the auxiliary module on the LEVEN dataset, and the dropout rate among each layer is \(0.1\). The batch size of the auxiliary module is \(16\). The rest part of our model is trained on the CAIL-2019 datasetFootnote 2. The window size of the event-context detection layer is \(64\). As for the interaction layer, the hidden size of the multi-head attention layer is set to \(768\), and the number of heads in the multi-head attention layer is \(8\). The dropout rate per layer in the student model is \(0.3\), and the batch size during the training process is \(8\). We use Adam [21] as the optimizer to optimize the whole model and set the learning rate to \(1e-5\).

Since SCM is a binary classification task with a balanced dataset (CAIL-2019), we use Accuracy (Acc.) as our evaluation metric to objectively measure the effectiveness of ECDM and other baselines. It is worth mentioning that the validation set and test set of CAIL-2019 are divided by the original authors, ensuring fair performance evaluation. Therefore, we utilize both the validation set and the test set as the evaluation results.

3.3 Experimental Results

Table 1 presents the experimental results of our model compared to baselines on the validation and test datasets of CAIL-2019. Our model, ECDM, demonstrates significant improvements in accuracy metrics over previous baselines. Specifically, compared to the previous state-of-the-art SCM model, our model achieves a 2.4% and 4.3% increase in accuracy on the validation and test datasets respectively, validating the effectiveness of our approach. The results highlight that our model effectively extracts and utilizes context features of events, which are crucial for SCM. The event-context detection layer assigns different context to different events, mitigating the impact of the same event in the same description. The interaction layer enables comprehensive extraction of global semantic and local event features from the fact description. As a result, our model out-performs the baseline significantly.

Table 1. Similar Case Matching Results on CAIL-2019

Besides, it is observed that Siamese-based models generally outperform concatenation-based models. It can be seen that the neural network model is more inclined to encode a single case rather than a concatenation of case triplet, thereby reducing interference information. It shows the importance and rationality of using Siamese-based architecture in the encoder layer of ECDM.

3.4 Ablation Study

To study the impact of each layer in our model, we designed several ablation tests to investigate the performance of ECDM. The results of the ablation test are shown in Table 1.

When we replace the Siamese network-based encoder layer with the concatenation-based version and replace the interaction layer with a self-attention mechanism, we can observe that the performance degrades obviously. As mentioned earlier, neural networks tend to encode cases individually, and ECDM is no exception.

Furthermore, when we remove the event-context detection mechanism and directly concatenate event sequences with original semantic features, the model loses the ability to capture the event context in fact descriptions. As a result, the accuracy drops by at least 3.9%. To further demonstrate the effectiveness of the auxiliary module and event-context detection layer, we replace the auxiliary module with a random sequence of events. The accuracy results show a decrease of at least 6.9%. This indicates that the accuracy of event prediction has a significant impact on the model’s performance. On the other hand, ECDM is able to reduce the accumulation of auxiliary module errors by flexibly learning the embedded vectors of events, which highlights the robustness and effectiveness of our proposed model.

Finally, we remove the interaction layer and feed the results of the event-context detection layer into the output layer. The decrease in the results demonstrates that the interaction layer plays an irreplaceable role in our model.

3.5 Impact of Window Size

To further explore the effectiveness of the event-context detection layer, we test our model with various attention window sizes. The results are shown in Fig. 4. We find that the performance of setting window size as 4 or 16 was not very ideal. The accuracy of the model is around 50%, which is approximately equal to the model making random guesses. We suppose that due to language habits, the adjacent words of trigger words are similar in a small range, so they cannot provide helpful context features, interfering with the original semantic information. As a result, it causes the model to fail to converge. When the window size is too large, event-context detection degrades to approximate global attention, and trigger words will attend to tokens that do not describe themselves, which affects the performance. At the same time, the model requires more training iterations to filter out the words that determine the severity from the larger attention window. The model achieves the best performance when the attention window size is 64. In this case, the model can focus on the words that describe itself and avoid interference from other words.

Fig. 4.
figure 4

Performance of different window sizes of ECDM.

3.6 Case Study

We cite a typical example from the training datasets to illustrate that our method works. As Fig. 5 shows, since the original text is in Simplified Chinese, the order and segmentation of the text cannot be reflected in the translation, so we did not add a callout symbol to the translation. First, there are four events in this paragraph: pay, request, rent/borrow and bodily harm. In the context of these events, we highlight parts with high attention weight. Note that the event can pay attention to the relevant part of the context. In addition, for the general text at the end, no event occurs in this part of the text, and they will lose the event-context features. Note that ECDM does not involve case pairs when extracting event features, so the features are still suitable for single-text legal tasks. Therefore, we consider exploring the application of ECDM to more downstream legal tasks as our future work.

4 Conclusion

In this paper, we explore the task of similar case matching and propose the event-context detection model (ECDM) to solve it. First, we introduce the event-context detection mechanism, which can provide event context information besides detecting events and help improve the efficiency of downstream tasks, by utilizing the pre-trained event detection model, instead of labeling events manually for the target dataset. After that, we improve the performance of the SCM task based on ECDM by utilizing event-context as side information. The experiments show that ECDM outperforms the state-of-the-art model in accuracy, which indicates that our model can effectively leverage event-context features from fact description to improve performance and is prospected to be applied to other downstream subtasks of legal intelligence. In future work, we will explore more downstream tasks to investigate the effectiveness of ECDM.

Fig. 5.
figure 5

A typical example from training dataset