1 Introduction

Peer reviews are central for research validation, where multiple experts review the paper independently and then provide their opinion in the form of reviews. Sometimes the reviewers are required to provide their ‘recommendation score’ to reflect their assessment of the work. They are also sometimes required to provide their ‘confidence score’ to exhibit how familiar is the related literature to the reviewer or how confident the reviewer is about their evaluation. Not only the review text but also additional signals like recommendation and confidence scores from multiple reviewers help the chairs/editors to get a better feel of the merit of the paper and assist them in reaching their decision on the acceptance or rejection of the article to the concerned venue. The chair/editor then writes a meta-review cumulating the reviewers’ views while justifying the decision on the paper’s fate, finally communicating the same to the authors.

With the growing number of scientific paper submissions to top-tier conferences, having AI interventions [1] to counter the information overload seems justified. An AI assistant to generate an initial meta-review draft would help the chair to craft a meaningful meta-review quickly. Here in our initial investigation, we set out to investigate if we can leverage all the available signals to a human meta-reviewer (e.g., review-text, reviewer’s recommendation, reviewer’s conviction, reviewer’s confidence [2], final judgment [3], and others) to automatically write a decision-aware meta-review. We design a deep neural architecture pipeline that performs the relevant sub-tasks at different stages in the pipeline to generate a decision-aware meta-review finally. Our primary motivation in this work is to replicate the stages in a human peer-review process while assisting the chairs in making informed decisions and writing good quality meta-reviews.

Specifically, we present a decision-aware transformer-based multi-encoder deep neural architecture to generate a meta-review while also predicting the final acceptance decision, which is again fueled by predicting the reviewer’s sentiment, recommendation, conviction/uncertainty [4], and confidence in the intermediate steps. The multi-encoder gives three separate representations of the three peer-reviews for further input to the decoder. We use the review text and the reviewer’s sentiment in our pipeline to predict the recommendation score. Then we use the predicted recommendation score along with the uncertainty of the reviewer (which we predict via a separate model [5]) to predict the confidence score. For each paper, we use the predicted recommendation score, uncertainty score, confidence score, sentiment, and representations of the three reviews to arrive at the final decision. Finally, we use the decision to generate a decision-aware meta-review. We evaluate our generated meta-reviews, both qualitatively and quantitatively. Although we achieved encouraging results, we emphasize that the current investigation is in its initial phase. We would require further fine-tuning and a deeper probe to justify our findings. Nevertheless, our approach to meta-review generation is novel and encompasses almost all the human factors in the peer review process.

The rest of the paper is organised as follows: Relevant prior works are discussed in Sect. 2. The dataset is described in Sect. 3. Our proposed methodology is presented in Sect. 4 along with a description of the sub-tasks incorporated in the pipeline. The evaluation metrics, baselines and comparing systems are described in Sect. 5. Results and analysis are given in Sect. 6. Finally, the conclusion is drawn in Sect. 7.

2 Related work

Although the problem is ambitious and new, there are a handful of investigations in the recent literature. The most relevant one being the decision-aware meta-review generation [6]. Here the authors mapped the three reviews to a high level encoder representation and used the last hidden states to predict decision while using a decoder to automatically generate the meta-review. In MetaGen [7], the authors first generate the extractive draft and then use a fine-tuned UniLM [8] (Unified Language Model) for the final decision prediction and abstractive meta-review generation. In [9], the authors investigate the role of summarization models and how far are we from meta-review generation with those large pre-trained models. We attempt the similar task, but we go one step further to perform multiple relevant sub-tasks in various stages of the peer-review process to automatically generate the meta-review, simulating the human peer-review process to a greater extent.

We also discuss some relevant works (decision prediction in peer reviews) that can add further context to the problem. The PeerRead [10] dataset is the first publicly available peer-review resource that encouraged Natural Language Processing (NLP)/ Machine Learning (ML) research on peer review problems. The authors defined two novel NLP tasks, viz. decision and aspect-score prediction [11]. Another work on conference paper acceptance prediction [12] extracted features from the paper such as title length, number of references, number of tables and figures, and others, to predict the final decisions using machine learning algorithms. The authors of DeepSentiPeer [13] used three channels of information: paper, corresponding review, and the review polarity to predict the overall recommendation as well as final decision. There are a few other works on NLP/ML for peer review problems [14, 15] such as aspect extraction [16] and sentiment analysis, which are worthy to explore to understand the related NLP/ML investigations in this domain.

3 Dataset

Research in the peer review system has been limited because of data privacy, confidentiality, and a closed system; however, in the last few years new open review system where reviews and comments along with the decision are posted publicly. This new process of review system has led to the availability of the data for studying the process.

3.1 Data collection

We collect the required peer review data (reviews, meta-reviews, recommendations, and confidence score) from the OpenReviewFootnote 1 platform along with the decision of acceptance/rejection in the top-tier ML conference ICLR for the years 2018, 2019, 2020, and 2021. Most of the papers in our dataset have got three reviews. After pre-processing and eliminating some unusable reviews/meta-reviews, we arrive at 7,072 instances of papers with associated peer review data for our experiments. We use 15% of the data as the test set (1060), 75% as the training set (5304), and the remaining 10% as the validation set (708). Our proposed model treats each review individually (does not concatenate), so for training, we create a permutation in ordering the three reviews to have a training set of 31,824 reviews. We provide the total number of reviews, meta-reviews, and length in Table 1.

Table 1 Details of the reviews and meta-review in our dataset across ICLR editions
Table 2 Paper distribution for decision prediction task
Fig. 1
figure 1

Recommendation and confidence score normalized data distribution across labels

Table 2 shows the distribution of reviews across paper-categories (Accepted or Rejected).

3.2 Data pre-processing

For recommendation and confidence labels, we normalize the values on the Likert scale of 1 to 5 and remove the label category when the data is less than 0.01\(\%\)(rare class). Recommendation score of 1 means a strong reject, and 5 means a strong accept. Similarly, a confidence score of 1 indicates that the reviewer’s evaluation is an educated guess either because the paper is not in the reviewer’s area or was complicated to understand. On the other hand, a confidence score of 5 indicates the reviewer is absolutely sure that their evaluation is correct and that they are very familiar with the relevant literature. From Fig. 1, we can see that 85\(\%\) of labels for recommendation score belong to only two classes and the rest three classes combined account for only 15\(\%\), confidence score distribution is similar with 78\(\%\) taken up by two of the classes, leaving 22\(\%\) only for the rest.

Ideally, a meta-review should contain all the key/deciding aspects collated from the multiple reviews along with the final decision on the concerned manuscript. Thus, we exclude those manuscripts from our dataset for which the meta-review or review word token size is less than 10, as we think such short meta-reviews/reviews contribute negligibly (sometimes even negatively) to the learning process.

4 Methodology

Peer-review decision is the central component of a meta-review. The chair writes the meta-review once they have already decided on the paper’s fate. Hence, meta-reviews are decision-aware. We briefly discuss the sub-tasks in our pipeline in the subsequent sections.

Table 3 Example of sentiment encoding from VADER for review sentences with corresponding recommendation scores

4.1 The various prediction sub-tasks

Here we describe the three prediction sub-tasks which we utilize later (directly and indirectly) for aiding the main task of generating meta-reviews.

Recommendation Score Prediction We take the reviews along with sentence-level sentiment encodings to predict recommendation scores. In Table 3, we show examples of review sentences and their corresponding sentiment encodings (via VADER [17]) along with the final recommendation made by the corresponding reviewers. We can see that reviewer’s sentiment (positive/negative/neutral) has a direct correlation to the final recommendation score. Hence, our decision to incorporate sentiment encodings for recommendation score prediction. We fine-tune a transformer-based pre-trained Bidirectional Encoder Representation from Transformer (BERT) [18] model for the given task. The BERT is a bidirectional transformer that is pre-trained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia. BERT contextual representation of review augmented with Vader sentiment is given as input to feed-forward neural networks with ReLU, dropout and the batch norm as sub-layers with final layer after softmax doing multiclass classification to predict the scores.

Confidence Score Prediction For confidence score prediction, we take the review’s BERT representation, the predicted recommendation score, and the uncertainty score to predict the confidence score. For generating the uncertainty score, we use a pre-trained hedge-detection model  [5] which uses XLNet [19] trained on BioScope Corpus [20] and SFU Review Corpus [21] to predict uncertain words. We use these predicted uncertain words and define an uncertainty score which is the ratio of the total number of uncertain word tokens in a review to the total words token in a review. We deem uncertainty/hedge cues from the reviewer as an important vertical to predict the reviewer’s confidence or conviction. We add the uncertainty score as a feature with the BERT contextual representation of the reviews and the recommendation scores to predict the confidence scores. We use these features as an input for our linear layer and then pass through dropout, batch norm and ReLU to a new linear layer and then to softmax for the final confidence score prediction. The model architecture used for recommendation score prediction and confidence score prediction is shown in Fig. 2.

Fig. 2
figure 2

Detailed architecture for recommendation score and confidence score prediction

Fig. 3
figure 3

Detailed architecture for decision prediction. Rating here refers to the predicted recommendation scores

Decision Prediction Finally, we build a model which takes three review representations, predicted recommendation score, predicted confidence score, and the specific sentence-level sentiment encodings of the review from VADER along with the predicted uncertainty score [5] as input to predict the final peer review decision on the paper. We present the model architecture for decision prediction in Fig. 3. The inputs are given to the linear layers from where they pass through ReLU, dropout, batch norm sub-layers and then to another linear layer for the final binary classification or the decision prediction.

4.2 Seq-to-seq meta-review generation: main task

Since our final problem is a generation one, we use transformer-based sequence-to-sequence encoder-decoder architecture for the generation task. As most papers have three reviews in our data, we use a transformer-based three encoders and a decoder model to automatically generate the meta-review. To make our multi-source transformer decision-aware, we use the former decision models’ last encoding as input and pass it into decoder layers to provide the decision context (refer to Fig. 4).

Fig. 4
figure 4

Final model architecture of seq-to-seq decision-aware meta-review generation leveraging the encoded decision from Fig. 3

Three encoders act as feature extractors that map the input vector to a high-level representation. With this representation, the decoder recursively predicts the sequence one at a time auto-regressively. The encoder consists of N layers of multi-head self-attention, feed-forward network, and residual connections. The decoder consists of M layers with sub-layers of multi-head self-attention, feed-forward, and extra cross-attention, also known as multi-head encoder-decoder attention. In a multi-source transformer, cross-attention with three past key-value pairs can be modeled in several ways [22]. We use a parallel strategy for our approach to produce a rich representation from the three encoders in the task. In addition, we choose to train a Byte-pair encoding tokenizer with the same special tokens as RoBERTa [23] and pick its vocab size to be 52,000.

5 Evaluation

In this section, we first describe the evaluation metrics that we chose for automatic evaluation of the model generated meta-reviews and then discuss about the selected baselines and comparing systems.

5.1 Evaluation metrics

To evaluate multi-class prediction (recommendation score and confidence score) and binary prediction (final decision) tasks, we use metrics such as accuracy, F1 score, and Root Mean Squared Error (RMSE) from scikit-learn.Footnote 2 While for the meta-review generation task, we use some popular automatic evaluation metrics which are used for evaluating text generation and summarization. Since a single metric does not give the best evaluation for a generated summary, we use ROUGE-1, ROUGE-2, ROUGE-3 [24], BERTScore [25], S3 [26] and BLEU [27] metrics. Below we describe the above mentioned evaluation metrics for meta-review generation:

ROUGE This is a widely adopted summarization evaluation metrics which stands for Recall-Oriented Understudy for Gisting Evaluation. ROUGE scores range from 0 to 1.0 means reference summary does not have any common n-gram unit with generated summary and 1 means all the n-gram units in reference summary has been captured by model generated summary. Thus, ROUGE-N measures unigram, bigram, trigram and higher order n-gram overlap between candidate and reference.

ROUGE-N recall between a system generated summary and a reference summary computes how much of the information contained in the reference summary is being captured by the system generated summary. It is calculated as follows:

$$\begin{aligned}{} & {} {\text {ROUGE-N}}_\textrm{recall}=\nonumber \\{} & {} \quad \frac{\# \text {matching n-grams between cand and ref}}{\# \text {n-grams in ref}} \end{aligned}$$
(1)

On the other hand, ROUGE-N precision computes how much of the system generated candidate summary actually overlaps with the reference summary and is calculated as follows:

$$\begin{aligned}{} & {} {\text {ROUGE-N}}_\textrm{precision}=\nonumber \\{} & {} \quad \frac{\#{\text {matching n-grams between cand and ref}}}{\#\text {n-grams in cand}} \end{aligned}$$
(2)

S3 This automatic scoring metrics creates a model trained on human judgment datasets from TAC conferences. It uses existing automatic metrics as features such as ROUGE, JS-divergence, and ROUGE-WE and predicts the score. The regression model learns the combination exhibiting the best correlation with human judgments. For experiments they have used Pyramid and the Responsiveness annotations. Models are trained and tested in leave one out cross validation ensuring proper testing of the approach for manual evaluation involving human in the process of scoring a reference summary with different scheme.

  • Responsiveness: Human annotators score summaries on a LIKERT scale ranging from 1 to 5.

  • Pyramid: Summarization Content Units (SCUs) are identified by noting/annotating information that are used for comparison of information in summaries. SCUs are variable in length but are not bigger than sentential clause, they emerge from annotation of a corpus of summaries for the same input. SCUs that appear in more manual summaries will get greater weights, so a pyramid will be formed after SCU annotation of manual summaries. The SCUs in peer summary are then compared against an existing pyramid to evaluate how much information agrees between the peer summary and manual summary. A key feature of a pyramid is that it quantitatively represents agreement among the human summaries.

BERTScore Evaluates generated text with pre-trained BERT contextual embeddings. BERTScore computes the similarity of candidate and reference summaries as a sum of cosine similarities between their token embeddings. Contextual embeddings gives different vector representations for the same words in different sentences, depending on the surrounding words, which form the context of the target word.

Given a reference sentence tokenized to k tokens \(x=(x1,x2\ldots ,xk)\) and a candidate sentence tokenized to l tokens \({\hat{x}=( \hat{x}1,\ldots ,\hat{x}l )}\) where each token is represented by contextual embeddings and calculates matching using cosine similarity. BERTScore computes recall score by matching token in reference sentence x to token in candidate sentence \(\hat{x}\) and precision by matching the token in candidate sentence \(\hat{x}\) to token in reference sentence x by using cosine similarity. F1 score is calculated by combining precision and recall. Uses greedy approach to match each token to the most similar token in the other sentence to maximize the similarity score.

For a reference x and candidate \(\hat{x}\), the recall, precision, and F1 scores are:

$$\begin{aligned} R_\textrm{BERTScore}= & {} \frac{1}{\hbox {mod}(x)}\sum _{x_i \in x \; }^{} \max _{\hat{x}_{j} \in \hat{x}} {\;x_i^T \hat{x}_{j}} \end{aligned}$$
(3)
$$\begin{aligned} P_\textrm{BERTScore}= & {} \frac{1}{\hbox {mod}(\hat{x})}\sum _{\hat{x}_{j} \in \hat{x} \; }^{} \max _{x_i \in x } \;x_i^T \hat{x}_{j} \end{aligned}$$
(4)
$$\begin{aligned} F1_\textrm{BERTScore}= & {} 2*\frac{P_\textrm{BERTScore} * R_\textrm{BERTScore}}{P_\textrm{BERTScore} + R_\textrm{BERTScore}} \end{aligned}$$
(5)

BLEU It is a widely used metrics in machine translation and it stands for Bilingual Evaluation Understudy. It is a precision-oriented metrics that calculates n-gram overlap between candidate and reference summary as follows:

$$\begin{aligned}{} & {} \hbox {precision}_n\nonumber \\{} & {} \quad =\frac{\displaystyle \sum _{c\in \hbox {candidates}}^{}\;\sum _{n-\hbox {gram}\in c}^{}\hbox {Count}_\textrm{clip}(\text {n-gram})}{\displaystyle \sum _{c\in candidates}^{}\;\sum _{n-gram\in c}^{}\hbox {Count}(\text {n-gram})}\end{aligned}$$
(6)
$$\begin{aligned}{} & {} \hbox {Count}_\textrm{clip}(\text {n-gram})\nonumber \\{} & {} =\quad \min (\text {matched n-gram count}, \nonumber \\{} & {} \quad \max _{r\in \textrm{refs}}(\text {n-gram count in r})) \end{aligned}$$
(7)

Using brevity penalty to penalize score with respect to the length of candidate summary. Brevity Penalty is multiplied to the so far score, that penalizes sentences that are shorter than any of our reference summary. This implies that if our candidate summary is as long as reference summary we multiply by 1.

$$\begin{aligned} \hbox {BP}= & {} {\left\{ \begin{array}{ll} 1,&{} \hbox {if}\, \textit{c} > \textit{r} \\ e^{1-(\frac{r}{c})},&{} \hbox {if}\, \textit{c} \le \textit{r} \\ \end{array}\right. } \end{aligned}$$
(8)
$$\begin{aligned} \text {BLEU-N}= & {} \hbox {BP}*\exp \left( \sum _{n=1}^{N}W_{n}\log \;\hbox {precision}_{n}\right) \end{aligned}$$
(9)

where N is the number of n-grams which is by default 4 and \(W_n\) are the weights of the different n-gram precisions.

5.2 Baselines and comparing systems

Our initial experiments include PEGASUS, a BART-based summarizer, which we treat as the baseline for comparison, and two variants of our proposed model. We also use a pre-trained decision-aware MRG model and predict our test data.

We keep the learning rate for experiments as 5e-05, the number of beams for \(beam search=4\), \(loss=cross entropy\), and \(optimizer=Adam\). We train different models for 100 epochs with learning rate scheduler=linear and choose the best variant in terms of validation loss.

PEGASUS [28] PEGASUS is an abstractive summarization algorithm which uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer-based encoder-decoder model. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. The best PEGASUS model is evaluated on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate that it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.

Table 4 Model scores for automatic evaluation metrics
Table 5 Results with respect to F1 score and overall accuracy for decision prediction, where S \(\rightarrow \) sentiment and H \(\rightarrow \) uncertainty score

BART [29] BART uses a standard transformer-based seq2seq architecture with a bidirectional encoder and a unidirectional decoder. The pre-training task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where text spans are replaced with a single mask token. BART is particularly effective when fine-tuned for text generation and works well for comprehension tasks. We use the Hugging Face implementation of 12 encoder layers and 12 decoder layers with pre-trained weightsFootnote 3 and fine-tune them on our dataset to generate the meta-review.

Simple Meta-Review Generator (S-MRG) This is a simple transformer-based architecture with only three encoders, each with two encoder layers to map inputs to a high-level representation and a decoder of two decoder layers with softmax normalization applied on the last hidden state in decoder for generating the sequence probability distribution over whole target vocabulary recursively, one at a time auto-regressively.

Decision-Aware MRG (MRG Decision) [6] MRG Decision predicts the decision from encoders’ hidden states and carries the decision vector encoded from the encoder-hidden state output to the decoder layer, to provide the context to the generator module. The decoder’s last hidden state after the softmax layer predicts the sequence recursively.

Proposed Approach/Model: Seq-to-Seq Decision Aware Meta-Review Generation (\({S2S}_{{MRG}}\)) We improve the decision prediction model by using various input features as we notice that MRG Decision lacks in decision making (accuracy of 63 \(\%\)). In Fig. 4 our model \(S2S _{MRG}\) uses the decision encoded vectors in all decoder layers where vectors are concatenated before the feed-forward sub-layer to provide the context to the generator module.

Our proposed approach takes input from the decision-prediction module (hence decision-aware just as human chairs do) to generate the meta-reviews.

6 Results and analysis

We present the results of automatic evaluation of the model generated meta-reviews along with a comparison with the selected baselines as well as comparing systems in Table 4. In Table 5 we present the performance comparison of our decision prediction task with various combination of features and comparable systems.

6.1 Quantitative and qualitative analysis

Our seq-to-seq meta-review generation model outperforms all the baseline models for ROUGE precision and BLEU scores. We achieve comparable results with the BART-based summarization model for all other scores. However, we argue that the evaluation is unfair as MRG and summarization are not the same tasks. We also evaluated the previous decision-aware model for meta-review generation MRG Decision [6] and found that our model outperforms it in terms of all quantitative metric scores.

Fig. 5
figure 5

Confusion matrices of the three prediction sub-tasks

The Root Mean Squared Error (RMSE) for the recommendation score prediction task when we use only review text comes to be 0.76. While when we use sentence-level sentiment encoding along with review text, our RMSE reduces to 0.75. For sentence-level sentiment encoding examples, refer to Table 3. For the confidence score prediction task, when we predict only using review text, we obtain an RMSE of 0.86. Further, when we incorporate recommendation and uncertainty scores along with the review text, the RMSE reduces to 0.82.

Table 6 Ground truths and automatically generated meta-review for a given paper

From Table 5, we can see that the final decision module also improves by 21% with respect to MRG Decision. In terms of the various feature combinations, our decision prediction model accuracy improves by 15% when we use several input features such as the recommendation, confidence, hedge score, sentiment encodings, and the text of the three reviews instead of simply using the review text.

The confusion matrices for the three sub-tasks are shown in Fig. 5. It is evident that when we incorporate sentiment encodings for recommendation score prediction and the predicted recommendation scores along with uncertainty scores for the confidence score predictions, the predictions get concentrated over the classes 3 and 4 (see the confusion matrices on the right side of top and middle rows). This is expected and is perfectly aligned with the data distribution in the dataset where more than 78% of the data belongs to these two classes as depicted in Fig. 2. Moreover, for the task of decision prediction, we observe that taking the predicted recommendation score, predicted confidence score, uncertainty score and sentiment encodings along with the review text into consideration results in improved prediction accuracy for the “Accept" class, thus improving the overall accuracy.

Table 6 shows the MRG outputs of the different techniques. We use the pre-trained models for PEGASUS and BART from HuggingFaceFootnote 4 but fine-tune on our review dataset. Our custom architectures with two different setups are entirely trained on our dataset. We find that although PEGASUS generated meta-review manifests sentences with polarity, the output is not detailed. The significant aspects of concern in the human-generated review are not

prominent in the generated meta-review. The overall polarity and the decision do not match with the original meta-review. On the other hand, we observe that the output with BART, which is an extensive language model with 406 million parameters is detailed. Moreover, the generated meta-review also manifest polarity, and highlight merits and demerits. Our model S-MRG, does a reasonable job of capturing the polarity (see Table 6), and also the generated meta-review is in the third person. However, we notice that some irrelevant text from other papers’ common primary keywords is present in the generated meta-review, which is eventually noise in the output.

Although the Decision-aware MRG [6] model writes the meta-review in the third person/as meta-reviewer in coherence with the existing peer reviews, but its decision prediction module has an accuracy of only 63%. Our proposed seq-to-seq decision-aware MRG model outputs are detailed and write the meta-review in the third person/as meta-reviewer in coherence with the existing peer reviews and brings out the merits and demerits of the paper. Generated meta-review also manifests polarity. The decision prediction module has a higher accuracy of 84%, which can be further improved by augmenting review-paper interaction as additional information channels to the model. We argue that the decision prediction module plays a key role in helping the model generate meta-reviews with the correct connotation as the meta-reviews are generally written by the chairs/editors only after a decision regarding the fate of the manuscript has already been made. Hence, higher the decision accuracy, higher the chance of generating a better meta-review.

6.2 Error analysis

We perform initial error analysis on our generated output. The automatically generated meta-review sometimes contains repeating texts. We also found that for a few papers, the ground truth decision is a reject. However, the generated meta-review by the model recommends accepting the paper. Sometimes meta-reviewers write from outside the context of the reviews or one-liners with a positive connotation (example: The work brings little novelty compared to existing literature.), but the final decision is negative. In such cases, our model fails, probably due to the lack of proper context. We would look forward to doing a more in-depth analysis of our errors.

7 Conclusion

In this preliminary investigation, we propose a new technique for incorporating decision-awareness for automatic generation of meta-reviews from the peer reviews of manuscripts. We do this by taking into account the various sub-tasks which form an integral part of a human peer review process. Specifically, we first predict the recommendation scores based on the review texts and their sentiments. We then use the predicted recommendation scores along with uncertainty scores to predict the confidence scores of the respective reviews. Next, we use these scores and other features to predict the final decision on the manuscript. With the incorporation of these intermediate sub-tasks, we obtain an improvement of 21\(\%\) in the decision prediction task, which is crucial to meta-review generation. Finally, we use the predicted decisions to generate the meta-reviews. Our proposed approach outperforms the earlier works and performs comparably with BART, which is a large complex neural architecture with 12 encoders and 12 decoders. However, we agree that only text summarization does not simulate a complex task such as automatically writing meta-reviews. As our immediate next step, we would like to deeply investigate fine-tuning of the specific sub-tasks, use the final-layer representations of the sub-tasks instead of the predictions, and perform a sensitivity analysis of each sub-task on the main task. Additionally, we would like to incorporate more fine-grained decisions such as strong/weak accept/reject or minor/major revisions instead of binary decisions. Finally, we would also like to explore a multi-tasking framework for meta-review generation in the future.