Laparoscopic surgery, introduced in the 1980s, is widely accepted and currently mainstreamed as a minimally invasive surgery (MIS) for many general surgery procedures, including gastrectomy, particularly for early gastric cancer (EGC). Laparoscopic gastrectomy (LG), first reported in 1994 [1], has been rapidly adopted in Asian countries. According to evidence-based medicine, meta-analyses have showed the safety and feasibility of LG besides several other advantages over open gastrectomy, such as reduced invasiveness, less wound pain, earlier recovery of bowel movements, earlier discharge, and fewer pulmonary complications [2,3,4,5]. In addition, LG and open surgery reportedly have comparable rates of long-term morbidity and mortality in EGC [6, 7].

Although patients benefit from laparoscopic resections, some of the factors that hinder the application of laparoscopic surgery are two-dimensional images, decreased sense of touch [8], long learning curve [9, 10] (especially in lymph node dissection), the intricate manipulations of the forceps required through the fixed ports, and the uncomfortable position forced upon surgeons. Thus, robotic systems were developed to address these limitations of laparoscopic surgery [11, 12]. Such systems include three-dimensional views, a tremor filter and improved dexterity with da vinci surgical system. Since the robotic gastrectomy (RG) was firstly reported by Hashizume et al. [13] in 2003, it has been thought to provide undoubted technical advantages [14]. However, although robotic systems allow precise operating in various fields of MIS, its role for gastric cancer remains controversial [15,16,17].

Recently, several observational clinical studies on this topic have been published, and three updated meta-analyses [18,19,20] showed that when compared with LG, RG was associated with a longer operative time and lower estimated blood loss and complications. Mortality, overall survival (OS), and disease-free survival (DFS) for LG and RG were similar. Nevertheless, with increasing statistical tests being employed on the accumulated additional data, the likelihood of observing a false-positive or false-negative result increases [21]. Trial sequential analysis (TSA) is an approach that retains the desired risk of random error when conventional significance testing is repeated on accumulating data when cumulative meta-analyses are performed; this which provides the required information size in meta-analyses as well [22, 23]. Therefore, we used the TSA method to control the risk of type I error in our meta-analyses.

Methods

The present study was registered in PROSPERO international prospective register of systematic reviews (https://www.crd.york.ac.uk/PROSPERO/) and the Registration Number is: CRD42018089637.

Criteria for considering studies for this review

The included studies met the following criteria: (1) prospective observational studies (POSs) and randomized control trial (RCT) analyses of both RG and LG for gastric cancer; (2) any sample size; and (3) when more than one study reported results from the same patient population, only the most recent study was included.

The exclusion criteria were as follows: (1) studies published as an abstract without the appropriate data or publication of the full paper; (2) studies with considerable overlap between centers or patient cohorts evaluated in the published studies, and (3) case reports, reviews, and clinical trial registrations with no result and retrospective observational studies.

Outcome measures

The following outcomes were used to compare the RG and LG groups in patients with gastric cancer. Primary endpoints were operation time (min), blood loss (ml), hospital stay (days), complications based on the Clavien–Dindo classification, major complications, minor complications, OS, and DFS. Secondary endpoints were time to first flatus (days), retrieved lymph nodes (LN), proximal resection margin (PRM), distal resection margin (DRM), mortality, open conversion, reoperation, and hospital expenses.

Search strategies for identification of studies

The MOOSE (Meta-analysis of Observational Studies in Epidemiology) statement and guidelines were consulted during the design, analysis, and reporting of this meta-analysis. A systematic review of the medical literature was performed with the assistance of a medical librarian to identify all potential abstracts that compared RG to LG in patients with gastric cancer regardless of publication status or language. Specifically, studies published before October 2017 were searched for in PubMed, Embase, Science Citation Index, Cochrane Library, and Chinese Biomedical Database (CBM). Relevant studies were identified using the search terms gastric cancer and gastric adenocarcinoma. These results were combined with robotic, laparoscopic, gastrectomy. The “related article” function from PubMed was used to further identify potential articles that were eligible for inclusion in the meta-analysis. Then, manual searches of their relevant references were performed to identify any other potential papers or electronic links.

Data collection and analysis

Data extraction and assessment of study quality

From the potential eligible trials, two reviewers independently selected suitable trials on the basis of their titles and abstracts. The retrieved studies were critically appraised by the two review authors for inclusion according to the Newcastle-Ottawa scale (NOS) [24]. Studies scored 0 in any of the categories were classified as having a high risk of bias, and studies scored 1 and ≥ 2 in all categories were classified as having moderate and low risk of bias, respectively. To evaluate the quality of evidence from the pooled results, the Grading of Recommendations Assessment, Development, and Evaluation system (GRADE system) was used [25], and a summary table was created using the GRADE profiler software (version 3.6.1). Any disagreement was resolved by consensus discussions with the remaining members of the review team. Subsequently, trial data on the pre-defined endpoints were independently extracted by the two investigators.

Statistical analysis

Analysis was conducted using Review Manager (version 5.2). Calculations of effect sizes are presented as odds ratios with their 95% confidence intervals (CIs) for dichotomous variables and mean differences (MD) for continuous outcomes. For the time-to-event endpoints (OS, DFS), hazard ratios (HRs) with the corresponding 95% CIs were calculated from the available numerical data using methods reported by Parmar et al. [26] and were combined as the effective value to assess the summary effects. A spreadsheet developed by Tierney et al. [27] was used to perform the calculations. The I2 measure statistic provides an estimate of the percentage of inconsistency considered to be due to chance. The threshold values of I2 equal 25%, 50%, and 75%, representing low, moderate, and high heterogeneity, respectively. Pooled analyses were conducted using random and fixed effect models with the Mantel–Haenszel method when appropriate. Statistical heterogeneity was investigated using the Cochran’s Q test (P < 0.10) and the I2 statistic (> 50%). Sensitivity analysis was conducted based on the low risk of bias. Subgroup analyses were conducted based on distal gastrectomy and countries. Potential publication bias was assessed by visually inspecting the funnel plots in Review Manager.

Trial sequential analysis

Cumulative meta-analyses of trials were at the risk of producing random errors because of insufficient data and repetitive testing of the accumulating data, and thus, the requirement of the amount of information analogous to the sample size of a single optimally powered clinical trial may not be met [22, 28].

Trial sequential analysis (TSA) was applied to assess the statistical reliability of the data in a cumulative meta-analysis; it controlled alpha and beta values for sparse data and repetitive testing on accumulating data. TSA was a tool for estimating whether the currently available evidence was conclusive enough.

Empirical evidence suggests that information size (IS) considerations and adjusted significance thresholds may eliminate type I error (early false-positives) findings due to imprecision and repeated significance testing in updated meta-analyses [22, 23, 28, 29]. The adjusted required information size (RIS) was calculated using a = 0.05 (two-sided) and b = 0.20 (power 80%) with an empirical mean difference for continuous outcomes, an alpha error of 5%, a beta error of 20%, and a control group proportion obtained from the results of our meta-analysis for binary outcomes. We can decide whether further evidence provided by more trials is needed based on whether the cumulative Z-curve crosses trial sequential monitoring boundaries (TSMB) or the futility zone. If the TSMB is not surpassed, it is most probably necessary to continue doing the trials. Trial sequential analysis version 0.9 beta (http://www.ctu.dk/tsa) was used for all these analyses [30].

Results

Selected studies

Through the literature search and selection based on the inclusion criteria, a total of 16 studies were included in the meta-analysis (Fig. 1) [31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46]. All these studies were POSs, with a total of 4576 patients, of which 1517 underwent RG and 3059 underwent LG.

Fig. 1
figure 1

The literature search and selection

Baseline characteristics

Demographic and clinical characteristics of patients were extracted and are displayed in Table 1. Operative factors and tumor node metastasis (TNM) stages are shown in Table 2. Quality assessment scoring of studies is shown in Table 3, and each study had a score of > 6 points. Among the 16 POSs, three were subjects from China [37, 38, 41], nine from Korea [32,33,34,35,36, 39, 42,43,44], two from Japan [40, 45], two from Italy [31, 46]. Five [30, 36, 40, 43, 45, 50–55] were considered at low risk of bias, while the rest were at moderate risk of bias [31, 39, 43, 45, 46].

Table 1 Demographic and clinical characteristics of patients
Table 2 Operative factors and TNM stage
Table 3 Methodological quality

Short-term outcomes

Table 4 shows the results of meta-analysis for each outcome. RG showed a significantly higher operative time than LG (MD 57.98 min, P < 0.00001). The pooled results showed a significant reduction (23.71 ml) in intra-operative blood loss among the RG group (P = 0.005). RG showed a slightly shorter duration than LG in terms of the number of days to the first flatus (MD − 0.20, P = 0.07). The pooled results showed no significant difference in the days of hospital stay between the treatment groups (MD − 0.49, P = 0.06). More lymph nodes were harvested during RG than during LG (MD 1.81, P = 0.05). The pooled results showed no significant between-group difference in PRM and DRM.

Table 4 Summary of effect on clinical outcomes

For the RG group, morbidity rates ranged from 0 to 47.4%, whereas for the LG group, morbidity rates were 4.8–38.6%. The decrease in overall complications did not significantly differ between RG and LG on comparing the pooled results (OR 1.05, P = 0.65, Fig. 2A). Moreover, no significant difference was found in major or minor complications between RG and LG. The pooled results showed no significant between-group differences in terms of the need for reoperation (OR 1.72, P = 0.11), mortality (OR 1.35, P = 0.56), open conversion (OR 1.58, P = 0.35), PRM (OR 0.34, P = 0.15), and DRM (OR 0.73, P = 0.23) between the two groups.

Fig. 2
figure 2

A The pooled results showed no significant decrease in overall complications with RG compared with LG. B, C The pooled results showed no significant difference in overall survival and disease-free survival between the treatment groups

Medical costs were compared in three POSs [41, 42, 44]. Huang [41] showed that the robotic group was associated with more medical costs compared with the laparoscopic group (RG 5714.2 ± 1591.7$, LG 2, 915.1 ± 1341.4$). Park [42] reported that total medical costs were significantly lower for LG than for RG, with a difference of 4886 298 KRW or US$ 3909. Meanwhile, Kim [44] showed that patients undergoing RG accrued significantly higher total costs than patients undergoing LG [13,748, 422.5 KRW (US$ 13,470) (RG) vs. 9,165,862 KRW (US$ 8980) (LG); P < 0.001].

Long-term outcomes

The pooled results showed no significant difference in OS between the treatment groups (HR = 1.15, 95% CI 0.51–2.59, P = 0.73, Fig. 2B). In addition, no significant difference was observed in DFS in between RG and LG (HR = 2.24, 95% CI 0.79–6.35, P = 0.13, Fig. 2C).

Trial sequential analyses

For hospital stay (Fig. 3A), flatulence (Fig. 3D), and overall complications (Fig. 3F), neither the traditional boundary nor the trial sequential monitoring boundary was crossed, suggesting the lack of concrete evidence and the requirement of more studies. For the outcomes of blood loss (Fig. 3B) and operative time (Fig. 3C), the cumulative Z-curve crossed either the traditional boundary or the TSMB, suggesting firm evidence in the RG group compared with the LG group. The potential false-positives of meta-analyses were found in the number of lymph nodes harvested (Fig. 3E); the TSA of the pooled results showed that the cumulative Z-curve crossed the conventional boundary for benefit but did not cross the trial sequential monitoring boundary or the futility boundaries. Therefore, more trials are necessary before drawing a conclusion. The meta-analyses of OS (Fig. 3G) and DFS (Fig. 3H) did not yield any sign of statistically significant between-group difference; the cumulative Z-curve crossed neither the traditional boundary nor the trial sequential monitoring boundary; further, boundary alpha 5% with beta 20% was ignored due to too little information use, suggesting the lack of firm evidence.

Fig. 3
figure 3figure 3

Trial sequential analysis (TSA). The adjusted required information size was calculated using α = 0.05 (two-sided), β = 0.20 (power 80%), and an empirical mean difference. For hospital stay (A), flatulence (D), neither the traditional boundary nor the trial sequential monitoring boundary (TSMB) was crossed, suggesting a lack of firm evidence and more studies needed. For the outcomes of blood loss (B) and operative time (C), the cumulative z-curve crossed either the traditional boundary or the TSMB, suggesting firm evidence in the RG group compared with the LG group. E TSA of retrieved LN number. The cumulative Z-curve crossed the conventional boundary for benefit, but did not cross the trial sequential monitoring boundary or the futility boundaries. Therefore, more trials were necessary before drawing a conclusion. F for overall complications, neither the traditional boundary nor the trial sequential monitoring boundary was crossed, suggesting a lack of firm evidence and more studies needed. The meta-analyses of overall survival (G) and disease-free survival (H) did not yield any sign of statistical significance, the cumulative z-curve crossed neither the traditional boundary nor the trial sequential monitoring boundary and boundary alpha 5% with beta 20% was ignored duo to too little information use, suggesting a lack of firm evidence

Subgroup analyses

Country-specific subgroup analyses of the number of lymph nodes harvested were conducted. The studies of China and Japan showed that RG had significantly higher number of lymph nodes harvested compared with LG; however, there was no significant difference between RG and LG in studies of Korea and Italy (Fig. 4).

Fig. 4
figure 4

Country-specific subgroup analyses of the number of lymph nodes harvested were conducted. The studies of China and Japan showed that RG had significantly higher number of lymph nodes harvested compared with LG; however, there was no significant difference between RG and LG in studies of Korea and Italy

To check for further differences between RG and LG, additional subgroup analyses and trial sequential analyses were performed for operation time, blood loss, hospital stay, overall complications, retrieved lymph nodes be related to low-bias risk, distal gastrectomy, and different countries (Table 5).

Table 5 Summary results of sensitivity and subgroup analysis

GRADE of the outcomes

The GRADE system was used to synthesize and rate the evidence for the outcomes (Table 6). The level of evidence is moderate in overall complication, major complication, minor complication, reoperation, mortality, OS, and DFS, while it is low in operation time, blood loss, hospital stay, number of lymph nodes harvested, open conversion, PRM, and DRM. The level of evidence is very low in flatulence.

Table 6 Strength of evidence for RG in patients with gastric cancer compared with LG

Evaluation of publication bias

Publication bias in this meta-analysis was assessed using a funnel plot of overall complications. The bilaterally symmetrical funnel plot of overall complications indicated a lack of publication bias (Fig. 5).

Fig. 5
figure 5

The bilateral symmetry shaped funnel plot of overall complications indicated a lack of publication bias

Discussion

The da Vinci surgical system was developed as a robot-assisted surgical system for MIS, and it comprises three components: the surgeon console, patient-side cart, and vision system. In some studies, the learning curve was reported among both RG and LG groups. LG showed a steep learning curve, whereas RG showed a shallower learning curve with better results from the beginning of the initial case, indicating the easier adaptability of robot-assisted surgery [10, 47]. However, the main international guidelines [48, 49] of management of gastric cancer did not discuss the robotic technology. Therefore, the goal of the present analysis was to gather the available data to examine the actual role of minimally invasive surgery.

This meta-analysis of 16 POSs including 4576 patients with gastric cancer found that RG could be performed safely and effectively and was associated with lesser blood loss, shorter time to post-operative flatulence, and higher medical costs. Although RG resulted in prolonged operative times, this did not translate into any increase in post-operative complications, open conversions, or mortality. Furthermore, there was no significant difference between the two groups in terms of the number of retrieved LNs, hospital stay, reoperation, DRM, PRM, OS, or DFS.

Based on the sequential monitoring boundary generated, the current evidence for the potential disadvantages of RG on operative times appeared reliable and conclusive. However, RG had the longer average operating time because operations were conducted very carefully and surgeons were not familiar with the docking procedure [35]. Some studies showed that the increased operating time was associated with a higher BMI [50, 51]. However, it should also be considered that most surgeons had extensive experience of LG but no experience of RG.

Heterogeneity was substantial, although the present TSA showed that both the traditional boundary and the trial sequential monitoring boundary were crossed by the cumulative Z-curve and did not finally reach the required information size on overall complications. To unravel the reason for the heterogeneity, we conducted subgroup analyses based on low-bias risk trials and demonstrated a trend towards reduced risk of over complications in patients receiving RG treatment; more patients need to be studied to conclusively demonstrate this potential over complications benefit. Furthermore, our subgroup analyses suggested that a possible beneficial effect of RG was observed in distal gastrectomy, while the cumulative Z-curve crossed the traditional boundary but not the trial sequential monitoring boundary, which suggested the lack of evidence for a 20% relative risk reduction in over complications when comparing the RG group with the LG group.

Many surgeons used a harmonic scalpel to dissect the lymph nodes and coagulate the vessels; however, the harmonic scalpel does not have seven degrees of freedom. With the aid of robotic instruments, robots can help surgeons suture intracorporeally owing to the precise 3D view and instruments with seven degrees of freedom. Huang et al. [41] reported that it is easier to perform lymphadenectomy than LG, particularly in infra-pyloric and supra-pancreatic areas. This was in agreement with what was showed in our analysis that the TSMB was crossed by the cumulative Z-curve, which was firm evidence for a higher LN retrieval number and shorter hospital stay on RG.

Three included studies reported that medical costs were significantly higher for RG than for LG. One way to justify the additional expense for RG may be that the increase in cost is balanced out by a more favorable learning curve than LG [41].

In 2017, Obama reported a cohort analysis that revealed no statistically significant difference in 5-year OS or DFS (P = 0.4112 and P = 0.8733, respectively): 93.3% and 90.7% after RG and 91.6% and 90.5% after LG, respectively [52]. However, our analyses showed that the cumulative Z-curve crossed neither the traditional boundary nor the trial sequential monitoring boundary, suggesting the lack of firm evidence. Further studies on the long-term oncologic outcomes of robotic gastrectomy are warranted to reach more definitive conclusions.

Our present study has several strengths. The methodology was rigorous, with a comprehensive search to identify the relevant POSs without language limitations. Further, TSA incorporated both the information size and the effect size and it was more conservative and probably more accurate. In the setting of a non-significant result, TSA helped decide whether “more evidence is needed” (when the futility boundary is not crossed), thus reducing uncertainty.

The main objective of this study was to review the measured outcome data comparing RG and LG groups for gastric cancer patients from the available published literature. There are several limitations that must be taken into account when considering these results in clinical application. First, there was no RCT included in the meta-analysis, and there was no information included regarding the quality of life. Second, the significant heterogeneity and the non-randomized nature of all the studies limited the confidence. Third, different levels of expertise of the intervention may have produced confounding factors. Finally, most included studies factored in the period before the learning curve of RG.

In conclusion, the present study demonstrated that RG is as acceptable as LG in terms of short-term and long-term outcomes. RG is a promising approach for the treatment of gastric cancer. TSA demonstrated that further studies are not needed to evaluate the operative time and blood loss differences between these techniques.