Introduction

With the continuous development of medical information technology, the information platform in regional medical integration has been gradually improved. Its platform stores the previous medical information, examination results and electronic medical records (EMR) of all patients in the region. These data hide the health feature of the population in the region and the developmental rule for corresponding disease [1, 2]. In addition, these big-data need to be analyzed by data mining technology. Under this condition, the widespread use of clinical pathways can not only reduce medical costs, but also help the quality improvement of medicine care in regions with scarce medical resources [3].

The essence of the clinical pathway is the standard for a single disease diagnosis and treatment service, which includes criteria for patient acquisition, diagnosis, treatment, care, and fees. Standardized diagnosis and treatment for single diseases eliminates redundant examinations and medications, regulates the doctor’s surgery, shortens hospital stays, and improves the rotation efficiency of the bed, thereby achieving the goal of ensuring medical quality when reducing medical costs [4].

In order to improve the quality of patients’ medical care and reduce medical expenses, the implementation of the clinical pathways requires a strict management system. The criterion of clinical pathways is only a guiding framework that needs to be adapted to local conditions when used in a specific area. At present, the research and application of clinical pathways in the United Kingdom, Japan, Singapore and other countries have entered a relatively mature stage. Datamining technology has revealed and indicated a new direction of clinical pathways development [5]. In 2004, Mary K et al. proposed to apply data mining technology to medical data analysis, and enumerated some data mining strategies and algorithms [6]. According to the state of disease progression, Arianna Dagliati et al. used time series based pattern mining to the diagnosis of patients with type-2 diabetes [7]. Jochen et al. had presented a process mining method based on clinical diagnosis and state log data, which had inherent advantages for the analysis of unstructured data [8]. At the same time, it also laid the foundation for the development of China’s clinical pathways according to these studies. In 2013, Iwata et al. proposed a data-oriented maintenance method for existing clinical pathways and a clinical care plan for developing new diseases based on nursing records. The experimental evaluation of 10 diseases confirmed that the method was beneficial to the improvement of nursing quality [9].

Compared with the large-scale and multi-level of clinical pathways from hospitals to communities, China started late and the limited application of overall medical information construction is still limited to large cities and large hospitals. However, With the gradual deepening of the related application of clinical pathways, research on clinical medicine, public health and prevention, and integration of Chinese and Western medicine has been promoted. In particular, hospital management and medical quality have become the focus of research, the scope of big data and data mining applications is expanding, and the concept of “Internet + medical” is closely integrated with medical and information technology. The application basis of clinical pathway is the process control of the hospital information management system, so the utility of computer-related technology was proposed to solve the difficulties of clinical path application.

In the past ten years, data mining methods have been applied in decision-making for clinical path optimization such as association analysis, clustering, decision tree, etc. Tan Jian et al. applid the state machine workflow model to replace the traditional sequential workflow for acute exacerbation of chronic obstructive pulmonary disease, which had improved the adaptability and management efficiency of clinical pathway [10]. Wang Chaoet al. discussed the feasibility of applying the pricing of single disease to other applications in the medical field by analyzing clustering algorithms [11]. Also, Cao Shuzhen used data mining and On-Line Analytical Processing (OLAP) as an example to analyze the factors affecting single-patient charges [12]. Xu Jing and Liu Zixian proposed a single disease cost estimation model based on HIS data warehouse combined with rough set and support vector machine which method had higher prediction accuracy than single support vector machine [13]. These studies have promoted the development of clinical pathways in China.

However, the clinical path still faces many problems and challenges in the specific implementation in addition to the external reasons, e.g., the slow informatization construction of medical institutions. The most prominent difficulty is the continuous improvement and optimization of the clinical pathways. Complex environmental factors lead to constant variability of the disease itself, and individualized differences in regional populations also cause variations in the actual implementation of clinical pathways. In order to solve these problems, the medical and information industry have conducted a lot of research at home and abroad. With the popularization of the information clinical path, how to design and maintain the clinical path in a timely and efficient manner has become the main difficulty and research direction of clinical path development.

The clinical path optimization problem in the regional medical integration was examines in the paper. Aiming at the optimization requirements of a wide range of universal clinical path standards, especially chronic diseases, a clinical path optimization diagnosis and treatment strategy based on data mining unit was proposed. Taking the clinical path of diabetes and hypertension disease as an example, the experimental results showed that the clinical path after optimization of the strategy was better than the original standard under the condition that the average hospital stays and the average hospitalization were basically unchanged. The goal of the strategy is optimized for the clinical path diagnosis and treatment criteria. And the selected clinical pathways were mostly chronic diseases and conventional diseases. The purpose of the paper was to expand the scope of application of the clinical path standard to allow for a large patient group in the region to conduct research. Under the background of regional medical integration and massive medical data, data mining technology was used to reveal the inherent rules of clinical treatment through clinical path optimization strategies, and finally optimize the clinical path, so as to seek a balance between medical expenses and medical quality.

Related works

The basic concepts of clinical pathway and the related optimization theories, the difficulties and challenges of clinical path optimization, the medical data, and the adopted data mining techniques and algorithms in the section were mainly introduced.

Basic concepts

The development of the clinical pathway must follow the medical principles, refine the operating procedures and drug lists, and adjust itself according to local characteristics. Clinical pathway optimization could avoid the recurrence of negative variation by scientifically analyzing clinical history and integrating efficient treatments.

The development and implementation of clinical pathway involved multiple disciplines and required multiple departments to collaborate. However, the implementation of management approach was common to the selection and formulation of clinical pathway. Clinical pathway management should be common diseases, frequent diseases, clear treatment program and less variation in the process of diagnosis and treatment. Meanwhile, all of these principles had encouraged the informatization development of clinical pathway [14]. There were two types of variation in clinical pathway: positive and negative. Positive variation could improve treatment effects or shorten hospital stays or reduce medical costs. However, negative variation had the opposite effect. The analysis of the factors influencing of positive variation could be served as the focus of improving the clinical pathway, the analysis of the causes of negative variation could timely correct the clinical pathway to avoid recurrence. The core of variation was also the focus of clinical pathway work for continuous improvement and optimization.

As shown in Fig.1, the clinical pathway could be divided into three parts according to the diagnosis and treatment process. The first step was to determine the type of disease and ICD code according to the patient’s condition and relevant examination data after access, and then recommended the corresponding clinical pathway treatment scheme based on the ICD code. The second step was to perform the treatment in accordance with the clinical pathway standards, keep an eye on the treatment effect of patients in the process of diagnosis and treatment, and make appropriate adjustments according to the patient’s situation. The third step was to pay attention to whether there was a variation in the patient’s situation during the process of using the clinical pathway diagnosis and treatment scheme. If there was a variation, the patient should exit the path and adopt the traditional diagnosis and treatment methods. Otherwise, the patient would recover smoothly after completing the treatment.

Fig. 1
figure 1

Application process for clinical pathway

Clinical pathway optimization

The decision of intelligent optimization depended on various optimization algorithms, such as artificial neural network, machine learning algorithm, and mixed optimization strategy. An optimization algorithm was a kind of search process or rule, which was based on a certain requirement to satisfy the user to solve the problem [15]. The most common decision-making method in clinic was decision tree algorithm. The core of the decision tree algorithm was to use the tree structure to establish a decision model based on the data attributes, which could efficiently classify unknown data to solve classification and regression problems. Decision trees were a simple but widely used classifier whose the technical difficulty was to select a good branch value. The common algorithms included CART (Classification and Regression Tree) [16], random forests, and so on. The algorithm could obtain the optimal solution through a large number of calculation and analysis for results, but the particularity of medical data limited the accuracy of the calculation results.

The difficulties of clinical path optimization

Recently, most of the clinical pathway standard treatment used by hospitals have been developed through discussions at multi-department expert meetings [17]. This traditional method itself had some problems. For example, the response to variation was not timely, and continuous adjustment required a lot of manpower and took a long time. In the context of regional diagnosis and treatment, especially for routine diseases and chronic diseases, the standard clinical pathway diagnosis and treatment scheme also had the following difficulties.

1) the classification of diseases in the clinical pathway was unclear, and the diagnosis and treatment plan contained a large number of redundant clinical projects.

2) The regional medical information platform connected information systems with different functions in multiple medical institutions. Since these systems were mostly designed and maintained by third-party institutions, which were no unified standards.

3) With the accumulation of regional medical information data over time, higher requirements were placed on storage methods and security, and the requirements for centralized data analysis of big-data materials were also high.

The optimization scheme of clinical path

In this section, a clinical path optimization scheme based on medical data mining was proposed, and the structure and core of the strategy were introduced in detail. First, the characteristics and differences of medical data were used in clinical path optimization, as well as the problems and challenges faced by clinical data mining would be analyzed. Secondly, the structural relationship of clinical path optimization strategy based on data mining would be summarized. The key optimization factors of clinical pathway diagnosis and treatment unit were then analyzed from local to global.

The characteristics of data source

The needed data for data mining came from information software commonly by hospitals used. The commonly information systems used in hospitals included Hospital Information System (HIS), Electronic Medical Record (EMR) and Laboratory Information Management System (LIMS), Picture Archiving and Communication Systems (PACS). Some medical institutions and researchers actively explored the electronic construction scheme of clinical pathway on the basis of the hospital’s own medical information system, which not only saved the development cost and reduced the amount of software, but also provided a large amount of basic data for the initial construction and post-optimization of the clinical pathway. The information software commonly used in hospitals had a clinical pathway derived from clinical practice and was also the universal treatment mode hidden in big-data. According to these characteristics, the main data sources of clinical path optimization were HIS data source and EMR data source.

Medical information, such as digital, image and text, covered all data in the medical procedures and the activities of medical institutions that included clinical medical data and hospital management information [17]. The finiteness and incompleteness of medical records also demonstrated that it was impossible to fully grasp disease information, which made the medical data reflect the characteristics of lack of objectivity and unclear description. Massive raw medical information contained more uncertain, incomplete and noisy information. Therefore, the data must be preprocessed to transform medical information into a suitable form of processing before mining medical data.

Weighted voting clustering integration diagnosis based on clustering accuracy

Traditional voting methods had been widely used for classification integration. And to use the voting method in clustering, the result of base clustering firstly needed to be lablled. It was assumed that multiple base clustering results were obtained by the K-maens algorithm, in which all algorithms were divided into K clusters [18]. One of them was chosen as the datum, and the results of the same basis cluster were marked with the same datum. The specific marking method was described as follows.

  1. (1)

    Obtaining two cluster markers for medicine data, Ua = [Ua(1), Ua(2), ⋯, Ua(k)]T and Ub = [Ub(1), Ub(2), ⋯, Ub(k)]T come from the base clustering results.

  2. (2)

    Combining all data in Ua and Ub into a matrix;

  3. (3)

    Outputing the maximum value in the matrix and delete the elements in the row and column where the maximum value is;

  4. (4)

    Repeating steps1, 2 and 3 until there were no elements in the matrix.

Although the marking efficiency was improved by randomly selecting the benchmark clustering results, which would greatly affect the quality of the clustering results. Therefore, weighted voting based on feature relation was applied to cluster integration.

Diagnosis and treatment division of clustering accuracy

The goal of the paper was to optimize the clinical pathway for diagnosis and treatment, which was the division of diagnosis and treatment units to solve the regional issues. Therefore, the closer the division rule was to the standard clinical pathway for the diagnosis and treatment units, the better the late-stage optimization effect would be.

Assumed that the clustering result generated by a single patient was C = [C1, C2, ⋯, Ck], which contained the Kdiagnosis and treatment units. The maximum number of clinical behaviors in unit Ci was max{n(Nj, Ci)}, which was the same as that in unit N = [N1, N2, ⋯, Nk]. If clinical behaviors n(Ci) were included in unit Ci, the clustering accuracy of unit Ci was defined as shown in eq. (1).

$$ \phi \left({C}_i\right)=\frac{\max \left\{n\left({N}_j,{C}_i\right)\right\}}{n\left({C}_i\right)} $$
(1)

The clustering accuracyϕof all clinical behavior classification results for a single patient was the weighted average of the clustering accuracy for all independent diagnosis and treatment units, which was defined as shown in eq. (2).

$$ \phi =\frac{1}{\sum_{i=1}^kn\left({C}_i\right)}{\sum}_{i=1}^kn\left({C}_i\right)\phi \left({C}_i\right) $$
(2)

Therefore, the weighted voting clustering integration diagnosis and treatment unit process based on clustering accuracy was written as follows.

  1. (1)

    Using the k-means algorithm to divide the clinical behaviors of each patient into K diagnosis and treatment units and forming the base clustering result C = [C1, C2, ⋯, Ck];

  2. (2)

    Calculating the clustering accuracy ϕ of each patient by comparison with the standard clinical pathway diagnosis and treatment unit;

  3. (3)

    Taking the results of diagnosis and treatment unit division of patients with the highest clustering accuracy as the benchmark and integrated it by weighted voting;

  4. (4)

    dividing all clinical behaviors of all patients into K diagnosis and treatment units according to the unified standard to complete the division of diagnosis and treatment units.

Clinical pathway optimization method for diagnosis and treatment unit based on association analysis

The clinical behavior of the same treatment unit divided by previous cluster analysis had a certain degree of association, but there were clinical behaviors adopted by doctors, which was according to individual differences in the diagnosis and treatment data. These clinical behaviors were not applicable to most patients and should not be assigned to the clinical pathway, so the original clinical pathway standard unit could not be directly replaced with the division diagnosis and treatment unit [18].

The division diagnosis and treatment units were compared with the original standard clinical pathway diagnosis and treatment units, and different clinical behaviors were selected as the projects to be optimized [19]. Then, according to the actual diagnosis and treatment data of the patients, the classic Apriori algorithm and FP-growth algorithm were used in the association analysis to mine the association rules of the clinical behavior data of all patients, and calculate the reliability and support degree. The mining of association rules of clinical behavior could be divided into two parts.:

  1. (1)

    According to the number of clinical behaviors, the division medical treatment units were successively compared with the clinical pathway standard medical treatment units, and were corresponded and numbered one by one based on the principle of maximum similarity. The clinical behaviors contained both of them was extracted to form an independent small set αi(i = 1, 2, ⋯, k), and the remaining different clinical behaviors were considered as the pending optimization item to uniformly re-code βij(i = 1, 2, ⋯, n; j = 1, 2, ⋯, k).

  2. (2)

    The clinical behavior data of each patient after re-coding was used as transaction database D, and the minimum credibility and minimum support-set were set to run the correlation algorithm of association rules, so that obtained the association rules between collection and clinical behavior;

Due to the differences in the association rule algorithms, different algorithms might differ in the mining of clinical behavior data. Apriori algorithm and FP-growth algorithm in the paper were introduced to mine the association rule of clinical behavior data [20].

  1. A.

    Clinical behavior data based on Apriori algorithm

  1. (1)

    Scanning the clinical behaviors of all patients in the clinical behavior database, and the independent setαi(i = 1, 2, ⋯, k) was considered as a whole, and then the independent small whole and the pending optimization were counted and recorded as candidate set X1;

  2. (2)

    Setting the minimum support threshold min_supp and the minimum confidence threshold min_conf, and the set of items exceeding min_supp and min_conf got the 1 frequent item set L1.

  3. (3)

    Candidating set X2 was produced from the frequent one item set L1, then all the clinical behavior data of the database were scanned again, and each candidate set was counted again, and 2 frequent item set L1 was obtained after being judged by threshold value.

  4. (4)

    The above steps were iterated continuously, and finally the association rules were mined by frequent item sets.

  1. B.

    Clinical behavior data processing based on FP-growth algorithm

  1. (1)

    Setting the minimum support threshold min_supp and the minimum confidence threshold min_conf, the clinical behavior of all patients in the clinical behavior database was scanned, and the independent set αi(i = 1, 2, ⋯, k) was regarded as K independent items, and then the independent set and the pending optimization were counted. The FP tree was established through two scans of transaction datasets, which removed the infrequent item sets and covered the main information in the dataset.

  2. (2)

    According to each node in the FP tree, the conditional mode base tree was established again by using the method of FP tree generation, and then frequent item sets were called repeatedly until the end of single path.

Experiment results and analysis

The clinical pathway of diabetes in the section would be taken as an example to verify the feasibility of -proposing strategy through experiments, and the performance of the optimized clinical pathway was evaluated through multiple evaluation indexes.

Due to the inevitable defects and noise in the data obtained by the hospital, such as vacancy value in the data table. In order to reduce the difference in processing data and improve the efficiency of data mining, which was necessary to process the original data. The preprocessed data would be more consistent with the requirements of data mining technology for data types, and would also improve the accuracy of data mining.

Data pre-processing

Two data mining methods, that were clustering and association analysis, were adopted to verify the feasibility of -presenting strategy. Therefore, the data type requirements of the two mining methods should be satisfied as much as possible in the data preprocessing. First, the unique code of different diagnosis and treatment items, such as I1,I2,⋯,In,was identified in the doctor’s advice information. Owing to the influence of distance definition in cluster mining, the time of doctor’s advice information should be normalized. The specific time data (including opening time, stop time, use time, etc.) were mapped to the individual’s overall hospitalization time, and corresponding data was maintained between [0,1] intervals. Finally, each individual non-numeric data was quantified in the doctor’s advice field.

Rules for the normalization processing of behavioral data

The normalization of behavioral data included data type conversion, attribute numerical value and partial data reduction. After data was cleaned, the original data still contained a large amount of time type data, which needed to be converted into integer data in days in this experiment. In order to meet the requirements of data mining methods for data types, further numerical processing of the attribute data extracted by the classification was required. These data sizes only represented different attributes and had no practical significance. At the same time, in order to link the use time of doctor’s advice information with the patient’s hospitalization time, and the complexity of the actual value in data mining was reduced. In this paper, the time data in the doctor’s advice information was normalized, and the specific rules were defined as follows:

Assuming that the patient was xi(i = 1, 2, ⋯n) and the single hospitalization duration was D days, the doctor’s order duration was the X day of the entire hospitalization stage, and the doctor’s order execution duration was Y days, then the following formula (3) and (4) defined the duration coefficient of each patient as w1 and the time coefficient of length as w2.

$$ {w}_{1i}\left(i=1,2,\cdots, n\right)=\frac{\mathrm{X}}{D} $$
(3)
$$ {w}_{2i}\left(i=1,2,\cdots, n\right)=\frac{\mathrm{Y}}{D} $$
(4)

Evaluation indicators

This experiment evaluated the optimization effect of clinical pathway from two aspects, and selected the applicable rate and cure rate as clinical pathway optimization evaluation index of medical quality, chose the average hospitalization day and average medical cost as the optimization evaluation index of clinical pathway in medical cost. The so-called cured patients included not only patients who were discharged from the hospital to clearly state the “cure” condition, but also patients who had been hospitalized for more than 50 days with the same diagnosis.

Due to the medical industry standards and related system requirements, the optimized effect cannot be directly verified by experiments, but the optimized result could be used as the predicted value of patients in the later stage and compared with the actual use of patients to simulate the effect of clinical pathway optimization.

The specific evaluation process of the clinical pathway optimization effect evaluation indexes are as follows:

  1. (1)

    The patient populations were determined to use the optimized clinical pathway. Selecting the diagnosis and treatment data of all patients were diagnosed with diabetes and hypertension in a certain period of time, and screening out the patients whose diagnosis and treatment data covered the optimized clinical path diagnosis and treatment plan; assuming that the patient had used the optimized clinical path if it is included;

  2. (2)

    Making statistics of the number of patients, length of stay, medical expenses and discharge before and after the optimization of clinical pathway respectively;

  3. (3)

    Calculating the applicable rate, cure rate, average length of hospital stays and average medical cost before and after clinical pathway optimization respectively.

Result analysis

This experiment selected the actual diagnosis and treatment data of the endocrinology department, and analyzed the effect before and after the clinical pathway optimization according to the evaluation process statistics. The specific evaluation indexes were shown in the Table 1 as follows.:

Table 1 Before and after clinical path optimization effect evaluation index

Under the condition of guarantying the cure rate, the application rate of optimized clinical pathway was higher than the original clinical pathway. Therefore, the medical quality of the optimized clinical pathway had been improved;

The average length of stay in the optimized clinical pathway was slightly shorter, but the average medical cost was increased, which indicated that the optimization of some clinical behaviors was conducive to enhance treatment effect, but the medical cost was accordingly increased. However, the increase rate of average medical cost was not high, and medical cost was increasingly affected by objective conditions. Thus, a comprehensive evaluation of the clinical path optimization diagnosis and treatment effect are verified by experiments. From the aspects of improving the quality of medical care and controlling the cost of medical care, four evaluation indicators for clinical pathway optimization were proposed that included the application rate, cure rate, average length of stay, and average medical costs. Experiments show that after the optimization strategy proposed in the paper, the clinical path diagnosis and treatment had been achieved a substantial enhancement in medical quality under the condition that the medical cost was the basically same. Namely, when the cure rate was remained at 88.1%, the application rate increased from 69.1% to 73.7%.

Therefore, according to the diagnosis and treatment data of patients with diabetes and hypertension in the clinical pathway, all parts of the optimization strategy were verified in the experiment, and then the clinical pathway diagnosis and treatment scheme had been confirmed through an optimization strategy with a wider utilization rate and better treatment effect. First, the principle, standard and of data pretreatment in detail were explained, and then the characteristics and application conditions of the two schemes of diagnosis and treatment units were also respectively interpreted by experiments. Second, it was concluded that the Fp-growth algorithm was more efficient in correlation analysis, and it was verified that the number of concurrent diagnosis and treatment units was less. Finally, four evaluation indexes and specific operations were adopted from the aspects of guaranteeing medical quality and controlling medical cost to evaluate the optimized clinical pathway effect. Through data analysis, it was affirmed that the clinical pathway optimization proposed in the paper had the superiority of enhancing medical quality.

Conclusions

The purpose of our proposed method was to expand the scope of application of the clinical pathway standard diagnosis and treatment program in the context of regional medical integration and large-scale medical data, to enable a larger group of patients within the region, and to enhance the community hospitals and township hospitals. The quality of medical care, so that the medical resources of hospitals at all levels were fully utilized, reduced the total cost of medical care for all people. This study focused on the optimization needs of general clinical pathway standards that applied to a wide range of areas, particularly chronic diseases, in response to the regional medical environment. The object of optimization was the content of clinical path diagnosis and treatment standards, and the selected clinical pathway was chronic disease and conventional disease. Using data mining technology, a clinical path optimization diagnosis and treatment strategy based on diagnosis and treatment unit division was proposed. Moreover, the specific implementation of this optimization strategy was explained by using the clustering and correlation analysis methods. Finally, the diabetes-associated hypertension case was used as an example to test the performance of the proposed optimization strategy. The experimental results showed that the optimized clinical strategy proposed in the paper was superior to the original standard in ensuring the average length of stay and the average hospital stay.