Introduction

The US Veterans Health Administration (VHA) has implemented multiple effective treatments for posttraumatic stress disorder (PTSD), including two specific psychotherapy protocols, cognitive processing therapy (CPT) and prolonged exposure (PE; Karlin et al. 2010). Cognitive processing therapy is comprised of 12 weekly 60-min sessions of cognitive therapy, where veterans address maladaptive thoughts associated with their worst traumatic event. Cognitive processing therapy can be administered either in an individual therapy format or a group format (Resick et al. 2015, 2017). Prolonged exposure consists of 9–12 weekly 90-min sessions of trauma-associated imaginal and in-vivo exposures. It is administered in an individual therapy format, although trials are underway to examine administration in a group format (e.g., Smith et al. 2015). Research trials of CPT and PE have resulted in statistically significant and clinically meaningful improvement in veterans’ PTSD symptoms (e.g. Haagen et al. 2015; Monson et al. 2006; Schnurr and Lunney 2015; Schnurr et al. 2007; Voelkel et al. 2015). The VHA Uniform Mental Health Services Package mandated the availability of these treatments in VHA clinics beginning in 2008 (Kussman 2008), and they have been implemented in many settings (Chard et al. 2010; Tuerk et al. 2011; Rosen et al. 2016, 2017; Yoder et al. 2012). Studies on a local or regional level have estimated that approximately 6–13% of VHA patients with PTSD receive PE or CPT (Lu et al. 2016; Mott et al. 2014; Shiner et al. 2013).

Given the high prevalence of PTSD among VHA patients (e.g. Seal et al. 2007), as well as the resources invested in training providers nationwide, being able to identify use of evidence-based psychotherapies for PTSD on a national level is important. First, it allows an understanding of implementation patterns, including barriers to and facilitators of use. Second, it facilitates evaluation of whether patients improve in naturalistic settings (outside of research trials) and which patients improve most when receiving these treatments. Third, it allows us to better understand dropout rates and associated factors. Consequently, we need accurate measures of the use of these psychotherapies. While administrative data elucidate whether a patient received psychotherapy, they do not contain information about specific treatments utilized, including whether CPT or PE were administered. While CPT was implemented in 2006 and PE in 2007, templated notes for these treatments were not implemented until 2015 and are still inconsistently used (Shiner et al. in press). Subsequently, the VHA has not been able to study whether these treatments have been systematically implemented and whether they are effective on a larger scale.

Although manual review of treatment notes at a regional or national level would be challenging and labor-intensive, automated coding of note text using natural language processing (NLP) is one method that can efficiently deliver important information from large and unstructured data (Meystre et al. 2008). Shiner et al. (2012, 2013) were able to use NLP methods to better understand the percentage of veterans with newly diagnosed PTSD who received CPT or PE within six sites in the New England VHA region. In this study, our goal was to extend Shiner and colleagues’ work by applying automated coding to a large national pool of mental health treatment notes in order to identify use of CPT and PE. Shiner et al.’s work was limited to a regional evaluation and the tool they used (the Automated Retrieval Console, or ARC; D’Avolio et al. 2010) ran processes in series rather than in parallel, so it could not be reasonably scaled up for national, longitudinal work. Another significant difference is that this work allows us to identify evidence-based psychotherapys (EBPs) delivered in both group and individual formats. Group CPT is reported to have spread rapidly in the VA and being able to identify group CPT will allow us to detect patterns of implementation. We hypothesized that automated coding using NLP would be able to detect and discriminate between note text describing evidence-based protocols for PTSD and other psychotherapy.

Method

Participants

We retrospectively identified 255,968 veterans of the wars in Iraq and Afghanistan who had at least two post-deployment encounters (inpatient and/or outpatient) with a PTSD diagnosis (ICD-9 309.81), and at least one post-deployment clinic visit with a psychotherapy procedure code (see list of procedure codes in Appendix 1) at one of 130 VHA facilities from October 2001 to August 2015. Next, all psychotherapy clinic visits for these patients were identified and linked to clinical notes. We excluded 3085 (1.19%) patients who did not have any notes associated with their psychotherapy procedure coded visits. Our text corpus consisted of a total of 8,168,330 clinical notes associated with psychotherapy visits for 255,933 patients across outpatient and inpatient settings.

Natural Language Processing Method

We created an NLP system to analyze narrative text of the psychotherapy notes to determine the type of psychotherapy that the documents describe. The system was built using Leo packages that extend the Unstructured Information Management Architecture Asynchronous Scaleout (UIMAAS) framework (Ferrucci and Lally 2004; Cornia et al. 2014). The system utilized LIBSVM implementation of a support-vector machine (SVM) algorithm (Chang and Lin 2011). LIBSVM is a library for support vector classification and regression. SVM is a supervised algorithm that requires a set of training examples to develop a machine learning model. The SVM algorithm was chosen because it is a robust machine learning algorithm that is appropriate for large sparse feature sets and is generally accepted to be the most accurate for imbalanced sets. In addition, applying SVM algorithm is fast, which is essential when working with large datasets (Baharudin et al. 2010). We performed manual annotation of psychotherapy notes in order to create a reference document set for training and validating the NLP system.

Annotation of a Reference Standard

The goal of manual annotation was to review each note in the selected set and assign a label that reflects the type and format of psychotherapy that the clinical note described. We used the following eight labels: (1) CPT individual, (2) CPT group, (3) PE individual, (4) PE group, (5) other individual psychotherapy, (6) other group psychotherapy, (7) other family or couples’ psychotherapy, and (8) not psychotherapy.

Two practicing VHA clinicians who are trained in and provide evidence-based psychotherapies at the VHA (i.e., one staff psychologist, K.B., and one psychology postdoctoral fellow, L.A.G.), and two professional clinical chart annotators performed multiple rounds of annotation. The psychology and professional clinical chart annotation team collaborated to iteratively create annotation guidelines describing the code definitions (see Appendix 3 for example). All four annotators completed the first two rounds of annotation on the same document set to evaluate inter-annotator agreement. Once an acceptable level of agreement (kappa ≥ 0.8) was achieved, the professional clinical chart annotators reviewed additional documents. The documents were labeled using annotation tools on a centralized virtual workspace provided by the VA Informatics and Computing Infrastructure (VINCI).

The main challenge with selecting documents for annotation for our project was the large differences between the prevalence of each type of therapy in our large body of national clinical notes. Machine learning classifiers do not perform well if the training set is highly imbalanced. One of the approaches to creating the most accurate model is to oversample classes with low prevalence (Batuwita and Palade 2013). However, we had no formal way of performing document selection to oversample for relevant documents. Thus, we utilized an iterative approach to perform a series of steps multiple times: (1) using random stratified selection of documents from the complete unlabeled set; (2) annotation of the selected documents; (3) training a preliminary SVM classifier model using the annotated set; and (4) applying the newly trained preliminary model on the full dataset. In each iteration of this method, we used stratification either simply by location code within VA (total number of locations is 130), or using additional stratification by labels assigned by the latest trained SVM model. The goal of each iteration was to arrive at a balanced document set where each document type is evenly represented. In addition to random stratified selection, we performed targeted selection of documents that indicated unusual situations, such as a patient having both PE and CPT sessions on the same day, which is highly unlikely.

All preliminary training of SVM models was performed without manual feature selection. Once the last annotation iteration was completed and a balanced document set was identified, the full annotated document set was split into training and testing sets, and only after that the training set was manually reviewed to guide creation of the feature set for the final NLP system.

NLP System Development

The final annotated set of 3467 clinical documents was randomly split into a two-thirds training set (N = 2960) and a one-third (N = 1507) validation set. There were four steps in the development of our NLP system. First, we built a set of features for a machine learning classifier using a sparse binary bag-of-words document representation. This method encodes a document as an unordered set of words with a value of “1” if the word is present in the document and, if the word is absent, the value of “0” is implied. We amended this set by removing most frequently used irrelevant words (e.g. a, an, is, was, the). Second, we removed misleading irrelevant phrases. For example, the phrase “CPT code” stands for current procedural terminology, a way to code procedures for clinical visits. However, “CPT” also stands for “cognitive processing therapy.” Therefore, the phrase “CPT code” was removed. Third, we created a set of features representing salient phrases, which are indicative of one of the categories. The values for these features were set to “2” to give them a bigger weight. For example, occurrence of the phrase “cognitive processing therapy” in a document is more important to determining if the document reports on a CPT session than the words “cognitive,” “processing,” and “therapy” separately. The total number of features in the training set was 16,516.

Document level classification was performed using LIBSVM implementation of a linear multi-label SVM classifier. The manually annotated document set indicated that clinical reports documenting PE group sessions were extremely rare (which is consistent with the fact that VA clinicians have not received training on this format so far), so only a single PE category was used. Similarly, all documents reporting other psychotherapy sessions were combined into a single category of “other psychotherapy.” Thus, classification was performed using five labels: PE, CPT individual, CPT group, other psychotherapy, and not psychotherapy. While the algorithm is designed for multi-label classification, internally it performs a series of binary classifications as “one-against-one” classifications and then aggregates the results of these classifications using voting to assign a single most probable label out of the five labels (Hsu and Lin 2002). The accuracy of the classification model on the training set was 0.89. Finally, after developing the machine learning model, we validated the system on the test document set.

We analyzed system accuracy for each category separately and for the system as a whole. Measures included true positive count (TP: the number of documents in each category that the system and the reference standard agreed describe the performed psychotherapy; we also included true negative, false positive, and false negative counts), positive predictive value (PPV: the proportion of true positive documents to the total number of documents in each category identified by the system), sensitivity (the proportion of the true positive documents to the total number of documents in each category identified by the annotators), specificity (the proportion of the true negative documents to the total number of documents in each category identified by the annotators), and classification accuracy (the proportion of documents across all categories that the system and the reference standard agreed on to the total number of documents reviewed).

Analysis of Psychotherapy Receipt

Once the automated coding process was complete, we performed analyses comparing various methods for assessing psychotherapy delivery. For each patient, we calculated psychotherapy received in three ways: (1) mean number of individual or group psychotherapy sessions using psychotherapy current procedural technology codes (see Appendix 1), (2) mean number of psychotherapy sessions using automated classification of all documents, and (3) mean number of sessions of each specific EBP using NLP-based automated coding of individual psychotherapy documents. We repeated these analyses for the subpopulation of patients who received any EBP.

In order to account for potential bias from varying observation times, we performed a separate analysis restricting the dataset to the first 4.1 years (median observation time among entire cohort) from initial psychotherapy procedure coded visit among the 130,416 patients who were observed for at least that long (i.e., initial visit was at least 4.1 years before end of study), and repeated the analyses described above. All statistical analyses were completed in SAS, Enterprise Guide version 7.1 (Cary, NC).

Results

After the initial two rounds of annotation and training, the agreement between annotators was excellent (kappa = 0.88; 95% CI 0.85–0.90). Table 1 outlines the frequencies of documents in each category in the training and testing sets. The frequencies of some labels were too small to build an accurate machine learning model. As distinguishing among these categories was not essential, they were combined. Prolonged exposure individual and group were combined into the PE category and psychotherapy sessions other than PE and CPT types were merged into the “other psychotherapy” category.

Table 1 Frequency of documents in each category for training and testing sets

NLP system validation showed an acceptable level of performance with PE accuracy of 0.99, CPT individual and CPT group accuracy of 0.97, and overall classification accuracy of 0.92 (see Table 2). Additionally, sensitivity, specificity, PPV, and negative predictive value (NPV) measures of PE and CPT individual and group were all 0.90 or greater (see Table 2). The NLP system was then applied to the full dataset. In total, the automated coding process using NLP identified 3,705,968 psychotherapy notes, including 84,445 PE notes, 196,018 CPT individual notes and 121,211 CPT group notes. As we moved from analytic methods reliant on procedure coding to methods reliant on automated coding of note text using NLP, our estimates of the use of psychotherapy decreased (Table 3). Using administrative coding, it appears that patients received an average of 18.7 sessions of psychotherapy over a median of 4.1 years, whereas using automated review of note text it appears that patients received 14.5 sessions. This means that some services administratively coded as psychotherapy appeared to be other services when the notes were reviewed. A total of 51,852 patients (20.2%) received at least one session of PE or CPT over the study period. These patients received an average of 7.8 sessions of EBP, although they had an average total of 38.9 individual psychotherapy sessions. This means that patients who received EBP for PTSD also received an equal or greater number of sessions of other forms of individual psychotherapy as part of their course of treatment. Restricting the period of observation to 4.1 years did not meaningfully change the results (see Appendix 2).

Table 2 Performance characteristics of the developed system
Table 3 Comparing methods to estimate use of prolonged exposure and cognitive processing therapy among all patients

Discussion

We achieved our goal of deriving a method to identify VHA PTSD EBP notes on a national level and confirmed our hypothesis that an NLP system could distinguish between EBP notes and general psychotherapy notes on a large scale. Although the VHA invested in system-wide training programs to implement EBP for PTSD over a decade ago, this is the first time it is possible to identify receipt of EBP from EMRs on a national level across the implementation period. Our system can facilitate further research to determine the percentage of veterans that receive minimally adequate EBP for PTSD as well as their level of symptom improvement in a VHA clinic setting outside of a clinical trial using patient reported outcomes stored in the EMR (e.g., Maguen et al. 2014). More specifically, we can use the algorithm to determine how many sessions of EBPs each individual received and whether dose is associated with measures of PTSD symptom outcomes tracked in the EMR (Hebenstreit et al. 2015; Maguen et al. 2014; Seal et al. 2016).

Given that there are regional and site differences in the implementation of EBPs for PTSD, it is critical to be able to determine automated coding accuracy using NLP on a national level. While Shiner et al. (2013) demonstrated that automated coding of psychotherapy notes using NLP was possible on a regional level, we were able to build on and extend their work by using a different platform, expanding to a national level, and identifying EBP for PTSD delivered in different formats (e.g., CPT group). This also will allow for comparisons of delivery methods of EBP for PTSD (Dreyer et al. 2010), which to date have only been compared in clinical trials (e.g., CPT individual vs. group; Resick et al. 2017). We found that about 20% of Iraq and Afghanistan veterans received at least one session of EBP for PTSD over nearly 15 years of observation. This is lower than we would expect from other studies using administrative data that were not able to isolate receipt of EBP for PTSD (Cully et al. 2008; Harpaz-Rotem and Rosenheck 2011; Seal et al. 2010; Spoont et al. 2010). However, low receipt of EBP for PTSD is consistent with more recent studies demonstrating that few veterans are initiating EBPs for PTSD and that dropout levels are high among those who do initiate (Kehle-Forbes et al. 2016).

Being able to identify patients who engaged in EBP for PTSD will help improve care for veterans in several ways. It will help us understand which individuals are most likely to receive and benefit from these treatments. It will also help us understand how many sessions are needed to receive a “minimally adequate” dose of treatment. For example, if some individuals drop out of treatment because they are better, this is important information that can help modify the delivery of care. Being able to identify patients who engaged in EBP for PTSD will also help us to understand predictors of dropout and improvement. Additionally, given that we have longitudinal data, it can help inform us about the average length of time to EBP engagment as well as the typical trajectories of care. For example, we found that those receiving about eight sessions of EBP also received an average of nearly 40 individual psychotherapy sessions during the study period. In follow up analyses, we examined the number of psychotherapy sessions prior to the first session of EBP as well as after the last session of EBP for those who completed any EBP sessions. We found that veterans attended a mean of 25 sessions (median = 10) prior to their first session of EBP and a mean of 19 sessions (median = 7) after their last session of EBP. Consequently, it seems that on average, patients are getting additional treatment before and after EBP, with a larger percentage of sessions happening prior to EBP. This could represent efforts to address patients’ readiness for PTSD treatment (e.g., Zubkoff et al. 2016), treatment of comorbid mental health conditions (e.g., Shiner et al. 2017), or post-EBP care for patients that may continue to have symptoms or relapse over time.

There are some important limitations to this work that should be noted. First, we conducted this study with Iraq and Afghanistan veterans, who are the newest veterans of war in the VHA system. Consequently, results may not generalize to all veterans. Second, while rates of EBPs for PTSD were low, this may be due to patient preferences, which we were not able to assess. Third, participation in a CPT or PE session does not reflect the intensity or quality of the intervention, Although measuring quality of individual sessions was not the focus of the current study, it is an important goal for future studies.

Despite these limitations, our findings suggest that automated coding using NLP is a method to identify use of EBPs. As far as we are aware, this is the first large-scale national application of automated coding to identify EBPs in VHA psychotherapy notes. This method holds great promise for answering multiple previously inaccessible questions that can assist clinicians and local and national leaders alike to understand current EBP practices and outcomes, and ultimately improve care for those with PTSD.