INTRODUCTION

Clinical notes are the means by which physicians document and communicate important information regarding the care of their patients.1, 2 Appropriate documentation is a necessary component of practice,3,4,5 with notes being used for patient care, medical education, billing, quality improvement, and legal proceedings.1, 2 Since the advent of electronic health records (EHRs), physician notes have become more legible and accessible,2, 6 but these advances have come at the cost of increased note length due to note “bloat,” or “clutter,” and increased errors created by “cut and paste” or “copy forward” practices.6,7,8 Faculty and trainees recognize these pitfalls9,10,11 while also perceiving these functions as helpful for efficiency.11 Despite this, there are few curricula available to teach trainees how to construct their notes appropriately6, 12 and a gold standard for assessing note quality does not exist.12

Some groups have created note templates to improve progress note7, 11 or discharge summary quality.12 These interventions have shown modest effect by reducing “clutter”7 or decreasing note length,11 however interventions based solely on compliance with institution-specific note templates limit generalizability. Others have developed assessment tools to improve notes. QNOTE assesses outpatient notes,1, 2 while PDQI-9 assesses inpatient notes.6, 8, 13 Both QNOTE and PDQI-9 are based on subjective adjectives as assessment items, such as “concise” or “up-to-date”, which do not provide concrete, actionable feedback to learners. PDQI-9 also requires assessors to be familiar with the patient or perform significant chart review. The RED checklist13 assesses inpatient progress notes with four global measures of quality (truthful, reasoned, updated, and succinct) via open-ended questions and assesses individual note components via a checklist, but requires assessors to also review the previous progress note. Additionally, none of the tools integrate with competency-based frameworks such as the Accreditation Council for Graduate Medical Education (ACGME) milestones. While these tools have furthered our understanding of assessing trainee documentation, a tool that simultaneously assesses the multiple functions of a note, targets note “bloat,” assesses overall clarity, and maps to other educational frameworks is needed.

The purpose of this study was to develop and create initial validity evidence for a single tool that assesses admission note quality and serves multiple functions, including assessment of key individual components of a note and provides global assessments mapped to the ACGME milestone framework.

METHODS

Setting

We conducted our study from 2017 to 2018 at the University of Cincinnati Medical Center. Our Internal Medicine residency program has approximately 92 categorical and preliminary residents each year. All documentation is entered into the electronic health record (Epic Hyperspace; Epic Systems, Verona, Wisconsin). This study was approved by the University of Cincinnati Institutional Review Board.

Assessment Tool Development

An initial draft of the Admission Note Assessment Tool (ANAT) was developed by two authors (DW and JH). The study team consisted of two internal medicine hospitalists (JH and DS) and four internal medicine-pediatrics hospitalists (DW, MK, BK, and JO). Members of this group have content expertise in tool development, learner assessment, and billable documentation. Tool development continued with the goal that the ANAT (Fig. 1) would help accomplish the following objectives:

  1. 1.

    Ensure proper documentation for billing

  2. 2.

    Decrease note “bloat”

  3. 3.

    Provide global assessments mapped to the ACGME milestone framework

Fig. 1
figure 1

Admission Note Assessment Tool (ANAT)

The ANAT consists of 16 discrete checklist items: the first 13 items are scored as “met” or “not met” and the last three checklist items are scored as “met”, “partially met”, or “not met”. The two global assessment items are scored on a five-point behaviorally-based scale mapped to ACGME sub-competencies. Global assessment item one is mapped to ICS-3 (“appropriate utilization and completion of health records”) and PROF-4 (“exhibits integrity and ethical behavior in professional conduct”), while global assessment item two is mapped to PC-1 [“gathers and synthesizes essential and accurate information to define each patient’s clinical problem(s)”] and ICS-2 [“communicates effectively in interprofessional teams (e.g. peers, consultants, nursing, ancillary professionals and other support personnel)”].

Validity evidence was sought utilizing Messick’s validity framework.14

The study group revised the initial draft of the ANAT based on discussion and consensus building15 around optimal tool content and format to meet the tool’s objectives. The ANAT then underwent further iterative revisions as follows. Each study group member used the tool individually to evaluate an admission note. The group then discussed how each study group member interpreted and applied the assessment items in their rating of the note via think alouds. The ANAT was then revised and the process was repeated until there was agreement that the tool accomplished the above stated goals, had utility,16 and was easy to use. Notes were taken throughout this process and the group’s experiences were used to create a rater training manual.

At the start of the project, two commonly identified issues were the lack of adequate review of systems (ROS) and physical exam (PE) documentation needed for appropriate evaluation and management (E&M) billing.17 Since the majority of our patients are significantly complex, billing requirements for a level three E&M encounter became the benchmark for assessing the elements of the note. We found that agreeing on the quality of certain elements (e.g. completeness of history of present illness) was challenging given the subjective nature of assessing quality. Therefore, the scope of many items in the ANAT was narrowed to focus on billing criteria and scored as “met” or “not met” in a checklist format. Similarly, to help decrease note “bloat,” we included items aimed at reducing irrelevant historical labs or imaging. A separate area was created for narrative feedback related to each item. The incorporation of narrative feedback throughout the tool became an important focus to allow specific, actionable feedback related to more subjective or nuanced aspects of documentation not captured in the checklist, and to inform the global assessment ratings.

Within the assessment and plan (A&P) items our goal was to assess clinical reasoning, but we found differing opinions amongst our group regarding what we considered adequate. Therefore, we decided to assess the presence or absence of clinical reasoning in the A&P items and focus on narrative feedback, while assessing the overall adequacy of clinical reasoning in one of the global assessment items. It was felt that by scoring these items on a three-point scale (i.e. “met”, “partially met”, and “not met”) we could still provide some discrimination between trainees.

With the development of this tool we created two global assessment items: 1) As pertaining to the elements of a note, the learner can “document an initial hospital encounter” and 2) As pertaining to the quality of a note, the learner can “demonstrate ability to synthesize and document clinical reasoning during an initial hospital encounter”. We created a behaviorally-anchored five-point rating scale based on the amount of clarification/editing needed by a supervisor (e.g. “documentation requires substantial clarifications by supervisor”). In order to integrate ANAT ratings into our program of assessment, and to examine relationship to other variables of trainee performance in the future, a key aspect of development included mapping these global assessment items to ACGME sub-competencies. We mapped item one to interpersonal and communication skills (ICS)-3 (“appropriate utilization and completion of health records”) and professionalism (PROF)-4 (“exhibits integrity and ethical behavior in professional conduct”), and item two to patient care (PC)-1 [“gathers and synthesizes essential and accurate information to define each patient’s clinical problem(s)”] and ICS-2 [“communicates effectively in interprofessional teams (e.g. peers, consultants, nursing, ancillary professionals and other support personnel)”].4

Assessment Tool Piloting

Raters were trained prior to each pilot. Rater training consisted of a one-hour training session with the principal investigator (DW) in-person or via conference call. Each component of the tool was explained, and raters were instructed on proper use of the tool through situational examples, the rater training manual, and simulated review of a sample note.

To determine if ANAT could be used by any assessor without firsthand knowledge of the patient, chart review, or review of other notes, pilot testing was performed comparing admission note ratings by the supervising attending to note ratings by study team raters. For this comparison to be legitimate, notes needed to be reviewed in close proximity to the supervising attending’s time on service, to ensure memory of the patient and limit recall bias. This was a crucial step to determine whether firsthand knowledge of the patient is necessary for note assessment. The number of notes available was limited by the number of patients seen and thus determined the number of additional study team raters needed to power reliability calculations. In pilot one, a total of 28 notes were assessed: 18 notes were assessed by the supervising attending and two study team raters; and an additional 10 notes were assessed by the same two study team raters only. Results of ratings by the supervising attending and study team raters (i.e. first 18 notes) were then compared to ratings by the study team raters alone (i.e. all 28 notes).

After reviewing the results from pilot one, the tool was refined using the same iterative process described previously. Feedback from the supervising attending was also incorporated into the discussion. A second round of pilot testing was performed, again using a supervising attending and study team raters. A new supervising attending was used, again for note review to be completed in close proximity to the supervising attending’s time on service. Based on power calculations, additional study team raters were used to decrease the number of notes needed for review. In pilot two, a total of 15 notes were assessed: 13 notes were assessed by the supervising attending and four study team raters; and an additional two notes were assessed by the same four study team raters only. Results of ratings by the supervising attending and study team raters were again compared to ratings by the study team raters alone. During pilot two, raters recorded time spent on each note assessment. After reviewing the results from pilot two and discussing the group’s experience using the tool, the group felt no further revisions were needed.

A final pilot was performed with one study team rater and three attending physicians from other institutions using the finalized tool to assess feasibility of using the tool at a different institution with similar rater training. A total of 18 notes were reviewed by all four raters. Results of ratings by the study team rater and the three attending physicians from other institutions were compared to ratings by the three attending physicians from other institutions alone.

Statistical Analysis

For interrater reliability calculations we assigned scores between 0 and 1 for each individual item on the ANAT. For all items, ratings were scored as follows: “not met” as 0, “met” as 1; when applicable, “partially met” as 0.5. The two global assessment items were rated on a five-point scale. As discussed by de Vet et al.,18 agreement parameters should be used for an instrument developed for evaluative purposes where only measurement error of the instrument itself matters and not the variability between subject matters. Thus, average percent agreement was used to measure interrater reliability. All data was analyzed using R, version 3.3.3.19

RESULTS

The ANAT (Fig. 1) includes 16 checklist items that are key “elements of the note” and two global assessment items.

The results of pilot two testing with five raters, consisting of the supervising attending and four study team raters, can be found in Table 1. Rater agreement ranged from 86% to 100% for thirteen of the items. The three A&P items had rater agreement ranging from 72% to 81%. Rater agreement on the two global assessment items was 69% and 68% respectively. Results of ratings by the supervising attending and study team raters, compared to ratings by the study team raters alone, can be seen in Table 1. Results from pilot testing with raters from other institutions can be seen in Table 2.

Table 1 Results of Pilot Testing Comparing a Supervising Attending Rater and Study Team Raters
Table 2 Results of Pilot Testing Comparing a Study Team Rater and Outside Institution Raters

Overall, raters in pilot two took an average of 12.3 min to complete the note assessment (SD 3.7). Supervising attendings took an average of 11.2 min (SD 1.9) while study team raters took an average of 12.6 min (SD 3.9).

Using Messick’s validity framework we gathered content, response process, and internal structure validity evidence for the ANAT. The ANAT was developed by faculty members with content expertise in tool development, learner assessment, and billable documentation, evidence for the content validity of our tool. Our think-alouds during consensus building, standardized rater training, and comprehensive rater training manual created from this process generated evidence of response process validity. The finding that scores from the supervising attending were comparable to scores from the study team raters (Table 1) demonstrated internal structure validity.

DISCUSSION

The ANAT had high agreement for simple and objective items (e.g. chief complaint) while it had lower agreement for more complex items, such as ROS and PE, and more subjective items like the A&P items and global assessment items. Less agreement on subjective items is similar to other published tools. For example, the RED checklist had lower agreement for their A&P items with 71% agreement for the item “active problems are accompanied by clinical reasoning,” and as low as 51% agreement for the item “problems are associated with brief, clear plans.”13

Lower agreement for the global assessment items is also expected given the nuances assessors bring to global assessment. Our behaviorally-anchored scale includes guidance for rating the global assessment items and faculty were trained to rate based on amount of clarification/editing needed, not their own documentation attestation practices. However, just like variation seen in actual practice, faculty bring their own interpretations to assessment of performance.20, 21 Further, the global assessment items are rated on a five-point scale, making perfect agreement less likely.

Results were similar between the supervising attending and study team raters across all items, indicating that use of the ANAT without prior knowledge of the patient is expected to yield similar results to use of the ANAT by the supervising attending. Average time to complete the note assessment was 12 min. These findings improve feasibility of implementing a system for routine assessment of trainee notes, as assessments are not limited to the supervising attending, can be completed quickly, and can be completed at any time after note completion. Results were similar between raters from other institutions and a study team rater, suggesting that the ANAT can be implemented at other institutions with minimal rater training.

Unlike other published tools, the ANAT serves multiple functions. None of the published tools incorporate billing criteria, while ANAT does. Additionally, items in ANAT assess incorporation of irrelevant historical labs and imaging. These items target note “bloat” and other pitfalls of EHR shortcuts without depending on adoption of a particular note template, unlike published interventions. The ANAT focuses on assessing documentation behaviors not limited to one EHR or one institution, suggesting the ANAT can be easily adopted by others. Maximizing assessment opportunities to provide additional feedback on practical skills, such as billing criteria and appropriate use of the EHR, provides trainees with important education without additional faculty effort. PDQI-9 and the RED checklist require assessors to review information other than the individual note assessed, while the ANAT only requires review of the note being assessed, further maximizing faculty effort.

The ANAT evaluates individual components of notes via a checklist and provides global assessment via the two global assessment items. This structure addresses the need to provide both types of assessment for clinical documentation. A checklist is well suited to assess discrete, concrete items such as components of a note and billing criteria, also allowing direct feedback to specific portions of a note. Whereas, global assessment is needed for more complex skills not well captured in a checklist, such as the learner’s ability to synthesize information and communicate their thought process. The only other assessment tool that evaluates individual components of notes and evaluates global measures is the RED checklist developed by Bierman et al.13 Unlike the RED checklist, our global assessment items utilize a behaviorally-based numerical scale and maps these items to the sub-competencies for Internal Medicine within the ACGME milestone framework, an important part of ANAT’s construct.22, 23 In the future, any resident who receives a level one will get flagged by our clinical competency committee for further review as part of our standard review processes. This will allow us to track and intervene on critical deficiencies in documentation.24

The ANAT provides space for narrative comments allowing for specific, directed feedback not otherwise captured in the checklist ratings. Incorporation of narrative feedback is highlighted throughout the rater training process to ensure assessment is not limited to checklist criteria, to allow more subjective measures of quality to be assessed via narrative feedback, and to instruct assessors on aspects of the note that should inform the global assessment ratings. The emphasis on narrative comments throughout the ANAT has the ability to generate rich formative feedback aimed at concrete, actionable improvement, which is less of a focus with other published tools.

This study is limited by being conducted at a single institution within a single program, although pilot testing included attendings from multiple institutions. Another limitation is the lack of comparison of ANAT results to results using other published tools or other standards of documentation practice. Additionally, the ANAT incorporates billing criteria in the assessment items that may not be applicable in the future if billing criteria is modified. However, a change in billing criteria would require minimal tool modification which would not impact the tool’s underlying assessment construct. Importantly, we do not yet know if use of the ANAT will affect documentation behaviors amongst learners. Further, we have not yet established evidence of relationship to other variables or consequence validity, although including global assessment items mapped to the ACGME milestone framework is an important step to being able to establish this validity evidence in the future.

CONCLUSIONS

We present initial validity evidence for the ANAT that uniquely serves multiple functions by incorporating billing criteria, targeting note “bloat,” assessing individual note elements, and utilizing global assessment items mapped to the ACGME milestone framework while simultaneously providing an opportunity for narrative feedback. Routine use of the ANAT as a part of trainee assessment is feasible given that any attending can complete the evaluation with minimal training and minimal time required, with overall high agreement expected. Further testing of the ANAT should be undertaken to continue to build validity evidence and to see how the tool performs as part of routine assessment. Next steps should include development of a documentation curriculum that can be implemented along with the ANAT, assessment of documentation behaviors after implementation of ANAT, and spread to other institutions.