Introduction

Cancer screening by biennial mammography is recommended for women aged 40 years or above in Japan. Ultrasonography can detect breast cancers that escape mammography, and the supplemental contribution is particularly high in high-density breasts of young women. In Japan, a large-scale comparative study of whether the concomitant use of ultrasonography is effective for breast cancer screening in women in their 40s is in progress [1].

Ultrasonography is already used for breast cancer screening at many private health care organizations. Quality control is important for breast cancer screening by ultrasonography. While the equipment is an important component of quality control, the ability of the examiner is even more important, because ultrasonographic detection and diagnosis of the lesions are performed in real-time. To improve the skill of the examiner for the future nationwide introduction of breast cancer screening by ultrasonography, the Educational Committee of the Japan Association of Breast and Thyroid Sonology (JABTS), an NPO, has organized 2-day training programs on breast ultrasonography for physicians and technologists with tests using images to evaluate the ability of the participants at the end of the programs. In this report, the test results were analyzed.

Subjects and methods

Between April 2008 and March 2009, the Educational Committee of the JABTS sponsored training programs on breast ultrasonography 18 times (9 times for physicians, 9 times for technologists). The target population was physicians and technologists engaged, or expected to be engaged, in breast cancer screening by ultrasound and those at hospitals accepting secondary examinations of breast cancer screening. Each training program was performed over 2 days, with 49 physicians or 48 technologists participating at most. The subjects of this study were 422 physicians and 415 technologists. The specialties of the physicians and the numbers practicing each speciality were as follows: breast surgeons (196), surgeons of other or unspecified fields (118), gynecologists (37), radiologists (33), physicians working at screening institutions (21), and physicians of internal medicine (15); data were missing for 2. The technologists included medical technologists and radiographers; both are allowed to perform ultrasound in Japan. The experience levels were divided into four groups according to the numbers of ultrasound examinations performed during the past 5 years: <100, 100–499, 500–999, and ≥1,000. The distribution of the physicians in the different experience levels was 66, 126, 74, and 156, respectively, and that of technologists was 105, 110, 53, and 147, respectively.

Table 1 shows a training program.

Table 1 Training program for physicians and technologists

Tests using images were performed at the end of the training program using laptop-type personal computers. The observers answered using computers. The contents of the tests were as follows.

Fifty questions using videos (each 15–25 s long)

Based on the video images that mimicked ultrasonography and that did not pause at the lesion, participants were asked whether a lesion that requires secondary examination (category 3 or more advanced) was present or absent. The questions consisted of 25 videos including lesions of category 3 or greater and 25 videos that included no lesions or included lesions of category 2.

Fifty questions using still images (each consisting of 1 or 2 orthogonal images)

Based on the still images, the category and disease considered most likely were selected. The questions included 18 sets of malignancies, 7 sets of benign lesions that need further examination, 17 sets of benign lesions that do not need recall (category 2), and 8 sets of normal breasts (including normal variations).

The videos were played using Windows Media Player. The observers were taught how to use the software during group training sessions. The answers (recommended categories and disease names) to questions presented using still images were determined based on a conference involving two doctors with more than 20 years of experience with breast ultrasound and one technologist with more than 10 years of experience. Table 2 defines the categories we use in Japan in comparison with BIRADS categories. Figure 1 and Table 3 show the criteria for category judgment. Table 4 shows the choices of disease names presented in the tests using still images. There were multiple possibly correct answers regarding both the category and disease name in still images, and the answer was regarded as correct if one of them was selected. The duration of tests using videos and those using still images combined was 100 min.

Table 2 Breast cancer categories and their definitions
Fig. 1
figure 1

Mass image-forming lesions

Table 3 Non-mass image-forming lesions
Table 4 Choices of breast diseases

The following six items were evaluated for comparison:

  1. 1.

    Percentage of correct judgments of diseases on video images (video sensitivity)

  2. 2.

    Percentage of correct judgments of non-disease conditions on video images (video specificity)

  3. 3.

    Percentage of correct judgments of diseases in still images (still image sensitivity)

  4. 4.

    Percentage of correct judgments of non-disease conditions in still images (still image specificity)

  5. 5.

    Percentage of category agreement for still images

  6. 6.

    Percentage of disease agreement for still images

Here, the video sensitivity and specificity are defined, respectively, as answering that a disease is “present” when viewing a video containing a category 3 or more advanced lesion and answering that a disease is “absent” when viewing a video not containing a category 3 or more advanced lesion. The still image sensitivity and specificity were defined, respectively, as the percentage of category 3 or more advanced lesions scored as 3 or higher and the percentage of category 2 or less advanced lesions scored as 2 or lower. Category and disease agreement for still images were defined, respectively, as the percentages of answers that agreed with the recommended categories in all cases and the percentage of answers that agreed with the recommended disease names in all cases. Figure 2 shows an example of still image questions and answers and how the answers were evaluated.

Fig. 2
figure 2

Example of still image tests. The recommended answers for this image set are category 3 or 4. Although this lesion was actually an intraductal papilloma, the differential diagnoses include intraductal papilloma or DCIS (non-invasive intracystic papillary carcinoma is classified as DCIS in Japan). The answers were evaluated as follows. Category 2, intraductal papilloma: sensitivity, false negative; category agreement, incorrect; disease agreement, correct. Category 3, intraductal papilloma: sensitivity, true positive; category agreement, correct; disease agreement, correct. Category 4, DCIS: sensitivity, true positive; category agreement, correct; disease agreement, correct. Category 5, DCIS: sensitivity, true positive; category agreement, incorrect; disease agreement, correct

Regarding each of these six items, the significance of the following differences was examined:

  1. 1.

    Difference between physicians and technologists

  2. 2.

    Differences among physicians in various fields of specialty such as breast surgery, other surgery, gynecology, radiology, working at screening institutions, and internal medicine; and between technologists working at a screening facility and those working at a hospital

  3. 3.

    Differences according to the number of patients the observers had examined by breast ultrasonography during the past 5 years (self-reported) (<100, 100–499, 500–999, and ≥1,000)

  4. 4.

    Differences according to age (20s, 30s, 40s, and 50s or above) of the observers

The differences between two groups such as physicians and technologists were examined by the t test, and those among three or more groups were examined by one-way ANOVA and Tukey’s multiple comparison procedure. All statistical analyses were performed using SAS 9.13, with a significance level of 5%.

Results

Comparisons between physicians and technologists

The video sensitivity was 84.0% in physicians and 85.9% in technologists, being significantly higher in technologists (p = 0.037) (Table 5). The still image specificity was 85.1% in physicians and 86.6% in technologists, being significantly higher in technologists (p = 0.026). The percentage of disease name agreement was 78.4% in physicians and 81.1% in technologists, being significantly higher in technologists (p < 0.0001).

Table 5 Comparisons between physicians and technologists

Comparisons according to physician fields of specialty

Among physicians, breast surgeons and radiologists showed better performance in many measures than doctors from other fields, especially gynecologists (Table 6). No significant difference was noted in the still image sensitivity.

Table 6 Differences among physicians in various fields of specialty and between technicians working at a screening facility and those working at a hospital

No significant difference was observed between technologists working at a screening facility and those working at a hospital.

Comparisons according to the number of patients the subjects examined by breast ultrasonography during the past 5 years (self-reported)

The video sensitivity improved with increases in the number of patients for both physicians and technologists (Table 7). In physicians, video sensitivity was 76.1% in those who had examined <100 patients and 85.5% in those who had examined 100 or more, with a significant difference (p < 0.0001). In technologists, it was 79.8% in those who had examined <100 patients and 87.9% in those who had examined 100 or more, again showing a significant difference (p < 0.0001). The video specificity was 73.1% in technologists who had examined <100 patients and 82.8% in those who had examined 100 or more, with a significant difference (p < 0.0001).

Table 7 Differences according to the number of patients the subjects had examined by breast ultrasonography during the past 5 years

The still image sensitivity was 93.3% in physicians who had examined <100 patients and 96.4% for those who had examined 100 or more, with a significant difference (p < 0.0001). The still image specificity was 81.7% in physicians who had examined <100 patients, being significantly lower than the 85.7% in those who had examined 100 or more (p = 0.0037) and 88.0% for those who had examined 1,000 or more patients (p < 0.0001). The value in those who had examined 1,000 or more patients was significantly higher than the 85.0% in those who had examined 100–499 (p = 0.0389) and 82.4% for those who had examined 500–999 patients (p = 0.0002). In technologists, the still image specificity was significantly higher at 88.4% in those who had examined 1,000 or more patients than the value in those who had examined <100 (p = 0.0033).

The percentage of category agreement was highest in physicians who had examined 1,000 or more patients at 87.3% and was significantly higher than the 81.0% in those who had examined <100 (p < 0.0001) and 84.2% in those who had examined 100–499 patients (p = 0.004). In technologists, also, the percentage of category agreement was 82.4% in those who had examined <100 patients, being significantly lower than the 85.1% in those who had examined 100 or more patients (p < 0.0001), 86.5% in those who had examined 500–999 patients (p = 0.0084), and 87.0% in those who had examined 1,000 or more patients (p < 0.0001).

The percentage of disease name agreement was 73.0% in physicians who had examined <100 patients, being significantly lower than the 79.4% in those who had examined 100 or more (p < 0.0001). It was also lower than the 77.7% in those who had examined 100–499 patients (p = 0.0102), 77.6% in those who had examined 500–999 patients (p = 0.0297), and 81.7% in those who had examined 1,000 or more patients (p = 0.0001). The results were also better in those who had examined 1,000 or more patients than in those who had examined 100–499 patients (p = 0.0029).

In technologists, the percentage of disease name agreement was 76.6% in those who had examined <100 patients, being significantly lower than the 81.1% in those who had examined 100 or more (p < 0.0001). It was also lower than the 80.6% in those who had examined 100–499 patients (p = 0.0079), 83.0% in those who had examined 500–999 patients (p = 0.003), and 84.0% in those who had examined 1,000 or more patients (p < 0.0001). It was also higher in those who had examined 1,000 or more patients than in those who had examined 100–499 (p = 0.0201).

Comparisons according to age

In physicians, the video specificity was 81.8% in those aged 50 years and above, being significantly lower than in those aged <50 years (p < 0.0001) (Tables 8, 9). It was also significantly lower than the 82.4% in those in their 30s (p < 0.0001) and 81.3% in those in their 40s (p < 0.0001). In technologists, video sensitivity and video specificity were not significantly different for those aged 50 years and above compared with that in those aged <50 years. This is considered to have been partly due to the small number of subjects.

Table 8 Differences according to physician or technologist age
Table 9 Comparison of differences between physicians and technologists according to age

In physicians, the still image sensitivity was 94.4% in those aged 50 years or above, being significantly lower than the 96.5% in those aged <50 years (p = 0.0020). It was also significantly lower than the 96.9% in those in their 30s (p = 0.0024) and 96.7% in those in their 40s (p = 0.0037). The still image specificity was highest at 87.7% in physicians in their 40s, being significantly higher than the 82.2% in those aged 50 years or above (p < 0.0001). In technologists, no significant difference was noted in either still image sensitivity or still image specificity according to age.

The percentage of category agreement using still images was 82.0% in physicians aged 50 years or above, being significantly lower than the 84.8% in those aged <50 years (p = 0.0001). It was also significantly lower than the 85.2% in those in their 30s (p = 0.0061) and 87.0% in those in their 40s (p < 0.0001). In technologists, the percentage of category agreement was 80.4% in those aged 50 years or above, also being significantly lower than the 85.5% in those aged <50 years (p < 0.0023). It was also lower than the 85.5% in those in their 20s (p = 0.0078), 86.2% in those in their 30s (p = 0.0012), and 84.7% in those in their 40s (p = 0.0339).

The percentage of disease name agreement using still images was 74.6% in physicians aged 50 years or above, being significantly lower than the 78.4% in those aged <50 years (p < 0.0001). It was also lower than the 79.2% in those in their 30s (p = 0.002) and 80.9% in those in their 40s (p < 0.0001). In technologists, no significant difference was noted in the percentage of disease name agreement according to age.

Discussion

In Japan, both the morbidity and mortality rates of breast cancer are increasing, and breast cancer became the most frequent type of cancer among Japanese women, overtaking stomach cancer in 1993 [2]. The incidence of breast cancer in Japanese women reaches a peak in the late 40s, unlike in Western countries. In Japan, screening for breast cancer has long been performed by palpation, but mammography was introduced after the results of randomized comparative studies in Western countries were reported. Today, breast cancer screening primarily by mammography has become widely available for women aged 40 years or above. However, the sensitivity of mammography screening for breast cancer in women in their 40s is 71.4%, which is lower than the figures in those in their 50s and 60s [3]. This is probably because many women in their 40s present high-density to extremely dense breasts on mammography. To increase the detection rate of breast cancer by screening, the use of MRI for high-risk groups is recommended in the United States [4]. In Japan, however, the concomitant use of ultrasonography, which can be performed at most hospitals, may be more convenient, and a large-scale comparative study is presently being conducted involving women in their 40s [1].

Since breast ultrasonographic diagnosis is performed in real-time, the ability of the examiner to detect lesions, evaluate them, and create appropriate records greatly affects the sensitivity and specificity of breast cancer screening. In Japan, it is stipulated that examination results must be evaluated by physicians but that breast ultrasonography can actually be performed by either a physician or a technologist. In the United States, the results of screening ultrasonography performed by physicians have been reported [5], but screening by technologists is considered to be advantageous from the viewpoints of the number of available examiners and cost. In this study, the video sensitivity was evaluated to assess the subjects’ ability to detect breast cancer. Since it was significantly higher in technologists, screening by technologists is not considered to lead to the overlooking of lesions.

A low specificity is a problem with breast ultrasonography, although many lesions other than breast cancer are detected in this way. We previously performed tests using images before and after a 2-day training program and reported that the program was effective, because the sensitivity of screening using video and still images improved significantly after the program despite a slight but nonsignificant decrease in specificity [6]. In the program, the participants were trained to discriminate changes in the breast that should be eliminated from secondary examinations (clearly benign lesions or very small lesions that are likely to be benign, e.g., normal variations such as a lactating breast and the interposition of fat, surgically enlarged breasts, cysts, and typical fibroadenoma). The results of this study were obtained from participants of a 2-day training program, who had acquired the same knowledge about the assessment of lesions. The results of this study indicate that technologists have an ability to read both still images and videos comparable to physicians and are sufficiently capable of conducting primary screening.

Among physicians of various specialties, gynecologists did poorly in the tests. This may be partly because they are slightly older than other physicians. The average ages of gynecologists and other specialties were 50 and 44 years, respectively. There is also the concept in breast cancer screening that secondary examinations should not be indicated for changes detected by breast ultrasonography unlikely to be cancer, which differs from the guidelines for uterine cancer screening. Gynecologists may not have been accustomed to the stricter elimination policy. There was no difference between technologists working at a screening facility and those working at a hospital. This probably suggests that technologists at a screening facility are also learning to categorize and identify diseases as well as acquiring skills for detecting lesions, but may also have been due to the difficulty in discriminating clearly between screening and hospital technologists, because some hospital technologists are working in screening departments.

In terms of the number of patients the subjects had examined, those who had experienced <100 cases did poorly in the tests. This suggests that learning in a training program is not sufficient and that clinical experience is important to improve the screening ability. This also shows that if an examiner has experience with fewer than 100 patients, the screening ultrasound should be done with the supervision of experienced examiners.

The results of the subjects aged 50 years or above were poor. Fatigue caused by having to watch the computer screen for a long time and the degree of familiarity with computer operation may have been related to this outcome, and considerations such as extending the testing time and allowing the taking of breaks may be necessary for this age group.

Conclusion

We analyzed the results of tests using images after a breast ultrasonography training program. The test results were comparable between physicians and technologists and were poorer for participants with less clinical experience.