October 2011, VOLUME 197
NUMBER 4

Recommend & Share

October 2011, Volume 197, Number 4

Health Care Policy and Quality

Original Research

Error Rates in Breast Imaging Reports: Comparison of Automatic Speech Recognition and Dictation Transcription

+ Affiliations:
1 Research and Development Division, Joint Department of Medical Imaging, Mount Sinai Hospital, University Health Network, Women’s College Hospital, Toronto, ON, Canada.

2 Gattuso Rapid Diagnostic Breast Centre, Princess Margaret Hospital, University Health Network, Toronto, ON, Canada.

3 Department of Biostatistics, Joint Department of Medical Imaging, Princess Margaret Hospital, University Health Network, Toronto, ON, Canada.

4 Department of Informatics, Joint Department of Medical Imaging, Mount Sinai Hospital, University Health Network, Women’s College Hospital, Toronto, ON, Canada.

5 Breast Imaging Division, Department of Medical Imaging, University of Toronto, Princess Margaret Hospital, 610 University Ave, Rm 3-922, Toronto, ON, Canada M5G 2M9.

Citation: American Journal of Roentgenology. 2011;197: 923-927. 10.2214/AJR.11.6691

ABSTRACT
Next section

OBJECTIVE. The purpose of this study was to compare the error rates in breast imaging reports generated with automated speech recognition (ASR) technology as opposed to conventional dictation transcription.

MATERIALS AND METHODS. Breast imaging reports reviewed from January 2009 to April 2010 during multidisciplinary tumor board meetings at two hospitals were scrutinized for minor and major errors.

RESULTS. Of 615 reports obtained, 308 were generated with ASR and 307 with conventional dictation transcription. At least one major error was found in 23% of ASR reports, as opposed to 4% of conventional dictation transcription reports (p < 0.01). Major errors were more common in breast MRI reports (35% of ASR and 7% of conventional reports), the lowest error rates occurring in reports of interventional procedures (13% of ASR and 4% of conventional reports) and mammography reports (15% of ASR and no conventional reports) (p < 0.01). The error rates did not differ substantially between reports generated by staff radiologists and trainees or between reports generated by speakers who spoke English as their first language and those whose native language was not English. After adjustment for academic rank, native language, and imaging modality, reports generated with ASR were 8 times as likely as conventional dictation transcription reports to contain major errors (p < 0.01).

CONCLUSION. Reports generated with ASR are associated with higher error rates than reports generated with conventional dictation transcription. The imaging modality used is a predictor of the occurrence of reporting errors. Conversely, native language and academic rank of the speaker do not have a significant influence on error rate.

Keywords: breast imaging report, transcription, voice recognition

The use of automated speech recognition (ASR) software to generate reports in radiology is not new. In 1987 an attempt at word recognition with a voice recognition system with a 1000-word lexicon showed that recognition accuracy was greater than 95% [1]. Today, many medical imaging departments have adopted or are in the process of adopting ASR as an alternative to conventional dictation transcription. It has been suggested [24] that implementation of ASR in hospitals may help to reduce report production time and result in cost savings compared with conventional dictation transcription.

Despite the advantages of ASR technology, a high error rate has been observed in reports generated with this software [5]. Results of a study of the frequency and spectrum of errors in thoracic oncology radiology reports generated with ASR technology [6] suggested that the error rate with ASR-generated reports is elevated (≈ 22%) and higher than would be expected. However, little is known about the application of this technology to breast imaging reports. Our aim was to determine the frequency and spectrum of dictation errors in verified breast imaging reports generated with ASR compared with verified reports generated with conventional dictation transcription.

Materials and Methods
Previous sectionNext section

Institutional ethical approval was obtained for this study. We considered retrospective review of breast imaging reports the optimal assessment method for this study because it would exclude bias introduced by the knowledge that the reports would be scrutinized for errors.

Data Selection

Our joint department of medical imaging serves five major university-affiliated hospitals. The breast imaging division is active in three of these five teaching hospitals, and more than 250 patients are seen daily within the division. Breast imaging reports from January 2009 to April 2010 for patients whose cases were discussed at weekly multidisciplinary tumor board rounds at two hospitals were included in this study. The data collected included reports before implementation of a voice recognition system in one hospital and after implementation of a voice recognition system in the other hospital. The training period for voice recognition end users to become comfortable with the new technology was not assessed in the study. A total of 14 staff members, eight fellows, and 11 residents (third-year radiology residents or postgraduate fourth-year residents) were included in the analysis.

TABLE 1: Distribution of Breast-Imaging Modality, Speaker Experience, and Number of Reports per Radiologist by Report Method

Imaging reports included in the study were of mammography, breast ultrasound examinations, breast MRI, interventional breast procedures (ultrasound-guided large-core and fine-needle biopsies, preoperative needle localizations, MRI-guided core biopsy, stereotactic core biopsy, ductography, and clip placements), consultation reports, and specimen radiographs. Combined mammography and ultrasound reports were defined as a single report issued for two modalities. Most of the breast imaging reports were of diagnostic examinations of patients with suspected or proven breast cancer. Overall, the reports included in this study were of a higher complexity than reports of screening examinations, which are usually generated with report templates.

The voice recognition software used was Speech Magic (version 6.1, service pack 2, Nuance). ASR reports were verified and signed by the author as they were generated. If the speaker was a fellow or resident, a staff member was responsible for reviewing the case before dictation of the report. Dictation was completed with a handheld speech microphone (ProPlus LFH5276, Philips Healthcare).

Conventional dictation transcription was undertaken using the E-RIS transcription system, version 1.44 (Merge Technology). Transcription was completed by transcriptionists experienced in breast imaging reporting. Once transcribed, reports were sent to the original speaker for electronic amendment and verification.

All reports dictated by attending radiologists or trainees were reviewed on the radiology information system at an electronic PACS workstation, corrected for errors, and verified, making these reports immediately available on the hospital clinical information system. The speaker assumed complete responsibility for report production, including correcting typographic errors generated by the voice recognition software or the transcriptionist.

Data Collection

Reports were reviewed for the presence of errors, and the errors were classified into 12 types (word omission, word substitution, nonsense phrase, wrong word, punctuation error, incorrect measurement, missing or added “no,” added word, incorrect verb tense, plural error, spelling mistake, incomplete phrase). Reports containing an error were reviewed independently by two observers—an undergraduate medical student and a radiologist with 17 years of experience in breast imaging, who determined the severity of the error: major or minor.

Errors that affected understanding of the report were considered major. For example, “There is no definite evidence of suspicious enhancement in either breast after the duodenum administration” was reported instead of “There is no definite evidence of suspicious enhancement in either breast after the gadolinium administration.” Errors affecting patient care were also considered major and were often caused by an incorrect unit of measure (millimeter/centimeter) or a missing or added “no,” such as “mammographic signs of malignancy” instead of “no mammographic signs of malignancy.” Errors that did not affect report understanding or patient care were labeled minor. When the two observers disagreed on the importance of an error, the errors were reevaluated case by case until consensus was reached.

The section of the report (clinical information, comparison examinations, findings, impression) in which the error was located, the academic rank of the speaker (faculty or staff member, fellow, resident), and the speaker’s native language (English or not) were recorded. Unstructured reports (free text) were excluded from the study.

Data Analysis

Descriptive statistics on the reports generated were summarized, and differences in imaging modality, academic rank, and native language were compared between ASR and conventional dictation transcription reports. The total number of reports with errors was calculated along with the total number of errors. Errors identified were classi fied as either minor or major. Results were recorded at both the error level (error type, error location within the report) and report level (imaging modality, dictator). Error rate was defined as the total number of reports with errors divided by the total number of reports. Error rates were compared for different sections of the report, by imaging modality of the generated report, between native and nonnative English speakers, and by the speaker’s academic rank. Comparisons between ASR and conventional dictation transcription reports involving error rates were performed with logistic regression with generalized estimating equations to account for within-speaker clustering of reports [7]. In addition, multivariable logistic regression analysis with generalized estimating equations was used to determine independent predictors of the occurrence of errors in reports. All statistical tests were two-sided, and p < 0.05 was considered significant. All statistical analyses were performed with statistical software (SAS version 9.2, SAS Institute).

Results
Previous sectionNext section

We scrutinized 615 reports for errors: 308 reports generated with ASR (data from the hospital at which ASR had been used for 2 years) and 307 reports generated with conventional dictation transcription (data from the hospital that continued to rely on transcriptionists for report generation). A total of 33 speakers made the 615 reports; 11 speakers used both ASR and conventional dictation transcription. Table 1 summarizes the study sample. The distributions of modality, speaker academic rank, and speaker native language were all statistically different between the two groups (p < 0.0001). More ASR reports than conventional dictation transcription reports were made by faculty members (88% vs 69%) and by speakers with English as their first language (33% vs 16%).

TABLE 2: Frequency of Major Errors in Breast Imaging Reports by Report Method

Among the 308 reports generated with ASR, 159 reports (52%) contained at least one error compared with 68 of the 307 reports (22%) generated with conventional dictation transcription (p < 0.01). Reports generated with ASR were also more likely than conventional reports to contain at least one major error (23% vs 4%, p < 0.01).

Error rates differed significantly by modality in reports generated with ASR. Major errors found with ASR were more common in MRI reports (35% of ASR, 7% of conventional reports) and combined mammography and ultrasound reports. The lowest error rates were found in reports of interventional procedures (13% of ASR, 4% of conventional reports) and mammography (15% of ASR, no conventional reports) (p < 0.01). No substantial difference in major error rates was seen between reports generated by staff and reports generated by residents or fellows using either ASR or conventional dictation transcription. Minor errors were more common in ASR reports dictated by staff than in reports dictated by residents or fellows (37% vs 22%, p < 0.01). No significant differences were found between reports generated with ASR and those with conventional dictation transcription by speakers with English as their first language versus speakers whose first language was not English.

The most common report section containing both minor and major errors was findings for both ASR and conventional dictation transcription. For ASR, 57% of minor errors and 50% of major errors were in the findings section. Similarly, for conventional dictation transcription, 78% of minor errors and 83% of major errors were in the findings section.

A total of 230 errors were found in 159 ASR reports. The most common error types were added word (46 instances, 20% of total ASR errors), word omission (43 instances, 19%), word substitution (39 instances, 17%), and punctuation error (49 instances, 21%). A total of 77 errors were found in 68 conventional dictation transcription reports. The most common error types were word substitution (15 instances, 19% of total conventional report errors), word omission (13 instances, 17%), added word (11 instances, 14%), and punctuation error (14 instances, 18%). The error types and frequencies are shown in Tables 2 (major errors) and 3 (minor errors). The results of the multivariable analysis to determine independent predictors of the occurrence of errors in reports are summarized in Table 4. After adjustment for the speaker’s academic rank and native language and the imaging modality, the reports generated with ASR were more than twice as likely as reports generated with conventional dictation transcription to contain minor errors (adjusted odds ratio, 2.24, p < 0.01) and more than 8 times as likely to contain major errors (adjusted odds ratio, 8.39, p < 0.01).

The academic rank and the native language of the speakers were found not to be independent predictors of the occurrence of major errors. By contrast, modality was found to be an independent predictor of the occurrence of major errors (p < 0.01). MRI and combined mammography and ultrasound reports were more likely to contain a major error (odds ratios, 4.4 and 3.4) than were reports of mammography alone.

Discussion
Previous sectionNext section

Our data showed that breast imaging reports generated with ASR are 8 times as likely as reports generated with conventional dictation transcription to contain major errors, after adjustment for native language, academic rank of the speaker, and breast imaging modality. Twenty-three percent of the reports generated with ASR reviewed in this study contained at least one error that could have affected understanding of the report or altered patient care. Quint et al. [6] found a similar prevalence of major errors (22%) in thoracic oncology reports generated with ASR. Other studies have also shown a higher prevalence of errors in reports generated with ASR compared with conventional dictation transcription [8, 9].

Attempts have been made to explain the higher frequency of errors in ASR reports than in conventional dictation transcription. It has been suggested that reviewing reports between 6 and 24 hours after dictation may be helpful in detecting errors that can be missed when reports are verified immediately, as with ASR [6]. With ASR, in which the words appear as the radiologist dictates, there is immediate editing of the report. Because of recency, the radiologist perceives that what he or she sees is what was dictated, increasing the risk of missing errors, such as misspelled words and omission of a word.

The lack of awareness of the high prevalence of errors in ASR reporting may contribute to the higher error rates seen in ASR reports. With pressure to decrease reporting times, radiologists unaware of the high frequency of errors associated with ASR may only superficially edit reports before signing them. In addition, background noise and the dictating environment can contribute to the higher errors rates in ASR reports. It has been noted [8] that errors in ASR reports are more common when reports are generated in noisy settings. By contrast, with conventional dictation transcription, the background environment does not seem to affect the error rate [10, 11]. We did not measure background noise because it was not part of the scope of this project, and both groups of reports were generated in comparable environments.

At the report level, most errors were in the findings section. This result is not surprising because the findings section of reports is usually the most elaborate, having more complex and less standardized content.

The types of errors were comparable in reports generated with ASR and conventional dictation transcription (Tables 2 and 3). The most common type of error in reports generated with ASR was addition of a word to a sentence or phrase. This finding is not surprising because ASR technology relies on recognition of phonemes and strings of words rather than one word at a time, which can lead to automatic addition of a preposition after a specific word [12]. Ideally, spelling mistakes should not be found in reports generated with ASR because the software uses only words retrieved from its dictionary. The spelling mistakes found in the ASR reports therefore result from the manual editing performed by the verifier. Word substitution, word omission, and punctuation errors were common in both ASR and conventional dictation transcription reports.

TABLE 3: Frequency of Minor Errors in Breast Imaging Reports by Report Method

In analysis of predictors of error outcomes, the modality that was the subject of the report was found to be an independent predictor of error rate. MRI reports were 4.4 times as likely as mammography reports to contain major errors (p < 0.01). Reports of ultrasound alone and ultrasound combined with mammography were more than twice as likely as mammography reports to contain major errors (p < 0.01). MRI is often performed in complex cases, and combined mammography and ultrasound reports tend to be longer than single-modality reports, thus a higher incidence of errors would be expected. In addition, mammography reports tend to follow a more rigorous and standardized structure in our department. BI-RADS, initially developed by the American College of Radiology [13] to standardize reporting in mammography, has been successfully adopted by our radiologists and has helped improve the quality of our mammography reports. Although our department is using it for MRI and breast ultrasound reports, BI-RADS has not been as widely and successfully adopted for those modalities.

The native language and academic rank of the speaker were found not to be predictors of the occurrence of errors in radiology reports generated with ASR or with conventional dictation transcription in this study. Previous literature did not support these findings. Two studies have had contradictory results. One showed that the native language of the speaker is a strong predictor of mistakes [8], and the other study showed no significant difference between native and nonnative English speakers with respect to error rate [6].

Even though most errors detected in our study did not alter patient care because all cases had been discussed in multidisciplinary team meetings, the presence of major errors tends to make reports confusing and difficult to read. More important, the clinicians involved in any one patient’s care may not be limited to those affiliated with the academic teaching hospital at which imaging was performed. For a given breast report (e.g., mammography), many clinicians may view the report (e.g., general practitioner, surgeon, medical oncologist, nurse practitioner). Some clinicians may have the opportunity to review the original images and the report with the reporting radiologist, but many others may not have this opportunity, and decisions regarding patient care may be based solely on the imaging report.

TABLE 4: Results of Multivariable Analysis to Determine Independent Predictors of Occurrence of Errors

A radiology report is in many respects the single most important factor by which radiologists are judged by their clinical colleagues [14]. The abundance of grammatical errors found in this study can be perceived as a lack of professionalism and of carelessness on the part of the reporting radiologists.

Our study had several limitations. ASR technology has been used for only 24 months at our institution, whereas conventional dictation transcription has been used for a decade. This factor might have contributed to the considerable difference in error rates. It will be interesting to conduct a similar study after ASR has been used for many years at our institution to compare the error rates of ASR and conventional dictation transcription. Another limitation was that the systems were tested only in complex breast imaging scenarios. The reports reviewed often had multiple imaging findings, and standardized templates were not regularly used. For normal findings and routine breast screening reports, the contents of an entire report can be inserted into a template. In this study, the number of possible errors might have been artificially elevated and possibly biased against speech recognition. Finally, this study was conducted in an academic medical environment with several levels of trainees, so it is difficult to extrapolate the findings to private practice groups.

Conclusion
Previous sectionNext section

Complex breast imaging reports generated with ASR were associated with higher error rates than reports generated with conventional dictation transcription. The native language and the academic rank of the speaker did not have a strong influence on error rate. Conversely, the imaging modality used, such as MRI, was found to be a predictor of major errors in final reports. Careful editing of reports generated with ASR is crucial to minimizing error rates in breast imaging reports.

References
Previous sectionNext section
1. Robbins AH, Horowitz DM, Srinivasan MK, et al. Speech-controlled generation of radiology reports. Radiology 1987; 164:569–573 [Google Scholar]
2. Gale B, Safriel Y, Lukban A, Kalowitz J, Fleischer J, Gordon D. Radiology report production times: voice recognition vs. transcription. Radiol Manage 2001; 23:18–22 [Google Scholar]
3. Sferrella SM. Success with voice recognition. Radiol Manage 2003; 25:42–49 [Google Scholar]
4. Marquez LO. Improving medical imaging report turnaround times. Radiol Manage 2005; 27:34–37 [Google Scholar]
5. Kanal KM, Hangiandreou NJ, Sykes AM, et al. Initial evaluation of a continuous speech recognition program for radiology. J Digit Imaging 2001; 14:30–37 [Google Scholar]
6. Quint LE, Quint DJ, Myles JD. Frequency and spectrum of errors in final radiology reports generated with automatic speech recognition technology. J Am Coll Radiol 2008; 5:1196–1199 [Google Scholar]
7. Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics 1986; 42:121–130 [Google Scholar]
8. McGurk S, Brauer K, Macfarlane TV, Duncan KA. The effect of voice recognition software on comparative error rates in radiology reports. Br J Radiol 2008; 81:767–770 [Google Scholar]
9. Rana DS, Hurst G, Shepstone L, Pilling J, Cockburn J, Crawford M. Voice recognition for radiology reporting: is it good enough? Clin Radiol 2005; 60:1205–1212 [Google Scholar]
10. White KS. Speech recognition implementation in radiology. Pediatr Radiol 2005; 35:841–846 [Google Scholar]
11. Krishnaraj A, Lee JK, Laws SA, Crawford TJ. Voice recognition software: effect on radiology report turnaround time at an academic medical center. AJR 2010; 195:194–197 [Abstract] [Google Scholar]
12. White GM. Speech recognition: a tutorial overview. Computer 1976; 9:40–53 [Google Scholar]
13. D’Orsi CJ, Mendelson EB, Ikeda DM, et al: Breast Imaging Reporting and Data System: ACR BI-RADS—breast imaging atlas. Reston, VA: American College of Radiology, 2003 [Google Scholar]
14. Reiner BI, Knight N, Siegel EL. Radiology reporting, past, present, and future: the radiologist’s perspective. J Am Coll Radiol 2007; 4:313–319 [Google Scholar]

Address correspondence to A. M. Scaranelo ().

Recommended Articles

Error Rates in Breast Imaging Reports: Comparison of Automatic Speech Recognition and Dictation Transcription

Full Access, , , , , ,
American Journal of Roentgenology. 1998;170:23-25. 10.2214/ajr.170.1.9423591
Abstract | PDF (507 KB) | PDF Plus (304 KB) 
Full Access, , ,
American Journal of Roentgenology. 2010;195:194-197. 10.2214/AJR.09.3169
Abstract | Full Text | PDF (539 KB) | PDF Plus (626 KB) 
Full Access, , , ,
American Journal of Roentgenology. 1997;169:27-29. 10.2214/ajr.169.1.9207496
Abstract | PDF (594 KB) | PDF Plus (338 KB) 
Full Access, , , ,
American Journal of Roentgenology. 2011;197:919-922. 10.2214/AJR.11.7491
Citation | Full Text | PDF (507 KB) | PDF Plus (597 KB) 
Full Access, , , ,
American Journal of Roentgenology. 2000;174:617-622. 10.2214/ajr.174.3.1740617
Abstract | Full Text | PDF (197 KB) | PDF Plus (233 KB) 
Full Access, , , , ,
American Journal of Roentgenology. 2017;208:739-749. 10.2214/AJR.16.16963
Abstract | Full Text | PDF (1062 KB) | PDF Plus (1203 KB)