|
|
||||||||
Original Research |
Department of Radiology, University of Florida Health Center, PO Box 100374, Gainesville, FL 32610.
Received November 1, 2004;
accepted after revision December 8, 2004.
C. L. Sistrom was funded by the General Electric/Association of University
Radiologists Research Fellowship from July 2000 through June 2003.
Abstract
|
|
|---|
MATERIALS AND METHODS. A Web-based testing mechanism was used to present radiology reports to each of 16 senior medical students and record their answers to 10 multiple choice questions about specific medical content for each of 12 cases. Subjects were randomly assigned to view the reports in either free text or structured format. In addition to number of answers correct for each case, we recorded the time taken for each case and an efficiency score (correctly answered questions per minute). These three outcomes were tested for differences on report format using multifactorial analysis of variance. A postexperimental questionnaire and a mediated focus group elicited subject preference as to radiology report format.
RESULTS. There were no significant differences in the three outcomes (score, time, and efficiency) between the free text and structured format conditions. The power of the experiment was sufficient to detect small differences in these outcomes by format. Subjects strongly and consistently expressed a preference for the structured version.
CONCLUSION. We assert that free text and itemized (structured) forms of radiology reports are equally efficient and accurate for transmitting case-specific interpretative content to reviewers of the document.
|
|
|---|
The distinction between reporting into structure and reading structure is by no means purely academic. Current technology allows a clinical document to be produced in one format and displayed in an entirely different way. For example, one vendor of a structured computerized reporting product (eDictation) uses a sophisticated interface to allow radiologists to choose relevant findings from a large and highly structured menu of possibilities. At the same time, the software produces phrases and sentences that look just like a typical narrative report and it is this document that is made available for referring physicians to review. The assumption is that clinicians will be more comfortable with radiology reports that look like what they are used to reading, that is, narrative free text. Our research was designed to obtain empiric data about this question. We sought to examine the clinical utility of radiology reports in terms of information transfer to the reader. Specifically, what are the effects of report format (independent of content) on the efficiency with which medical personnel can read them and obtain information needed for patient care? Our working hypothesis was that consistently formatted (structured) reports would be easier to read and comprehend resulting in greater efficiency for the task of answering content-specific questions.
|
|
|
|
|---|
For each report, we generated a series of 10 multiple-choice items, each having a question stem and from 3-5 options. These were designed to have a single option that was unambiguously correct based only on the content of the report to which they referred. The reports and candidate questions were administered to three senior medical students using a Web-based testing system (described below) that allowed them to refer back to the report text as needed. These students did not participate as subjects in the subsequent experiment. We also gave printed versions of the questions and associated reports to two faculty radiologists and two senior radiology residents and asked them to evaluate them for clarity and consistency. Using feedback given to us by all of these evaluators, we corrected the wording of one of the question stems and three of the options (all distracter items). We also eliminated one option (also a distracter).
The original reports were in narrative format with variable use of headings for indication, comparison, examination details, and findings. All of the original reports had a labeled impression section. These 12 reports along with the 10 questions pertinent to each one formed the free text condition of the experiment. A report structure shell was designed for each of the three types of studies. These consisted of the components typically found in narrative radiology reports with additional headings in the findings section. For abdominal CT and sonography, these basically were anatomic. For head CT, we combined anatomic and functional headings. The templates for all three examination types are listed in Appendix 1.
|
For each case, we parsed the free text report into the appropriate structured template. This was done so that all the original content was exactly and completely replicated in the structured version. We were careful to leave basic sentence structure and word choice intact. We duplicated or eliminated words or phrases as needed to maintain proper syntax in the structured version. For example, the free text might contain "The pancreas, spleen, and kidneys are unremarkable." The word "unremarkable" would be placed after the pancreas, spleen, and kidneys headings in the structured version. If a heading in the structured template did not have any relevant content from the free text version, it was left in the report with no text after it. Appendix 2 is an example of a free text abdominal CT report and Appendix 3 contains the structured version. With 12 unique cases, each having two versions (free text and structured format), there were 24 total cases.
|
|
Experimental Details
Our College of Medicine uses a locally developed Web-based testing system
for all examinations administered to students. It has been in use for more
than 7 years and by the time our medical students have reached their fourth
year, they have taken at least 20 tests using it. Since we used the same
system to administer our experimental tests, all of the subjects were quite
familiar with its function and appearance. This eliminated any need to train
subjects before their participation and reduced any potential variability in
performance based on differential learning of the testing procedure
itself.
The experimental testing software was designed to present each case as a set of three Web pages joined into a common frame set. The first page gave a brief description of a clinical presentation derived from the clinical history and the stated reason for the examination in question. A button (labeled GO TO THE REPORT) on the first page caused a second page containing the report text to be displayed. Buttons above and below the report text (labeled GO TO QUESTIONS) served to advance to the third page containing all 10 questions relating to the report. The question page also had buttons (labeled GO BACK TO THE REPORT) at the top and bottom that caused redisplay of the report text while keeping the question page active in the background, thus preserving its state. Subjects could switch back and forth between the report and questions as often as needed and their previous answers would remain intact. Figures 1A, 1B, and 1C depicts the three frames making up a single test case.
|
The Web pages all had JavaScript code (Sun Microsystems) embedded in them to record a time stamp (at 1/10 sec precision) for every button click and item selection as the subject navigated through the case and answered questions. A submit button on the question page served to record the answers and the time-stamped navigation data. Submission was not accepted until all 10 questions had been answered. It is important to note that the design of the testing mechanism caused the entire set of pages comprising a single case to be transferred from Web server to the local computer when the subject was ready to start. All navigation and recording of time stamps was accomplished locally and required no further traffic between the testing computer and the server. Thus, there was no possibility that timing during a single case might be confounded by variations in network speed. All testing was performed on a single workstation located in a quiet room. This computer had a single 18-inch flat panel color monitor. Relevant software included Windows 2000 professional and Microsoft Internet Explorer. It was connected to our hospital's internal network.
All of the subjects of the experiment were senior medical students at our institution. They were recruited by means of fliers posted in the medical school teaching complex. The research (and the flier) had prior approval by our local IRB. They were required to have taken and passed their medicine and surgery clinical clerkships before participating in the research. For incentive, a $100 account was opened for each participant at the College of Medicine bookstore. Each of the subjects was assigned all 12 cases to do during a single experimental session. Six of these cases were presented with the free text version of the report and the remaining six with the structured version of the report. The assignment of which of the 12 cases was presented to each subject as free text versus structured was randomized, with one restriction: We balanced the number of times each case was presented in free text versus structure across the entire experiment. The order in which subjects took their 12 cases was randomized, with one restriction: We balanced the assignment so that each case would be presented an equal number of times in each quartile of the case order (1-3, 4-6, 7-9, 10-12). Our initial power calculations called for eight subjects. This allowed a symmetric design in both the format and case order factors. We performed a second repetition of the experiment with an additional eight subjects, resulting in a total of 192 sets of case/subject responses for analysis. During the second repetition, the same randomization scheme was used with the only difference being that the assignment of the free text or structured format version of the report was reversed. Testing of all 16 subjects (5 women and 11 men) was completed in the 2002-2003 academic year. There were no dropouts or technical failures during the experiment and each of the subjects completed their 12 cases in the sequence assigned during a single session and none reported any problems with the testing system.
Following completion of the testing, 15 of the 16 subjects met with the principal investigator in a "debriefing" focus group. Before this meeting, subjects had been told only that the purpose of the experiment was to assess their ability to extract information from reports. At the beginning of the meeting, subjects were given a brief questionnaire to fill out. This asked several general questions about radiology report format and content. Next, a brief paragraph defined free text and structure followed by an example of each format (similar to Tables 1 and 2). Next, a set of eight Likert-scaled items asked for their preference concerning various functional aspects of reading radiology reports (accuracy, speed, certainty, items not mentioned, positive findings, negative findings, and general preference) based on their experience with the test cases. The preference scores were anchored as follows: 1 = prefer free text, 5 = no preference, 10 = prefer structure. A mediated discussion was then conducted to elicit opinions about radiology report structure and content. Note that subjects were not given any feedbackafter they participated or during the debriefing meetingconcerning individual or aggregate performance in answering questions for the cases.
|
|
Statistical Analysis
Each subject's participation generated a set of 12 experimental results.
These consisted of answers to the 10 questions for a case and the time-stamped
navigation data. We scored the subject's answers against a key to obtain the
number correct. The start time was subtracted from the final submission time
to obtain number of seconds taken to do each case. An efficiency score was
then calculated for each case by dividing the number of questions answered
correctly by the number of seconds taken to finish the entire case. This
result was multiplied by 60 to give the number of correctly answered questions
per minute. The time-stamped navigation activity records were processed to
obtain two outcomes for each case. The number of times the subject moved back
to view the report from the question page was tabulated. This outcome could
take any positive integer and will be called report views. The number of
answer selections made during each case was counted as well. This outcome was
at least 10 and was higher when subjects changed their minds about one or more
answers. Thus, there were five outcomes analyzed: number of questions correct,
time in seconds taken to complete the case, efficiency score, report views,
and answer selections.
We used the Statistical Analysis System (SAS Version 9 for Windows, SAS Institute) for all data manipulation and statistical calculation. For all tests of significance, we set p = 0.05 as the cutoff and used two-tailed alternate hypotheses. The analysis was done on the basis of a balanced incomplete block design. The factors included examination type (3 levels), report format (2 levels), and individual cases (4 levels per type). Thus, there were 24 factor level combinations (treatments), a block size of 12 (cases per subject), and eight blocks (subjects). The experiment was replicated twice with two groups of eight subjects for a total of 16 subjects and 192 experimental units. Summary statistics for the five outcomes were generated, including mean, median, mode, SD, frequency distribution plot, and normal probability plot.
We performed analysis of variance with a general linear model procedure
(SAS PROC GLM) to test for differences in each of the five outcomes (percent
of questions correct, time taken, efficiency, report views, and answer
selections) jointly related to our independent variables. We used the same
linear model for each outcome. Fixed effects included report format, case
type, and the order that the case was presented to the subject. Random effects
included case identity nested within case type, subject identity, and report
format crossed with subject identity. The model was specified as follows:
![]() |
![]() |
Standard F statistics using the type 3 sums of squares and appropriate error terms were used to test the coefficients (ß1 - ß6) against the null hypothesis of no effect (ßi = 0). The Duncan procedure was used to perform multiple comparisons of mean number of report views by case type [3]. Since none of the fixed effects was significant in any of the other models, no additional multiple comparisons were performed. The two outcomes relating to subjects' test-taking habits (report views and answer selections) were correlated within subjects, formats, case types, and overall to obtain Pearson correlation coefficients [4].
Because our results showed equivalence on the main variable of interest
(report format) we performed post hoc power analysis. The sample size was
initially set with eight subjects each looking at 12 cases for a total of 96
experimental units. We were able to double the planned sample size because
many students responded to the request for participation and running the
experiments was quite easy due to all subjects' familiarity with the testing
system. Our analysis was based on paired testing between free text and
structured format with
= 0.05 and ß = 0.01 (90% power). We used
the root mean square error from the analysis of variance output as the
estimate of sigma for each outcome.
For the postexperimental survey, Likert-scaled items from the debriefing questionnaires were summarized by calculating the median value, the 10th percentile, and the 90th percentile. The general preference items were enumerated and percentages calculated. The qualitative content of the debriefing focus group was summarized from a transcription of a tape recording made during the session.
|
|
|---|
The number of correct responses (score) ranged from two to 10 with a mean of 8.35 and SD of 1.52. The distribution was negatively skewed with median and mode both being nine. None of the fixed effects (format p = 0.35, order p = 0.27, and case type p = 0.92) were significant and there was no interaction between format and subject effects on the score. The majority of the variance (60%) was partitioned between subjects and between cases within case type and R squared for the model was 0.58. The analysis of variance results are reproduced in Table 1.
The time taken to complete the cases ranged from 30 to 707 sec with mean of 351 and SD of 108. The distribution was not skewed with a median of 341 but there was some kurtosis. The single observation at 30 sec was a distinct outlier with the next lowest value being 102 sec (first percentile). None of the fixed effects were significant (format p = 0.41, order p = 0.99, and case type p = 0.22) and there was no interaction between format and subject effects on the time. The majority of the variance (62%) was partitioned between subjects and between cases within case type and R squared for the model was 0.71. The analysis of variance results are reproduced in Table 2.
The efficiency (number of correctly answered questions per minute) ranged from 0.52 to 4.7 with mean of 1.56 and SD of 0.56. The distribution was nearly normal except that the positive tail was longer. This was due to the same outlier (30 sec to complete the case) mentioned above. None of the fixed effects were significant (format p = 0.92, order p = 0.60, and case type p = 0.48) and there was no interaction between format and subject effects on the efficiency. Just under half of the variance (46%) was partitioned between subjects and between cases within case type and R squared for the model was 0.57. The analysis of variance results are reproduced in Table 3.
|
The number of times that any answer was selected for each case (answer selections) ranged from 10 (obligate floor value) to 19 with a mean of 11.5, SD of 1.68, median of 11, and mode of 10. The distribution was, as expected, not normal and looked more like a Poisson type with a mean of 11.5. We elected to proceed with analysis of variance despite the violation of normality because this outcome was considered to be secondary and the results were relatively uninteresting. The only significant effect was between subjects. Report format and case type had the same mean number across all levels.
The number of times subjects went back to look at the report text (report views) while answering questions ranged from two to 32 with a mean of 12, median of 11, and SD of 6.8. The distribution was near normal, allowing for the discrete nature of the outcome. The analysis of variance analysis showed no difference in the mean number by report type (structured = 13.4, free text = 12.3, p = 0.93). Again, there was considerable variance between subjects (p < 0.0001) with a minimum of 4.7 times per case ranging up to 25 times per case. Interestingly, there was a significant difference between the type of case (p = 0.0016) even though there was no significant difference between the individual cases (p = 0.11). The mean number of times the report was consulted for abdominal CT (13.1) was essentially the same as for abdominal songraphy (12.9). However, for head CT, subjects went back to look at the report an average of 11 times. We tested for interaction between case type and subject effects and found none. Therefore, the tendency to look back at head CT reports less frequently than sonography and abdomen CT was shared by all subjects.
Across the entire sample of 192 cases, the correlation between report views and answer selections was weakly positive with a Pearson coefficient of 0.22 (p = 0.002). When we examined the relationship between report views and answer selections for each subject, only three of the 16 had significant correlations. These were all positive (0.58, 0.64, 0.75) and probably accounted for the aggregate correlation. The correlations between report views and answer selections stratified by case type and report format were all weakly positive (0.18 to 0.28). The one exception was the head CT cases, where there was no correlation between numbers of report views and answer selections.
The head CT cases seemed to elicit a somewhat different pattern of report viewing unrelated to the number of times answers were selected. Considering that the head CT structured format was functionally organized rather than strictly by anatomy, we wanted to be sure there was no interaction between the type of case and our main independent variable of interest, the report format. When we added this interaction term to the analysis of variance models for score, time, and efficiency, we were reassured to find that it was not significant for any of the outcomes. Thus, the finding that report format had no effect on the outcomes is conclusive and holds across case type.
For the post hoc power analysis, the root mean square errors (sigma) were
as follows: score = 1.32, time (seconds) = 65, and efficiency = 0.43. As
described above, we set
= 0.05 and power to 90%. For score, we could
have detected a difference of about 0.5 in the mean of correctly answered
questions. The observed mean scores were 8.28 for free text and 8.43 for
structured format, with a difference of 0.15. For time to complete each case,
we could have detected a difference of about 30 sec. The observed mean times
were 355 sec for free text and 347 sec for structured format with a difference
of 8.4 sec. Finally, for efficiency of answering questions, we could have
detected a difference of about 0.2 questions per minute. The observed
efficiency for free text was 1.57 questions per minute and for structured
format, it was 1.55 questions per minute for a difference of 0.02 questions
per minute. For all three of the main outcomes of interest, the observed
effect of report format was far less (approaching an order of magnitude) than
the difference our experiment was powered to detect.
Qualitative Results
The debriefing meeting was attended by 15 of 16 subjects. One of the male
students was serving in a clinical clerkship at another institution. One of
the general questions asked for preference about the report organization. One
option to this question was "like a laboratory report." By this we
meant standardized headings in the body with results organized under these
headings. The choices and number of responses were as follows. Like a
laboratory report (11/15 = 73%), like a newspaper story (1/15 = 7%), and in
the current (unstructured) format (3/15 = 20%). We also asked how they would
prefer to have uncertainty expressed. The choices and responses were in words
(8/15 = 53%), as a semiquantitative scale (3/15 = 20%), and in quantitative
terms (2/15 = 13%). The final general question asked about how subjects would
respond to an explicitly worded recommendation in a radiology report. The
responses were as follows. The subject would be compelled to follow the
recommendation (2/15 = 13%), they might be compelled to follow the
recommendation (10/15 = 67%), and they would not feel so compelled (3/15 =
20%).
The responses to the Likert-scaled questions about preference between free text (1) and structured format (10) are summarized in Table 4 with median, mode, and range for each one. Clearly, the subjects strongly tended to prefer the structured format for the seven separately articulated domains as well as overall. During the mediated discussion, this perception was reinforced. At least five subjects clearly expressed the opinion that they would like to see all radiology reports formatted in a manner similar to our structured condition (as in Appendix 3). One participant mentioned that the headings should be consistent across all instances of a report type. He said "it might be confusing if some people include biliary system under liver and some people included it under gallbladder, for example." Others felt that the order of headings should be altered dynamically so that the abnormal findings would be at the top of the report. One potential drawback to the structured format was expressed as follows: "The overall gestalt of the examination might be lost in structure whereas with free text a sense of severity and more acuity can be conveyed." A corollary opinion that was expressed quite forcefully and frequently was that reports should still have a clearly labeled impression section. The concept of organizing a report like a newspaper story was linked to the need for an impression. Like a news story, subjects wanted to see reports have an equivalent of the lead paragraph in which findings are synthesized and condensed to allow rapid review and to highlight the diagnostic impression.
|
|
|
|---|
We assert that there is no effect of report format on speed, accuracy, or efficiency for our subject population reading the types of reports we presented to them. By extension, we suggest that the same phenomenon may pertain to practicing physicians. If this is true, designers of reporting systems may not need to work so hard to produce old-style narrative documents out of structured elements. At the same time, it would seem that free text reports are not as difficult to read for content as some believe. The choice of report format and structure may be made on the basis of referring physician preference and considering the effect of reading into structure by radiologists. However, we advise those seeking to fundamentally change the way in which radiology reports are created and displayed to proceed with caution in consideration of the following. In a recent article, Ash et al. [5] discussed unintended consequences of overemphasizing structured information entry in health care informatics. They cite evidence from studies in cognitive psychology and sociology that in a shared context, concise, unconstrained, free text communication is the most effective for coordinating work around a complex task (5, 6). Overly structured data can lead to loss of cognitive focus by clinicians, both during input and review. This can cause clinicians to experience a loss of overview about the case at hand when they have to attend to data contained in many different fields, sometimes on different screens within an interface (7, 8). Furthermore, the act of writing or dictating in narrative form may be integral to the cognitive processing of the case [9]. Our finding of dissonance between subjects' preferences concerning report format and their actual performance reading them for comprehension confirms that the cognitive issues are complex. In our opinion, tried and true methods of authoring and displaying radiology reports should not be abandoned without considering the consequences.
Our follow-up session and the questionnaires completed by the subjects shed light on reader preference for report format. They all strongly and consistently preferred the structured version to the free text. This preference for structured format was consistent across all seven domains that we asked about with modal values on the 10-point Likert scale all being 10 (prefer structure). Also, the corollary question about general report organization resulted in 73% preferring a "laboratory report" format over the alternatives. The opinions of our subjects are entirely consistent with other workers' findings with respect to physician preferences about radiology reports. There is a large body of published research detailing the opinion of referring physicians regarding the content and format of radiology reports [10-15]. The terminology differs somewhat but attributes consistently endorsed by consumers (readers) of radiology reports include complete, itemized, and structured. Another element that is commonly preferred by referring clinicians is that the report should contain a complete listing of pertinent negative findings. In aggregate, these opinions seem to militate for a report organization and format like the "laboratory report" option preferred by our subjects.
Selection of senior medical students to serve as subjects proved to be quite successful. Interest and enthusiasm were such that we were able to double the sample size with little effort. Even after the study had closed, numerous students asked to participate. We found the experimental paradigm to be very acceptable to subjects and quite easy to administer. These factors should allow us to easily extend and expand the experimentsusing additional cohorts of senior medical studentsto address limitations described below.
Perhaps the most important limitation in our study has to do with generalizing our results from senior medical students answering content-specific questions to practicing physicians using radiology reports for clinical decision making. We narrowed the focus of the research to evaluate readability of the documents containing radiology interpretations with respect to the format alone. Medical students' subsequent experiences during training and practice certainly do lead to differences in many skills and habits. However, we argue that the simple ability to read a passage of text and comprehend its content is already well established by the senior year of medical school.
A difficult design consideration was whether to test subjects with both the free text and structured version of each case. This would have provided even greater power to detect the effect of format on outcomes by virtue of having a directly paired comparison. We think that our choice of a balanced block design, incomplete in the format factor, was valid for two reasons. First, the planned and achieved power of the chosen design allowed us to detect differences between free text and structure that were far less than what we considered to be practically relevant. Second, having subjects see cases twice would have introduced methodologically difficult problems with memory effects.
Another issue is that our subjects had no time constraints or other pressures placed on them during testing. We plan on adding features to the experimental paradigm that will stress a subject's short-term memory of the material. The structured versions of the reports we used had phrasing and syntax identical to that found in the original narrative versions. In practice, the language and construction of interpretative statements would likely be rather different in structured reports. Readers may be more (or less) able to rapidly comprehend medical content presented in structured format using "telegraphic" constructions such as "LIVER: Negative."
To address limitations described above and extend the scope of our inferences, we plan on at least three extensions to the experiment using the same cases, questions, and randomization scheme. First, we will remove the button on the question page that allows going back to review the report. Subjects will know this and that they must answer all 10 questions after one reading. There will be no time limit for either reading the report or answering the questions. Second, we will place a time constraint on how long the report is visible before switching to the question page. Again, subjects will know that they cannot go back to review the report while answering questions. Third, we will enable either structured or free text formats to be viewed at the discretion of the subject while they go through a case. The tracking code will record which version(s) they look at and for how long. This will allow us to determine if subjects develop and actually act on a preference for one format or the other as they move through the cases.
Further research will involve psychometric evaluation of the questions themselves. Once the three additional experiments detailed above have been completed, we will have a large number (64) of answers to each of 120 different questions about radiology report content. This should allow us to use standard techniques to assess item difficulty, reliability, various correlations, and discriminatory power. These results will be interesting in their own right by revealing what kinds of questions are challenging for readers to answer. This might guide radiologists in explicitly including phraseology in their reports to address these difficulties. Types of questions that exhibit high levels of variance in the answers given or are poorly correlated with subject's overall scores will also be of interest. Given this knowledge about the types of questions that are most reliable and discriminatory, we can redesign the cases and questions to optimize power to detect subtle differences in reader performance. Such information about question content may guide other researchers in their own experiments about readability of medical documents.
To our knowledge, this work is the first experimental evaluation of radiology reports whose primary outcomes are quantitative measures of information transfer to readers of the documents. Based on the results described above, we assert that there is no difference in information transfer efficiency between free text (narrative style) report format and structured (itemized) reports having the same content. Despite the fact that they performed no better with the structured versions, our subjects clearly preferred it to the free text format.
|
|
|---|
This article has been cited by other articles:
![]() |
D. L. Weiss and C. P. Langlotz Structured Reporting: Patient Care Enhancement or Productivity Nightmare? Radiology, December 1, 2008; 249(3): 739 - 747. [Full Text] [PDF] |
||||
![]() |
L. Berlin Replacing traditional text radiology reports with image-centric reports: a shift from epiphany to enigma? Am. J. Roentgenol., November 1, 2006; 187(5): 1156 - 1159. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |