|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Radiology Research, University of Arizona, Tucson, AZ.
2 Department of Radiology, Dartmouth-Hitchcock Medical Center, One Medical
Center Dr., Lebanon, NH.
3 Department of Radiology, University of Washington, Box 354755, 4245 Roosevelt
Way, NE, Seattle, WA 98105.
Received May 7, 2007;
accepted after revision August 7, 2007.
Address correspondence to F. S. Chew
(fchew{at}u.washington.edu).
Abstract
Keywords: prospective reader studies receiver operating characteristic ROC analysis self-assessment
|
Solution to Question 1
All of these options refer to theories regarding the way humans process
information and render decisions. How people make decisions under conditions
of uncertainty refers to signal detection theory. Signal detection theory
deals specifically with the principle of signal detection (i.e., detecting a
signal or target (e.g., tumor) in a background of noise (e.g., chest anatomy),
and considers both perception and decision-making
[1]. Signal detection theory is
the underlying basis for ROC analysis. The best response is Option B.
The stages by which people encode, store, and retrieve information refers to
information processing theory. It specifies four types of knowledge: general
versus specific (useful in many tasks or specific ones), declarative (facts),
procedural (how to), and conditional (when and how). Option A is not the best
response. The perceptually based theory that attempts to explain the way
people identify objects, make decisions regarding their similarity, and make
preference judgments is called General Recognition Theory. It has more to do
with decision classification processes that occur after a target (lesion) has
been detected, and not with the detection of the target itself. Option C is
not the best response. How people make current decisions based on the outcome
of previous decisions may be described by Bayesian decision theory. Bayesian
decision theory is a specific type of decision theory that uses some prior
distribution of data in the computation of the present decision, often taking
utilities into account. This theory does not address perception. Option D is
not the best response. The transmission of information and ways to reduce the
inherent uncertainty in information refers to information systems theory.
Information systems theory proposes a three-stage process to eliminate
uncertainty through enactment (do something with the information), selection
(decide which information to keep and which to ignore) and retention (what
needs to be remembered). This theory does not address perception. Option E is
not the best response.
Solution to Question 2
An investigator wishes to determine whether radiologists can accurately
detect calcification in pulmonary nodules with chest radiography. ROC analysis
is used to assess decision accuracy in diagnostic detection tasks
[2]. In a typical ROC study,
observers are presented with a set of images, half containing a target of
interest (e.g., a tumor) and half without any target (e.g., normal chest
image). Ideally, the set should contain a mix of subtle to moderately subtle
targets. The images are randomized and each observer views the set in a given
condition being tested (e.g., soft-copy display with and without edge
enhancement applied). The observer is asked to search each image and report
whether a lesion or target is present or absent. Observers then report their
confidence in that decision using either a discrete (5-or 6-point) or
continuous (0–100) scale. The confidence ratings are then used to
generate the ROC curve, and the area under the curve can be calculated using
standard methods. Option C is the best response. To measure speed the
investigator would use a stopwatch and compare times using a t test.
Option A is not the best response. To compare risk or relative risk of a group
compared with the general population, one would use relative risk analysis
techniques. Option B is not the best response. To study the impact of a drug
on patient survival, one would use survival analysis techniques. Option D is
not the best response. To examine tool use, such as a mouse versus trackball,
one would rely on human factors analysis techniques such as a
time–motion study. Option E is not the best response.
Solution to Question 3
The ROC curve is not supposed to dip below the chance line (D) as curve C
does. Option C is the best response. The chance line defines an
observer who is essentially guessing about the status of an image and is,
therefore, operating at chance (would call half the images normal and half
abnormal). The area under the chance line by definition is 0.50. Ideally, an
observer who is qualified to carry out the experimental task and who
understood the reporting instructions should perform better than chance and
thus should have a resulting area under the curve higher than 0.50 (where 1.0
is perfect performance). Proper ROC methods have been developed to prevent
this from occurring [3]. Curves
A and B do not cross the chance line. Curve E falls below the chance line,
typically indicating that the observer was not following the reporting
instructions correctly (i.e., if the reporting scale used 1 = present,
definite confidence; and 6 = absent, definite confidence, but the reader used
1 = absent, definite confidence; and 6 = present, definite confidence).
Solution to Question 4
The closer an ROC curve is to the upper left corner, the better performance
is as generally measured by the area under the ROC curve
(Az) [4].
Option A is the best response. In ROC space, the area above the chance
line (D) equals 0.50. The chance line defines an observer who is essentially
guessing about the status of an image and is therefore operating at chance
(would call half the images normal and half abnormal). The area under the
chance line by definition is 0.50. Ideally, an observer who is qualified to
carry out the experimental task and who understood the reporting instructions
should perform better than chance and thus should have a resulting area under
the curve higher than 0.50 (where 1.0 is perfect performance). Perfect
performance (Az = 1.0) would be indicated by an ROC curve
that follows precisely the left and upper lines. Curve B represents
performance intermediate between A and C. Curve E represents performance below
chance, typically indicating that the observer was not following the reporting
instructions correctly.
Solution to Question 5
If a disease is relatively rare, occurring in only 5% of patients, then a
clinician who calls all the cases negative will have an accuracy of 95%, which
is misleading [4]. Option A
is the best response. Option B is incorrect because you can derive
sensitivity and specificity from accuracy. Option C is incorrect because
accuracy is simply reported as a percentage and does not require calculus.
Option D is incorrect because the number of cases in a test set does not by
itself affect accuracy, although using too many cases in one setting could
lead to reader fatigue, which could decrease accuracy. Knowledge of disease
prevalence (or prior probability) in general can affect the decision criteria
of a clinician. For example, coccidioidomycosis (valley fever) is caused by an
organism found in the soil in the southwestern United States and affects
primarily the lungs (nodules are observed). A clinician in New England who
sees a patient with nonspecific nodules is unlikely to consider
coccidioidomycosis and thus misread the case of a patient who did not inform
the clinician that he or she just returned from a vacation in Arizona.
However, a clinician in Arizona, seeing the same nodular manifestations, would
be more likely to correctly consider coccidioidomycosis as the diagnosis.
Solution to Question 6
In satisfaction of search, observers do not report additional findings on
images when they have found something suggested by the original search task
[5]. For example, if the main
task is searching for nodules in chest images, the presence of a rib fracture
often goes unreported once the nodule is detected. Option B is the best
response. Option A is not the best response because many patients do have
multiple lesions per examination. Option C is not the best response because
there are ROC techniques (e.g., free-response ROC, alternative free-response
ROC) designed specifically to account for multiple lesions. More than one
lesion does not tend to increase the false-positive rate, so Option D is not
the best response. Option E is not the best response because residents are
trained from the beginning to search for and detect multiple lesions per
case.
Solution to Question 7
Statistical power is the probability that one can reject the null
hypothesis (there is no difference between conditions being compared) when it
is indeed false (there is a true difference between conditions). Typically,
one wants a power of about 0.80, meaning that the probability that one can
reject the null hypothesis as a result of the study is 80%
[6]. Statistical power is
affected significantly by sample size—the greater the sample size, the
more power one typically has. Option D is the best response.
Replicating reading volume (option A) is impractical, as is trying to
replicate (option B) prevalence (e.g., in screening mammography one may have
one abnormal case per 1,000). Although using too many cases may tire the
readers and impair their performance (option C), this is not the main reason
to consider the sample size; rest breaks can always be incorporated into the
protocol to address reader fatigue. Having clear written instructions should
avoid observer confusion (option E) regarding the task no matter how many
cases are used.
Solution to Question 8
An independent assessment of truth using other types of clinical data is
always preferred when these sources of data are available
[7]. Option E is the best
response. All of the other choices rely on a single observer to decide
truth, and that observer may be biased or simply incorrect. If other clinical
data are not available to serve as the reference standard, the next best
option is typically a panel of experienced clinicians.
Solution to Question 9
Only option D, the correct response, actually records an objective
measure of reader efficiency—time required to render a decision
[8]. All of the other methods
rely on the subjective opinion of individual observers about perceived image
appearance and do not assess reading efficiency.
Solution to Question 10
A conservative reader typically adopts a relatively high decision threshold
(must have a lot of evidence to report an abnormality as present), resulting
in fewer positive decisions (both true positive and false positive) than a
more liberal reader [9].
Option B is the best response. A liberal reader has a lower decision
threshold and thus is characterized by option A. Because the decision
threshold affects both true-and false-positive decisions the same way (when
one goes up the other does also), options C, D, and E rarely occur in
experimental settings.
Solution to Question 11
The best response is E. The Fryback and Thornbury model is based on
the effectiveness of the technology, viewed from six different perspectives or
levels [10]. The levels are
technical, diagnostic, diagnostic thinking, therapeutic, patient outcome, and
societal. A diagnostic test is considered technically effective if its result
is accurate and precise in a physical sense
[11]. Diagnostic efficacy
concerns the extent to which the results of a diagnostic test agree with
patients' actual states of health. Diagnostic-thinking efficacy is difficult
to measure but is the extent to which a diagnostic test affects physicians'
subjective estimates of disease likelihood. Therapeutic efficacy addresses the
question of how and by how much does a particular diagnostic test change the
way in which patient are treated. Patient-outcome efficacy refers to whether a
patient's health is demonstrably improved by use of the test. Societal
efficacy merges private and public considerations (e.g.,
cost/benefit/effectiveness) to assess diagnostic tests within the context of
the social endeavor. This framework is valuable in today's clinical
environment because it acknowledges that it is no longer sufficient to simply
demonstrate that a new technology can better depict anatomy, function,
disease, etc., and thereby improve diagnostic accuracy. The decision whether
to adopt or forego a new technology also depends on its cost, not only in the
monetary sense but also in the societal sense, and the outcomes affected by
the new technology. The Fryback and Thornbury model does not consider the
complexity of the technology or the rate of adoption of the technology.
Options A and C are not the best responses. Their model only indirectly
considers the direct cost and the cost-effectiveness of a technology in the
context of society efficacy, but their model is not based on these
considerations. Options B and D are not the best responses.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |