AJR 2000; 174:1241-1244
© American Roentgen Ray Society
A Simple Method for Obtaining Original Data from Published Graphs and Plots
Chris L. Sistrom1 and
Patricia J. Mergo
1
Both authors: Department of Radiology, University of Florida College of
Medicine, P. O. Box 100374, Gainesville, FL 32610-0374.
Received August 3, 1999;
accepted after revision September 29, 1999.
Address correspondence to C. L. Sistrom.
Abstract
OBJECTIVE. To describe a method for deriving original data values
from scanned images of graphs and scatterplots published in the medical
literature.
CONCLUSION. The procedure is simple, reproducible, and relatively
error free (when performed carefully). This method is useful in converting
published graphic material into numeric data for various uses when the
original data are unavailable directly from the authors.
Introduction
There are many situations in which it is desirable to obtain the original
data values that were used to create graphs or scatterplots published in the
medical literature. These situations include meta-analysis, production of
slide presentations, preparation of material for computer-based teaching, and
writing textbooks or review articles. Having the original data values rather
than an image of the graph or plot is advantageous for several reasons. These
reasons include the ability to manipulate the data, combine the data with
other sources for meta-analysis, store the data more compactly, and gain
flexibility in reproduction method and style. Graphs or plots from different
sources can be reproduced in presentations and texts in a uniform format,
enhancing continuity and readability. Photographic methods are being replaced
by digital means of transmission and reproduction, and graphs or plots from
articles are increasingly being scanned into digital form. Our method converts
scanned images of plots or graphics into data values, which take much less
storage space than a high-resolution picture file. The only specialized piece
of software needed is NIH (National Institutes of Health) Image, which can be
down-loaded free of charge from the Internet in both IBM-compatible
(www.scioncorp.com) and Macintosh-compatible (rsb.info.nih.gov/nihimage or
www.scioncorp.com) formats.
The most accurate and desirable means of obtaining original data is by
contacting the author of the article or chapter in question; however, this may
be impractical or impossible, especially if the material is somewhat dated. In
our experience, many authors cannot readily obtain the original data. If the
method we describe is used, readers should be aware of the possibility of
error between derived and original values of the data. As we show, with
careful attention to detail, the magnitude of such errors should be rather
small. Strict attention to relevant copyright laws and general tenants of
professional courtesy are important in any use or reproduction of published
scientific material.
Scanning the Graph or Plot
Almost all consumer-grade flatbed scanners can produce images of sufficient
quality and resolution to yield accurate results. We used a CanoScan 600
scanner (Canon Computer Systems, Costa Mesa, CA). An original copy of the
journal or book is best, though a high-quality photocopy made with the page
perfectly flat on the copy glass can be used. The same care should be used in
positioning the material on the scanner. Additionally, the abscissa and
ordinate of the graph or plot must be parallel to the x-axis and
y-axis of the scanner. The scan is made in gray scale and the dots
per inch (DPI) are adjusted so that the resulting image will have at least
1000 pixels in each axis covering the graph or plot being analyzed. We used
600 DPI. Once an adequate scan of the graph or plot has been obtained, it is
saved in a tagged image file (.TIF) format (gray scale of 256 shades).
Recording Locations of Data Points and Axes
A copy of NIH Image (either the PC or Macintosh version) is required to
perform this part of the procedure. Installation is simply done by running the
provided setup program. In addition, the Microsoft DirectX (Microsoft,
Redmond, WA) drivers must be installed on your computer. Often, these are
already present and registered. If not, they can be obtained from the
Microsoft Web site (www.microsoft.com) and installed. It is best to have
selected a high-resolution setting in your graphics card software (1024
x 760 or more). Also, the mouse sensitivity, speed, and acceleration
should be set to low values to allow accurate and reproducible cursor
positioning. Figure 1 is a
representation of a scanned scatterplot with original and scanned coordinate
systems shown. The conversion of graph or plot points to pixel values is
performed with the following steps.

View larger version (21K):
[in this window]
[in a new window]
[as a PowerPoint slide]
|
Fig. 1. Graph shows scanned plot with scan and plot coordinate systems
labeled..TXT file produced by NIH Image will contain numbers in scan
coordinate spaces (X and Y pixel values). Spreadsheet operations serve to
convert these into plot coordinates (Abs = abscissa, Ord = ordinate, Org =
origin, Max = maximum), which are original data values.
|
|
Start NIH Image. Open the.TIF file made when the graph or plot was scanned.
The image can be scaled to fit your screen by setting the checkbox under the
edit menu. Inverting the image (command found under the Options menu) is
useful because NIH Image places a single pixel black dot at each location
measured and this can only be seen with the original image inverted (white on
black). Under the Analyze menu, select Options, check X-Y Center, and check
Wand Auto-Measure. Then, under the Analyze menu, select Set Scale and Set
Units = Pixels. On the toolbar, select Wand (cross with a tiny circle in the
middle). Click once at the graph or plot origin (AbsOrg, OrdOrg in
Fig. 1). Click once at the
highest value of the abscissa and the lowest value of the ordinate (AbsMax,
OrdOrg in Fig. 1). Click once
at the lowest value of the abscissa and the highest value of the ordinate
(AbsOrg, OrdMax in Fig. 1).
Click once at every point along the graph or each plotted value (1, 2, 3,... i
in Fig. 1). The Info window
will show the coordinates of the points as they are registered and Count will
show the number of points stored. Under the File menu, select Export and check
Measurements. Set Save as File Type = Text. Specify a meaningful file name
(include.TXT at the end) and directory location for storage. Click Save. The
resulting text file may be examined in a text editor before closing NIH Image
to be sure it contains the numbers you obtained. The first two columns of
Table 1 show the first few sets
of raw pixel values contained in such a file.
Conversion of Raw Value to Original Data Values
This part of the process may be performed using any commercially available
spreadsheet program. We used Lotus 123 for Windows (Lotus Development,
Cambridge, MA) to perform the following steps.
Start your spreadsheet program. Import the.TXT file produced by NIH Image.
It should form two columns of numbers with the left-most column (usually A)
containing the raw pixel values representing the abscissa (x-axis)
and the right-sided column (usually B) containing the raw pixel values for the
ordinate (y-axis). Check that A1 and A3 are nearly equal. Check that
B1 and B2 are nearly equal. This is to insure that the original was not tilted
or distorted during scanning. In the first row of the third column (usually C)
place the following formula:
(A1-value of A1)/[(value of A2-value of A1)/abscissa span]+lowest value of
the abscissa
In the first row of the fourth column (usually D) place the following
formula:
(value of B1-B1)/[(value of B1-value of B3)/ordinate span]+lowest value of
the ordinate
Copy C1 and paste it into the rest of the C column. Copy D1 and paste it
into the rest of the D column. Check the formulae in C2 through Ci to insure
that the numbers are incremented correctly in the formula variable (A2 through
Ai). Check the formulae in D2 through Di to insure that the numbers are
incremented correctly in the formula variable (B2 through Bi). Column C should
now contain the original abscissa values. Column D should now contain the
original ordinate values. Check that C1 equals abscissa origin, C2 equals
abscissa maximum, D1 equals ordinate origin, and D3 equals ordinate maximum.
If they do not, you have entered the formulae incorrectly and need to correct
them. Copy columsn C and D and paste into any application desired. It may be
necessary to Export them as text values for Import into some applications
(such as your spreadsheet). This is because the Copy and Paste operation with
some combinations of spreadsheets and other applications actually transfers
the formulae rather than the calculated values.
Table 1 lists raw pixel values,
formulae, and calculated values from sample data. In this example, the
abscissa went from 20 to 100 (span, 80) while the ordinate went from 0 to 300
(span, 300). Therefore, the calculated values for the abscissa were corrected
by adding 20 and no correction was made to ordinate values. Furthermore, A1
equals A3 and B1 equals B2, indicating a perfectly straight scan.
Assessment of Method Error
We applied the method described in this article to a scatterplot already
published [1]. This scatterplot
(Fig. 2) depicted the weight in
pounds versus the age in years for 227 patients being studied to determine
factors influencing the prevertebral soft-tissue thickness on cervical spine
radiographs. We reproduced the scatterplot
(Fig. 3) by graphing values for
age and weight derived from the scanned image by means of the steps already
described. Microsoft PowerPoint (Microsoft) was used to generate
Figure 3, whereas Harvard
Graphics for Windows (Software Publishing, Nashua, NH) was used to make the
original (Fig. 2). Note that
distribution of data is indistinguishable though the axis scaling, labeling,
and data point markers are different.

View larger version (15K):
[in this window]
[in a new window]
[as a PowerPoint slide]
|
Fig. 2. Reproduction of image file produced by scanning original scatterplot
used to test our method. Weight of 227 patients was plotted against age in
study of prevertebral soft-tissue thickness on cervical spine radiographs.
|
|

View larger version (15K):
[in this window]
[in a new window]
[as a PowerPoint slide]
|
Fig. 3. Scatterplot produced after importing derived data values for weight
and age into graphing program (PowerPoint; Microsoft, Redmond, WA). Note that
distribution of data points is visually indistinguishable from original
material in Figure 2.
|
|
We were able to quantitatively compare the results derived by our method
with the original data values that were still available to us. This comparison
followed the method developed by Bland and Altman
[2] for comparing measurements
made with two instruments of the same quantity. In their analyses, values
obtained with a reference instrument (ref) are compared with those
obtained with a test instrument (test). This comparison is done by
plotting ref-test on the ordinate and
test+ref/2 on the abscissa. Lines representing the mean of
ref-test and ±2 standard deviations are plotted as
well to show central tendency of error and the limits of agreement.
For purposes of this analysis, the original data for weight and age were
treated as if they represented reference values, and the derived values for
age and weight were treated in the same manner as test instrument results.
Rather than plotting the average of derived and original values on the
abscissa, we simply plotted the original values. When we generated error plots
for age and weight, we found no errors for age
(Fig. 4). Errors for weight
only exceeded 1 lb (0.373 kg) for a single data point. The limits of agreement
for weight were ±1 lb (0.373 kg)
(Fig. 5).

View larger version (11K):
[in this window]
[in a new window]
[as a PowerPoint slide]
|
Fig. 4. Graph shows error terms for patient age (in years) derived by
conversion of scanned plot (Fig.
2) then compared with original data. Difference between original
and derived ages (origderiv) is plotted against original age values.
Derived age equaled original age for every point.
|
|

View larger version (13K):
[in this window]
[in a new window]
[as a PowerPoint slide]
|
Fig. 5. Graph shows error terms for subject weight (in pounds) derived by
conversion of scanned plot in Figure
2 then compared with original data. Difference between original
and derived weights (origderiv) is plotted against original weight
values. Mean error (solid line) and limits of agreement (±2 SD,
dotted line) are shown.
|
|
To test the effect of improper scanning technique, we deliberately rotated
the original plot by 3° (clockwise) before making a second scan. All other
parameters were the same as before. We then performed the same steps to derive
the original data values and the same error analysis. When the derived values
for age and weight were plotted and visually compared with the original scan,
it was hard to distinguish any difference. However, quantitative analysis
revealed that the mean error for age was -2 years with limits of agreement of
-1 to -3 years. The mean error for weight was +6 lb (2.238 kg) and the limits
of agreement were 2-10.5 lb (0.746-3.917 kg). This result emphasizes the
importance of having the original material (or a high-quality copy) and
performing the scan with the plot or graph perfectly aligned with the scanner
x- and y-axes.
Discussion
The method we describe for deriving original data values from published
graphs or plots was initially developed by one of the authors during
preparation of a textbook [3]
that included many such illustrations. In some cases, regressions or other
equations based on these data were listed in the text of the paper. When these
functions were recalculated from the derived values, the results were almost
always in very close numeric agreement with those published. Using the same
graphing program to redraw all the plots and graphs with the derived data
enabled the production of illustrations with a consistent appearance. Careful
visual comparison of these illustrations with the original material insured
that the reproductions were accurate and complete. In some instances, data
from different sources could be combined into single graphs or plots, thus
visually showing relationships in ways that were not otherwise possible.
Permission to reproduce the figures was requested from the corresponding
author and journal publisher in all cases. We have completely detailed the
method in stepwise fashion with illustrative sample data
(Table 1) so that readers may
perform it themselves.
References
-
Sistrom CL, Southall P, Peddada SD, Shaffer HH. Factors affecting
the thickness of cervical prevertebral soft tissue. Skeletal
Radiol 1993;22:167
-172[Medline]
-
Bland JM, Altman DG. Statistical methods for assessing agreement
between two methods of clinical measurement. Lancet
1986;1:307
-310[Medline]
-
Keats TE, Sistrom CL. Atlas of Radiologic
measurement, 7th ed. Philadelphia: Mosby Year Book,
2000

CiteULike
Complore
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?