OBJECTIVE. The objective of our study was to evaluate repeatability and reproducibility of lung nodule volume measurements using volumetric nodule-sizing software.
MATERIALS AND METHODS. Fifty nodules, less than 20 mm in diameter, in 29 patients were scanned with 1.25-mm collimation using MDCT (time 1 = T1). During the same session, two additional scans, using identical technique, were obtained through each nodule (T2, T3). Three observers working independently then obtained volumetric measurements using a semiautomated volumetric nodule-sizing software package. Qualitative nodule characterization was also performed. The Bland-Altman method for assessing measurement agreement was used to calculate the 95% limits for agreement for nodule volumes at T1, T2, and T3.
RESULTS. Automated nodule segmentation was successful in 438 (97%) of 450 measurements. Forty-three nodules were available for final evaluation. Twenty-six nodules had well-defined edges, and 17 had irregular or spiculated margins. Seventeen were freestanding, 16 were juxtapleural, and 10 were juxtavascular in location. Average nodule volume was 345.5 mm3 (range, 49.3–1,434 mm3). The mean interobserver variability (repeatability) was 0.018% (SD = 0.73%), and the SD of the mean for the three contemporaneous scans (reproducibility) was 13.1% (confidence limits, ± 25.6%). SD and confidence limits narrowed as volumes increased.
CONCLUSION. Volumetric measurements show minimal interobserver variability (0.018%) but an interscan SEM of 13.1% (confidence limits, ± 25.6%). Repeatability and reproducibility of volumetric measurements are better than those of linear measurements reported in the literature.
Pulmonary nodules are frequently found incidentally in patients undergoing chest CT for a variety of reasons and particularly in those screened for lung cancer . Because such nodules are usually benign, it is often clinically appropriate to follow nodule size with serial CT scans to determine stability . Determining change in nodule size is also critical in following tumor response to treatment. Direct volume measurements are theoretically preferable to diameter or perimeter measurements because nodules are seldom perfectly spherical and often have irregular or difficult-to-define margins. Several factors determine the reliability of measurements: the quality of the measuring tool, interobserver variability, the characteristics of the nodule, and the inherent variability of serial scanning of the same nodule [3–5]. Several published studies have reported excellent interobserver agreement (repeatability) and interscan agreement (reproducibility) for linear and volumetric measurements of artificial nodules in vitro, but in vivo studies have shown a much wider range of measurements [3–10].
With the introduction of multidetector scanners, thinner slices are more routinely available and better z-axis resolution renders nodule volume measurements more accurate. GE Healthcare has recently developed a software program (Advanced Lung Analysis [ALA]) that can semiautomatically determine nodule volume. This program segments and separates a nodule from adjacent structures and then sums the weighted voxels to determine volume. Before such a system is clinically useful, it must show repeatable and reproducible measurements in vivo. Measurement reproducibility is more critical than assessing absolute nodule volume because proportional change in volume over time is more clinically relevant than the absolute change in volume. If successful, nodule volume determination software should permit physicians to determine growth rates (doubling time) more accurately to assess the need for intervention and assess response to chemotherapy. It will also provide a method for determining doubling times of various primary tumors, metastatic nodules, and benign nodules—thereby improving our understanding of growth characteristics of many entities.
This study was designed to answer three basic questions: First, what is the measurement variability due to the interaction of the observer and measuring tool (i.e., what is the interobserver variability when three observers independently measure the volume of the same nodule on a single scan)? Second, what variability is inherent in the rescanning process (i.e., do volume measurements remain consistent for each of three scans done within 20 min)? Third, can this technique be used with nodules that span a wide range of morphologic types and in various lung backgrounds?
Materials and Methods
Study Population and Design
This study was approved by the institutional review board of the Medical College of Wisconsin. Patients consisted of two adult populations. In the first group, patients were scanned for the evaluation of known nodules. In the second group, unsuspected pulmonary nodules were detected on a scan obtained for another clinical purpose. No patients were recruited for this study alone. IV contrast material was administered, if clinically appropriate. All patients were scanned on multidetector scanners. Nodules measuring less than 20 mm in diameter were included because that size range includes most of the nodules detected on CT in clinical practice and the group most likely to be followed with serial scans. This convenience sample was chosen by one author to provide a variety of nodules in lungs with various types of background disease. She measured the diameters of the nodules with electronic calipers at the time of the initial scan (time 1 = T1). After the initial diagnostic scan (T1), the study was explained to the patient and informed consent was obtained for two repeat scans (T2, T3) through the small lung volumes containing the nodule.
The initial CT scan was obtained in a manner appropriate for the patient's medical condition. The patient remained supine throughout the study. After a 10- to 20-min pause, the nodule or nodules in question were rescanned twice on separate breath-holds (T2, T3). In eight patients, IV contrast material was given at the time of the initial diagnostic scan because it was clinically indicated. No additional contrast material was administered at T2 or T3. All CT scans were obtained on an 8- or 16-MDCT scanner (LightSpeed Ultra or LightSpeed 16, GE Healthcare) with 120 kVp, 200–400 mA, 0.5–0.8 sec, pitch of 1.35–1.375, and 5 mm (1.25 mm × 4 thickness) as a standard technique. Patients were instructed to “breathe deeply and hold your breath.” No attempt was made to standardize lung volumes. Scans were viewed at 1.25-mm section thickness and high-resolution (bone) algorithm. For each patient, the scanning parameters were identical at T1, T2, and T3.
Fifty nodules in 29 patients were scanned over a 5-month period. No patient was rescanned at more than two levels, and no patient had more than three nodules included in this study. Patient-identifying information was removed, and cases were randomized before measurements were made.
The CT data were transferred electronically to a workstation (Advantage Windows, GE Healthcare). One investigator chose the nodules to be evaluated and assigned each patient's three scans (T1, T2, T3) randomly into one of three groups (A, B, or C) for interpretation. Working independently, two experienced thoracic radiologists (7 and 30 years' experience) and an untrained biomedical engineering intern measured each nodule (T1, T2, T3) using the volumetric (3D) analysis software (Figs. 1A and 1B). If automated segmentation failed, it was attempted a second time. After two failures, a nodule was eliminated from the study. At least 2 weeks passed between analyses of images in groups A, B, and C.
At T1, both radiologists, using a checklist of agreed criteria, characterized nodule location (lobe), lung position (juxtapleural, juxtavascular, or interparenchymal), and lung background (normal, chronic obstructive pulmonary disease [COPD], fibrosis, or ground-glass opacification). They also rated nodule edge characteristics (smooth, lobular, irregular, or spiculated) and perceived nodule density (solid, totally calcified, or mixed). A third radiologist resolved differences in categorization between the two radiologists involved in scoring.
Interobserver agreement of nodule volume was determined on T1 readings (n = 150). Interscan agreement was evaluated on the basis of three independent reviewers evaluating three contemporaneous scans (T1, T2, T3) (n = 450).
Segmentation and volume calculation were automatic. To begin the segmentation and sizing routine, the user first places a cursor on any point in a nodule of interest in an axial view. The nodule bookmark location is optimized and repositioned automatically so that nodule volume computation always begins from the same seed point. After the seed point is repositioned, watershed segmentation is performed to separate nodule components from the lung parenchyma. Finally, a model-based shape analysis is performed to determine anatomic characteristics of various nodule types. This permits unique handling of the different anatomic presentations of lung nodules, including separation from adjacent chest wall or mediastinal structures and segmentation of vascular connections before nodule volume estimation. The nodule volume calculation uses a weighted sum of voxel volume on the border of the nodule based on a priori knowledge of the CT scanner point-spread function impact on nodule edges.
The Bland and Altman  statistical method for assessing measurement agreement was used to determine both the interobserver and the interscan variability in estimating the volume of a given nodule. The method is used to generate a 95% confidence interval (CI) for the variability within measurements and is ideally suited for cases in which the truth is unknown. The variability in both the interobserver and interscan measurements is determined in a similar fashion. For the case of interobserver variability, we first compute an estimated volume that is the mean of the measurements performed by multiple reviewers for each nodule at scan instance T1. The interobserver variability measure is the percentage difference between the individual measurements and the estimated volume. This compensates for the size range of the nodules. The reported interobserver variability is the SD of the interobserver variability measure. To compute the interscan variability, the estimated volume of a nodule at a scanning time (T1, T2, or T3) is the mean of all three reviewers' measurements at all scanning times. The interscan variability measure is the percentage difference between the average volume measurement at each scanning time and the estimated volume. Similarly, the reported interscan variability is the SD of the interscan variability measure.
Fifty nodules, ranging in diameter from 4 to 19 mm, were chosen. The volume of each nodule was measured nine times (i.e., at T1, T2, and T3 by three observers). Six nodules (12%) were excluded from analysis because of a failure on at least one of the nine automated volumetric measurements. Segmentation failed because of proximity to a pulmonary vessel in four nodules or to a pleural surface in two nodules (Fig. 2). Of the 450 automated volumetric observations, 438 (97%) were successful. One ground-glass nodule had edges that were extremely difficult to define. The calculated volume ranged from 21 to 99 mm3, and volumes varied by 55% over the nine observations (Fig. 3). This outlier was dropped, allowing the use of normal statistics for analysis of the remaining 43 nodules.
The population of 43 patients included 27 women and 16 men. The average age was 60.2 years (range, 35–84 years). The CT characteristics and locations of the nodules are listed in Tables 1 and 2. Seventeen nodules were freestanding in the parenchyma, 16 were juxtapleural, and 10 were juxtavascular.
TABLE 1: Characteristics of Pulmonary Nodules in Final Evaluation
TABLE 2: SDs of the Means and Confidence Limits for Subgroups of Nodules
SD (%) of the Mean (± 95% confidence limits)
Diameter < 6 mm and volume < 114 mm3
Diameter 6 to < 9 mm
Diameter ≥ 9 mm and volume < 1,560 mm3
Smooth or lobulated
Irregular or spiculated
Calcified or contrast-enhanced
Calcium detected or contrast material used
The average nodule volume was 345.5 mm3 (SD = 361.1 mm3) with a range of 49.3–1,434 mm3. The distribution of nodule volumes is displayed in Figure 4 along with equivalent diameters estimated from the calculated volumes as though the nodules were spherical. Thirty-five (81%) of the 43 nodules were 10.5 mm or smaller.
The mean interobserver variability (repeatability) for the 43 nodules was 0.018% (SD = 0.73%; range = -3.3% to 7.6%). The interscan variability (reproducibility) gave an SD of the mean of 13.1% (95% confidence limits, ± 25.6%). Eight nodules were studied after IV contrast administration. From T1 to T2, there was a mean 2.0% decrease in volume in these eight nodules and from T1 to T3, a 7.2% decrease in volume, well within the confidence limits.
The sample was then subdivided to compare subgroups of nodules. No significant differences were found between these relatively small subgroups (Table 2). The SD of the mean for 27 smooth and lobulated nodules was 13.5% (95% confidence limits, ± 26.4%) and for 17 irregular and spiculated nodules was 12.8% (95% confidence limits, ± 25.2%). The SD of the mean for eight calcified nodules was 11.6% (95% confidence limits, ± 22.7%); for eight contrast-enhanced nodules, 6.6% (95% confidence limits, ± 32.7%); and for the remaining 28 nodules, 12.6% (95% confidence limits, ± 24.8%). Note that one nodule was both calcified and contrast-enhanced. When unenhanced noncalcified nodules were compared with contrast-enhanced or calcified nodules or with contrast-enhanced calcified nodules, the SD of the mean was 12.6% (95% confidence limits, ± 24.8%) and 14.5% (95% confidence limits, ± 28.4%), respectively. We compared 30 nodules surrounded by normal lung tissue with 10 found in diffusely abnormal lung background and found an SD of 13.1% (95% confidence limits, ± 25.6%) and 13.5% (95% confidence limits, ± 27.1%), respectively.
The sample was subdivided into nodules according to effective diameter, which is the diameter of a sphere with the measured nodule volume. The effective diameter ranges were chosen to evenly distribute the 43 nodules over three size categories for statistical purposes: < 6 mm (n = 13), 6 to < 9 mm (n = 16), and 9–19 mm (n = 14). The SD of the mean was 14.1% (± 27.67%), 16.0% (± 31.4%), and 9.1% (± 17.9%), respectively. Note that for larger nodules (≥ 9 mm) the standard deviation of the mean was lower and the CI narrowed. Figure 5 depicts the 95% confidence limits for nodules above a given volume, illustrating that confidence limits narrow as nodule volume increases.
Pulmonary nodules are a common finding on CT of the chest, and their characterization is a common problem. Other than calcification, morphologic characteristics are of little help in differentiating benign from malignant nodules . Nodules that are 10 mm or greater usually undergo intensive evaluation (e.g., contrast-enhanced CT, PET, or biopsy), whereas pulmonary nodules smaller than 10 mm in diameter are frequently followed with serial CT scans to determine stability.
In daily practice, linear measurements in the axial plane are commonly used to assess size changes. The World Health Organization utilizes a bidimensional cross product to follow nodule size (the largest diameter × the perpendicular length), whereas the Response Evaluation Criteria for Solid Tumors (RECIST) protocol used the largest dimension [13, 14]. If nodules were perfectly spherical, a change in diameter would accurately reflect overall changes in volume. However, nodules are frequently lobular, so judgments based on long- and short-axis measurements are necessarily subjective. In addition, nodule margins may be spiculated or indistinct, making it difficult to define borders precisely.
If important clinical decisions are to be made based on serial scans, the measurement tools must be observer-independent and reproducible from scan to scan. A 10% (or 1 mm) error in the diameter measurement of a 10-mm nodule could result in a perceived 39% change in volume—a potentially clinically significant error.
The literature on accuracy of measurements of CT-detected nodules is limited and difficult to summarize because methodology varies from study to study and results are expressed differently by different authors.
For example, Wormanns et al.  assessed agreement between linear measurements of pulmonary nodules (diameter, 2–40 mm). Using hard-copy images, the reviewers had to classify a nodule as smaller than 5 mm, 5–10 mm, or larger than 10 mm. They found an interobserver correlation of 0.91 and 0.89 for nodules reconstructed at 3 and 5 mm, respectively. For the study conducted by Revel et al. , three radiologists measured 54 solid nodules on a PACS workstation three times in one session. Those authors concluded that to ensure a true increase in volume on serial scans (95% CI) when measurements were made by the same radiologist, the nodule diameter would have to increase by approximately 1.6 mm. With more than one reviewer, the nodule would have to increase by 1.7 mm to diagnose growth with certainty. Erasmus et al.  showed considerable intraobserver variability and even more interobserver variability (five radiologists, two readings) in both one-dimensional and 2D measurements in 33 patients with lung tumors (diameter, 1.8–8 cm). Intraobserver variability would have led to misclassification of growth in 9.5% of unidimensional measurements and 21% of bidimensional measurements. Misclassification due to interobserver variability was 30% and 43%, respectively.
Schwartz et al.  studied three types of tumors measured with handheld calipers, electronic calipers, and automated perimeter contour detection. Hand calipers and electronic calipers had an interobserver coefficient of variation of 0.19 and 0.17, respectively. Using automated perimeter contour detection, the coefficient of variation fell to 0.9.
The mentioned limitations of linear measurements and the nonspherical shape of most nodules have caused investigators to focus on the use of volumetric computerized methods. Yankelevitz et al.  introduced semiautomated nodule volume calculation software and assessed it in a phantom using both spherical and deformable silicone nodules (diameter, 3.9–11 mm). Phantom nodule volumes could be measured accurately to within ± 3%. Two recent studies have focused specifically on the problems of repeatability and reproducibility of volumetric measurement in vivo. Wormanns et al.  obtained two whole-lung CT scans within 10 min on 10 patients with multiple nodules. Using 50 nodules (diameter, 2–20 mm), they assessed intra- and interobserver agreement between two reviewers and then interscan agreement. Automated volumetric software was used for all measurements. Intraobserver variability was 0.5% (95% CI, 0.2–1.6%) and interobserver variability was 0.5% (95% CI, -3.0% to 1.4%). Interscan variability increased to approximately ± 20%.
Revel et al.  reported on 54 solid nodules evaluated three times by three reviewers during the same session using the same software that we used in our experiment. Segmentation was successful in 96%. Intraobserver variability ranged from 2.4% to 3.1%. Interobserver agreement was perfect in 35 patients (67%). The small sample size of the remaining nodules limited further statistical analysis.
Variables that may affect the accuracy of volume measurements have also been studied. Ko et al.  showed, in a phantom study of nodules (< 5 mm diameter), that higher precision computerized volumetric measurements could be obtained with the use of a high-frequency algorithm and diagnostic CT technique (120 mAs) rather than low amperage. Ground-glass nodules and small size were associated with increased measurement variability. In a phantom study, Winer-Muram et al.  showed that nodule volume was overestimated more on thick-section CT images than on thin-section CT images.
The design of our study incorporated the use of a high-frequency algorithm, thin-section images (1.25 mm), and a diagnostic CT technique. The technique was held constant on three consecutive scans. These parameters were designed to minimize variability of measurement related to the scan technique itself. One variable that was not held constant was lung volume, which could have been controlled more precisely by the use of a spirometer. Instead, all CT scans were obtained with instructions for deep inspiration, which is consistent with actual clinical practice.
The first question posed by our study investigated interobserver repeatability. Variation among three reviewers was extremely low (0.018%) for nodules between approximately 4 and 20 mm in diameter. Thus, for a given nodule, regardless of shape and edge characteristics, three observers obtained almost identical volumes. This finding is similar to those of the in vitro studies of Yankelevitz et al.  and Ko et al.  and the in vivo studies of Wormanns et al.  and Revel et al. .
The second question examines variations in measurement when the same nodule is scanned three times over 20 min. The SD of the mean was 13.1% (CI, ± 25.6%). This is better than literature reports for linear measurements on the same scan but still high for clinical work. Even for nodules over 9 mm in diameter, the SEM was 9.1% and confidence limits were ± 17.9%.
This study departs from previous work, except the recent work of Wormanns et al. , in its assessment of the same nodule scanned more than once. Because there is negligible interobserver variability, the nodule itself and its margins with surrounding lung parenchyma must vary from scan to scan. It is hypothesized that a number of changes may occur between serial scans that may affect results of both manual linear and automated volume measurements. Physiologic changes such as lung volume, phase of cardiac cycle, microatelectasis, or patient position on the table may occur, and technical changes such as slice registration and selection may lead to varying amounts of volume averaging. Volume averaging in our study was minimized by using 1.25-mm axial images. Boll et al. , using cardiac gating, recently showed that small nodules near the heart show as much as 34% volume change during the cardiac cycle.
This study was designed to assess all types of nodules rather than one specific type. The ability to analyze the subgroups of nodules is severely limited. Only a small percentage of nodules were of ground-glass or mixed attenuation. These are clinically considered to be the nodules most suspicious for cancer . These nodules are probably more difficult to measure reproducibly than solid nodules. We did not have a sufficiently large sample of such nodules to examine this question. The inclusion of eight completely calcified nodules also represents a limitation. Dense or complete calcification, except in certain sarcomas, is considered a benign characteristic, one not requiring measurement. It is possible that the automated volumes are more reliable in calcified nodules, which have sharp edges; thus, our study results appear more favorable than if only noncalcified nodules had been evaluated. However, the SD of the mean for our eight calcified nodules was 11.6% (± 22.7%) versus 12.9% (± 25.2%) for the noncalcified unenhanced group.
Does IV contrast material change the density of the nodule and its volumetric measurement? This is a frequent clinical scenario because the initial scan is often obtained with contrast enhancement and the follow-up, without contrast enhancement. Eight contrast-enhanced nodules were included in this study. The volume between the first and third scans decreased 7.2%, well within the confidence limits. The SD of the mean of the eight contrast-enhanced nodules was 16.6% (CI, ± 32.7%) versus 12.6% (± 24.8%) for the noncalcified unenhanced group. The role of IV contrast enhancement requires further study.
Segmentation failed in six (12%) of 50 nodules. In four nodules in which segmentation failed, a vessel was included in the automated segmentation. This is easily visible to the operator and could be electronically cropped. In our study, we chose not to allow the operator to alter the bounding box defining the nodule and its immediate surroundings before the volumes were calculated. Thus, some failed segmentations might have been rescued with minor intervention.
Results of this study suggest that the overall variability of volume measurements is considerably less with the automated software than with manual measurements. This is an important result and has immediate implications. In clinical practice, it suggests that high-quality volumetric measurements, where available, are less variable than linear measurements and should improve assessment of nodule stability. However, caution is still required in applying this tool because the overall variability between scans in vivo is still substantial with wide confidence limits of 13.1% (confidence limits, ± 25.6%). Because one observer in our study was a nonphysician graduate student, a trained radiologist is not required to produce consistent measurements. Nonetheless, six nodules did not segment properly and a vague ground-glass nodule gave unreliable results. Volume measurements must be overseen by a trained observer.
We thank Sylvia Bartz for her secretarial assistance. For their technical support, we are grateful to Beth Heckel and Saad Sirohey of GE Healthcare and Maureen Levenhagen and Mary Thielke of the Medical College of Wisconsin.
Supported by a grant from GE Healthcare, Milwaukee, WI.
Address correspondence to L. R. Goodman.
Swensen SJ, Jett JR, Sloan JA, et al. Screening for lung cancer with low-dose spiral computed tomography. Am J Respir Crit Care Med 2002; 165:508-513
Therasse P, Arbuck SG, Eisenhauer EA, et al. New guidelines to evaluate the response to treatment in solid tumors. European Organization for Research and Treatment of Cancer, National Cancer Institute of the United States, National Cancer Institute of Canada. J Natl Cancer Inst 2000; 92:205-216
Erasmus JJ, Gladish GW, Broemeling L, et al. Interobserver and intraobserver variability in measurement of non-small-cell carcinoma lung lesions: implications for assessment of tumor response. J Clin Oncol 2003; 21:2574-2582