Original Research
Multispecialty
February 22, 2023

Technical Adequacy of Fully Automated Artificial Intelligence Body Composition Tools: Assessment in a Heterogeneous Sample of External CT Examinations

Abstract

Please see the Editorial Comment by Robert D. Boutin discussing this article.
Chinese (audio/PDF) and Spanish (audio/PDF) translations are available for this article's abstract.
To listen to the podcast associated with this article, please select one of the following: iTunes, Google Play, or direct download.
BACKGROUND. Clinically usable artificial intelligence (AI) tools analyzing imaging studies should be robust to expected variations in study parameters.
OBJECTIVE. The purposes of this study were to assess the technical adequacy of a set of automated AI abdominal CT body composition tools in a heterogeneous sample of external CT examinations performed outside of the authors' hospital system and to explore possible causes of tool failure.
METHODS. This retrospective study included 8949 patients (4256 men, 4693 women; mean age, 55.5 ± 15.9 years) who underwent 11,699 abdominal CT examinations performed at 777 unique external institutions with 83 unique scanner models from six manufacturers with images subsequently transferred to the local PACS for clinical purposes. Three independent automated AI tools were deployed to assess body composition (bone attenuation, amount and attenuation of muscle, amount of visceral and sub-cutaneous fat). One axial series per examination was evaluated. Technical adequacy was defined as tool output values within empirically derived reference ranges. Failures (i.e., tool output outside of reference range) were reviewed to identify possible causes.
RESULTS. All three tools were technically adequate in 11,431 of 11,699 (97.7%) examinations. At least one tool failed in 268 (2.3%) of the examinations. Individual adequacy rates were 97.8% for the bone tool, 99.1% for the muscle tool, and 98.9% for the fat tool. A single type of image processing error (anisometry error, due to incorrect DICOM header voxel dimension information) accounted for 81 of 92 (88.0%) examinations in which all three tools failed, and all three tools failed whenever this error occurred. Anisometry error was the most common specific cause of failure of all tools (bone, 31.6%; muscle, 81.0%; fat, 62.8%). A total of 79 of 81 (97.5%) anisometry errors occurred on scanners from a single manufacturer; 80 of 81 (98.8%) occurred on the same scanner model. No cause of failure was identified for 59.4% of failures of the bone tool, 16.0% of failures of the muscle tool, or 34.9% of failures of the fat tool.
CONCLUSION. The automated AI body composition tools had high technical adequacy rates in a heterogeneous sample of external CT examinations, supporting the generalizability of the tools and their potential for broad use.
CLINICAL IMPACT. Certain causes of AI tool failure related to technical factors may be largely preventable through use of proper acquisition and reconstruction protocols.

Highlights

Key Finding
Three fully automated AI tools for measuring body composition (vertebral bone, body wall musculature, and visceral and subcutaneous abdominal fat) had technical adequacy rates of 97.8–99.1% in a sample of 11,699 external abdominal CT examinations, performed at 777 unique external institutions with 83 unique scanner models from six manufacturers.
Importance
The results of this study support the potential of automated AI body composition tools to generalize to a diverse array of external CT examinations.
The development and deployment of artificial intelligence (AI) tools for use in clinical medicine in general—and radiology in particular—has increased in recent years [1, 2]. Development of AI tools in radiology has focused primarily on analysis of imaging studies for automated detection of a particular finding (e.g., pneumothorax or intracranial hemorrhage) or automated measurement of some aspect of an imaging study (e.g., body composition or aortic calcium) to reduce errors and increase radiologist efficiency [3].
In prior work, a set of automated AI tools was developed that can be deployed on CT examinations of the abdomen to assess various aspects of body composition, including vertebral body attenuation (hereafter, bone tool) [48]; amount and attenuation of back, paraspinous, and anterior abdominal muscle (muscle tool) [9, 10]; and amount of visceral and subcutaneous fat (fat tool) [1113]. These body composition parameters have been found to correlate with health outcomes, including osteoporosis [1416] and fragility fractures [17, 18], metabolic syndrome [19], cardiovascular events [2023], and overall mortality [22, 24]. Given that over 70 million CT examinations are performed annually in the United States [25], these tools have tremendous potential for use in opportunistic health screening of abdominal CT examinations performed for nearly any indication [26].
Owing to the nature of the way in which deep learning AI tools for analysis of imaging studies are developed and validated, a primary concern is generalizability of tool performance to imaging studies performed on equipment types, at locations, or with protocols that differ from those on which the AI tools were initially developed and validated [2729]. An additional barrier to adoption of AI tools is opacity, real or perceived, regarding how these algorithms function (i.e., unexplainability of AI) [2, 3, 30]. Patients' imaging studies are occasionally transferred by a variety of mechanisms from the health system where the examination is originally performed to a different health system where the patient may subsequently seek care. These external imaging studies (also called outside imaging studies) with respect to health system receiving the examinations afford an opportunity to address issues related to the generalizability of AI algorithms in medical imaging. Accordingly, the aims of this study were to assess the technical adequacy of a set of automated AI abdominal CT body composition tools in a heterogeneous sample of external CT examinations performed outside the authors' hospital system and to explore possible causes of failure of AI tools.

Methods

Patient Sample

This retrospective cross-sectional study was conducted at a single academic institution. The study protocol was HIPAA-compliant and approved by the institutional review board. The requirement for written informed consent was waived.
An electronic search was conducted of the institutional PACS from January 1, 2000, through December 31, 2020, for CT examinations of the abdomen performed on adult (age ≥ 18 years) patients at facilities outside of the authors' primary hospital system. All patients had presented to the authors' institution for clinical care. All CT examinations were initially performed at facilities external to the authors' institution, and the images were later transferred to the authors' local institutional PACS. The images were transferred before initiation of this study; no images were transferred specifically for this study. The search yielded 12,666 unique CT examinations performed on 9535 unique patients. Examinations were eligible for analysis whether performed with or without IV contrast material. For CT examinations with more than one axial series, a single axial series was randomly selected for inclusion in the analysis.
A total of 35 series were excluded because they were not obtained with the patient in the supine position (e.g., prone or decubitus positioning was used). An additional 932 series were excluded because they did not contain contiguous images spanning from the superior endplate of the L1 vertebral body through the inferior endplate of L3. If the initial series that was randomly selected for an examination was excluded, a potential replacement series from that examination was not sought. After these exclusions, the final study sample comprised 11,699 series from 8949 patients (4256 men, 4693 women; mean ± SD age, 55.5 ± 15.9 years; median, 56.6 years; range, 18–98 years). The general characteristics of these examinations, as extracted from DICOM header data, are summarized in Table 1. The examinations were performed at 777 outside institutions comprising 83 scanner models from six manufacturers (GE Healthcare, Marconi Medical Systems, Philips Healthcare, Shimadzu, Siemens Healtineers, Toshiba Medical). Figure 1 shows the flow of patient selection.
TABLE 1: Characteristics of 11,699 CT Series
CharacteristicValue
Unique examinations11,699
Unique patients8949
Unique external institutions777
Unique scanner manufacturers6
Unique scanner models83
Unique reconstruction kernels66
Tube potential 
Unique values8
Range (kVp)90–140
Tube current 
Unique values686
Range (mA)26–801
Slice thickness 
Unique values39
Range (mm)0.5–10.0
FOV 
Unique values412
Range (mm)250–700

Note—Except for ranges of measurements, values are counts.

Fig. 1 —Chart shows flow of patient selection. AI = artificial intelligence.

Description of Body Composition Tools

The AI body composition tools evaluated in this study have been previously described and evaluated in earlier validation studies [46, 911]. The codes for these algorithms were not publicly available at the time of this writing. The three AI tools assessed in this study (bone, muscle, fat) function independently of one another; adequacy or failure of one tool is not dependent on the performance of the others. These body composition tools were developed to analyze axial images from CT examinations of the abdomen performed on adults (age ≥ 18 years) in supine position. The three tools are packaged into a single fully automated container for processing of an inputted CT series. For a given series, the tools initially perform preprocessing to normalize the slice thickness to 3 mm (with 3 mm spacing) and to correct for off-sets in CT attenuation values. The tools then apply a body part regression function to determine whether the series has adequate anatomic coverage (at least from the L1 to L3 levels) and to provide slice positions for relevant anatomic landmarks [31, 32].
The bone tool places an ROI in the trabecular bone of the vertebral body on a single axial slice at a selected level and in the three contiguous slices above and below the selected level (total of seven contiguous slices). The tool reports the median attenuation in Hounsfield units at the selected level, and the minimum of the median attenuation across the seven contiguous slices is measured and used to provide higher sensitivity if desired [4]. For the current study, the L1 level was selected, and only the median attenuation at the selected level was used. The muscle tool automatically detects and segments abdominal, paraspinous, and back musculature on a single slice at the level of the L3 vertebral body and calculates total muscle cross-sectional area and mean muscle attenuation in Hounsfield units. Similarly, the fat tool automatically detects and segments visceral and subcutaneous fat in a single slice at the level of the L3 vertebral body and calculates cross-sectional area for both types of fat and cross-sectional area of total abdominal fat. All tools provide default values of −10,000 if detecting a segmentation failure. As part of the output, each AI algorithm generates a segmentation map, which a radiologist may review for accuracy. Figure 2 shows examples of the segmentation maps for the three tools.
Fig. 2A —Examples of fully automated artificial intelligence body composition tool segmentation. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat.
A, 48-year-old man who underwent abdominopelvic CT on GE Healthcare scanner at outside institution. Images show segmentations at L1 (A) and L3 (B) levels.
Fig. 2B —Examples of fully automated artificial intelligence body composition tool segmentation. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat.
B, 48-year-old man who underwent abdominopelvic CT on GE Healthcare scanner at outside institution. Images show segmentations at L1 (A) and L3 (B) levels.
Fig. 2C —Examples of fully automated artificial intelligence body composition tool segmentation. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat.
C, 65-year-old woman who underwent abdominopelvic CT on Marconi Medical Systems scanner at outside institution other than that for A and B. Images show segmentations at L1 (C) and L3 (D) levels.
Fig. 2D —Examples of fully automated artificial intelligence body composition tool segmentation. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat.
D, 65-year-old woman who underwent abdominopelvic CT on Marconi Medical Systems scanner at outside institution other than that for A and B. Images show segmentations at L1 (C) and L3 (D) levels.

Assessment of Body Composition Tools

The automated body composition tools were applied to all series included in the final study sample. A tool was deemed technically adequate in its analysis of a series if it returned a value within the reference range for the given parameter, as listed in Table 2. These reference ranges were empirically derived on the basis of ranges observed in prior studies evaluating the performance of these tools that had large sample sizes (bone, 11,035 examinations [4]; muscle, 9310 examinations [10]; fat, 8852 examinations [11]). A tool was deemed not technically adequate in its analysis of a series if it returned a value outside of the reference range for the given parameter, whether or not the tool had detected a segmentation failure. The technical adequacy rate was calculated for each tool. For tools that provided multiple measured values (e.g., muscle area and muscle attenuation for the muscle tool; visceral abdominal fat area, subcutaneous fat area, and total abdominal fat area for the fat tool), all values outputted by the tool were required to be within their respective reference range for the tool to be considered technically adequate in the tested series.
TABLE 2: Empirically Derived Reference Ranges for Artificial Intelligence Tools
ToolMinimumMaximum
Bone tool [4]  
Bone attenuation (HU)−501200
Muscle tool [10]  
Muscle attenuation (HU)−50200
Muscle area (cm2)25500
Fat tool [11]  
Visceral abdominal fat area (cm2)01200
Subcutaneous abdominal fat area (cm2)0.11000
Total abdominal fat area (cm2)0.11500

Note—All tools provide default values of −10,000 if a segmentation failure is detected.

For all series in which at least one tool failed, two fellowship-trained abdominal radiologists (B.D.P. with 5 and P.J.P. with 24 years of posttraining experience) reviewed all available information (CT images, DICOM header information, tool outputs) in consensus in a qualitative manner to identify possible causes of failure. If the segmentation was deemed to have been correct at this manual review despite the tool's having outputted a value outside of the reference range, the classification was changed to be deemed technically adequate. An additional targeted assessment was performed to explore issues related to tool failure and series classification as a derived multiplanar reformatted (MPR) axial series as opposed to a series reconstructed from an originally acquired axial dataset.
For each AI tool, a subset of 200 series in which the tool was deemed technically adequate (i.e., all outputted values were within the reference range) was randomly selected for manual review. These series, along with the segmentations performed by the tool, were reviewed independently by two reviewers (B.D.P., one of the previously noted investigators, and A.M.S., an undergraduate student who received training in evaluation of the tool outputs before study initiation). The reviewers qualitatively recorded the presence of significant segmentation errors that may have occurred despite the tool's having output values within the reference range, thus indicating a false observation of technically adequate tool performance. Discrepancies were reconciled by a third reviewer (P.J.P., one of the previously noted reviewers).

Statistical Analysis

Data were summarized descriptively. All calculations were performed with Microsoft Excel software (version 2022, Build 14931.20764).

Results

Technical Adequacy Rates

After manual review of all series in which at least one of the three automated body composition tools outputted a value outside of the reference range, no such case was reclassified as technically adequate for the bone or fat tool. One such case was reclassified as technically adequate for the muscle tool (value outside of reference range attributed to extreme sarcopenia) (Fig. 3). After this reclassification, all three automated body composition tools were technically adequate in 11,431 of 11,699 (97.7%) series. Individually, the bone tool was technically adequate in 11,443 of 11,699 (97.8%) series, the muscle tool in 11,599 of 11,699 (99.1%) series, and the fat tool in 11,570 of 11,699 (98.9%) series. At least one tool failed in 268 of 11,699 (2.3%) series. Of these, one tool failed in 143 of 268 (53.4%) series, two tools failed in 33 of 268 (12.3%) series, and all three tools failed in 92 of 268 (34.3%) series.
Fig. 3 —52-year-old man who underwent abdominopelvic CT at outside institution. Axial images without (left) and with (right) segmentation overlay show L3 level. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat. Muscle tool returned mean muscle attenuation of −58.8 HU, outside of reference range; thus, muscle tool was deemed technical failure. At manual review, tool was observed to have correctly segmented paraspinous musculature, such that value was accurate despite being outside of reference range. Outlier value was attributed to presence of extreme sarcopenia. This case was reclassified as technically adequate. No other instance of technical failure of any tool in any series was reclassified.

Causes of Artificial Intelligence Tool Failure

Table 3 shows the causes of failure identified for each of the three AI tools. Among the series in which all three AI tools failed, incorrect voxel dimensions in the DICOM header resulting in anisometric pixels after preprocessing accounted for failure of all three tools in 81 of 92 (88.0%) series. Specifically, during preprocessing, the incorrect DICOM header information caused the re-sampling procedure to generate a distorted image volume with a horizontally or vertically stretched appearance that was then inputted to the measurement tools. Figure 4 shows examples of this error. This anisometry error was the most common specific cause of failure of all three tools (bone tool, 81/256 [31.6%]; muscle tool, 81/100 [81.0%]; fat tool, 81/129 [62.8%]) and accounted for 81 of 268 (30.2%) series in which at least one tool failed. All three AI tools failed in all series in which anisometry error was present.
TABLE 3: Causes of Failure of Individual Artificial Intelligence Tools
Cause of FailureNo.%
Bone tool (n = 256)  
No cause identified15259.4
Anisometry error8131.6
Metallic hardware (streak artifact)135.1
Severe scoliosis41.6
Incomplete or narrow FOVa31.2
Vacuum disk20.8
Vertebroplasty10.4
Muscle tool (n = 100)  
Anisometry error8181.0
No cause identified1616.0
Incomplete or narrow FOVa22.0
Severe scoliosis11.0
Fat tool (n = 129)  
Anisometry error8162.8
No cause identified4534.9
Incomplete or narrow FOVa21.6
Metallic hardware (streak artifact)10.8

Note—At least one tool failed in a total of 268 series, there being overlap between tools. One tool failed in 143 series, two tools failed in 33 series, and all three tools failed in 92 series. Percentages exceed 100 owing to rounding.

a
Includes truncation artifact.
Fig. 4A —Examples of anisometry error leading to failure of artificial intelligence (AI) tool. Both series were originally reconstructed into axial plane by use of multiplanar reformatted data rather than from original axial acquisition, causing erroneous voxel dimension information in DICOM header. All three AI tools (bone, fat, muscle) failed in each series. This type of error accounted for 88.0% (81/92) of series in which all tools failed but was not seen in any series in which any tool was technically adequate.
A, 65-year-old man who underwent abdominopelvic CT at outside institution. Image shows horizontally stretched variation of anisotropy error.
Fig. 4B —Examples of anisometry error leading to failure of artificial intelligence (AI) tool. Both series were originally reconstructed into axial plane by use of multiplanar reformatted data rather than from original axial acquisition, causing erroneous voxel dimension information in DICOM header. All three AI tools (bone, fat, muscle) failed in each series. This type of error accounted for 88.0% (81/92) of series in which all tools failed but was not seen in any series in which any tool was technically adequate.
B, 49-year-old man who underwent abdominopelvic CT at outside institution. Image shows vertically stretched variation of anisotropy error.
Other identified causes of failure of at least one tool were as follows: streak artifact generated by metallic hardware (bone tool, 13/256 [5.1%]; fat tool, 1/129 [0.8%]), severe scoliosis (bone tool, 4/256 [1.6%]; muscle tool, 1/100 [1.0%]), and incomplete or narrow FOV (including truncation artifact) (bone tool, 3/256 [1.2%]; muscle tool, 2/100 [2.0%]; fat tool, 2/129 [1.6%]). Additional causes of failure unique to the bone tool were vacuum disk phenomenon in two series and vertebroplasty in one series. Figure 5 shows examples of identified causes of failure. No cause of failure was identified in 152 of 256 (59.4%) series for the bone tool, 16 of 100 (16.0%) for the muscle tool, and 45 of 129 (34.9%) for the fat tool. In addition, no cause of failure of any tool that failed was identified in 161 of 268 (60.1%) series in which at least one tool failed. Figure 6 shows examples of tool failure without an identified cause.
Fig. 5A —Examples of specific causes of failure of artificial intelligence tool. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat. Liver (beige) and spleen (orange) were also segmented but not included in analysis.
A, 78-year-old woman who underwent abdominopelvic CT at outside institution. Axial images without (left) and with (right) segmentation overlay show L1 level. Bone tool returned L1 vertebral body bone attenuation of −146 HU, outside of reference range; thus, bone tool was deemed technical failure. Failure was attributed to volume averaging of vacuum phenomenon within slice.
Fig. 5B —Examples of specific causes of failure of artificial intelligence tool. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat. Liver (beige) and spleen (orange) were also segmented but not included in analysis.
B, 64-year-old woman who underwent abdominopelvic CT at outside institution. Axial images without (left) and with (right) segmentation overlay show L1 level. Bone tool returned vertebral body bone attenuation of −10,000 HU (default value for segmentation failure detected by tool), outside of reference range; thus, bone tool was deemed technical failure. Failure was attributed to presence of spinal fusion hardware.
Fig. 6A —Examples of failure of artificial intelligence tool in which no cause of failure was identified. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat. Liver (beige) and spleen (orange) were also segmented but not included in analysis. No cause of failure was identified in either series.
A, 71-year-old woman who underwent abdominopelvic CT at outside institution. Axial images without (left) and with (right) segmentation overlay show L3 level. Muscle tool returned muscle area of 17.1 cm2, outside of reference range; thus, muscle tool was deemed technical failure.
Fig. 6B —Examples of failure of artificial intelligence tool in which no cause of failure was identified. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat. Liver (beige) and spleen (orange) were also segmented but not included in analysis. No cause of failure was identified in either series.
B, 49-year-old woman who underwent abdominopelvic CT at outside institution. Axial images without (left) and with (right) segmentation overlay show L1 level. Bone tool returned vertebral body attenuation of −10,000 HU (default value for segmentation failure detected by tool), outside of reference range; thus, bone tool was deemed technical failure.

Anisometry Errors and Multiplanar Reformatted Series

The 81 series in which all three tools failed owing to anisometry error were performed at 20 of 777 (2.6%) unique external sites; two sites accounted for 45 of 81 (55.6%) anisometry errors. In addition, 79 of 81 (97.5%) anisometry errors occurred on scanners from a single manufacturer; 80 of 81 (98.8%) occurred on the same model of CT scanner.
A total of 71 of 81 (87.7%) anisometry errors occurred in derived MPR axial series (as opposed to series reconstructed from an originally acquired axial dataset). In the entire sample, 242 of 11,699 (2.1%) series were derived MPR series, and 81 of the 242 (33.5%) were affected by anisometry errors. Among series that were not derived MPR, the technical adequacy rates were 98.4% (11,275/11,457) for the bone tool, 99.8% (11,430/11,457) for the muscle tool, and 99.5% (11,401/11,457) for the fat tool. All three tools were technically adequate in 11,263 of 11,457 (98.3%) series.

Review of Random Subset of Series With Technically Adequate Tool Performance

Among series in which the performance of the tool was deemed technically adequate on the basis of outputted values within reference ranges, 200 of 11,443 for the bone tool, 200 of 11,599 for the muscle tool, and 200 of 11,560 for the fat tool were randomly selected for manual review to assess for erroneous segmentation. The initial two observers had the following disagreements: bone tool, disagreements on three series, no segmentation deemed erroneous by the third observer; muscle tool, seven series, one segmentation deemed erroneous by the third observer; fat tool, 13 series, five segmentations deemed erroneous by the third observer. True technical adequacy was thus confirmed in 200 of 200 (100.0%) series for the bone tool, 99 of 100 (99.0%) series for the muscle tool, and 195 of 200 (97.5%) series for the fat tool. In the single series with erroneous segmentation by the muscle tool, the tool incompletely segmented the anterior abdominal wall musculature of a patient with high muscle mass and low body fat (Fig. 7A). In all five series with erroneous segmentation by the fat tool, significant truncation artifact was present (Fig. 7B).
Fig. 7A —Examples of incorrect classifications as technically adequate. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat.
A, 28-year-old man who underwent abdominopelvic CT at outside institution. Axial images with (left) and without (right) segmentation overlay show L3 level. Muscle tool returned L3 muscle area of 120.0 cm2, within reference range; thus, muscle tool output was deemed technically adequate. At manual review of segmentation, muscle tool was found to have failed to segment substantial amount of anterior abdominal wall musculature, attributed to high muscle mass and low body fat.
Fig. 7B —Examples of incorrect classifications as technically adequate. Red indicates skeletal muscle; green, trabecular bone; yellow, visceral fat; blue, subcutaneous fat.
B, 50-year-old woman who underwent abdominopelvic CT at outside institution. Axial images with (left) and without (right) segmentation overlay show L3 level. Fat tool returned subcutaneous fat area of 325.2 cm2 and visceral fat area of 569.8 cm2, within reference range; thus, fat tool output was deemed technically adequate. At manual review of segmentation, fat tool was found to have overmeasured visceral fat area and undermeasured subcutaneous fat area owing to significant truncation artifact.

Discussion

In this study, to evaluate the technical adequacy of automated AI body composition tools and to identify possible causes of failure, the tools were tested on a large dataset of external CT examinations. Individually, the bone, muscle, and fat tools were each technically adequate—defined as outputting values within their respective empirically derived reference ranges—in over 97% of outside studies. This performance compares favorably with results in previously published work on internal datasets [33, 34].
All three tools failed in 34.3% of series in which at least one tool failed. Given that these tools function independently and are influenced by information in different parts of CT images, this observation indicates a global technical issue as the cause of failure in these series. Indeed, 88.0% of global failures (i.e., failure of all three tools) were due to anisometry error associated with pixel stretching during image preprocessing. This phenomenon in turn was attributable to incorrect voxel dimension information in DICOM headers. Most anisometry errors occurred in derived MPR series as opposed to series of originally acquired axial images. If all such derived MPR series (2.1% of all series) were excluded—regardless of tool success or failure—the technical adequacy rate would range from 98.4% to 99.8% across the three tools. Thus, the straightforward step of checking to ensure that the target series is not a derived MPR series would markedly mitigate a major cause of failure in application of these tools.
A range of causes of AI tool failure other than anisometry error were also identified. These included streak artifact due to metallic hardware, severe scoliosis, vacuum disk phenomenon, vertebroplasty, and truncation artifact. These factors caused tool failure in small fractions of series, indicating overall robustness of the tools. Nonetheless, for most series in which at least one tool failed, no cause of failure was identified for any tool.
Truncation artifact was problematic not only in instances of tool failure but also in instances of false assessments of technical adequacy. Truncation artifact has been previously described as a cause of AI tool segmentation failure [34]. Truncation artifact was potentially observed in the context of an incomplete or narrow FOV. Possible causes of incomplete or narrow FOV include large body habitus and targeted examination indication leading to intentional selection of narrow coverage. The body composition tools would not be expected to be successful in analyzing such series with incomplete anatomic coverage. Considering the impact of this issue, future work should explore approaches to training the tools to handle truncation artifact.
AI tools are increasingly used in clinical medicine [1, 2]. In radiology, such tools are often designed to aid the radiologist in detection of clinically relevant findings [3]. In contrast, the tools evaluated in this study automatically determine body composition metrics in abdominopelvic CT examinations. Manual determination of these metrics would be overly cumbersome and time-consuming in clinical practice. The AI tools can be applied to essentially any CT examination performed for any clinical indication. These body composition metrics—bone attenuation, muscle volume and attenuation, and fat volume and distribution—have correlations with highly clinically significant health outcome measures at the population level [1424]. Consequently, these measures could be opportunistically used to divert patients into appropriate screening, prevention, or surveillance programs, impacting overall population health and health care cost savings.
One challenge in the development of deep learning AI tools is generalizability of a given tool to datasets outside of the training and validation datasets used to develop the tool [13]. In radiology, a specific example of this dilemma arises when AI tools are applied to imaging examinations performed at external institutions with the images later transmitted to the home institution. Within a single institution or health system, standard imaging protocols are generally developed to ensure quality and consistency, and equipment is commonly acquired from a single preferred vendor. However, among disparate institutions and health systems, there is wide variation in equipment and in specific examination parameters and protocols. These differences may have unforeseen effects on the performance of AI tools that did not incorporate examinations performed with extrinsic parameters into training and validation data-sets, highlighting the importance of the findings of this study.
Elucidating the causes of failure of deep learning AI tools can be challenging, as the process that the algorithm uses to reach its conclusion can be opaque. This challenge may be augmented in analysis of very large datasets, as manual review of failures can be time-consuming. However, understanding causes of failure is critical not only for improving performance but also for increasing radiologists' trust in these tools [30]. A benefit of identifying technical causes of failure—for example, anisometry errors—is that through the use of proper image acquisition and reconstruction protocols, these errors may be largely preventable. If the errors are not prevented, understanding of the technical causes of failure should at least allow identification of studies not meeting certain technical standards so that they can be omitted from AI analysis.
There were limitations to this study. To evaluate the technical adequacy of the tools, the analysis was conducted with an empirically derived set of reference ranges for each tool based on prior work. Although all instances of failure and a subset of instances of technical adequacy for each tool were manually reviewed, manual review of all series and tool outputs was not practical because of the large sample size (> 11,000 series). The challenges of working with large datasets in the context of machine learning studies is well documented [35]. There likely were additional undetected failures among cases that were not manually reviewed, as supported by the small number of segmentation failures identified in the manual review of a random subset of series in which the tools returned results within the reference ranges. Thus, the reported technical adequacy rates likely represent an upper bound for each tool. In addition, although the results suggest that these tools were generalizable to a wide array of outside CT examinations, generalizability was not explicitly explored with respect to patient demographic factors. This represents an important subject for future research, as body composition metrics have been found to vary significantly by age, race, and sex [33]. Also, owing to the opaque nature of how convolutional neural network AI tools, such as the ones evaluated in this study, make decisions, determination of likely causes of AI tool failure remains speculative. Finally, only technical adequacy was formally assessed. The ability to use the tools to predict and influence clinical outcomes based on external studies was not evaluated.
In conclusion, the automated AI body composition tools evaluated in this study had high technical adequacy rates of over 97% in a large and diverse array of external CT examinations. These findings support the potential for applying the tools to abdominopelvic CT datasets obtained across health systems. Causes of failure included technical factors—which may be largely preventable through proper image acquisition and reconstruction protocols—and factors inherent to the patient that are more challenging to control. Explainability and an understanding of causes of failure can help build trust in AI tools and increase acceptance among radiologists and other physicians.

Footnotes

Provenance and review: Not solicited; externally peer reviewed.
Peer reviewers: Maria Pilar Aparisi Gómez, Auckland City Hospital; Jingyu Zhong, Shanghai Jiao Tong University School of Medicine; additional individuals who chose not to disclose their identities.

References

1.
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25:44–56
2.
Deo RC. Machine learning in medicine. Circulation 2015; 132:1920–1930
3.
Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts HJWL. Artificial intelligence in radiology. Nat Rev Cancer 2018; 18:500–510
4.
Pickhardt PJ, Nguyen T, Perez AA, et al. Improved CT-based osteoporosis assessment with a fully automated deep learning tool. Radiol Artif Intell 2022; 4:e220042
5.
Pickhardt PJ, Graffy PM, Zea R, et al. Automated abdominal CT imaging bio-markers for opportunistic prediction of future major osteoporotic fractures in asymptomatic adults. Radiology 2020; 297:64–72
6.
Pickhardt PJ, Lee SJ, Liu J, et al. Population-based opportunistic osteoporosis screening: validation of a fully automated CT tool for assessing longitudinal BMD changes. Br J Radiol 2019; 92:20180726
7.
Garner HW, Paturzo MM, Gaudier G, Pickhardt PJ, Wessell DE. Variation in attenuation in L1 trabecular bone at different tube voltages: caution is warranted when screening for osteoporosis with the use of opportunistic CT. AJR 2017; 208:165–170
8.
Summers RM, Baecher N, Yao J, et al. Feasibility of simultaneous computed tomographic colonography and fully automated bone mineral densitometry in a single examination. J Comput Assist Tomogr 2011; 35:212–216
9.
Burns JE, Yao J, Chalhoub D, Chen JJ, Summers RM. A machine learning algorithm to estimate sarcopenia on abdominal CT. Acad Radiol 2020; 27:311–320
10.
Graffy PM, Liu J, Pickhardt PJ, Burns JE, Yao J, Summers RM. Deep learning-based muscle segmentation and quantification at abdominal CT: application to a longitudinal adult screening cohort for sarcopenia assessment. Br J Radiol 2019; 92:20190327
11.
Lee SJ, Liu J, Yao J, Kanarek A, Summers RM, Pickhardt PJ. Fully automated segmentation and quantification of visceral and subcutaneous fat at abdominal CT: application to a longitudinal adult screening cohort. Br J Radiol 2018; 91:20170968
12.
Liu J, Pattanaik S, Yao J, et al. Associations among pericolonic fat, visceral fat, and colorectal polyps on CT colonography. Obesity (Silver Spring) 2015; 23:408–414
13.
Summers RM, Liu J, Sussman DL, et al. Association between visceral adiposity and colorectal polyps on CT colonography. AJR 2012; 199:48–57
14.
Buckens CF, Dijkhuis G, de Keizer B, Verhaar HJ, de Jong PA. Opportunistic screening for osteoporosis on routine computed tomography? An external validation study. Eur Radiol 2015; 25:2074–2079
15.
Jang S, Graffy PM, Ziemlewicz TJ, Lee SJ, Summers RM, Pickhardt PJ. Opportunistic osteoporosis screening at routine abdominal and thoracic CT: normative L1 trabecular attenuation values in more than 20 000 adults. Radiology 2019; 291:360–367
16.
Pickhardt PJ, Pooler BD, Lauder T, del Rio AM, Bruce RJ, Binkley N. Opportunistic screening for osteoporosis using abdominal computed tomography scans obtained for other indications. Ann Intern Med 2013; 158:588–595
17.
Carberry GA, Pooler BD, Binkley N, Lauder TB, Bruce RJ, Pickhardt PJ. Unreported vertebral body compression fractures at abdominal multidetector CT. Radiology 2013; 268:120–126
18.
Lee SJ, Binkley N, Lubner MG, Bruce RJ, Ziemlewicz TJ, Pickhardt PJ. Opportunistic screening for osteoporosis using the sagittal reconstruction from routine abdominal CT for combined assessment of vertebral fractures and density. Osteoporos Int 2016; 27:1131–1136
19.
Pickhardt PJ, Graffy PM, Zea R, et al. Utilizing fully automated abdominal CT–based biomarkers for opportunistic screening for metabolic syndrome in adults without symptoms. AJR 2021; 216:85–92
20.
Graffy PM, Summers RM, Perez AA, Sandfort V, Zea R, Pickhardt PJ. Automated assessment of longitudinal biomarker changes at abdominal CT: correlation with subsequent cardiovascular events in an asymptomatic adult screening cohort. Abdom Radiol (NY) 2021; 46:2976–2984
21.
O'Connor SD, Graffy PM, Zea R, Pickhardt PJ. Does nonenhanced CT-based quantification of abdominal aortic calcification outperform the Framing-ham risk score in predicting cardiovascular events in asymptomatic adults? Radiology 2019; 290:108–115
22.
Pickhardt PJ, Graffy PM, Zea R, et al. Automated CT biomarkers for opportunistic prediction of future cardiovascular events and mortality in an asymptomatic screening population: a retrospective cohort study. Lancet Digit Health 2020; 2:e192–e200
23.
Pooler BD, Kim DH, Pickhardt PJ. Potentially important extracolonic findings at screening CT colonography: incidence and outcomes data from a clinical screening program. AJR 2016; 206:313–318
24.
Anker SD, Morley JE, von Haehling S. Welcome to the ICD-10 code for sarcopenia. J Cachexia Sarcopenia Muscle 2016; 7:512–514
25.
Brenner DJ. Slowing the increase in the population dose resulting from CT scans. Radiat Res 2010; 174:809–815
26.
Pickhardt PJ. Value-added opportunistic CT screening: state of the art. Radiology 2022; 303:241–254
27.
Eche T, Schwartz LH, Mokrane FZ, Dercle L. Toward generalizability in the deployment of artificial intelligence in radiology: role of computation stress testing to overcome underspecification. Radiol Artif Intell 2021; 3:e210097
28.
Ho SY, Phua K, Wong L, Bin Goh WW. Extensions of the external validation for checking learned model interpretability and generalizability. Patterns (N Y) 2020; 1:10012911431
29.
Mongan J, Moy L, Kahn CE Jr. Checklist for artificial intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell 2020; 2:e200029
30.
Dietvorst BJ, Simmons JP, Massey C. Algorithm aversion: people erroneously avoid algorithms after seeing them err. J Exp Psychol Gen 2015; 144:114–126
31.
Yan K, Lu L, Summers RM. Unsupervised body part regression via spatially self-ordering convolutional neural networks. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, 2018:1022–1025
32.
Yan K, Wang X, Lu L, et al. Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018:9261–9270
33.
Magudia K, Bridge CP, Bay CP, et al. Population-scale CT-based body composition analysis of a large outpatient population using deep learning to derive age-, sex-, and race-specific reference curves. Radiology 2021; 298:319–329
34.
Rigiroli F, Zhang D, Molinger J, et al. Automated versus manual analysis of body composition measures on computed tomography in patients with bladder cancer. Eur J Radiol 2022; 154:110413
35.
Magudia K, Bridge CP, Andriole KP, Rosenthal MH. The trials and tribulations of assembling large medical imaging datasets for machine learning applications. J Digit Imaging 2021; 34:1424–1429

Information & Authors

Information

Published In

American Journal of Roentgenology
Pages: 1 - 9
PubMed: 37095663

History

Submitted: November 8, 2022
Revision requested: November 21, 2022
Revision received: January 24, 2023
Accepted: February 8, 2023
Version of record online: February 22, 2023

Keywords

  1. artificial intelligence
  2. automated
  3. body composition
  4. CT
  5. deep learning

Authors

Affiliations

B. Dustin Pooler, MD [email protected]
Department of Radiology, University of Wisconsin School of Medicine and Public Health, E3/311 Clinical Science Center, 600 Highland Ave, Madison, WI 53792-3252.
John W. Garrett, PhD
Department of Radiology, University of Wisconsin School of Medicine and Public Health, E3/311 Clinical Science Center, 600 Highland Ave, Madison, WI 53792-3252.
Andrew M. Southard
Department of Radiology, University of Wisconsin School of Medicine and Public Health, E3/311 Clinical Science Center, 600 Highland Ave, Madison, WI 53792-3252.
Ronald M. Summers, MD, PhD
Department of Radiology and Imaging Sciences, Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, NIH Clinical Center, Bethesda, MD.
Perry J. Pickhardt, MD
Department of Radiology, University of Wisconsin School of Medicine and Public Health, E3/311 Clinical Science Center, 600 Highland Ave, Madison, WI 53792-3252.

Notes

Address correspondence to B. D. Pooler ([email protected]).
Version of record: Apr 26, 2023
R. M. Summers and P. J. Pickhardt contributed equally to this work.
P. J. Pickhardt is an advisor to Bracco, GE Healthcare, and Nano-X. R. M. Summers receives royalties from iCAD, ScanMed, PingAn, Philips Healthcare, and Translation Holdings, and his laboratory has received research funding through a cooperative research and development agreement with PingAn. The remaining authors declare that there are no other disclosures relevant to the subject matter of this article.

Funding Information

Supported in part by the Intramural Research Program of the NIH Clinical Center.

Metrics & Citations

Metrics

Citations

Export Citations

To download the citation to this article, select your reference manager software.

Articles citing this article

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share on social media