In prior work, a set of automated AI tools was developed that can be deployed on CT examinations of the abdomen to assess various aspects of body composition, including vertebral body attenuation (hereafter, bone tool) [
4–
8]; amount and attenuation of back, paraspinous, and anterior abdominal muscle (muscle tool) [
9,
10]; and amount of visceral and subcutaneous fat (fat tool) [
11–
13]. These body composition parameters have been found to correlate with health outcomes, including osteoporosis [
14–
16] and fragility fractures [
17,
18], metabolic syndrome [
19], cardiovascular events [
20–
23], and overall mortality [
22,
24]. Given that over 70 million CT examinations are performed annually in the United States [
25], these tools have tremendous potential for use in opportunistic health screening of abdominal CT examinations performed for nearly any indication [
26].
Owing to the nature of the way in which deep learning AI tools for analysis of imaging studies are developed and validated, a primary concern is generalizability of tool performance to imaging studies performed on equipment types, at locations, or with protocols that differ from those on which the AI tools were initially developed and validated [
27–
29]. An additional barrier to adoption of AI tools is opacity, real or perceived, regarding how these algorithms function (i.e., unexplainability of AI) [
2,
3,
30]. Patients' imaging studies are occasionally transferred by a variety of mechanisms from the health system where the examination is originally performed to a different health system where the patient may subsequently seek care. These external imaging studies (also called outside imaging studies) with respect to health system receiving the examinations afford an opportunity to address issues related to the generalizability of AI algorithms in medical imaging. Accordingly, the aims of this study were to assess the technical adequacy of a set of automated AI abdominal CT body composition tools in a heterogeneous sample of external CT examinations performed outside the authors' hospital system and to explore possible causes of failure of AI tools.
Methods
Patient Sample
This retrospective cross-sectional study was conducted at a single academic institution. The study protocol was HIPAA-compliant and approved by the institutional review board. The requirement for written informed consent was waived.
An electronic search was conducted of the institutional PACS from January 1, 2000, through December 31, 2020, for CT examinations of the abdomen performed on adult (age ≥ 18 years) patients at facilities outside of the authors' primary hospital system. All patients had presented to the authors' institution for clinical care. All CT examinations were initially performed at facilities external to the authors' institution, and the images were later transferred to the authors' local institutional PACS. The images were transferred before initiation of this study; no images were transferred specifically for this study. The search yielded 12,666 unique CT examinations performed on 9535 unique patients. Examinations were eligible for analysis whether performed with or without IV contrast material. For CT examinations with more than one axial series, a single axial series was randomly selected for inclusion in the analysis.
A total of 35 series were excluded because they were not obtained with the patient in the supine position (e.g., prone or decubitus positioning was used). An additional 932 series were excluded because they did not contain contiguous images spanning from the superior endplate of the L1 vertebral body through the inferior endplate of L3. If the initial series that was randomly selected for an examination was excluded, a potential replacement series from that examination was not sought. After these exclusions, the final study sample comprised 11,699 series from 8949 patients (4256 men, 4693 women; mean ± SD age, 55.5 ± 15.9 years; median, 56.6 years; range, 18–98 years). The general characteristics of these examinations, as extracted from DICOM header data, are summarized in
Table 1. The examinations were performed at 777 outside institutions comprising 83 scanner models from six manufacturers (GE Healthcare, Marconi Medical Systems, Philips Healthcare, Shimadzu, Siemens Healtineers, Toshiba Medical).
Figure 1 shows the flow of patient selection.
Description of Body Composition Tools
The AI body composition tools evaluated in this study have been previously described and evaluated in earlier validation studies [
4–
6,
9–
11]. The codes for these algorithms were not publicly available at the time of this writing. The three AI tools assessed in this study (bone, muscle, fat) function independently of one another; adequacy or failure of one tool is not dependent on the performance of the others. These body composition tools were developed to analyze axial images from CT examinations of the abdomen performed on adults (age ≥ 18 years) in supine position. The three tools are packaged into a single fully automated container for processing of an inputted CT series. For a given series, the tools initially perform preprocessing to normalize the slice thickness to 3 mm (with 3 mm spacing) and to correct for off-sets in CT attenuation values. The tools then apply a body part regression function to determine whether the series has adequate anatomic coverage (at least from the L1 to L3 levels) and to provide slice positions for relevant anatomic landmarks [
31,
32].
The bone tool places an ROI in the trabecular bone of the vertebral body on a single axial slice at a selected level and in the three contiguous slices above and below the selected level (total of seven contiguous slices). The tool reports the median attenuation in Hounsfield units at the selected level, and the minimum of the median attenuation across the seven contiguous slices is measured and used to provide higher sensitivity if desired [
4]. For the current study, the L1 level was selected, and only the median attenuation at the selected level was used. The muscle tool automatically detects and segments abdominal, paraspinous, and back musculature on a single slice at the level of the L3 vertebral body and calculates total muscle cross-sectional area and mean muscle attenuation in Hounsfield units. Similarly, the fat tool automatically detects and segments visceral and subcutaneous fat in a single slice at the level of the L3 vertebral body and calculates cross-sectional area for both types of fat and cross-sectional area of total abdominal fat. All tools provide default values of −10,000 if detecting a segmentation failure. As part of the output, each AI algorithm generates a segmentation map, which a radiologist may review for accuracy.
Figure 2 shows examples of the segmentation maps for the three tools.
Assessment of Body Composition Tools
The automated body composition tools were applied to all series included in the final study sample. A tool was deemed technically adequate in its analysis of a series if it returned a value within the reference range for the given parameter, as listed in
Table 2. These reference ranges were empirically derived on the basis of ranges observed in prior studies evaluating the performance of these tools that had large sample sizes (bone, 11,035 examinations [
4]; muscle, 9310 examinations [
10]; fat, 8852 examinations [
11]). A tool was deemed not technically adequate in its analysis of a series if it returned a value outside of the reference range for the given parameter, whether or not the tool had detected a segmentation failure. The technical adequacy rate was calculated for each tool. For tools that provided multiple measured values (e.g., muscle area and muscle attenuation for the muscle tool; visceral abdominal fat area, subcutaneous fat area, and total abdominal fat area for the fat tool), all values outputted by the tool were required to be within their respective reference range for the tool to be considered technically adequate in the tested series.
For all series in which at least one tool failed, two fellowship-trained abdominal radiologists (B.D.P. with 5 and P.J.P. with 24 years of posttraining experience) reviewed all available information (CT images, DICOM header information, tool outputs) in consensus in a qualitative manner to identify possible causes of failure. If the segmentation was deemed to have been correct at this manual review despite the tool's having outputted a value outside of the reference range, the classification was changed to be deemed technically adequate. An additional targeted assessment was performed to explore issues related to tool failure and series classification as a derived multiplanar reformatted (MPR) axial series as opposed to a series reconstructed from an originally acquired axial dataset.
For each AI tool, a subset of 200 series in which the tool was deemed technically adequate (i.e., all outputted values were within the reference range) was randomly selected for manual review. These series, along with the segmentations performed by the tool, were reviewed independently by two reviewers (B.D.P., one of the previously noted investigators, and A.M.S., an undergraduate student who received training in evaluation of the tool outputs before study initiation). The reviewers qualitatively recorded the presence of significant segmentation errors that may have occurred despite the tool's having output values within the reference range, thus indicating a false observation of technically adequate tool performance. Discrepancies were reconciled by a third reviewer (P.J.P., one of the previously noted reviewers).
Statistical Analysis
Data were summarized descriptively. All calculations were performed with Microsoft Excel software (version 2022, Build 14931.20764).
Discussion
In this study, to evaluate the technical adequacy of automated AI body composition tools and to identify possible causes of failure, the tools were tested on a large dataset of external CT examinations. Individually, the bone, muscle, and fat tools were each technically adequate—defined as outputting values within their respective empirically derived reference ranges—in over 97% of outside studies. This performance compares favorably with results in previously published work on internal datasets [
33,
34].
All three tools failed in 34.3% of series in which at least one tool failed. Given that these tools function independently and are influenced by information in different parts of CT images, this observation indicates a global technical issue as the cause of failure in these series. Indeed, 88.0% of global failures (i.e., failure of all three tools) were due to anisometry error associated with pixel stretching during image preprocessing. This phenomenon in turn was attributable to incorrect voxel dimension information in DICOM headers. Most anisometry errors occurred in derived MPR series as opposed to series of originally acquired axial images. If all such derived MPR series (2.1% of all series) were excluded—regardless of tool success or failure—the technical adequacy rate would range from 98.4% to 99.8% across the three tools. Thus, the straightforward step of checking to ensure that the target series is not a derived MPR series would markedly mitigate a major cause of failure in application of these tools.
A range of causes of AI tool failure other than anisometry error were also identified. These included streak artifact due to metallic hardware, severe scoliosis, vacuum disk phenomenon, vertebroplasty, and truncation artifact. These factors caused tool failure in small fractions of series, indicating overall robustness of the tools. Nonetheless, for most series in which at least one tool failed, no cause of failure was identified for any tool.
Truncation artifact was problematic not only in instances of tool failure but also in instances of false assessments of technical adequacy. Truncation artifact has been previously described as a cause of AI tool segmentation failure [
34]. Truncation artifact was potentially observed in the context of an incomplete or narrow FOV. Possible causes of incomplete or narrow FOV include large body habitus and targeted examination indication leading to intentional selection of narrow coverage. The body composition tools would not be expected to be successful in analyzing such series with incomplete anatomic coverage. Considering the impact of this issue, future work should explore approaches to training the tools to handle truncation artifact.
AI tools are increasingly used in clinical medicine [
1,
2]. In radiology, such tools are often designed to aid the radiologist in detection of clinically relevant findings [
3]. In contrast, the tools evaluated in this study automatically determine body composition metrics in abdominopelvic CT examinations. Manual determination of these metrics would be overly cumbersome and time-consuming in clinical practice. The AI tools can be applied to essentially any CT examination performed for any clinical indication. These body composition metrics—bone attenuation, muscle volume and attenuation, and fat volume and distribution—have correlations with highly clinically significant health outcome measures at the population level [
14–
24]. Consequently, these measures could be opportunistically used to divert patients into appropriate screening, prevention, or surveillance programs, impacting overall population health and health care cost savings.
One challenge in the development of deep learning AI tools is generalizability of a given tool to datasets outside of the training and validation datasets used to develop the tool [
1–
3]. In radiology, a specific example of this dilemma arises when AI tools are applied to imaging examinations performed at external institutions with the images later transmitted to the home institution. Within a single institution or health system, standard imaging protocols are generally developed to ensure quality and consistency, and equipment is commonly acquired from a single preferred vendor. However, among disparate institutions and health systems, there is wide variation in equipment and in specific examination parameters and protocols. These differences may have unforeseen effects on the performance of AI tools that did not incorporate examinations performed with extrinsic parameters into training and validation data-sets, highlighting the importance of the findings of this study.
Elucidating the causes of failure of deep learning AI tools can be challenging, as the process that the algorithm uses to reach its conclusion can be opaque. This challenge may be augmented in analysis of very large datasets, as manual review of failures can be time-consuming. However, understanding causes of failure is critical not only for improving performance but also for increasing radiologists' trust in these tools [
30]. A benefit of identifying technical causes of failure—for example, anisometry errors—is that through the use of proper image acquisition and reconstruction protocols, these errors may be largely preventable. If the errors are not prevented, understanding of the technical causes of failure should at least allow identification of studies not meeting certain technical standards so that they can be omitted from AI analysis.
There were limitations to this study. To evaluate the technical adequacy of the tools, the analysis was conducted with an empirically derived set of reference ranges for each tool based on prior work. Although all instances of failure and a subset of instances of technical adequacy for each tool were manually reviewed, manual review of all series and tool outputs was not practical because of the large sample size (> 11,000 series). The challenges of working with large datasets in the context of machine learning studies is well documented [
35]. There likely were additional undetected failures among cases that were not manually reviewed, as supported by the small number of segmentation failures identified in the manual review of a random subset of series in which the tools returned results within the reference ranges. Thus, the reported technical adequacy rates likely represent an upper bound for each tool. In addition, although the results suggest that these tools were generalizable to a wide array of outside CT examinations, generalizability was not explicitly explored with respect to patient demographic factors. This represents an important subject for future research, as body composition metrics have been found to vary significantly by age, race, and sex [
33]. Also, owing to the opaque nature of how convolutional neural network AI tools, such as the ones evaluated in this study, make decisions, determination of likely causes of AI tool failure remains speculative. Finally, only technical adequacy was formally assessed. The ability to use the tools to predict and influence clinical outcomes based on external studies was not evaluated.
In conclusion, the automated AI body composition tools evaluated in this study had high technical adequacy rates of over 97% in a large and diverse array of external CT examinations. These findings support the potential for applying the tools to abdominopelvic CT datasets obtained across health systems. Causes of failure included technical factors—which may be largely preventable through proper image acquisition and reconstruction protocols—and factors inherent to the patient that are more challenging to control. Explainability and an understanding of causes of failure can help build trust in AI tools and increase acceptance among radiologists and other physicians.