Original Research
Genitourinary Imaging
October 14, 2020

Data Augmentation and Transfer Learning to Improve Generalizability of an Automated Prostate Segmentation Model

Abstract

OBJECTIVE. Deep learning applications in radiology often suffer from overfitting, limiting generalization to external centers. The objective of this study was to develop a high-quality prostate segmentation model capable of maintaining a high degree of performance across multiple independent datasets using transfer learning and data augmentation.
MATERIALS AND METHODS. A retrospective cohort of 648 patients who underwent prostate MRI between February 2015 and November 2018 at a single center was used for training and validation. A deep learning approach combining 2D and 3D architecture was used for training, which incorporated transfer learning. A data augmentation strategy was used that was specific to the deformations, intensity, and alterations in image quality seen on radiology images. Five independent datasets, four of which were from outside centers, were used for testing, which was conducted with and without fine-tuning of the original model. The Dice similarity coefficient was used to evaluate model performance.
RESULTS. When prostate segmentation models utilizing transfer learning were applied to the internal validation cohort, the mean Dice similarity coefficient was 93.1 for whole prostate and 89.0 for transition zone segmentations. When the models were applied to multiple test set cohorts, the improvement in performance achieved using data augmentation alone was 2.2% for the whole prostate models and 3.0% for the transition zone segmentation models. However, the best test-set results were obtained with models fine-tuned on test center data with mean Dice similarity coefficients of 91.5 for whole prostate segmentation and 89.7 for transition zone segmentation.
CONCLUSION. Transfer learning allowed for the development of a high-performing prostate segmentation model, and data augmentation and fine-tuning approaches improved performance of a prostate segmentation model when applied to datasets from external centers.
MRI of the prostate has proven to be useful for improving diagnostic accuracy [1, 2], staging [1], and treatment planning [2] for patients either known to have or suspected of having prostate cancer. Over the past decade, an increase in the utilization of prostate MRI for detecting prostate cancer and targeting lesions using MRI-ultrasound fusion–guided biopsies has increased the demand for prostate segmentation techniques as a result of the dependency of registration systems on prostate gland renderings. Manual prostate segmentation is a time-consuming task that is subject to inter-and intraobserver variability [3]. Substantial interest in semiautomated or fully automated prostate segmentation systems has resulted in multiple academic studies [47] and organized challenges [8]. In the past 5 years, deep learning approaches have become the preferred method of automating the process of prostate segmentation [4]. Although some methods have achieved excellent results compared with manual segmentation [35], most current models are trained on a limited number of samples and do not have validation datasets from external institutions [46]. In addition, variability in MRI vendors, scanning protocols, magnetic field strength, endorectal coils, and patient populations requires the development of methods to allow models trained at a single center to maintain a high degree of accuracy when applied to external centers. The objective of the present study was to improve generalization of prostate segmentation models by use of a combination of transfer learning, robust data augmentation, and a fine-tuning strategy to refine a model trained at one center for use at multiple external centers.

Materials and Methods

Study Population

The training data were derived from a retrospective review of a total of 659 patients who were enrolled in institutional review board–approved protocols studying the role of MRI in prostate cancer diagnosis. There were two distinct cohorts. The first cohort (cohort 1) consisted of all patients who underwent prostate MRI between February 2018 and November 2018 at one institution as part of protocol 18-C-0017, which began in February 2018. The second cohort (cohort 2) included all patients who underwent surgical resection for prostate cancer as part of a separate tissue collection protocol (protocol 16-C-0010) between February 2015 and November 2018. February 2015 was the date that utilization of the Prostate Imaging Reporting and Data System version 2 scoring system began at this institution. Informed consent was obtained from all subjects.
Men who had been receiving androgen deprivation therapy before undergoing prostate MRI (n = 11) were excluded from the study, resulting in a final cohort of 648 patients, with 365 patients included in cohort 1 and 283 patients in cohort 2. MR images of cohorts 1 and 2 were obtained using a 3-T MRI system (Achieva, Philips Health-care). T2-weighted turbo-spin-echo images were obtained in the axial, sagittal, and coronal planes; however, for the purposes of this study, only the T2-weighted axial series was used. For patients who underwent sequential imaging, only the initial MRI examination was conducted with an endorectal coil (eCoil, Medrad), which was filled with 40 mL of perfluorinated fluid (Galden, Solvay) and tuned to 127.8 MHz.
Six independent datasets were used as separate test sets, for a sum total of 406 patients (Table 1). One test set that comprised data from an independent cohort from the training institution utilized cases that were not used in the training or validation datasets. Five separate test sets from other unaffiliated institutions were considered unseen domain test sets. These datasets had a diverse array of geographic locations, MRI vendors, and acquisition protocols (Table 1).
TABLE 1: Comparison of Image Acquisition Parameters
DatasetNo. of PatientsMRI System Vendor(s)aField StrengthbMedian x/y Resolution (mm)Slice Thickness or Gap (mm)TR (ms)TE (ms)FOV (mm2)No. of T2-Weighted Slices Acquired
Training dataset         
 Cohort 1365Philips Healthcare (100)3 T (100)0.2734434120140 × 14026
 Cohort 2283Philips Healthcare (100)3 T (100)0.2734434120140 × 14026
Testing dataset         
 Independent cohort166Philips Healthcare (100)3 T (100)0.2734434120140 × 14026
 External center 142Siemens Healthcare (90) and GE Healthcare (10)3 T (98) and 1.5 T (2)0.5733460137200X20028
 External center 275Siemens Healthcare (100)3 T (100)0.5633730121NA28
 External center 310Philips Healthcare (100)3 T (100)0.5437203160180X18026
 External center 455Philips Healthcare (100)3 T (100)0.463472680200 × 20024
 External center 558GE Healthcare (100)3 T (100)0.4333662105220 × 22035

Note—x/y Resolution = in-plane pixel resolution of multiparametric MR images, NA = not available.

a
Data in parentheses denote percentage of MRI systems.
b
Data in parentheses denote percentage of patients with 3-T vs 1.5-T MRI.

Ground Truth Segmentation

Before model training, ground truth segmentation labels were generated by a single radiologist with more than 10 years of experience in interpreting prostate MR images. A simplified zone-based segmentation strategy was used (Fig. 1). The entire prostate was segmented, excluding seminal vesicles. A separate transition zone segmentation was performed that incorporated both the transition zone and anterior fibromuscular stroma. The central zone was not included in transition zone segmentation but instead was included in whole prostate segmentation. All segmentations were performed using using a research segmentation tool (pseg, iCAD Inc.) which conducts a machine-learning based estimation of the whole prostate and transition zone segmentations, which were then manually refined.
Fig. 1A —Segmentation strategy.
A, 63-year-old man with prostate tumor. Fully anatomic segmentation T2-weighted MR images in axial (A) and sagittal (B) viewing plane include delineation of whole prostate (green area), peripheral zone (yellow area), transition zone (red area), central zone (purple area), anterior fibromuscular stroma (dark blue area) and prostatic urethra (light blue area). Fully anatomic segmentation is very time consuming, and specific boundaries (particularly transition zone and anterior fibromuscular stroma) are not well defined.
Fig. 1B —Segmentation strategy.
B, 63-year-old man with prostate tumor. Fully anatomic segmentation T2-weighted MR images in axial (A) and sagittal (B) viewing plane include delineation of whole prostate (green area), peripheral zone (yellow area), transition zone (red area), central zone (purple area), anterior fibromuscular stroma (dark blue area) and prostatic urethra (light blue area). Fully anatomic segmentation is very time consuming, and specific boundaries (particularly transition zone and anterior fibromuscular stroma) are not well defined.
Fig. 1C —Segmentation strategy.
C, Same patient as in A and B. Modified segmentation approach shows two views of same segmentation T2-weighted MR images in axial (C) and sagittal (D) viewing plane. Boundaries of entire prostate are delineated (green area). Transition zone is then delineated to encompass both anterior fibromuscular stroma and true transition zone (red area).
Fig. 1D —Segmentation strategy.
D, Same patient as in A and B. Modified segmentation approach shows two views of same segmentation T2-weighted MR images in axial (C) and sagittal (D) viewing plane. Boundaries of entire prostate are delineated (green area). Transition zone is then delineated to encompass both anterior fibromuscular stroma and true transition zone (red area).

Convolutional Neural Network Architecture and Data Augmentation

The 3D anisotropic hybrid network used for training is shown in Figure 2 and has been previously described elsewhere [7]. This is a 2D-3D hybrid network in which both the 2D and 3D portions feature an encoder and decoder. This model was implemented in an open-source machine learning platform (TensorFlow, version 1.10, Google Brain Team) in a framework customized for support from multiple graphics processing units. A soft Dice similarity coefficient (DSC) loss was used as the loss function. The data augmentation strategy that was used is known as deep stacked transformation (DST) [9]. The data augmentation transforms fall into three categories: spatial transforms (rotation, scaling, and deformation), alterations in image appearance (brightness, contrast, and intensity), and alterations in factors related to image quality (sharpness, blurriness, and noise). In a prior study, models trained with DST augmentation were found to be better than those trained with standard augmentation techniques (e.g., random cropping only) [10].
Fig. 2A —Transfer learning using anisotropic hybrid network (AH-Net) architecture. AH-Net architecture (as described in [10]) features convolutional neural network (U-Net) backbone in 2D and 3D.
A, Schematic shows 2D encoder portion includes residual neural network (ResNet-50) weights initialized with ResNet-50 pretrained on large photographic image collections (ImageNet [11]). In this way, transfer learning is natively used with this architecture. MC-GCN = multichannel global convolutional network, RGB = red, green, blue, GCN = global convolutional network, Conv. = convolutional, x = input image, y = output segmentation.
Fig. 2B —Transfer learning using anisotropic hybrid network (AH-Net) architecture. AH-Net architecture (as described in [10]) features convolutional neural network (U-Net) backbone in 2D and 3D.
B, Schematic shows 3D portion of this architecture (AH-Net), which allows full volumetric segmentation as output. AH-Downsample, AH-ResNet, and AH-Decoder are modules of AH-Net. MaxPool = maximum pooling to reduce input data dimensionality, Conv. = convolutional, x = input image, y = output segmentation.

Training and Fine-Tuning

Two prostate segmentation models were trained on data from cohorts 1 and 2 combined, with one model utilizing DST augmentation and the other model not utilizing DST (Fig. 3). Before the model was trained using the anisotropic hybrid network architecture, all training images were resampled into a resolution of 1.0 mm3, and intensity values were normalized to [0, 1]. In the training dataset combining cohorts 1 and 2, 80% of patients are randomly selected as the training set, and the remaining 20% are assigned to validation set. This model was trained for 1500 epochs on a DGX-1 Volta server (NVIDIA), and the best model was saved.
Fig. 3 —Schematic shows training, validation, and test strategy that incorporates anisotropic hybrid network (AH-Net) architecture, which natively incorporates transfer learning, data augmentation, and fine-tuning approaches. In step 1, training on 80% of internal data was performed using AH-Net architecture. In steps 2 and 3, two models, one with data augmentation (red outline) and one without data augmentation (no outline), were trained. Both models were then applied to 20% validation dataset not from internal center (i.e., not included in training dataset) as well as six independent datasets. Model trained with augmentation (red outline) was then retrained with lower learning rate (fine-tuned; step 4) on each of six independent datasets, and each fine-tuned model was applied to test set from each center (step 5) to obtain results (step 6). Numbers in circles denote sequential steps in strategy. U-Net = convolutional neural network, NIH = National Institutes of Health.
The model trained on data from cohorts 1 and 2 was then retrained using a 10-fold decrease in the learning rate (a process known as fine-tuning) on data from one independent dataset as well as data derived from five external sites. For the fine-tuning portion of training, the training was performed using a different training, validation, and testing split compared with the training strategy used in cohorts 1 and 2. For each new independent dataset, 40% of the data were used for retraining (or fine-tuning) the model, with 10% used as a validation dataset to adjust the fine-tuning model training parameters. The remaining 50% of the data were used as a test dataset to evaluate the performance of fine-tuning on data not used during the retraining or parameter tuning processes. Fine-tuning was performed on each dataset independently, and the split of training, validation, and testing can be found in Table 2. The entire pipeline is enabled by the open-source domain-optimized application framework (nvcr.io/nvidia/clara-train-sdk:v2.0, Clara Train SDK, NVIDIA) [11]. Domain-specific transfer learning in Clara Train SDK allows segmentation of 3D CT and MR images and training or fine-tuning of models with eventual export to TensorRT (version 5.0.2-1+cuda10.0, NVIDIA)–based inference using Python (version 3.5.2, Python Software Foundation) wrappers.
TABLE 2: Whole Prostate Segmentation Test Set Results
Cohort or CenterTotal No. of SamplesNo. of Training SamplesNo. of Validation SamplesNo. of Testing SamplesDSC for Whole Prostate Test Set, Mean (Range)
Without AugmentationWith DSTWith Fine-Tuning
Independent cohort16666178391.0 (65.5–95.5)90.9 (65.8–95.5)91.2 (64.9–95.3)
External center 1421742184.7 (62.5–92.1)90.4 (77.6–93.9)90.9 (83.4–94.7)
External center 2753083789.8 (74.6–94.2)91.6 (84.3–94.8)92.0 (86.5–94.6)
External center 31041576.2 (58.0–89.4)86.3 (81.7–91.3)89.6 (85.5–93.3)
External center 4552262789.3 (68.6–94.6)92.6 (77.7–95.4)92.9 (83.1–95.7)
External center 5582362986.4 (64.1–93.4)90.6 (87.8–93.6)91.4 (88.7–93.8)
All4061624220288.8 (58.0–95.5)91.0 (65.8–95.5)91.5 (64.9–95.7)

Note—DSC = Dice similarity coefficient, DST = deep stack transformation.

Model performance was evaluated using the DSC function as follows:
where SDL is the segmentation of a deep learning model and Sm is the manual segmentation. The DSC can range between 0 (no overlap) and 1 (perfect overlap). Additional metrics, including volumetric similarity, Hausdorff distance, mean surface distance, and standard surface distance, were used to further evaluate the similarity of the ground-truth segmentation and the model-generated segmentations [9].

Results

The characteristics of the patients in the training cohorts are presented in Table 3. A total of 140 (21%) of the patients in cohorts 1 and 2 underwent scanning performed using an endorectal coil, with the remaining patients undergoing scanning conducted using a surface coil only. A wide variety of prostate sizes and lesions were detected. Image acquisition parameters are presented in Table 1.
TABLE 3: Demographic and Clinical Data for Cohort 1 and Cohort 2
Demographic and Clinical DataCohort 1 (n = 365)Cohort 2 (n = 283)
Age (y), mean (range)68 (18–89)66 (46–81)
Weight (kg), mean (range)85 (30–146)88 (44–139)
Whole prostate size (cm3), mean (range)70 (16–265)44 (10–255)
Transition zone size (cm3), mean (range)45 (3–222)20 (4–279)
Highest PI-RADS score, no. (%) of patients  
 180 (22)4 (1)
 238 (10)7 (2)
 396 (26)13 (5)
 4102 (28)124 (44)
 549 (13)135 (48)

Note—PI-RADS = Prostate Imaging Reporting and Data System.

The DSC from the validation cohort for the initial model trained on cohorts 1 and 2 was 93.1 for whole prostate segmentation and 89.0 for transition zone segmentation. The results for the internal test set cohort as well as multiple unseen domain test set cohorts are listed in Tables 2 and 4. The mean DSC was lowest for the cohort without DST augmentation (88.8 for whole prostate segmentation and 85.1 for transition zone segmentation). For whole prostate segmentation, the DSC was lowest (76.2) for images from the center with the fewest data (center 3). The model trained using DST data augmentation increased the overall DSC to 91.0 for whole prostate segmentation and 88.1 for transition zone segmentation, reflecting an improvement in DSC of 2.2% and 3.0%, respectively. Fine-tuning of the model on each center's data before validation further improved the model, resulting in mean DSCs of 91.5 for whole prostate segmentation and 89.7 for transition zone segmentation. An example of challenging segmentation is presented in Figure 4. The segmentation results consistently improved with DST data augmentation and fine-tuning regardless of the type of similarity metric used (Table 5).
TABLE 4: Transition Zone Segmentation Test Set Results
Cohort or CenterTotal No. of SamplesNo. of Training SamplesNo. of Validation SamplesNo. of Testing SamplesDSC for Transition Zone Test Set, Mean (Range)
Without AugmentationWith DSTWith Fine-Tuning
Independent cohort16666178388.4 (49.3–96.1)88.7 (52.0–95.9)89.4 (57.3–96.0)
External center 1421742182.9 (63.0–90.3)86.9 (69.9–93.3)88.5 (77.1–95.0)
External center 2753083784.9 (22.9–93.3)89.2 (83.7–94.5)90.7 (85.6–94.9)
External center 31041565.2 (46.0–78.6)73.5 (66.8–83.7)87.4 (82.0–91.9)
External center 4552262784.2 (58.7–92.2)90.5 (75.3–95.3)92.0 (80.9–95.7)
External center 5582362981.8 (61.0–90.3)86.2 (68.2–93.2)88.1 (74.7–94.0)
All4061624220285.1 (22.9–96.1)88.1 (52.0–95.9)89.7 (57.3–96.0)

Note—DSC = Dice similarity coefficient, DST = deep stack transformation.

Fig. 4A —62-year-old man with prostate-specific antigen level of 31.9 ng/mL and Gleason 4+3 prostate cancer.
A, Axial T2-weighted MR image shows challenging segmentation case involving large prostate tumor.
Fig. 4B —62-year-old man with prostate-specific antigen level of 31.9 ng/mL and Gleason 4+3 prostate cancer.
B, Axial T2-weighted MR images show model trained with ground truth segmentation (B); model trained without augmentation (C), which produced Dice similarity coefficient of 76; and model trained with deep stacked transformation augmentation (D), which produced Dice similarity coefficient of 89.
Fig. 4C —62-year-old man with prostate-specific antigen level of 31.9 ng/mL and Gleason 4+3 prostate cancer.
C, Axial T2-weighted MR images show model trained with ground truth segmentation (B); model trained without augmentation (C), which produced Dice similarity coefficient of 76; and model trained with deep stacked transformation augmentation (D), which produced Dice similarity coefficient of 89.
Fig. 4D —62-year-old man with prostate-specific antigen level of 31.9 ng/mL and Gleason 4+3 prostate cancer.
D, Axial T2-weighted MR images show model trained with ground truth segmentation (B); model trained without augmentation (C), which produced Dice similarity coefficient of 76; and model trained with deep stacked transformation augmentation (D), which produced Dice similarity coefficient of 89.
TABLE 5: Additional Performance Metrics for Whole Prostate and Transition Zone Segmentation
MetricWithout AugmentationWith AugmentationWith Fine-Tuning
Volume similarity   
 Whole prostate   
  Center 10.20 (−0.06 to 0.68)0.00 (−0.12 to 0.17)0.08 (−0.03 to 0.18)
  Center 20.07 (−0.11 to 0.49)0.01 (−0.17 to 0.30)0.03 (−0.10 to 0.24)
  Center 30.37 (0.09–0.76)0.04 (−0.04 to 0.14)0.02 (−0.05 to 0.07)
  Center 40.13 (−0.04 to 0.63)0.03 (−0.04 to 0.44)0.04 (−0.06 to 0.33)
  Center 50.18 (−0.10 to 0.70)0.04 (−0.15 to 0.17)0.01 (−0.18 to 0.12)
  Independent0.02 (−0.17 to 0.34)0.00 (−0.20 to 0.26)−0.02 (−0.20 to 0.10)
 Transition zone   
  Center 10.15 (−0.17 to 0.74)0.04 (−0.17 to 0.59)0.09 (−0.12 to 0.42)
  Center 20.16 (−0.22 to 1.53)0.06 (−0.23 to 0.25)0.01 (−0.24 to 0.20)
  Center 30.62 (0.38–1.04)0.42 (0.26–0.53)0.13 (0.03–0.29)
  Center 40.19 (−0.35 to 0.76)0.01 (−0.39 to 0.17)−0.01 (−0.31 to 0.09)
  Center 50.15 (−0.32 to 0.73)−0.08 (−0.58 to 0.24)−0.01 (−0.40 to 0.23)
  Independent0.01 (−0.51 to 0.38)−0.02 (−0.47 to 0.28)−0.04 (−0.49 to 0.26)
Hausdorff distance   
 Whole prostate   
  Center 17.86 (4.00–18.79)5.80 (3.61–13.93)5.43 (3.00–10.25)
  Center 26.79 (3.16–16.09)5.96 (2.83–14.28)5.37 (3.00–10.82)
  Center 310.59 (6.00–16.00)10.15 (4.12–18.11)6.01 (4.12–7.81)
  Center 46.97 (3.16–16.09)5.45 (3.16–9.00)5.47 (3.00–9.16)
  Center 57.28 (3.16–15.81)6.03 (3.32–9.43)5.64 (4.00–10.49)
  Independent5.40 (3.00–20.25)5.39 (2.45–10.95)5.35 (2.83–11.79)
 Transition zone   
  Center 16.80 (3.16–14.04)5.19 (3.00–12.00)4.71 (3.00–11.00)
  Center 27.37 (3.61–24.45)6.09 (3.16–17.55)6.06 (3.16–13.19)
  Center 311.53 (7.28–16.28)9.59 (5.83–12.00)4.81 (3.32–6.08)
  Center 48.56 (3.16–25.50)5.58 (3.16–11.04)4.86 (3.00–11.87)
  Center 58.18 (4.00–18.44)6.70 (3.16–14.00)6.04 (3.16–12.00)
  Independent5.12 (2.24–12.21)5.00 (2.00–12.04)4.85 (2.24–12.21)
Mean surface distance   
 Whole prostate   
  Center 11.69 (0.77–3.97)1.02 (0.73–2.55)0.94 (0.58–1.82)
  Center 21.25 (0.63–4.31)0.97 (0.50–2.19)0.90 (0.50–1.68)
  Center 32.49 (1.09–4.61)1.49 (0.87–2.42)1.05 (0.69–1.33)
  Center 41.36 (0.53–3.79)0.86 (0.56–2.67)0.82 (0.58–1.97)
  Center 51.50 (0.64–5.26)1.02 (0.63–1.46)0.92 (0.52–1.46)
  Independent0.95 (0.45–3.37)0.96 (0.44–3.35)0.92 (0.47–3.46)
 Transition zone   
  Center 11.49 (0.68–4.66)1.07 (0.59–3.72)0.92 (0.54–2.82)
  Center 21.54 (0.68–7.93)1.07 (0.59–2.51)0.91 (0.55–1.72)
  Center 32.92 (1.91–5.08)2.18 (1.45–2.62)0.97 (0.66–1.41)
  Center 41.66 (0.69–5.23)0.91 (0.51–1.94)0.73 (0.53–1.27)
  Center 51.66 (0.83–4.76)1.22 (0.69–2.54)1.03 (0.63–1.85)
  Independent0.94 (0.40–3.34)0.91 (0.42–3.21)0.85 (0.42–2.72)
Standard surface distance   
 Whole prostate   
  Center 11.73 (0.77–4.64)1.04 (0.68–2.58)0.97 (0.68–1.72)
  Center 21.29 (0.69–4.11)1.01 (0.59–2.38)0.94 (0.61–1.69)
  Center 32.35 (1.08–3.96)1.58 (0.85–2.97)1.11 (0.72–1.50)
  Center 41.46 (0.58–4.62)0.90 (0.64–1.95)0.91 (0.63–1.54)
  Center 51.44 (0.67–4.98)1.05 (0.68–1.67)0.96 (0.67–1.76)
  Independent0.99 (0.58–3.19)0.97 (0.56–2.02)0.93 (0.57–1.96)
 Transition zone   
  Center 11.48 (0.67–3.50)0.97 (0.64–2.79)0.90 (0.57–2.88)
  Center 21.47 (0.70–5.90)1.1 (0.64–3.40)1.02 (0.62–1.96)
  Center 32.70 (1.53–4.42)2.13 (1.19–2.48)0.94 (0.64–1.42)
  Center 41.72 (0.67–5.65)0.97 (0.59–3.05)0.78 (0.59–1.28)
  Center 51.56 (0.81–4.60)1.20 (0.65–2.60)1.06 (0.64–2.22)
  Independent0.96 (0.54–2.41)0.93 (0.54–2.41)0.89 (0.55–2.52)

Discussion

Significant advances in computer vision have resulted from the combination of novel computational approaches and datasets with millions of images [11]. Because of the domain-specific knowledge necessary to cu-rate large datasets in radiology, creation of large datasets with expert-level annotation is quite challenging and is subject to substantial variation among individuals [3]. In our study, we present a method that allowed development of a state-of-the-art segmentation model and generalization to multiple institutions when a combination of transfer learning, data augmentation, and fine-tuning were used to overcome the substantial variation in MRI data quality because of the large range of technical and patient-specific factors.
Two techniques that have been used successfully in the application of computer vision to natural world images were used in the training of the model: data augmentation [12] and transfer learning [13]. Data augmentation is the process by which the images are altered slightly during training of a model so the model learns the most important features associated with segmentation rather than image-specific features [14]. The data augmentation strategy used in this study was developed to address specific sources of variation in 3D anisotropic images, such as those found in MRI [10]. Data augmentation improved the generalizability of the models to the unseen domain. In comparison with studies in the literature, our study used an approach that revealed a result that is among the best reported in the literature [5, 8]. However, the true value of the present study is that it shows a method of generalizing a model trained at one center to multiple independent centers through the use of data augmentation and fine-tuning [15].
It should be noted two types of transfer learning were used. The anisotropic hybrid network architecture [7] utilizes transfer learning natively. The encoder of the 2D portion of the neural network is a residual neural network [16] backbone that is pretrained on a large dataset of natural world images known as ImageNet [11]. The second type of transfer learning was the fine-tuning approach. There are many sources of variation in multipara-metric MRI, including multiple vendors, image acquisition parameters, and patient populations. These factors have been shown to increase variation in manual prostate segmentation [17]. Tailoring a neural network to each institution is a way to allow a network that works well at one center to be adapted by another center. In our case, this fine-tuning approach, when combined with data augmentation, proved to be the best approach to preserve accuracy of our prostate segmentation model.
The limitations of the present study include its retrospective design and relatively small dataset for deep learning. For instance, ImageNet, one of the most highly used datasets in computer vision has more than 14 million images. However, it is difficult to accumulate such massive numbers of well-annotated medical imaging sets as expert curation is required and is difficult to consistently achieve. The techniques of transfer learning, data augmentation, and fine-tuning are designed to assist in overcoming these differences.
The U.S. Food and Drug Administration recently released a white paper on the approval process for models developed using deep learning approaches [18]. Although there has been progress in the development of models capable of flexibly updating, at this time the models that have been approved have been frozen after training and validation. A critical step in determining the safety of such models is showing that they are robust to new populations. The data augmentation and transfer learning (fine-tuning) approaches mentioned in the present study may be beneficial in bridging the gap between the research setting and the clinic [19].

Conclusion

Transfer learning allowed the development of high-performing whole prostate and transition zone segmentation models, and data augmentation and fine-tuning approaches improved performance of a prostate segmentation model when applied to external centers. Our segmentation tool needs to be validated in prospective multicenter studies.

References

1.
Mehralivand S, Shih JH, Harmon S, et al. A grading system for the assessment of risk of extraprostatic extension of prostate cancer at multiparametric MRI. Radiology 2019; 290:709–719
2.
Hricak H, Wang L, Wei DC, et al. The role of preoperative endorectal magnetic resonance imaging in the decision regarding whether to preserve or resect neurovascular bundles during radical retropubic prostatectomy. Cancer 2004; 100:2655–2663
3.
Gardner SJ, Wen N, Kim J, et al. Contouring variability of human- and deformable-generated contours in radiotherapy for prostate cancer. Phys Med Biol 2015; 60:4429–4447
4.
Cheng R, Roth HR, Lay N, et al. Automatic magnetic resonance prostate segmentation by deep learning with holistically nested networks. J Med Imaging (Bellingham) 2017; 4:041302
5.
Clark T, Zhang J, Baig S, Wong A, Haider MA, Khalvati F. Fully automated segmentation of prostate whole gland and transition zone in diffusion-weighted MRI using convolutional neural networks. J Med Imaging (Bellingham) 2017; 4:041307
6.
Zhu Q, Du B, Turkbey B, Choyke PL, Yan P. Deeply-supervised CNN for prostate segmentation. arXiv website. arxiv.org/abs/1703.07523. Revised March 22, 2017. Accessed August 24, 2019
7.
Liu S, Xu D, Zhou SK, et al. 3D anisotropic hybrid network: transferring convolutional features from 2D images to 3D anisotropic volumes. arXiv website. arxiv.org/abs/1711.08580. Published November 23, 2017. Accessed August 24, 2019
8.
Litjens G, Toth R, van de Ven W, et al. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med Image Anal 2014; 18:359–373
9.
Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging 2015; 15:29
10.
Zhang L, Wang X, Yang D, et al. When unseen domain generalization is unnecessary? Rethinking data augmentation. arXiv website. arxiv.org/abs/1906.03347. Published June 7, 2019. Accessed September 7, 2019
11.
Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM 2017; 60:84–90
12.
Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. arXiv website. Published December 13, 2017. Accessed September 23, 2019
13.
Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big Data 2016; 3:9
14.
Hussain Z, Gimenez F, Yi D, Rubin D. Differential data augmentation techniques for medical imaging classification tasks. AMIA Annu Symp Proc 2018; 2017:979–984
15.
Zhu Q, Du B, Turkbey B, Choyke P, Yan P. Exploiting interslice correlation for MRI prostate image segmentation, from recursive neural networks aspect. Complexity 2018; 2018:1–10
16.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. arXiv website. arxiv.org/abs/1512.03385. Published December 10, 2015. Accessed August 24, 2019
17.
Nyholm T, Jonsson J, Söderström K, et al. Variability in prostate and seminal vesicle delineations defined on magnetic resonance images, a multi-observer, -center and -sequence study. Radiat Oncol 2013; 8:126
18.
U.S. Food and Drug Administration (FDA). Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD). U.S. FDA website. www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device. Updated 2019. Accessed September 4, 2019
19.
Hudson J, Khazragui HF. Into the valley of death: research to innovation. Drug Discov Today 2013; 18:610–613

Information & Authors

Information

Published In

American Journal of Roentgenology
Pages: 1403 - 1410
PubMed: 33052737

History

Submitted: September 23, 2019
Accepted: February 18, 2020
First published: October 14, 2020

Keywords

  1. artificial intelligence
  2. prostate MRI
  3. segmentation

Authors

Affiliations

Thomas H. Sanford
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Ling Zhang
NVIDIA Corporation, Bethesda, MD.
Stephanie A. Harmon
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Clinical Research Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD.
Jonathan Sackett
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Dong Yang
NVIDIA Corporation, Bethesda, MD.
Holger Roth
NVIDIA Corporation, Bethesda, MD.
Ziyue Xu
NVIDIA Corporation, Bethesda, MD.
Deepak Kesani
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Sherif Mehralivand
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Ronaldo H. Baroni
Diagnostic Imaging Department, Albert Einstein Hospital, Sao Paulo, Brazil.
Tristan Barrett
University of Cambridge School of Clinical Medicine, Cambridge, United Kingdom.
Rossano Girometti
Department of Radiology, University of Udine, Udine, Italy.
Aytekin Oto
Department of Radiology, University of Chicago, Chicago, IL.
Andrei S. Purysko
Department of Radiology, Cleveland Clinic, Cleveland, OH.
Sheng Xu
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Peter A. Pinto
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Daguang Xu
NVIDIA Corporation, Bethesda, MD.
Bradford J. Wood
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Peter L. Choyke
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.
Baris Turkbey
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 10, Rm B3B85, Bethesda MD 20892.

Notes

Address correspondence to B. Turkbey ([email protected]).
The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.

Funding Information

Supported in whole or in part by federal funds from the National Cancer Institute, National Institutes of Health (contract HHSN261200800001E) and supported in part by the Intramural Research Program of the National Institutes of Health. NVIDIA Corporation contributed computational resources in the form of a DGX-2 computer.

Metrics & Citations

Metrics

Citations

Export Citations

To download the citation to this article, select your reference manager software.

Articles citing this article

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share on social media