Lung cancer is the most common type of malignancy worldwide, accounting for 12.9% of new cancer diagnoses and 19.4% of cancer-related deaths [
1]. Non–small cell lung cancer (NSCLC) accounts for approximately 85% of newly diagnosed cases of lung cancer [
2]. According to the current American College of Chest Physicians and European Society of Thoracic Surgeons guidelines, staging should be performed with
18F-FDG PET/CT, because patients with nodal disease may not benefit from upfront surgery and may need other treatments [
3–
5]. Researchers have suggested a correlation between standardized up-take value (SUV) and disease extent, response to therapy, and prognosis [
3,
6–
9].
CT has limited overall sensitivity (55%) for lymph node staging and is outperformed by the 80–90% sensitivity of PET/CT. The negative predictive value (NPV) of PET/CT for nodal disease in peripheral T1 tumors (≤ 3 cm) is quite high (92–94%), and invasive nodal staging can be omitted when PET/CT findings are negative in these cases [
4]. The NPV of PET/CT of larger or centrally located tumors is somewhat lower (89% for T2 tumors), and PET/CT is insufficient to obviate invasive lymph node sampling to confirm or exclude nodal metastases in patients with no metabolically active nodes [
3,
4].
There is a clinical need for new, robust noninvasive imaging parameters to better predict lymph node status. If such imaging parameters are confirmed, they may affect the initial workup of patients with NSCLC [
10–
14]. Furthermore, greater knowledge of the metastatic potential of a tumor may help tailor the imaging surveillance algorithms used after initial therapy to the specific tumor risk. More frequent surveillance can be provided to patients at higher risk of disease dissemination.
Methods of machine learning have been introduced for medical imaging analysis, and there has been a marked shift toward deep learning methods, especially the use of multilayered convolutional neural networks (CNNs) [
15–
19]. With these methods, attempts are made to analyze images to achieve various tasks, such as classification, detection, registration, and segmentation. The main limit of all deep learning methods, including CNNs, is the need for large datasets to train the machine learning algorithms to improve their performance [
18,
19]. Attempts to predict mediastinal nodal metastases by means of machine learning have met with little success. To our knowledge, no studies have been performed to analyze PET images by use of CNNs to predict the nodal and distant metastatic potential of newly diagnosed NSCLC.
The purpose of our study was to determine the utility of CNNs in predicting the presence of lymph node metastases and in predicting the systemic metastatic potential of newly diagnosed NSCLC by analyzing features of the primary tumor with PET.
Materials and Methods
This retrospective single-institution (Princess Margaret Cancer Centre) proof-of-concept study received approval from the local ethics committee, and the requirement for informed consent was waived. The study included all consecutively registered patients who underwent PET/CT between October 2013 and October 2015 for suspected lung malignancy.
The inclusion criteria were as follows: histologically proven NSCLC with the primary tumor larger than 1.5 cm in greatest diameter on CT images (solid or semisolid); invasive lymph node sampling (endobronchial ultrasound [EBUS] or surgery) within 1 month of PET/CT or 6-month imaging follow-up with no evidence of developing nodal metastases; and clinical follow-up more than 6 months from PET/CT. The exclusion criteria were as follows: prior lung or chest malignancy; prior chemotherapy or chest radiotherapy; no histologic confirmation of primary tumor; no lymph node sampling or clinical and imaging follow-up lasting at least 6 months. All T category designations (T1–T4) were included.
Data Abstraction
Data collected included patient demographic data (age, sex), tumor histologic type, and TNM stage at diagnosis. Clinical notes and subsequent imaging follow-up during the trial period, including all available CT, MRI, and PET/CT examinations, were collected to assess for developing nodal and distant metastases.
PET Protocol
All PET/CT scans were obtained with a dedicated in-line PET/CT scanner in 3D mode (Siemens Biograph mCT 40, Siemens Healthcare). At least 6-hour fasting was required before the examination. Data acquisition was performed 60–70 minutes after IV injection of approximately 5 MBq/kg body weight FDG (up to 550 MBq). An initial helical CT scan from skull base to the upper thighs was obtained with the following parameters: 120 kVp; 40–105 mAs; scan width, 5.0 mm; reconstructed section thickness, 2.0-mm overlap. After CT completion, PET scans of the same area were acquired for 3 minutes per bed position with five to seven bed positions per patient.
Tumor Segmentation
Tumor segmentation was performed on attenuation-corrected PET images in 3D by one reader, and segmentation was verified by a second reader, both nuclear medicine fellowship-trained radiologists. Segmentation was performed with local image features extraction software (version 3.40, LIFEx, LIFEx Soft) by one of two methods [
20]. The first was semiautomatic segmentation with a fixed threshold of maximum SUV (SUV
max) 3.0 followed by a manual correction to remove adjacent FDG-avid nontumor structures (e.g., heart, vessels) (
Fig. 1). In the second method, for masses with low FDG avidity (SUV
max < 3.0), manual segmentation was performed. All segmented tumors, regardless of SUV
max, were transformed to a 3D radiotherapy structure set containing only the PET data on the segmented tumor. These data were used for CNN training and validation.
Standard of Reference
T category was designated by the treating physician and diagnosed according to the findings at initial staging CT and PET/CT. At diagnosis, all included patients were required to have an N category designation, which was determined from tissue samples obtained from lymph nodes by one or more of mediastinoscopy; EBUS-guided biopsy of hilar or mediastinal nodes (N1–N2); imaging-guided biopsy of a scalene or supraclavicular node (N3); or at surgery. In cases in which data from both presurgical lymph node sampling and surgical nodal staging were available, N category was assigned according to surgical staging.
If a patient had no nodal metastases at initial imaging and had not undergone invasive lymph node sampling, 6-month imaging follow-up was used to confirm negative nodal status. During the surveillance period, patients were assessed for development of nodal metastases at either imaging or subsequent lymph node biopsy and according to the oncologist's designation during follow-up. Only patients who had undergone invasive lymph node sampling were included in the CNN analysis.
Initial M category was assigned at presentation according to findings at staging PET/CT, with tissue diagnosis if available. A patient was considered to have M1 disease at the end of the follow-up period if distant metastases developed at any point during the surveillance period according to any imaging modality (CT, FDG PET/CT), and M1 category was assigned by the treating oncologist.
Image Analysis and Supervised Machine Learning
For each tumor, labels were added to differentiate the subgroups: no nodal metastases (N0) versus any nodal metastases (N1–N3) and no distant metastases (M0) versus upfront distant metastases (upfront M1) versus metastases at end of follow-up. These were assigned according to the standard of reference and yielded three binary datasets: N0 versus N1–N3, M0 versus upfront M1, and M0 versus M1 at end of follow-up.
The input to the supervised machine learning (SML) framework was sets of segmented PET images in the 3D radiotherapy structure set format, which were augmented by means of an adaptation of the method proposed by Ciompi et al. [
21]. This method entails rotation of image data in three planes and intersecting the triplets generated with the PET image data in various spatial transformations. Data samples were then used as input as triplets of 2D PET images.
The design of the SML framework was a deep learning CNN DenseNet machine learning architecture [
22]. DenseNet was chosen because in this architecture, each transition layer has access to all the preceding feature maps in its block and therefore to the collective information of the network. This is meant to reduce parameter redundancy. The augmented datasets for all three datasets separately (N0 vs N1–N3, M0 versus upfront M1, M0 versus M1 at end of follow-up) were fivefold partitioned into two subsets: 80% for training and 20% for validation with no overlap between patient studies assigned to the two sets. The means and SDs of the training data partition were computed for each of the images in the orthogonal triplet. Each image of an orthogonal triplet was centered by the training data mean and rescaled by the training SD before being input into the SML network.
For each of the three separate datasets (N0 vs N1–N3, M0 versus upfront M1, M0 versus M1 at end of follow-up), a sample input into the SML framework consisted of a set of three triplets of image regions corresponding to the same primary tumor. The output of this SML framework was a predicted binary classification in the form of a probability that a tumor belongs to one of the two classes for one the labeled datasets. Parameters specifying the DenseNet architecture are shown in
Table 1.
Statistical Analysis
Continuous parameters are presented as median and range. The sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) statistics were computed for the validation runs by use of the designated CNN data analysis. These were calculated separately for prediction of N category, M category at presentation, and M category at the end of the follow-up period. Analysis of predicting M category at the end of treatment also included calculation of positive predictive value (PPV) and NPV. The results of CNN analyses were averaged over the five validation partition runs and are presented as mean and SD. These include AUC; F1 score, a measure of the accuracy of CNNs, scores ranging from 0 to 1 (perfect test precision); and MCC, a measure of binary classification quality.
Results
Patients and Tumors
Among 2811 patients who underwent PET/CT for lung nodules and suspected or biopsy-proven lung malignancy between October 2013 and October 2015, 264 (144 men [54.5%], 120 women [45.5%]; median age, 67 years; range, 30–92 years), had confirmed NSCLC and complete datasets—which included all imaging, histologic, and clinical data—and participated in clinical follow-up for at least 6 months. The median follow-up period was 25.2 months (range, 6–43 months).
Among 264 tumors, 178 (67.4%) were adenocarcinoma, 64 (24.2%) were squamous cell carcinoma, and 22 (8.4%) were other subtypes, including mixed and large cell carcinoma. Low FDG avidity (SUVmax < 3.0) was found in 45 (17.0%) tumors, which were manually contoured during tumor segmentation. All other tumors (219/264 [83.0%]) were segmented by means of the designated semiautomatic method.
Nodal workup included histopathologic sampling for 223 of the 264 patients (84.5%) (71 by EBUS, nine by mediastinoscopy, 140 at surgery, three by neck node biopsy). For 41 (15.5%) patients follow-up was the standard of reference for N category designation (peripheral T1 tumor, 27; upfront metastases or palliative status at presentation owing to general condition, nine; patient refusal of invasive nodal sampling, five). T category at initial diagnosis is detailed in
Table 2. N and M categories at presentation and at the end of follow-up are shown in
Table 3.
N0 Versus N1–N3
For the CNN nodal analysis, only patients who had undergone invasive lymph node assessment were included, and this group consisted of 223 of the 264 patients (84.5%). Of these, 152 (68.2%) had full surgical nodal staging. For 61 of the 152 (40.1%) patients, surgical staging followed prior presurgical EBUS nodal staging. In 3 of the 61 patients (5%) the category was increased from N0 to N1 during surgery. No other patients had a change in N category after surgery.
No nodal metastasis (N0) was found in 189 of the 223 (84.8%) patients at diagnosis. Most of them (110/131 [84.0%]) had surgical confirmation of the negative nodal status. During the follow-up period, 38 of 189 (20.1%) patients with N0 at diagnosis were found to have nodal metastases. Nodal metastases (N1–N3) were found in 92 of 223 (41.3%) patients at the end of the follow-up period. Among these 92 patients 30 (32.6%) underwent surgical staging, an,d 21 (22.8%) had a percutaneous biopsy result confirming a metastatic supraclavicular lymph node (N3).
For correct binary prediction of N category (N0, N1–N3) among the entire cohort, mean accuracy was 0.80 ± 0.17. Mean sensitivity and specificity were 0.74 ± 0.32 and 0.84 ± 0.16. Mean AUC was 0.80 ± 0.01; F
1 was 0.75 ± 0.22 with MCC of 0.59 ± 0.36 (
Fig. 2 and
Table 4).
M0 Versus M1
The analysis of M0 versus M1 included all 264 patients. Of these, 223 (84.5%) had disease designated M0 at initial presentation, and 162 (61.4%) had disease designated M0 at the end of the follow-up period (median, 25.2 months; range, 6–43 months). For prediction of upfront M category at diagnosis, mean accuracy was 0.82 ± 0.04. Mean sensitivity and specificity were 0.24 ± 0.11 and 0.91 ± 0.02. Mean AUC was 0.71 ± 0.11; F
1 was 0.27 ± 0.11 with MCC of 0.17 ± 0.14 (
Fig. 3 and
Table 4). For prediction of M category at the end of treatment, mean accuracy was 0.63 ± 0.05. Mean sensitivity and specificity were 0.45 ± 0.08 and 0.79 ± 0.06. Mean AUC was 0.65 ± 0.05; F
1 was 0.53 ± 0.05 with MCC of 0.26 ± 0.1. PPV was 0.52 ± 0.09, and NPV was 0.69 ± 0.08 (
Fig. 4 and
Table 4).
Discussion
In this study, CNN analysis of primary PET images of previously untreated NSCLC resulted in correct prediction of N category with moderate sensitivity and specificity of 74% ± 32% and 84% ± 16%. These values are comparable, although slightly lower than, those reported by other authors, who used various methods, including simple metrics, such as SUV, and more advanced methods of texture analysis and machine learning. Cho et al. [
23] analyzed the ratio of mediastinal lymph node SUV to that of the primary tumor and reported sensitivity 87.5% and specificity of 85.9% for predicting metastatic lymph nodes. When analyzing tumor texture on CT images as a predictor of nodal metastases, Andersen et al. [
24] found higher specificity (97%) and lower sensitivity (53%) than we did when predicting whether mediastinal lymph nodes were benign or malignant. In a study combining both PET and CT data on mediastinal lymph nodes, Lee et al. [
25] found sensitivity of 88.3% and specificity of 82.6% in assessing nodal metastases. Neither Andersen et al. nor Lee et al. analyzed the primary tumor but assessed lymph nodes only. This may have contributed to their higher sensitivity in predicting nodal metastases.
The sensitivity of our CNN for predicting distant metastatic potential was low in both analysis of upfront metastasis (sensitivity, 24%) and analysis of metastatic potential at the end of the surveillance period (sensitivity, 45%). Although the specificity was high (91% in the upfront M1 group, 79% in the end of follow-up group), the PPV and NPV for predicting M category at the end of follow-up were low, 54.5% and 68.6%. Given these results, it would be difficult to create a new prognostication method and to tailor a patient-specific surveillance algorithm using the current data analysis.
In predicting distant metastatic potential, other published studies have shown variable results but generally had limited success rates. Wu et al. [
26] found a concordance index of 0.71 in predicting distant metastases, which outperformed both SUV
max and tumor volume (concordance indexes, 0.67 and 0.64). Zhou et al. [
27] found similar results with classification accuracy of 72.84%. Coroller et al. [
28], however, found a specific CT radiomic signature that provided a correlation index of 0.61 for predicting distant metastases. The lower sensitivity in our study could be explained by the small number of patients with distant metastatic disease (15.5% at presentation, 38.6% by the end of the follow-up period). Future studies including larger patient cohorts with longer follow-up periods, which will likely increase the number of patients in whom metastases develop, may improve the diagnostic performance of CNN for predicting distant metastases.
Like other studies, the current study was conducted with different analytical methods to tackle the challenge of predicting lymph node and distant metastatic potential. These studies seem to have all reached a certain accuracy plateau, and a limit to the achievable sensitivity and specificity. For lymph node staging, this plateau does not significantly improve the known accuracy of human interpretation, and at the current state these methods cannot obviate invasive lymph node sampling as part of staging of NSCLC. There are several possible reasons for this plateau in diagnostic accuracy. Some studies, including the current one, analyze data on all patients with NSCLC as one group, whereas others also include patients with SCLC. Because NSCLC is a group of malignancies, including several histologic types that may have different imaging features, future studies should explore whether analyzing different histologic types separately (e.g., separating adenocarcinoma from squamous cell carcinoma) could decrease data heterogeneity and subsequently improve the overall performance of CNNs.
Another possible factor limiting the accuracy of CNNs is the use of data from a single imaging modality (PET or CT), a method used in an attempt to simplify and standardize data extraction. The various machine learning techniques analyzed, including CNN, have not yet yielded practice-changing prognostication. Work to be done in this area includes assessment of dual-stream CNN and analysis of data obtained from both PET and CT as input streams together [
19].
Limitations
Our study had several limitations. First, we analyzed all NSCLC subtypes as a group (mainly adenocarcinoma and squamous cell carcinoma), even though this is a heterogeneous group that likely has different imaging features. Second, the small size of our cohort, especially the small number of patients with distant metastases, may have limited our results, because our study required data augmentation techniques to enable machine learning training. Furthermore, the cohort was not large enough to analyze each histologic subtype separately. Third, using a single-stream CNN may have resulted in lower sensitivity and specificity than use of a dual-stream CNN.
Fourth, we analyzed distant metastatic potential in patients at the end of variable follow-up periods. These data yielded overall worse results compared with those of analysis of data on patients with upfront distant metastatic disease, which might have been due to the effect of various treatments administered to these patients. Analyzing more homogeneous patient populations with more uniform datasets may yield better results.
Fifth, our data analysis did not take into account the initial T category, which might have had affected our results. Moreover, we did not include a multivariate analysis to take into account other tumor parameters, such as grade and genetic mutation status, which may have improved the accuracy and NPV of the CNN. However, it is likely that this type of analysis would require a much larger patient cohort. Improved methods should be explored in future studies, most likely in a multicenter design, which would provide this large patient cohort. Finally, the use of segmented tumors as input data for the CNN might have been a limitation by affecting the results. The exact magnitude of this impact and whether it improves or worsens the performance of the CNN are yet unknown [
29].
Conclusion
Our study showed that using a CNN to analyze segmented primary tumors with PET in patients with previously untreated NSCLC can yield a reasonably good prediction of N category, although this prediction does not significantly differ from clinical assessment of nodal status at PET/CT. Even though the CNN had limited sensitivity for determining the distant metastatic potential of tumors, it had fairly high specificity. Our accuracy rates are similar to those in previous studies in which various classic and advanced methods were used. Using the current tools and limited-size patient databases, appears to have produced an accuracy plateu, which may require application of CNNs to more uniform cohorts or require rethinking of the way advanced image analysis is used. To move CNNs from the research setting to clinical practice, this accuracy plateau must be overcome.