Crohn disease is a chronic inflammatory bowel disease that can cause a wide variety of symptoms. Several scoring systems that can grade disease activity are already well established in the management of luminal Crohn disease [
1]. The Crohn disease endoscopic index of severity (CDEIS), histopathologic grading according to Borley et al. [
2], and imaging scores such as that for perianal Crohn disease are increasingly used [
3]. However, there is no universally accepted grading of Crohn disease activity, and we are left with a very important clinical problem: Can we grade disease activity with any method, and more importantly, how can we predict the outcome of medical therapy?
Recently, two groups have developed a quantitative scoring system for Crohn disease activity: the MR index of activity and Crohn disease MRI index (CDMI) score [
13,
15]. The MR index of activity has a reported high correlation to the CDEIS (
r = 0.80;
p < 0.001), whereas the CDMI score has a high correlation to a histopathology score (estimated acute inflammation score) (Kendall τ b = 0.48;
p = 0.002). Therefore, either scoring system could be considered for assessing Crohn disease activity, but their reproducibility needs to be evaluated before wider clinical implementation. Furthermore, to our knowledge, no study has compared the accuracy of these two scoring systems in an external patient cohort.
The primary aim of this study was to assess the reproducibility of MRI features and scoring systems in patients with Crohn disease. The secondary aim was to correlate these scoring systems with the CDEIS in an external patient cohort.
Materials and Methods
Patients
Data from 33 consecutive patients with Crohn disease proven at histopathologic analysis were analyzed. These patients had taken part in a prospective single-center study comparing dynamic contrast-enhanced (DCE) MR enterography to ileocolonoscopy with CDEIS. The indication for ileocolonoscopy was clinical suspicion of relapsing Crohn disease. Exclusion criteria were age younger than 18 years, contraindications for MRI (including pacemakers, metallic implants, severe claustrophobia, and pregnancy), technical failure of a sequence, incomplete reference standard (CDEIS), and a negative diagnosis for Crohn disease. All patients had been recruited between February 2009 and November 2010 for assessment of Crohn disease activity. Furthermore, for all patients, ileocolonoscopy had been performed within a month of the MR enterography. The results of that study have been published previously [
16].
Patient exclusion criteria for this study were non-diagnostic MR enterography image quality (i.e., the study was not of sufficient quality to determine disease activity, if present) as determined by one or more of the readers and an incomplete MR enterography scan protocol (i.e., not including T2-weighted single-shot fast spin-echo [FSE], T2-weighted fat-saturated single-shot FSE, or 3D T1-weighted contrast-enhanced sequences, which are mandatory for calculating the scoring systems). Per-segment exclusion criteria were resected bowel segments and insufficient distention or visibility (< 20% of the bowel adequately distended and visible) of a bowel segment, as determined by one of the readers.
The Crohn disease activity index (CDAI) score [
17] and C-reactive protein levels were assessed in all patients. A CDAI score greater than 150 or a C-reactive protein level greater than 8 mg/L was considered as active disease.
For the previous study, ethical permission was obtained from the hospital medical ethics committee, and written informed consent was obtained from all patients. For the current study, informed consent was waived by the hospital medical ethics committee.
Finally, three of the 33 patients in the dataset of the prior study were excluded. Two observers assessed the quality of one MR enterography study (one patient) as nondiagnostic, and for two other MR enterography examinations, no T2-weighted fat saturation sequence was available (two patients). Thus, 30 patients (median age, 32 years; age range, 19–72 years; 21 women and nine men) were evaluated. The CDAI values showed that 47% (
n = 14) of the patients had active disease. The C-reactive protein values showed that 50% (
n = 15) of the patients had active disease. Baseline characteristics are shown in
Table 1.
These 30 patients had 148 segments (in two patients only four segments were eligible after right hemicolectomy) of which five segments, all rectal, were excluded because of insufficient visibility, resulting in 143 evaluable segments. The remaining 143 segments were radiologically scored by the four observers.
MR Enterography Protocol
The protocol of the study has been published previously [
16]. Patients fasted 4 hours before the examination and drank 1600 mL of mannitol (2.5%; Osmitrol, Baxter) solution 1 hour before the scan. Supine images were acquired using a 3-T MRI unit (Intera, Philips Healthcare) with a 16-channel torso phased-array body coil. Axial and coronal T2-weighted single-shot FSE sequences with and without fat saturation were acquired, followed by a coronal 3D T1-weighted spoiled gradient-echo sequence with fat saturation. After these series, 20 mg of butylscopolamine bromide (Buscopan, Boehringer Ingelheim) was IV administered, and a DCE-MRI sequence with 0.1 mL/kg bodyweight of gadobutrol (1.0 mmol/mL; Gadovist, Bayer Schering Pharma) was obtained. Ten seconds after the start of the dynamic sequence, 0.1 mL/kg bodyweight of gadobutrol (1.0 mmol/mL) was injected IV by bolus injection (5 mL/s) through a 20-gauge IV catheter using an automated injection pump (Mallinckrodt Optistar, Liebel-Flarsheim). Injection of contrast medium was immediately followed by a bolus of 15 or 20 mL saline (5 mL/s), depending on the length of the contrast injection tube. The duration of the DCE-MRI sequence was 6 minutes. After these series, a second dose of 20 mg of butylscopolamine bromide was IV administered. Thereafter, contrast-enhanced axial and coronal 3D T1-weighted spoiled gradient-echo sequences with fat saturation were performed. All sequences were used for image analysis, except the DCE-MRI sequence.
Observers
Four readers from two tertiary centers in different countries with 18 years (700 MR enterography studies), 17 years (1100 MR enterography studies), 4 years (170 MR enterography studies), and 1 year (160 MR enterography studies), of experience in reading abdominal MRI evaluated the MRI scans using the axial and coronal T2-weighted single-shot FSE with and without fat saturation, coronal unenhanced, and axial and coronal contrast-enhanced 3D T1-weighted spoiled gradient-echo sequences (
Table 2). All readers used a PACS (Impax 5.0, AGFA Healthcare, Agfa-Gevaert) workstation. All readers were unaware of the findings at the initial reading and the findings from ileocolonoscopy but were aware of patients' surgical history. The small bowel and the colon were divided into five segments: terminal ileum, right colon (cecum plus ascending colon), transverse colon, left colon (descending colon plus sigmoid), and rectum, so there could be a direct segment comparison between MRI and the CDEIS.
MRI Features
Seventeen different MRI features (
Table 3) were evaluated by all readers. Features were selected according to the MRI features described in the literature and used by most abdominal radiologists as identified in an international inventory, together with those used in the two published scoring systems [
13,
15,
18]. The most affected part of the segment was chosen for scoring.
The following MRI features were used to calculate the MR index of activity: mural thickness in millimeters, relative contrast enhancement, and the presence of edema and ulcers. These features have been proven to be significantly correlated to the CDEIS. The MR index of activity was calculated using the following formula: (1.5 × wall thickness in millimeters) + (0.02 × relative contrast enhancement) + (5 × edema) + (10 × ulceration) [
13].
For the overall CDMI score, the following four features—mural thickness, mural T2 signal, perimural T2 signal, and mural T1 enhancement—were scored on a scale of 0 to 3, resulting in a maximum score of 12 [
15]. These four features were selected because they were found to be significantly correlated with disease activity according to an endoscopic biopsy acute inflammatory score. In addition, the sum of the scores for mural thickness, mural T2 signal, perimural T2 signal, and contrast enhancement showed the highest accuracy [
15]. Furthermore, the following features were assessed: abscess, comb sign, enlarged (> 1 cm) lymph nodes, fistulas, lymph node enhancement, pattern of mural enhancement, pseudopolyps, and total length of the disease in each segment.
The MRI features with regard to lymph nodes were scored per patient; all other features were assessed per segment. The readers used the same method as described in detail in the articles about the MR index of activity and the CDMI score [
13,
15]. For calculating the relative contrast enhancement involved, we used the formula as described in Rimola et al. [
13] [(WSI contrast-enhanced – WSI unenhanced) / WSI unenhanced] × 100 × (SD noise unenhanced / SD noise contrast-enhanced). Here, WSI is the wall signal intensity, SD noise unenhanced corresponds to the average of three SD of the signal intensity measured outside of the body before gadolinium-based contrast agent injection, and SD noise contrast-enhanced corresponds to the SD of the same noise after gadolinium-based contrast agent administration [
19]. Ulcerations (defined as deep depressions in the mucosal surface of a thickened segment) and the short-axis diameters of enlarged lymph nodes were assessed on contrast-enhanced 3D T1-weighted spoiled gradient-echo images with fat saturation.
Eight of the MRI features are common to the MR index of activity and CDMI score but were assessed using different definitions of abnormality according to the particular scoring system. Specifically, mural thickness was measured using an ordinal score (0–3) and as a continuous variable in millimeters using calipers on either single-shot FSE or spoiled gradient-echo sequences. T1 contrast enhancement was measured using an ordinal score (0–3) and using relative contrast enhancement [
19]. Lymph nodes were assessed using an ordinal (0–3) score and a binominal score (yes/no). T2 signal was measured using an ordinal score (0–3) and a binominal score (edema; yes/no).
Reference Standard
Colonoscopy was performed after standard bowel preparation by either a gastroenterologist or a senior resident in gastroenterology under direct supervision of a gastroenterologist, using a standard colonoscope. The performing endoscopist was aware of the patient's history but was blinded to the MR enterography results. Segments were excluded from the analysis if they could not be scored during ileocolonoscopy. One of two gastroenterologists experienced in endoscopy of inflammatory bowel disease assessed the CDEIS [
1]. A segmental CDEIS was calculated using the variables of deep ulceration (no = 0, yes = 12), superficial ulceration (no = 0, yes = 6), surface involved by disease (0–10), and ulcerated surface (0–10) for each of five bowel segments (terminal ileum, right colon, transverse colon, left colon, and rectum). The reference standard has been described in detail elsewhere [
16]. In six patients, the terminal ileum could not be assessed during ileocolonoscopy because of a stenosis; therefore, 137 segments were correlated to the segmental CDEIS scores.
The median time between colonoscopy and MR enterography was 7 days (interquartile range, 5–14 days). MRI and colonoscopy were not performed on the same day. The median CDEIS was 4.3 (interquartile range, 1.6–5.8).
Data Analyses
Several multirater analyses were performed for all features individually and for the overall MR index of activity and CDMI score to assess the interobserver agreement. For all ordinal data, a weighted kappa coefficient was calculated per two raters and eventually was pooled. For the binomial data, a multirater kappa coefficient was used, which was also calculated per two raters and pooled. For continuous data, a multirater intraclass correlation coefficient was determined.
In addition, the scores of both of the most experienced abdominal radiologists were analyzed post hoc. This was done to evaluate whether experience would positively influence the reproducibility values. Both the kappa and intraclass correlation coefficient values interpretation was as follows: 0–0.20, poor; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, good; and 0.81–1.00, excellent [
20].
For an overall correlation, we first calculated means of MR index of activity scores and CDMI scores per segment for all four observers and correlated these values with the CDEIS scores. Because the segmental scores were interpreted as continuous variables, the Spearman correlation was used to correlate the segmental CDEIS scores to the segmental MR index of activity and segmental CDMI scores. Correlation coefficient values were interpreted as follows: 0.0, not correlated; 0.2, weakly correlated; 0.5, moderately correlated; 0.8, strongly correlated; and 1.0, perfectly correlated. Statistical analysis was performed in Excel 2003 (Microsoft) using PASW statistics software (version 19, SPSS).
Discussion
This study shows variable reproducibility of many individual MRI features advocated in the assessment of Crohn disease activity. Four features (wall thickness in millimeters, the presence of edema [yes/no], enhancement pattern [0–3], and length of the disease in each segment [0–3]) had good reproducibility, whereas extramural MRI features such as perimural T2 signal, comb sign, and lymph nodes showed only fair reproducibility. When individual features were combined into two scoring systems proposed in the literature (MR index of activity and CDMI), interobserver variability was good across four readers.
Our study has several strengths: four readers from two international expert centers assessed a large number of bowel segments using different features and two scoring systems. In addition, four specific MRI features were measured in two different ways within the two scoring systems [
13,
15], and we applied both definitions to determine which method is most reproducible. The MRI scoring systems were correlated to the CDEIS per segment, an objective activity index in comparison with clinical and biochemical parameters [
21]. Overall, we showed that the recently developed MRI scoring systems showed good-to-excellent reproducibility and moderate correlation to the CDEIS.
The reproducibility of several, mainly extramural, MRI features showed only fair reproducibility, and some authors have reported a higher interobserver variability [
9,
13,
22] than in our study. Conversely, the variable interobserver agreement in our study is more in concordance with other data [
12,
23–
25]. An explanation of this distinction might be in the severity of the disease of the included patients. Severe disease is easier to diagnose than mild disease, because in the former, the MRI features are most pronounced [
4]. Importantly, mural thickness, T1 contrast enhancement, and T2 wall signal indicating edema are considered important MRI features of activity [
18]. These features are common to both the MR index of activity and the CDMI score, and, reassuringly, all showed moderate-to-good interobserver variability.
Although the aforementioned important MRI features may be considered as the basic elements of both systems, they are defined in different ways. In the CDMI score, only qualitative variables are used, whereas in the MR index of activity, predominantly quantitative data are extracted. Using quantitative data as in the MR index of activity score might lead to a more precise grading of disease activity, although it is more time consuming, which potentially limits the use of this score in the clinical setting. An example is the measurement of the relative contrast enhancement, where region of interest measurements are used. Region of interest–based measurements have a known poor interobserver variability [
24]. In accordance, our study showed lower reproducibility of relative contrast enhancement (0.42), in comparison with grading T1 enhancement from 0 to 3 (0.57).
In addition to contrast enhancement, three other MRI features are defined in two different ways in the literature. The measurement of edema is essential in the management of Crohn disease to differentiate intestinal inflammation from fibrosis. Our study reported a higher reproducibility when edema was measured binomially (yes/no) rather than ordinally (0–3). Mural thickness measured in millimeters is not only more objective than using a qualitative variable, it also has higher reproducibility. The interobserver variability of lymph node measurement described in the development of the MR index of activity [
13] and the CDMI score [
15] showed similar fair interobserver agreement. These findings clarify how features can be most reproducibly measured and might result in a more consistent use in the future.
It is generally assumed that any new radiologic technique such as MR enterography has an associated learning curve for accurate interpretation. We therefore investigated whether experience might have influenced the reproducibility values. We had two experienced (700 or more MR enterography studies) and two less-experienced (170 or more MR enterography studies) readers. The assessment of just five of the tested MRI features showed improved reproducibility values when measured by experienced readers. Interobserver variability of enlarged lymph nodes (> 1 cm), lymph nodes (size and number; 0–3), comb sign, mural thickness (0–3), and mural thickness measured in millimeters increased from fair to moderate, moderate to good, or good to excellent, respectively. This could be because the less-experienced readers were not used to assessing these features.
One could argue that some MRI features may have shown a higher reproducibility when scored by experienced observers only. However, our data showed only a small increase in kappa or intraclass correlation coefficient values for only a few MRI features between experienced observers only. This is in accordance with findings of a previous study in which reproducibility of bowel-wall gadolinium enhancement measurements was determined [
24].
To our knowledge, our study is the first to compare the reproducibility of multiple MRI features and two scoring systems and to describe the interobserver variability of similar MRI features measured in different ways. Although certain individual features (e.g., perimural T2 signal, ulcerations, and relative contrast enhancement) showed only fair interobserver variability, importantly, when combined together in both the CDMI score and the MR index of activity, the results showed good reproducibility.
The correlation to the CDEIS in our study, which is lower than that reported by Rimola et al. [
13] for the MR index of activity in their study, might be explained by the different study protocols. We used a less-extensive method of contrast agent administration than was used to develop the MR index of activity, where warm water was retrogradely instilled into the colon. In addition, our study cohort primarily comprised patients with mild disease activity, whereas the MR index of activity was developed in a cohort including patients with more-severe disease activity. This may explain a lower correlation to the CDEIS and a low detection rate of ulcerations in our series than in the original article about the MR index of activity [
13]. On the other hand, the correlation to endoscopic activity is in concordance with previous research [
26,
27]. Furthermore, our protocol contained late phase IV contrast-enhanced series, which may have affected the evaluation of the contrast-enhanced series.
A number of limitations have to be acknowledged. The CDEIS is not a perfect reference standard because it assesses the mucosa only and gives little information on the trans-mural and extramural disease extent. However, endoscopy remains the reference standard for Crohn disease activity. We chose MR enterography as the contrast agent administration technique, because it is the most commonly used technique for bowel distention for patients with Crohn disease and is better accepted than MR enteroclysis [
28]. Neither MR enterography nor MR enteroclysis is aimed at optimal colonic distention, although colonic distention will be obtained to a variable extent. In our study, sufficient colonic distention and visibility were achieved in all but five patients in which the rectum was inadequately visible. The MR index of activity was developed in a cohort using both MR enterography and rectal fluid administration. This difference in bowel preparation may, at least in part, explain the different correlation between the MR index of activity and CDEIS in this study as compared with the studies by the Barcelona group that introduced this score [
13,
29]. Recent articles have reported that motility can be changed in affected small-bowel locations in Crohn disease [
30,
31]. However, our protocol did not contain cine MR motility series and, therefore, we could not study the scoring system developed by Girometti et al. [
32].
Along with ulcerations, abscesses, fistulas, and pseudopolyps were rarely seen in our data. This is in line with the daily clinical experience in our tertiary referral centers and reflects the patient spectrum in our institutions. To accurately determine the interobserver variability of these features, analysis of a group of patients with larger disease severity might elucidate the reproducibility of these features.
We did not perform an intraobserver analysis, because the intraobserver variability is generally higher than the interobserver agreement, which is intuitive because one would expect an observer to agree more with himself or herself than with another reader. Another methodologic limitation might be that we only used MRI examinations obtained at 3 T, but we do not expect substantial differences in evaluation of the features, MR index of activity score, and CDMI score compared with 1.5 T. Indeed, one study has reported that 3 T is equally accurate as 1.5 T in the assessment of Crohn disease [
33].
In summary, some commonly used MRI features have good reproducibility among four readers. Two recently developed scoring systems, the CDMI and MR index of activity scores, have good reproducibility and have moderate agreement with CDEIS. Additional research in a larger cohort of patients, including all disease stages and with more than one reference standard, has to be performed before a global accurate MRI scoring system can be implemented in clinical trials and daily clinical practice.