|
|
||||||||
Fundamentals of Clinical Research for Radiologists |
1 Technology Assessment Unit, Royal Victoria Hospital, Montreal, QC H3A 1A1,
Canada.
2 Department of Epidemiology and Biostatistics, McGill University, 1020 Pine
Ave. W, Montreal QC H3A 1A2, Canada.
3 Department of Diagnostic Radiology, Montreal General Hospital, McGill
University Health Centre, 1650 Cedar Ave., Montreal QC H3G 1A4, Canada.
4 Department of Oncology, Synarc, 575 Market St., San Francisco CA, 94105.
Received November 17, 2004;
accepted after revision November 23, 2004.
Series editors: Nancy Obuchowski, C. Craig Blackmore, Steven Karlik, and
Caroline Reinhold.
Introduction
|
|
|---|
In a hypothetical study evaluating the use of MRI for the assessment of myocardial viability, researchers were interested in characterizing the nature of the relation between myocardial infarct volume and ejection fraction. Their objective was to answer questions such as: Is there any relation between infarct volume and ejection fraction? What is the strength of this relation? Does ejection fraction increase or decrease with increasing myocardial infarct volume? By how much would we expect the ejection fraction to change when the myocardial infarct volume increases by 1 mL? Can we predict a patient's ejection fraction when given his or her myocardial infarct volume? How accurate is this prediction?
Questions such as these arise in situations in which more than one variable has been measured on each patient (or observational unit) in a sample, and the relationship between the different variables is of interest. This module covers some of the most commonly used statistical tools to answer such questions: correlation coefficients and regression models. We will cover methods for studying the relation between two variables that may be both continuous, both dichotomous (i.e., having only two values), or a mix (one dichotomous and the other continuous). We will also cover situations in which we wish to study the relation between more than two variables.
To illustrate the methods in this tutorial we have used hypothetical examples that are all inspired from studies appearing in radiology research journals. Some of the concepts covered in this tutorial assume knowledge of earlier articles in this series, to which the reader is encouraged to refer [1-4].
|
|
|---|
|
|
![]() |
Pearson's correlation coefficient can range from a minimum value of -1 to a maximum value of 1. Figures 2A, 2B, 2C, 2D, 2E, and 2F illustrates the value of rP in various prototypical situations. A value of rP = 1 is obtained when an increase in X is always associated with an increase in Y and the points in the scatterplot between X and Y can be joined to form a perfect straight line (Fig. 2A). A value of rP = -1 is indicative of a perfect negative linear relation between X and Y (Fig. 2B). As the strength of the linear relation between X and Y diminishes, the value of rP approaches 0 (Figs. 2C and 2D). A correlation coefficient of 0 indicates that there is no relation between the two variables. For the hypothetical data in Figures 1A, and 1B we find that rP is -0.91, suggesting a fairly strong negative relation between myocardial infarct volume and ejection fraction. The interested reader is referred to the table at the end of the appendix for a more detailed explanation of how to calculate the correlation coefficient.
|
|
|
|
|
|
|
Figures 2E and 2F illustrate two situations in which there is a perfect, though nonlinear, relation between X and Y. In Figure 2E, an increase in X is always accompanied by an increase in Y. Here, rP is quite high (0.92), although not equal to 1. In Figure 2F we have a U-shaped relation between the variables, with both low and high values of X being associated with high values of Y. Here rP is close to 0, suggesting only a weak relation between X and Y. These plots serve to illustrate that a value of rP close to 0 does not rule out the possibility of a strong nonlinear relationship between the variables.
Interpreting Pearson's correlation coefficientA few things need to be kept in mind when interpreting a correlation coefficient:
|
|
Assumptions used in calculating Pearson's correlation coefficientSome important things need to be kept in mind before calculating rP. First, it is based on the assumption that both X and Y are measured on an interval scale. When we say myocardial infarct volume has been measured on an interval scale, we mean that a myocardial infarct volume of 4 mL is twice as large as a myocardial infarct volume of 2 mL. This would not have been true if it were measured by a nominal variable having values 1 (small), 2 (medium), and 3 (large) because we cannot say that a patient rated as "medium" has twice the myocardial infarct volume of a patient rated as "small." Second, both X and Y are assumed to follow a normal probability distribution [2]. This assumption allows us to perform hypothesis tests and construct confidence intervals for rP, as we will see.
Inference for Pearson's correlation coefficientThe sample
correlation coefficient, rP, is a statistic the value of
which changes depending on the sample collected. It is only an estimate of the
population correlation coefficient,
P, that we would
have obtained if it were possible to observe the entire population of patients
(or study units) from which the sample was collected. When reporting the
sample correlation coefficient, we also need to report some measure of our
uncertainty in the knowledge of the population correlation coefficient. This
uncertainty may be expressed in terms of a p value or a confidence
interval [3]. Confidence
intervals are preferred to p values because they provide more
information regarding the parameter estimated. An earlier article in this
series explains in detail the distinction between confidence intervals and
p values [3]. However,
p values are still frequently reported in the medical literature, so
we cover methods for their calculation and interpretation here.
p value: A p value measures the strength of the evidence
in favor of a null hypothesis of the form H0:
P =
0, where
0 is a
predetermined value of the correlation coefficient of interest. In our example
on myocardial infarct volume and ejection fraction, we can set
0 = 0 to measure the evidence in favor of "no
association between the two variables." When the p value is
very low (typically < 0.05 or 0.01) we reject the null hypothesis. Details
on how to calculate the p value are provided for the interested
reader in Appendix 1. We find that the p value for our example is
very, very small (<< 0.001). In other words, the probability that we would
have observed a correlation as strong as rP = -0.91, when
in fact the true correlation between myocardial infarct volume and ejection
fraction was
P = 0, is very, very smallmuch
less than 0.0001. Therefore, we reject the null hypothesis of H0:
P = 0 and conclude that there is an association
between myocardial infarct volume and ejection fraction.
Confidence interval: The hypothesis testing approach limits us to
a single hypothesis, which is often artificially set up. Rather than simply
concluding that the population correlation coefficient is not 0, we might want
to say a little more about the strength of the correlation. A confidence
interval is more informative in that it gives us the range of possible values
of
P that are compatible with the observed value of
the correlation coefficient. Details of the calculation of the confidence
interval are given in Appendix 1. The 95% confidence interval for the
correlation coefficient between myocardial infarct volume and ejection
fraction is (-0.96 to -0.81). If our hypothetical study were repeated several
times and a confidence interval calculated each time, then 95% of the
confidence intervals would capture the true value of
P. However, we cannot say if the interval obtained
from our sample is one of the 95% that capture the true value of
P (see
[3] for more details on how to
interpret a confidence interval). The 95% confidence interval may also be
interpreted as the range of values of the null hypothesis (
0)
that cannot be rejected at the 1 - 0.95 = 0.05 level of significance.
The fact that our 95% confidence interval does not include 0 means that the
null hypothesis of
0= 0 would be rejected, which is the same
conclusion we reached earlier using the p value. A better approach
would be to compare the confidence interval with a predetermined range of
values indicative of no relation between the variables. For example, let us
say that a correlation coefficient in the range from -0.1 to 0.1 is in
practice indicative of no relation between myocardial infarct volume and
ejection fraction. Then the fact that our confidence interval clearly lies
outside this region leads us to conclude there is a strong, negative relation
between myocardial infarct volume and ejection fraction.
Partial Correlation
It is possible that the observed correlation between two variables
(X and Y) may be in part because of a third variable
(Z) that is related to both of these variables. When this third
confounding variable is also observed, we may be interested in estimating the
correlation between X and Y after eliminating the effect of
their correlation with Z. For example, in a study of liver lesion
characterization using three diagnostic testssonography, CT, and
MRIthe Pearson's correlation coefficient between the accuracy of the
different diagnostic tests was as shown in the following equations:
![]() |
![]() |
![]() |
Clearly, all three methods are correlated with each other. What is the
correlation between the diagnostic performance of sonography and MRI alone,
after eliminating the effect of the correlation that both have with CT? To
estimate this, we can calculate a partial correlation coefficient. The partial
correlation between X and Y after having eliminated the
effect of a third variable Z is given by:
![]() |
If Z is not a confounding variable, one or both of rP (X,Z) and rP (Y,Z) would be 0 or very small. In such a situation, the partial correlation between X and Y (rXY.Z) would be similar to the Pearson's correlation coefficient between them (rP [X,Y]).
The partial correlation coefficient between performance in sonography and
MRI in our example is shown in these equations (where US =
sonography):
![]() |
Thus, after eliminating the contribution of CT, we find that the strong relation between sonography and MRI vanishes. Moreover, it appears that the direction of the relation changes as well, suggesting that after removing the contribution of CT, lesions that are accurately diagnosed with sonography in fact are poorly diagnosed with MRI and vice versa.
This concept can be extended to calculate the partial correlation between two variables after adjusting for the effect of two or more variables. Multiple regression, which is discussed later in this article, can be used for the same purpose and is more straightforward to perform using commonly available statistical software packages.
Spearman's Rank Correlation
Spearman's rank correlation, which we denote by rS, is
another statistic used for measuring the correlation between a pair of
variables. It is called a nonparametric measure and is preferred when
assumptions required for calculating Pearson's correlation coefficient are
violatedthat is, when X and/or Y are not measured on
an interval scale, or when X and/or Y do not follow a normal
probability distribution. To calculate Spearman's correlation coefficient, we
need to assign a rank to the individual values of X and
Ythat is, sort each of X and Y in increasing
order and assign them ranks so that the smallest observation has a rank of 1
and the highest observation has a rank of N. The expression for
Spearman's correlation coefficient is similar to Pearson's correlation
coefficient, except that xi and yi are
replaced by the rank(xi) and rank(yi)
as follows:
![]() |
Spearman's correlation coefficient ranges between -1 and 1, with these extreme values indicating a perfect negative or positive relationship, respectively, between X and Y. It takes the value 0 when there is no relation between the variables (Figs. 2A, 2B, 2C, and 2D). An advantage of Spearman's correlation coefficient over Pearson's correlation coefficient is that it can be used to evaluate a nonlinear relation between variables when the direction of the relationship does not change. In Figure 2E, where Y continuously increases with X, we see that the perfect nonlinear relationship between the variables is captured by Spearman's correlation coefficient, although not by Pearson's correlation coefficient. However, like rP, rS is inappropriate for measuring the strength of a nonlinear relationship that both increases and decreases, such as the U-shaped relation in Figure 2F.
|
|
|---|
Regression is a broad area to which this article provides but a brief introduction. Greater detail on estimation and inference for linear and logistic regression is covered in introductory biostatistics textbooks [7-9]. More complex topics, such as regression model diagnostics, variable selection, and logistic regression for ordinal variables, are covered in greater depth in advanced textbooks [10-13].
Simple Linear Regression
Like Pearson's correlation coefficient, simple linear regression is also
used to characterize linear relationships between variables. It is
distinguished from multiple variable linear regression (discussed later) in
that it involves only two variables, the outcome or dependent variable and the
predictor or independent variable. The standard form of the simple linear
regression equation is as follows:
![]() |
where X and Y are the observed values of the predictor
and the outcome variables, respectively. The parameters
and ß are
called the intercept and the slope, respectively. For a given value of
X, the predicted value of Y is
+ ßX.
The term
, the residual (or error), is the difference between the
observed value of Y and the predicted value of Y. The
intercept and slope parameters are estimated with the aim of reducing this
difference. The estimated values of the intercept and slope are denoted by
a and b, respectively. An important assumption of the linear
regression model is that the residuals are assumed to follow a normal
distribution with mean 0 and a variance
2, which remains
constant for all values of X. These assumptions imply that for a
given value of X, the error in predicting the outcome is 0 on the
average. Moreover, the magnitude of the error is not associated with
X.
|
![]() |
(see the solid line in Fig. 4).
The intercept of the regression model is equal to the predicted value of the outcome when the predictor variable is 0. This parameter is of interest only in those situations in which 0 lies within the plausible range of X values. Figure 4 shows that when the myocardial infarct volume is 0 mL, the ejection fraction is predicted to be equal to the intercept, or 70%. The slope of the regression model is the change in the outcome corresponding to a unit change in the predictor variable. A slope of 0 indicates that no relation exists between the predictor and outcome variables. From Figure 4, we see that when the myocardial infarct volume increases by 1 mL, the predicted value of the ejection fraction decreases by an amount equal to the slope, or -3.6%.
Selecting the "best-fitting" lineWe need an
objective criterion to help us estimate
and ß so that we have a
best-fitting straight line. As explained earlier, we would like to use the
regression equation to predict the outcome variable using the predictor
variable. Clearly, we would like to do so in a way that minimizes the error in
prediction (i.e., results in the lowest possible residual),
i, for each patient. We use a criterion that minimizes the
sum of the squared residual terms:
![]() |
This is known as the method of least squares. The expressions for the
estimated values of the intercept and the slope obtained using the method of
least squares are given in
![]() |
where
![]() |
(See the table in Appendix 2
for an illustrative example of how to calculate a and b for
a smaller sample of five patients. Notice that much of the calculation
involves the terms already used in the calculation of Pearson's correlation
coefficient.) In addition to a and b, we also obtain an
estimate for the SE (i.e., square root of the variance) of the residuals,
which we denote by s:
![]() |
For our example, the SE of the residuals is given by s = 3.53. This tells us that the average error in predicting the ejection fraction by the myocardial infarct volume is about 3.53%. This error is quite small when compared with the range of ejection fraction valuesroughly 40-70%suggesting that our regression equation has a good predictive ability on average.
The residual SE, s, can be used to obtain estimates of the SEs of a and b and of the predicted value of the outcome variable using the formulae given in Appendix 2. These SEs can be used to perform inferences for these parameters via hypothesis tests or confidence intervals. In our example we find that the confidence interval for the slope of the regression line is (-4.3% to -2.9%). Because this interval does not include 0, we can conclude that there is an association between myocardial infarct volume and ejection fraction.
Model diagnosticsAfter having obtained the intercept and slope of a regression model, we need to verify whether the basic assumptions on which the model was built were satisfied. We need to evaluate whether the residuals follow a normal probability distribution, whether the variance of the residuals is constant for all values of X, and whether the relation between Y and X is linear. All of these assumptions can be verified using the following simple plots of the residuals.
Normal probability plotA normal probability plot is used to verify whether the residuals follow a normal probability distribution. Most standard statistical software packages can be used to produce this plot. Figure 5A illustrates the ideal situation, in which the residuals do indeed follow a normal distribution and we observe a straight line along the diagonal of the plot. Any departure of the residuals from a normal distribution will show up as a deviation from this straight line. Figure 5B illustrates a case in which the residuals are skewed to the right and we observe a curved line below the diagonal. A possible corrective measure for this problem is to model the natural logarithm of the outcome instead of the outcome itself.
|
|
Scatterplot of residuals versus XFigures 6A, 6B, 6C are prototype scatterplots of the residuals versus the predictor variable, X. In Figure 6A, we have the ideal situation, in which the model is appropriate. The residuals are randomly scattered about the value of 0 for the entire range of X. Furthermore, the residuals fall in a horizontal band of equal width for the entire range of X, meaning that they have a constant variance. In Figure 6B, we have a situation in which the residuals indicate that the relation between outcome and predictor is nonlinear. We find that values of X that are close to its minimum or maximum are associated with positive residuals, whereas values of X in the middle of its range are associated with negative residuals. The parabolic relation between the residuals and X in this plot suggests that Y is in fact a quadratic function of Xthat is, Y is a function of both X and X2. In Figure 6C, we see an increase in the magnitude of the residuals with increasing X. This tells us that our assumption of a constant variance has been violated. As a result, the prediction of the outcome is better for lower values of X than for higher values.
|
|
|
Multiple Variable Linear Regression
|
|
|---|
![]() |
As in the case of the simple linear regression model, the unknown
parameters
, ß1, ß2,
ß3, and ß4 are estimated with the objective of
minimizing the sum of the squared residuals (i.e., the sum of the squared
differences between the observed GFR values for each patient and the predicted
values according to the regression model). We do not present the expressions
for calculating the different coefficients and their confidence intervals
because these are cumbersome, requiring knowledge of matrix theory. Moreover,
most widely available statistical software programs can calculate these
quantities. We focus instead on the interpretation of the model.
Table 1 presents the results from a hypothetical study relating the GFR to the predictor variables mentioned here among 100 patients with ages ranging from 40 to 60 years, weight ranging from 40 to 100 kg, and serum creatinine levels between 180 and 200 mmol/L. The intercept is the predicted value of the outcome in the event that all predictor variables are equal to 0. This quantity is of interest only when it is possible for all predictor variables in the model to be simultaneously equal to 0. In the example in Table 1, the intercept is not of interest because the values age = 0, weight = 0, and 1 / serum creatinine = 0 are not possible. The regression coefficients (estimates of the ß1 parameters) corresponding to continuous predictors are interpreted as the change in the outcome variable for a unit change in the predictor variable, while the remaining predictor variables are constant. This means that among a group of patients with a common weight, sex, and serum creatinine, an increase of 1 year in a patient's age is associated with a decrease in the GFR of 0.06 mL/min.
|
Ordinal and nominal predictor variables When including nominal predictors (e.g., variables such as sex or country of origin that have no natural ordering) or ordinal predictors (e.g., age measured in 5-year categories) in a regression model, we need to create what are called "dummy variables" or "indicator variables." To do this, we identify one of the categories of the predictor as a reference category. In the case of ordinal variables, the reference category is typically the lowest category. For example, if age is a three-category ordinal variable having values 61-65 years, 66-70 years, and 71-75 years, the 61-65 year category could be selected as the reference. In the case of nominal variables, where there is no clear ordering of the categories, any category may be arbitrarily selected as the reference. Once the reference category has been determined, we create indicator variables corresponding to each of the remaining categories of the predictor. The indicator variables take the value of 1 if a patient is in the category to which it corresponds or 0 otherwise. Because three categories were defined for the variable age, this means we need to create two indicator variablesone would take the value 1 for patients in the 66-70 year category, and the second would take the value 1 for patients in the 71-75 year category. Both indicator variables are added to the regression model as predictors.
In the example for GFR, the only noncontinuous predictor is sex. The category "male" was regarded as the reference category. Thus, the variable "sex" is an indicator for the female sex. It takes the value 1 if the patient is female and 0 if the patient is male. The regression coefficient corresponding to sex tells us that after adjusting for the effect of other predictor variables, female patients have a GFR that is 2.60 mL/min lower than that of male patients.
Inference for regression coefficients Along with regression coefficients, we can report confidence intervals that give an idea of the uncertainty in estimating them. If the confidence interval corresponding to a predictor variable does not include 0, we conclude that it is statistically significant. Alternatively, we could perform a hypothesis test based on the t distribution and report a p value that tells us the probability of observing our estimated regression coefficient if its true value is 0. If the p value is much smaller than a predetermined level of significance (typically 0.05 or 0.01), we reject the null hypothesis that the regression coefficient is equal to 0. If there are k parameters in a model, the p value is obtained from the tables of the t distribution with N - k degrees of freedom (df), where N is the sample size and k is the number of predictors in the regression model. In our example, we can deduce from the 95% confidence intervals that the regression coefficients corresponding to both 1 / serum creatinine and sex are significantly different from 0, and those corresponding to age and weight are not. A similar conclusion is obtained on the basis of the p values.
Model fitThe R2 statistic introduced earlier can also be used to evaluate model fit for multiple variable linear regression models. The R2 statistic is defined as the proportion of the variance in the outcome variable explained by the regression model. It ranges between 0% and 100%, with values closer to 100% indicating a better model fit. In our example for predicting GFR from age, weight, sex, and serum creatinine level, the R2 statistic was quite low, meaning that the information obtained explained only 21% of the observed variation in GFR. A low value of R2 is not unusual in real-life applications.
Model selectionWhen we have several candidate predictor variables, we are often faced with the challenge of choosing between different models that are based on different predictors. Besides assessing the fit of a model, the R2 statistic may also be used to compare two different models for the same outcome. Table 2 lists R2 values for different candidate multiple regression models with GFR as the outcome. The model with the highest value of R2that is, the model that best explains the observed variation in GFRis the model with all four predictor variables included simultaneously. In interpreting these results, it must be noted that the R2 statistic is influenced by the number of predictor variables in the model. Notice that in Table 2 the R2 statistic increases with every additional predictor added to the model. Thus, when comparing two models, the R2 statistic may simply favor the model with the greater number of predictors.
|
Besides the R2 statistic, several other criteria have been proposed for model selection. One such criterion is the Bayesian information criterion (BIC). This criterion assesses model fit while simultaneously applying a penalty for every additional predictor added. Our interest is not in the actual value of the BIC for a given model, but rather the difference in the BIC between two models. The lower the BIC, the better the fit of the model. From Table 2, we see that according to the BIC criterion, adding age to the model worsens the model fit. Although criteria such as the R2 and BIC may be used to assess model fit, the choice of which predictor variables go into a model depends also on their clinical relevance, their impact on the magnitude of regression coefficients associated with the remaining predictors, and their statistical significance.
Model validationAn important way to evaluate a model is to use it to predict the outcome in a data set that is independent of the one used to fit the regression model. This step is referred to as "model validation." Repeating the study to collect new data may not always be a feasible option because of the cost and time involved. Instead, if we have a sufficiently large sample, we may choose to split the data set into two partsa model-building or training data set that is used to estimate the regression coefficients, and a validation data set. This is known as cross-validation [11]. The model-building data set needs to be sufficiently large to obtain the required precision in estimating the regression coefficients. If this is not possible with half the data, the model-building data set may be larger than the validation data set.
Confounding and effect modificationA multiple linear regression model allows us to study the relation between a primary predictor, X (e.g., the experimental treatment), and the outcome, Y, while adjusting for the effect of one or more secondary predictor variables (e.g., the patient's demographic characteristics). For illustration, we will consider only one secondary predictor, Z, but the concepts discussed here can be extended to the case of more than one secondary predictor. A variable Z is said to be a confounder if it is associated with both X and Y. The true relation between Y and X is not determined by Z. However, not including Z in the regression model results in an incorrect estimate of magnitude or direction of the regression coefficient of X. A variable Z is said to be an effect modifier if it affects the magnitude of the association between Y and X. To determine if Z is an effect modifier, we must add both Z and the product XZ to the regression model between Y and X. It is possible for a variable to be both a confounder and an effect modifier.
|
|
|
|
![]() |
where the predictor "sex" is an indicator variable for female sex.
Figure 7C illustrates the
case when sex is an effect modifier of the relation between weight and bone
densitythat is, the strength of the association between weight and bone
density is modified by the variable sex. This means the regression lines
between bone density and weight among men and women have different slopes (see
Fig. 7D). In our hypothetical
example, bone density increases more rapidly with weight among women than
among men. We can evaluate whether sex is an effect modifier using a single
multiple variable regression model that includes weight, sex, and their
product as predictors, as follows:
![]() |
From this single equation we can determine the different associations between bone density and weight among men and women. By setting sex = 0 in this equation, we find that the regression coefficient associated with weight is 0.2, the same as was obtained by fitting a separate linear regression model among men. Similarly, when setting sex = 1 in the equation, we find that the regression coefficient associated with weight = 0.2 0.15 = 0.05 mass/volume units, which is the same as the regression coefficient obtained when fitting the model among women alone. If the regression coefficient corresponding to the product term is significantly different from 0, we conclude that there is an interaction between weight and sex.
Logistic Regression
Logistic regression, like linear regression, can be used to relate a single
outcome variable to one or more predictor variables. However, the outcome
variable is dichotomous, having only two values (e.g., success or failure of
an experimental treatment, survival or death at the end of a 10-year
follow-up). One value of the dichotomous outcome variable must be designated
as the outcome of interestfor example, success when the outcome has the
values success or failure, or death if the outcome has the values death or
survival. The odds of the outcome of interest are given by the ratio of the
probability of observing the outcome of interest, to the probability of not
observing it: probability of success / probability of failure, or probability
of death / probability of survival. The logistic regression equation relates
the logarithm of the odds of the outcome to the predictor variables.
In a hypothetical study, logistic regression was used to predict the
extremely high breast density on mammography using information on a woman's
parity (i.e., number of children), body mass index (BMI), and age. Extremely
high breast density was defined as a dichotomous variable taking the value 1
when a woman's breast density was greater than or equal to 75%, and taking the
value 0 when a woman's breast density was less than 75%. The resulting
multiple logistic regression equation had the following form:
![]() |
![]() |
where ln is the logarithm to the natural base e and EHBD is extremely high breast density.
The predictor variables in a logistic regression equation may be continuous, nominal, or ordinal. As in the case of multiple linear regression, nominal and ordinal predictor variables are entered into the equation as indicator variables. In the logistic regression equation for extremely high breast density, BMI and age are both continuous variables, and nulliparous is an indicator that the woman is nulliparous.
The best estimates for the unknown parameters
, ß1,
ß2, and ß3 may be obtained by a statistical
method known as maximum likelihood. This method helps us identify the most
likely value of the true parameters given the observed data and under the
assumption that the number of patients with the outcome of interest follows a
binomial distribution [2].
The relation between each predictor variable and the outcome in a logistic regression model is expressed in terms of an odds ratio (for more about odds ratios see the article by Blackmore and Cummings [4] in this series). When the predictor variable is ordinal or nominal, the odds ratio is a comparison between each indicator variable and the reference category. An odds ratio of 1 indicates there is no difference in the odds of the outcome of interest between the category associated with the indicator variable and the reference category. An odds ratio greater (lesser) than 1 indicates the outcome of interest is more (less) likely in the category associated with the indicator variable than in the reference category. Results for the extremely high breast density example are given in Table 3. The odds ratio of 5.53 corresponding to nulliparous tells us that the odds of extremely high breast density are (5.53 - 1) x 100 = 453% greater among women who are nulliparous compared with those who are not. For a continuous predictor variable, the odds ratio gives the relative increase (or decrease) in the odds of the outcome for a change of one unit of the predictor variable. For example, in Table 3, the odds ratio of 0.85 corresponding to BMI means that for a unit increase in the BMI, a woman's odds of extremely high breast density decrease by (1 - 0.85) x 100 = 15%. The odds ratios for all predictor variables are obtained by taking the exponent of the regression coefficient.
|
We can test whether each regression coefficient is different from 0 using a
chi-square test with N - k df, where N is the
sample size and k is the number of predictors in the regression
model. By comparing the chi-square p values in
Table 3 with the traditional
level of significance of the null hypothesis of
= 0.05, we conclude
that the predictors nulliparous and BMI are statistically significantly
associated with an extremely high breast density. Alternatively, we can report
a confidence interval for the odds ratio. If the confidence interval does not
include 1, then the predictor is considered statistically significant. If the
confidence interval includes 1, as in the case of the predictor age in
Table 3, we conclude that it is
not significantly associated with the outcome.
As in linear regression a logistic regression model can also be used to determine whether a particular predictor variable is a confounder or effect modifier. The fit of a logistic regression model may be assessed using the BIC or a statistic similar to the R2 statistic.
|
|
|---|
General Concepts for Sample Size Calculation
Whatever the parameter of interest, certain concepts remain common to the
exercise of sample size calculation.
First, the sample size calculation requires a guess value for the parameter of interest (e.g., correlation coefficient or the slope of a regression model) and parameters of its probability distribution (e.g., SE of the slope). This is rather paradoxical because the goal of the study is to find out more about this parameter. However, some reasonable range of guess values for the parameter can usually be found from the literature.
Second, identify a clinically meaningful range of values for this parameter.
Sample Size for Pearson's Correlation Coefficient
Assume we want to perform a study the goal of which is to measure the
correlation between ratings of two experienced radiologists on a series of
mammograms. Based on an earlier pilot study, our guess value for the
correlation coefficient is
P = 0.85. A sufficiently
high correlation is deemed to be in the order of 0.8-0.9. Any value less than
this is considered poor correlation. Ideally, we would like our research study
to unequivocally determine whether the true correlation between the reviewers
is sufficiently high. This means we would like our sample size to be large
enough to ensure that the confidence interval lies entirely within or below
the range 0.8-0.9that is, the half-width of the confidence interval (or
precision of our estimate) should be a maximum of 0.85 - 0.8 = 0.9 - 0.85 =
0.05. The calculation of the confidence interval requires the transformation
of the correlation coefficient,
P, into
![]() |
(see Appendix 1). Therefore, we need to determine the maximum permissible
value of the confidence interval half-width on the transformed scale. To do
this, we transform both the guess value of the correlation coefficient and the
lower end of the confidence interval and calculate their difference. The
maximum permissible half-width of the transformed confidence interval, is
given by
![]() |
The sample size required to obtain a (1-
)% confidence interval is
then calculated as
![]() |
where Z1-
/2 is the (1-
/2) quantile of the
standard normal distribution. Thus, to obtain a 95% confidence interval for
our study, we would need a sample size of approximately
![]() |
Sample Size for the Slope of a Simple Linear Regression Model
Sample size calculation for the simple linear regression model typically
focuses on determining whether the slope is different from 0. The required
sample size can be obtained using the same approach as that given in this
article for the correlation coefficient, by exploiting the fact that a slope
of 0 in a simple linear regression equation is equivalent to a correlation of
0 between the predictor and outcome variables. Suppose we plan to study the
relation between renal length as measured by sonography (predictor) and GFR
(outcome) via simple linear regression. Suppose also that a smaller pilot
study of the relation between these variables had reported a correlation
coefficient of 0.3 (-0.2 to 0.8). To conclusively show a relation between the
two variables, we would like the confidence interval to lie within 0.1-0.5
(i.e., to eliminate 0). The required sample size can be calculated using the
methods described earlier for Pearson's correlation coefficient.
|
|
|---|
APPENDIX 1. Inference for Pearson's Correlation Coefficient (rP)
|
|
|---|
![]() |
where ln is the natural logarithm. This transformation is required
because even though X and Y may follow a normal
distribution, rP does not. However, ZP
is known to follow a normal distribution with a standard deviation
![]() |
making the calculation of the p value and confidence intervals easier. The remaining steps involved in calculating a p value are explained in the box below.
Compute the test statistic
The rule for estimating the p value depends on the alternative hypothesis HA as follows (see [3] for more on hypothesis testing):
When HA:
When HA:
When HA: The p value is calculated by comparing the test statistic with the tables of the normal distribution. Typically, if the p value is less than a predetermined level of significance, such as 0.05 or 0.01, the null hypothesis is rejected in favor of the alternative.
|
Recall that in our example of myocardial infarct volume and ejection
fraction, the correlation coefficient for the entire sample of n = 30
patients was rP = -0.91. To estimate the evidence in favor
of the hypothesis "there is no relation between myocardial infarct
volume and ejection fraction"that is, H0:
P = 0we begin by calculating the test
statistic. First transform rP into
![]() |
Then transform
0 into
![]() |
Finally, calculate the SD of ZP as
![]() |
Using these three quantities, the test statistic can now be calculated as
z = (ZP -
Z0)/
Z = (-1.53-0)/0.19 = -8.05.
The evidence in favor of the null hypothesis against an alternative hypothesis
of "there is a relation between myocardial infarct volume and ejection
fraction"that is, HA:
P
0 is equal to P(Z
|-8.05|). This
is the probability that a variable following a standard normal distribution is
less than -8.05 or greater than 8.05. From the normal distribution tables, we
find that this probability is less than 0.0001. See module 10 in this series
[2] for an explanation of how
to use the tables of the normal distribution.
Confidence interval
As in the case of the p value, to construct a confidence interval
for
P we first need to transform
rP into ZP. The upper
(uZ) and lower (lZ) limits of
the (1-
)% confidence interval on the transformed scale are given by
(lZ = ZP -
Z1-
/2
Z, uZ = ZP +
Z1-
/2
Z), where
Z is the previously defined SD of
ZP, and Z1-
/2 is the
(1-
/2) quantile of the standard normal distribution. The
latter is the point below which the area under the normal distribution curve
is equal to 1-
/2. We then retransform these limits to obtain the
(1-
)% confidence interval for
P as (l
= [exp(2lZ) - 1]/[exp(2lZ) + 1],
u = [exp(2uZ) - 1]/[exp(2uZ)
+ 1]). In our example of myocardial infarct volume and ejection fraction, we
can use the previously calculated values of Zp
and
Z to obtain a 95% confidence interval on the
transformed scale as (lZ = -1.53 - 1.96[0.19],
uZ = -1.53 + 1.96[0.19]) = (-1.90 to -1.16). The value
Z1-
/2 = 1.96 is obtained from the normal
distribution table. On retransformation, we obtain the limits of the 95%
confidence interval for
P as
![]() |
![]() |
APPENDIX 2. Inference for the Simple Linear Regression Model
|
|
|---|
|
Typically, we are more interested in the slope than in the intercept. A
natural null hypothesis of interest is H0: ß=0. The SE of the
slope in our example is given by sb = 0.32. See the table
in this appendix for an illustration of how to calculate
sb in a smaller sample of five patients. Note the results
there are slightly different from those in this section because they are based
on a different sample. Using the formula in the box, the test statistic can be
calculated as
![]() |
As in the case of the correlation coefficient, the p value that we
report depends on the direction of the alternative hypothesis. If the
alternative hypothesis was HA: ß
0, then the p
value is given by
P(tN-2
|tb|)that
is, the probability that the standard t distribution with N
- 2 = 28 degrees of freedom (df) takes values less than or equal to
-|tb| = -11.38 or greater than or equal to
|tb| = 11.38. (Recall N = our
sample size of 30. See [3] more
details on the t distribution.) Looking up the t
distribution tables corresponding to N - 2 = 30-2 = 28 df,
we find that this probability is less than 0.001. Because this probability is
much less than the traditional significance levels of 0.05 or 0.01, we reject
the null hypothesis and conclude that there is a relation between ejection
fraction and myocardial infarct volume.
Alternatively, we could construct a 95% confidence interval for the slope.
As mentioned previously, this is more informative than simply reporting
whether we did or did not reject a single null hypothesis. The term
"t1-
/2,
N-2" in the formula above denotes the
1-
/2 quantile of the t distribution with 28 df (i.e.,
the point on the standard t distribution below which there is a
1-
/2 probability). For a 95% confidence interval, we have
=
1-0.95 = 0.5. The value of t1-
/2,
N-2 = t0.975,28 = 2.05. For
our example, we have already calculated b = -3.6% and
sb = 0.32. Thus, the 95% confidence interval is given by
![]() |
This interval gives us an idea of the range of values of the slope that is compatible with the data and cannot be rejected by a hypothesis test. Because the interval does not include 0, we can conclude that there is a negative relation between ejection fraction and myocardial infarct volume.
For a given value of myocardial infarct volume, our simple linear regression model may also be used to predict the ejection fraction for an average patient or to predict the ejection fraction for an individual patient. The SEs for the predicted mean ejection fraction and for an individual's ejection fraction are as follows:
SE for predicted mean outcome at x
![]() |
SE for predicted individual outcome at x
![]() |
Notice that these two SEs are very similar except for the fact that an
additional 1 appears in the term under the square root for the SE of the
predicted outcome for an individual. This causes the SE of the predicted
outcome for a single individual to always be greater than the predicted
outcome for an average individual. This is because of the additional variance
of the individual outcomes above the average outcome. In our example,
SM,2 = 1.14, and
SI,2 = 3.71. The predicted value of the outcome
when the predictor is equal to x is denoted by
. The predicted average
ejection fraction corresponding to a myocardial infarct volume of 2 mL
(denoted by
) can be
calculated using the regression equation as 70 - 3.6(2) = 62.8%. The
expression for a (1-
)% confidence interval for the average ejection
fraction is
![]() |
Recall that we had determined from the tables of the t
distribution that t0.975,28 is 2.05. Thus, the 95%
confidence interval for the predicted mean ejection fraction when myocardial
infarction volume = 2 mL is given by
![]() |
![]() |
![]() |
The confidence interval for an individual's ejection fraction when
myocardial infarction volume is 2 mL is obtained by replacing the SE in the
this expression by sI,xthat is, by
![]() |
![]() |
![]() |
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |