Regression Post Tests

In Uncategorized

Chapter 5-21.  Regression Post Tests

 

In this chapter, we will discuss statistical comparisons made after one or more regression models have been fitted.  These are called post tests, and can be used to combine regression coefficients in the same model, compare coefficients across models, and compare model fit across models.

 

Simultaneous Comparison of Groups

 

We did this in Chapter 6, and will cover it quickly here.  Opening the rmr dataset in Stata,

 

File

Open

Find the directory where you copied the course CD

Change to the subdirectory datasets & do-files

Single click on  rmr.dta    

                    Open

 

use "C:Documents and Settingsu0032770.SRVRDesktop

Biostats & Epi With Statadatasets & do-files

rmr.dta", clear

 

*    which must be all on one line, or use:

cd "C:Documents and Settingsu0032770.SRVRDesktop"

cd "Biostats & Epi With Statadatasets & do-files"

use rmr, clear

 

This dataset comes from Nawata et al (2004)[on course CD].  The data were taken from the authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the authors.

 

File rmr.dta Codebook

group   urinary excretion of albumin group (U-Alb)

a = U-Alb < 30 mg/d

b = 30 mg/d ≤ U-Alb ≤ 300 mg/d

c = 300 mg/d < U-Alb

lbm      lean body mass (kg)

rmr      resting metabolic rate (kJ/h/m2)

 

The group variable is a string variable with values “a”, “b”, and “c”.

Creating indicator variables and checking our work,

 


Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.

 

gen groupa = cond(group=="a",1,0)

gen groupb = cond(group=="b",1,0)

gen groupc = cond(group=="c",1,0)

tab group groupa

tab group groupb

tab group groupc

 

 

|        groupa

group |         0          1 |     Total

-----------+----------------------+----------

a |         0         12 |        12

b |        10          0 |        10

c |        12          0 |        12

-----------+----------------------+----------

Total |        22         12 |        34

 

|        groupb

group |         0          1 |     Total

-----------+----------------------+----------

a |        12          0 |        12

b |         0         10 |        10

c |        12          0 |        12

-----------+----------------------+----------

Total |        24         10 |        34

 

|        groupc

group |         0          1 |     Total

-----------+----------------------+----------

a |        12          0 |        12

b |        10          0 |        10

c |         0         12 |        12

-----------+----------------------+----------

Total |        22         12 |        34

 

We see that the indicator variables were correctly set up.

 

We want to test the hypothesis that the three group means are the same:

 

 

Even better, we want to do this adjusting for lbm, which is a potential confounding variable:

 

 

Fitting the linear regression model, with group a as the referent,

 

Statistics

Linear models and related

Linear regression

Dependent variable: rmr

Independent variables: groupb groupc lbm

OK

 

regress rmr groupb groupc lbm

 

Source |       SS       df       MS              Number of obs =      34

-------------+------------------------------           F(  3,    30) =    9.18

Model |  8798.55426     3  2932.85142           Prob > F      =  0.0002

Residual |  9588.38691    30  319.612897           R-squared     =  0.4785

-------------+------------------------------           Adj R-squared =  0.4264

Total |  18386.9412    33  557.180036           Root MSE      =  17.878

 


rmr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

groupb |  -3.410606   7.654802    -0.45   0.659     -19.0438    12.22258

groupc |   29.91667   7.305324     4.10   0.000      14.9972    44.83613

lbm |   .5454562   .3431357     1.59   0.122    -.1553203    1.246233

_cons |   112.2196   15.87451     7.07   0.000     79.79954    144.6397


 

We see that there is a significant difference between group c and group a, while controlling for lbm, where the adjusted means shown here are:

 

Group a = _cons = 112.2196

Group b = _cons + groupb = 112.2196 – 3.410606 = 108.80899

Group c = _cons + groupc = 112.2196 + 29.91667 = 142.13627

 

We can get these adjusted group means using, setting lbm = 0, using

 

Statistics

Postestimation

Adjusted means and proportions

Main tab:  Compute and display predictions for each level of variables:  group

Variables to be set to there overall mean value:  <leave blank this time>

Variables to be set to a specified value:  Variable: lbm    Value: 0

Options tab:  Confidence interval or prediction intervals

OK

 

adjust lbm=0, by(group) xb ci

 


Dependent variable: rmr     Command: regress

Variables left as is: groupa, groupb, groupc

Covariate set to value: lbm = 0



group |         xb          lb          ub

----------+-----------------------------------

a |     112.22    [79.7995     144.64]

b |    108.809    [76.0153    141.603]

c |    142.136    [109.108    175.165]


Key:  xb         =  Linear Prediction

[lb , ub]  =  [95% Confidence Interval]

 

 

To get a single p value for the group comparison, if all we wanted to say was that there was a significant difference among the groups, we use what we know about combining the coefficients,

 

Group a = _cons = 112.2196

Group b = _cons + groupb = 112.2196 – 3.410606 = 108.80899

Group c = _cons + groupc = 112.2196 + 29.91667 = 142.13627

 

and then construct the following post-estimation Wald test,

 

test _cons = _cons+groupb = _cons+groupc

 

( 1) - groupb = 0

( 2) - groupc = 0

 

F(  2,    30) =   12.10

Prob > F =    0.0001

 

We can now report the three groups are not equal after adjusting for lbm (p<.001).

 

What we just tested is the group main effect in an analysis of covariance,

encode group, generate(group2) // create numeric group variable

anova rmr group2 lbm, continuous(lbm)

 

Number of obs =      34     R-squared     =  0.4785

Root MSE      = 17.8777     Adj R-squared =  0.4264

 

Source |  Partial SS    df       MS           F     Prob > F

-----------+----------------------------------------------------

Model |  8798.55426     3  2932.85142       9.18     0.0002

|

group2 |   7734.4293     2  3867.21465      12.10     0.0001

lbm |  807.629752     1  807.629752       2.53     0.1224

|

Residual |  9588.38691    30  319.612897

-----------+----------------------------------------------------

Total |  18386.9412    33  557.180036

 

 

 

In the authors statistical methods section, they stated they used analysis of variance (ANOVA) and analysis of covariance (ANCOVA).

 

Although there are special routines in Stata to fit them more easily, analysis of variance (ANOVA) is just a linear regression with categorical predictor variables, and analysis of covariance (ANCOVA) is just a linear regression with both categorical variables and continuous variables.  In ANOVA terminology, categorical variables are called factors and continuous variables are called covariates.

 

 


Likelihood Ratio Test For Model Fit

 

We did this in Chapter 24, and will cover it again quickly here.  Using the 11.2.Isoproterenol.dta dataset provided with the Dupont (2002, p.338) textbook, described as,

 

“Lang et al. (1995) studied the effect of isoproterenol, a β-adrenergic agonist, on forearm blood flow in a group of 22 normotensive men. Nine of the study subjects were black and 13 were white.  Each subject’s blood flow was measured at baseline and then at escalating doses of isoproterenol.”

 

Reading the data in,

 

use "C:Documents and Settingsu0032770.SRVRDesktop

Biostats & Epi With Statadatasets & do-files

11.2.Isoproterenol.dta", clear

 

*    which must be all on one line, or use:

cd "C:Documents and Settingsu0032770.SRVRDesktop"

cd "Biostats & Epi With Statadatasets & do-files"

use 11.2.Isoproterenol, clear

 

 

To get a p value to test if a random slope is needed (in other words, the random slope improves the goodness of fit), we compare the log likelihoods between a model without it and a model with it included,

 

reshape long fbf , i(id) j(dose)

*

xtmixed fbf dose || id: , mle cov(unstructured)

estimates store modelA // store model estimates

xtmixed fbf dose || id: dose , mle cov(unstructured)

estimates store modelB  // store model with added term estimates

lrtest modelB modelA

display "LR Chi-square = " -2*(-476.95216 –(-437.38499))

 

 

 

. xtmixed fbf dose || id: , mle cov(unstructured)

Note: single-variable random-effects specification; covariance structure set to identity

 

Log likelihood = -476.95216                     Prob > chi2        =    0.0000


fbf |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

dose |   .0352574   .0027581    12.78   0.000     .0298517    .0406631

_cons |   4.966772   1.245969     3.99   0.000     2.524718    7.408827


 

. xtmixed fbf dose || id: dose , mle cov(unstructured)

 

Log likelihood = -437.38499                     Prob > chi2        =    0.0000


fbf |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

dose |   .0356267   .0049976     7.13   0.000     .0258315    .0454219

_cons |   4.929394   .6680961     7.38   0.000      3.61995    6.238838


 

. lrtest modelB modelA

 

Likelihood-ratio test                                  LR chi2(2)  =     79.13

(Assumption: modelA nested in modelB)                  Prob > chi2 =    0.0000

 

. display "LR Chi-square = " -2*(-476.95216 -(-437.38499))

LR Chi-square = 79.13434

 

 

The LR ratio test statistic = 79.13, p<0.0001 informs us that adding the random slopes improved the fit.  The display command, was not needed, but was used just to show that this statistic is -2 × the difference of the two model likelihoods.

 

You could use this likelihood ratio test to get a p value for any variable in the model.  It is only slightly more powerful than the Wald test p value shown in the regression table, frequently just changing the p value in the second or third decimal place.

 

 

 

Comparison of Regression Coefficients Within The Same Model

 

The regression coefficients within the same model can be compared using a post test.  This might be useful, for example, if you wanted to compare the coefficients for different study sites.

 

First, read in some hypothetical data in the do-file editor,

 

clear

input y x1 x2 site

5 7 6 1

6 3 4 1

7 6 8 1

8 7 6 2

7 5 7 2

8 6 7 2

7 5 8 3

4 3 4 3

6 5 5 3

end

list , sepby(site)

 

+--------------------+

| y   x1   x2   site |

|--------------------|

  1. | 5    7    6      1 |
  2. | 6    3    4      1 |
  3. | 7    6    8      1 |

|--------------------|

  1. | 8    7    6      2 |
  • | 7    5    7      2 |
  • | 8    6    7      2 |

  • |--------------------|

    1. | 7    5    8      3 |
  • | 4    3    4      3 |
  • | 6    5    5      3 |

  • +--------------------+

     

    For this illustration, we will not concern ourselves with overfitting (too few subjects for the number of predictors in the model).

     

    Although it is not necessary, if you add the “nocons” option, than you can see the mean value of each study site.  Fitting such a linear regression model,

     

    tab site , gen(site) // create site indicator variables

    regress y site1 site2 site3, nocons

     

     


    y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    site1 |          6   .6382847     9.40   0.000     4.438174    7.561826

    site2 |   7.666667   .6382847    12.01   0.000      6.10484    9.228493

    site3 |   5.666667   .6382847     8.88   0.000      4.10484    7.228493


     

    Some possible site comparisons that might be of interest are:

     

    test site1 = site2 = site3  // no difference among the sites

    *

    test site1 = site2 // pairwise site comparisons

    test site1 = site3

    test site2 = site3

     

    . test site1 = site2 = site3  // no difference among the sites

     

    ( 1)  site1 - site2 = 0

    ( 2)  site1 - site3 = 0

     

    F(  2,     6) =    2.82

    Prob > F =    0.1371

     

    . *

    . test site1 = site2 // pairwise site comparisons

     

    ( 1)  site1 - site2 = 0

     

    F(  1,     6) =    3.41

    Prob > F =    0.1144

     

    . test site1 = site3

     

    ( 1)  site1 - site3 = 0

     

    F(  1,     6) =    0.14

    Prob > F =    0.7246

     

    . test site2 = site3

     

    ( 1)  site2 - site3 = 0

     

    F(  1,     6) =    4.91

    Prob > F =    0.0686

     

    The two-tailed p values are shown in red.

     

    The p values will be slightly different from the same pairwise t tests, but either approach is valid.  An advantage of the approach shown here, however, is that you can test the pairwise differences while controlling for confounding variables, something which cannot be done with t tests.

     

    Protocol Suggestion

     

    The association between y and study site will be estimated using linear regression. To determine if the study sites differ, and which ones, the regression coefficients for specific study sites will be compared using a post-estimation Wald test (Judge et al, 1985, 20-28).

     

    The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
    Comparison of Regression Coefficients From Separate Models

     

    The regression coefficients computed from different models can be compared using a post test.  This might be useful, for example, if you wanted to compare the coefficients for different subgroups.

     

    First, read in some hypothetical data in the do-file editor,

     

    clear

    input y x1 x2 site

    5 7 6 1

    6 3 4 1

    7 6 8 1

    8 7 6 1

    7 5 7 1

    8 6 7 2

    7 5 8 2

    4 3 4 2

    6 5 5 2

    8 4 3 2

    end

    list , sepby(site)

     

    +--------------------+

    | y   x1   x2   site |

    |--------------------|

    1. | 5    7    6      1 |
  • | 6    3    4      1 |
  • | 7    6    8      1 |

  • | 8    7    6      1 |

  • | 7    5    7      1 |

  • |--------------------|

    1. | 8    6    7      2 |
  • | 7    5    8      2 |
  • | 4    3    4      2 |

  • | 6    5    5      2 |

  • | 8    4    3      2 |

  • +--------------------+

     

    We are interested in determining if the association between y and x1 is greater in Site 1 than in Site 2.

     

    This can be done using,

     

    regress y x1 if site==1 // fit 1st model

    estimates store site1   // store estimates of 1st model

    regress y x1 if site==2 // fit 2nd model

    estimates store site2   // store estimates of 2nd model

    suest site1 site2       // combine estimation results for testing

    test [site1_mean]x1 = [site2_mean]x1  // test for equality of coef

     

     

    . regress y x1 if site==1 // fit 1st model


    y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    x1 |   .1071429    .388504     0.28   0.801     -1.12925    1.343536

    _cons |          6   2.251983     2.66   0.076    -1.166816    13.16682


    . estimates store site1   // store estimates of 1st model.

     

    regress y x1 if site==2 // fit 2nd model


    y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    x1 |          1   .6201737     1.61   0.205    -.9736694    2.973669

    _cons |          2   2.922065     0.68   0.543    -7.299314    11.29931


    . estimates store site2   // store estimates of 2nd model

     

    . suest site1 site2       // combine estimation results for testing

    Simultaneous results for site1, site2

    Number of obs   =         10


    |               Robust

    |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    site1_mean   |

    x1 |   .1071429   .2955498     0.36   0.717    -.4721241    .6864098

    _cons |          6   1.266053     4.74   0.000     3.518581    8.481419

    -------------+----------------------------------------------------------------

    site1_lnvar  |

    _cons |   .5250103   .3748764     1.40   0.161     -.209734    1.259755

    -------------+----------------------------------------------------------------

    site2_mean   |

    x1 |          1   .4134491     2.42   0.016     .1896546    1.810345

    _cons |          2   2.301987     0.87   0.385    -2.511812    6.511812

    -------------+----------------------------------------------------------------

    site2_lnvar  |

    _cons |   .6931472   .3944053     1.76   0.079     -.079873    1.466167


    . test [site1_mean]x1 = [site2_mean]x1  // test for equality

    ( 1)  [site1_mean]x1 - [site2_mean]x1 = 0

    chi2(  1) =    3.09

    Prob > chi2 =    0.0789

     

    In this example, the regression coefficient 0.1071429 for Site 1 was not quite significantly different than the regression coefficient 1 for Site 2 (p = 0.079).

     

    The command, suest, is an acronym for seemingly unrelated estimation in Stata.  It simply combines information across models for use with other commands, such as test.

     

    Protocol Suggestion

     

    The association between y and x1 will be estimated using linear regression, separately for the two study sites.  To determine if the association is stronger in one study site than the other, the regression coefficient estimated in the two models will be compared using a post-estimation Wald test (Judge et al, 1985, 20-28).

     

    The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
    Comparison of Two Correlation Coefficients

     

    It is sometimes of interest to compare two correlation coefficients.  Using the fact that when the outcome and predictor variables are first transformed into standardized scores, or z-scores, the regression coefficient is identically the Pearson correlation coefficient. The regression coefficients computed from different models can be compared using a post test.

     

    First, read in some hypothetical data in the do-file editor,

     

    clear

    input y x1 x2 site

    5 7 6 1

    6 3 4 1

    7 6 8 1

    8 7 6 1

    7 5 7 1

    8 6 7 2

    7 5 8 2

    4 3 4 2

    6 5 5 2

    8 4 3 2

    end

    list , sepby(site)

     

    +--------------------+

    | y   x1   x2   site |

    |--------------------|

    1. | 5    7    6      1 |
  • | 6    3    4      1 |
  • | 7    6    8      1 |

  • | 8    7    6      1 |

  • | 7    5    7      1 |

  • |--------------------|

    1. | 8    6    7      2 |
  • | 7    5    8      2 |
  • | 4    3    4      2 |

  • | 6    5    5      2 |

  • | 8    4    3      2 |

  • +--------------------+

    Obtaining Pearson correlation coefficients,

     

    pwcorr y x1 x2 , sig

     

    |        y       x1       x2

    -------------+---------------------------

    y |   1.0000

    |

    |

    x1 |   0.3635   1.0000

    |   0.3018

    |

    x2 |   0.2914   0.6217   1.0000

    |   0.4140   0.0550

     

    The correlation between x1 and y is r = 0.36 and between x2 and y is r = 0.29.  In this example, we are interested in demonstrating a difference between these two correlations, suggesting one variable is a better predictor than the other.

     

    We will compare the two correlation coefficients using a post test following two linear regression models.  If, however, the two variables, x1 and x2, have different underlying scales, it does not make sense to compare the regression coefficients directly.  First, the variables must be converted to the same underlying scale, using standardized scores, or z scores.

     

    egen zy=std(y)

    egen zx1=std(x1)

    egen zx2=std(x2)

    sum

     

    Variable |       Obs        Mean    Std. Dev.       Min        Max

    -------------+--------------------------------------------------------

    y |        10         6.6    1.349897          4          8

    x1 |        10         5.1    1.449138          3          7

    x2 |        10         5.8     1.75119          3          8

    site |        10         1.5    .5270463          1          2

    zy |        10   -1.19e-08           1  -1.926073   1.037116

    -------------+--------------------------------------------------------

    zx1 |        10   -3.73e-09           1  -1.449138   1.311125

    zx2 |        10   -1.49e-09           1  -1.598913   1.256289

     

    We see that the three z scores each have a mean of 0 and a standard deviation of 1, thus all are on the same underlying scale, which is how many standard deviation units each score is from its mean.

     

    regress zy zx1      // compute 1st Pearson correlation

    estimates store A   // store estimates of 1st model

    regress zy zx2      // compute 1st Pearson correlation

    estimates store B   // store estimates of 2nd model

    suest A B           // combine estimation results for testing

    test [A_mean]zx1 = [B_mean]zx2  // test for equality

     

     

    . regress zy zx1      // compute 1st Pearson correlation


    zy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    zx1 |    .363519   .3293657     1.10   0.302    -.3959996    1.123038

    _cons |  -1.06e-08   .3124637    -0.00   1.000    -.7205426    .7205426


     

    . estimates store A   // store estimates of 1st model

     

    . regress zy zx2      // compute 1st Pearson correlation


    zy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    zx2 |   .2914172   .3382078     0.86   0.414    -.4884913    1.071326

    _cons |  -1.15e-08   .3208521    -0.00   1.000    -.7398862    .7398862


     

    . estimates store B   // store estimates of 2nd model

     

    . suest A B           // combine estimation results for testing

     

    Simultaneous results for A, B

    Number of obs   =         10


    |               Robust

    |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    A_mean       |

    zx1 |    .363519   .3824863     0.95   0.342    -.3861403    1.113178

    _cons |  -1.06e-08   .2945936    -0.00   1.000    -.5773929    .5773928

    -------------+----------------------------------------------------------------

    A_lnvar      |

    _cons |  -.0239488   .3292348    -0.07   0.942    -.6692372    .6213395

    -------------+----------------------------------------------------------------

    B_mean       |

    zx2 |   .2914172   .3512637     0.83   0.407    -.3970469    .9798814

    _cons |  -1.15e-08   .3025022    -0.00   1.000    -.5928935    .5928935

    -------------+----------------------------------------------------------------

    B_lnvar      |

    _cons |   .0290349   .3150404     0.09   0.927     -.588433    .6465027


     

    . test [A_mean]zx1 = [B_mean]zx2  // test for equality

     

    ( 1)  [A_mean]zx1 - [B_mean]zx2 = 0

     

    chi2(  1) =    0.06

    Prob > chi2 =    0.8130

     

    We notice that the regression coefficients are now identically the Pearson correlation coefficients observed on the previous page.

     

    We see that the correlation between y and x1 (r = 0.36) is not significantly different from the correlation between y and x2 (r = 0.29)(p = 0.81).  So, although x1 appears to be the better predictor, based on an observed larger correlation coefficient, there was not sufficient information in this sample to demonstrate that statistically.

     

    Protocol Suggestion

     

    To test the difference in correlation of two variables, each correlated with a common third variable, we will use a regression model approach.  When the outcome and predictor variable are both converted to a standardized score, or z score, the regression coefficient is identically the Pearson correlation coefficient.  These standardized regression coefficients, or Pearson correlation coefficients, will be compared using a linear regression post-estimation Wald test (Judge et al, 1985, 20-28).

     

    The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
    References

     

    Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: a Simple

                Introduction to the Analysis of Complex Data. Cambridge, Cambridge University

    Press.

     

    Gregoire AJP, Kumar R, Everitt BS et al. (1996). Transdermal oestrogen for the treatment of

    severe post-natal depression. The Lancet 347:930-934.

     

    Judge GG, Griffiths WE, Hill RC, Lutkepohl, Lee T-C. (1985). The Theory and Practice of

                Econometrics. 2nd ed, New York, Wiley.

     

    Rabe-Hesketh S, Everitt B. (2003). A Handbook of Statistical Analyses using Stata. 3rd ed.

    New York, Chapman & Hall/CRC.

     

    Nawata K, Sohmiya M, Kawaguchi M, et al. (2004). Increased resting metabolic rate in patients

    with type 2 diabetes mellitus accompanied by advanced diabetic nephropathy.

    Metabolism 53(11) Nov: 1395-1398.