Chapter 5-21. Regression Post Tests
In this chapter, we will discuss statistical comparisons made after one or more regression models have been fitted. These are called post tests, and can be used to combine regression coefficients in the same model, compare coefficients across models, and compare model fit across models.
Simultaneous Comparison of Groups
We did this in Chapter 6, and will cover it quickly here. Opening the rmr dataset in Stata,
File
Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on rmr.dta Open |
use "C:Documents and Settingsu0032770.SRVRDesktop
Biostats & Epi With Statadatasets & do-files rmr.dta", clear
* which must be all on one line, or use: cd "C:Documents and Settingsu0032770.SRVRDesktop" cd "Biostats & Epi With Statadatasets & do-files" use rmr, clear |
This dataset comes from Nawata et al (2004)[on course CD]. The data were taken from the authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the authors.
File rmr.dta Codebook
group urinary excretion of albumin group (U-Alb)
a = U-Alb < 30 mg/d
b = 30 mg/d ≤ U-Alb ≤ 300 mg/d
c = 300 mg/d < U-Alb
lbm lean body mass (kg)
rmr resting metabolic rate (kJ/h/m2)
The group variable is a string variable with values “a”, “b”, and “c”.
Creating indicator variables and checking our work,
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.
gen groupa = cond(group=="a",1,0)
gen groupb = cond(group=="b",1,0) gen groupc = cond(group=="c",1,0) tab group groupa tab group groupb tab group groupc |
| groupa
group | 0 1 | Total
-----------+----------------------+----------
a | 0 12 | 12
b | 10 0 | 10
c | 12 0 | 12
-----------+----------------------+----------
Total | 22 12 | 34
| groupb
group | 0 1 | Total
-----------+----------------------+----------
a | 12 0 | 12
b | 0 10 | 10
c | 12 0 | 12
-----------+----------------------+----------
Total | 24 10 | 34
| groupc
group | 0 1 | Total
-----------+----------------------+----------
a | 12 0 | 12
b | 10 0 | 10
c | 0 12 | 12
-----------+----------------------+----------
Total | 22 12 | 34
We see that the indicator variables were correctly set up.
We want to test the hypothesis that the three group means are the same:
Even better, we want to do this adjusting for lbm, which is a potential confounding variable:
Fitting the linear regression model, with group a as the referent,
Statistics
Linear models and related Linear regression Dependent variable: rmr Independent variables: groupb groupc lbm OK |
regress rmr groupb groupc lbm |
Source | SS df MS Number of obs = 34
-------------+------------------------------ F( 3, 30) = 9.18
Model | 8798.55426 3 2932.85142 Prob > F = 0.0002
Residual | 9588.38691 30 319.612897 R-squared = 0.4785
-------------+------------------------------ Adj R-squared = 0.4264
Total | 18386.9412 33 557.180036 Root MSE = 17.878
rmr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
groupb | -3.410606 7.654802 -0.45 0.659 -19.0438 12.22258
groupc | 29.91667 7.305324 4.10 0.000 14.9972 44.83613
lbm | .5454562 .3431357 1.59 0.122 -.1553203 1.246233
_cons | 112.2196 15.87451 7.07 0.000 79.79954 144.6397
We see that there is a significant difference between group c and group a, while controlling for lbm, where the adjusted means shown here are:
Group a = _cons = 112.2196
Group b = _cons + groupb = 112.2196 – 3.410606 = 108.80899
Group c = _cons + groupc = 112.2196 + 29.91667 = 142.13627
We can get these adjusted group means using, setting lbm = 0, using
Statistics
Postestimation Adjusted means and proportions Main tab: Compute and display predictions for each level of variables: group Variables to be set to there overall mean value: <leave blank this time> Variables to be set to a specified value: Variable: lbm Value: 0 Options tab: Confidence interval or prediction intervals OK |
adjust lbm=0, by(group) xb ci |
Dependent variable: rmr Command: regress
Variables left as is: groupa, groupb, groupc
Covariate set to value: lbm = 0
group | xb lb ub
----------+-----------------------------------
a | 112.22 [79.7995 144.64]
b | 108.809 [76.0153 141.603]
c | 142.136 [109.108 175.165]
Key: xb = Linear Prediction
[lb , ub] = [95% Confidence Interval]
To get a single p value for the group comparison, if all we wanted to say was that there was a significant difference among the groups, we use what we know about combining the coefficients,
Group a = _cons = 112.2196
Group b = _cons + groupb = 112.2196 – 3.410606 = 108.80899
Group c = _cons + groupc = 112.2196 + 29.91667 = 142.13627
and then construct the following post-estimation Wald test,
test _cons = _cons+groupb = _cons+groupc |
( 1) - groupb = 0
( 2) - groupc = 0
F( 2, 30) = 12.10
Prob > F = 0.0001
We can now report the three groups are not equal after adjusting for lbm (p<.001).
What we just tested is the group main effect in an analysis of covariance,
encode group, generate(group2) // create numeric group variable
anova rmr group2 lbm, continuous(lbm) |
Number of obs = 34 R-squared = 0.4785
Root MSE = 17.8777 Adj R-squared = 0.4264
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 8798.55426 3 2932.85142 9.18 0.0002
|
group2 | 7734.4293 2 3867.21465 12.10 0.0001
lbm | 807.629752 1 807.629752 2.53 0.1224
|
Residual | 9588.38691 30 319.612897
-----------+----------------------------------------------------
Total | 18386.9412 33 557.180036
In the authors statistical methods section, they stated they used analysis of variance (ANOVA) and analysis of covariance (ANCOVA).
Although there are special routines in Stata to fit them more easily, analysis of variance (ANOVA) is just a linear regression with categorical predictor variables, and analysis of covariance (ANCOVA) is just a linear regression with both categorical variables and continuous variables. In ANOVA terminology, categorical variables are called factors and continuous variables are called covariates.
Likelihood Ratio Test For Model Fit
We did this in Chapter 24, and will cover it again quickly here. Using the 11.2.Isoproterenol.dta dataset provided with the Dupont (2002, p.338) textbook, described as,
“Lang et al. (1995) studied the effect of isoproterenol, a β-adrenergic agonist, on forearm blood flow in a group of 22 normotensive men. Nine of the study subjects were black and 13 were white. Each subject’s blood flow was measured at baseline and then at escalating doses of isoproterenol.”
Reading the data in,
use "C:Documents and Settingsu0032770.SRVRDesktop
Biostats & Epi With Statadatasets & do-files 11.2.Isoproterenol.dta", clear
* which must be all on one line, or use: cd "C:Documents and Settingsu0032770.SRVRDesktop" cd "Biostats & Epi With Statadatasets & do-files" use 11.2.Isoproterenol, clear |
To get a p value to test if a random slope is needed (in other words, the random slope improves the goodness of fit), we compare the log likelihoods between a model without it and a model with it included,
reshape long fbf , i(id) j(dose)
* xtmixed fbf dose || id: , mle cov(unstructured) estimates store modelA // store model estimates xtmixed fbf dose || id: dose , mle cov(unstructured) estimates store modelB // store model with added term estimates lrtest modelB modelA display "LR Chi-square = " -2*(-476.95216 –(-437.38499)) |
. xtmixed fbf dose || id: , mle cov(unstructured)
Note: single-variable random-effects specification; covariance structure set to identity
Log likelihood = -476.95216 Prob > chi2 = 0.0000
fbf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dose | .0352574 .0027581 12.78 0.000 .0298517 .0406631
_cons | 4.966772 1.245969 3.99 0.000 2.524718 7.408827
. xtmixed fbf dose || id: dose , mle cov(unstructured)
Log likelihood = -437.38499 Prob > chi2 = 0.0000
fbf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
dose | .0356267 .0049976 7.13 0.000 .0258315 .0454219
_cons | 4.929394 .6680961 7.38 0.000 3.61995 6.238838
. lrtest modelB modelA
Likelihood-ratio test LR chi2(2) = 79.13
(Assumption: modelA nested in modelB) Prob > chi2 = 0.0000
. display "LR Chi-square = " -2*(-476.95216 -(-437.38499))
LR Chi-square = 79.13434
The LR ratio test statistic = 79.13, p<0.0001 informs us that adding the random slopes improved the fit. The display command, was not needed, but was used just to show that this statistic is -2 × the difference of the two model likelihoods.
You could use this likelihood ratio test to get a p value for any variable in the model. It is only slightly more powerful than the Wald test p value shown in the regression table, frequently just changing the p value in the second or third decimal place.
Comparison of Regression Coefficients Within The Same Model
The regression coefficients within the same model can be compared using a post test. This might be useful, for example, if you wanted to compare the coefficients for different study sites.
First, read in some hypothetical data in the do-file editor,
clear
input y x1 x2 site 5 7 6 1 6 3 4 1 7 6 8 1 8 7 6 2 7 5 7 2 8 6 7 2 7 5 8 3 4 3 4 3 6 5 5 3 end list , sepby(site) |
+--------------------+
| y x1 x2 site |
|--------------------|
- | 5 7 6 1 |
- | 6 3 4 1 |
- | 7 6 8 1 |
|--------------------|
- | 8 7 6 2 |
| 8 6 7 2 |
|--------------------|
- | 7 5 8 3 |
| 6 5 5 3 |
+--------------------+
For this illustration, we will not concern ourselves with overfitting (too few subjects for the number of predictors in the model).
Although it is not necessary, if you add the “nocons” option, than you can see the mean value of each study site. Fitting such a linear regression model,
tab site , gen(site) // create site indicator variables
regress y site1 site2 site3, nocons |
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
site1 | 6 .6382847 9.40 0.000 4.438174 7.561826
site2 | 7.666667 .6382847 12.01 0.000 6.10484 9.228493
site3 | 5.666667 .6382847 8.88 0.000 4.10484 7.228493
Some possible site comparisons that might be of interest are:
test site1 = site2 = site3 // no difference among the sites
* test site1 = site2 // pairwise site comparisons test site1 = site3 test site2 = site3 |
. test site1 = site2 = site3 // no difference among the sites
( 1) site1 - site2 = 0
( 2) site1 - site3 = 0
F( 2, 6) = 2.82
Prob > F = 0.1371
. *
. test site1 = site2 // pairwise site comparisons
( 1) site1 - site2 = 0
F( 1, 6) = 3.41
Prob > F = 0.1144
. test site1 = site3
( 1) site1 - site3 = 0
F( 1, 6) = 0.14
Prob > F = 0.7246
. test site2 = site3
( 1) site2 - site3 = 0
F( 1, 6) = 4.91
Prob > F = 0.0686
The two-tailed p values are shown in red.
The p values will be slightly different from the same pairwise t tests, but either approach is valid. An advantage of the approach shown here, however, is that you can test the pairwise differences while controlling for confounding variables, something which cannot be done with t tests.
Protocol Suggestion
The association between y and study site will be estimated using linear regression. To determine if the study sites differ, and which ones, the regression coefficients for specific study sites will be compared using a post-estimation Wald test (Judge et al, 1985, 20-28).
The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
Comparison of Regression Coefficients From Separate Models
The regression coefficients computed from different models can be compared using a post test. This might be useful, for example, if you wanted to compare the coefficients for different subgroups.
First, read in some hypothetical data in the do-file editor,
clear
input y x1 x2 site 5 7 6 1 6 3 4 1 7 6 8 1 8 7 6 1 7 5 7 1 8 6 7 2 7 5 8 2 4 3 4 2 6 5 5 2 8 4 3 2 end list , sepby(site) |
+--------------------+
| y x1 x2 site |
|--------------------|
- | 5 7 6 1 |
| 7 6 8 1 |
| 8 7 6 1 |
| 7 5 7 1 |
|--------------------|
- | 8 6 7 2 |
| 4 3 4 2 |
| 6 5 5 2 |
| 8 4 3 2 |
+--------------------+
We are interested in determining if the association between y and x1 is greater in Site 1 than in Site 2.
This can be done using,
regress y x1 if site==1 // fit 1st model
estimates store site1 // store estimates of 1st model regress y x1 if site==2 // fit 2nd model estimates store site2 // store estimates of 2nd model suest site1 site2 // combine estimation results for testing test [site1_mean]x1 = [site2_mean]x1 // test for equality of coef |
. regress y x1 if site==1 // fit 1st model
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | .1071429 .388504 0.28 0.801 -1.12925 1.343536
_cons | 6 2.251983 2.66 0.076 -1.166816 13.16682
. estimates store site1 // store estimates of 1st model.
regress y x1 if site==2 // fit 2nd model
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 1 .6201737 1.61 0.205 -.9736694 2.973669
_cons | 2 2.922065 0.68 0.543 -7.299314 11.29931
. estimates store site2 // store estimates of 2nd model
. suest site1 site2 // combine estimation results for testing
Simultaneous results for site1, site2
Number of obs = 10
| Robust
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
site1_mean |
x1 | .1071429 .2955498 0.36 0.717 -.4721241 .6864098
_cons | 6 1.266053 4.74 0.000 3.518581 8.481419
-------------+----------------------------------------------------------------
site1_lnvar |
_cons | .5250103 .3748764 1.40 0.161 -.209734 1.259755
-------------+----------------------------------------------------------------
site2_mean |
x1 | 1 .4134491 2.42 0.016 .1896546 1.810345
_cons | 2 2.301987 0.87 0.385 -2.511812 6.511812
-------------+----------------------------------------------------------------
site2_lnvar |
_cons | .6931472 .3944053 1.76 0.079 -.079873 1.466167
. test [site1_mean]x1 = [site2_mean]x1 // test for equality
( 1) [site1_mean]x1 - [site2_mean]x1 = 0
chi2( 1) = 3.09
Prob > chi2 = 0.0789
In this example, the regression coefficient 0.1071429 for Site 1 was not quite significantly different than the regression coefficient 1 for Site 2 (p = 0.079).
The command, suest, is an acronym for seemingly unrelated estimation in Stata. It simply combines information across models for use with other commands, such as test.
Protocol Suggestion
The association between y and x1 will be estimated using linear regression, separately for the two study sites. To determine if the association is stronger in one study site than the other, the regression coefficient estimated in the two models will be compared using a post-estimation Wald test (Judge et al, 1985, 20-28).
The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
Comparison of Two Correlation Coefficients
It is sometimes of interest to compare two correlation coefficients. Using the fact that when the outcome and predictor variables are first transformed into standardized scores, or z-scores, the regression coefficient is identically the Pearson correlation coefficient. The regression coefficients computed from different models can be compared using a post test.
First, read in some hypothetical data in the do-file editor,
clear
input y x1 x2 site 5 7 6 1 6 3 4 1 7 6 8 1 8 7 6 1 7 5 7 1 8 6 7 2 7 5 8 2 4 3 4 2 6 5 5 2 8 4 3 2 end list , sepby(site) |
+--------------------+
| y x1 x2 site |
|--------------------|
- | 5 7 6 1 |
| 7 6 8 1 |
| 8 7 6 1 |
| 7 5 7 1 |
|--------------------|
- | 8 6 7 2 |
| 4 3 4 2 |
| 6 5 5 2 |
| 8 4 3 2 |
+--------------------+
Obtaining Pearson correlation coefficients,
pwcorr y x1 x2 , sig |
| y x1 x2
-------------+---------------------------
y | 1.0000
|
|
x1 | 0.3635 1.0000
| 0.3018
|
x2 | 0.2914 0.6217 1.0000
| 0.4140 0.0550
The correlation between x1 and y is r = 0.36 and between x2 and y is r = 0.29. In this example, we are interested in demonstrating a difference between these two correlations, suggesting one variable is a better predictor than the other.
We will compare the two correlation coefficients using a post test following two linear regression models. If, however, the two variables, x1 and x2, have different underlying scales, it does not make sense to compare the regression coefficients directly. First, the variables must be converted to the same underlying scale, using standardized scores, or z scores.
egen zy=std(y)
egen zx1=std(x1) egen zx2=std(x2) sum |
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
y | 10 6.6 1.349897 4 8
x1 | 10 5.1 1.449138 3 7
x2 | 10 5.8 1.75119 3 8
site | 10 1.5 .5270463 1 2
zy | 10 -1.19e-08 1 -1.926073 1.037116
-------------+--------------------------------------------------------
zx1 | 10 -3.73e-09 1 -1.449138 1.311125
zx2 | 10 -1.49e-09 1 -1.598913 1.256289
We see that the three z scores each have a mean of 0 and a standard deviation of 1, thus all are on the same underlying scale, which is how many standard deviation units each score is from its mean.
regress zy zx1 // compute 1st Pearson correlation
estimates store A // store estimates of 1st model regress zy zx2 // compute 1st Pearson correlation estimates store B // store estimates of 2nd model suest A B // combine estimation results for testing test [A_mean]zx1 = [B_mean]zx2 // test for equality |
. regress zy zx1 // compute 1st Pearson correlation
zy | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
zx1 | .363519 .3293657 1.10 0.302 -.3959996 1.123038
_cons | -1.06e-08 .3124637 -0.00 1.000 -.7205426 .7205426
. estimates store A // store estimates of 1st model
. regress zy zx2 // compute 1st Pearson correlation
zy | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
zx2 | .2914172 .3382078 0.86 0.414 -.4884913 1.071326
_cons | -1.15e-08 .3208521 -0.00 1.000 -.7398862 .7398862
. estimates store B // store estimates of 2nd model
. suest A B // combine estimation results for testing
Simultaneous results for A, B
Number of obs = 10
| Robust
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
A_mean |
zx1 | .363519 .3824863 0.95 0.342 -.3861403 1.113178
_cons | -1.06e-08 .2945936 -0.00 1.000 -.5773929 .5773928
-------------+----------------------------------------------------------------
A_lnvar |
_cons | -.0239488 .3292348 -0.07 0.942 -.6692372 .6213395
-------------+----------------------------------------------------------------
B_mean |
zx2 | .2914172 .3512637 0.83 0.407 -.3970469 .9798814
_cons | -1.15e-08 .3025022 -0.00 1.000 -.5928935 .5928935
-------------+----------------------------------------------------------------
B_lnvar |
_cons | .0290349 .3150404 0.09 0.927 -.588433 .6465027
. test [A_mean]zx1 = [B_mean]zx2 // test for equality
( 1) [A_mean]zx1 - [B_mean]zx2 = 0
chi2( 1) = 0.06
Prob > chi2 = 0.8130
We notice that the regression coefficients are now identically the Pearson correlation coefficients observed on the previous page.
We see that the correlation between y and x1 (r = 0.36) is not significantly different from the correlation between y and x2 (r = 0.29)(p = 0.81). So, although x1 appears to be the better predictor, based on an observed larger correlation coefficient, there was not sufficient information in this sample to demonstrate that statistically.
Protocol Suggestion
To test the difference in correlation of two variables, each correlated with a common third variable, we will use a regression model approach. When the outcome and predictor variable are both converted to a standardized score, or z score, the regression coefficient is identically the Pearson correlation coefficient. These standardized regression coefficients, or Pearson correlation coefficients, will be compared using a linear regression post-estimation Wald test (Judge et al, 1985, 20-28).
The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
References
Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: a Simple
Introduction to the Analysis of Complex Data. Cambridge, Cambridge University
Press.
Gregoire AJP, Kumar R, Everitt BS et al. (1996). Transdermal oestrogen for the treatment of
severe post-natal depression. The Lancet 347:930-934.
Judge GG, Griffiths WE, Hill RC, Lutkepohl, Lee T-C. (1985). The Theory and Practice of
Econometrics. 2nd ed, New York, Wiley.
Rabe-Hesketh S, Everitt B. (2003). A Handbook of Statistical Analyses using Stata. 3rd ed.
New York, Chapman & Hall/CRC.
Nawata K, Sohmiya M, Kawaguchi M, et al. (2004). Increased resting metabolic rate in patients
with type 2 diabetes mellitus accompanied by advanced diabetic nephropathy.
Metabolism 53(11) Nov: 1395-1398.