Regression Post Tests | Diana L. Ascher, PhD, MBA

Chapter 5-21. Regression Post Tests

In this chapter, we will discuss statistical comparisons made after one or more regression models have been fitted. These are called post tests, and can be used to combine regression coefficients in the same model, compare coefficients across models, and compare model fit across models.

Simultaneous Comparison of Groups

We did this in Chapter 6, and will cover it quickly here. Opening the rmr dataset in Stata,

File

Open

Find the directory where you copied the course CD

Change to the subdirectory datasets & do-files

Single click on rmr.dta

Open

use "C:Documents and Settingsu0032770.SRVRDesktop

Biostats & Epi With Statadatasets & do-files

rmr.dta", clear

* which must be all on one line, or use:

cd "C:Documents and Settingsu0032770.SRVRDesktop"

cd "Biostats & Epi With Statadatasets & do-files"

use rmr, clear

This dataset comes from Nawata et al (2004)[on course CD]. The data were taken from the authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the authors.

File rmr.dta Codebook

group urinary excretion of albumin group (U-Alb)

a = U-Alb < 30 mg/d

b = 30 mg/d ≤ U-Alb ≤ 300 mg/d

c = 300 mg/d < U-Alb

lbm lean body mass (kg)

rmr resting metabolic rate (kJ/h/m²)

The group variable is a string variable with values “a”, “b”, and “c”.

Creating indicator variables and checking our work,

Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.

gen groupa = cond(group=="a",1,0)

gen groupb = cond(group=="b",1,0)

gen groupc = cond(group=="c",1,0)

tab group groupa

tab group groupb

tab group groupc

| groupa

group | 0 1 | Total

-----------+----------------------+----------

a | 0 12 | 12

b | 10 0 | 10

c | 12 0 | 12

-----------+----------------------+----------

Total | 22 12 | 34

| groupb

group | 0 1 | Total

-----------+----------------------+----------

a | 12 0 | 12

b | 0 10 | 10

c | 12 0 | 12

-----------+----------------------+----------

Total | 24 10 | 34

| groupc

group | 0 1 | Total

-----------+----------------------+----------

a | 12 0 | 12

b | 10 0 | 10

c | 0 12 | 12

-----------+----------------------+----------

Total | 22 12 | 34

We see that the indicator variables were correctly set up.

We want to test the hypothesis that the three group means are the same:

Even better, we want to do this adjusting for lbm, which is a potential confounding variable:

Fitting the linear regression model, with group a as the referent,

Statistics

Linear models and related

Linear regression

Dependent variable: rmr

Independent variables: groupb groupc lbm

regress rmr groupb groupc lbm

Source | SS df MS Number of obs = 34

-------------+------------------------------ F( 3, 30) = 9.18

Model | 8798.55426 3 2932.85142 Prob > F = 0.0002

Residual | 9588.38691 30 319.612897 R-squared = 0.4785

-------------+------------------------------ Adj R-squared = 0.4264

Total | 18386.9412 33 557.180036 Root MSE = 17.878

rmr | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

groupb | -3.410606 7.654802 -0.45 0.659 -19.0438 12.22258

groupc | 29.91667 7.305324 4.10 0.000 14.9972 44.83613

lbm | .5454562 .3431357 1.59 0.122 -.1553203 1.246233

_cons | 112.2196 15.87451 7.07 0.000 79.79954 144.6397

We see that there is a significant difference between group c and group a, while controlling for lbm, where the adjusted means shown here are:

Group a = _cons = 112.2196

Group b = _cons + groupb = 112.2196 – 3.410606 = 108.80899

Group c = _cons + groupc = 112.2196 + 29.91667 = 142.13627

We can get these adjusted group means using, setting lbm = 0, using

Statistics

Postestimation

Adjusted means and proportions

Main tab: Compute and display predictions for each level of variables: group

Variables to be set to there overall mean value: <leave blank this time>

Variables to be set to a specified value: Variable: lbm Value: 0

Options tab: Confidence interval or prediction intervals

adjust lbm=0, by(group) xb ci

Dependent variable: rmr Command: regress

Variables left as is: groupa, groupb, groupc

Covariate set to value: lbm = 0

group | xb lb ub

----------+-----------------------------------

a | 112.22 [79.7995 144.64]

b | 108.809 [76.0153 141.603]

c | 142.136 [109.108 175.165]

Key: xb = Linear Prediction

[lb , ub] = [95% Confidence Interval]

To get a single p value for the group comparison, if all we wanted to say was that there was a significant difference among the groups, we use what we know about combining the coefficients,

Group a = _cons = 112.2196

Group b = _cons + groupb = 112.2196 – 3.410606 = 108.80899

Group c = _cons + groupc = 112.2196 + 29.91667 = 142.13627

and then construct the following post-estimation Wald test,

test _cons = _cons+groupb = _cons+groupc

( 1) - groupb = 0

( 2) - groupc = 0

F( 2, 30) = 12.10

Prob > F = 0.0001

We can now report the three groups are not equal after adjusting for lbm (p<.001).

What we just tested is the group main effect in an analysis of covariance,

encode group, generate(group2) // create numeric group variable

anova rmr group2 lbm, continuous(lbm)

Number of obs = 34 R-squared = 0.4785

Root MSE = 17.8777 Adj R-squared = 0.4264

Source | Partial SS df MS F Prob > F

-----------+----------------------------------------------------

Model | 8798.55426 3 2932.85142 9.18 0.0002

group2 | 7734.4293 2 3867.21465 12.10 0.0001

lbm | 807.629752 1 807.629752 2.53 0.1224

Residual | 9588.38691 30 319.612897

-----------+----------------------------------------------------

Total | 18386.9412 33 557.180036

In the authors statistical methods section, they stated they used analysis of variance (ANOVA) and analysis of covariance (ANCOVA).

Although there are special routines in Stata to fit them more easily, analysis of variance (ANOVA) is just a linear regression with categorical predictor variables, and analysis of covariance (ANCOVA) is just a linear regression with both categorical variables and continuous variables. In ANOVA terminology, categorical variables are called factors and continuous variables are called covariates.

Likelihood Ratio Test For Model Fit

We did this in Chapter 24, and will cover it again quickly here. Using the 11.2.Isoproterenol.dta dataset provided with the Dupont (2002, p.338) textbook, described as,

“Lang et al. (1995) studied the effect of isoproterenol, a β-adrenergic agonist, on forearm blood flow in a group of 22 normotensive men. Nine of the study subjects were black and 13 were white. Each subject’s blood flow was measured at baseline and then at escalating doses of isoproterenol.”

Reading the data in,

use "C:Documents and Settingsu0032770.SRVRDesktop

Biostats & Epi With Statadatasets & do-files

11.2.Isoproterenol.dta", clear

* which must be all on one line, or use:

cd "C:Documents and Settingsu0032770.SRVRDesktop"

cd "Biostats & Epi With Statadatasets & do-files"

use 11.2.Isoproterenol, clear

To get a p value to test if a random slope is needed (in other words, the random slope improves the goodness of fit), we compare the log likelihoods between a model without it and a model with it included,

reshape long fbf , i(id) j(dose)

xtmixed fbf dose || id: , mle cov(unstructured)

estimates store modelA // store model estimates

xtmixed fbf dose || id: dose , mle cov(unstructured)

estimates store modelB // store model with added term estimates

lrtest modelB modelA

display "LR Chi-square = " -2*(-476.95216 –(-437.38499))

. xtmixed fbf dose || id: , mle cov(unstructured)

Note: single-variable random-effects specification; covariance structure set to identity

Log likelihood = -476.95216 Prob > chi2 = 0.0000

fbf | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

dose | .0352574 .0027581 12.78 0.000 .0298517 .0406631

_cons | 4.966772 1.245969 3.99 0.000 2.524718 7.408827

. xtmixed fbf dose || id: dose , mle cov(unstructured)

Log likelihood = -437.38499 Prob > chi2 = 0.0000

fbf | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

dose | .0356267 .0049976 7.13 0.000 .0258315 .0454219

_cons | 4.929394 .6680961 7.38 0.000 3.61995 6.238838

. lrtest modelB modelA

Likelihood-ratio test LR chi2(2) = 79.13

(Assumption: modelA nested in modelB) Prob > chi2 = 0.0000

. display "LR Chi-square = " -2*(-476.95216 -(-437.38499))

LR Chi-square = 79.13434

The LR ratio test statistic = 79.13, p<0.0001 informs us that adding the random slopes improved the fit. The display command, was not needed, but was used just to show that this statistic is -2 × the difference of the two model likelihoods.

You could use this likelihood ratio test to get a p value for any variable in the model. It is only slightly more powerful than the Wald test p value shown in the regression table, frequently just changing the p value in the second or third decimal place.

Comparison of Regression Coefficients Within The Same Model

The regression coefficients within the same model can be compared using a post test. This might be useful, for example, if you wanted to compare the coefficients for different study sites.

First, read in some hypothetical data in the do-file editor,

clear

input y x1 x2 site

5 7 6 1

6 3 4 1

7 6 8 1

8 7 6 2

7 5 7 2

8 6 7 2

7 5 8 3

4 3 4 3

6 5 5 3

end

list , sepby(site)

+--------------------+

| y x1 x2 site |

|--------------------|

| 5 7 6 1 |
| 6 3 4 1 |
| 7 6 8 1 |

|--------------------|

| 8 7 6 2 |

| 7 5 7 2 |

| 8 6 7 2 |

|--------------------|

| 7 5 8 3 |

| 4 3 4 3 |

| 6 5 5 3 |

+--------------------+

For this illustration, we will not concern ourselves with overfitting (too few subjects for the number of predictors in the model).

Although it is not necessary, if you add the “nocons” option, than you can see the mean value of each study site. Fitting such a linear regression model,

tab site , gen(site) // create site indicator variables

regress y site1 site2 site3, nocons

y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

site1 | 6 .6382847 9.40 0.000 4.438174 7.561826

site2 | 7.666667 .6382847 12.01 0.000 6.10484 9.228493

site3 | 5.666667 .6382847 8.88 0.000 4.10484 7.228493

Some possible site comparisons that might be of interest are:

test site1 = site2 = site3 // no difference among the sites

test site1 = site2 // pairwise site comparisons

test site1 = site3

test site2 = site3

. test site1 = site2 = site3 // no difference among the sites

( 1) site1 - site2 = 0

( 2) site1 - site3 = 0

F( 2, 6) = 2.82

Prob > F = 0.1371

. *

. test site1 = site2 // pairwise site comparisons

( 1) site1 - site2 = 0

F( 1, 6) = 3.41

Prob > F = 0.1144

. test site1 = site3

( 1) site1 - site3 = 0

F( 1, 6) = 0.14

Prob > F = 0.7246

. test site2 = site3

( 1) site2 - site3 = 0

F( 1, 6) = 4.91

Prob > F = 0.0686

The two-tailed p values are shown in red.

The p values will be slightly different from the same pairwise t tests, but either approach is valid. An advantage of the approach shown here, however, is that you can test the pairwise differences while controlling for confounding variables, something which cannot be done with t tests.

Protocol Suggestion

The association between y and study site will be estimated using linear regression. To determine if the study sites differ, and which ones, the regression coefficients for specific study sites will be compared using a post-estimation Wald test (Judge et al, 1985, 20-28).

The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
Comparison of Regression Coefficients From Separate Models

The regression coefficients computed from different models can be compared using a post test. This might be useful, for example, if you wanted to compare the coefficients for different subgroups.

First, read in some hypothetical data in the do-file editor,

clear

input y x1 x2 site

5 7 6 1

6 3 4 1

7 6 8 1

8 7 6 1

7 5 7 1

8 6 7 2

7 5 8 2

4 3 4 2

6 5 5 2

8 4 3 2

end

list , sepby(site)

+--------------------+

| y x1 x2 site |

|--------------------|

| 5 7 6 1 |

| 6 3 4 1 |

| 7 6 8 1 |

| 8 7 6 1 |

| 7 5 7 1 |

|--------------------|

| 8 6 7 2 |

| 7 5 8 2 |

| 4 3 4 2 |

| 6 5 5 2 |

| 8 4 3 2 |

+--------------------+

We are interested in determining if the association between y and x1 is greater in Site 1 than in Site 2.

This can be done using,

regress y x1 if site==1 // fit 1st model

estimates store site1 // store estimates of 1st model

regress y x1 if site==2 // fit 2nd model

estimates store site2 // store estimates of 2nd model

suest site1 site2 // combine estimation results for testing

test [site1_mean]x1 = [site2_mean]x1 // test for equality of coef

. regress y x1 if site==1 // fit 1st model

y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

x1 | .1071429 .388504 0.28 0.801 -1.12925 1.343536

_cons | 6 2.251983 2.66 0.076 -1.166816 13.16682

. estimates store site1 // store estimates of 1st model.

regress y x1 if site==2 // fit 2nd model

y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

x1 | 1 .6201737 1.61 0.205 -.9736694 2.973669

_cons | 2 2.922065 0.68 0.543 -7.299314 11.29931

. estimates store site2 // store estimates of 2nd model

. suest site1 site2 // combine estimation results for testing

Simultaneous results for site1, site2

Number of obs = 10

| Robust

| Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

site1_mean |

x1 | .1071429 .2955498 0.36 0.717 -.4721241 .6864098

_cons | 6 1.266053 4.74 0.000 3.518581 8.481419

-------------+----------------------------------------------------------------

site1_lnvar |

_cons | .5250103 .3748764 1.40 0.161 -.209734 1.259755

-------------+----------------------------------------------------------------

site2_mean |

x1 | 1 .4134491 2.42 0.016 .1896546 1.810345

_cons | 2 2.301987 0.87 0.385 -2.511812 6.511812

-------------+----------------------------------------------------------------

site2_lnvar |

_cons | .6931472 .3944053 1.76 0.079 -.079873 1.466167

. test [site1_mean]x1 = [site2_mean]x1 // test for equality

( 1) [site1_mean]x1 - [site2_mean]x1 = 0

chi2( 1) = 3.09

Prob > chi2 = 0.0789

In this example, the regression coefficient 0.1071429 for Site 1 was not quite significantly different than the regression coefficient 1 for Site 2 (p = 0.079).

The command, suest, is an acronym for seemingly unrelated estimation in Stata. It simply combines information across models for use with other commands, such as test.

Protocol Suggestion

The association between y and x1 will be estimated using linear regression, separately for the two study sites. To determine if the association is stronger in one study site than the other, the regression coefficient estimated in the two models will be compared using a post-estimation Wald test (Judge et al, 1985, 20-28).

The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
Comparison of Two Correlation Coefficients

It is sometimes of interest to compare two correlation coefficients. Using the fact that when the outcome and predictor variables are first transformed into standardized scores, or z-scores, the regression coefficient is identically the Pearson correlation coefficient. The regression coefficients computed from different models can be compared using a post test.

First, read in some hypothetical data in the do-file editor,

clear

input y x1 x2 site

5 7 6 1

6 3 4 1

7 6 8 1

8 7 6 1

7 5 7 1

8 6 7 2

7 5 8 2

4 3 4 2

6 5 5 2

8 4 3 2

end

list , sepby(site)

+--------------------+

| y x1 x2 site |

|--------------------|

| 5 7 6 1 |

| 6 3 4 1 |

| 7 6 8 1 |

| 8 7 6 1 |

| 7 5 7 1 |

|--------------------|

| 8 6 7 2 |

| 7 5 8 2 |

| 4 3 4 2 |

| 6 5 5 2 |

| 8 4 3 2 |

+--------------------+

Obtaining Pearson correlation coefficients,

pwcorr y x1 x2 , sig

| y x1 x2

-------------+---------------------------

y | 1.0000

x1 | 0.3635 1.0000

| 0.3018

x2 | 0.2914 0.6217 1.0000

| 0.4140 0.0550

The correlation between x1 and y is r = 0.36 and between x2 and y is r = 0.29. In this example, we are interested in demonstrating a difference between these two correlations, suggesting one variable is a better predictor than the other.

We will compare the two correlation coefficients using a post test following two linear regression models. If, however, the two variables, x1 and x2, have different underlying scales, it does not make sense to compare the regression coefficients directly. First, the variables must be converted to the same underlying scale, using standardized scores, or z scores.

egen zy=std(y)

egen zx1=std(x1)

egen zx2=std(x2)

sum

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

y | 10 6.6 1.349897 4 8

x1 | 10 5.1 1.449138 3 7

x2 | 10 5.8 1.75119 3 8

site | 10 1.5 .5270463 1 2

zy | 10 -1.19e-08 1 -1.926073 1.037116

-------------+--------------------------------------------------------

zx1 | 10 -3.73e-09 1 -1.449138 1.311125

zx2 | 10 -1.49e-09 1 -1.598913 1.256289

We see that the three z scores each have a mean of 0 and a standard deviation of 1, thus all are on the same underlying scale, which is how many standard deviation units each score is from its mean.

regress zy zx1 // compute 1st Pearson correlation

estimates store A // store estimates of 1st model

regress zy zx2 // compute 1st Pearson correlation

estimates store B // store estimates of 2nd model

suest A B // combine estimation results for testing

test [A_mean]zx1 = [B_mean]zx2 // test for equality

. regress zy zx1 // compute 1st Pearson correlation

zy | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

zx1 | .363519 .3293657 1.10 0.302 -.3959996 1.123038

_cons | -1.06e-08 .3124637 -0.00 1.000 -.7205426 .7205426

. estimates store A // store estimates of 1st model

. regress zy zx2 // compute 1st Pearson correlation

zy | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

zx2 | .2914172 .3382078 0.86 0.414 -.4884913 1.071326

_cons | -1.15e-08 .3208521 -0.00 1.000 -.7398862 .7398862

. estimates store B // store estimates of 2nd model

. suest A B // combine estimation results for testing

Simultaneous results for A, B

Number of obs = 10

| Robust

| Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

A_mean |

zx1 | .363519 .3824863 0.95 0.342 -.3861403 1.113178

_cons | -1.06e-08 .2945936 -0.00 1.000 -.5773929 .5773928

-------------+----------------------------------------------------------------

A_lnvar |

_cons | -.0239488 .3292348 -0.07 0.942 -.6692372 .6213395

-------------+----------------------------------------------------------------

B_mean |

zx2 | .2914172 .3512637 0.83 0.407 -.3970469 .9798814

_cons | -1.15e-08 .3025022 -0.00 1.000 -.5928935 .5928935

-------------+----------------------------------------------------------------

B_lnvar |

_cons | .0290349 .3150404 0.09 0.927 -.588433 .6465027

. test [A_mean]zx1 = [B_mean]zx2 // test for equality

( 1) [A_mean]zx1 - [B_mean]zx2 = 0

chi2( 1) = 0.06

Prob > chi2 = 0.8130

We notice that the regression coefficients are now identically the Pearson correlation coefficients observed on the previous page.

We see that the correlation between y and x1 (r = 0.36) is not significantly different from the correlation between y and x2 (r = 0.29)(p = 0.81). So, although x1 appears to be the better predictor, based on an observed larger correlation coefficient, there was not sufficient information in this sample to demonstrate that statistically.

Protocol Suggestion

To test the difference in correlation of two variables, each correlated with a common third variable, we will use a regression model approach. When the outcome and predictor variable are both converted to a standardized score, or z score, the regression coefficient is identically the Pearson correlation coefficient. These standardized regression coefficients, or Pearson correlation coefficients, will be compared using a linear regression post-estimation Wald test (Judge et al, 1985, 20-28).

The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
References

Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: a Simple

Introduction to the Analysis of Complex Data. Cambridge, Cambridge University

Press.

Gregoire AJP, Kumar R, Everitt BS et al. (1996). Transdermal oestrogen for the treatment of

severe post-natal depression. The Lancet 347:930-934.

Judge GG, Griffiths WE, Hill RC, Lutkepohl, Lee T-C. (1985). The Theory and Practice of

Econometrics. 2nd ed, New York, Wiley.

Rabe-Hesketh S, Everitt B. (2003). A Handbook of Statistical Analyses using Stata. 3rd ed.

New York, Chapman & Hall/CRC.

Nawata K, Sohmiya M, Kawaguchi M, et al. (2004). Increased resting metabolic rate in patients

with type 2 diabetes mellitus accompanied by advanced diabetic nephropathy.

Metabolism 53(11) Nov: 1395-1398.

Chapter 5-21. Regression Post Tests

Related Posts

Announcing Microsoft Research Open Data, a cloud hosted platform for sharing datasets

Doug Kellner on Theory

Editorial: More companies need to temper their Artificial Intelligence with authentic ethics

Brainstorming the ethics of neuroscience research in the age of organoids