• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/66

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

66 Cards in this Set

  • Front
  • Back
NB: When you've transformed a variable for regression ...
Make sure you remember the units are transformed, too!
First step when you're not sure if data set needs to be transformed into linear
'center' the independent variables (predictors): i.e. subtract mean of x from each x, so transformed var is original's distance from the mean

just wanna bring observations closer to zero!
Second step when you're not sure if data needs to be transformed
look at X-Y scatter -- is there curvature? if so, take transform (maybe log or exp) on Y.... if linear, result: E(log(Y)) = Beta0 + Beta1(X)

can derive meaning of betas like above
look at X-Y scatter -- is there curvature? if so, take transform (maybe log or exp) on Y.... if linear, result: E(log(Y)) = Beta0 + Beta1(X)

can derive meaning of betas like above
how confidence bands change when converting to and from transformation
'Least serious' regression violation vis a vis transformation consideration
Non-normality; also happens to usually be remedied automatically after other violations are addressed
Violations of linearity
1) Analytically (Table 6.1 in book; try logit if response is a probability etc.)

2) Numerically (look at scatter & try what looks best!)
Other transformations other than Logarithmic (mostly about boxcox)
Ladder of power: exponentiating Y by -1, -0.5, 0.5, 1, 2

Box-Cox transform: attached;

NB: If after running boxcox, lambda = 0 can't be rejected, 'simply' use log transform (i.e. Y-prime = log(Y)) 

if lambda = -1 can't be rejected, use Y-prime = 1
Ladder of power: exponentiating Y by -1, -0.5, 0.5, 1, 2

Box-Cox transform: attached;

NB: If after running boxcox, lambda = 0 can't be rejected, 'simply' use log transform (i.e. Y-prime = log(Y))

if lambda = -1 can't be rejected, use Y-prime = 1/Y (reciprocal)

if lambda = 1 can't be rejected, do nothing! Y-prime = Y!

after running boxcox, can either use suggested lambda in boxcox fit (last of printout) or can follow above guidlines if last bit didn't reject all hypotheses
How do we spot non-linearities?
Scatterplots, partial plots, augmented CPR plots ('acrplot' in STATA), LOOK FOR CURVED PATTERNS IN THESE PLOTS

acpr is augmented partial residual plot, need to do one-by-one if have multiple predictors

',lowes' fits curved line and fitted line
Encountering heteroscedastic errors (POWER TRANSFORM)
In applications, when increasing variance over predictor var encountered, st.dev of residuals increases as predictor increases. so we hypothesize that st.dev(error) = k*x

leading to the following transform: Y' = Y/X' where X' = 1/X

generate ty = y/x
In applications, when increasing variance over predictor var encountered, st.dev of residuals increases as predictor increases. so we hypothesize that st.dev(error) = k*x

leading to the following transform: Y' = Y/X' where X' = 1/X

generate ty = y/x
generate tx = 1/x

and re-scatter

Can try log too to see if it helps but didn't help like this way did in example!

next what we did: regress y x (x^2)

it worked OK but better to use Y/X and 1/X because it has fewer predictors than log(Y) ~ X + X^2
Transform options
square root, reciprocal (^(-1)), reciprocal of square root!, log, log(sqrt(y))!

~ AVOID UNNECESSARY TRANSFORMATIONS ~
Box-Cox before and after
Weighted Least Square method
another way to correct for heteroscedacity, by assigning weight to each observation's SSE

. regress y x [w=1/x^2]
(analytic weights assumed)
(sum of wgt is 1.0470e-04)

useful when it looks like predictor is influencing magnitude of variance spread; we 'standardize' by dividing by their own variance

"each observation is weighted by the estimate of its
standard deviation"

Computing the weights from the data: how to estimate these within-region variances?
Answer: Find the MSE for each region!

regress y x1 x2 x3
. predict res, resid
. generate r2 = res^2
. egen r = mean(r2), by(region)

r represents MSE for each region

Weight Least Squares (WLS) divides each term in SSE by r
what's good...
Compared with LS, the R2
, F-statistic and the MSE of WLS are
much better! Much better t
NB Regarding WLS
The WLS returns the residual using y  y^. But our fitted y^ is obtained by WLS. To check residuals, we need to
obtain the weighted residuals.�
The WLS returns the residual using y y^. But our fitted y^ is obtained by WLS. To check residuals, we need to
obtain the weighted residuals.�
autocorrelation
if observations follow a natural order ---> have patterns, the errors are correlated and correlation is called 'autocorrelation'
Common autocorrelation patterns
Common causes of autocorrelation
1. same observations measured in adjacent periods (stock market today and tomorrow)
2. spatially close observations (temp in NC and VA)
3. another important predictor omitted
How autocorrelation and heteroscedacity really endangers Ordinary Least Squares regression
1. OLS estimates are unbiased but not efficient so no longer minimize SSE
2. sigma^2 and s.e. of betas may be way over- or under-estimated. Positive correlation -> under-estimation of s.e. -> increased false positives

==> C.I., OTHER TESTS INVALID
Time series & Runs
Time series (ordered) data -> index plot (standardized residual vs time)

IF errors are correlated, plot will display clusters of + or - data called 'runs'; sequence plot shows runs

under H0: no autocorrelation, expected number of runs / variance of
Time series (ordered) data -> index plot (standardized residual vs time)

IF errors are correlated, plot will display clusters of + or - data called 'runs'; sequence plot shows runs

under H0: no autocorrelation, expected number of runs / variance of statistic is as attached

'runtest' in STATA tests if observed # runs is different from expected; " thresh(0) just means 0 is the threshold or fulcrum for +/- whereas median is default threshold "
Durbin-Watson test
for detecting serial correlation; based on assumption that successive errors are correlated in the following, where et is the t-th OLS residual. H0: rho = 0 => errors are uncorrelated
for detecting serial correlation; based on assumption that successive errors are correlated in the following, where et is the t-th OLS residual. H0: rho = 0 => errors are uncorrelated
Approximations and tips on Durbin-Watson statistic
when d is near 2, there isn't much autocorrelation in the error, but if d is far from 2, it suggests autocorrelation is present
when d is near 2, there isn't much autocorrelation in the error, but if d is far from 2, it suggests autocorrelation is present
How close is 'close to 2' for D-W statistic?
for positive autocorrelations (d is below 2):

d < dL => reject null (that there is NO positive autocorrelation)

d > dU => we fail to reject the null

dL < d < dU => test is inconclusive

dL and dU are in tables, vary by n, p

*FOR NEGATIVE AUTOCORRELATION, "WORK WITH (4-d) and use the same procedure as above"
You can remove autocorrelation by transformation!
Cochrane-Orcutt transformation is one way to remove autocorrelation:

'prais' command in STATA;
Cochrane-Orcutt transformation is one way to remove autocorrelation:

'prais' command in STATA;

'two' stops procedure after first estimate of p

OR you can let the iteration run until convergence

we generate residual from et and e(t-1), generate latter from lagged variable "L.r2"
Remember that one cause of autocorrelation is...
artificial: the omission of another predictor
variable
Limitations of DW statistic
When TWO LEVELS OF AUTOCORRELATION PRESENT, DW statistic only look at the overall autocorrelation, and it can't detect higher level of autocorrelation. Thus, high DW statistic does not necessarily indicate no autocorrelation.

i.e. omitted season in ski rental data; regressing with season will fix autocorrelation which is different in ski season and warm season
Multicollinearity
The symptom: p-values for betas in regression are insignificant but F-statistic is signaling signficance!


==> multicollinearity, which is when some of the 'independent' predictors are correlated

multicollinearity almost always exists but concern is about severity
If all predictors in a model are independent...
we call the predictors 'orthogonal'
Multicollinearity is not...
an error! is from lack of info in data set... what if X3 can be approx. expressed as combo of X1 and X2 ? etc.
Collinearity
when one variable can be expressed completely as function of other variables; "most serious case"; HAVE TO DROP VAR FROM MODEL!

stata autodetects perfect collinearity and makes decision about what var to drop
What if we ignore multicollinearity?
If mild, major consequence is that CIs are wider than usual ("more conservative tests"), but if multicollinearity is too severe CIs are TOO big and Beta estimates are uselessly haphazard
If we add a predictor and C.I. blows up in size / changes other estimates drastically, then
new predictor is highly correlated with old one(s)

need assumption of independence b/c we assume beta represents roc of xn on y WITH ALL OTHER VARS HELD CONSTANT
new predictor is highly correlated with old one(s)

need assumption of independence b/c we assume beta represents roc of xn on y WITH ALL OTHER VARS HELD CONSTANT
When is multicollinearity "serious" and how do we detect it?
If F-test is significant and ALL individual t-tests are INsignificant, multicollinearity is likely

if fewer t-tests are insignificant, probably just means that some variables are non-important

another sign is if a predictor has a coefficient w/ a sign that is opposite of sense/context/expectation
BEWARE: multicollinearity and pairwise combos
How to combat fact that multicollinearity can arise out of any linear combo of predictor vars and so is impossible to screen with just pairwise scatter?
tolerance
tolerance for predictor Xj measures fraction of unexplained variance (1-(R^2)j ) in Xj after adjustment for other variables
VIF
variance inflation factor; 1/tolerance

VIF > 10 suggests multicollinearity
Once we detect multicollinearity...
First: if Xj is collinear with other variables => Xj is basically redundant

So:
1) can OMIT predictor (~ fitting model with Beta-j = 0 )
2) can CENTER near mean before constructing powers and interaction terms... will reduce quadratic/interaction collinearity coming with bigger (uncentered) coefficients
3) get more data / study up on mechanism
4) CONSTRAINED REGRESSION
Constrained regression
realize that there's a conceptual reason that your residual plot looks like this (i.e. distinctive change in error correlation around point), DO SEPARATE ANALYSIS! here, about x = 12

Next, make sure R^2 is high and predicted Betas all make sense concep
realize that there's a conceptual reason that your residual plot looks like this (i.e. distinctive change in error correlation around point), DO SEPARATE ANALYSIS! here, about x = 12

Next, make sure R^2 is high and predicted Betas all make sense conceptually... if concept-check turns out with something unreasonable, it's v. likely it's highly correlated

check with pwcorr; VIF -- > if high, run separate regressions on each var that has high VIF

HERE IT IS: instead of omitting one regressor, we keep both of them in 'constrained' context
How constrained regression works
Constrained regression avoids collinearity by using a same
parameter for two predictors. Relationship of the two coefficients is predefined by the constraint function.
Variable Selection
How do we pick between a bunch of OK models?

Various methods: F-test (if nested), adj-R^2, Mallow's Cp, likelihood-based criteria like AIC and BIC. Gets annoying tho! n variables --> 2^n possible models
How to be efficient in var selection?
What is model being used for? Exploration (building simple -> complicated, considering variables' conceptual inclusion), predictive (already know what's up, not thinking about vars much), explanatory (parsimonious variable choosing here, worried about con
What is model being used for? Exploration (building simple -> complicated, considering variables' conceptual inclusion), predictive (already know what's up, not thinking about vars much), explanatory (parsimonious variable choosing here, worried about confounding, i.e. corr. vs causation )

ULTIMATELY we want some of all => minimize bias^2 + variance == Mean Square Error

model should be easy to interpret and watch out for confounders!
Stepwise regression
USE W/ CAUTION!
USE W/ CAUTION!
Stepwise regression: forward selection
4. With new model, return to step 2

The probability to enter option, pe, was set to .99 so that all of the variables would enter and their order of entry depends on their signifcance.

The procedure is repeated until adding any other remaining varia
4. With new model, return to step 2

The probability to enter option, pe, was set to .99 so that all of the variables would enter and their order of entry depends on their signif cance.

The procedure is repeated until adding any other remaining variables would have the added variable's p-value being > pe.
AIC and BIC
Stepwise regression: Backward selection
The probability to remove option, pr, was set to .01 so that all of the variables except the last one would be removed and their order of removal also depends on their signicance.

The procedure is repeated until the p-values of the variables in the mo
The probability to remove option, pr, was set to .01 so that all of the variables except the last one would be removed and their order of removal also depends on their signi cance.

The procedure is repeated until the p-values of the variables in the model are all smaller than pr.

The probability to remove option, pr, was set to .33 to correspond to a t-statistic of 1.0
Warning about stepwise regression:
Completely meaningless results can happen. Unless you can use theory and common sense to justify the resulting models, don't rely on this method.

NB: pr should be > pe
Robustness
find a relationship that holds for most of the data and isn't excessively influenced by small # of deviant data

mean is NOT robust estimate of center, but median is
Least Median of Squares regression
Robust version of mean of squared residuals in Ordinary Least Regression (which minimizes " " )
Robust version of mean of squared residuals in Ordinary Least Regression (which minimizes " " )
Robust regression
another alternative to Least Squares regression, better dealing with data contamination like outliers or overly influential observations; 'rreg' in stata

messes with weights of data so that data with Cook's D > 1 are excluded from robust analysis

doesn't assume normality, allows for outliers
robust regression vs OLS
In OLS regression, all cases have a weight of 1. Hence, the more cases in the robust regression that have a weight close to one, the closer the results of the OLS and robust regressions.

Try robust, if it's close to OLS go back to OLS model and treat the robust regression as a check for outliers (negative!)
Balancing weight in robust regression
Poisson regression
used to model count variables as outcome (can't be negative), aka 'log-linear'

N.B. The 'i.' before a categorical var indicates that it is a factor variable (i.e., categorical variable), and that it should be included in the model as a series of indica
used to model count variables as outcome (can't be negative), aka 'log-linear'

N.B. The 'i.' before a categorical var indicates that it is a factor variable (i.e., categorical variable), and that it should be included in the model as a series of indicator variables.
Big assumption of Poisson
recall mean and variance of Poisson are the same => BIG assumption: conditional on predictors, est. mean and est. var are the same
recall mean and variance of Poisson are the same => BIG assumption: conditional on predictors, est. mean and est. var are the same
estat gof
goodness of fit chi square; is is not
a test of the model coefficients, but a test of the model form: does the Poisson model form fi t our data? Large p-value indicates good t (i.e. H0 is that model fits)
Chi-square goodness-of-fit test indicate bad fit?
Try to figure if there are missing predictors, if linearity assumption holds, or/and if conditional mean and conditional variance are very different
In count situation, what if conditional mean and variance aren't the same?
Can try to fit negative binomial
Poisson (for counts) regression summary
Poisson model characterized by v. Strong assumption that conditional variance of outcome variable equals the conditional mean

if bad fit, first check that model is "appropriately specified" i.e. no omitted variables or bad functional forms

if modeled bona fide, assumption that conditional var = conditional mean should be checked

Poisson regression estimated with MLE, * requires large sample *
Logistic regression
for binary situations (binomial); Y ~ Bin(n, pi)

logit(pi) called 'log odds' and pi / (1 - pi) called 'odds'
for binary situations (binomial); Y ~ Bin(n, pi)

logit(pi) called 'log odds' and pi / (1 - pi) called 'odds'
Generalized linear models and the 'link' function
Commonly used link functions
identity link: f(t) = t; if Y ~ normal => OLS regression

log link: f(t) = log(t); if Y ~ Poisson => Poisson regression

logit link: see attached; if Y ~ Binomial => logistic regression

Cf. command 'glm'
identity link: f(t) = t; if Y ~ normal => OLS regression

log link: f(t) = log(t); if Y ~ Poisson => Poisson regression

logit link: see attached; if Y ~ Binomial => logistic regression

Cf. command 'glm'
Interpreting logistic regression results
Relationship between 'odds' and Betas (logistic regression output)
Multiple logistic regression
How do we interpret Beta-hats here? analog to the multiple linear regression: Beta-hat is interpreted as the partial log-odds ratio. 

What Beta-hat_j is, in words: when comparing the odds of two groups that differ by only one unit of Xj, Beta-hat_j is
How do we interpret Beta-hats here? analog to the multiple linear regression: Beta-hat is interpreted as the partial log-odds ratio.

What Beta-hat_j is, in words: when comparing the odds of two groups that differ by only one unit of Xj, Beta-hat_j is the log of the odds ratio between those two groups.
Multiple logistic regression if interaction present
Logistic regression for grouped data
mind that in both cases you want fitted values between 0 and 1 (because est. prob)
mind that in both cases you want fitted values between 0 and 1 (because est. prob)
Logistic reg for grouped data vs ordinary logistic reg