• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/30

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

30 Cards in this Set

  • Front
  • Back
Influence Analysis (definition and 4 types)
One data point may have a significantly larger impact on the obtained statistics than the other data points; one data point may drastically increase or decrease or impact the regression coefficient, t value, F ratio or other phenomenon (that one data point may be more influential than other data points) 4 types: leverage, cooks d, dfbeta, standardized dfbeta
Leverage
examines the influence of x scores on the obtained statistics
All other things being equal the larger the x score deviates from the mean of x scores the larger the leverage
The maximum leverage scores is a 1
Higher leverage=more impact on the obtained statistics
Shortcoming – leverage is only concerned with x scores… in other words it is only concerned with the IV
only can use relative comparisons; only tells you that there was an impact not how much it was
Cook's D
examines the impact of x scores, y scores, or both on the obtained statistics and there is not a maximum score
Cooks D can be above one, but just as with leverage you are only making a relative comparison
Even though both Cooks D and Leverage says that we have an impact, but it doesn’t say how much the impact is
Problem: you have to calculate a cook’s D at every single data point.
Leverage and cooks d is only a measure of relative influence they don’t tell us how much a data point is influencing the results
DFBETA
indicates changes to the y intercept and to the regression coefficient after deleting a single data point
First calculate the regression coefficient with all of the data and then without the 1 data point to see if there is a difference Problem: values influenced by scale of measurement
All Data  y’=5.05 +.75x
Deleted x=5, y=10  y’ 5.19+.68x
Y intercept (old minus new) =5.05-5.19=-.14 = DFBETAa (stands for difference between old a (y-intercept) and new)
Reg. Coef= .75 -.68=.07=DFBETAb (stands for difference between old & new b (slope/reg. coef)
OLD minus NEW (new=the minus b/c of deleting the data point) Problem with normal DFBETA is that it’s difficult to know what a large change it is b/c the values can be influenced by the scale of measurement (seconds vs. minutes) b/c values of y intercept & reg. coeff. are influenced by the scale of measurement in the study
Standardized DFBETA
The way you can compare across different scales of measurement is by standardizing DFBETAS

ndicates changes to the y intercept and to the regression coefficient after deleting a single data point
Simple Linear Regression (how many IV/DV; what is equation; what is symbols for correlation and variance)
One IV One DV  Linear Regression …. Prediction is w/o the e
Y=a+bx+e
Y=score on the DV (y score)
A= y-intercept
B= slope/regression coefficient/beta wt.
X=score on the IV (x score)
E= error or residual
1 IV, I DV
r = the correlation between one IV and one DV
r2 = Proportion of variance accounted of variance accounted for
Multiple Linear Regression (#of IV/DV; correlation/variance0
Two or more IVs One DV multiple regression (2 DVs is multivariate)… prediction is w/o the e
Y=a +b1x1+ b2x2 +b3x3+…… bkxk+e
Y=score on the DV (y score)
A= y-intercept
E= error or residual
B1= regression coefficient of the first IV
X1=score on the first IV
B2= regression coefficient of the second IV
X2=score on the second IV
B3= regression coefficient of the third IV
X3=score on the third IV
BK= regression coefficient of the kth IV
XK=score on the kth IV
2 or more IVs, 1 DVs
R=the correlation among all of the IVs with the DV (total correlation)
R2 = Proportion of variance accounted of variance accounted for
Prediction
Multiple Regression equation maximizes prediction by optimally weighting each IV via the slope/regression coefficient/beta weight
It allows us to combine information from 2 or more IVs in the best manner possible
Proportion of Explained Variance
A Multiple Regression Equation (MRE) can be used to determine the total variance accounted for by all of the IVs (R2y.12)
Testing for Statistical Significance
MRE can be used to test the statistical significance of the overall R2
Can be used to determine if any of the IVs are significant (r21 and r22 is significant)
Relative Importance
MRE can be used to determine the relative importance of each IV in explaining the DV; in other words, it can tell us which ones are more important
b
unstandardized beta weight  use with raw scores and when the research is applied; unstandardized and influence by the scale of measurement (if one study looked at inches and the other study looked at feet the study that looked at inches will yield values that are 12 times larger.. therefore the size of the beta weight is not meaningful only the significance level is meaningful; one cannot infer conclusions about the relative importance of beta weights just b/c one beta weight is twice as large as another doesn’t mean it is twice as important; you also can’t compare beta weights across studies b/c there may be different scales of measurement;
Fancy B
standardized beta weight  use with standardized scores and when the research is theoretical When dealing with standardized scores the mean of Z scores is 0 and the y intercept is 0
For Standardized Scores
Y’=b1x1+b2x2
Zy’=B1Z1+B2Z2
Zy’=predicted standard score of y
B=standardized beta weight of x1
Z1=standardized score of x1
used to compare across studies
Testing R2
Use F equation=msreg/msres=ssreg/dfreg//ssres/dfres

dfreg=# of IVs
dfres=n-k-1

Its looking at : the combination of all the IVs is related to one dv or all the IV’s together explain significant variance in the DV if F (R2) is significant it means that all of the IVs account for a significant variance in the one DV
Squared Multiple Correlation Coefficient
Indicates the total variance of the DV accounted word by ALL
R2= SSreg
SStotal
SSreg=b1∑x1y+b2∑x2y = .7046(95.05)+(.519)(58.50)=101.60
SStotal=140.55

R2= SSreg = 101.60 = .72
SStotal 140.55

An R2 of .72 means that these 2 IVs explain 72% of the variance in the DV


 R will always range between 0.0 and 1.0.. if you have an neg corr. between 2 variables you still can explain a pos percent of variance… think of it like a circle if there is a corr. There is some overlap between the two circles even if it is a neg corr. The overlap in terms of area would be positive and therefore positive variance… if you have multiple IVs even if they were all negative correlations the variance would still be positive… b/c you’d have a –r and then you’d square it to get r2 for little r it can range from -1 to 1…. The negative correlation is only there to show you direction but the total correlation (r2) is there to show you the total percent of a variance explained
Testing Regression coefficient
When testing the regression coefficients we test whether one regression coefficient is significant after controlling for all other IVs

r (correlation) b/w IV and DV ( \ \ \ part and squiggly part) ; b b/w IV and DV ( \ \ part only) ; r2¬ is squiggly plus ///









the partial slope of IV1= \ \ \
the partial slope of IV2= / / /
R = \ \ \ + squiggly + / / /
Overlap is called partial reg coef


The more the IVs are correlated the more difficult it is to get the significance regression coefficient
Tests of R2 versus tests of b
R2 = all IVs significantly explain variance in one DV
b = one IV significantly explain variance in one DV after controlling for all other IVs
it is entirely possible to obtain a significant R2 yet not obtain a significant regression coefficient, but you can’t have a R2 that is not significant when each regression coefficient is significant

testing b is the same as testing B
Confidence Intervals
Ex. Amanda takes her 45 min to get to school but sometimes 40 and sometimes 60 .. she is 95 percent sure that itll take her anywhere between 40 and 60

C.I. = b ± (tcritical¬) Sb
b=beta weight/regression coef/slope
tcritical= critical value of t using a 2-tailed test
Sb=standard error of b (tells us the expected range in amanda’s ex. 40 to 60)
C.I. = .7046± (2.11)(.1752) = .3349 and 1.0743 (95% confident)
Violations of Assumptions
1. Measurement Error
a. Measurement errors in DV
Don’t bias the regression coefficient in either direction they don’t always increase or decrease it, but they do increase error which leads to a less accurate conclusion
b. Measurement errors in IV
Do bias the regression coefficient they underestimate the true regression coefficient usually this underestimation is due to the unreliability of the measurement the more unreliability the more underestimation..
If your measure has an alpha of .85 its not perfect… the more unreliability then the more the correction will increase the value
2. Specification Errors – “your model is wrong”
Involve the components and/or the relationships in a proposed, theoretical model
These are: omitting important variables, including unimportant variables, or theorizing that the relationship is linear when it is really curvilinear (omitting important variable leaving out smoking as a cause for lung cancer…. Including unimportant would be like including blue eyes as a cause for lung cancer)
Regression Model vs. Correlational Model
Regression Model – x values are fixed. If you use 5 point likert scale originally you need to use the same scale in replication
Correlational Model – x can be random they don’t have to be the same… you can have a 5 point scale and then a 7 point… allows any analysis between 2 variables without the scale of measurement restriction
Outliers - def and are they good or bad and types
Definition – something that lies or is situated away from or classed differently a main or related body… in stats an outlier is a data point that is markedly different from the general trend in the data! Most ppl think that it is bad b/c it drags down ur regression coefficient… but the truth is that outliers can increase or decrease the correlation, but they are bad b/c they result in an inaccurate conclusion an inaccurate estimate of the true relationship between two variables…
1. Outliers can increase the correlation
Outliers can make it appear that there is more of a linear relationship then there is… if you have a correlation that looks like a circle…. And then you add an extreme outlier it will make it have a more oval shape which means higher linear relationship and correlation!
2. Outliers can decrease the correlation
If the data is in a pretty linear shape (oval) and there is an extreme outlier it may make it become more circular which leads to an inaccurate estimate between the 2 variables
Methodological and behavioral
Methodological Outliers
Due to the researchers mistakes... these mistakes include data entry (type in 110 instead of 11) instrument malfunction, inconsistent administrations, and measurement errors
Behavioral Outliers
Outliers due to the actual performance of the subjects…. (when you test for the effects of alcohol on baseball hitting… and you randomly choose ppl but one of them is from the Xavier baseball team… even really drunk he may hit better than the avg person… .he would be an outlier) sometimes the outliers are more interesting … how can you possible hit 19 out of 20 baseballs when you are wasted…. You may want to modify your model to incorporate studying the outlier
3 ways of detecting outliers
Standardized Residuals - ZRESID
Studentized Residuals - SRESIDs
Studentized Deleted Residuals - SDRESID
ZRESID
ZRESID=Residual
Standard deviation of the residuals (all the data points including the alleged outlier)

Residual is the difference between what we thought we would get and what we actually got…. Residual is y-y’ which is 10-8.80 so our residual =1.20;
ZRESID=1.20/2.446 (from the prev section) = .4906
|ZRESID|=>2 if it is great than 2 it is an outlier
<2 not an outlier
Conceptually you would do this for all ur data points and if any of them are larger than 2 than it’d be an outlier…. Just do it for the sketchy ones…. b/c its unrealistic to run it for each data point..
With ZRESID it’s the numerator not the denominator… it assumes that all the residuals have the same variance; so the denominator is made up of all the data points; standardized residuals looks across all values of x or data points; Problem with this one is that you’d have to calculate this for each outlier; In the ZRESID all the residuals have the same variables
SRESID
Student residuals look only at one value (e.g. x=5)
It makes an adjustment to what we just did. Sei=Sy.x√1-[(1/N)+{(xi-mean of X)2}
SSx
Sei= td. Deviation of the residuals at a particular value of x
Sy.x= Std deviation of the residuals (all of the data pts)
Xi=x scores of a specific data point (suspected outlier)
N=sample size
Mean of X=mean of x

Sei=2.446 √1-[(1/20)+[((5-30)2)/40]=2.2551
SRESID=Residual/ Sei=1.20/2.2551=.5321
When x=5 the denominator would always be 2.2551 but if x=something else you’d have to figure out a different denominator number…. But in ZRESID the denominator is always the same; SRESID is a more sensitive more precise measure
To determine if a value is an outlier:
If our SRESID if it was greater than the critical value of T, it would be an outlier…. In this case our N is 20 our df is 18… so the critical value is 2.101
Problem: again it includes the alleged outlier in the denominator… and having that alleged outlier in there makes it have more difficult to find an outlier
SDRESID
A shortcoming of both SRESIDs and ZRESIDs is that the denominator is comprised of the expected/suspected outlier if we include the outlier then it makes it more difficult to detect the outlier b/c if you include the outlier than you have more variance and if you get more variance then your sum of squares increases… and you are less likely to find the outlier
EX: Sei=3√1 =3….1.20/3=.4
Sei=2√1 =2….1.20/2=.6
So we will base it on 19 data points (instead of 20) and then make this adjustment
If we deleted that data point.. and came up with the variance from the 19 data points.. our value will no longer be 2.446.. instead it would 2.497 just computing the standardization of the residual minus 1 data point
Everything else is the exact same equation as the SRESID even though you will use a 20 in the N still
Sei=2.497 √1-[(1/20)+[((5-30)2)/40]=2.302
SDRESID=Residual/ Sei=1.20/2.302=.5213
You still use the critical value of T to find out if this is an outlier or not (df-k-1 then you look at T value with .05 and if the absolute value is greater than the critical value then you have an outlier)
This one is the most sensitive because it is not including the outlier
You use same steps looking up SDRESID and SRESID
Summing the detecting outliers
ZRESID =.4906 (residual for all data points; outlier if value is greater than |2|)
SRESID=.5321 (residual for each level of x (5); outlier determined by t critical)
SDRESID=.5213 (Deletes the outlier under scrutiny (kind of like SRESID but deletes potential outliers; best b/c its more likely to ID data as an outlier if it really is; outlier determined by t critical)
Random Information
No relationship between x1 and x2
rx1y=.20 and rx2y=.10
R= rx1y+ rx2y just add them up when there is no correlation between the two

When x1 and x2 are related (r does not equal 0)







rx1y= where x1 and y overlaps
rx2y=where x2 and y overlaps
They both are including where x1 and x2 overlap with y, but you should only count the overlap once
R2 < rx1y + rx2y
Notation
A note regarding Notation
2 variables = R2x1x2 or R2y.12  1 DV 2 IVs (first 2)
5 variables = R2x1x2x3x4x5 or R2y.12345  1 DV 5 IVs
Only want 1 and 3rd variable = R2y.13  1 DV, first IV third IV