Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
78 Cards in this Set
- Front
- Back
Normal Distribution
|
3 properties:
1. Symmetrical 2. Unimodal (mean, median and mode are all in the same place at the center of the distribution) 3. Asymptotic (upper and lower tails of the distribution never touch the baseline) Sometimes referred to as a Gaussian distribution |
|
Population Distribution
|
the distribution of scores in a population
ex: distribution of IQ scores for everyone in a country |
|
Distribution of a Sample
|
the distribution of scores in a sample of a given size
ex: IQ scores of the students in class, as a sample of the TAMU population |
|
Sampling Distribution
|
the distribution of some statistic in all possible samples of a given size
ex: mean, or a slope coefficient |
|
Central Limit Theorem
|
CLT states that, as sample size (n) becomes large: then a) the sampling distribution of the mean becomes approximately normal, regardless of the shape of the variable’s frequency distribution; b) the sampling distribution will be centered around the variable’s population mean; and c) the standard deviation of this sampling distribution, called its standard error, approaches the variable’s standard deviation in the population divided by the square root of the sample size
|
|
SSE
|
sum of the squares of the errors
SSE = Sum(Yi – Y-hati)2 [square of the TPE] |
|
R-squared
|
coefficient of determination
=ESS/TSS Indicates the explanatory power of the regression model. It records the proportion of variation in the dependent variable that is explained or accounted for by the independent variable. varies from +1 to 0 |
|
TSS
|
Total Sum of Squared Deviations
Sum(Yi – Y-bar)2 The TSS indicates the total variation of the dependent variable that we want to explain. The total variation may be divided into two parts: the part accounted for or explained by the regression equation (ESS); and that part that the regression equation cannot account for, i.e., the residual part (RSS). |
|
ESS
|
Explained sum of squared deviations
Sum(Y-hati – Y-bar)2 |
|
RSS
|
error of residual (unexplained) sum of squared deviations
Sum(Yi - Yhati)2 |
|
r (correlation coefficient)
|
type of standardized version of the slope, and does not depend on units of measurement.
The Pearson correlation coefficient, r, is sort of a standardized version of the slope. It is a type of slope for which the value, unlike that of b, does not depend on the units of measurement. The correlation is the value the slope would assume if the measurement units for the two variables are such that their standard deviations are equal. |
|
properties of r
|
1. r is only valid when a straight line is a reasonable model for the relationship. It measures the strength of the linear association between X and Y.
2. Unlike the slope b, the value of r must fall between the values of -1.0 and +1.0. 3. r has the same sign as b. We’ve just seen that r equals the slope b multiplied by the ratio of two (positive) standard deviations, so the the sign is preserved. 4. r = +/- 1.0 when all the sample points fall exactly on the prediction line. These would correspond to perfect positive and negative linear associations. 5. The larger the absolute value of r, the stronger the degree of linear association. 6. The value of r does not depend on the variables’ units of measurement. In the earlier example of murder and poverty rates among the 50 states, r = .63, irrespective of whether Y is measured as murders per 1,000,000 population, or as murders per 100,000 population. 7. Unlike the slope b, r treats the two variables symmetrically. The prediction equation using Y to predict X has the same correlation as the equation using X to predict Y. If the murder rate is used to predict the poverty rate, rather than, as in the example, using the poverty rate to predict the murder rate, r = .63 in both (but the b’s would most likely not be the same) . |
|
t-test
|
= b/ SEb
We want a t-value that has a low probability, meaning that it is unlikely that the sample came from a population where H0 is true. Want value above 1.96 for 0.05 probability. |
|
Adjusted R-squared
|
R2(adj.) is an adjustment to the value of R2 in order to obtain more of an unbiased estimate of the R2 coefficient. Statisticians have shown that the value of R2 for any sample drawn from a larger population tends to be slightly biased upwards, that is, it tends to overestimate the value of R2 for the population from which it was drawn. R2 for the population is known as 2, which is not a letter P, but the upper-case Greek letter rho.
R2 is slightly biased upwards because the sample data fall closer to the sample prediction equation than to the true population regression equation. This bias is greater if n (the sample size) is small and/or if K (the number of predictors, i.e., independent variables) is large. R2(adj) approaches R2 in value as n increases. If R2 is quite low, the value of R2 (adj) may become negative. |
|
Multiple Regression Coefficients
|
The interpretation of the intercept a is easy and is an extension of the bivariate case; a is the value of Y when each independent variable is zero.
The interpretation of b requires more attention. The value of the slope bk equals the average change in Y associated with a one unit change in Xk, when the other independent variable(s) are held constant. Therefore, the slope in the multivariate case is sometimes referred to as a partial slope or as a partial regression coefficient. |
|
Multiple Regression & Spurious Relationships
|
A major contribution of multiple regression is that it enables us to test to see if a previous bivariate relationship may be spurious, that is, if the previous bivariate relationship is not a real one but is caused by the fact that the two variables in the bivariate equation are both caused by a third variable not included.
Consider the demonstrated bivariate association between height and mathematics scores. Taller children perform better on math than shorter children. But if you run a multivariate regression with math scores as Y and both height and age as X variables, the slope of height on math scores usually becomes insignificant. |
|
F-test
|
1. The F-test allows us to evaluate the complete regression; it is a global test; it tests the H0 that all the regression coefficients in the real population are zero. The significance level of the F-test also gives us the significance of the R2. If you wish to determine if a specific b coefficient is significant, use the t-test for that coefficient.
2. We can also use the F-test to compare two regression models, one of which is more complex than the other. |
|
Tolerance
|
The R2 produced by regressing a particular X variable on the other X variables may then be subtracted from 1. This value is known as the “tolerance” (or the independent variation) of the X variable. If the R2 of an X variable regressed on the other X variables is .29, this means that the X variable has a tolerance of .71, that is, 71% of the variation in that particular X variable is independent of the other X variables in the model. The higher the tolerance of an X variable, the less the presence of a problematic amount of collinearity. Watch out for low tolerances, say < .40 or <.35.
|
|
VIF
|
Variance inflation factor
VIF = reciprocal of the tolerance (1/tolerance) The larger the VIF (the smaller the “tolerance”), the greater the multicollinearity in the model that is caused by the particular X variable. The square root of the VIF of a particular X variable tells you how much larger the standard error for the b coefficient of that X variable is, compared with what it would be if that X variable were uncorrelated with the other X variables in the regression equation. |
|
Regression Assumptions
|
I.1. There is no specification error.
I. 1a. The relationship between Xi and Yi is linear. I. 1b. No relevant independent (X) variables have been excluded. I. 1c. No irrelevant independent (X) variables have been included. I. 2a. The Y variable is quantitative, continuous and unbounded; the X variables are quantitative or dichotomous; all variables are measured without error. I. 2b. All X variables have nonzero variance, that is, each independent variable has some variation in value. I. 2c. There is not perfect collinearity (i.e., there is no exact linear relationship) between two or more of the X variables I.2d. For each observation, the expected value of the error term is zero. I.2e. The variance of the error term is constant for all values of Xi. Another way of saying this is that the error term is homoscedastic if across each set of values for the k independent variables, the variance is constant at a value sigma2. I.2f. The error terms are uncorrelated, that is, there is no autocorrelation I. 2g. Each independent variable is uncorrelated with the error term I. 2h. The error term, Ei, is normally distributed. |
|
BLUE
|
when OLS is most efficient
Best Linear Unbiased Estimator |
|
Normality Assumption
|
The normality assumption does not mean that all the variables in the regression equation must be normally distributed. The only “variable” that is assumed to have a normal distribution is the error term, which is something we can’t observe directly.
But again, nonnormal e distributions often result from badly skewed Y and or X distributions. So here are some ways for appraising whether the X and Y variables have normal distributions. Compare the mean and median; in a normal distribution they are the same. Also look at kurtosis and skewness values (which will equal 3 and 0, respectively in a normal distribution.) Also look at graphs of the distribution of the variables in the model. |
|
Tukey's Ladder
|
If you have a non-normal distribution:
Powers greater than 1 shift weight to the upper tail of the distribution and thereby reduce negative skew. The higher the power, the stronger this effect. Powers less than 1 pull in the upper tail and may therefore reduce positive skew. The lower the power, the stronger this effect. Natural logs have the effect of bringing in the positive outliers. The higher values are compressed, pulling in the upper tail of the distribution. |
|
Standard Error
|
The standard error is the standard deviation of the sampling distribution.
SEx-bar = SE / sqrt (N) Symbol for SE = lowercase sigma A small standard error indicates little sample-to-sample variation, so that most b's and a's are close to B (beta) and A (alpha). Large SE indicate the opposite. |
|
TPE (total prediction error)
|
= Sum (Yi - Y-hati)
= Sum (difference between observed & predicted) |
|
Slope (b)
|
*average* change in Y associated with one unit change in X
|
|
Intercept (a)
|
indicates the point where the regression line "intercepts" with the Y-axis
|
|
Covariance
(sxy) |
A useful summary statistic is known as the covariance (i.e., co-vary), and is designated as sxy, the covariance of X and Y. For each observation, we subtract its Y value from the mean of Y, or Y-bar, and we subtract its X value from the mean of X, or X-bar. We multiply the two differences together, and sum them over all the observations; and then divide this sum by the number of observations, minus 1.
= [Sum(Xi-X-bar)(Yi-Y-bar)] / (N-1) But all that the covariance statistic is really useful or valuable for is its indication of the sign (+ or -) of a relationship. It tells us nothing about the strength of a relationship. The covariance statistic produces a raw number which has no theoretical upper bound. |
|
Type I Error
|
rejecting the null hypothesis when in fact it is true.
|
|
Type II Error
|
failing to reject H0 when it is in fact false
|
|
Partial Correlation
|
correlation coeff holding other variables constant (doesn't do that in Pearsonian r)
|
|
Interaction
|
When an X variable's effect depends on the values of other X variables.
Most common approach for modeling interaction introduces cross-product terms of the explanatory variables into the multiple regression model. Ex: the more SES, the less the effect of life events on mental impairment. |
|
homo/hetero-scedasticity
|
The variance of the error term is constant for all values of Xi. This is the assumption of homoscedasticity.
The word homoscedastic is from the Greek word skedastos, which means, able to be scattered, which itself is from skedannunai, to scatter. The word literally means having equal scatter or variation; having equal variances. The opposite of homoscedasticity (which is what we strive for) is heteroscedasticity (which is what we want to avoid). This is a serious and important assumption (regression). It assumes that the errors of prediction are not related to the values of the independent variable. Violating this assumption will not introduce bias into the OLS estimate of the slope, but will bias the estimate of the standard error of the slope. |
|
Resistant vs Robust Estimators
|
An estimate is resistant if its value is not much affected by small changes in sample data.
A robust estimator performs well even when there are small violations of assumptions about the underlying population (an error term that is not really Gaussian). |
|
Probability
|
Probability is the likelihood that a given event will occur.
In gambling, the term probability takes the form of a specific mathematical expression; it is the frequency of a given outcome divided by the total number of all possible outcomes. |
|
Odds
|
The likelihood of a given event occurring, compared to the likelihood of the same event not occurring.
|
|
Ordinal Variable
|
variable that is categorical and ordered; "poor" "good" "excellent"
very liberal, slightly liberal, moderate, etc. |
|
Nominal Variable
|
a variable that is categorical but not ordered
|
|
Censoring
|
In event history analysis, exists when incomplete information is available about the duration of the risk period because of limited observation period
A case is censored if-by the end of the observation period- the event has not occurred to the case (right censored) A subject can also be left censored if data is not available at the beginning of the risk period |
|
crude death rate
|
the total number of deaths per year per 1000 people; sum of the weighted ASDRs
CDR = (Deaths/Midyear Population) *1000 Crude because not all population is at an equal risk of death- varies by many characteristics |
|
age-specific death rate (ASDR)
|
This refers to the total number of deaths per year per 1000 people of a given age
Age-specific death rates, and not crude death rates, should be used to compare the mortality experiences of countries with known differences in age composition. Ex: U.S. & Venezuela The U.S. is an “older” country than Venezuela, that is, the U.S. has more older people proportionately than Venezuela. In contrast, Venezuela is a much “younger” country than the U.S. Because younger people die at lower rates than do older people, many (but not all) “young” countries have lower CDRs than “old” countries. nMx= (deaths to persons aged x to x+n/mid-year population aged x to x+n)*1000 -->where n is the width of the age group and x is the initial year of the age group Not a crude rate bc it takes into account differential mortality by age group |
|
Menarche
|
Refers to the age at which a female experiences her first menstrual cycle.
Typically occurs in early teens, though sometimes younger due to hormones in food and improved nutrition |
|
age-specific fertility rate (ASFR)
|
The annual number of births to women in a particular age group per 1000 women in that age group
Data obtained by civil registration system or censusesFormula: |
|
American Community Survey (ACS)
|
An annual survey conducted by the CB that has been instituted to replace the long form of the census
Takes a representative sample of the American people on typical census topics: economics, housing, demographic and social variables Important because of the yearly update |
|
crude birth rate
|
CBR = (Births/Mid Year Population)*1000
Crude because the denominator includes the total population, not just the population at risk Problematic because age structure can have substantial effects on crude rates Ie: A developing country population with many young people can have a high CBR Need to use standardization techniques to refine comparisons |
|
Current Population Survey
|
A sample survey conducted monthly by the CB*Designed to represent the civilian non-institutionalized population that obtains a wide range of socioeconomic demographic data such as employment, unemployment, earnings, hours or work, and age, sex, race, occupation and industry
|
|
Definitions of Death
(underlying cause vs. pattern of failure) |
Underlying cause- definition of death in a life table entails that every death is represented in just one d column so that the table is mutually exclusive and additive
Pattern of failure- the number of persons who leave the population for each type of chronic disease includes everyone who had that disease listed on their death certificate |
|
Demographic transition
|
Population shift from high fertility and mortality to low fertility and mortality
|
|
Diffusion Effect (in Fertility Transition)
|
Attempt to identify a mechanism that leads to the cumulative adoption of some behavior by more and more individuals even while their social position and the resources associated with them remain largely unchanged
Fertility declines take place under a wide variety of economic and mortality conditions and there is a tendency to be influenced by ethnic, linguistic, and religious boundaries |
|
Fecundity
|
The physiological capacity of a woman, man, couple, or group to reproduce
Infecund persons are also described as sterile Women are most fecund during their 20’s*For females, fecundity ranges from 0 to 30 children Bongaarts maximum fecundity is about 15 children per woman. This is the theoretical maximum if women engaged in natural fertility from age 12 to age 50 |
|
Fertility
|
Actual birth performance
One’s Fertility is limited to one’s fecundity and is usually far below it |
|
General Fertility Rate
|
Number of births in a given year divided by midyear female population of childbearing age x 1000
Improves on CBR because it only includes pop at risk Masks differences in rate of childbearing for different ages throughout the reproductive years |
|
Gross Reproduction Rate
|
The sum of the ASFRs that include only live female births in the numerators
Used to determine whether the pop will grow, replace itself, or decline Formula: Interpretation: The number of daughters expected to be born alive to a hypothetical cohort of 1000 women GRR measures daughters per woman, TFR measures children |
|
Infant Mortality Rate
|
Number of deaths to children born alive from birth to exact age one year/ number of live births x 1000
Equal to NMR + PNMR Best known and most widely available measure of mortality in early life Key indicator of demographic development and health conditions in different countries Problems: Migration may affect the numerator but not the denominator and some deaths include children born at the end of last year (still under age 1) |
|
IPUMS
|
Integrated Public Use Microdata Series
Consists of microdata samples from US and International census records Records are converted and made available to researchers through a web system Based out of Minnesota Population Center Provides consistent variable names, coding schemes, and documentation across all samples |
|
Life Expectancy
|
Average number of years remaining to a group of persons who reached a given age
|
|
Life Span
|
Maximum age that humans as a species could reach under optimum conditions
Longest was that of Jeanne Calment (122y 5m) from France and now is Tuti Yusupova of Uzbekistan who is 128 Almost entirely biological |
|
Life Table
|
A statistical model composed of a combination of age specific mortality rates for a given population
Single decrement- has only one way of leaving- mortality Unabridged- mortality info for single years of life Most are abridged- information by age group |
|
Life Table Transient State vs. Absorbing State
|
Conventional life tables concern 2 states (life and death) and multiple decrement concern multiple states
Multistate models allow for movements between life (active state) and death ( absorbing state) but also for possible movements among various types of active states Absorbing states only permit entries Transient states allow entries and exits |
|
Longevity
|
The ability to resist death
Has both biological and social components and varies according to these characteristics |
|
Multiple Decrement Life Table
|
An extension of the standard life table, takes into account multiple transitions between states
(Such as more than one cause of death or way to leave a population) |
|
National Survey of Family Growth
|
An ongoing series of sample surveys designed to provide current information about childbearing, contraception and related aspects of maternal and child health for the US
Used by the US Dept of Health and Human Services to plan health services and health ed programs |
|
Natural Fertility
|
The level of fertility in a population in which deliberate control of childbearing is not practice
Characteristic to most populations prior to the onset of the demographic transition Achieved by Hutterites |
|
Net Reproduction Rate
|
Average number of daughters born per woman (or 1000 women) by the end of her childbearing years who have been subject to the ASBRs and survival rates in a given year
NRR = 1: Exact Replacement NRR < 1: Below Replacement NRR > 1: Above Replacement |
|
Replacement Fertility
|
Level of fertility needed for a population to replace itself
Average of 2.1 babies per woman |
|
Reproduction
|
Production of female births
|
|
Sex Ratio at Birth (SRB)
|
Usually defined as the number of boys born for every 100 girls
SRB = (# male births/# female births) x 100 Most societies have SRBs between 104 and 106 China has a high (120) SRB |
|
Taeuber Paradox
|
Attributed to Conrad Taeuber by Keyfitz
Essentially States that the elimination of a particular death risk (ie. Cancer) makes little impact on cohort life span because it in effect exposes the people that would have died to a whole new set of risks |
|
Total Fertility Rate (TFR)
|
Average number of children a hypothetical cohort of 1000 women would have if they survived their childbearing years
|
|
Family Limitation & Birth Control
|
FL depends on the number of children already born and refers specifically to behavior designed to stop childbearing altogether.
BIRTH CONTROL (BC) encompasses both behavior intended to stop births and deliberate attempts to space births, and may also apply to behavior outside of the family context, that is, in "illegitimate" relations. This distinction is theoretically important because it was the introduction and spread of stopping behavior (i.e., FL), and not spacing per se (which would involve BC), that was the key to the onset of the Demographic Transition (DT). In other words, the DT resulted from the shift from NF to FL. |
|
Singulate Mean Age at Marriage (SMAM)
|
Mean at first marriage for a cohort of women or men who marry by age 50
Computed by information on current marital status in a single census or survey Essentially the mean number of years lived in the single state as implied by a schedule of age specific percentages single Analogous to TFR bc of cross sectionality |
|
Stable Population
|
A model or hypothetical population closed to migration with an unchanging relative age composition and a constant rate of change in its total size
Results from conditions of constant fertility and mortality rates over an extended period Contrasting age distributions will evolve to identical stable age distributions if the fertility and mortality levels are the same |
|
Stationary Population (LT interpretation)
|
In a life table, the nLx entry may be interpreted as either 1) the number of person years lived during the age interval x to x +n by the lx individuals alive at the beginning of the interval or 2) the stationary population within an interval- requires the assumption that 100,000 individuals are born each year
After a significant number of years and no migration, a stationary population results After more than 100 years, the 100,000 entering the population each year at birth would be exactly balanced by 100,000 dying at all ages |
|
Spatial mobility (3 types)
|
1. local movement, i.e., short-distance change of residence within the same community;
2. internal migration, i.e., change of residence from one community to another while remaining within the same national boundaries; 3. international migration, i.e., change of residence from one nation to another. All spatial mobility involves "permanent" changes of residence. The term "migration" is usually reserved for those changes of residence that involve a complete change and readjustment of the community affiliations of the mover. Thus local movement (#1 above) should not be referred to as migration. A "residential move" is a permanent change of residence. A "migration" is a permanent change of residence involving the crossing of a political boundary. |
|
Net Migration
|
In-migration + Out-Migration in a given area over a given period of time
Rate is net migration/mid year pop x k (k = 1000 or 100) |
|
Whipple’s Measure of Age Heaping
|
A popular way to determine if age heaping is having an effect
Range from 0 (0 and 5 aren't reported at all) to 100 (no pref for 0 or 5) to 500 (Only ages ending in 0 or 5 are reported) Values less than 105 are considered accurate |
|
Age Heaping
|
The practice of reporting years of life so that the terminal digit reflects cultural preference or ease of reporting (such as 0 or 5)
|
|
Parity
|
the number of live births born to a woman
|