• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/78

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

78 Cards in this Set

  • Front
  • Back
Normal Distribution
3 properties:
1. Symmetrical
2. Unimodal (mean, median and mode are all in the same place at the center of the distribution)
3. Asymptotic (upper and lower tails of the distribution never touch the baseline)

Sometimes referred to as a Gaussian distribution
Population Distribution
the distribution of scores in a population

ex: distribution of IQ scores for everyone in a country
Distribution of a Sample
the distribution of scores in a sample of a given size

ex: IQ scores of the students in class, as a sample of the TAMU population
Sampling Distribution
the distribution of some statistic in all possible samples of a given size

ex: mean, or a slope coefficient
Central Limit Theorem
CLT states that, as sample size (n) becomes large: then a) the sampling distribution of the mean becomes approximately normal, regardless of the shape of the variable’s frequency distribution; b) the sampling distribution will be centered around the variable’s population mean; and c) the standard deviation of this sampling distribution, called its standard error, approaches the variable’s standard deviation in the population divided by the square root of the sample size
SSE
sum of the squares of the errors

SSE = Sum(Yi – Y-hati)2

[square of the TPE]
R-squared
coefficient of determination

=ESS/TSS

Indicates the explanatory power of the regression model. It records the proportion of variation in the dependent variable that is explained or accounted for by the independent variable.
varies from +1 to 0
TSS
Total Sum of Squared Deviations

Sum(Yi – Y-bar)2

The TSS indicates the total variation of the dependent variable that we want to explain. The total variation may be divided into two parts: the part accounted for or explained by the regression equation (ESS); and that part that the regression equation cannot account for, i.e., the residual part (RSS).
ESS
Explained sum of squared deviations

Sum(Y-hati – Y-bar)2
RSS
error of residual (unexplained) sum of squared deviations

Sum(Yi - Yhati)2
r (correlation coefficient)
type of standardized version of the slope, and does not depend on units of measurement.

The Pearson correlation coefficient, r, is sort of a standardized version of the slope. It is a type of slope for which the value, unlike that of b, does not depend on the units of measurement. The correlation is the value the slope would assume if the measurement units for the two variables are such that their standard deviations are equal.
properties of r
1. r is only valid when a straight line is a reasonable model for the relationship. It measures the strength of the linear association between X and Y.
 
2. Unlike the slope b, the value of r must fall between the values of -1.0 and +1.0.
 
3. r has the same sign as b. We’ve just seen that r equals the slope b multiplied by the ratio of two (positive) standard deviations, so the the sign is preserved.
 
4. r = +/- 1.0 when all the sample points fall exactly on the prediction line. These would correspond to perfect positive and negative linear associations.  
5. The larger the absolute value of r, the stronger the degree of linear association.
 
6. The value of r does not depend on the variables’ units of measurement. In the earlier example of murder and poverty rates among the 50 states, r = .63, irrespective of whether Y is measured as murders per 1,000,000 population, or as murders per 100,000 population.
 
7. Unlike the slope b, r treats the two variables symmetrically. The prediction equation using Y to predict X has the same correlation as the equation using X to predict Y. If the murder rate is used to predict the poverty rate, rather than, as in the example, using the poverty rate to predict the murder rate, r = .63 in both (but the b’s would most likely not be the same) .
t-test
= b/ SEb

We want a t-value that has a low probability, meaning that it is unlikely that the sample came from a population where H0 is true.

Want value above 1.96 for 0.05 probability.
Adjusted R-squared
R2(adj.) is an adjustment to the value of R2 in order to obtain more of an unbiased estimate of the R2 coefficient. Statisticians have shown that the value of R2 for any sample drawn from a larger population tends to be slightly biased upwards, that is, it tends to overestimate the value of R2 for the population from which it was drawn. R2 for the population is known as 2, which is not a letter P, but the upper-case Greek letter rho.
 
R2 is slightly biased upwards because the sample data fall closer to the sample prediction equation than to the true population regression equation. This bias is greater if n (the sample size) is small and/or if K (the number of predictors, i.e., independent variables) is large.

R2(adj) approaches R2 in value as n increases.
If R2 is quite low, the value of R2 (adj) may become negative.
Multiple Regression Coefficients
The interpretation of the intercept a is easy and is an extension of the bivariate case; a is the value of Y when each independent variable is zero.
The interpretation of b requires more attention. The value of the slope bk equals the average change in Y associated with a one unit change in Xk, when the other independent variable(s) are held constant.
Therefore, the slope in the multivariate case is sometimes referred to as a partial slope or as a partial regression coefficient.
Multiple Regression & Spurious Relationships
A major contribution of multiple regression is that it enables us to test to see if a previous bivariate relationship may be spurious, that is, if the previous bivariate relationship is not a real one but is caused by the fact that the two variables in the bivariate equation are both caused by a third variable not included.

Consider the demonstrated bivariate association between height and mathematics scores. Taller children perform better on math than shorter children.

But if you run a multivariate regression with math scores as Y and both height and age as X variables, the slope of height on math scores usually becomes insignificant.
F-test
1. The F-test allows us to evaluate the complete regression; it is a global test; it tests the H0 that all the regression coefficients in the real population are zero. The significance level of the F-test also gives us the significance of the R2. If you wish to determine if a specific b coefficient is significant, use the t-test for that coefficient.

2. We can also use the F-test to compare two regression models, one of which is more complex than the other.
Tolerance
The R2 produced by regressing a particular X variable on the other X variables may then be subtracted from 1. This value is known as the “tolerance” (or the independent variation) of the X variable. If the R2 of an X variable regressed on the other X variables is .29, this means that the X variable has a tolerance of .71, that is, 71% of the variation in that particular X variable is independent of the other X variables in the model. The higher the tolerance of an X variable, the less the presence of a problematic amount of collinearity. Watch out for low tolerances, say < .40 or <.35.
VIF
Variance inflation factor

VIF = reciprocal of the tolerance (1/tolerance)

The larger the VIF (the smaller the “tolerance”), the greater the multicollinearity in the model that is caused by the particular X variable.

The square root of the VIF of a particular X variable tells you how much larger the standard error for the b coefficient of that X variable is, compared with what it would be if that X variable were uncorrelated with the other X variables in the regression equation.
Regression Assumptions
I.1. There is no specification error.
I. 1a. The relationship between Xi and Yi is linear.
I. 1b. No relevant independent (X) variables have been excluded.
I. 1c. No irrelevant independent (X) variables have been included.
I. 2a. The Y variable is quantitative, continuous and unbounded; the X variables are quantitative or dichotomous; all variables are measured without error.
I. 2b. All X variables have nonzero variance, that is, each independent variable has some variation in value.
I. 2c. There is not perfect collinearity (i.e., there is no exact linear relationship) between two or more of the X variables
I.2d. For each observation, the expected value of the error term is zero.
I.2e. The variance of the error term is constant for all values of Xi. Another way of saying this is that the error term is homoscedastic if across each set of values for the k independent variables, the variance is constant at a value sigma2.
I.2f. The error terms are uncorrelated, that is, there is no autocorrelation
I. 2g. Each independent variable is uncorrelated with the error term
I. 2h. The error term, Ei, is normally distributed.
BLUE
when OLS is most efficient
Best Linear Unbiased Estimator
Normality Assumption
The normality assumption does not mean that all the variables in the regression equation must be normally distributed. The only “variable” that is assumed to have a normal distribution is the error term, which is something we can’t observe directly.

But again, nonnormal e distributions often result from badly skewed Y and or X distributions. So here are some ways for appraising whether the X and Y variables have normal distributions.

Compare the mean and median; in a normal distribution they are the same. Also look at kurtosis and skewness values (which will equal 3 and 0, respectively in a normal distribution.) Also look at graphs of the distribution of the variables in the model.
Tukey's Ladder
If you have a non-normal distribution:

Powers greater than 1 shift weight to the upper tail of the distribution and thereby reduce negative skew. The higher the power, the stronger this effect.

Powers less than 1 pull in the upper tail and may therefore reduce positive skew. The lower the power, the stronger this effect.

Natural logs have the effect of bringing in the positive outliers. The higher values are compressed, pulling in the upper tail of the distribution.
Standard Error
The standard error is the standard deviation of the sampling distribution.

SEx-bar = SE / sqrt (N)

Symbol for SE = lowercase sigma

A small standard error indicates little sample-to-sample variation, so that most b's and a's are close to B (beta) and A (alpha). Large SE indicate the opposite.
TPE (total prediction error)
= Sum (Yi - Y-hati)
= Sum (difference between observed & predicted)
Slope (b)
*average* change in Y associated with one unit change in X
Intercept (a)
indicates the point where the regression line "intercepts" with the Y-axis
Covariance
(sxy)
A useful summary statistic is known as the covariance (i.e., co-vary), and is designated as sxy, the covariance of X and Y. For each observation, we subtract its Y value from the mean of Y, or Y-bar, and we subtract its X value from the mean of X, or X-bar. We multiply the two differences together, and sum them over all the observations; and then divide this sum by the number of observations, minus 1.

= [Sum(Xi-X-bar)(Yi-Y-bar)] / (N-1)

But all that the covariance statistic is really useful or valuable for is its indication of the sign (+ or -) of a relationship. It tells us nothing about the strength of a relationship. The covariance statistic produces a raw number which has no theoretical upper bound.
Type I Error
rejecting the null hypothesis when in fact it is true.
Type II Error
failing to reject H0 when it is in fact false
Partial Correlation
correlation coeff holding other variables constant (doesn't do that in Pearsonian r)
Interaction
When an X variable's effect depends on the values of other X variables.

Most common approach for modeling interaction introduces cross-product terms of the explanatory variables into the multiple regression model.

Ex: the more SES, the less the effect of life events on mental impairment.
homo/hetero-scedasticity
The variance of the error term is constant for all values of Xi. This is the assumption of homoscedasticity.

The word homoscedastic is from the Greek word skedastos, which means, able to be scattered, which itself is from skedannunai, to scatter. The word literally means having equal scatter or variation; having equal variances.

The opposite of homoscedasticity (which is what we strive for) is heteroscedasticity (which is what we want to avoid).

This is a serious and important assumption (regression). It assumes that the errors of prediction are not related to the values of the independent variable. Violating this assumption will not introduce bias into the OLS estimate of the slope, but will bias the estimate of the standard error of the slope.
Resistant vs Robust Estimators
An estimate is resistant if its value is not much affected by small changes in sample data.

A robust estimator performs well even when there are small violations of assumptions about the underlying population (an error term that is not really Gaussian).
Probability
Probability is the likelihood that a given event will occur.

In gambling, the term probability takes the form of a specific mathematical expression; it is the frequency of a given outcome divided by the total number of all possible outcomes.
Odds
The likelihood of a given event occurring, compared to the likelihood of the same event not occurring.
Ordinal Variable
variable that is categorical and ordered; "poor" "good" "excellent"

very liberal, slightly liberal, moderate, etc.
Nominal Variable
a variable that is categorical but not ordered
Censoring
In event history analysis, exists when incomplete information is available about the duration of the risk period because of limited observation period

A case is censored if-by the end of the observation period- the event has not occurred to the case (right censored)

A subject can also be left censored if data is not available at the beginning of the risk period
crude death rate
the total number of deaths per year per 1000 people; sum of the weighted ASDRs

CDR = (Deaths/Midyear Population) *1000

Crude because not all population is at an equal risk of death- varies by many characteristics
age-specific death rate (ASDR)
This refers to the total number of deaths per year per 1000 people of a given age

Age-specific death rates, and not crude death rates, should be used to compare the mortality experiences of countries with known differences in age composition.

Ex: U.S. & Venezuela
The U.S. is an “older” country than Venezuela, that is, the U.S. has more older people proportionately than Venezuela. In contrast, Venezuela is a much “younger” country than the U.S. Because younger people die at lower rates than do older people, many (but not all) “young” countries have lower CDRs than “old” countries.

nMx= (deaths to persons aged x to x+n/mid-year population aged x to x+n)*1000
-->where n is the width of the age group and x is the initial year of the age group

Not a crude rate bc it takes into account differential mortality by age group
Menarche
Refers to the age at which a female experiences her first menstrual cycle.

Typically occurs in early teens, though sometimes younger due to hormones in food and improved nutrition
age-specific fertility rate (ASFR)
The annual number of births to women in a particular age group per 1000 women in that age group

Data obtained by civil registration system or censuses Formula:
American Community Survey (ACS)
An annual survey conducted by the CB that has been instituted to replace the long form of the census

Takes a representative sample of the American people on typical census topics: economics, housing, demographic and social variables

Important because of the yearly update
crude birth rate
CBR = (Births/Mid Year Population)*1000

Crude because the denominator includes the total population, not just the population at risk

Problematic because age structure can have substantial effects on crude rates

Ie: A developing country population with many young people can have a high CBR

Need to use standardization techniques to refine comparisons
Current Population Survey
A sample survey conducted monthly by the CB *Designed to represent the civilian non-institutionalized population that obtains a wide range of socioeconomic demographic data such as employment, unemployment, earnings, hours or work, and age, sex, race, occupation and industry
Definitions of Death
(underlying cause vs. pattern of failure)
Underlying cause- definition of death in a life table entails that every death is represented in just one d column so that the table is mutually exclusive and additive

Pattern of failure- the number of persons who leave the population for each type of chronic disease includes everyone who had that disease listed on their death certificate
Demographic transition
Population shift from high fertility and mortality to low fertility and mortality
Diffusion Effect (in Fertility Transition)
Attempt to identify a mechanism that leads to the cumulative adoption of some behavior by more and more individuals even while their social position and the resources associated with them remain largely unchanged

Fertility declines take place under a wide variety of economic and mortality conditions and there is a tendency to be influenced by ethnic, linguistic, and religious boundaries
Fecundity
The physiological capacity of a woman, man, couple, or group to reproduce

Infecund persons are also described as sterile

Women are most fecund during their 20’s *For females, fecundity ranges from 0 to 30 children

Bongaarts maximum fecundity is about 15 children per woman. This is the theoretical maximum if women engaged in natural fertility from age 12 to age 50
Fertility
Actual birth performance

One’s Fertility is limited to one’s fecundity and is usually far below it
General Fertility Rate
Number of births in a given year divided by midyear female population of childbearing age x 1000

Improves on CBR because it only includes pop at risk

Masks differences in rate of childbearing for different ages throughout the reproductive years
Gross Reproduction Rate
The sum of the ASFRs that include only live female births in the numerators

Used to determine whether the pop will grow, replace itself, or decline
Formula:

Interpretation: The number of daughters expected to be born alive to a hypothetical cohort of 1000 women

GRR measures daughters per woman, TFR measures children
Infant Mortality Rate
Number of deaths to children born alive from birth to exact age one year/ number of live births x 1000

Equal to NMR + PNMR

Best known and most widely available measure of mortality in early life

Key indicator of demographic development and health conditions in different countries

Problems: Migration may affect the numerator but not the denominator and some deaths include children born at the end of last year (still under age 1)
IPUMS
Integrated Public Use Microdata Series

Consists of microdata samples from US and International census records

Records are converted and made available to researchers through a web system

Based out of Minnesota Population Center

Provides consistent variable names, coding schemes, and documentation across all samples
Life Expectancy
Average number of years remaining to a group of persons who reached a given age
Life Span
Maximum age that humans as a species could reach under optimum conditions

Longest was that of Jeanne Calment (122y 5m) from France and now is Tuti Yusupova of Uzbekistan who is 128

Almost entirely biological
Life Table
A statistical model composed of a combination of age specific mortality rates for a given population

Single decrement- has only one way of leaving- mortality

Unabridged- mortality info for single years of life

Most are abridged- information by age group
Life Table Transient State vs. Absorbing State
Conventional life tables concern 2 states (life and death) and multiple decrement concern multiple states

Multistate models allow for movements between life (active state) and death ( absorbing state) but also for possible movements among various types of active states

Absorbing states only permit entries

Transient states allow entries and exits
Longevity
The ability to resist death

Has both biological and social components and varies according to these characteristics
Multiple Decrement Life Table
An extension of the standard life table, takes into account multiple transitions between states

(Such as more than one cause of death or way to leave a population)
National Survey of Family Growth
An ongoing series of sample surveys designed to provide current information about childbearing, contraception and related aspects of maternal and child health for the US

Used by the US Dept of Health and Human Services to plan health services and health ed programs
Natural Fertility
The level of fertility in a population in which deliberate control of childbearing is not practice

Characteristic to most populations prior to the onset of the demographic transition

Achieved by Hutterites
Net Reproduction Rate
Average number of daughters born per woman (or 1000 women) by the end of her childbearing years who have been subject to the ASBRs and survival rates in a given year

NRR = 1: Exact Replacement NRR < 1: Below Replacement NRR > 1: Above Replacement
Replacement Fertility
Level of fertility needed for a population to replace itself

Average of 2.1 babies per woman
Reproduction
Production of female births
Sex Ratio at Birth (SRB)
Usually defined as the number of boys born for every 100 girls

SRB = (# male births/# female births) x 100

Most societies have SRBs between 104 and 106

China has a high (120) SRB
Taeuber Paradox
Attributed to Conrad Taeuber by Keyfitz

Essentially States that the elimination of a particular death risk (ie. Cancer) makes little impact on cohort life span because it in effect exposes the people that would have died to a whole new set of risks
Total Fertility Rate (TFR)
Average number of children a hypothetical cohort of 1000 women would have if they survived their childbearing years
Family Limitation & Birth Control
FL depends on the number of children already born and refers specifically to behavior designed to stop childbearing altogether.

BIRTH CONTROL (BC) encompasses both behavior intended to stop births and deliberate attempts to space births, and may also apply to behavior outside of the family context, that is, in "illegitimate" relations.

This distinction is theoretically important because it was the introduction and spread of stopping behavior (i.e., FL), and not spacing per se (which would involve BC), that was the key to the onset of the Demographic Transition (DT). In other words, the DT resulted from the shift from NF to FL.
Singulate Mean Age at Marriage (SMAM)
Mean at first marriage for a cohort of women or men who marry by age 50

Computed by information on current marital status in a single census or survey

Essentially the mean number of years lived in the single state as implied by a schedule of age specific percentages single

Analogous to TFR bc of cross sectionality
Stable Population
A model or hypothetical population closed to migration with an unchanging relative age composition and a constant rate of change in its total size

Results from conditions of constant fertility and mortality rates over an extended period

Contrasting age distributions will evolve to identical stable age distributions if the fertility and mortality levels are the same
Stationary Population (LT interpretation)
In a life table, the nLx entry may be interpreted as either 1) the number of person years lived during the age interval x to x +n by the lx individuals alive at the beginning of the interval or 2) the stationary population within an interval- requires the assumption that 100,000 individuals are born each year

After a significant number of years and no migration, a stationary population results

After more than 100 years, the 100,000 entering the population each year at birth would be exactly balanced by 100,000 dying at all ages
Spatial mobility (3 types)
1. local movement, i.e., short-distance change of residence within the same community;

2. internal migration, i.e., change of residence from one community to another while remaining within the same national boundaries;

3. international migration, i.e., change of residence from one nation to another.

All spatial mobility involves "permanent" changes of residence.

The term "migration" is usually reserved for those changes of residence that involve a complete change and readjustment of the community affiliations of the mover. Thus local movement (#1 above) should not be referred to as migration.

A "residential move" is a permanent change of residence.

A "migration" is a permanent change of residence involving the crossing of a political boundary.
Net Migration
In-migration + Out-Migration in a given area over a given period of time

Rate is net migration/mid year pop x k (k = 1000 or 100)
Whipple’s Measure of Age Heaping
A popular way to determine if age heaping is having an effect

Range from 0 (0 and 5 aren't reported at all) to 100 (no pref for 0 or 5) to 500 (Only ages ending in 0 or 5 are reported)

Values less than 105 are considered accurate
Age Heaping
The practice of reporting years of life so that the terminal digit reflects cultural preference or ease of reporting (such as 0 or 5)
Parity
the number of live births born to a woman