Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
81 Cards in this Set
- Front
- Back
The four parts of statistics
|
It's the study of data analysis-defining the problem, collecting the data, analyzing and summarizing data, and drawing inferences from data.
|
|
Distribution
|
A list of the possible values of a variable together with how often each value occurs
|
|
Types of data
|
Quantitative and categorical
|
|
Left Skewed
|
The tail is on the left, mean is less than the median.
|
|
Right Skewed
|
Tail on the right, the mean is greater than the median
|
|
Histogram and boxplot
|
A right skewed histogram will match with a boxplot that has a long right whisker
|
|
Outliers
|
A data point that is quite a bit removed from the rest of the data
|
|
Mean
|
A measure of center. The "balance point". Computed by adding numbers and dividing by how many there are
|
|
Median
|
Measure of center, cuts the data in half. Order the data and find middle observation. Outliers have no affect on the median.
|
|
Five number summary
|
Each quartile is 25 percent. Minimum, median and maximum
|
|
Standard Deviation
|
Measures the variability of the data about the mean; in a sense, it is like the average distance of the date from the mean.
|
|
Effects of outliers on standard deviation.
|
Since outliers affect the mean and deviations from the mean are used to compute standard deviation, then outliers can make the standard deviation bigger than it should be.
|
|
When to use five number summary or mean and standard deviation.
|
Use the five-number summary in the presence of outliers.
|
|
When can a normal distribution be used to model a data set.
|
The Normal distribution can be used whenever the shape of the histogram of the data resembles the normal curve.
|
|
How do you obtain a proportion or a probability for a value from a normal curve.
|
First, convert the value to a z score and look the z score up in table A. Be sure that if you want the "greater than" percentage you subtract the probability given in the table from one.
|
|
68-95-99.7 rule.
|
68% of the observation are within one standard deviation, 95% within 2 st dev., 99.7 within three.
|
|
Z-score
|
The z-score tells how many standard deviations a value is from the mean. If a score from a test has a z score of 1.8 that means the score is 1.8 standard deviations above the mean.
|
|
Explanatory Variable
|
The variable we are assessing or testing in an experiment.
|
|
Response Variable
|
The variable we want to predict (y) and the explanatory variable is the x or the variable we use to do the predicting.
|
|
Correlation
|
Correlation is a measure of the linear relationship between the x and y variables. R is the symbol for correlation coefficient.
|
|
Corellation (r)
|
r is always between -1 and 1. Values close to zero indicate littler or no linear relationship. Values close to -1 show strong negative,+1 strong positive. r has no unit of measure. r only measures strenghth of linear relationships.
|
|
Least squares regression line
|
The line obtained by minimizing the sum of the squared residuals.
|
|
Purpose of regression equations.
|
Regression equations are used to model relationships between quantitative variables and also for prediction.
|
|
Residual
|
Observed y minus predicted y
|
|
Slope in a least squares regression line.
|
slope tells us the average increase(or deacrease if negative) in y for each one unit increase in x.
|
|
Roles of x and y in regression
|
X is the explanatory and y is the response variable
|
|
r-squared
|
Tells the percentage of total variation in the y's that can be explained by the x's
|
|
Residual Plots
|
1) Uniform scatter 2)outlers (normality) 3) Megaphone (equal variance) 4) A smile or frown (non linearity)
|
|
Necessary Conditions for testing Significance using slope
|
Normality-Variance is constant
Independent-Linear |
|
Tests on slope
|
Ho:B=0 versus Ha:B does not equal 0. 1. No linear relationship exists
2. A significant relationship exists. |
|
s in regression output
|
The s in regression output measures the standard deviation of the observed y's about the regression line.
|
|
Extrapolation
|
Using an x value outside of the range of the observed x's used to obtain the reqression equation to predict a y value.
|
|
CI vs PI
|
PI is wider
|
|
Lurking Variable
|
Affects the relationship between the response variable and explanatory variable but is not part of the study. They are dangerous because they can suggest relationships that do not really exist.
|
|
Marginal Distribution
|
row total divided by table total
|
|
Conditional distrubtion
|
cell count divided by the row.
* if the distributions are equal we say the variables are NOT related. |
|
Voluntary response sample
|
Samples obtained by having respondents call in or write in voluntarily
*biased, not probability sample |
|
Convenience Sample
|
Researchers contact subjects that are convenient to contact.
*biased, not probability sample |
|
Population of interest vs. sample
|
The entire group of interest-population
The subgroup of individual from the population about which the researcher actually obtains information-sample |
|
Response Variable
|
The observation recorded (measured) on each individual.
|
|
Bias
|
The amount that sample results systematically differ from what they should be. Bias can be eliminated by taking probability samples, using careful wording on survey questions etc.
|
|
SRS
|
Sampling from the entire population
|
|
Stratified sampling
|
Sampling from withing froups of a population or sampling from withing different populations
|
|
Multistage
|
First sampling groups and then sampling from within those groups.
|
|
SRS & Stratified & Multistage
|
All probability samples where every member of the population has a known non-zero chace of being selected. SRS gives each possible sample of size n an equal chance of being selected and a stratified sample gives each member of the stratea an equal chance within its strata of being selected.
|
|
Cautions of surveying
|
Uncercoverage, non-response, response bias, wording of questions
|
|
Methods to reduce bias
|
Using probability samples, avoiding undercoverage, reducing non response and avoiding poor wording on questions.
|
|
Observational Study
|
Studies where information is gathered on the population but nothing is inflicted on the subjects.
|
|
Sample Surveys
|
Observational studies not experiments.
|
|
Advantage of experiment over observational study.
|
You can establish causation
|
|
Control group
|
Those that recieve the placebo
|
|
Completely randomized experiment
|
An experiment where all experimental units are allocated at random among the treatment groups
|
|
Replication
|
having more than one experimental unit per treatment
|
|
Double blind
|
Niether the subjects nor the diagnostician know who is recieving the treatment or placebo.
|
|
Matched Pairs
|
Taking two measurements on each individual
|
|
Completely random
|
Two completely seperate groups. In the experiment. If individuals are randomly allocated to treatments you have a completely random design. In experiment if order in which treatments are applied to individuals is randomized you have a matched pairs.
|
|
Sampling Distribution
|
A list of the possible valued of the statistic together with the probabilities of each value. A collection of all statistics values from all possible samples.
|
|
Increase preicision
|
by increasing sample size
|
|
x-bar
|
sample mean
|
|
P hat
|
sample proportion
|
|
When is Phat normal
|
when np is greater than 10 and when n*1-P is greater than 10.
|
|
Inferential Statistics
|
Using information from a sample to draw inferences about a population.
|
|
Two most used types of inferential statistics
|
Confidence interval and test of hypothesis
|
|
Confidence Interval
|
Gives a range of plausible values of the parameter being estimated by the confidence interval
|
|
95 % confidence means?
|
The procedure provides intervals containing the parameter value for 95% of all samples. Confidence is in the procedure not the interval.
|
|
p-value
|
The probability of getting a test statistic as extreme or more extreme than the value actually ovserved if the null hypothesis were true.
|
|
alpha
|
The level of significance, probability of rejecting the null hypothesis when it is true or the largest risk the researcher is willing to take in rejecting a true null hypothesis.
|
|
multiple analyses
|
two or more tests of significance perforemd. Inflates the type 1 error rate
|
|
Power
|
Increasing alpha descreases beta and increases power. Increasing sample size decreases beta and increases power
|
|
T distribution vs. Normal distribution
|
t distribution is more spread out than normal.
|
|
First rule in data analysis
|
plot the data
|
|
When are t procedures robust
|
When there are no outliers.
|
|
Chi-Square Hypothesis
|
Ho: no relationship
Ha: relationship |
|
Expected Count
|
Coumng total times row total divided by table total
|
|
When is using Chi-square appropriate
|
Expected counts are greater than 5.
|
|
DF for two way table
|
(r-1)(c-1)
|
|
What does Chi-Square test?
|
Measures the amount of discrepancy between the observed counts and the expected counts where expected counts are computed assuming the null hypothesis of no relationship is true.
|
|
Anova Hypothesis
|
Ho: All means equal
Ha: at least one mean differs |
|
When do you use ANOVA
|
When the explanatory variable is categorical and the response variable is quantitative OR the problem says you are comparing three or more means.
|
|
When do you use Chi-square
|
When both variables are categorical.
|
|
What symbols to use for means
|
mu for means, p symbols for proportions, beta for slope in regression.
|