Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
71 Cards in this Set
- Front
- Back
variables
|
characteristics recorded about each individual
quantitative: variables in numbers categorical: names the category |
|
case
|
individual about whom or which we have data
|
|
frequency table
|
organizing counts
|
|
relative frequency table
|
similar to a normal frequency table, but this give percentages
|
|
distribution
|
names the different possible categories and how frequently they may occur
|
|
area principle
|
the area occupied by a part of the graph should correspond to the magnitude of the value it represents
|
|
bar chart
|
displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison
|
|
pie chart
|
shows the relative portion/ the whole group of cases as a circle.
|
|
contingency table
|
used when looking at two categorical variables together. shows how the individuals are distributed along each variable
|
|
marginal distribution
|
th distribution of one of the variables in a contingency table
|
|
conditional distribution
|
a distribution of one variable for only those individuals satisfying some condition on another variable
|
|
independent
|
in a contingency table when the distribution of one variable is the same for all categories of another
|
|
segmented bar chart
|
used rather than a bar chart. each bar represents a "whole" and is divided into the separate parts
|
|
Simpson's paradox
|
when averages are taken across different groups, they can appear contradictory
|
|
distribution
|
the bins and the counts in each bin give the distribution for a quantitative variable
|
|
histogram
|
shows the distribution as the heights of bars
|
|
relative frequency histogram
|
displays the percentage of cases in each bin instead of the count
|
|
stem-and-leaf displays
|
contains all the information found in a histogram and satisfies the area principle ad show the distribution. Preserves the individual data values
|
|
dotplot
|
a simple display the places a dot for each case in the data
|
|
Describing distribution of a Histogram
|
unimodal: one main peak
bimodal: two peaks multimodal: three of more peaks uniform: all bars are approximately the same height symmetric: fold along the middle and the sides will match tails: thinner ends of a distribution skewed: if a tail stretches out really far than the histogram is skewed to the side of the longer tail outliers: stragglers that stand off away from the body of the distribution |
|
timeplot
|
shows data that change over time
|
|
center of distribution
|
median: the middle value that divides the histogram into two equal areas
(n+1)/2 n is the number of numbers |
|
spread
|
describes the distribution numerically which measures the spread along with its center
|
|
range
|
max-min
|
|
quartiles
|
split the data at the median and find the medians of those two halves
|
|
interquartile range/IQR
|
upper quartile-lower quartile
|
|
percentiles
|
the lower and upper quartiles are aka the 25th and the 75th percentiles of the data
|
|
the five-number-summary
|
Median
Quartiles : Q1 and Q3 Maximum Minimum |
|
mean
|
add up the numbers and divided by n
the point in which the histogram would balance for skewed data, use the median instead |
|
deviation
|
how far the data value is from the mean
|
|
standard deviation
|
1) you find the mean (ybar)
2) subtract the mean from each value 3) square each value 4) add these numbers and divide by n-1 5) take the square root of this number |
|
z-scores
|
how many standard deviations something is from the mean
z= (y-ybar)/s s is the standard deviation |
|
normal model
|
appropriate for distributions whose shapes are unimodal and roughly symmetric
mean=0 SD=1 |
|
standardized value
|
found by subtracting the mean and dividing by the standard deviation
|
|
68-95-99.7 Rule
|
In a Normal model, 68% of the values fall within one standard deviation of the mean, 95% fall within two standard deviations of the mean, and 99.7% fall within three standard deviations of the mean
|
|
normal percentile
|
gives the percentage of values in a standard normal distribution found at that z-score or below
|
|
normal probability plot
|
if the plot is normal, then a diagonal straight line should form
|
|
scatterplot
|
shows the relationship between two quantitative variables measured on the same cases
|
|
direction of scatterplots
|
runs from upper left to the lower right- negative
lower left to upper right-positive |
|
response and explanatory variable
|
explanatory is on the x-axis
response is on the y-axis |
|
correlation
|
strength of the linear association between two quantitative variables
between -1 and 1 |
|
how to find the correlation coefficient
|
1) (x-xbar)(y-ybar) do this for each pair
2) divide this sum by the product of (n-1) x slilx x slily |
|
predicted value
|
the value of y-hat found for each x-value in the data. Found by substituting the x-value in the regression equation.
|
|
residual
|
the difference between the observed value and its associated predicted value
y - y-hat= residual |
|
linear model
|
y-hat = b0 + b1x
y-hat is the predicted value b0 is the y-intercept b1 is the slope |
|
R-squared
|
the correlation between y and x
the fraction of variability accounted for by the least squares regression on c how successful the regression is in linearly relating y to x |
|
least squares
|
specifies the unique line that minimizes the variance of the residuals or, the sum of the squared residuals
|
|
good linear model?
|
1) scatterplot is linear
2) calculate r. if r is close to 1 or -1 then it is a good model 3) look at residual plot and see no pattern |
|
leverage
|
outliers that exert high leverage on a linear model. Pulls the line close, so that they can have a large effect on the line
|
|
Influential point
|
a point that has high leverage
|
|
random
|
an event is random if we know what outcomes could happen, but not which particular values did or will happen
|
|
simulation
|
models random events by using random numbers to specify event outcomes with relative frequencies that correspond to the true-world relative frequencies we are trying to model
|
|
Census Sample
|
survey of the whole population
|
|
SRS
|
sample size n in which n elements in the population has an equal chance of selection
|
|
Convenience Sample
|
biased. Individuals who are conveniently available
|
|
Systematic Sampling
|
every nth number
|
|
Stratified Sample
|
divides the population into strata (groups). Into homogeneous strata within each group do a SRS
|
|
Cluster Sampling
|
divide into heterogeneous groups and take a SRS
|
|
Multistage Sampling
|
Sampling done in stages
|
|
Forms of Bias
|
1) Response Bias: wording of the question
2) Non-Response Bias: you are present, but chose not to respond 3) Voluntary Response: responding because you have a strong feeling 4) Undercoverage: you are not present and hence not represented in the sample |
|
Observational Study
|
no manipulation of factors has been employed
|
|
Retrospective Study
|
subjects are selected and then their previous conditions or behaviors are determined
|
|
Prospective Study
|
observational study in which subjects are followed to observe future outcomes
|
|
experiment
|
manipulates factor levels to create treatments and compare the responses of the subject groups across treatment levels
|
|
Principles of Experimental Design
|
1) Control: areas we know may have an effect, but are not factors being studied
2) Randomize 3) Replication Block: reduces variation (not required) |
|
Statistically Significant
|
an observed difference is too large for us to believe that it is likely to have occurred naturally
|
|
Control Group
|
group given the placebo or baseline treatment
|
|
single-blind
|
either those who can influence the results do not know or those who evaluate the results dont know
|
|
double-blind
|
when everyone is blinded
|
|
block
|
when groups of experimental units are similar, we gather them into blocks. this isolate variability so that we may see the differences more clearly
|
|
confounded
|
the effects of two or more factors are associated with eachother
|