Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Related Flashcards

Flashcards
»
stats end! lets go! Finish strong!

Stats End! Lets Go! Finish Strong!

by mikew5353, Dec. 2015

Favorite

Add to folder

Flag

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/212

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

212 Cards in this Set

Front
Back

	prediction intervals are always wider	than corresponding confidence intervals
	a megaphone in a residual scatter plot indicates	an un equal variation
	slope is	the average change in y for every one increase in x
	r squared is	percent of variation in y explained by x
	a confidence interval	estimates the mean for a group
	a megaphone in a residual scatter plot indicates	unequal variation
	suppose the pee value was found to be less than .0001 using a level of significant of 0.01, what should we conclude?	there is a significant association between gender and whether or not they have a piercing
	what are the hy	hoe, bee equals zero, and no linear association ha is bee differs zero and is linear association
	confidence is groups and prediction is	interval
	confidence interval	groups and smaller
	prediction interval is	individual and bigger
	for linear regression the degrees of freedom are	n minus 2
	pee is the population proportion and is	number of successes in a population divided by the population size
	p hat is sample proportion and is	the number of successes in a sample divided by the sample size
	p hat is used to estimate	p
	the theoretical sampling distribution of p hat is	the distribution of all p hats from all possible samples of the same size, n, from the sample population
	the sampling distribution of p hat is	the standard deviation is to square p(1-p)/n the mean is equal to p is approximately normal if n is large and along those lines if np is greater or equal to 10 and n(1-p) is greater or equal to ten then n is considered large also np is the number of successes and n(1-p) is the number of failures
	the sampling distribution of p hat part two is	values of p are bounded between 0 and 1 values of peee close to these boundaries require larger sample sixes for the sampling distribution to be approximately normal values of pee close to .5 are more normal than big or small pee's
	test of significance	conditions are, simple random sample and the sampling distribution approximately normal if np0 is greater or equal to ten and n(1-p0) is greater or equal to ten )check successes and failures Hypotheses Hoe: p equals p0 Ha: pee is greater than p0, p is less than p0 or p is differs p0 and there is the test statistic formula which is the one that starts with p hat minus p0
	means is for	quantitative
	what is the symbol for eh difference between two population means	it is mew 1 minus mew 2
	which one of the following represents the parameter estimated with a 99 percent confidence interval for the difference in proportions?	that is pee 1 minus pee 2
	what is the main purpose of replication in a designed experiment ?	estimate chance variation
	what is not a valid sample for collection data for inference ?	a convenience sample
	U is	the population mean and mean of the sampling distribution of x bar
	x bar is	the sample mean
	sigma circle thing is	the population standard deviation
	s is	stantard deviation of sample and measures the sample
	s over the square root of n is	standard error of x bar, and estimates the standard deviation of the sampling distribution
	t star and s over the square root of n is	the margin of error for x bar
	p is the	population proportion and mean of the sampling distribution of p hat
	p hat is	the sample proportion and I need to know how to calculate it
	p hat (1-p hat) over n and all of this squared is	how to calculation the standard error of p hat
	z star then square this thing p hat (1-p hat) over n and this is	the margin of error for estimating pee
	mew or U is	the population mean and mean of the sampling distribution of x bar
	B is the	probability of a type 2 error
	sigma or the zero thing with a dash is (Fraternity!) ha is	population standard deviation
	s over the square root of n is	the standard error of x bar
	x bar 1 minus x bar 2 is	the difference between two sample means
	mew 1 - mew 2 is	the difference between two population means
	sigma (circle dash thing) over the square root of n is	the standard deviation of the sampling distribution of x bar
	b with a bar is	the mean of the sample of differences
	x bar is the	sample mean
	s1 squared over n 1 plus s 2 squared over n 2 and then square the whole thing is	the standard error of x bar 1 minus x bar 2
	S is the	sample standard deviation
	S d is the	standard deviation of the sample of differences
	a or alpha is	the level of significance and probability of a type one error
	t star then s over the square root of n is	the margin of error for one sample tee confidence interval for U (mew)
	a is the	level of significant or probability of a type one error which is the probability or rejecting a true null hypothesis
	type one is	probability or rejecting a true null hypothesis
	a is	true population y-intercept in a regression equation
	B is the	probability of a type 2 error ( probability of failing to reject a false null hypothesis )
	a type 2 is	the probability of failing to reject a false null hypothesis
	B is	the true population slope in a regression equation
	a - estimated (sample) y- intercept in a	regression equation
	b - estimated sample	slope in a regression equation
	C is	the level of confidence
	U is the	mean of a population or distribution
	U is the	mean of the sampling distribution of x bar
	u 1 - u 2 is the	difference between the means of two populations
	n is the	sample size
	p is	the proportion or percentage of a population
	p is the	mean of the sampling distribution of p hat
	p hat is the	proportion or percentage of a sample
	p 1 minus p 2 is	the difference between the proportions of two populations
	p hat 1 minus p hat 2 is the	difference between two sample proportions
	r is	the sample correlation coefficient
	s squared is	the sample variance
	s is	the sample standard deviation
	s over the square root of n is	the standard error of x bar and it estimates the standard deviation of the sampling distribution of x bar
	E is the	summation symbol
	sigma the circle with the dash thing si the the	standard deviation of a population or distribution
	sigma squared is	the variance of a population or distribution
	sigma over the n squared is	the standard deviation of the sampling distribution of x bar
	x bar is the	sample mean
	x bar 1 minus x bar 2 is the	difference between two sample means
	X is the	explanatory variable in regression analysis
	Y is the	response variable in regression analysis
	y hat is the	predicted y
	categorical variable	use bar chart for this
	quantitative variable	like numbers, use a histogram for this
	what is a parameter ?	it is about the population
	what is a statistic	it is about the sample
	what is the whole thing with the parameter and the statistic ?	we use a statistic to make inferences about a parameter
	s is	the standard deviation and it measures spread and typical distance of points from mean and goes up with outliers
	what is a population ?	it is the group you took your sample from
	what is a lurking variable ?	it is a variable that affects the outcome of but is not in the study
	the five number summary is	min, q1, median, q 3, and the max also remember that 75 percent of observations is less than q3, and 25 percent is greater than q3 also remember than if the mean is less than the median then the left got screwed and if mean is greater than the median or to the right of the median than the right side got screwed
	if mean is less than or to the left of the median then	the left side got screwed or left skewed
	if mean is to the right of the median or greater than the median then	the right got screwed or right skewed
	if two variables are correlated then	we use on etc predict the other but still no causation
	the correlation coefficient of r needs to	be linear, no units so changing units does nothing, be able to interpret an r value between -1 and 1 and also remember if it is close to one of those numbers then it is strong
	r squared is	the percent, the variation , explained and accounted for
	how to predict using a regression equation ?	whatever
	how to interpret slope?	as x goes up by one, then the prediction goes up by slope
	the least squares regression line is	the line that minimizes the sum of squared residuals
	what are the different types of studies ?	matched pairs simple random study stratified sample and completely randomized experiment and remember that only experiments determine causation so a convenience sample is not an experiment and an experiment is when there is something done to it so you cannot determine causation with that particular study
	a matched pairs study is	A matched pairs design is a special case of a randomized block design. It can be used when the experiment has only two treatment conditions; and subjects can be grouped into pairs, based on some blocking variable. Then, within each pair, subjects are randomly assigned to different treatments.
	a randomized block design is	With a randomized block design, the experimenter divides subjects into subgroups called blocks, such that the variability within blocks is less than the variability between blocks. Then, subjects within each block are randomly assigned to treatment conditions. Compared to a completely randomized design, this design reduces variability within treatment conditions and potential confounding, producing a better estimate of treatment effects.The table below shows a randomized block design for a hypothetical medical experiment.Gender TreatmentPlacebo VaccineMale 250 250Female 250 250Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250 men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine.
	a completely randomized design is	A completely randomized design is probably the simplest experimental design, in terms of data analysis and convenience. With this design, subjects are randomly assigned to treatments.TreatmentPlacebo Vaccine500 500A completely randomized design layout for a hypothetical medical experiment is shown in the table to the right. In this design, the experimenter randomly assigned subjects to one of two treatment conditions. They received a placebo or they received a cold vaccine. The same number of subjects (500) are assigned to each treatment condition (although this is not required). The dependent variable is the number of colds reported in each treatment condition. If the vaccine is effective, subjects in the "vaccine" condition should report significantly fewer colds than subjects in the "placebo" condition.
	a simple random study is	Simple random sampling refers to any sampling method that has the following properties.The population consists of N objects.The sample consists of n objects.If all possible samples of n objects are equally likely to occur, the sampling method is called simple random sampling.An important benefit of simple random sampling is that it allows researchers to use statistical methods to analyze sample results. For example, given a simple random sample, researchers can use statistical methods to define a confidence interval around a sample mean. Statistical analysis is not appropriate when non-random sampling methods are used.There are many ways to obtain a simple random sample. One way would be the lottery method. Each of the N population members is assigned a unique number. The numbers are placed in a bowl and thoroughly mixed. Then, a blind-folded researcher selects n numbers. Population members having the selected numbers are included in the sample.
	a stratified sample is	Stratified sampling refers to a type of sampling method . With stratified sampling, the researcher divides the population into separate groups, called strata. Then, a probability sample (often a simple random sample ) is drawn from each group.Stratified sampling has several advantages over simple random sampling. For example, using stratified sampling, it may be possible to reduce the sample size required to achieve a given precision. Or it may be possible to increase the precision with the same sample size.
	random sampling allows for	inference
	what is statistical inference ?	is the process of deducing properties of an underlying distribution by analysis of data.[1] Inferential statistical analysis infers properties about a population: this includes testing hypotheses and deriving estimates. The population is assumed to be larger than the observed data set; in other words, the observed data is assumed to be sampled from a larger population.
	what does it mean to be statistically significant?	The likelihood that a result or relationship is caused by something other than mere random chance. Statistical hypothesis testing is traditionally employed to determine if a result is statistically significant or not. This provides a "p-value" representing the probability that random chance could explain the result.
	Zee is	the number of standard deviations away from the mean
	how to calculate probabilities from the zee table	follow this process 1. calculate- z = x bar - mew (u) over sigma 2. look up zee on the zee table and be sure the areas match 3. there are 3 types greater than- then look up as is and do nothing else less than - then always look up the negative then do nothing if difference or in between than - look up negative than times that number by two 4. use the 68, 95, 99.7 rule or work "backwards"
	sampling distributions are	the distribution of all possible sample statistics
	for the sampling distribution of x bar	mean mew is always the standard deviation is sigma over the square root of n always and goes down as n (sample size) goes up - and normal if either one the population is normal or two it is large that means that n has to be equal or greater than 30
	for calculation the probability from sampling distribution of x bar	1. calculate z = x bar minus mew (u) over sigma over the square root of n 2. look up zee on the zee table and be sure the areas match for example look for one minus 3. note, the normal table must apply (start normal or sample size needs to be large)
	the central limit theorem allows us to	calculate the probabilities from the normal table without knowing population distribution
	what is the central limit theorem?	The central limit theorem and the law of large numbers are the two fundamental theorems of probability. Roughly, the central limit theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. The importance of the central limit theorem is hard to overstate; indeed it is the reason that many statistical procedures work.
	what is the central limit theorem?	The central limit theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough.How large is "large enough"? The answer depends on two factors.Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required.In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger.
	the sampling distribution of p hat is	the mean always equals = p the standard deviation is the square root of p(1- p) over n and this is always the case and it is normal if n times p is greater or equal to 10 and n (1-p) is greater or equal to 10 they have to meet both of these tests to be normal so it gets more normal as n increases the definition is, the distribution of all sample proportions from samples
	the sampling distribution of p hat definition is	the distribution of all sample proportions from all samples
	for a one sample tee test write this hypothesis	whatever
	Checks are	for the means no outliers proportions: n times p 0 is greater or equal to 10 and n(1- p0 ) is equal or greater than 10 it must pass both of these tests for the checks
	What are the errors in hypothesis testing ?	type one- reject when true and the probability is alpha or the a symbol type two - accept when fall and the probability is the B symbol Power is - the probability of rejecting the false null and it is one or 1-B
	a type one error is	A Type I error occurs when the researcher rejects a null hypothesis when it is true. The probability of committing a Type I error is called the significance level , and is often denoted by α.
	a type two error is	A Type II error occurs when the researcher accepts a null hypothesis that is false. The probability of committing a Type II error is called Beta, and is often denoted by β. The probability of not committing a Type II error is called the Power of the test.
	Power is	A Type II error occurs when the researcher accepts a null hypothesis that is false. The probability of committing a Type II error is called Beta, and is often denoted by β.The power of a test is equal to 1 minus Beta. It indicates the probability of not making a Type II error.
	Confidence Intervals	Determine a range of reasonable values for the parameter and remember the parameter has to do with the population general formula is estimate + - ME (won't ask to calculate 2 sample confidence intervals, also be able to build given margin of error ME)
	confidence intervals	the margin or error (determines the confidence error width) - goes down as n goes up, goes up as confidence up as the confidence level goes up one sample over matched paris formula for mean or proportion using confidence intervals to do a hypothesis test so for example is 0 in the interval ?
	what is a confidence interval ?	Statisticians use a confidence interval to express the degree of uncertainty associated with a sample statistic. A confidence interval is an interval estimate combined with a probability statement.For example, suppose a statistician conducted a survey and computed an interval estimate, based on survey data. The statistician might use a confidence level to describe uncertainty associated with the interval estimate. He/she might describe the interval estimate as a "95% confidence interval". This means that if we used the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter to fall within the interval estimates 95% of the time.Confidence intervals are preferred to point estimates and to interval estimates, because only confidence intervals indicate (a) the precision of the estimate and (b) the uncertainty of the estimate.
	what is a confidence level ?	In survey sampling, different samples can be randomly selected from the same population; and each sample can often produce a different confidence interval. Some confidence intervals include the true population parameter; others do not.A confidence level refers to the percentage of all possible samples that can be expected to include the true population parameter. For example, suppose all possible samples were selected from the same population, and a confidence interval were computed for each sample. A 95% confidence level implies that 95% of the confidence intervals would include the true population parameter.
	what is a confidence level ?	A confidence level refers to the percentage of all possible samples that can be expected to include the true population parameter. For example, suppose all possible samples were selected from the same population, and a confidence interval were computed for each sample. A 95% confidence level implies that 95% of the confidence intervals would include the true population parameter.
	what is confounding?	Confounding occurs when the experimental controls do not allow the experimenter to reasonably eliminate plausible alternative explanations for an observed relationship between independent and dependent variables.Consider this example. A drug manufacturer tests a new cold medicine with 200 volunteer subjects - 100 men and 100 women. The men receive the drug, and the women do not. At the end of the test period, the men report fewer colds.
	Calculation test statistics zee or tee	matched pairs, only look at the differences! so do tee = x bar minus 0 over s over the square root of n for one sample test do tee = x bar - u0 over s over the square root of n
	Be able to calculate sample sizes given a margin of error so do	n equals ( z star over m) (p star) ( 1 minus p star) use p star = 0.5 or .5
	Pee values	be able to look them up on the tee table and pay attention to one or two sided test definition is probability of observing as or more extreme assuming null is true if pee value is less than < alpha then that means it is "statistically significant" and statistically significant does not mean practically significant
	what is practically significant ?	Statistical versus Practical Significance Practical significance looks at whether the difference is large enoughto be of value in a practical sense.
	what is a pee value ?	A P-value measures the strength of evidence in support of a null hypothesis. Suppose the test statistic in a hypothesis test is equal to S. The P-value is the probability of observing a test statistic as extreme as S, assuming the null hypotheis is true. If the P-value is less than the significance level, we reject the null hypothesis.
	what is practical significance ?	Practical significance is about whether we should care/whether the effect is useful in an applied context. An effect could be statistically significant, but that doesn't in itself mean that it's a good idea to spend money/time/resources into pursuing it in the real world. The truth is that in most situations (at least in psychology), the null hypothesis is never true. Two groups will almost never be exactly the same if you were to test thousands or millions of people. That doesn't mean that every difference is interesting.This is usually associated with effect size measures (e.g. Cohen's d; which has criteria for 'small', 'medium' and 'large' effects), but generally will also need to take into account the context of the particular study (e.g. clinical research will have different expectations than personality psychology in terms of what kind of effects can be expected).
	if pee is low then	reject the hoe and accept ha !
	if pee is high then	fail to reject ho and insufficient evidence to accept HA
	two sample tee tests for means	HO: u1 = u 2 vs Ha: u1>, <, differs u 2 or Ho: u1- u2 = 0 vs. Ha: u1-u2 >, <, differs 0 if pee is low reject the hoe
	Two sample proportions	- compare two sample proportions, and its a Zee test!!!!!!!!!!!!!!!!!!!!!! Two sample Hypothesis: HO; p1-p2 = 0 vs. Ha p1- p2 >, <, differes 0
	What is a Chi Square distribution?	Suppose we conduct the following statistical experiment . We select a random sample of size n from a normal population, having a standard deviation equal to σ. We find that the standard deviation in our sample is equal to s. Given these data, we can compute a statistic, called chi-square, using the following equation:Χ2 = [ ( n - 1 ) * s2 ] / σ2If we repeated this experiment an infinite number of times, we could obtain a sampling distribution for the chi-square statistic. The chi-square distribution is defined by the following probability density function :Y = Y0 * ( Χ2 ) ( v/2 - 1 ) * e - Χ2 / 2where Y0 is a constant that depends on the number of degrees of freedom, Χ2 is the chi-square statistic, v = n - 1 is the number of degrees of freedom , and e is a constant equal to the base of the natural logarithm system (approximately 2.71828). Y0 is defined, so that the area under the chi-square curve is equal to one.
	What is a Chi Square statistic	Suppose we select a random sample from a normal population. The chi-square statistic can be computed using the following equation:Χ2 = [ ( n - 1 ) * s2 ] / σ2where n is the sample size, σ is the population standard deviation, s is the sample standard deviation equals, and Χ2 is the chi-square statistic.
	Chi square tests (hint: data displayed as table)	use it to determine if there is a relationship between two CATEGORICAL !!!!!!!!!!!!!!!!! variables!!!!! H0: No relationship (no association), Ha: Is Relationship is (an association) - How to calculate expected counts - its also on your formula sheet, MY formula sheet! Checks: Randomization, All EC's equal or greater than 5
	Anova	H0: u1 = u2 = u3 .....vs. Ha: at least 1 different Be able to pick off pee value from a table
	Regression	We do a regression analysis to look at the relationship between two quantitative variables regression minimizes the sum of squared residuals how to look at computer output for tests of hypothesis and pick off appropriate pee values, Always look at the SLOPE!!!!!!!!!!
	a one sample tee test is used when	A one-sample t-test is used to test whether a population mean is significantly different from some hypothesized value.
	a one sample zee test is used when	A one-sample z-test is used to test whether a population parameter is significantly different from some hypothesized value.Here is how to use the test.Define hypotheses. The table below shows three sets of null and alternative hypotheses. Each makes a statement about how the true population mean μ is related to some hypothesized value M. (In the table, the symbol ≠ means " not equal to ".)
	estimation is	In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Often, we use sample statistics (e.g., mean, proportion) to estimate population parameters (e.g., mean, proportion).
	a margin of error is	The margin of error expresses the maximum expected difference between the true population parameter and a sample estimate of that parameter. To be meaningful, the margin of error should be qualified by a probability statement (often expressed in the form of a confidence level).For example, a pollster might report that 50% of voters will choose the Democratic candidate. To indicate the quality of the survey result, the pollster might add that the margin of error is +5%, with a confidence level of 90%. This means that if the survey were repeated many times with different samples, the true percentage of Democratic voters would fall within the margin of error 90% of the time.
	an observational study is	Like experiments, observational studies attempt to understand cause-and-effect relationships. However, unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.For example, a sample survey, does not apply a treatment to survey respondents. The researcher only observes survey responses. Therefore, a sample survey is an example of an observational study.
	methods of data collection	Census. A census is a study that obtains data from every member of a population. In most studies, a census is not practical, because of the cost and/or time required.Sample survey. A sample survey is a study that obtains data from a subset of a population, in order to estimate population attributes.Experiment. An experiment is a controlled study in which the researcher attempts to understand cause-and-effect relationships. The study is "controlled" in the sense that the researcher controls (1) how subjects are assigned to groups and (2) which treatments each group receives.In the analysis phase, the researcher compares group scores on some dependent variable. Based on the analysis, the researcher draws a conclusion about whether the treatment ( independent variable) had a causal effect on the dependent variable.Observational study. Like experiments, observational studies attempt to understand cause-and-effect relationships. However, unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.
	a census is	A census is a study that obtains data from every member of a population. In most studies, a census is not practical, because of the cost and/or time required.
	a sample survey is	A sample survey is a study that obtains data from a subset of a population, in order to estimate population attributes.
	an experiment is	An experiment is a controlled study in which the researcher attempts to understand cause-and-effect relationships. The study is "controlled" in the sense that the researcher controls (1) how subjects are assigned to groups and (2) which treatments each group receives. In the analysis phase, the researcher compares group scores on some dependent variable. Based on the analysis, the researcher draws a conclusion about whether the treatment ( independent variable) had a causal effect on the dependent variable.
	an observational study is	Like experiments, observational studies attempt to understand cause-and-effect relationships. However, unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.
	pros and cons of data collection	Resources. When the population is large, a sample survey has a big resource advantage over a census. A well-designed sample survey can provide very precise estimates of population parameters - quicker, cheaper, and with less manpower than a census.Generalizability. Generalizability refers to the appropriateness of applying findings from a study to a larger population. Generalizability requires random selection. If participants in a study are randomly selected from a larger population, it is appropriate to generalize study results to the larger population; if not, it is not appropriate to generalize. Observational studies do not feature random selection; so generalizing from the results of an observational study to a larger population can be a problem.Causal inference. Cause-and-effect relationships can be teased out when subjects are randomly assigned to groups. Therefore, experiments, which allow the researcher to control assignment of subjects to treatment groups, are the best method for investigating causal relationships.
	Regression continued	H0: slope B = 0 vs. Ha: B differes 0 or HO: no relationship vs. Ha : Is relationship or H0: X cannot be used to predict Y vs. Ha: X can be used to predict Y Prediction intervals estimate predictions for One individual Prediction intervals are WIDER!!!!!!!!!!!!!!!!!!!!!!! than confidence intervals
	ITS ALL ABOUT	PROGRESSION!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
	what is distribution ?	The distribution of a variable is a description of the relative numbers of times each possible outcome will occur in a number of trials. The function describing the probability that a given value will occur is called the probability density function (abbreviated PDF), and the function describing the cumulative probability that a given value or any value smaller than it will occur is called the distribution function (or cumulative distribution function, abbreviated CDF).Formally, a distribution can be defined as a normalized measure, and the distribution of a random variable x is the measure P_x on S^' defined by setting
	Explanatory vs Response variables	The response variable is the focus of a question in a study or experiment. An explanatory variable is one that explains changes in that variable. It can be anything that might affect the response variable. Let’s say you’re trying to figure out if chemo or anti-estrogen treatment is better procedure for breast cancer patients. The question is: which procedure prolongs life more? And so survival time is the response variable. The type of therapy given is the explanatory variable; it may or may not affect the response variable. In this example, we have only one explanatory variable: type of treatment. In real life you would have several more explanatory variables, including: age, health, weight and other lifestyle factors.
	what is distribution ?	The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of individuals in each group. When a distribution of numerical data is organized, they’re often ordered from smallest to largest, broken into reasonably sized groups (if appropriate), and then put into graphs and charts to examine the shape, center, and amount of variability in the data.
	what is a histogram	A histogram is made up of columns plotted on a graph. Here is how to read a histogram.The columns are positioned over a label that represents a quantitative variable .The height of the column indicates the size of the group defined by the column label.The histogram below shows per capita income for five age groups. PerCapitaIncome $40,000$30,000$20,000$10,000
	what is a boxplot	A boxplot, sometimes called a box and whisker plot, is a type of graph used to display patterns of quantitative data.A boxplot splits the data set into quartiles. The body of the boxplot consists of a "box" (hence, the name), which goes from the first quartile (Q1) to the third quartile (Q3).Within the box, a vertical line is drawn at the Q2, the median of the data set. Two horizontal lines, called whiskers, extend from the front and back of the box. The front whisker goes from Q1 to the smallest non-outlier in the data set, and the back whisker goes from Q3 to the largest non-outlier.Smallest non-outlier Q1 Q2 Q3 Largest non-outlier . . . . . -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600If the data set includes one or more outliers, they are plotted separately as points on the chart. In the boxplot above, two outliers precede the first whisker; and three outliers follow the second whisker.
	what is a stem plot	A stemplot is used to display quantitative data, generally from small data sets (50 or fewer observations). The stemplot below shows IQ scores for 30 sixth graders.Stems1501401301201101009080Key: 110 Leaves12 64 5 7 91 2 2 2 5 7 9 90 2 3 4 4 5 7 8 9 91 1 4 7 87 represents an IQ score of 117In a stemplot, the entries on the left are called stems; and the entries on the right are called leaves. In the example above, the stems are tens (80 and 90) and hundreds (100 through 140). However, they could be other units - millions, thousands, ones, tenths, etc. In the example above, the stems and leaves are explicitly labeled for educational purposes. In the real world, however, stemplots usually do not include explicit labels for the stems and leaves.Some stemplots include a key to help the user interpret the display correctly. The key in the stemplot above indicates that a stem of 110 with a leaf of 7 represents an IQ score of 117.
	a statistical experiment is	All statistical experiments have three things in common:The experiment can have more than one possible outcome.Each possible outcome can be specified in advance.The outcome of the experiment depends on chance.A coin toss has all the attributes of a statistical experiment. There is more than one possible outcome. We can specify each possible outcome in advance - heads or tails. And there is an element of chance. We cannot know the outcome until we actually flip the coin.
	symmetry is	Symmetry is an attribute used to describe the shape of a data distribution. When it is graphed, a symmetric distribution can be divided at the center so that each half is a mirror image of the other. A non-symmetric distribution cannot.
	variance is	The variance is a numerical value used to indicate how widely individuals in a group vary. If individual observations vary greatly from the group mean, the variance is big; and vice versa.It is important to distinguish between the variance of a population and the variance of a sample. They have different notation, and they are computed differently. The variance of a population is denoted by σ2; and the variance of a sample, by s2.The variance of a population is defined by the following formula:σ2 = Σ ( Xi - X )2 / Nwhere σ2 is the population variance, X is the population mean, Xi is the ith element from the population, and N is the number of elements in the population.The variance of a sample is defined by slightly different formula:
	replication is	In an experiment, replication refers to the practice of assigning each treatment to many experimental subjects. In general, the more subjects in each treatment condition, the lower the variability of the dependent measures.
	robust	are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distributions. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations; under this model, non-robust methods like a t-test work badly.
	what is the purpose of replication?	the repetition of an experimental condition so that the variability associated with the phenomenon can be estimated. ASTM, in standard E1847, defines replication as "the repetition of the set of all the treatment combinations to be compared in an experiment. Each of the repetitions is called a replicate."Replication is not the same as repeated measurements of the same item: they are dealt with differently in statistical experimental design and data analysis.For proper sampling, a process or batch of products should be in reasonable statistical control; inherent random variation is present but variation due to assignable (special) causes is not. Evaluation or testing of a single item does not allow for item-to-item variation and may not represent the batch or process. Replication is needed to account for this variation among items and treatments.
	what is response bias?	In survey sampling, response bias refers to the bias that results from problems in the measurement process. Some examples of response bias are given below.Leading questions. The wording of the question may be loaded in some way to unduly favor one response over another. For example, a satisfaction survey may ask the respondent to indicate where she is satisfied, dissatisfied, or very dissatified. By giving the respondent one response option to express satisfaction and two response options to express dissatisfaction, this survey question is biased toward getting a dissatisfied response.Social desirability. Most people like to present themselves in a favorable light, so they will be reluctant to admit to unsavory attitudes or illegal activities in a survey, particularly if survey results are not confidential. Instead, their responses may be biased toward what they believe is socially desirable.
	what is a mean	A mean score is an average score, often denoted by X. It is the sum of individual scores divided by the number of individuals. Thus, if you have a set of N numbers ( X1 , X2 , X3 , . . . XN ), the mean of those numbers would be defined as:X = ( X1 + X2 + X3 + . . . + XN ) / N = [ Σ Xi ] / NFor example, the mean of the numbers 1, 2, and 3 would be (1 + 2 + 3)/3 or 2.Note: The mean score of a random variable (also called the expected value) is defined somewhat differently. See expected value.
	what is a median	The median is a simple measure of central tendency. To find the median, we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values.Thus, in a sample of four families, we might want to compute the median annual income. Suppose the incomes are $30,000 for the first family; $50,000, for the second; $90,000, for the third; and $110,000, for the fourth. The two middle values are $50,000 and $90,000. Therefore, the median annual income is ($50,000 + $90,000)/2 or $70,000.
	what is multi stage sampling	With multistage sampling, we select a sample by using combinations of different sampling methods .For example, in Stage 1, we might use cluster sampling to choose clusters from a population. Then, in Stage 2, we might use simple random sampling to select a subset of elements from each cluster for the final sample.
	stratified sampling is	Stratified sampling refers to a type of sampling method . With stratified sampling, the researcher divides the population into separate groups, called strata. Then, a probability sample (often a simple random sample ) is drawn from each group.Stratified sampling has several advantages over simple random sampling. For example, using stratified sampling, it may be possible to reduce the sample size required to achieve a given precision. Or it may be possible to increase the precision with the same sample size.
	skewness is	When they are displayed graphically, some distributions of data have many more observations on one side of the graph than the other.Distributions with fewer observations on the right (toward higher values) are said to be skewed right; and distributions with fewer observations on the left (toward lower values) are said to be skewed left.
	significance level is	A Type I error occurs when the researcher rejects a null hypothesis when it is true. The probability of committing a Type I error is called the significance level, and is often denoted by α.
	a sample design is	A sample design is made up of two elements.Sampling method. Sampling method refers to the rules and procedures by which some elements of the population are included in the sample. Some common sampling methods are simple random sampling , stratified sampling , and cluster sampling . Estimator. The estimation process for calculating sample statistics is called the estimator. Different sampling methods may use different estimators. For example, the formula for computing a mean score with a simple random sample is different from the formula for computing a mean score with a stratified sample. Similarly, the formula for the standard error may vary from one sampling method to the next.The "best" sample design depends on survey objectives and on survey resources. For example, a researcher might select the most economical design that provides a desired level of precision. Or, if the budget is limited, a researcher might choose the design that provides the greatest precision without going over budget.
	a randomized block design is	With a randomized block design, the experimenter divides subjects into subgroups called blocks, such that the variability within blocks is less than the variability between blocks. Then, subjects within each block are randomly assigned to treatment conditions. Compared to a completely randomized design, this design reduces variability within treatment conditions and potential confounding, producing a better estimate of treatment effects.The table below shows a randomized block design for a hypothetical medical experiment.Gender TreatmentPlacebo VaccineMale 250 250Female 250 250Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250 men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine.It is known that men and women are physiologically different and react differently to medication. This design ensures that each treatment condition has an equal proportion of men and women. As a result, differences between treatment conditions cannot be attributed to gender. This randomized block design removes gender as a potential source of variability and as a potential confounding variable.
	a range is	The range is a simple measure of variation in a set of random variables. It is difference between the biggest and smallest random variable.Range = Maximum value - Minimum valueTherefore, the range of the four random variables (3, 5, 5, 7} would be 7 - 3 or 4.
	a ration level is	The ratio scale of measurement is a type of measurement scale. It is characterized by equal intervals between scale units and a minimum scale value of zero.The weight of an object would be an example of a ratio scale. Units along the weight scale are equal to one another, and the minimum value is zero.Weight scales have a minimum value of zero because objects at rest can be weightless, but they cannot have negative weight.
	68, 95, 99.7 rule is	In statistics, the so-called 68–95–99.7 rule is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of one, two and three standard deviations, respectively; more accurately, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard deviations of the mean, respectively. In mathematical notation, these facts can be expressed as follows, where x is an observation from a normally distributed random variable, μ is the mean of the distribution, and σ is its standard deviation:
	68, 95, 99.7 rule is	About 68% of values fall within one standard deviation of the mean.About 95% of the values fall within two standard deviations from the mean.Almost all of the values — about 99.7% — fall within three standard deviations from the mean.These facts are what is called the 68 95 99.7 rule, sometimes called the Empirical Rule. That’s because the rule originally came from observations (empirical means “based on observation”).
	the empirical rules states that	The empirical rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. The empirical rule can be broken down into three parts:68% of data falls within the first standard deviation from the mean.95% fall within two standard deviations.99.7% fall within three standard deviations.The rule is also called the 68-95-99 7 Rule or the Three Sigma Rule.
	empirical rule continued	When applying the Empirical Rule to a data set the following conditions are true:Approximately 68% of the data falls within one standard deviation of the mean (or between the mean – one times the standard deviation, and the mean + 1 times the standard deviation). In mathematical notation, this is represented as: μ±1σApproximately 95% of the data falls within two standard deviations of the mean (or between the mean – 2 times the standard deviation, and the mean + 2 times the standard deviation). The mathematical notation for this is: μ±2σApproximately 99.7% of the data falls within three standard deviations of the mean (or between the mean – three times the standard deviation and the mean + three times the standard deviation). The following notation is used to represent this fact: μ±3σ
	standard deviation is	The standard deviation is a numerical value used to indicate how widely individuals in a group vary. If individual observations vary greatly from the group mean, the standard deviation is big; and vice versa.It is important to distinguish between the standard deviation of a population and the standard deviation of a sample. They have different notation, and they are computed differently. The standard deviation of a population is denoted by σ and the standard deviation of a sample, by s.The standard deviation of a population is defined by the following formula:σ = sqrt [ Σ ( Xi - X )2 / N ]where σ is the population standard deviation, X is the population mean, Xi is the ith element from the population, and N is the number of elements in the population.The standard deviation of a sample is defined by slightly different formula:s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ]where s is the sample standard deviation, x is the sample mean, xi is the ith element from the sample, and n is the number of elements in the sample.And finally, the standard deviation is equal to the square root of the variance.
	standard error is	The standard error is a measure of the variability of a statistic. It is an estimate of the standard deviation of a sampling distribution. The standard error depends on three factors:N: The number of observations in the population.n: The number of observations in the sample.The way that the random sample is chosen.If the population size is much larger than the sample size, then the sampling distribution has roughly the same standard error, whether we sample with or without replacement . On the other hand, if the sample represents a significant fraction (say, 1/20) of the population size, the standard error will be noticeably smaller, when we sample without replacement.
	the standard normal distribution is	The standard normal distribution is a special case of the normal distribution . It is the distribution that occurs when a normal random variable has a mean of zero and a standard deviation of one.The normal random variable of a standard normal distribution is called a standard score or a z score. Every normal random variable X can be transformed into a z score via the following equation:z = (X - μ) / σwhere X is a normal random variable, μ is the mean, and σ is the standard deviation.
	percent also means	probability
	z score	A z-score (aka, a standard score) indicates how many standard deviations an element is from the mean. A z-score can be calculated from the following formula.z = (X - μ) / σwhere z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.Here is how to interpret z-scores.A z-score less than 0 represents an element less than the mean.A z-score greater than 0 represents an element greater than the mean.A z-score equal to 0 represents an element equal to the mean.A z-score equal to 1 represents an element that is 1 standard deviation greater than the mean; a z-score equal to 2, 2 standard deviations greater than the mean; etc.A z-score equal to -1 represents an element that is 1 standard deviation less than the mean; a z-score equal to -2, 2 standard deviations less than the mean; etc.If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3.
	association does not imply causation	is a phrase used in statistics to emphasize that a correlation between two variables does not imply that one causes the other.[1][2] Many statistical tests calculate correlation between variables. A few go further, using correlation as a basis for testing a hypothesis of a true causal relationship; examples are the Granger causality test and convergent cross mapping.[clarification needed (hypothesis testing not well explained here)]
	association	whatever
	the alternative hypothesis is	The alternative hypothesis, denoted by H1 or Ha, is the hypothesis that sample observations are influenced by some non-random cause.
	the null hypothesis is	The null hypothesis, denoted by H0, is usually the hypothesis that sample observations result purely from chance.
	the null and the alternative	For example, suppose we wanted to determine whether a coin was fair and balanced. A null hypothesis might be that half the flips would result in Heads and half, in Tails. The alternative hypothesis might be that the number of Heads and Tails would be very different. Symbolically, these hypotheses would be expressed asH0: p = 0.5 Ha: p <> 0.5Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails. Given this result, we would be inclined to reject the null hypothesis. That is, we would conclude that the coin was probably not fair and balanced.
	for the Null	we start off assuming that it is true and we are trying to prove the alternative is true,
	the law of large number is	One can think about the probability of an event in terms of its long-run relative frequency. The relative frequency of an event is the number of times an event occurs, divided by the total number of trials.P(A) = ( Frequency of Event A ) / ( Number of Trials )For example, a merchant notices one day that 5 out of 50 visitors to her store make a purchase. The next day, 20 out of 50 visitors make a purchase. The two relative frequencies (5/50 or 0.10 and 20/50 or 0.40) differ. However, summing results over many visitors, she might find that the probability that a visitor makes a purchase gets closer and closer 0.20.The scatterplot above shows the relative frequency as the number of trials (in this case, the number of visitors) increases. Over many trials, the the relative frequency converges toward a stable value (0.20), which can be interpreted as the probability that a visitor to the store will make a purchase.The idea that the relative frequency of an event will converge on the probability of the event, as the number of trials increases, is called the law of large numbers.
	probability sampling is	Statisticians distinguish between two broad categories of sampling .Probability sampling. With probability sampling, every element of the population has a known probability of being included in the sample. Non-probability sampling. With non-probability sampling, we cannot specify the probability that each element will be included in the sample.Probability samples allow us to make probability statements about sample statistics. We can estimate the extent to which a sample statistic is likely to differ from a population parameter . The bulk of this web site focuses on probability sampling.
	a convenience sample is	A convenience sample is one of the main types of non-probability sampling methods. A convenience sample is made up of people who are easy to reach.Consider the following example. A pollster interviews shoppers at a local mall. If the mall was chosen because it was a convenient site from which to solicit survey participants and/or because it was close to the pollster's home or business, this would be a convenience sample.
	a voluntary sample is	A voluntary sample is one of the main types of non-probability sampling methods. A voluntary sample is made up of people who self-select into the survey. Often, these folks have a strong interest in the main topic of the survey.Suppose, for example, that a news show asks viewers to participate in an on-line poll. This would be a voluntary sample. The sample is chosen by the viewers, not by the survey administrator.
	a completely randomized design is	A completely randomized design is probably the simplest experimental design, in terms of data analysis and convenience. With this design, subjects are randomly assigned to treatments.TreatmentPlacebo Vaccine500 500A completely randomized design layout for a hypothetical medical experiment is shown in the table to the right. In this design, the experimenter randomly assigned subjects to one of two treatment conditions. They received a placebo or they received a cold vaccine. The same number of subjects (500) are assigned to each treatment condition (although this is not required). The dependent variable is the number of colds reported in each treatment condition. If the vaccine is effective, subjects in the "vaccine" condition should report significantly fewer colds than subjects in the "placebo" condition.A completely randomized design relies on randomization to control for the effects of extraneous variables. The experimenter assumes that, on averge, extraneous factors will affect treatment conditions equally; so any significant differences between conditions can fairly be attributed to the independent variable.
	randomization	Randomization refers to the practice of using chance methods (random number tables, flipping a coin, etc.) to assign subjects to treatments. In this way, the potential effects of lurking variables are distributed at chance levels (hopefully roughly evenly) across treatment conditions.
	lurking variables are	A well-designed experiment includes design features that allow researchers to eliminate extraneous variables as an explanation for the observed relationship between the independent variable(s) and the dependent variable. These extraneous variables are called lurking variables.
	a randomized block design is	With a randomized block design, the experimenter divides subjects into subgroups called blocks, such that the variability within blocks is less than the variability between blocks. Then, subjects within each block are randomly assigned to treatment conditions. Compared to a completely randomized design, this design reduces variability within treatment conditions and potential confounding, producing a better estimate of treatment effects.The table below shows a randomized block design for a hypothetical medical experiment.Gender TreatmentPlacebo VaccineMale 250 250Female 250 250Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250 men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine.It is known that men and women are physiologically different and react differently to medication. This design ensures that each treatment condition has an equal proportion of men and women. As a result, differences between treatment conditions cannot be attributed to gender. This randomized block design removes gender as a potential source of variability and as a potential confounding variable.
	control is	In an experiment, a control group is a baseline group that receives no treatment or a neutral treatment. To assess treatment effects, the experimenter compares results in the treatment group to results in the control group.
	a convenience sample is	A convenience sample is one of the main types of non-probability sampling methods. A convenience sample is made up of people who are easy to reach.Consider the following example. A pollster interviews shoppers at a local mall. If the mall was chosen because it was a convenient site from which to solicit survey participants and/or because it was close to the pollster's home or business, this would be a convenience sample.
	double blinding is	In an experiment, if subjects in the control group know that they are receiving a placebo, the placebo effect will be reduced or eliminated; and the placebo will not serve its intended control purpose.Blinding is the practice of not telling subjects whether they are receiving a placebo. In this way, subjects in the control and treatment groups experience the placebo effect equally. Often, knowledge of which groups receive placebos is also kept from analysts who evaluate the experiment. This practice is called double blinding. It prevents the analysts from "spilling the beans" to subjects through subtle cues; and it assures that their evaluation is not tainted by awareness of actual treatment conditions.
	Bivariate Data is	Statistical data are often classified according to the number of variables being studied.Univariate data. When we conduct a study that looks at only one variable, we say that we are working with univariate data. Suppose, for example, that we conducted a survey to estimate the average weight of high school students. Since we are only working with one variable (weight), we would be working with univariate data.Bivariate data. When we conduct a study that examines the relationship between two variables, we are working with bivariate data. Suppose we conducted a study to see if there were a relationship between the height and weight of high school students. Since we are working with two variables (height and weight), we would be working with bivariate data.
	non response bias is	Sometimes, in survey sampling, individuals chosen for the sample are unwilling or unable to participate in the survey. Nonresponse bias is the bias that results when respondents differ in meaningful ways from nonrespondents. Nonresponse is often problem with mail surveys, where the response rate can be very low.
	response bias is	In survey sampling, response bias refers to the bias that results from problems in the measurement process. Some examples of response bias are given below.Leading questions. The wording of the question may be loaded in some way to unduly favor one response over another. For example, a satisfaction survey may ask the respondent to indicate where she is satisfied, dissatisfied, or very dissatified. By giving the respondent one response option to express satisfaction and two response options to express dissatisfaction, this survey question is biased toward getting a dissatisfied response.Social desirability. Most people like to present themselves in a favorable light, so they will be reluctant to admit to unsavory attitudes or illegal activities in a survey, particularly if survey results are not confidential. Instead, their responses may be biased toward what they believe is socially desirable.
	interviewer bias is	The distortion of response to a personal or telephone interview which results from differential reactions to the social style and personality of interviewers or to their presentation of particular questions. The use of fixed-wording questions is one method of reducing interviewer bias. Anthropological research and case-studies are also affected by the problem, which is exacerbated by the self-fulfilling prophecy, when the researcher is also the interviewer
	under coverage is	In survey sampling, undercoverage is a type of selection bias . It occurs when some members of the population are inadequately represented in the sample.A classic example of undercoverage is the Literary Digest voter survey, which predicted that Alfred Landon would beat Franklin Roosevelt in the 1936 presidential election. The survey sample suffered from undercoverage of low-income voters, who tended to be Democrats. Undercoverage is often a problem with convenience samples .
	know why we need random , and that it is because of	so we can do inference, and inference is Introduction to Statistical Inference means drawing conclusions based ondata. There are a many contexts in which inference isdesirable, and there are many approaches to performinginference. is the process of deducing properties of an underlying distribution by analysis of data.[1] Inferential statistical analysis infers properties about a population: this includes testing hypotheses and deriving estimates. The population is assumed to be larger than the observed data set; in other words, the observed data is assumed to be sampled from a larger population.Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and does not assume that the data came from a larger population.
	the four step process for using confidence intervals is	one, state what is the question that requires estimation a parameter two plan, choose correct procedure like a one sample z test for convinced intervals for proportions and choose a level of confidence Solve- Plot the data for quantitative variables only Conclude the level of confidence and the confidence interval and the parameter of interest int eh contest of the problem
	The test of significant four step process	state, what is the question that requires a statistical test? plan Chose correct test procedure like a matched pairs t test solve and conclude and that means comparing the p value to alpha and deciding to reject or failing to reject H0 and conclude in context of the problem

Share This Flashcard Set

Set the Language

Related Flashcards

Stats End! Lets Go! Finish Strong!

Add to Folders

Upgrade to Cram Premium

Card Range To Study

212 Cards in this Set