Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
175 Cards in this Set
- Front
- Back
Approach
|
Suggested way to look at and organize a problem so it can be solved.
**There is usually more than one way... |
|
Bias
|
Result of the sample is not representative of the population
|
|
3 Sources of Bias
|
1. Sampling Bias
2. Non-response Bias 3. Response Bias |
|
Blinding
|
Refers to non-disclosure of a treatment an experimental unit is receiving.
|
|
2 Types of Blinding
|
1. Single Blinding
2. Double Blinding |
|
Single Blind
|
Experiment in which the experimental unit (or subjects) do not know which treatment (s)he is receiving.
|
|
Double Blind
|
Study/Experiment in which neither the experimental unit nor the researcher is aware of whether the subject knows which treatment (s)he is receiving.
|
|
Case Control Studies
|
Retrospective studies that require individuals to look back in time, or researchers to look at existing records.
|
|
Closed Question
|
Question for which the respondent must choose from a list of predetermined responses - i.e., multiple choice
|
|
Cluster Sample
|
Sample obtained by selecting all individuals within a randomly selected collection or group of individuals.
(i.e., All Online students = population; Each online class = Cluster; obtain simple random selection of cluster for survey; Survey all students with cluster. |
|
Cohort Studies
|
Identifies a group of individuals to participate (cohorts), then are observed over a length of time (sometimes very long. Characteristics are recorded and some individuals are exposed to certain factors (not intentionally) and others will not. At end of study, the value of the response variable is recorded for the individual. (i.e. Framingham Heart Study)
|
|
Completely Randomized Design
|
Simplest type of experiment. Design in which each experimental unit is randomly assigned to a treatment.
|
|
Continuous Variable
|
A quantitative variable that has an infinite number of possible values that are not countable.
Can be Discrete or Continuous. |
|
Confounding
|
Occurs when the effects of two or more explanatory variables are not separated. Therefore, any relation that may exist between an explanatory variable and the response variable may be due to some other variable(s) not accounted for in the study.
** Major problem with observational studies, often the cause is a "lurking variable". |
|
Control Group
|
Serves as a baseline treatment that can be used to compare to other treatments.
|
|
Convenience Sampling
|
Sample in which the individuals are easily obtained and not based on randomness.
* self-selected - individuals decided to participate (voluntary response), i.e., phone-in polling, internet surveys **Unreliable results because sampling is not random. |
|
Cross-Sectional Studies
|
Observational studies that collect information about individuals at a specific point in time - or over a very short period of time.
|
|
Data
|
Fact or Proposition used to draw a conclusion or make a decision. Can be numerical or non-numerical. (List of observed values for a variable)
i.e., gender is a variable - the observations of Male/Female are data. |
|
Designed Experiment
|
Researcher assigns the individuals in a group (certain group), intentionally changes the value of an explanatory variable and records the value of the response variable for each group.
|
|
Steps in Designing an Experiment
|
1. Identify the problem to be solved (explicit)
2. Determine factors that affect the response variable (field expert) 3. Determine the number of experimental units. 4. Determine the level for each factor. a) Controls: 1) fix at 1 predetermined factor; or 2) Set them at predetermined levels b) Randomize - randomize exp units to various treatment groups so the effects of factors that can't be controlled are minimized. 5. Conduct the experiment: a) Exp units are randomly assigned to the treatments. Replication occurs when each tx is applied to more than 1 exp unit. b) Collect and process data. Measure value of the response variable for each replication 6. Test the claim - inferential statistics. |
|
Discrete Variable
|
A quantitative variable that has either a finite number of possible values, or a countable number of possible values. (If you count to get the value of a quantitative variable, it is discrete)
|
|
Experiment
|
A controlled study conducted to determine the effect varying a treatment has on a response variable - one or more explanatory variables or factors .
|
|
Experimental Unit
|
Person, object, or some other well defined item upon which a treatment is applied. (Often referred to as a Subject)
|
|
Explanatory Variable.
|
Qualitative Variables - Variables that describe, name or label the individuals.
|
|
Functional Status
|
The ability to conduct day-to-day activities.
|
|
Individual
|
Person or object that is a member of the population to be studied.
|
|
Interval Level of Measurement
(Quantitative Variable) |
Has the properties of the ordinal level, and the differences in the values of the variables has meaning. Arithmetic operations can be performed (Addition & Subtraction)
(i.e., Temperature - can perform arithmetic operations, but ratio doesn't not represent meaningful results. |
|
Lurking Variable
|
an explanatory variable that was not considered in a study, but that affects the value of the response variable in the study.
* Typically related to explanatory variables considered in a study. |
|
Matched-Pairs Design
|
Experimental design in which the experimental units are paired up. the pairs are matched so that they are somehow related, there are only 2 levels of treatment.
|
|
Nominal Level of Measurement
(Qualitative Variable) |
Values of the variable - name, label or categorized - does not allow for the values to be arranged in a ranked or specific order.
(i.e. gender) |
|
Non-response Bias
|
Exists when individuals selected to be in sample - who do not respond to survey - have a different opinion than those who do participate.
|
|
Non-Sampling Errors
|
Non-response bias, response bias, data entry errors, under coverage. Can also be present in Census.
** The errors that result from obtaining and recording the information collected. |
|
Observational Study
|
Measures the value of the response variable without attempting to influence the value of either the response or explanatory variables.
**Observes the behavior without trying to influence the outcome. |
|
Types of Observational Studies
|
1. Cross-Sectional - collect information about individuals - usually short periods of time
2. Case Studies - collect information about individuals - completed retrospectively 3. Cohort - individuals studied for longer periods of time - completed prospectively |
|
Open Question
|
A question for which the respondent is free to choose his or her response: Open line answer.
|
|
Ordinal Level of Measurement
(Qualitative Variables) |
Has the properties of the nominal level and the naming scheme allows for the values to be arranged in a specific order.
(i.e., Letter Grades) Can be ranked, but differences have no meaning. |
|
Parameter
|
Numerical summary of a population
|
|
Placebo
|
An innocuous medication (such as sugar tablets) that look, taste and smell like the experimental medication.
|
|
Population
|
Entire group of individuals to be studied.
|
|
Qualitative Data
|
Observations corresponding to a qualitative variable
|
|
Qualitative Variables
*(Categorical) |
Allow for classification of individuals based on some attribute or characteristic.
|
|
Quantitative Data
|
Observations corresponding to a quantitative variable
*(Discrete/Continuous) |
|
Quantitative Variables
|
Provide numerical measures of individuals. Arithmetic operations - addition and subtraction - can be performed on the values and will provide meaningful results.
|
|
Random Sampling
|
The process of using chance to select individuals from a population to be included in a sample.
|
|
Ratio Level of Measurement
(Quantitative Variables) |
Has the properties of the interval level and the ratios have meaning. Arithmetic operations can be completed - Multiplication and Division.
|
|
Response Bias
|
Exists when the answers on a survey do not reflect the true feelings of the respondent.
(i.e., Interviewer Error, Misrepresented Answers, Wording of Questions) |
|
Reasons for Response Bias
|
1. Interviewer Error - untrained interviewers
2. Misrepresented Answers: responses that misrepresent facts, flat-out lies 3. Wording of questions: unbalanced? vague? 4. Ordering of questions or words: questions should be rearranged and asked again. 5. Type of question: open-free to choose response, closed - multiple choice 6. Data Entry Error: Imperative to perform accuracy checks! |
|
Reliability
|
Represents the ability of different measurements of the same individual to yeild the same results.
|
|
Response Variable
|
Quantitative Variables - Variables that are measured such as interval or ratio, the end results.
|
|
Sample
|
Subset of the population being studied
|
|
Sampling Bias
|
Technique used to obtain the individuals to be in the sample tends to favor one part of the population over another.
|
|
Sampling Error
|
The error that results from using a sample to estimate information about a population; occurs because a sample gives incomplete information about population (cannot reveal all)
**Error that results from using a subset of a population to describe characteristics of the population |
|
Sampling With Replacement
|
Certain number of unique numbers selected from a population. Questions/surveys sent to individuals in sample. Individual's names are left in the population and could possibly be chosen again.
|
|
Sample Without Replacement
|
Certain number of unique numbers selected from a population. Questions/surveys sent to individuals in sample. Those individual's names are removed and can not be chosen again.
|
|
Seed
|
In a random number generator, provides an initial point for the generator to start creating random numbers
(dictates the random numbers that are generated) |
|
Simple Random Sampling
|
A sample "n" from a population of size "N" is obtained, if every possible sample of size "n" has an equally likely chance of occurring.
|
|
Goal of Sampling
|
Obtain as much information as possible about population at the least amount of cost.
|
|
Statistic
|
Numerical summary of a sample
|
|
Statistics
|
The science of collecting, organizing, summarizing and analyzing information to draw conclusions or answer questions.
In addition, statistics is about providing a measure of confidence in any conclusions. |
|
Descriptive Statistics
|
Consists of organizing and summarizing data. Describe data through numerical summaries, tales and graphs.
|
|
Inferential Statistics
|
Uses methods that take a result from a sample, extend it to the population and measure the reliability of the result.
|
|
Process of Statistics
|
1) Identify the research objective (determine the detailed questions)
2) Collect the data needed to answer those queswtions - important to use appropriate data collection processes. 3) Describe the data - descriptive statistics allows the researcher to obtain an overview of the data. 4) Perform Inference: apply the appropriate techniques to extend the results obtained from sample to population and report a level of reliability |
|
Stratified Sample
|
Separate population into non-overlapping groups (strata) and then obtaining a simple random sample from each stratum. The individuals within each stratum should be homogenous (similar) in some way.
|
|
Systemic Sampling
|
Obtained by selecting every "K"th individual from the population. The 1st individual selected corresponds to a random number between 1 & K
Formula: N/n = K Random # = P Sample consists of: P, P + K, P + 2K, …, P + (n+1)K |
|
Completely Randomized Design
|
Experimental Design in which each experimental unit is randomly assigned to a treatment.
(i.e., field fertilizer example) |
|
Treatment
|
Any combination of the values of factors used in an experiment
|
|
Validity
|
Represents how close to the true value the measurement is.
|
|
Undercoverage
|
Occurs when the proportion of one segment of the population is lower in a sample, than it is in the population.
(can be caused by incomplete/incorrect frame, or not representative of population) |
|
Variables
|
Characteristics of the individuals within the population
|
|
Bar Graph
|
Constructed by labeling each category of data on either the horizontal or vertic al axis and the frequency or relative frequency of the category on the other axis. Rectangles of equal width are drawn for each category. The height of each rectangle represents the category's frequency/relative frequency.
|
|
Bell-Shaped Distribution
|
(Symmetric Distribution)
The highest frequency occurs in the middle and frequencies tail off to the left & right |
|
Classes
|
Categories of data; Categories by which data are grouped.
|
|
Class Width
|
The difference between consecutive lower class limits.
i.e., 25-34, 35-44 35 - 25 = 10 10 = Class Width |
|
Guidelines for Determining the lower Class Limit of the First Class and Class Width
|
Choose the lower Class Limit of the First Class by choosing the smallest observation in the data set or a convenient number slightly lower than the smallest observation.
Determine the class Width: *Decide on the number of classes (generally between 5 & 20) *Determine the class width by computing: largest data value - smallest data value and divide by number of classes. - Round this value up to a convenient number |
|
Deceptive Graphs
|
Purposely intended to create an incorrect impression.
|
|
Dot Plot
|
Drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it occurs.
Limited in usefulness but can be used to quickly visualize data. |
|
Frequency Distribution
|
Lists each category of data and the number of occurrences for each category of data
|
|
Guidelines for Constructing Good Graphs
|
**Title and label the graphic axeds clearly, provide explanations if needed.
Include: - Units of measurement - Data source (when appropriate) ** Avoid distortion. Never lie about the data! |
|
Histogram
|
Constructed by drawing rectangles for each class of data. The height is the frequency or relative frequency of the class. The width of each rectangle is the same, and the rectangles touch!
|
|
Lower Class Limit
|
The Lowest (smallest) value of a class
25-34 35-25 = 25 |
|
Misleading Graphs
|
Graphs that unintentionally create an incorrect impression.
Most common: * Manipulation of scale * Misplaced origin (of scale) |
|
Pareto Chart
|
Bar graph whose bars are drawn in decreasing order of frequency/relative frequency
*Helps prioritize categories for decision making purposes - QA, HR, Marketing |
|
Relative Frequency
|
The proportion (percentage) of observations within a category and is found using the following formula:
= Frequency/sum of all frequencies |
|
Open Ended
|
The first class has no lower class limit; or the last class has no upper class limit.
|
|
Choosing Bar Graphs, Pie Graphs or Pareto Graphs
|
Pie Charts - Should be used for showing the divisions of "all" possible values of qualitative variables into it's parts. (Not useful for comparing 2 specific values of the qualitative variable.
* Bar Graphs: Useful when comparing the different parts, but not the parts being compared to the whole. * Pareto Graphs: Useful when comparing parts to the whole. |
|
Pie Charts
|
Circle divided into sectors, each sector represents a category of data. Area of sector is proportional to frequency of the category.
* Typically used to present relative frequency of qualitative data. * Data is usually nominal, but can also be used for ordinal. |
|
Relative Frequency Distribution
|
Lists each category of data together with the relative frequency
|
|
Side-by-Side Bar Graphs
|
Bar graph that compares 2 data sets. Should be completed using relative frequency because different sample size/population size makes comparisons using frequency difficult or misleading.
|
|
Skewed Left Distribution
|
The tail to the L of Peak is longer than the tail to the right of the Peak.
|
|
Skewed Right Distribution
|
the tail to the right of the peak is longer than the tail to the left
|
|
Stem and Leaf Plot
|
another way to represent quantitative data graphically. Use the digits to the left of the right-most digit to form the stem. Each right-most digit forms a leaf. i.e., 147; 14 = Stem, 7 = Leaf
|
|
Construction of a Stem and Leaf Plot
|
Step 1: The stem of a data value will consist of the digits to the left of the right-most digit. The leaf will be the right-most digit.
Step 2: Wrikte the stems in vertical column in increasing order. Drawn vertical line to the right of the stems. Step 3: Write each leaf corresp;onding to the stems to the right of the vertical line. Stem 4: Within each stem, rearrange the leaves in ascending order, title the plot and provide legend. i.e., Legend: 5|5 represents 5.5% |
|
Time-Series Data
|
The value of a variable is measured at different points in time.
i.e., the closing price of Cisco Systems each month for the past 12 years. |
|
Time-Series Plot
|
Obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis.
|
|
Uniform Distribution
|
(Symmetric Distribution)
Shape of distribution - frequency of each value of the variable is evenly spread out across the values of the variable. |
|
Upper Class Limit
|
the largest (highest) value within the class
i.e., 25 - 34 Upper class limit = 34 |
|
Bar Graph
|
Constructed by labeling each category of data on either the horizontal or vertical axis and the frequency or relative frequency of the category on the other axis. Rectangles of equal width are drawn for each category. The height of each rectangle represents the category's frequency/relative frequency.
|
|
Bell Shaped Distribution
|
Symmetric Distribution
The highest frequency occurs in the middle and frequencies tail off to the left and right. |
|
Classes
|
Categories of data; categories by which data are grouped.
|
|
Class Width
|
The difference between consecutive lower class limits. i.e., 25-34, 35-44
35-25 = 10 Class width is 10 |
|
Formula for Class Width
|
Determine class width by computing:
largest data value - smallest data value divided by number of classes. Round results up to convenient number (rounding may result in fewer classes) |
|
Deceptive Graphs
|
Purposely intend to create an incorrect impression.
|
|
Dot Plot
|
Drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it occurs.
Limited in usefulness but can be used to quickly visualize data. |
|
Frequency Distribution
|
Lists each category of data and the number of occurrences for each category of data
|
|
Guidelines for Constructing Good Graphs
|
* Title and label the graphic axes clearly, provide explanations if needed
Include: -Units of measurement -Data source (when appropriate) *Avoid distortion. Never lie about the data |
|
Histogram
|
Constructed by drawing rectangles for each class of data. The height is the frequency or relative frequency of the class. The width of each rectangle is the same, and the rectangles touch.
|
|
Lower Class Limit
|
The lowest (smallest) value of a class
|
|
Misleading Graphs
|
Graphs that unintentionally create an incorrect impression.
Most Common: Manipulation of scale Misplaced origin (start at 0) |
|
Pareto Chart
|
Bar graph whose bars are drawn in decreasing order of frequency/relative frequency.
* Helps prioritize categories for decision making purposes - Q!A, HR, Marketing. |
|
Relative Frequency
|
The proportion (%) of observations within a category and is found using the following formula:
RF = Frequency/Sum of all Frequencies |
|
Open Ended
|
The first class has no lower class limit or the last class has no upper class limit.
|
|
Choosing Bar charts, Pie Charts or Pareto Charts
|
Pie Charts: Should be used for showing the divisions of all possible values of a qualitative variable into it's parts. (Not useful for comparing two specific values of the qualitative variable)
Bar Graphs: Useful when comparing the different parts, but not the parts being compared to the whole. Pareto Charts: Useful when comparing parts to the whole. |
|
Pie Charts
|
Circle divided into sectors, each sector represents a category of data. Area of sector is proportional to frequency of the category.
* Typically used to present relative frequency of qualitative data * Data is usually nominal but can also be used for ordinal. |
|
Relative Frequency Distribution
|
Lists each category of data together with the relative frequency
|
|
Side-By-Side Bar Graphs
|
Bar graph that cmpares two data sets. Should be completed using relative frequency because different sample sizes and/or population size makes comparisons using frequency difficult or misleading.`
|
|
Skewed Left Distribution
|
The tail to the left of the peak is longer than the tail to the right.
|
|
Skewed Right
|
The tail to the right of the peak is longer than the tail to the left.
|
|
Stem and Leaf Plot
|
Another way to represent quantitative data graphically. Use the digits to the left of the right-most digit to form the stem. Each rightmost digit forms a leaf.
i.e., 147; 14 = stem, 7 = leaf |
|
Time - Series Data
|
The value of a variable is measured at different pints in time.
i.e., the closing price of Cisco Systems stock each month for the past 12 years. |
|
Time Series Plot
|
Obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis - join by a line to subsequent plotted values.
|
|
Uniform Distribution
|
Symmetrical distribution.
Shape of distribution: frequency of each value of the variable is evenly spread out across the values of the variable. |
|
Upper Class Limit
|
The largest (highest) value within the class.
|
|
Describe the Distribution
|
Describe it's shape - skewed left, skewed right, symmetric; It's center - mean or median; and its spread - std deviation or IQR
|
|
Relationship Between the Mean, Median and Distribution Shape
|
Skewed Left - Mean is substantially smaller than the Median.
Symmetric - Mean is roughly equal to Median. Skewed Right - Mean is substantially larger than Median. |
|
Sample Standard Deviation
|
s = obtained by taking the square root of the sample variance.
s = (sq root) of s₂ |
|
Population Standard Deviation
|
℺ = √℺₂
|
|
Skewed Right
|
The tail to the right of the peak is longer than the tail to the left.
|
|
Stem and Leaf Plot
|
Another way to represent quantitative data graphically. Use the digits to the left of the right-most digit to form the stem. Each rightmost digit forms a leaf.
i.e., 147; 14 = stem, 7 = leaf |
|
Time - Series Data
|
The value of a variable is measured at different pints in time.
i.e., the closing price of Cisco Systems stock each month for the past 12 years. |
|
Time Series Plot
|
Obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis - join by a line to subsequent plotted values.
|
|
Uniform Distribution
|
Symmetrical distribution.
Shape of distribution: frequency of each value of the variable is evenly spread out across the values of the variable. |
|
Upper Class Limit
|
The largest (highest) value within the class.
|
|
Describe the Distribution
|
Describe it's shape - skewed left, skewed right, symmetric; It's center - mean or median; and its spread - std deviation or IQR
|
|
Relationship Between the Mean, Median and Distribution Shape
|
Skewed Left - Mean is substantially smaller than the Median.
Symmetric - Mean is roughly equal to Median. Skewed Right - Mean is substantially larger than Median. |
|
Sample Standard Deviation
|
s = obtained by taking the square root of the sample variance.
s = (sq root) of s₂ |
|
Population Standard Deviation
|
Obtained by taking the square root of the population variance.
℺ = √℺₂ |
|
Arithmetic Mean
|
*Quantitative Data Only
Computed by determining the sum of all the values of the variables in the data set and dividing by the number of observations. |
|
Sample Arithmetic Mean
|
Computed by using sample data.
꓃=( ∑Xi) / n |
|
Median
|
The value that lies in the middle of the data when arranged in ascending order.
M = median Odd # of obs: (n + 1) / 2 Even # of obs: Middle two obs added = 1,2,3,4 (2 + 3)/2 = 2.5 |
|
Resistant
|
A numerical summary of data where extreme values (relative to data) do not affect its value substantially.
|
|
Mode
|
The most frequent observation of the variable that occurs in the data set.
(Tally the number of observations for each value in the data set. The data value that occurs most often = mode) * The only measure of central tendency that can be used for qualitative data. |
|
Bimodal
|
Data set that has two modes
|
|
Multimodal
|
Data set that has 3 or more values that occur with the highest frequency.
|
|
Circumstances for Measure of Central Tendency
|
Mean:
Population - μ = ( ∑Xi) / N Sample: ꓃= ( ∑Xi) / n Center of gravity Quantitative data and freq distribution is roughly symmetric. Median: Arrange data in ascending order and divide data set in half. Odd # of obs: (n + 1) / 2 Even # of obs: Middle two obs added = 1,2,3,4 (2 + 3)/2 = 2.5 Divides the data 50%=50% Quantitative data and freq distribution is skewed left or right. Mode: Tally data to determine most frequent observations. Most frequent observation When the most freq obs is desired measure of central tendency or data is qualitative. |
|
Sample Standard Deviation
|
S = √s₂
|
|
Weighted Mean
|
Certain data values that have a higher importance of "weight" associated with them. Found by multiplying each value of variable by it's corresponding weight, summing the products and dividing the result by sum of the weights.
|
|
Quartiles
|
Divides data set into fourths.
Step 1: Arrange data in ascending order Step 2: Determine the Median (2nd Quartile or Q2) Step 3: Determine the 1st & 3rd quartiles by dividing the data set into 2 halves. Then divide each half into halves. The 1st quartile will be the median of bottom half, the 3rd quartile is the median of the top half. |
|
Interquartile Range - IQR
|
The range of the middle 50% of the observations in a data set.
IQR = Q3 - Q1 The more spread a data set has, the higher the IQR will be. |
|
Outliers
|
Extreme observations.
*Origins must be investigated *Can distort the Mean and standard deviation |
|
Checking for Outliers Using Quartiles
|
1. Determine the 1st and 3rd quartiles of the data.
2. Compute IQR 3. Determine the "fencers" - serve as cut off points for determining outliers. Lower Fence: Q1 - 1.5(IQR) Upper Fence: Q3 + 1.5(IQR) 4. If a data value is less than the lower fence or greater than the upper fence, it's considered an outlier. |
|
Five Number Summary
|
Min Q1 M(Q2) Q3 Max
resistant to extreme values. Measures the spread of data by determining the difference between the 25% & 75%. |
|
Boxplot
|
1. Determine the lower and upper fences.
IQR = Q3 - Q1 Lower fence: Q1 - 1.5(IQR) Upper fence: Q3 + 1.5(IQR) 2. Draw vertical lines at Q1, M and Q3. Enclose these lines in a box. 3. Label the upper and lower fences 4. Draw a line from Q1 to the smallest data value that's larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence. (whiskers) 5. Any data values < lower fence or > upper fence are outliers and marked with an asterisk. |
|
Dispersion
|
The degree to which the data are spread out
(describes a distribution) |
|
Range
|
(R) of a variable is the difference between the largest data value and the smallest data value.
* Uses only 2 values from data set * Affected by extreme values - not resistant |
|
Deviation about the Mean
|
Variance on how far, on average, each observation is from the mean. The further the observation is from the mean, the larger the absolute value of deviation
*Sum of all deviations about the mean must = 0 |
|
Population Variance
|
The sum of the squared deviations about the population mean divided by the number of observations in the population, (N).
*Its the Mean of the squared deviations about the population mean. Pop Var: σ₂ Formula: σ₂= ( Σ(Xi - μ )₂/N Note: ΣXi₂ = Square each observation then sum squared values. (ΣXi)₂ = Sum all obs, then square sum. |
|
Biased
|
When a statistic consistently overestimates or underestimates the population variance.
|
|
Sample Variance
|
**Always remember to use n-1 in denominator to keep from underestimating the variance
(n - 1) = smaller number = larger variances more equal to actual. |
|
Degrees of Freedom
|
n - 1; because the first n - 1 observations have freedom to be whatever value.
*We have n - 1 degrees of freedom in the computation of s₂ because an unknown parameter, μ, is estimated with ㄡ. For each parameter estimated, we lose 1 degree of freedom. |
|
Standard Deviation
|
Used in conjunction with the mean to numerically describe distributions that are bell-shaped and symmetric. The mean measures the center, the St Dev measures the spread
*Loosely described as the typical deviation from the mean. The larger the st dev, the more dispersed the distribution. (Units of measure must be the same!) |
|
Empirical Rule
|
If a distribution is roughly bell shaped:
* Approximately 68% of the data will lie within 1 standard deviation of the mean. Meaning approx 68% of data lie between μ - 1σ and μ + 1σ. Approx 95% of the data will lie with in 2 st dev of the mean; or, μ - 2σ and μ + 2σ * Approx 99.7% of the data will lie within 3 st dev of the mean; or μ - 3σ and μ + 3σ *Can also use this rule based on sample data with ㄡused in place of μ and s used in place of σ |
|
Chebyshev's Inequality
|
Obtain regardless of shape (skew, sym) for any data set, regardless of the shape of distribution at least
(1 - (1/K2))100% of the observations will lie within K standard deviations of the mean, where K is any number > 1; Meaning: at least (1 - (1/K2))100% of the data will lie between μ - kσ and μ + kσ for k>1 *Can be used for simple data too. |
|
Correlation Coefficient
|
A measure of the strength and direction of the linear relation between 2 quantitative variables
ϼ = Population, r = sample |
|
Linear Correlation Coefficient Properties
|
1. Always between -1 & 1 inclusive: -1
≤ r ≤ 1 2. If r = +1, then a perfect positive linear relation exists between the two variables 3. If r = -1, then a perfect negative linear relation exists between the two variables 4. The closer r is to +1, the stronger is the evidence of positive association 5. The closer r is to -14, the stronger is the evidence of negative association. 6. If r is close to zero (0), little to no evidence exists of a "linear" relation - does not imply no relation, just no linear relation. 7. The linear correlation coefficient is a unitless measure of association 8. The correlation is not resistant. |
|
Confounding
|
Any relation that may exist between two variables may be due to some other variable not accounted for.
|
|
Good Fit
|
The line drawn appears to describe the relation between the two variables well.
*Use Point/Slope equation for find the equation of the line: M = Y2 - Y1 / X2 - X1 = slope |
|
Residual
|
The difference between the observed value of y and the predicted value of y = the "error" or residual.
(The difference between data set and line from Point/Slope formula |
|
Least-Squares Regression Line
|
The line that minimizes the sum of the squared errors (or residuals). It's the line that minimizes the sum of the squared distance between the observed values of y and those predicted by the line Ῠ (y-hat)
Minimize Σ residuals ₂ |
|
Interpretation of Y Intercept
|
First: 2 Questions:
1. Is 0 a reasonable value for explanatory variable? 2. Do any observations near X = 0 exist? If answer is no to either question, no interpretation of Y intercept is given. Second: Should not use regression model to make predictions "Outside the Scope" of the model - values of the explanatory variable that are much larger or much smaller than those observed. * Cannot be certain of the behavior of data for which we have no observations. |
|
Predictions - No Linear Relation
|
If the linear correlation coefficient indicates no linear relation between explanatory and response variables, then use the mean value of the response variable as the predicted value.
|
|
Coefficient of Determination
|
Measures the proportion of total variances in the response variable that is explained by the least squares regression line. Its a number between 0 & 1 - inclusive. 0 ≤ R₂ ≤ 1
If R₂ = 1, the least squares regression line explains 100% of the variation in the response variable: R₂ = r₂ |
|
Unexplained Deviation
|
The difference between the observed value of the response variable (y) and the predicted value of the response variable Ῠ (y-hat): y - Ῠ (y-hat)
|
|
Explained Deviation
|
The deviation between the predicted value of the response variable Ῠ (y-hat) and the mean value of the response variable
|
|
Total Deviation
|
The deviation between observed value of response variable y and the mean value of the response variable y is called total deviation.
|
|
Measure of Central Tendency
|
Numerically describes the average or typical data value: Mean, Median or Mode
|