• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/175

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

175 Cards in this Set

  • Front
  • Back
Approach
Suggested way to look at and organize a problem so it can be solved.
**There is usually more than one way...
Bias
Result of the sample is not representative of the population
3 Sources of Bias
1. Sampling Bias
2. Non-response Bias
3. Response Bias
Blinding
Refers to non-disclosure of a treatment an experimental unit is receiving.
2 Types of Blinding
1. Single Blinding
2. Double Blinding
Single Blind
Experiment in which the experimental unit (or subjects) do not know which treatment (s)he is receiving.
Double Blind
Study/Experiment in which neither the experimental unit nor the researcher is aware of whether the subject knows which treatment (s)he is receiving.
Case Control Studies
Retrospective studies that require individuals to look back in time, or researchers to look at existing records.
Closed Question
Question for which the respondent must choose from a list of predetermined responses - i.e., multiple choice
Cluster Sample
Sample obtained by selecting all individuals within a randomly selected collection or group of individuals.
(i.e., All Online students = population; Each online class = Cluster; obtain simple random selection of cluster for survey; Survey all students with cluster.
Cohort Studies
Identifies a group of individuals to participate (cohorts), then are observed over a length of time (sometimes very long. Characteristics are recorded and some individuals are exposed to certain factors (not intentionally) and others will not. At end of study, the value of the response variable is recorded for the individual. (i.e. Framingham Heart Study)
Completely Randomized Design
Simplest type of experiment. Design in which each experimental unit is randomly assigned to a treatment.
Continuous Variable
A quantitative variable that has an infinite number of possible values that are not countable.

Can be Discrete or Continuous.
Confounding
Occurs when the effects of two or more explanatory variables are not separated. Therefore, any relation that may exist between an explanatory variable and the response variable may be due to some other variable(s) not accounted for in the study.
** Major problem with observational studies, often the cause is a "lurking variable".
Control Group
Serves as a baseline treatment that can be used to compare to other treatments.
Convenience Sampling
Sample in which the individuals are easily obtained and not based on randomness.
* self-selected - individuals decided to participate (voluntary response), i.e., phone-in polling, internet surveys
**Unreliable results because sampling is not random.
Cross-Sectional Studies
Observational studies that collect information about individuals at a specific point in time - or over a very short period of time.
Data
Fact or Proposition used to draw a conclusion or make a decision. Can be numerical or non-numerical. (List of observed values for a variable)
i.e., gender is a variable - the observations of Male/Female are data.
Designed Experiment
Researcher assigns the individuals in a group (certain group), intentionally changes the value of an explanatory variable and records the value of the response variable for each group.
Steps in Designing an Experiment
1. Identify the problem to be solved (explicit)
2. Determine factors that affect the response variable (field expert)
3. Determine the number of experimental units.
4. Determine the level for each factor.
a) Controls: 1) fix at 1 predetermined factor; or 2) Set them at predetermined levels
b) Randomize - randomize exp units to various treatment groups so the effects of factors that can't be controlled are minimized.
5. Conduct the experiment:
a) Exp units are randomly assigned to the treatments. Replication occurs when each tx is applied to more than 1 exp unit.
b) Collect and process data. Measure value of the response variable for each replication
6. Test the claim - inferential statistics.
Discrete Variable
A quantitative variable that has either a finite number of possible values, or a countable number of possible values. (If you count to get the value of a quantitative variable, it is discrete)
Experiment
A controlled study conducted to determine the effect varying a treatment has on a response variable - one or more explanatory variables or factors .
Experimental Unit
Person, object, or some other well defined item upon which a treatment is applied. (Often referred to as a Subject)
Explanatory Variable.
Qualitative Variables - Variables that describe, name or label the individuals.
Functional Status
The ability to conduct day-to-day activities.
Individual
Person or object that is a member of the population to be studied.
Interval Level of Measurement

(Quantitative Variable)
Has the properties of the ordinal level, and the differences in the values of the variables has meaning. Arithmetic operations can be performed (Addition & Subtraction)
(i.e., Temperature - can perform arithmetic operations, but ratio doesn't not represent meaningful results.
Lurking Variable
an explanatory variable that was not considered in a study, but that affects the value of the response variable in the study.
* Typically related to explanatory variables considered in a study.
Matched-Pairs Design
Experimental design in which the experimental units are paired up. the pairs are matched so that they are somehow related, there are only 2 levels of treatment.
Nominal Level of Measurement

(Qualitative Variable)
Values of the variable - name, label or categorized - does not allow for the values to be arranged in a ranked or specific order.
(i.e. gender)
Non-response Bias
Exists when individuals selected to be in sample - who do not respond to survey - have a different opinion than those who do participate.
Non-Sampling Errors
Non-response bias, response bias, data entry errors, under coverage. Can also be present in Census.
** The errors that result from obtaining and recording the information collected.
Observational Study
Measures the value of the response variable without attempting to influence the value of either the response or explanatory variables.
**Observes the behavior without trying to influence the outcome.
Types of Observational Studies
1. Cross-Sectional - collect information about individuals - usually short periods of time
2. Case Studies - collect information about individuals - completed retrospectively
3. Cohort - individuals studied for longer periods of time - completed prospectively
Open Question
A question for which the respondent is free to choose his or her response: Open line answer.
Ordinal Level of Measurement

(Qualitative Variables)
Has the properties of the nominal level and the naming scheme allows for the values to be arranged in a specific order.
(i.e., Letter Grades)

Can be ranked, but differences have no meaning.
Parameter
Numerical summary of a population
Placebo
An innocuous medication (such as sugar tablets) that look, taste and smell like the experimental medication.
Population
Entire group of individuals to be studied.
Qualitative Data
Observations corresponding to a qualitative variable
Qualitative Variables

*(Categorical)
Allow for classification of individuals based on some attribute or characteristic.
Quantitative Data
Observations corresponding to a quantitative variable

*(Discrete/Continuous)
Quantitative Variables
Provide numerical measures of individuals. Arithmetic operations - addition and subtraction - can be performed on the values and will provide meaningful results.
Random Sampling
The process of using chance to select individuals from a population to be included in a sample.
Ratio Level of Measurement

(Quantitative Variables)
Has the properties of the interval level and the ratios have meaning. Arithmetic operations can be completed - Multiplication and Division.
Response Bias
Exists when the answers on a survey do not reflect the true feelings of the respondent.
(i.e., Interviewer Error, Misrepresented Answers, Wording of Questions)
Reasons for Response Bias
1. Interviewer Error - untrained interviewers
2. Misrepresented Answers: responses that misrepresent facts, flat-out lies
3. Wording of questions: unbalanced? vague?
4. Ordering of questions or words: questions should be rearranged and asked again.
5. Type of question: open-free to choose response, closed - multiple choice
6. Data Entry Error: Imperative to perform accuracy checks!
Reliability
Represents the ability of different measurements of the same individual to yeild the same results.
Response Variable
Quantitative Variables - Variables that are measured such as interval or ratio, the end results.
Sample
Subset of the population being studied
Sampling Bias
Technique used to obtain the individuals to be in the sample tends to favor one part of the population over another.
Sampling Error
The error that results from using a sample to estimate information about a population; occurs because a sample gives incomplete information about population (cannot reveal all)
**Error that results from using a subset of a population to describe characteristics of the population
Sampling With Replacement
Certain number of unique numbers selected from a population. Questions/surveys sent to individuals in sample. Individual's names are left in the population and could possibly be chosen again.
Sample Without Replacement
Certain number of unique numbers selected from a population. Questions/surveys sent to individuals in sample. Those individual's names are removed and can not be chosen again.
Seed
In a random number generator, provides an initial point for the generator to start creating random numbers
(dictates the random numbers that are generated)
Simple Random Sampling
A sample "n" from a population of size "N" is obtained, if every possible sample of size "n" has an equally likely chance of occurring.
Goal of Sampling
Obtain as much information as possible about population at the least amount of cost.
Statistic
Numerical summary of a sample
Statistics
The science of collecting, organizing, summarizing and analyzing information to draw conclusions or answer questions.

In addition, statistics is about providing a measure of confidence in any conclusions.
Descriptive Statistics
Consists of organizing and summarizing data. Describe data through numerical summaries, tales and graphs.
Inferential Statistics
Uses methods that take a result from a sample, extend it to the population and measure the reliability of the result.
Process of Statistics
1) Identify the research objective (determine the detailed questions)
2) Collect the data needed to answer those queswtions - important to use appropriate data collection processes.
3) Describe the data - descriptive statistics allows the researcher to obtain an overview of the data.
4) Perform Inference: apply the appropriate techniques to extend the results obtained from sample to population and report a level of reliability
Stratified Sample
Separate population into non-overlapping groups (strata) and then obtaining a simple random sample from each stratum. The individuals within each stratum should be homogenous (similar) in some way.
Systemic Sampling
Obtained by selecting every "K"th individual from the population. The 1st individual selected corresponds to a random number between 1 & K

Formula:

N/n = K

Random # = P
Sample consists of: P, P + K, P + 2K, …, P + (n+1)K
Completely Randomized Design
Experimental Design in which each experimental unit is randomly assigned to a treatment.

(i.e., field fertilizer example)
Treatment
Any combination of the values of factors used in an experiment
Validity
Represents how close to the true value the measurement is.
Undercoverage
Occurs when the proportion of one segment of the population is lower in a sample, than it is in the population.
(can be caused by incomplete/incorrect frame, or not representative of population)
Variables
Characteristics of the individuals within the population
Bar Graph
Constructed by labeling each category of data on either the horizontal or vertic al axis and the frequency or relative frequency of the category on the other axis. Rectangles of equal width are drawn for each category. The height of each rectangle represents the category's frequency/relative frequency.
Bell-Shaped Distribution
(Symmetric Distribution)
The highest frequency occurs in the middle and frequencies tail off to the left & right
Classes
Categories of data; Categories by which data are grouped.
Class Width
The difference between consecutive lower class limits.
i.e., 25-34, 35-44
35 - 25 = 10
10 = Class Width
Guidelines for Determining the lower Class Limit of the First Class and Class Width
Choose the lower Class Limit of the First Class by choosing the smallest observation in the data set or a convenient number slightly lower than the smallest observation.

Determine the class Width:
*Decide on the number of classes (generally between 5 & 20)
*Determine the class width by computing: largest data value - smallest data value and divide by number of classes.
- Round this value up to a convenient number
Deceptive Graphs
Purposely intended to create an incorrect impression.
Dot Plot
Drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it occurs.
Limited in usefulness but can be used to quickly visualize data.
Frequency Distribution
Lists each category of data and the number of occurrences for each category of data
Guidelines for Constructing Good Graphs
**Title and label the graphic axeds clearly, provide explanations if needed.
Include: - Units of measurement
- Data source (when appropriate)
** Avoid distortion. Never lie about the data!
Histogram
Constructed by drawing rectangles for each class of data. The height is the frequency or relative frequency of the class. The width of each rectangle is the same, and the rectangles touch!
Lower Class Limit
The Lowest (smallest) value of a class
25-34
35-25
= 25
Misleading Graphs
Graphs that unintentionally create an incorrect impression.
Most common:
* Manipulation of scale
* Misplaced origin (of scale)
Pareto Chart
Bar graph whose bars are drawn in decreasing order of frequency/relative frequency

*Helps prioritize categories for decision making purposes - QA, HR, Marketing
Relative Frequency
The proportion (percentage) of observations within a category and is found using the following formula:
= Frequency/sum of all frequencies
Open Ended
The first class has no lower class limit; or the last class has no upper class limit.
Choosing Bar Graphs, Pie Graphs or Pareto Graphs
Pie Charts - Should be used for showing the divisions of "all" possible values of qualitative variables into it's parts. (Not useful for comparing 2 specific values of the qualitative variable.
* Bar Graphs: Useful when comparing the different parts, but not the parts being compared to the whole.
* Pareto Graphs: Useful when comparing parts to the whole.
Pie Charts
Circle divided into sectors, each sector represents a category of data. Area of sector is proportional to frequency of the category.
* Typically used to present relative frequency of qualitative data.
* Data is usually nominal, but can also be used for ordinal.
Relative Frequency Distribution
Lists each category of data together with the relative frequency
Side-by-Side Bar Graphs
Bar graph that compares 2 data sets. Should be completed using relative frequency because different sample size/population size makes comparisons using frequency difficult or misleading.
Skewed Left Distribution
The tail to the L of Peak is longer than the tail to the right of the Peak.
Skewed Right Distribution
the tail to the right of the peak is longer than the tail to the left
Stem and Leaf Plot
another way to represent quantitative data graphically. Use the digits to the left of the right-most digit to form the stem. Each right-most digit forms a leaf. i.e., 147; 14 = Stem, 7 = Leaf
Construction of a Stem and Leaf Plot
Step 1: The stem of a data value will consist of the digits to the left of the right-most digit. The leaf will be the right-most digit.
Step 2: Wrikte the stems in vertical column in increasing order. Drawn vertical line to the right of the stems.
Step 3: Write each leaf corresp;onding to the stems to the right of the vertical line.
Stem 4: Within each stem, rearrange the leaves in ascending order, title the plot and provide legend.
i.e., Legend: 5|5 represents 5.5%
Time-Series Data
The value of a variable is measured at different points in time.
i.e., the closing price of Cisco Systems each month for the past 12 years.
Time-Series Plot
Obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis.
Uniform Distribution
(Symmetric Distribution)
Shape of distribution - frequency of each value of the variable is evenly spread out across the values of the variable.
Upper Class Limit
the largest (highest) value within the class
i.e., 25 - 34
Upper class limit = 34
Bar Graph
Constructed by labeling each category of data on either the horizontal or vertical axis and the frequency or relative frequency of the category on the other axis. Rectangles of equal width are drawn for each category. The height of each rectangle represents the category's frequency/relative frequency.
Bell Shaped Distribution
Symmetric Distribution
The highest frequency occurs in the middle and frequencies tail off to the left and right.
Classes
Categories of data; categories by which data are grouped.
Class Width
The difference between consecutive lower class limits. i.e., 25-34, 35-44
35-25 = 10
Class width is 10
Formula for Class Width
Determine class width by computing:
largest data value - smallest data value
divided by number of classes.
Round results up to convenient number (rounding may result in fewer classes)
Deceptive Graphs
Purposely intend to create an incorrect impression.
Dot Plot
Drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it occurs.
Limited in usefulness but can be used to quickly visualize data.
Frequency Distribution
Lists each category of data and the number of occurrences for each category of data
Guidelines for Constructing Good Graphs
* Title and label the graphic axes clearly, provide explanations if needed
Include: -Units of measurement
-Data source (when appropriate)
*Avoid distortion. Never lie about the data
Histogram
Constructed by drawing rectangles for each class of data. The height is the frequency or relative frequency of the class. The width of each rectangle is the same, and the rectangles touch.
Lower Class Limit
The lowest (smallest) value of a class
Misleading Graphs
Graphs that unintentionally create an incorrect impression.
Most Common:
Manipulation of scale
Misplaced origin (start at 0)
Pareto Chart
Bar graph whose bars are drawn in decreasing order of frequency/relative frequency.
* Helps prioritize categories for decision making purposes - Q!A, HR, Marketing.
Relative Frequency
The proportion (%) of observations within a category and is found using the following formula:
RF = Frequency/Sum of all Frequencies
Open Ended
The first class has no lower class limit or the last class has no upper class limit.
Choosing Bar charts, Pie Charts or Pareto Charts
Pie Charts: Should be used for showing the divisions of all possible values of a qualitative variable into it's parts. (Not useful for comparing two specific values of the qualitative variable)
Bar Graphs: Useful when comparing the different parts, but not the parts being compared to the whole.
Pareto Charts: Useful when comparing parts to the whole.
Pie Charts
Circle divided into sectors, each sector represents a category of data. Area of sector is proportional to frequency of the category.
* Typically used to present relative frequency of qualitative data
* Data is usually nominal but can also be used for ordinal.
Relative Frequency Distribution
Lists each category of data together with the relative frequency
Side-By-Side Bar Graphs
Bar graph that cmpares two data sets. Should be completed using relative frequency because different sample sizes and/or population size makes comparisons using frequency difficult or misleading.`
Skewed Left Distribution
The tail to the left of the peak is longer than the tail to the right.
Skewed Right
The tail to the right of the peak is longer than the tail to the left.
Stem and Leaf Plot
Another way to represent quantitative data graphically. Use the digits to the left of the right-most digit to form the stem. Each rightmost digit forms a leaf.
i.e., 147; 14 = stem, 7 = leaf
Time - Series Data
The value of a variable is measured at different pints in time.
i.e., the closing price of Cisco Systems stock each month for the past 12 years.
Time Series Plot
Obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis - join by a line to subsequent plotted values.
Uniform Distribution
Symmetrical distribution.
Shape of distribution: frequency of each value of the variable is evenly spread out across the values of the variable.
Upper Class Limit
The largest (highest) value within the class.
Describe the Distribution
Describe it's shape - skewed left, skewed right, symmetric; It's center - mean or median; and its spread - std deviation or IQR
Relationship Between the Mean, Median and Distribution Shape
Skewed Left - Mean is substantially smaller than the Median.
Symmetric - Mean is roughly equal to Median.
Skewed Right - Mean is substantially larger than Median.
Sample Standard Deviation
s = obtained by taking the square root of the sample variance.

s = (sq root) of s₂
Population Standard Deviation
℺ = √℺₂
Skewed Right
The tail to the right of the peak is longer than the tail to the left.
Stem and Leaf Plot
Another way to represent quantitative data graphically. Use the digits to the left of the right-most digit to form the stem. Each rightmost digit forms a leaf.
i.e., 147; 14 = stem, 7 = leaf
Time - Series Data
The value of a variable is measured at different pints in time.
i.e., the closing price of Cisco Systems stock each month for the past 12 years.
Time Series Plot
Obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis - join by a line to subsequent plotted values.
Uniform Distribution
Symmetrical distribution.
Shape of distribution: frequency of each value of the variable is evenly spread out across the values of the variable.
Upper Class Limit
The largest (highest) value within the class.
Describe the Distribution
Describe it's shape - skewed left, skewed right, symmetric; It's center - mean or median; and its spread - std deviation or IQR
Relationship Between the Mean, Median and Distribution Shape
Skewed Left - Mean is substantially smaller than the Median.
Symmetric - Mean is roughly equal to Median.
Skewed Right - Mean is substantially larger than Median.
Sample Standard Deviation
s = obtained by taking the square root of the sample variance.

s = (sq root) of s₂
Population Standard Deviation
Obtained by taking the square root of the population variance.

℺ = √℺₂
Arithmetic Mean
*Quantitative Data Only
Computed by determining the sum of all the values of the variables in the data set and dividing by the number of observations.
Sample Arithmetic Mean
Computed by using sample data.
꓃=( ∑Xi) / n
Median
The value that lies in the middle of the data when arranged in ascending order.
M = median
Odd # of obs: (n + 1) / 2
Even # of obs: Middle two obs added = 1,2,3,4
(2 + 3)/2 = 2.5
Resistant
A numerical summary of data where extreme values (relative to data) do not affect its value substantially.
Mode
The most frequent observation of the variable that occurs in the data set.
(Tally the number of observations for each value in the data set. The data value that occurs most often = mode)
* The only measure of central tendency that can be used for qualitative data.
Bimodal
Data set that has two modes
Multimodal
Data set that has 3 or more values that occur with the highest frequency.
Circumstances for Measure of Central Tendency
Mean:
Population - μ = ( ∑Xi) / N
Sample: ꓃= ( ∑Xi) / n
Center of gravity
Quantitative data and freq distribution is roughly symmetric.
Median:
Arrange data in ascending order and divide data set in half.
Odd # of obs: (n + 1) / 2
Even # of obs: Middle two obs added = 1,2,3,4
(2 + 3)/2 = 2.5
Divides the data 50%=50%
Quantitative data and freq distribution is skewed left or right.
Mode:
Tally data to determine most frequent observations.
Most frequent observation
When the most freq obs is desired measure of central tendency or data is qualitative.
Sample Standard Deviation
S = √s₂
Weighted Mean
Certain data values that have a higher importance of "weight" associated with them. Found by multiplying each value of variable by it's corresponding weight, summing the products and dividing the result by sum of the weights.
Quartiles
Divides data set into fourths.
Step 1: Arrange data in ascending order
Step 2: Determine the Median (2nd Quartile or Q2)
Step 3: Determine the 1st & 3rd quartiles by dividing the data set into 2 halves. Then divide each half into halves. The 1st quartile will be the median of bottom half, the 3rd quartile is the median of the top half.
Interquartile Range - IQR
The range of the middle 50% of the observations in a data set.
IQR = Q3 - Q1

The more spread a data set has, the higher the IQR will be.
Outliers
Extreme observations.
*Origins must be investigated
*Can distort the Mean and standard deviation
Checking for Outliers Using Quartiles
1. Determine the 1st and 3rd quartiles of the data.
2. Compute IQR
3. Determine the "fencers" - serve as cut off points for determining outliers.
Lower Fence: Q1 - 1.5(IQR)
Upper Fence: Q3 + 1.5(IQR)
4. If a data value is less than the lower fence or greater than the upper fence, it's considered an outlier.
Five Number Summary
Min Q1 M(Q2) Q3 Max
resistant to extreme values. Measures the spread of data by determining the difference between the 25% & 75%.
Boxplot
1. Determine the lower and upper fences.
IQR = Q3 - Q1
Lower fence: Q1 - 1.5(IQR)
Upper fence: Q3 + 1.5(IQR)
2. Draw vertical lines at Q1, M and Q3. Enclose these lines in a box.
3. Label the upper and lower fences
4. Draw a line from Q1 to the smallest data value that's larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence. (whiskers)
5. Any data values < lower fence or > upper fence are outliers and marked with an asterisk.
Dispersion
The degree to which the data are spread out
(describes a distribution)
Range
(R) of a variable is the difference between the largest data value and the smallest data value.
* Uses only 2 values from data set
* Affected by extreme values - not resistant
Deviation about the Mean
Variance on how far, on average, each observation is from the mean. The further the observation is from the mean, the larger the absolute value of deviation
*Sum of all deviations about the mean must = 0
Population Variance
The sum of the squared deviations about the population mean divided by the number of observations in the population, (N).
*Its the Mean of the squared deviations about the population mean.
Pop Var: σ₂
Formula: σ₂= ( Σ(Xi - μ )₂/N
Note: ΣXi₂ = Square each observation then sum squared values.
(ΣXi)₂ = Sum all obs, then square sum.
Biased
When a statistic consistently overestimates or underestimates the population variance.
Sample Variance
**Always remember to use n-1 in denominator to keep from underestimating the variance
(n - 1) = smaller number = larger variances more equal to actual.
Degrees of Freedom
n - 1; because the first n - 1 observations have freedom to be whatever value.
*We have n - 1 degrees of freedom in the computation of s₂ because an unknown parameter, μ, is estimated with ㄡ. For each parameter estimated, we lose 1 degree of freedom.
Standard Deviation
Used in conjunction with the mean to numerically describe distributions that are bell-shaped and symmetric. The mean measures the center, the St Dev measures the spread
*Loosely described as the typical deviation from the mean. The larger the st dev, the more dispersed the distribution. (Units of measure must be the same!)
Empirical Rule
If a distribution is roughly bell shaped:
* Approximately 68% of the data will lie within 1 standard deviation of the mean. Meaning approx 68% of data lie between μ - 1σ and μ + 1σ.
Approx 95% of the data will lie with in 2 st dev of the mean; or, μ - 2σ and μ + 2σ
* Approx 99.7% of the data will lie within 3 st dev of the mean; or μ - 3σ and μ + 3σ
*Can also use this rule based on sample data with ㄡused in place of μ and s used in place of σ
Chebyshev's Inequality
Obtain regardless of shape (skew, sym) for any data set, regardless of the shape of distribution at least
(1 - (1/K2))100% of the observations will lie within K standard deviations of the mean, where K is any number > 1; Meaning: at least (1 - (1/K2))100% of the data will lie between μ - kσ and μ + kσ for k>1
*Can be used for simple data too.
Correlation Coefficient
A measure of the strength and direction of the linear relation between 2 quantitative variables
ϼ = Population, r = sample
Linear Correlation Coefficient Properties
1. Always between -1 & 1 inclusive: -1
≤ r ≤ 1
2. If r = +1, then a perfect positive linear relation exists between the two variables
3. If r = -1, then a perfect negative linear relation exists between the two variables
4. The closer r is to +1, the stronger is the evidence of positive association
5. The closer r is to -14, the stronger is the evidence of negative association.
6. If r is close to zero (0), little to no evidence exists of a "linear" relation - does not imply no relation, just no linear relation.
7. The linear correlation coefficient is a unitless measure of association
8. The correlation is not resistant.
Confounding
Any relation that may exist between two variables may be due to some other variable not accounted for.
Good Fit
The line drawn appears to describe the relation between the two variables well.
*Use Point/Slope equation for find the equation of the line: M = Y2 - Y1 / X2 - X1 = slope
Residual
The difference between the observed value of y and the predicted value of y = the "error" or residual.
(The difference between data set and line from Point/Slope formula
Least-Squares Regression Line
The line that minimizes the sum of the squared errors (or residuals). It's the line that minimizes the sum of the squared distance between the observed values of y and those predicted by the line Ῠ (y-hat)
Minimize Σ residuals ₂
Interpretation of Y Intercept
First: 2 Questions:
1. Is 0 a reasonable value for explanatory variable?
2. Do any observations near X = 0 exist?
If answer is no to either question, no interpretation of Y intercept is given.
Second: Should not use regression model to make predictions "Outside the Scope" of the model - values of the explanatory variable that are much larger or much smaller than those observed.
* Cannot be certain of the behavior of data for which we have no observations.
Predictions - No Linear Relation
If the linear correlation coefficient indicates no linear relation between explanatory and response variables, then use the mean value of the response variable as the predicted value.
Coefficient of Determination
Measures the proportion of total variances in the response variable that is explained by the least squares regression line. Its a number between 0 & 1 - inclusive. 0 ≤ R₂ ≤ 1
If R₂ = 1, the least squares regression line explains 100% of the variation in the response variable: R₂ = r₂
Unexplained Deviation
The difference between the observed value of the response variable (y) and the predicted value of the response variable Ῠ (y-hat): y - Ῠ (y-hat)
Explained Deviation
The deviation between the predicted value of the response variable Ῠ (y-hat) and the mean value of the response variable
Total Deviation
The deviation between observed value of response variable y and the mean value of the response variable y is called total deviation.
Measure of Central Tendency
Numerically describes the average or typical data value: Mean, Median or Mode