Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Related Flashcards

Flashcards
»
Probability for Data Science

Probability For Data Science

by gordonmacmillan, Jan. 2017

Subjects: Data science, probability, a/b testing, hypothesis testing, bayes theorem

Favorite

Add to folder

Flag

Related Essays

A/B Testing Essay
1. A/B Testing - What Is A/B Testing? A/B testing is basically a test where two or extra variations of a web page are shown to users at random, and statis...
Null Hypothesis Case Study
On testing the null hypothesis, two main decision rules are plausible. These decisions are arrived at using either the critical value or the probability valu...
Null Hypothesis
RESULT & OBSERVATION The hypothesis for statistical testing mus...
Essay On Mid-Day Observation
Mini tab (“Mini Tab” 2017) defines a hypothesis testing as, A statistical test that is used to determine whether there is enough evidence in a sample of dat...
Null And Alternative Hypothesis In Research
Set type 1 (alpha) and type 2 (beta) error. Use the appropriate table to to look for the corresponding sample size. Basic Concepts Hypotheses: Null and...
Chiocabulary Investigation
The significance level is a measure of how certain we want to be about our results. Low significance level leads to a low probability that the experimental r...
Case Study: Major Lab Accidents
The hypothesis can be in the form of an if-then statement. For example, a hypothesis may state, “If ____, then _____. “ After the hypothesis comes the conduc...
Key Idea I: The Scientific Method
A hypothesis is a educated guess of what scientist think the outcome of the experiment will be. Without data to support a hypothesis it has no value. In orde...
Francisco Ayala Essay
A hypothesis can be tested through four different activities. First of all, a hypothesis is inspected for reliability. Secondly, a hypothesis must also have ...
BUS308: Statistics For Managers
The purpose of hypothesis testing is to determine whether there is enough statistical evidence in favor of a certain belief about a parameter. We use a hypot...

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/24

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

24 Cards in this Set

Front
Back

	What is a p-value?	The p-value is defined as the probability of observing a result equal to or "more extreme" than abs(t) where t is a statistic is a standardized value that is calculated from sample data. In regression parameter calculations, A small p value indicates it is unlikely to observe such a strong association between the predictor and the response variable due to chance.
	When is p-value used?	The p-value is widely used in statistical hypothesis testing, specifically in null hypothesis significance testing. In this method, as part of experimental design, before performing the experiment, one first chooses a model (the null hypothesis) and a threshold value for p, called the significance level of the test, traditionally 5% or 1% [6] and denoted as α. If the p-value is less than or equal to the chosen significance level (α), the test suggests that the observed data is inconsistent with the null hypothesis, so the null hypothesis must be rejected. However, that does not prove that the tested hypothesis is true. When the p- value is calculated correctly, this test guarantees that the Type I error rate is at most α. For typical analysis, using the standard α = 0.05 cutoff, the null hypothesis is rejected when p <= .05 and not rejected when p > .05.
	What is a hypothesis test?	A hypothesis test is a statistical test that is used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for the entire population. A hypothesis test examines two opposing hypotheses about a population: the null hypothesis and the alternative hypothesis.
	Give an example of a hypothesis test.	For example, let’s say we’d like to test the assumption that men are taller than and women at the Galvanize campuses.
	Provide necessary steps to arrive at an answer with a hypothesis test.	Steps: 1) State the null and alternate hypotheses. 2) Calculate the t (or z) statistic. 3) Using the t value and the degrees of freedom, find the corresponding probability p. 4) Compare p to the desired Type I error rate, alpha. 5) State the conclusion.
	What is the null hypothesis?	The null hypothesis(typically described as H0) is that the average heights of the men and women are not significantly different, while the alternate hypotheses (H1 or Ha) is that men are taller than women.
	What is the t statistic?	The t-value quantifies how big the difference in the means is compared tvariation in the data. It’s basically the difference in the means quantified in multiples of the standard error.
	How does p value relate to the tails of the test?	Based on the alternate hypothesis, a decision about using a single tailed probability (as in this case for greather than ) or a double-tailed probability (means not equal) is desired to get the right value of p.
	What is Alpha?	The significance level, is typically 0.025 to 0.05 for a 95% confidence level for in double and single sided tests, respectively. If p <= alpha, the null hypothesis can be rejected, otherwise it can’t.
	What is statistical power?	Statistical power is the likelihood that a study will detect an effect when an effect exists (or, that a false null hypothesis will be rejected). Statistical power is inversely related to the probability of making a Type II error (β). Power = 1 – β. As statistical power increases the probability of making a Type II error (a false negative) decreases. Though there is no formal standard for power, (1-β) = 0.8 is often used.
	What affects statistical power?	Generally statistical power depends on the desired significance criterion, the magnitude of the effect of interest in the population (effect size), and the sample size used to detect the effect. Increasing the sample size is often the easiest way to increase the power. Many formulae exist for determining the correct sample size for a desired statistical power, depending on the application.
	Give examples where the median is a better measure of centrality than the mean.	The median is a better measure than the mean when data are skewed. The median is a better measure in cases where outliers shift the centrality of the mean.
	What is A/B testing?	A/B testing is: Two-sample hypothesis testing, Randomized exposure of test subjects to two variants: A and B, where A is control, B is the variation
	Give a couple of examples where A/B testing might be used.	For example, In website design, A/B testing will determine if changes to the website changed (hopefully increased) the click through rate. In the case of marketing materials, A/B testing would be used where a letter to a prospective customer will end with either “Sale ends March 8, use discount code M8 online to claim” or “Sale ends soon, use code SL to claim”
	What type of distribution models the period of time between events occurring at an average rate (assuming each event occurs independently of the last event)?	The exponential distribution.
	Give a couple real life examples that would be modeled by the exponential distribution.	For example, the probability that you’ll get a phone call in the next hour given you get a call on average once every 4 hours and, the probability someone will pass you at a street corner in the next five minutes given that 10 people walk by that street corner every hour.
	What are common methods for dealing with missing data?	Common imputation methods of dealing with unknown or missing values include: Removing entire observations containing one or more unknown values. Advantage: easy; Disadvantage: decreasing power and losing data; Filling in unknown values with the average of the existing values (mean imputation). Advantage: easy; Disadvantage: diminishes utility of correlations that use the variable that’s imputed; k-nearest-neighbors: Use the values of clustered neighbors to to fill in missing data points. Advantage: more representative; Disadvantage: more computationally expensive
	What is the Central Limit Theorem?	The CLT states that the arithmetic mean of a sufficiently large number of iterates of independent random variables will be approximately normally distributed regardless of the underlying distribution. i.e: the sampling distribution of the sample mean is normally distributed.
	Where is the Central Limit theorem often employed in data science?	The CLT is used in hypothesis testing and confidence intervals. The process of bootstrapping is like the process of sampling for the CLT but can be used to generate statistics other than the mean.
	Contrast Frequentist and Bayesian statistics.	In a Frequentist framework, there are true fixed parameters that describe a population. In a Bayesian framework, distributions are associated with parameters that describe the population.
	Explain how determining the average height of adult women in the US would be approached by these two paradigms.	Frequentist: there is one true answer to this (e.g 5’ 7”). Now when this population is sampled, that’s where probability and confidence intervals enter the picture. Bayesian: start with a “prior” distribution that describes his/her present state of knowledge (say that the height of women follows a normal distribution centered on 5’ 6” with a standard deviation of 6”). Then the Bayesian would collect data. This data would be used to update the prior distribution to get a new distribution – the posterior distribution. Statements about probability and confidence intervals reference this posterior distribution.
	Explain Bayes rule in words and then write out its formula.	In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule) describes the probability of an event, based on conditions that might be related to the event P(A\|B) = P(B\|A) * P(A) / P(B)
	What are the terms of Bayes formula.	P(A) and P(B) are the probabilities of events A and B without regard to each other. P(B\|A) is the probability of event B occuring given that event A occurred. P(A\|B) is the probability of event A occurring given that event B occurred. It’s what we are trying to find.
	What are the terms called in Bayes formula.	P(B\|A) is the “likelihood.”, P(A) is the “prior.”, P(B) is the “normalizing constant”, sometimes referred to as the “evidence."P(A\|B) is the “posterior.”

Share This Flashcard Set