• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/41

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

41 Cards in this Set

  • Front
  • Back

Downfalls of running AB test too long

Missed opportunities running the less optimal model, lower conversion landing page, etc. as the results had already converged.

Downfalls of running AB testing too short

Lose out on seasonality of the data. Weekend vs Weekday users, holidays, peak season vs slow season, etc.

General AB testing framework

- Define null and alt hypotheses


- Define a significance level (alpha)


- Collect data


- Compute test statistic


- Compute p-value

How would you test between 2 means vs 2 proportions?

- 2 proportions => z-test


- 2 means => t-test


- More specifically I tend to use the Welch's t-test so I have one less thing to assume about the two samples: that the variances are equal

How do you deal with multiple comparisons for AB testing?

Use the Bonferoni Correction. Account for the number of p-values you are generating from the multiple comparisons and adjust the significance level by the number of comparisons.

What is a confounding factor and how do you deal with them?

Confounding factor is a hidden variable that is dependent on an independent variable and the dependent variable you are trying to measure. Generally you can average out confounding factors by conducting experiments vs making observations.

Explain Precision and Recall

- Precision is the ratio of true positives to relevant data points (TP / TP + FP)


- Recall is the ratio of true positives to positive-labeled values (TP / TP + FN)

How does Feature Importance work in Random Forests?

Feature Importances are calculated as the average information gain from parent node to the average of the child nodes for all the trees in a RF Classifier and the RSS from parent node to the average of the child nodes for all the trees in a RF Regressor.

When do you stop building trees in a Random Forest?

I've heard differing opinions:


- Build as many as computationally makes sense


- Plot the number of trees to MSE / accuracy and find the "elbow" of the graph to find the point where additional trees begins to only marginally affect the MSE / accuracy of the model

How do you prevent overfitting in Decision Trees and Random Forests?

- Pruning in Decision Trees


- Decorrelate the trees using only a random subset of the features for each tree, and average away the variance by using lots of trees in your Random Forest

What is Information Gain?

The measure of entropy change from the parent node to the child nodes.

What is Entropy?

The measure of randomness of a distribution.

What is Gini Impurity?

The probability of mislabeling a data point by labeling data points at random based on the probability distribution of the data.

Explain Decision Trees

- For each feature in the data, you split the data optimally on that feature.


- For Classifier you maximize information gain at each node and Regressor you minimize RSS


- The splits are made at thresholds for continuous variables and at different categories for categorical variables

Explain Random Forests

- Is a bagging method


- Create a bunch of decision trees to take a majority vote (classifier) or average the predictions with (regressor)


- RF differentiates itself from bagging methods by decorrelating the trees by using a random subset of features for each tree

What is Deviance

**Fill Later**

Explain Boosting

- Create a bunch of decision trees dependent on each other


- In particular, fits to the imperfections of the previous tree


- In gradient boosting, fit to the gradient of the loss function of the previous tree

Talk through how you would tune your boosting model.

- With decision-tree-based models, I grid search the max_depth, min_samples_leaf, and max_features


- The key is how to handle shrinkage


- The learning_rate and n_estimators are related, so I tweak learning_rate while holding n_estimators relatively high (~3000)

In Gradient Boosted Trees, how are the trees related to one another?

Each tree is related to the previous tree. For GBM in particular, the tree is fitted to the residuals of the previous tree.

Difference in XGBoost to Gradient Boosting?

- Faster and more efficient


- Bins column values into percentiles to split on then only consider an average within that bin (mean, median, etc.).


- Handles sparsity (NA values) and mixed data types (continuous vs categorical)

Partial Dependency Plots

- Used to quantify the variables used in black-box-type machine learning models


- Shows the response change to different values of our variable


- Normalize for the response to the other variables


- Plot variable against the partial dependence (-1.0 to 1.0)

Describe the Gradient of the Loss Function

Loss calculated between the actual and predicted value from the tree divided by the predicted value from the tree.

Bayes' Rule

P(A|B) = P(B|A) * P(A) / P(B)

What is the big assumption made by Naive Bayes?

That in P(x1, x2, x3 | C), the underlying CONDITIONAL probabilities of x1, x2, and x3 are independent of one another.

Euclidean Distance (general)

- compute general distance (dist)


- compute 0 to 1 with:


- 1 / 1 + dist

Euclidean Distance (formula)

- dist(a,b) = ||a-b|| = sqrt[ (a-b)**2 ]


- 1 / 1 + dist(a, b)

Cosine Similarity (general)

Generate the angle between two vectors

Cosine Similarity (formula)

- cos(a, b) = dot(a, b) / ||a|| * ||b||


- sim(a, b) = 0.5 + 0.5 * cos(a, b)

Pearson Correlation (general)

- Similarity of two vectors


- Measures how far the values are from the mean

Pearson Correlation (formula)

- corr(a, b) = cov(a, b) / std(a) * std(b)


- sim(a, b) = 0.5 + 0.5 * corr(a, b)

Jaccard Similarity

- Similarity of the items in two sets


- sim(a, b) = intersection(a, b) / union(a, b)

Similarity Matrix

- m x m matrix filled with values from 0 to 1


- Compute the matrix of similarities between items or users

Making Predictions from a Recommender

rating(u, i) =


sum of sim(i, j) * rating(u, j) /


sum of sim(i, j)




- for a set of items rated by user u


- items i and j

Speeding up Predictions for Recommenders

Only computing the similarity for the top N similar items to the item being predicted

Validating a Recommender System

- A/B test for an uptick in conversion


- Predict ratings for the test set of users and compute the RMSE for the predicted vs actual values

Problem Validating Recommender using RMSE?

It considers how off your predictions were for all items where we often only care about the predictions for the top n items.

Recommenders: Precision at n

Take predicted top n ratings and compute the ratio that are relevant of the n ratings


- Relevant being "watched" or "converted"

Recommenders: Recall at n

Take predicted top n ratings and compute the ratio of relevant items in the top 10.

Recommenders: Fixing data sparsity for a user?

Matrix factorization to fill out the matrix: SVD or NMF commonly

How to Validate your Matrix Factorization?

RMSE for known values in the matrix

Gist of Alternating Least Squares for Recommender Systems

Your recommender can be decomposed into two separate matrices U and V. You update one matrix U, calculate the RMSE for the updated recommender, update the other matrix V, calculate the RMSE for the updated recommender. Repeat this process until the RMSE for the updated recommender doesn't improve.