Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
55 Cards in this Set
- Front
- Back
Steps in Data mining: |
1. Collect data 2. Data mining 3. Good Stuff |
|
What is data? |
a collection of objects and their features.
|
|
Types of Data |
1. Discrete features 2. Numerical Features( Continuous Features) |
|
Give ways that data might not be clean |
1. Noise (phone) 2. Outliers (instrument error; technical problem) 3. Missing values 4. Duplicated data |
|
what is the Best way to clean data? |
ML is the best way to detect/fix these. |
|
What is object in ML+DM? |
object = IID = independent and identically distributed. - the order of the objects doesn't matter - new objects will behave like the existing objects. The IID assumption implies that our conclusionswill probably also apply to new data. eg: assignment question would be similar to an exam question. |
|
How much data do we need? |
need more data than categories. |
|
What is feature aggregation? |
combining features to form new features. |
|
What is feature selection? |
Removing features that are not relevant to the task. |
|
what is feature transformation? |
Mathematical transformations:– Square, exponentiation, or take logarithm. |
|
What is Discretization? |
Feature transformation – Discretization: turn continuous into discrete. - – Scaling: convert variables to comparable scales |
|
how do you ‘look’ at features and highdimensionalobjects? |
- Summary statistics. – Visualization. – ML + DM (later in course). |
|
Type of Summary Stat? |
1. Discrete SS:
for a discrete variable: - Simple matching coefficient: - Jaccard coefficient for binary var 2. Continuous SS: - Measures of location: - Measures of spread: - Measures between continuous variables: |
|
Type of Visualization |
1. Basic Plot: f(x) = 2x+1 (2d graph) 2. Historgram 3. Box plot 4. Scatterplot (look at distribution of 2 featuers) 5. scatter plot array(Can plot multiple variables in an array.) 6. Map Coloring 7. Contour Plot |
|
What is the IID assumption? |
The IID assumption implies that our conclusionswill probably also apply to new data. |
|
What is quantiles? |
categories that occur more than t times |
|
what is decision stump? |
decision tree with one rule. |
|
Key idea of decision stump |
we have so many options to make this tree. There may be more than one optimal solution. require classification accuracy. |
|
classification accuracy |
“If we use this rule, how many objects do we label correctly?" |
|
Running time on this finding the best rule algorithm of the decision tree. |
O(dtn), d = # of the objects n = # of the features that might be the main cause t = all the possible good threshold to check |
|
How to speed up this process? |
Limiting the threshold to check: - Restrict the thresholds to the range of each feature. - Restrict thresholds to values of the features in data. We will then have just one threshold per one feature. O(dtn) -> O(n^2d) With sorting, O(n^2d) -> O(dnlogn) |
|
What does supervised learning do? |
using data to build programthat outputs labels from input features |
|
what does a decision tree do? |
making a decision via a sequenceof simple rules |
|
What is overfitting? |
lower accuracy on new data. |
|
the idea of supervised learning |
Typical supervised learning steps:1. Build model based on training data X and y. 2. Model makes predictions ‘yhat’ on new data Xtest. • You win if yhat is similar to true unseen ytest |
|
Golden rule of machine learning |
Minimizing test error is not the goal, but is ameasure of how well we do on new data: |
|
What is the decision Tree's idea? |
1.Start at root note. 2. Branch using splitting rule. 3. Leaf nodes are labeled. a) If leaf, return label.b) Otherwise, go to 2. |
|
What is the idea of Decision Stump? |
decision tree with one rule. Go through all objects, and find out which class is more likelygiven rule. (Classification Accuracy) To learn ‘best’ stump, find the ‘best’ rule. |
|
What is the fundamental TradeOff? |
1. How small you can make the training error.vs.2. How well training error approximates the test error |
|
Tell me about validation Error |
if used once:– unbiased approximation of test error Can overfit if you use validation error too many times With Decision Tree, You only evaluate a few depths, so validation error likelyapproximates test error. |
|
What can we do with Cross-Validation? |
You can use CV to select decision tree depth. • You can also use to decide whether to split:– Don’t split if CV error doesn’t improve. • Or fit deep decision tree and use CV to prune:– Remove leaf nodes that don’t improve CV error. |
|
What does Cross-Validation do? |
Cross-validation allows using more of the datasetto estimate validation error |
|
What is the Golden Rule? |
Even though what we care about is test error:– YOU CANNOT USE THE TEST DATA TO BUILD MODEL. • Minimizing test error is not the goal, but is ameasure of how well we do on new data. |
|
What do we mean by 'maximizing' over data? |
“I built 10 models based on half of the data you gave me.”– “One of them classified the other half of the data with 98% accuracy.” Maximizing over these errors is a biased approximation of test error.– But they only maximized it over 10 models, so bias is probably small.– They probably know about the golden rule. |
|
Explain naïve Bayes and its uses. |
Naïve Bayes is probabilistic classifier based on Bayes rule. – It’s ‘naïve’ because it makes a strong independence assumption. – But it tends to work well with bag of words. |
|
What is Generative classifiers? |
build a probability of seeing the features. – It needs to know the probability of the features, given the class. – You need one model that knows what spam messages look like. – You need a second model that knows what non-spam messages look like. Assume that variables in xiare independent of each other given yi. |
|
Again, what is the assumption of Naïve Bayes?*** |
given yi, all xiare conditionally independent of each other. |
|
what's the problem with Maximum Likelihood estimate? |
there is a chance of that one probability can mess up the whole thing. eg. p(seeing lac | spam) = 0 |
|
What is parametric method? |
There are a fixed number of “parameters” in the model (e.g., number of rules). As you get more data, you can estimate them more accurately. • But at some point, more data doesn’t help because model is too simple. |
|
what is non-parametric method? |
Number of parameters grows with the number of training examples. – Model gets more complicated as you get more data. – E.g., decision tree whose depth grows with the number of examples. |
|
What is the problem with KNN? |
KNN does not know that labels should be translation invariant( rotation, size enlarge) Solution: 1. The hard/slow way is to modify your distance function: 2. The easy/fast way is to just add transformed data during training |
|
Do we have any classifiers that are accurate and run in real time? |
– Decision trees and naïve Bayes are fast, but often not very accurate. – KNN is often accurate, but not very fast. Solution=> Deployed system uses an ensemble method called random forests. |
|
What is ensemble methods? |
classifiers that have classifiers as input. Eg: 1. Boosting: take simple classifier that underfits, improve its training error. b) Averaging: take complex classifier that overfits, improve its test error. |
|
2 types of averaging (ensemble method)? |
Model averaging:– Take the mode of the predictions (or average if probabilistic).• Stacking:– Fit another classifier that uses the predictions as features |
|
Two key ingredients in random forests? |
- Bootstrap sampling: randomize the incoming inputs (same # of inputs but some inputs might be repeated) – Random splits: randomly pick one feature and split right away (*** do not consider all features when searching for optimal rule) |
|
Explain the difference between supervised learning and unsupervised learning.*** |
Supervised learning: – We have features xiand class labels yi. – Write a program that produces yifrom xi. • Unsupervised learning: – We only have xivalues, but no explicit target labels. – You want to do ‘something’ with them. |
|
What is clustering? |
– Input: set of objects described by features xi.– Output: an assignment of objects to ‘groups’. Unlike classification, we are not given the ‘groups’.– Algorithm must discover groups. (ie a group that we find by ourselves) |
|
Explain K-means algorithm: |
– Assign each xito its closest mean. – Update the means based on the assignment. – Repeat until convergence. |
|
Problem with K-means? |
It may converge to sub-optimal local solution. (ie we may not get the true k clusters). - attributes have to be numerical. Solution: Do k-means++ |
|
Explain k-means++ steps |
-Choose a random data point as the first mean. – Compute the distance of every point to the closets mean. – Sample the next proportional to these distances squared. |
|
What is density-based clustering? param or non-param?
|
non-parametric clustering method! - Clusters are defined by connected dense regions. • Become more complicated the more data we have. – Data points in non-dense regions are not assigned a cluster |
|
Given parameters of Density-based clustering (DBSCAN) |
– Radius: minimum distance between points to be considered ‘close’. – MinPoints: number of ‘close’ points needed to define a cluster. |
|
Pros and Cons on DBSCAN |
Cons -Assigning new points to clusters is expensive. • In high-dimensions, need a lot of points to ‘fill’ the space Runtime: O(n^2) pros: useful for finding non-convex connectedclusters. |
|
Param vs non-param |
– Fixed-depth decision tree: ‘size’ of the model is O(1) in terms of ‘n’. – K-nearest neighbours: ‘size’ of the model is O(n) in terms of ‘n’ |
|
Explain concept of Region-based pruning: |
for finding closest objects among hugenumber of objects ***Don’t have to compute distances between points in distant regions. |