• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/55

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

55 Cards in this Set

  • Front
  • Back

Steps in Data mining:

1. Collect data


2. Data mining


3. Good Stuff

What is data?

a collection of objects and their features.

Types of Data

1. Discrete features


2. Numerical Features( Continuous Features)

Give ways that data might not be clean

1. Noise (phone)


2. Outliers (instrument error; technical problem)


3. Missing values


4. Duplicated data

what is the Best way to clean data?

ML is the best way to detect/fix these.

What is object in ML+DM?

object = IID = independent and identically distributed.




- the order of the objects doesn't matter


- new objects will behave like the existing objects.




The IID assumption implies that our conclusionswill probably also apply to new data.




eg: assignment question would be similar to an exam question.

How much data do we need?

need more data than categories.

What is feature aggregation?

combining features to form new features.

What is feature selection?

Removing features that are not relevant to the task.

what is feature transformation?

Mathematical transformations:– Square, exponentiation, or take logarithm.

What is Discretization?

Feature transformation




– Discretization: turn continuous into discrete.


- – Scaling: convert variables to comparable scales

how do you ‘look’ at features and highdimensionalobjects?

- Summary statistics.


– Visualization.


– ML + DM (later in course).



Type of Summary Stat?

1. Discrete SS:

for a discrete variable:


- Simple matching coefficient:


- Jaccard coefficient for binary var




2. Continuous SS:


- Measures of location:


- Measures of spread:


- Measures between continuous variables:

Type of Visualization

1. Basic Plot: f(x) = 2x+1 (2d graph)


2. Historgram


3. Box plot


4. Scatterplot (look at distribution of 2 featuers)


5. scatter plot array(Can plot multiple variables in an array.)


6. Map Coloring


7. Contour Plot

What is the IID assumption?

The IID assumption implies that our conclusionswill probably also apply to new data.

What is quantiles?

categories that occur more than t times

what is decision stump?

decision tree with one rule.

Key idea of decision stump

we have so many options to make this tree.


There may be more than one optimal solution.




require classification accuracy.

classification accuracy

“If we use this rule, how many objects do we label correctly?"

Running time on this finding the best rule algorithm of the decision tree.

O(dtn),




d = # of the objects


n = # of the features that might be the main cause


t = all the possible good threshold to check



How to speed up this process?

Limiting the threshold to check:


- Restrict the thresholds to the range of each feature.


- Restrict thresholds to values of the features in data.




We will then have just one threshold per one feature.




O(dtn) -> O(n^2d)




With sorting,




O(n^2d) -> O(dnlogn)

What does supervised learning do?

using data to build programthat outputs labels from input features

what does a decision tree do?

making a decision via a sequenceof simple rules

What is overfitting?

lower accuracy on new data.

the idea of supervised learning

Typical supervised learning steps:1. Build model based on training data X and y.




2. Model makes predictions ‘yhat’ on new data Xtest.




• You win if yhat is similar to true unseen ytest



Golden rule of machine learning

Minimizing test error is not the goal, but is ameasure of how well we do on new data:

What is the decision Tree's idea?

1.Start at root note.


2. Branch using splitting rule.


3. Leaf nodes are labeled.


a) If leaf, return label.b) Otherwise, go to 2.

What is the idea of Decision Stump?

decision tree with one rule.




Go through all objects, and find out which class is more likelygiven rule. (Classification Accuracy)




To learn ‘best’ stump, find the ‘best’ rule.

What is the fundamental TradeOff?

1. How small you can make the training error.vs.2. How well training error approximates the test error

Tell me about validation Error

if used once:– unbiased approximation of test error




Can overfit if you use validation error too many times




With Decision Tree, You only evaluate a few depths, so validation error likelyapproximates test error.





What can we do with Cross-Validation?

You can use CV to select decision tree depth.




• You can also use to decide whether to split:– Don’t split if CV error doesn’t improve.




• Or fit deep decision tree and use CV to prune:– Remove leaf nodes that don’t improve CV error.

What does Cross-Validation do?

Cross-validation allows using more of the datasetto estimate validation error

What is the Golden Rule?

Even though what we care about is test error:– YOU CANNOT USE THE TEST DATA TO BUILD MODEL.


• Minimizing test error is not the goal, but is ameasure of how well we do on new data.

What do we mean by 'maximizing' over data?

“I built 10 models based on half of the data you gave me.”– “One of them classified the other half of the data with 98% accuracy.”




Maximizing over these errors is a biased approximation of test error.– But they only maximized it over 10 models, so bias is probably small.– They probably know about the golden rule.

Explain naïve Bayes and its uses.

Naïve Bayes is probabilistic classifier based on Bayes rule.


– It’s ‘naïve’ because it makes a strong independence assumption.


– But it tends to work well with bag of words.

What is Generative classifiers?

build a probability of seeing the features.




– It needs to know the probability of the features, given the class.


– You need one model that knows what spam messages look like.


– You need a second model that knows what non-spam messages look like.




Assume that variables in xiare independent of each other given yi.

Again, what is the assumption of Naïve Bayes?***

given yi, all xiare conditionally independent of each other.

what's the problem with Maximum Likelihood estimate?

there is a chance of that one probability can mess up the whole thing.


eg. p(seeing lac | spam) = 0

What is parametric method?

There are a fixed number of “parameters” in the model (e.g., number of rules).




As you get more data, you can estimate them more accurately.




• But at some point, more data doesn’t help because model is too simple.

what is non-parametric method?

Number of parameters grows with the number of training examples.


– Model gets more complicated as you get more data.


– E.g., decision tree whose depth grows with the number of examples.

What is the problem with KNN?

KNN does not know that labels should be translation invariant( rotation, size enlarge)




Solution:


1. The hard/slow way is to modify your distance function:




2. The easy/fast way is to just add transformed data during training

Do we have any classifiers that are accurate and run in real time?

– Decision trees and naïve Bayes are fast, but often not very accurate.


– KNN is often accurate, but not very fast.




Solution=> Deployed system uses an ensemble method called random forests.



What is ensemble methods?

classifiers that have classifiers as input.


Eg:


1. Boosting: take simple classifier that underfits, improve its training error.


b) Averaging: take complex classifier that overfits, improve its test error.

2 types of averaging (ensemble method)?

Model averaging:– Take the mode of the predictions (or average if probabilistic).•




Stacking:– Fit another classifier that uses the predictions as features

Two key ingredients in random forests?

- Bootstrap sampling: randomize the incoming inputs (same # of inputs but some inputs might be repeated)


– Random splits: randomly pick one feature and split right away (*** do not consider all features when searching for optimal rule)

Explain the difference between supervised learning and unsupervised learning.***

Supervised learning:


– We have features xiand class labels yi.


– Write a program that produces yifrom xi.






• Unsupervised learning:


– We only have xivalues, but no explicit target labels.


– You want to do ‘something’ with them.

What is clustering?

– Input: set of objects described by features xi.– Output: an assignment of objects to ‘groups’.




Unlike classification, we are not given the ‘groups’.– Algorithm must discover groups. (ie a group that we find by ourselves)

Explain K-means algorithm:

– Assign each xito its closest mean.


– Update the means based on the assignment.


– Repeat until convergence.

Problem with K-means?

It may converge to sub-optimal local solution.


(ie we may not get the true k clusters).




- attributes have to be numerical.




Solution: Do k-means++

Explain k-means++ steps

-Choose a random data point as the first mean.


– Compute the distance of every point to the closets mean.


– Sample the next proportional to these distances squared.

What is density-based clustering? param or non-param?

non-parametric clustering method!




- Clusters are defined by connected dense regions.


• Become more complicated the more data we have.


– Data points in non-dense regions are not assigned a cluster

Given parameters of Density-based clustering (DBSCAN)

– Radius: minimum distance between points to be considered ‘close’.


– MinPoints: number of ‘close’ points needed to define a cluster.

Pros and Cons on DBSCAN

Cons


-Assigning new points to clusters is expensive.


• In high-dimensions, need a lot of points to ‘fill’ the space




Runtime: O(n^2)




pros:


useful for finding non-convex connectedclusters.

Param vs non-param

– Fixed-depth decision tree: ‘size’ of the model is O(1) in terms of ‘n’.


– K-nearest neighbours: ‘size’ of the model is O(n) in terms of ‘n’

Explain concept of Region-based pruning:

for finding closest objects among hugenumber of objects




***Don’t have to compute distances between points in distant regions.