Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Related Flashcards

Flashcards
»
CPSC 340

Cpsc 340

by scc385, Sep. 2015

Favorite

Add to folder

Flag

Related Essays

Data Technology: Disadvantages And Benefitations Of Database Technology
Analysing such data is an important need”. Data mining it is a process to discover patterns in large amount of data and turn data into knowledge What is t...
Exploratory Research
To illustrate data we will use Pie-charts, Data and basic calculations. However for statistical data we will use Anova and Hypothesis analysis, Co relation a...
Independent Variable Essay
Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Toget...
A Modest Proposal: Ranking Optimization Technique
Data mining consists of the several technical approaches, including machine learning, statistics, database organization, and so on The goal of the data minin...
Math Standards
The student applies mathematical process standards to solve problems by collecting, organizing, displaying, and interpreting data. The student is expected to...
Introduction To Quantitative Analysis
Descriptive Analysis We calculated descriptive statistics for both the variables, including mean, median, mode, standard deviation, variance, range and many...
Cmp 681 Enterprise Analytics
I will have to process the data by means of cleaning the data. This means, I would like to avoid using a dataset that is not clean. Hence, cleaning of the da...
Pt1420 Unit 2
Generally, dirty data encompasses missing data, wrong data, and non-standard representations of the same data (Kim, Choi, Hong, Kim, & Lee, 2003). Data becom...
Plasmid Electroporation Lab Report
After trial and errors, the authors explained the information into graphs. The 12 graphs are useful material that other researchers can use to build on the e...
Pareto Analysis: Questions On Quality And Quality Management
5- Scatter diagram which is a graph that shows the degree and direction of relationship between two variables, it is useful to show the value of the correl...

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/55

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

55 Cards in this Set

Front
Back

	Steps in Data mining:	1. Collect data 2. Data mining 3. Good Stuff
	What is data?	a collection of objects and their features.
	Types of Data	1. Discrete features 2. Numerical Features( Continuous Features)
	Give ways that data might not be clean	1. Noise (phone) 2. Outliers (instrument error; technical problem) 3. Missing values 4. Duplicated data
	what is the Best way to clean data?	ML is the best way to detect/fix these.
	What is object in ML+DM?	object = IID = independent and identically distributed. - the order of the objects doesn't matter - new objects will behave like the existing objects. The IID assumption implies that our conclusionswill probably also apply to new data. eg: assignment question would be similar to an exam question.
	How much data do we need?	need more data than categories.
	What is feature aggregation?	combining features to form new features.
	What is feature selection?	Removing features that are not relevant to the task.
	what is feature transformation?	Mathematical transformations:– Square, exponentiation, or take logarithm.
	What is Discretization?	Feature transformation – Discretization: turn continuous into discrete. - – Scaling: convert variables to comparable scales
	how do you ‘look’ at features and highdimensionalobjects?	- Summary statistics. – Visualization. – ML + DM (later in course).
	Type of Summary Stat?	1. Discrete SS: for a discrete variable: - Simple matching coefficient: - Jaccard coefficient for binary var 2. Continuous SS: - Measures of location: - Measures of spread: - Measures between continuous variables:
	Type of Visualization	1. Basic Plot: f(x) = 2x+1 (2d graph) 2. Historgram 3. Box plot 4. Scatterplot (look at distribution of 2 featuers) 5. scatter plot array(Can plot multiple variables in an array.) 6. Map Coloring 7. Contour Plot
	What is the IID assumption?	The IID assumption implies that our conclusionswill probably also apply to new data.
	What is quantiles?	categories that occur more than t times
	what is decision stump?	decision tree with one rule.
	Key idea of decision stump	we have so many options to make this tree. There may be more than one optimal solution. require classification accuracy.
	classification accuracy	“If we use this rule, how many objects do we label correctly?"
	Running time on this finding the best rule algorithm of the decision tree.	O(dtn), d = # of the objects n = # of the features that might be the main cause t = all the possible good threshold to check
	How to speed up this process?	Limiting the threshold to check: - Restrict the thresholds to the range of each feature. - Restrict thresholds to values of the features in data. We will then have just one threshold per one feature. O(dtn) -> O(n^2d) With sorting, O(n^2d) -> O(dnlogn)
	What does supervised learning do?	using data to build programthat outputs labels from input features
	what does a decision tree do?	making a decision via a sequenceof simple rules
	What is overfitting?	lower accuracy on new data.
	the idea of supervised learning	Typical supervised learning steps:1. Build model based on training data X and y. 2. Model makes predictions ‘yhat’ on new data Xtest. • You win if yhat is similar to true unseen ytest
	Golden rule of machine learning	Minimizing test error is not the goal, but is ameasure of how well we do on new data:
	What is the decision Tree's idea?	1.Start at root note. 2. Branch using splitting rule. 3. Leaf nodes are labeled. a) If leaf, return label.b) Otherwise, go to 2.
	What is the idea of Decision Stump?	decision tree with one rule. Go through all objects, and find out which class is more likelygiven rule. (Classification Accuracy) To learn ‘best’ stump, find the ‘best’ rule.
	What is the fundamental TradeOff?	1. How small you can make the training error.vs.2. How well training error approximates the test error
	Tell me about validation Error	if used once:– unbiased approximation of test error Can overfit if you use validation error too many times With Decision Tree, You only evaluate a few depths, so validation error likelyapproximates test error.
	What can we do with Cross-Validation?	You can use CV to select decision tree depth. • You can also use to decide whether to split:– Don’t split if CV error doesn’t improve. • Or fit deep decision tree and use CV to prune:– Remove leaf nodes that don’t improve CV error.
	What does Cross-Validation do?	Cross-validation allows using more of the datasetto estimate validation error
	What is the Golden Rule?	Even though what we care about is test error:– YOU CANNOT USE THE TEST DATA TO BUILD MODEL. • Minimizing test error is not the goal, but is ameasure of how well we do on new data.
	What do we mean by 'maximizing' over data?	“I built 10 models based on half of the data you gave me.”– “One of them classified the other half of the data with 98% accuracy.” Maximizing over these errors is a biased approximation of test error.– But they only maximized it over 10 models, so bias is probably small.– They probably know about the golden rule.
	Explain naïve Bayes and its uses.	Naïve Bayes is probabilistic classifier based on Bayes rule. – It’s ‘naïve’ because it makes a strong independence assumption. – But it tends to work well with bag of words.
	What is Generative classifiers?	build a probability of seeing the features. – It needs to know the probability of the features, given the class. – You need one model that knows what spam messages look like. – You need a second model that knows what non-spam messages look like. Assume that variables in xiare independent of each other given yi.
	Again, what is the assumption of Naïve Bayes?***	given yi, all xiare conditionally independent of each other.
	what's the problem with Maximum Likelihood estimate?	there is a chance of that one probability can mess up the whole thing. eg. p(seeing lac \| spam) = 0
	What is parametric method?	There are a fixed number of “parameters” in the model (e.g., number of rules). As you get more data, you can estimate them more accurately. • But at some point, more data doesn’t help because model is too simple.
	what is non-parametric method?	Number of parameters grows with the number of training examples. – Model gets more complicated as you get more data. – E.g., decision tree whose depth grows with the number of examples.
	What is the problem with KNN?	KNN does not know that labels should be translation invariant( rotation, size enlarge) Solution: 1. The hard/slow way is to modify your distance function: 2. The easy/fast way is to just add transformed data during training
	Do we have any classifiers that are accurate and run in real time?	– Decision trees and naïve Bayes are fast, but often not very accurate. – KNN is often accurate, but not very fast. Solution=> Deployed system uses an ensemble method called random forests.
	What is ensemble methods?	classifiers that have classifiers as input. Eg: 1. Boosting: take simple classifier that underfits, improve its training error. b) Averaging: take complex classifier that overfits, improve its test error.
	2 types of averaging (ensemble method)?	Model averaging:– Take the mode of the predictions (or average if probabilistic).• Stacking:– Fit another classifier that uses the predictions as features
	Two key ingredients in random forests?	- Bootstrap sampling: randomize the incoming inputs (same # of inputs but some inputs might be repeated) – Random splits: randomly pick one feature and split right away (*** do not consider all features when searching for optimal rule)
	Explain the difference between supervised learning and unsupervised learning.***	Supervised learning: – We have features xiand class labels yi. – Write a program that produces yifrom xi. • Unsupervised learning: – We only have xivalues, but no explicit target labels. – You want to do ‘something’ with them.
	What is clustering?	– Input: set of objects described by features xi.– Output: an assignment of objects to ‘groups’. Unlike classification, we are not given the ‘groups’.– Algorithm must discover groups. (ie a group that we find by ourselves)
	Explain K-means algorithm:	– Assign each xito its closest mean. – Update the means based on the assignment. – Repeat until convergence.
	Problem with K-means?	It may converge to sub-optimal local solution. (ie we may not get the true k clusters). - attributes have to be numerical. Solution: Do k-means++
	Explain k-means++ steps	-Choose a random data point as the first mean. – Compute the distance of every point to the closets mean. – Sample the next proportional to these distances squared.
	What is density-based clustering? param or non-param?	non-parametric clustering method! - Clusters are defined by connected dense regions. • Become more complicated the more data we have. – Data points in non-dense regions are not assigned a cluster
	Given parameters of Density-based clustering (DBSCAN)	– Radius: minimum distance between points to be considered ‘close’. – MinPoints: number of ‘close’ points needed to define a cluster.
	Pros and Cons on DBSCAN	Cons -Assigning new points to clusters is expensive. • In high-dimensions, need a lot of points to ‘fill’ the space Runtime: O(n^2) pros: useful for finding non-convex connectedclusters.
	Param vs non-param	– Fixed-depth decision tree: ‘size’ of the model is O(1) in terms of ‘n’. – K-nearest neighbours: ‘size’ of the model is O(n) in terms of ‘n’
	Explain concept of Region-based pruning:	for finding closest objects among hugenumber of objects ***Don’t have to compute distances between points in distant regions.

Share This Flashcard Set