Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Related Flashcards

Flashcards
»
Midterm Review

Midterm Review

by IreneYeh, Nov. 2015

Subjects: Data Mining, UCLA, CS145

Favorite

Add to folder

Flag

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/57

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

57 Cards in this Set

Front
Back

	Pattern Space Pruning constraints	1. Anti-monotonic 2. Monotonic 3. Succinct 4. Convertible
	Data Space Pruning constraints	1. Data succinct 2. Data anti-monotonic
	Pattern space 1. Anti-Monotonicity	A constraint C is anti-monotone if the super pattern satisfies C, all of its sub-patterns do so too. In other words, anti=monotonicity: If an item set S violates the constraint, so does any of tis superset. Example: sum <= v, range <= x, support count
	Pattern space 2. Monotonicity	If an itemset S satisfies the constraint, so does any of its superset. Example: sum >= v, min <= v, range >= x
	Pattern Space 3. Succinctness	Given A, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A, ie. S contains a subset belonging to A Example: min <= v
	Data Space 1. Anti-monotonicity	A constraint c is data anti-monotone if, for a pattern p, it cannot be satisfied by a transaction t in p-projected database, it cannot be satisfied by t's projection on p's superset either Example: sum >= v, min <= v, range >= x
	Convertible constraints Example: avg(x) >= 25ordered in value descending - ascending -	convert constraints into anti-monotone or monotone by properly ordering items anti-monotone monotone
	Strongly convertible	can convert the constraints to both anti-monotone and monotone
	types of data in clusterin	1. Interval-scaled variables 2. Binary variables 3. Nominal, ordinal, and ratio variables 4. Variables of mixed types
	Categories of clustering approaches	1. Partitioning algorithms 2. Hierarchy algorithms 3. Density-base methods 4. Grid-baed methods 5. Model-based
	Partitioning Algorithms heuristic methods: ______ and __________	k means and k medoids
	Pros and Cons of K-means 1. Efficiency? O(__) 2. Terminate at a ___________ 3. Applicable only when ___________ 4. Need to specify the ___________ 5. unable to handle ___________ 6. unsuitable to discover ___________	1. relatively. O(tkn), n is the # of objects, k is the # of clusters, t is the # of iterations, k, t << n 2. Local optimum 3. Mean is defined 4. number of clusters 5. noisy data and outliers 6. non-convex cluster
	K-medoid methods: PAM ___________ 1. ___________ choose k objects as the initial medoids 2. until no change, do	Partitioning around Medoids arbitrarily reassigned each object to the cluster to which th e nearest medoid randomly select a non-medoid object o', computer the total cost S of swapping medoid o with o' if S < 0, the swap o with o' to form the new set of k medoids
	Pros and Cons of PAM 1. PAM is ____ robust than k-means in the presence of noise and outliers 2. PAM is efficiently for ___________ data sets but does not scale well for ___________ data sets	1. more 2. small, large
	CLARA ( ___________ ) 1. Draw multiple ___________ of the data set and apply ___________ on each sample 2. perform ___________ than PAM in ___________ data sets 3. Efficiency depends on the ___________	Clustring LARge Applications 1. samples, PAM 2. better, large 3. sample size
	AGNES (___________) 1. initially, each object is a ___________ 2. step by step cluster ___________, until all objects form a ___________ 3. ___________ approach 4. the similarity between two clusters is measured by the similarity of the ___________	Agglomerative Nesting 1. cluster 2. merging, cluster 3. Single-link approach 4. closest pair of data points belonging to different clusters
	DIANA (___________) 1. initially ___________ 2. step by step ___________ until ___________	DIvisive ANAlysis 1. all objects are in one cluster 2. splitting clusters until each clusters contains only one object
	Challenges of Hierarchical clustering methods 1. Hard to choose ___________ 2. Do not ___________ well, O(__) 3. Data can't fit in memory 4. Integrating other techniques _________, _________, _________, _________	1. merge/split points 2. scale (n^2) 3. BIRCH, CURE, CHAMELEON, ROCK
	BIRCH ___________ 1. ___________ Tree, a hierarchical data structure summarizing object info	Balanced ITerative Reducing and Clustering using hierarchies 2. CF (clustering feature)
	CF components N LS SS	number of data points sum of x sum of x^2
	Pros and Cons of BIRCH 1. ___________ scalability 2. can handle only ___________ data 3. sensitive to the ___________	Linear scalability, single scan numeric data order of the data records
	Drawbacks of square error based methods	1. one representative per cluster good only for convex shaped having similar size and density 2. a number of cluster parameter k good only if k can be reasonably estimated
	Drawbacks of distance based methods 1. hard to find cluster with ___________ 2. hard to specify the ___________ 3. Heuristic: a cluster must be ___________	1. irregular shapes 2. number of clusters 3. dense
	EPS: ___________ MinPts: ___________ NEps(p): ___________	Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps-neighborhood of that point NEps(p): {q \| dist(p,q) <= EPS}
	Core Object p	p: \|Neps(p)\| >= MinPts
	Directly Density Reachable	point q directly density-reachable from p iff q is an NEps(p) and p is a core object
	Density Reachable	Directly density reachable p1->p2, p2->p3, .... pn-1>pn, then pn is density reachable from p1
	Density connected	Points p, q are density reachable from o->p and q are density connected
	DBSCAN	to find outlier 1. Arbitrary select a point p 2. retrieve all points density-reachable from p wrt Eps and MinPts 3. If p is a core point, a cluster is formed 4. if p is a border point, no points are density reachable from p and DBSCAn visits the next point of the database 5. Continue the process until all of the points have been processed
	Problems of DBSCAN 1. Clusters may have different __________ 2. Clusters may be in __________ 3. Sensitive to __________	1. densities 2. hierarchies 3. Parameters
	OPTICS (__________), a cluster ordering methods 1. Produces a special __________ of the database wrt tis __________ 2. Good for both __________ and __________ cluster analysis 3. Can be represented __________ 4. Complexities	Ordering points to identify the clustering structure 1. density based clustering structure 2. automatic and interactive cluster analysis 3. graphically or using visualization techniques 4. O(N*logN)
	Core distance of an object p	The smallest value such that the neighborhood of p has at least MinPts objects
	Grid-based clustering methods 1. Using ________________ data structure 2. Use dense grid cells to form clusters Several interesting methods 1. ___________ 2. ___________	1. multi-resolution grid 2. dense grid cells 1. STING 2. CLIQUE
	STING: ___________ 1. The spatial area is divided into ___________ 2. There are several levels of cells corresponding to different levels of ___________	A Statistical Information Grid Approach rectangular cells resolution
	Pros and Cons of STING 1. ___________ independent, easy to ___________, ___________ update 2. O(_) 3. All the cluster boundaries are either ___________ or ___________, no diagonal boundary is detected	1. Query, parallelize, incremental 2. K, where K is the number of grid cells at the lowest level 3. horizontal or vertical
	CLIQUE ___________ 1. Automatically identifying subspaces of ___________ data that allow better clustering than original space 2. both ___________ based and ___________ based	Clustering in QUEst 1. high dimensional data space 2. density based and grid based
	pros and cons of CLIQUE 1. it automatically finds subspaces of the ___________ 2. it is insensitive to the ___________ 3. it scales ___________ with the size of input 4. the accuracy of the result may be ___________ due to ___________	1. highest dimensionality such that the high density clusters exist in those subspaces 2. order of records in put and does not presume some canonical data distribution 3. linearly 4. degraded, simplicity
	Index-based algorithm: identify ___________ 1. O(__), more scalable with dimensionality than ___________ 2. Building a right index is very costly	O(kN^2) depth based
	Classification classifies data based on the ________ set	training
	Classification is ________ learning. The training data are accompanied by labels indicating the class of the observation.	Supervised
	Clustering is ________ learning. The class labels of training data is ________	unsupervised unknown
	Major Classification Models 1. ________ 2. ________ 3. ________ 4. ________ 5. ________	1. Classification by decision tree induction 2. Bayesian classification 3. neural networks 4. Support Vector Machines 5. Classification based on associations
	Evaluating classification methods 1. ________ 2. ________ 3.________ 4.________ 5. ________ 6. ________	1. Predictive accuracy 2. Speed 3. Robustness 4. scalability 5. interpretability 6. goodness of rules
	Natural Bias in the Information Gain Measure ________	Favor attributes with many values
	Overfitting in classification: ________ Two approaches to avoid overfitting: 1. _______________ 2. _______________	Too many branches, some may reflect anomalies due to noise or outliers. 1. Prepruning: Halt tree construction early 2. Postpruning: Remove branches from a "fully grown" tree - get a sequence of progressively prunes trees
	Approaches to determine the final tree size 1. ________ 2. ________ 3. ________ 4. ________	1. Separate training and testing sets 2. Use cross validation 3. use all the data for training + apply a statistical test to estimate whether expanding or pruning a nosed may improve the entire distribution 4. Minimum description length
	Minimum description length - Select the model with the ________ that ________ of ________ and ________	1. shortest effective description 2. minimizes the sum of 3. length of an effective description of the model and 4. the length of and effective description of the data when encoded with help of the model
	Why Bayesian Classification 1. ___________ 2. ___________ 3. ___________ 4. ___________	1. Probabilistic learning: use probabilities for hypothesis 2. Incremental: each training exampe can incrementally increase/decrease the probability that a hypothesis is correct 3. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities 4. Standardize
	Pros and Cons of Naive Bayesian Classifier Advantages: 1. ___________ 2. ___________ Disadvantages: 1. ___________	1. Easy to implement 2. Good results obtained 1. The assumption of class conditional independence. losses accuracy
	Bayesian networks Allows a subset of the variables conditionally ___________	independent
	SVM 1. ___________ the margin is good because it implies that only ___________ matters	Maximizing Support vectors
	___________ are currently among the best performers for a number of classification task ranging from text to genomic data	SVM, Support Vector Machines
	CAR ___________ The objective of classification by assosiation is to generate the complete set of CARs that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints.	Class association rules
	Contributions of CAR 1. ___________ 2. ___________ 3. ___________	1. NEw way to build accurate classifiers 2. makes association rule mining techniques applicable to classification tasks 3. Sole a number of important problems with the existing classification system, including: understandability problem discovery of interesting or useful rules disk vs memory
	support = (___________/\|D\|) Confident = (___________/___________)	rulesupCount rulesupCount/condsupCount
	given two rules a and b, define a > b if 1. ___________ 2. ___________ 3. ___________	1. The confidence of a is greater than that of b 2. Same confidence but the support of a is greater than that of b 3. Both the confidences and supports are the same, but a is generated earlier than b
	M1 algorithm: to classify by association The algorithm satisfies to conditions: 1. ___________ 2. ___________ Cons 1. ___________	1. Each training case is covered by the rule with the highest precedence among the rules that can cover the case. 2. Every rule in C correctly classifies at least one remaining training case when it is chosen It's simple but inefficient especially when the database is not resident in the main memory

Share This Flashcard Set

Set the Language

Related Flashcards

Midterm Review

Add to Folders

Upgrade to Cram Premium

Card Range To Study

57 Cards in this Set