Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
57 Cards in this Set
- Front
- Back
Pattern Space Pruning constraints |
1. Anti-monotonic 2. Monotonic 3. Succinct 4. Convertible |
|
Data Space Pruning constraints |
1. Data succinct 2. Data anti-monotonic |
|
Pattern space 1. Anti-Monotonicity |
A constraint C is anti-monotone if the super pattern satisfies C, all of its sub-patterns do so too. In other words, anti=monotonicity: If an item set S violates the constraint, so does any of tis superset. Example: sum <= v, range <= x, support count |
|
Pattern space 2. Monotonicity |
If an itemset S satisfies the constraint, so does any of its superset. Example: sum >= v, min <= v, range >= x |
|
Pattern Space 3. Succinctness |
Given A, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A, ie. S contains a subset belonging to A
Example: min <= v |
|
Data Space 1. Anti-monotonicity |
A constraint c is data anti-monotone if, for a pattern p, it cannot be satisfied by a transaction t in p-projected database, it cannot be satisfied by t's projection on p's superset either Example: sum >= v, min <= v, range >= x |
|
Convertible constraints Example: avg(x) >= 25ordered in value descending - ascending - |
convert constraints into anti-monotone or monotone by properly ordering items anti-monotone monotone |
|
Strongly convertible |
can convert the constraints to both anti-monotone and monotone |
|
types of data in clusterin |
1. Interval-scaled variables 2. Binary variables 3. Nominal, ordinal, and ratio variables 4. Variables of mixed types |
|
Categories of clustering approaches |
1. Partitioning algorithms 2. Hierarchy algorithms 3. Density-base methods 4. Grid-baed methods 5. Model-based |
|
Partitioning Algorithms heuristic methods: ______ and __________ |
k means and k medoids |
|
Pros and Cons of K-means 1. Efficiency? O(__) 2. Terminate at a ___________ 3. Applicable only when ___________ 4. Need to specify the ___________ 5. unable to handle ___________ 6. unsuitable to discover ___________ |
1. relatively. O(tkn), n is the # of objects, k is the # of clusters, t is the # of iterations, k, t << n 2. Local optimum 3. Mean is defined 4. number of clusters 5. noisy data and outliers 6. non-convex cluster |
|
K-medoid methods: PAM ___________ 1. ___________ choose k objects as the initial medoids 2. until no change, do |
Partitioning around Medoids arbitrarily reassigned each object to the cluster to which th e nearest medoid randomly select a non-medoid object o', computer the total cost S of swapping medoid o with o' if S < 0, the swap o with o' to form the new set of k medoids |
|
Pros and Cons of PAM 1. PAM is ____ robust than k-means in the presence of noise and outliers 2. PAM is efficiently for ___________ data sets but does not scale well for ___________ data sets |
1. more 2. small, large |
|
CLARA ( ___________ ) 1. Draw multiple ___________ of the data set and apply ___________ on each sample 2. perform ___________ than PAM in ___________ data sets 3. Efficiency depends on the ___________ |
Clustring LARge Applications 1. samples, PAM 2. better, large 3. sample size |
|
AGNES (___________) 1. initially, each object is a ___________ 2. step by step cluster ___________, until all objects form a ___________ 3. ___________ approach 4. the similarity between two clusters is measured by the similarity of the ___________ |
Agglomerative Nesting 1. cluster 2. merging, cluster 3. Single-link approach 4. closest pair of data points belonging to different clusters |
|
DIANA (___________) 1. initially ___________ 2. step by step ___________ until ___________ |
DIvisive ANAlysis 1. all objects are in one cluster 2. splitting clusters until each clusters contains only one object |
|
Challenges of Hierarchical clustering methods 1. Hard to choose ___________ 2. Do not ___________ well, O(__) 3. Data can't fit in memory 4. Integrating other techniques _________, _________, _________, _________ |
1. merge/split points 2. scale (n^2) 3. BIRCH, CURE, CHAMELEON, ROCK |
|
BIRCH ___________ 1. ___________ Tree, a hierarchical data structure summarizing object info |
Balanced ITerative Reducing and Clustering using hierarchies 2. CF (clustering feature) |
|
CF components N LS SS |
number of data points sum of x sum of x^2 |
|
Pros and Cons of BIRCH 1. ___________ scalability 2. can handle only ___________ data 3. sensitive to the ___________ |
Linear scalability, single scan numeric data order of the data records |
|
Drawbacks of square error based methods |
1. one representative per cluster good only for convex shaped having similar size and density 2. a number of cluster parameter k good only if k can be reasonably estimated |
|
Drawbacks of distance based methods
1. hard to find cluster with ___________ 2. hard to specify the ___________ 3. Heuristic: a cluster must be ___________ |
1. irregular shapes
2. number of clusters 3. dense |
|
EPS: ___________ MinPts: ___________ NEps(p): ___________ |
Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps-neighborhood of that point NEps(p): {q | dist(p,q) <= EPS} |
|
Core Object p |
p: |Neps(p)| >= MinPts |
|
Directly Density Reachable |
point q directly density-reachable from p iff q is an NEps(p) and p is a core object |
|
Density Reachable |
Directly density reachable p1->p2, p2->p3, .... pn-1>pn, then pn is density reachable from p1
|
|
Density connected |
Points p, q are density reachable from o->p and q are density connected |
|
DBSCAN |
to find outlier 1. Arbitrary select a point p 2. retrieve all points density-reachable from p wrt Eps and MinPts 3. If p is a core point, a cluster is formed 4. if p is a border point, no points are density reachable from p and DBSCAn visits the next point of the database 5. Continue the process until all of the points have been processed |
|
Problems of DBSCAN 1. Clusters may have different __________ 2. Clusters may be in __________ 3. Sensitive to __________ |
1. densities 2. hierarchies 3. Parameters |
|
OPTICS (__________), a cluster ordering methods 1. Produces a special __________ of the database wrt tis __________ 2. Good for both __________ and __________ cluster analysis 3. Can be represented __________ 4. Complexities |
Ordering points to identify the clustering structure 1. density based clustering structure 2. automatic and interactive cluster analysis 3. graphically or using visualization techniques 4. O(N*logN) |
|
Core distance of an object p
|
The smallest value such that the neighborhood of p has at least MinPts objects |
|
Grid-based clustering methods 1. Using ________________ data structure 2. Use dense grid cells to form clusters Several interesting methods 1. ___________ 2. ___________ |
1. multi-resolution grid 2. dense grid cells 1. STING 2. CLIQUE |
|
STING: ___________ 1. The spatial area is divided into ___________ 2. There are several levels of cells corresponding to different levels of ___________ |
A Statistical Information Grid Approach rectangular cells resolution |
|
Pros and Cons of STING 1. ___________ independent, easy to ___________, ___________ update 2. O(_) 3. All the cluster boundaries are either ___________ or ___________, no diagonal boundary is detected |
1. Query, parallelize, incremental 2. K, where K is the number of grid cells at the lowest level 3. horizontal or vertical |
|
CLIQUE ___________ 1. Automatically identifying subspaces of ___________ data that allow better clustering than original space 2. both ___________ based and ___________ based |
Clustering in QUEst 1. high dimensional data space 2. density based and grid based |
|
pros and cons of CLIQUE 1. it automatically finds subspaces of the ___________ 2. it is insensitive to the ___________ 3. it scales ___________ with the size of input 4. the accuracy of the result may be ___________ due to ___________ |
1. highest dimensionality such that the high density clusters exist in those subspaces 2. order of records in put and does not presume some canonical data distribution 3. linearly 4. degraded, simplicity |
|
Index-based algorithm: identify ___________ 1. O(__), more scalable with dimensionality than ___________ 2. Building a right index is very costly |
O(kN^2) depth based |
|
Classification classifies data based on the ________ set |
training |
|
Classification is ________ learning. The training data are accompanied by labels indicating the class of the observation. |
Supervised |
|
Clustering is ________ learning. The class labels of training data is ________ |
unsupervised unknown |
|
Major Classification Models 1. ________ 2. ________ 3. ________ 4. ________ 5. ________ |
1. Classification by decision tree induction 2. Bayesian classification 3. neural networks 4. Support Vector Machines 5. Classification based on associations |
|
Evaluating classification methods 1. ________ 2. ________ 3.________ 4.________ 5. ________ 6. ________ |
1. Predictive accuracy 2. Speed 3. Robustness 4. scalability 5. interpretability 6. goodness of rules |
|
Natural Bias in the Information Gain Measure ________ |
Favor attributes with many values |
|
Overfitting in classification: ________ Two approaches to avoid overfitting: 1. _______________ 2. _______________ |
Too many branches, some may reflect anomalies due to noise or outliers. 1. Prepruning: Halt tree construction early 2. Postpruning: Remove branches from a "fully grown" tree - get a sequence of progressively prunes trees |
|
Approaches to determine the final tree size 1. ________ 2. ________ 3. ________ 4. ________ |
1. Separate training and testing sets 2. Use cross validation 3. use all the data for training + apply a statistical test to estimate whether expanding or pruning a nosed may improve the entire distribution 4. Minimum description length |
|
Minimum description length - Select the model with the ________ that ________ of ________ and ________ |
1. shortest effective description 2. minimizes the sum of 3. length of an effective description of the model and 4. the length of and effective description of the data when encoded with help of the model |
|
Why Bayesian Classification
1. ___________ 2. ___________ 3. ___________ 4. ___________ |
1. Probabilistic learning: use probabilities for hypothesis
2. Incremental: each training exampe can incrementally increase/decrease the probability that a hypothesis is correct 3. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities 4. Standardize |
|
Pros and Cons of Naive Bayesian Classifier Advantages: 1. ___________ 2. ___________ Disadvantages: 1. ___________ |
1. Easy to implement 2. Good results obtained 1. The assumption of class conditional independence. losses accuracy |
|
Bayesian networks Allows a subset of the variables conditionally ___________ |
independent
|
|
SVM 1. ___________ the margin is good because it implies that only ___________ matters |
Maximizing Support vectors |
|
___________ are currently among the best performers for a number of classification task ranging from text to genomic data |
SVM, Support Vector Machines |
|
CAR ___________ The objective of classification by assosiation is to generate the complete set of CARs that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. |
Class association rules |
|
Contributions of CAR 1. ___________ 2. ___________ 3. ___________ |
1. NEw way to build accurate classifiers 2. makes association rule mining techniques applicable to classification tasks 3. Sole a number of important problems with the existing classification system, including: understandability problem discovery of interesting or useful rules disk vs memory |
|
support = (___________/|D|) Confident = (___________/___________) |
rulesupCount rulesupCount/condsupCount |
|
given two rules a and b, define a > b if 1. ___________ 2. ___________ 3. ___________ |
1. The confidence of a is greater than that of b 2. Same confidence but the support of a is greater than that of b 3. Both the confidences and supports are the same, but a is generated earlier than b |
|
M1 algorithm: to classify by association The algorithm satisfies to conditions: 1. ___________ 2. ___________ Cons 1. ___________ |
1. Each training case is covered by the rule with the highest precedence among the rules that can cover the case. 2. Every rule in C correctly classifies at least one remaining training case when it is chosen It's simple but inefficient especially when the database is not resident in the main memory |