• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/57

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

57 Cards in this Set

  • Front
  • Back

Pattern Space Pruning constraints

1. Anti-monotonic


2. Monotonic


3. Succinct


4. Convertible

Data Space Pruning constraints

1. Data succinct


2. Data anti-monotonic

Pattern space


1. Anti-Monotonicity

A constraint C is anti-monotone if the super pattern satisfies C, all of its sub-patterns do so too. In other words, anti=monotonicity: If an item set S violates the constraint, so does any of tis superset.




Example: sum <= v, range <= x, support count

Pattern space


2. Monotonicity

If an itemset S satisfies the constraint, so does any of its superset.




Example: sum >= v, min <= v, range >= x

Pattern Space


3. Succinctness

Given A, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A, ie. S contains a subset belonging to A



Example: min <= v


Data Space


1. Anti-monotonicity

A constraint c is data anti-monotone if, for a pattern p, it cannot be satisfied by a transaction t in p-projected database, it cannot be satisfied by t's projection on p's superset either




Example: sum >= v, min <= v, range >= x

Convertible constraints




Example: avg(x) >= 25ordered in value descending -


ascending -

convert constraints into anti-monotone or monotone by properly ordering items




anti-monotone


monotone

Strongly convertible

can convert the constraints to both anti-monotone and monotone

types of data in clusterin

1. Interval-scaled variables


2. Binary variables


3. Nominal, ordinal, and ratio variables


4. Variables of mixed types

Categories of clustering approaches

1. Partitioning algorithms


2. Hierarchy algorithms


3. Density-base methods


4. Grid-baed methods


5. Model-based

Partitioning Algorithms heuristic methods: ______ and __________

k means and k medoids

Pros and Cons of K-means


1. Efficiency? O(__)


2. Terminate at a ___________


3. Applicable only when ___________


4. Need to specify the ___________


5. unable to handle ___________


6. unsuitable to discover ___________

1. relatively. O(tkn), n is the # of objects, k is the # of clusters, t is the # of iterations, k, t << n


2. Local optimum


3. Mean is defined


4. number of clusters


5. noisy data and outliers


6. non-convex cluster

K-medoid methods: PAM ___________


1. ___________ choose k objects as the initial medoids


2. until no change, do

Partitioning around Medoids


arbitrarily


reassigned each object to the cluster to which th e nearest medoid


randomly select a non-medoid object o', computer the total cost S of swapping medoid o with o'


if S < 0, the swap o with o' to form the new set of k medoids

Pros and Cons of PAM


1. PAM is ____ robust than k-means in the presence of noise and outliers


2. PAM is efficiently for ___________ data sets but does not scale well for ___________ data sets

1. more


2. small, large

CLARA ( ___________ )


1. Draw multiple ___________ of the data set and apply ___________ on each sample


2. perform ___________ than PAM in ___________ data sets


3. Efficiency depends on the ___________

Clustring LARge Applications


1. samples, PAM


2. better, large


3. sample size

AGNES (___________)


1. initially, each object is a ___________


2. step by step cluster ___________, until all objects form a ___________


3. ___________ approach


4. the similarity between two clusters is measured by the similarity of the ___________

Agglomerative Nesting


1. cluster


2. merging, cluster


3. Single-link approach


4. closest pair of data points belonging to different clusters

DIANA (___________)


1. initially ___________


2. step by step ___________ until ___________

DIvisive ANAlysis


1. all objects are in one cluster


2. splitting clusters until each clusters contains only one object

Challenges of Hierarchical clustering methods


1. Hard to choose ___________


2. Do not ___________ well, O(__)


3. Data can't fit in memory


4. Integrating other techniques _________, _________, _________, _________

1. merge/split points


2. scale (n^2)


3. BIRCH, CURE, CHAMELEON, ROCK

BIRCH ___________


1. ___________ Tree, a hierarchical data structure summarizing object info

Balanced ITerative Reducing and Clustering using hierarchies


2. CF (clustering feature)

CF components


N


LS


SS

number of data points


sum of x


sum of x^2

Pros and Cons of BIRCH


1. ___________ scalability


2. can handle only ___________ data


3. sensitive to the ___________

Linear scalability, single scan


numeric data


order of the data records

Drawbacks of square error based methods

1. one representative per cluster


good only for convex shaped having similar size and density


2. a number of cluster parameter k


good only if k can be reasonably estimated

Drawbacks of distance based methods

1. hard to find cluster with ___________


2. hard to specify the ___________


3. Heuristic: a cluster must be ___________


1. irregular shapes

2. number of clusters


3. dense



EPS: ___________


MinPts: ___________


NEps(p): ___________

Eps: Maximum radius of the neighborhood


MinPts: Minimum number of points in an Eps-neighborhood of that point


NEps(p): {q | dist(p,q) <= EPS}

Core Object p

p: |Neps(p)| >= MinPts

Directly Density Reachable

point q directly density-reachable from p iff q is an NEps(p) and p is a core object

Density Reachable

Directly density reachable p1->p2, p2->p3, .... pn-1>pn, then pn is density reachable from p1


Density connected

Points p, q are density reachable from o->p and q are density connected

DBSCAN

to find outlier


1. Arbitrary select a point p


2. retrieve all points density-reachable from p wrt Eps and MinPts


3. If p is a core point, a cluster is formed


4. if p is a border point, no points are density reachable from p and DBSCAn visits the next point of the database


5. Continue the process until all of the points have been processed

Problems of DBSCAN


1. Clusters may have different __________


2. Clusters may be in __________


3. Sensitive to __________

1. densities


2. hierarchies


3. Parameters

OPTICS (__________), a cluster ordering methods


1. Produces a special __________ of the database wrt tis __________


2. Good for both __________ and __________ cluster analysis


3. Can be represented __________


4. Complexities

Ordering points to identify the clustering structure


1. density based clustering structure


2. automatic and interactive cluster analysis


3. graphically or using visualization techniques


4. O(N*logN)

Core distance of an object p

The smallest value such that the neighborhood of p has at least MinPts objects

Grid-based clustering methods


1. Using ________________ data structure


2. Use dense grid cells to form clusters




Several interesting methods


1. ___________


2. ___________

1. multi-resolution grid


2. dense grid cells




1. STING


2. CLIQUE

STING: ___________


1. The spatial area is divided into ___________


2. There are several levels of cells corresponding to different levels of ___________

A Statistical Information Grid Approach


rectangular cells


resolution

Pros and Cons of STING


1. ___________ independent, easy to ___________, ___________ update


2. O(_)


3. All the cluster boundaries are either ___________ or ___________, no diagonal boundary is detected

1. Query, parallelize, incremental


2. K, where K is the number of grid cells at the lowest level


3. horizontal or vertical

CLIQUE ___________


1. Automatically identifying subspaces of ___________ data that allow better clustering than original space


2. both ___________ based and ___________ based

Clustering in QUEst


1. high dimensional data space


2. density based and grid based

pros and cons of CLIQUE


1. it automatically finds subspaces of the ___________


2. it is insensitive to the ___________


3. it scales ___________ with the size of input


4. the accuracy of the result may be ___________ due to ___________

1. highest dimensionality such that the high density clusters exist in those subspaces


2. order of records in put and does not presume some canonical data distribution


3. linearly


4. degraded, simplicity

Index-based algorithm: identify ___________


1. O(__), more scalable with dimensionality than ___________


2. Building a right index is very costly

O(kN^2)


depth based

Classification classifies data based on the ________ set

training

Classification is ________ learning. The training data are accompanied by labels indicating the class of the observation.

Supervised

Clustering is ________ learning. The class labels of training data is ________

unsupervised


unknown

Major Classification Models


1. ________


2. ________


3. ________


4. ________


5. ________

1. Classification by decision tree induction


2. Bayesian classification


3. neural networks


4. Support Vector Machines


5. Classification based on associations

Evaluating classification methods


1. ________


2. ________


3.________


4.________


5. ________


6. ________

1. Predictive accuracy


2. Speed


3. Robustness


4. scalability


5. interpretability


6. goodness of rules

Natural Bias in the Information Gain Measure


________

Favor attributes with many values

Overfitting in classification: ________


Two approaches to avoid overfitting:


1. _______________


2. _______________

Too many branches, some may reflect anomalies due to noise or outliers.


1. Prepruning: Halt tree construction early


2. Postpruning: Remove branches from a "fully grown" tree - get a sequence of progressively prunes trees

Approaches to determine the final tree size


1. ________


2. ________


3. ________


4. ________

1. Separate training and testing sets


2. Use cross validation


3. use all the data for training + apply a statistical test to estimate whether expanding or pruning a nosed may improve the entire distribution


4. Minimum description length

Minimum description length


- Select the model with the ________ that ________ of ________ and ________

1. shortest effective description


2. minimizes the sum of


3. length of an effective description of the model and


4. the length of and effective description of the data when encoded with help of the model

Why Bayesian Classification

1. ___________


2. ___________


3. ___________


4. ___________



1. Probabilistic learning: use probabilities for hypothesis

2. Incremental: each training exampe can incrementally increase/decrease the probability that a hypothesis is correct


3. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities


4. Standardize


Pros and Cons of Naive Bayesian Classifier


Advantages:


1. ___________


2. ___________


Disadvantages:


1. ___________



1. Easy to implement


2. Good results obtained




1. The assumption of class conditional independence. losses accuracy

Bayesian networks


Allows a subset of the variables conditionally ___________

independent


SVM


1. ___________ the margin is good because it implies that only ___________ matters

Maximizing


Support vectors

___________ are currently among the best performers for a number of classification task ranging from text to genomic data

SVM, Support Vector Machines

CAR ___________


The objective of classification by assosiation is to generate the complete set of CARs that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints.

Class association rules

Contributions of CAR


1. ___________


2. ___________


3. ___________

1. NEw way to build accurate classifiers


2. makes association rule mining techniques applicable to classification tasks


3. Sole a number of important problems with the existing classification system, including:


understandability problem


discovery of interesting or useful rules


disk vs memory

support = (___________/|D|)


Confident = (___________/___________)

rulesupCount


rulesupCount/condsupCount

given two rules a and b, define a > b if


1. ___________


2. ___________


3. ___________

1. The confidence of a is greater than that of b


2. Same confidence but the support of a is greater than that of b


3. Both the confidences and supports are the same, but a is generated earlier than b

M1 algorithm: to classify by association


The algorithm satisfies to conditions:


1. ___________


2. ___________




Cons


1. ___________

1. Each training case is covered by the rule with the highest precedence among the rules that can cover the case.


2. Every rule in C correctly classifies at least one remaining training case when it is chosen




It's simple but inefficient especially when the database is not resident in the main memory