• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/42

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

42 Cards in this Set

  • Front
  • Back
What are the three types dependencies when dependence(X, Y) are equal, greater than, and less than 1?
If Dependence(X,Y) = 1, X and Y are independent.
If Dependence(X,Y) > 1, Y is positively dependent on X.
If Dependence(X,Y) < 1, Y is negatively dependent on X (−Y is positively dependent on X).
Give an example of a rule exception, or surprising pattern.
While birds(x) => flies(x), exception:
bird(x), penguin(x) => −flies(x)
What does CPIR(X|Y) tell us?
When CPIR(X|Y)=0, X and Y are dependent.
When it is 1, they are perfectly correlated.
When it is -1, they are perfectly negatively correlated.
What are Bayesian Networks?
A graphical model the encodes probabilistic relationship among variables of interest
Compare the Bayesian and classical approaches to probability (any one point)
Bayesian wants P( H | D )
Mention at least 1 Advantage of Bayesian Networks
Combine domain knowledge and data
What are two major reasons merging large databases becomes a difficult problem?
The databases are heterogeneous
The identifiers or strings differ in how they are represented within each DB
What are the three main steps of the Sorted Neighborhood Method?
Creation of key(s)
Sorting records on this key
Merge/Purge records
What is edit distance used for?
method used to quantify how dissimilar two strings are to one another, by counting the minimum number of operations required to transform one string into the other
Name four challenges that modern algorithms have to overcome today.
Use only a fixed amount of main memory.
Incorporate new examples as they become available
Operate continuously and indefinitely
Never lose potentially valuable information
List the input requirements of the HT-Algorithm, and state what output is generated.
Inputs:
S : sequence of examples
X : set of discrete attributes
G(.) : split evaluation function
δ : desired probability of choosing the wrong attribute at any given node
Output:
HT : A decision tree (Hoeffding Tree)
How is memory management handled differently in a VFDT than a Hoeffding Tree?
VFDT’s memory use is dominated by the memory required to keep counts for all growing leaves.
What are the components of a FP-tree?
One root
A set of item prefix subtrees as the children of the root
A frequent-item header table
How to calculate the frequent patterns containing a(v)i in path P
Only consider prefix sub-path of node a(v)i in P
The frequency count of every node in that sub-path is the same as node a(v)i
Find all the combinations
Compare efficiency mining operating in FP-growth with Apriori
Mining operations consist of:
Mainly prefix count adjustment
Counting
Pattern fragment concatenation
Much less costly than:
Generating a very large number of candidate patterns
Test each of them
What are the two major fraud detection categories, differentiate them, and where does DC-1 fall under?
Pre Call Methods
Involves validating the phone or its user when a call is placed.
Post Call Methods – DC1 falls here
Analyzes call data on each account to determine whether cloning fraud has occurred.
Why do fraud detection methods need to be adaptive?
Bandits change their behavior- patterns of fraud dynamic
Levels of fraud varies month-to-month
Cost of missing fraud or handling false alarms changes between inter-carrier contracts
What are the two steps of profiling monitors and and what are the two main monitor templates?
Profiling Step: measure an accounts normal activity and save statistics
Use Step: process usage for an account-day to produce a numerical output describing how abnormal activity was on that account-day
Threshold and Standard Deviation monitors.
Disadvantages and problems of PageRank?
Rank Sinks: Occur when pages get in infinite link cycles.

Spider Traps: A group of pages is a spider trap if there are no links from within the group to outside the group.

Dangling Links: A page contains a dangling link if the hypertext points to a page with no outgoing links.

Dead Ends: are simply pages with no outgoing links.
What Makes Ranking Optimization Hard?
Link Spamming

Keyword Spamming

Page hijacking and URL redirection

Intentionally inaccurate or misleading anchor text

Accurately targeting people's expectations
Specify the difference between paired t-test and simple binomial test in comparing two algorithms.
Paired t-test :
determine whether the difference between two algorithms exists or not

Binomial test :
compare the percentage of times ‘ algorithm A > algorithm B ’ versus ‘ A < B ’, with throwing out the ties
Why should we apply Bonferroni Adjustment to comparing classifiers?
In case of multiple tests, multiplicity effect occurs if we use same significant level for each test as for all tests. So we need to get more stringent level for each experiment by Bonferroni Adjustment.
Why is finding robust rules significant?
Real world databases tend to be dynamic

Changing information – updates and deletions, rather than just additions – could invalidate current rules

Continually checking and updating rules may incur high maintenance costs, especially for large databases

Robustness measures how likely the knowledge found will be consistent after changes to the database
Compare and Contrast Robustness estimations with support and predictive accuracy
Robustness is the probability that an entire database state is consistent with a rule, while support is the probability that a specific data instance satisfies a rule.

Predictive Accuracy is the probability that knowledge is consistent with randomly selected unseen data. This is significant in closed world databases, where data tends to be dynamic.
Why are transaction classes mutually exclusive when calculating robustness?
Transaction classes are mutually exclusive so that no transaction class covers another because for any two classes of transactions t_a and t_b, if t_a covers t_b, then Pr(t_a⋀t_b)=Pr(t_a) and it is redundant to consider t_b.
Compare and contrast FUP2 and DELI
Both algorithms are used in Association Analysis;
Goal: DELI decides when to update the association rules while FUP2 provides an efficient way of updating them;
Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP2 scans the whole database and returns the large itemsets exactly;
DELI saves machine resources and time.
What does DELI stand for?
Difference Estimation for Large Itemsets
Difference between Apriori and FUP2
Apriori scans the whole database to find association rules, and does not use old data mining results;
For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results.
Compare and contrast association rules and sequential patterns. How do they relate to each other in the context of the Apriori algorithms?
Association rules refer to intra-transaction patterns, while sequential patterns refer to inter-transaction patterns. Both of these are used in the Apriori algorithms studied here, because the algorithms are looking for different sequential patterns made up of association rules.
What is the major difference between the two algorithms CountSome and CountAll?
CountAll (AprioriAll) is careful with respect to minimum support, and careless with respect to maximality. (The minimum support is checked for each sequence on each run, but maximal sequences must be checked for later.)

CountSome (AprioriSome) is careful with respect to maximality, but careless with respect to minimum support. (Non-maximal sequences are pruned out during runtime, but the minimum support is not tested at all values of k.)
Why is the Transformation stage of these pattern mining algorithms so important to their speed?
The transformation allows each record to be looked up in constant time, reducing the run time.
What is the main difference between Polynomial, radial basis learning machines and neural network learning machines? Also provide that difference for the neural network learning machine
The kernel function:
K(x,xi) = S[v(x*xi) + c]
What is empirical data modeling? Give a summary of the main concept and its components
Empirical data modeling is the induction of observations to build up a model. Then the model is used to deduce responses of an unobserved system.
What must the Remp(𝛂) do over the set of loss functions?
It must converge to the R(𝛂)
What are the two aspects of Text Mining when applied to customer complaints?
Knowledge Discovery: Discovering a common customer complaint in a large collection of documents containing customer feedback.
Information Distillation: Filtering future complaints into pre-defined categories
How does the procedure for text mining differ from the procedure for data mining?
Adds feature extraction phase
Infeasible for humans to select features manually
The feature vectors are, in general, highly dimensional and sparse
What are some examples of unstructured textual collections used in Text Mining?
Customer letters
Email correspondence
Phone transcripts
Technical documentation
Patents
Why is the frequency of subgraphs a good function to evaluate candidate patterns? How could it be better?
The frequency of subgraphs is a monotonically decreasing function, meaning supergraphs are not more frequent than subgraphs. This is a desirable property combined with a minimum support threshold to reduce the search space as subgraph patterns get bigger.
However, frequency does not always imply significance – another metric must be used to evaluate the candidates generated by a graph miner for significance.
How is a string representation of a tree useful in graph mining? What requirements does it place on the graph?
A string representation of a tree is useful because string comparisons are worst-case O(n) and can be easily optimized. However, it requires that a tree be rooted and ordered, because otherwise the string comparison operator would not be valid.
Of the following Web mining paradigms:
Information Retrieval
Information Extraction
Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.
Information Retrieval, the search engine attempts provides a list of documents ranked by their relevancy to the search query.
State one common problem hampering accurate Web usage mining? Briefly support your answer.
Users connecting to a Web site though a proxy server,
Users (or their ISP’s) utilizing Web data caching,
will result in decreased server log accuracy. Accurate server logs are required for accurate Web usage mining.
What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?
"Bag of Words" representation