Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Related Flashcards

Flashcards
»
CS295 Final

Cs295 Final

by MikeTripp, May 2014

Favorite

Add to folder

Flag

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/42

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

42 Cards in this Set

Front
Back

	What are the three types dependencies when dependence(X, Y) are equal, greater than, and less than 1?	If Dependence(X,Y) = 1, X and Y are independent. If Dependence(X,Y) > 1, Y is positively dependent on X. If Dependence(X,Y) < 1, Y is negatively dependent on X (−Y is positively dependent on X).
	Give an example of a rule exception, or surprising pattern.	While birds(x) => flies(x), exception: bird(x), penguin(x) => −flies(x)
	What does CPIR(X\|Y) tell us?	When CPIR(X\|Y)=0, X and Y are dependent. When it is 1, they are perfectly correlated. When it is -1, they are perfectly negatively correlated.
	What are Bayesian Networks?	A graphical model the encodes probabilistic relationship among variables of interest
	Compare the Bayesian and classical approaches to probability (any one point)	Bayesian wants P( H \| D )
	Mention at least 1 Advantage of Bayesian Networks	Combine domain knowledge and data
	What are two major reasons merging large databases becomes a difficult problem?	The databases are heterogeneous The identifiers or strings differ in how they are represented within each DB
	What are the three main steps of the Sorted Neighborhood Method?	Creation of key(s) Sorting records on this key Merge/Purge records
	What is edit distance used for?	method used to quantify how dissimilar two strings are to one another, by counting the minimum number of operations required to transform one string into the other
	Name four challenges that modern algorithms have to overcome today.	Use only a fixed amount of main memory. Incorporate new examples as they become available Operate continuously and indefinitely Never lose potentially valuable information
	List the input requirements of the HT-Algorithm, and state what output is generated.	Inputs: S : sequence of examples X : set of discrete attributes G(.) : split evaluation function δ : desired probability of choosing the wrong attribute at any given node Output: HT : A decision tree (Hoeffding Tree)
	How is memory management handled differently in a VFDT than a Hoeffding Tree?	VFDT’s memory use is dominated by the memory required to keep counts for all growing leaves.
	What are the components of a FP-tree?	One root A set of item prefix subtrees as the children of the root A frequent-item header table
	How to calculate the frequent patterns containing a(v)i in path P	Only consider prefix sub-path of node a(v)i in P The frequency count of every node in that sub-path is the same as node a(v)i Find all the combinations
	Compare efficiency mining operating in FP-growth with Apriori	Mining operations consist of: Mainly prefix count adjustment Counting Pattern fragment concatenation Much less costly than: Generating a very large number of candidate patterns Test each of them
	What are the two major fraud detection categories, differentiate them, and where does DC-1 fall under?	Pre Call Methods Involves validating the phone or its user when a call is placed. Post Call Methods – DC1 falls here Analyzes call data on each account to determine whether cloning fraud has occurred.
	Why do fraud detection methods need to be adaptive?	Bandits change their behavior- patterns of fraud dynamic Levels of fraud varies month-to-month Cost of missing fraud or handling false alarms changes between inter-carrier contracts
	What are the two steps of profiling monitors and and what are the two main monitor templates?	Profiling Step: measure an accounts normal activity and save statistics Use Step: process usage for an account-day to produce a numerical output describing how abnormal activity was on that account-day Threshold and Standard Deviation monitors.
	Disadvantages and problems of PageRank?	Rank Sinks: Occur when pages get in infinite link cycles. Spider Traps: A group of pages is a spider trap if there are no links from within the group to outside the group. Dangling Links: A page contains a dangling link if the hypertext points to a page with no outgoing links. Dead Ends: are simply pages with no outgoing links.
	What Makes Ranking Optimization Hard?	Link Spamming Keyword Spamming Page hijacking and URL redirection Intentionally inaccurate or misleading anchor text Accurately targeting people's expectations
	Specify the difference between paired t-test and simple binomial test in comparing two algorithms.	Paired t-test : determine whether the difference between two algorithms exists or not Binomial test : compare the percentage of times ‘ algorithm A > algorithm B ’ versus ‘ A < B ’, with throwing out the ties
	Why should we apply Bonferroni Adjustment to comparing classifiers?	In case of multiple tests, multiplicity effect occurs if we use same significant level for each test as for all tests. So we need to get more stringent level for each experiment by Bonferroni Adjustment.
	Why is finding robust rules significant?	Real world databases tend to be dynamic Changing information – updates and deletions, rather than just additions – could invalidate current rules Continually checking and updating rules may incur high maintenance costs, especially for large databases Robustness measures how likely the knowledge found will be consistent after changes to the database
	Compare and Contrast Robustness estimations with support and predictive accuracy	Robustness is the probability that an entire database state is consistent with a rule, while support is the probability that a specific data instance satisfies a rule. Predictive Accuracy is the probability that knowledge is consistent with randomly selected unseen data. This is significant in closed world databases, where data tends to be dynamic.
	Why are transaction classes mutually exclusive when calculating robustness?	Transaction classes are mutually exclusive so that no transaction class covers another because for any two classes of transactions t_a and t_b, if t_a covers t_b, then Pr(t_a⋀t_b)=Pr(t_a) and it is redundant to consider t_b.
	Compare and contrast FUP2 and DELI	Both algorithms are used in Association Analysis; Goal: DELI decides when to update the association rules while FUP2 provides an efficient way of updating them; Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP2 scans the whole database and returns the large itemsets exactly; DELI saves machine resources and time.
	What does DELI stand for?	Difference Estimation for Large Itemsets
	Difference between Apriori and FUP2	Apriori scans the whole database to find association rules, and does not use old data mining results; For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results.
	Compare and contrast association rules and sequential patterns. How do they relate to each other in the context of the Apriori algorithms?	Association rules refer to intra-transaction patterns, while sequential patterns refer to inter-transaction patterns. Both of these are used in the Apriori algorithms studied here, because the algorithms are looking for different sequential patterns made up of association rules.
	What is the major difference between the two algorithms CountSome and CountAll?	CountAll (AprioriAll) is careful with respect to minimum support, and careless with respect to maximality. (The minimum support is checked for each sequence on each run, but maximal sequences must be checked for later.) CountSome (AprioriSome) is careful with respect to maximality, but careless with respect to minimum support. (Non-maximal sequences are pruned out during runtime, but the minimum support is not tested at all values of k.)
	Why is the Transformation stage of these pattern mining algorithms so important to their speed?	The transformation allows each record to be looked up in constant time, reducing the run time.
	What is the main difference between Polynomial, radial basis learning machines and neural network learning machines? Also provide that difference for the neural network learning machine	The kernel function: K(x,xi) = S[v(x*xi) + c]
	What is empirical data modeling? Give a summary of the main concept and its components	Empirical data modeling is the induction of observations to build up a model. Then the model is used to deduce responses of an unobserved system.
	What must the Remp(𝛂) do over the set of loss functions?	It must converge to the R(𝛂)
	What are the two aspects of Text Mining when applied to customer complaints?	Knowledge Discovery: Discovering a common customer complaint in a large collection of documents containing customer feedback. Information Distillation: Filtering future complaints into pre-defined categories
	How does the procedure for text mining differ from the procedure for data mining?	Adds feature extraction phase Infeasible for humans to select features manually The feature vectors are, in general, highly dimensional and sparse
	What are some examples of unstructured textual collections used in Text Mining?	Customer letters Email correspondence Phone transcripts Technical documentation Patents
	Why is the frequency of subgraphs a good function to evaluate candidate patterns? How could it be better?	The frequency of subgraphs is a monotonically decreasing function, meaning supergraphs are not more frequent than subgraphs. This is a desirable property combined with a minimum support threshold to reduce the search space as subgraph patterns get bigger. However, frequency does not always imply significance – another metric must be used to evaluate the candidates generated by a graph miner for significance.
	How is a string representation of a tree useful in graph mining? What requirements does it place on the graph?	A string representation of a tree is useful because string comparisons are worst-case O(n) and can be easily optimized. However, it requires that a tree be rooted and ordered, because otherwise the string comparison operator would not be valid.
	Of the following Web mining paradigms: Information Retrieval Information Extraction Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.	Information Retrieval, the search engine attempts provides a list of documents ranked by their relevancy to the search query.
	State one common problem hampering accurate Web usage mining? Briefly support your answer.	Users connecting to a Web site though a proxy server, Users (or their ISP’s) utilizing Web data caching, will result in decreased server log accuracy. Accurate server logs are required for accurate Web usage mining.
	What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?	"Bag of Words" representation

Share This Flashcard Set

Set the Language

Related Flashcards

Cs295 Final

Add to Folders

Upgrade to Cram Premium

Card Range To Study

42 Cards in this Set