Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Flashcards
»
BAPM Final

Bapm Final

by wiivoo, Mar. 2015

Favorite

Add to folder

Flag

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/160

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

160 Cards in this Set

Front
Back

	Cluster Objective	homogeneity within the segment is maximized: average of pairwise distance heterogeneity between segments is maximized: distance between some objects of different clusters No response variable Distance measure a way to quantify similarity
	Cluster Applications	Understanding a customer population Text Analytics Fraud detection
	Cluster Approaches	Hierachical: Agglomerative Divisive Non-Hierachical: Exclusive Non-exclusive
	Properties of Distance Measures	Always a real number non-negativiy symmetry identity shortest path to object is straight line
	Distance Measures	Euclidian Manhatten (City-Block) Angle between vectors
	Distance Measures for non-numeric	Translate into dummy Hamming distance How many operations to transform a text Graphs, Time Series
	General Notion of similarity/distance	effort to transform one object into another transformation specific costs
	Hierarchical Clustering	Iterative Algorithm: form cluster solution loop Divisive: Top-Down, all elements on cluster, split up Decreases heterogeneity in a cluster Agglomerative: Bottom-Up Every case a cluster, merge clusters Increases heterogeneity in a cluster
	Dendogram	Tree-type graph to depict development of hierarchical clustering height of line depicts to which extent intra-cluster heterogeneity increases when merged sharp increase suggests cut outlier detection possible
	How to measure distances between clusters?	Single linkage: closest objects only small dissimilarities needed chaining Complete: most dissimilar objects violates closeness property Compromises: Average: mean of all pairwise Centroid: distance between centroid Wards: marginal increase in variance when merging When clusters well separated: similar results
	Non-Hierarchical Clustering	Combinatorial Centroid Distribution based Density based Self-Organising maps
	Exclusive Clustering	Every case assigned to one cluster k-means
	K-Means clustering	Define number of clusters: K Difficult to know: domain knowledge, heuristic approach. Elbow finding strategy Randomly guess centroids/centers Assign cases to nearest centroid Update centroids Repeat until cluster assignment stops changing
	Non-exclusive clustering	Soft-cluster assignment: cluster membership probability - like k-Means related: Gaussian Mixture Models More robust towards outliers Better suited for overlapping distributions
	Gaussian Mixture Model Clustering	Chose K Algorithm models data Each mixture component influences every case Strong influence if case is close to mean vector, small otherwise Compute association to mixture components Recompute parameters based on responsibilities
	Comparison Hierarchical / Non-Hierarchical	Hierarchical: Cluster follow from dendogram arbitrary distance measures: flexibility, validity? Limited scalability: computational extensie Choose linkage measure Non-Hierarchical: number of clusters set by user less flexibility in terms of distance measure, geared for numeric data Fairly scalable: k-means converges fast
	Association rule mining	Detect frequent occurring patterns between different items in a transaction: - purchased together - co-occur in a text - chosen together Discover rules: If A then B Sequence analysis: time dependency
	Association rule terminology	Antecedent Consequent Whereby Item sets ((dis)joint) 2^N-1 combinations of sets: ignore empty set
	Association rule measures	Support: # (XuY) / #all transactions Threshold: minsup, frequent item set Confidence: Fractions of all transactions: #(XuY) / #Y Measure of strength of association conditional probability - no causal link Lift: Interestingness of a rule <1: substitution >1: complementarity # XuY / X*Y
	Association Rule Applications	Reveal dependency among clusters Market basket Next best offer Product bundling Store and shelf layout Product catalog Recommender system Web Usage
	Mining Association Rules	Only strong meaningful rules of interest Strategy: support > minsup >> must be sufficient frequent confidence > minconf Brute-Force approach: every time could be a set, create support for every set > very complex 2^N-1
	Strategies to find frequent item sets	Reduce candidates M M=2^N-1 Reduce transactions N Pooling, DHP Reduce comparisons nM Efficient data structure No compare every candidate to every transaction
	Apriori Algorithm	Property: downward closure if something is frequent, all subsets frequent if item infrequent, all superset also infrequent Functioning: minsup, minconf two step approach: exploit downward closure, find candidate sets with K items Discover rules > minconf Issue: Efficiency
	Issues with Association Rule Mining	How to set thresholds Irrelevant / trivial rules Useful only moderate supports Interesting patterns overlooked
	Association Rules and Supervised Learning	Automatic modelling: rules have one consequent, response variable only few antecedents sort by confidence top rules insight for test Variable selection: mimic correction based filter Identify association rules among categorical variables Develop rules
	Sequential Pattern Mining	Maximal sequence among all sequences Order of time in sequence is important Ass rules: same time intratransaction Seq rules: different times intertransaction Apriori Algorithm : if k infrequent, k+1 also infrequent
	Branches of Web Mining	Structure Mining: knowledge from hyperlinks (structure of web), key technology in search engines, discover communities of users, no link structure to relations table Usage Mining: discovery of access patterns from logs, many mining algos, pre-processing of clickstream data in logs to produce data for mining Content Mining: web page content, classify cluster pages to topics, similar to data mining, discover patterns in pages, extract infos for many purposes, reviews, descriptions etc
	Web Structure Mining	Extract patterns from hyperlinks that connect pages Examine structure of a single page Web as a directed graph
	Crawlers	Procedure: download page extract urls repeat for unread urls Breadth-first search: explore links layer by layer, well-connected pages Depth-first search: follow first link, subsequent pages Dynamics: examine page update frequency, re-visit periodically
	Search engine	Page ranking + others Important if others link to it Query independent relevance scores Inbound links cast a vote vote is weighted by sending pages' quality Recursive calculation HITS (Hypertext, induced, topic search)
	Web Usage Mining	Patters in clickstreams and transcations - interactions with web resource Sources: logs, meta data, domain knowledge, client side scripting, cookies Types: Usage, Content, Structure, User
	Web Usage Mining Process	Data collection / preparation, pattern discovery and analysis Preparation: create DB and fill with data Discovery: typical behavior statistics Analysis: aggregate, input to engines and reports
	Data Preparation Tasks	Pageview identification User identification Sessionization Path Completion Data Integration
	Pageview identification	Single access: multiple entries (images, html) Remove redundant entries Remove: Well-behaved robots / stealth robots
	User Identification	Not in logs: cookies, IP time ,referrer, user agent
	Sessionization	Segment user activity into sessions: authentication clickstream heuristics Episode: subset of session, related page views, classification of pageviews
	Path completion	Server log not capture whole navigation path: cache Knowledge of site structure, referrer information
	Data Integration	Sessionized clickstream: integrate transaction data, demographics, products, commerce data
	User-Pageview-Matrix	If order of views not relevant: clustering and asco rule mining
	Term‐Pageview Matrix	Cluster Pages according to a topic
	Content-Enhanced Matrix	Combine Term and User page view matrix, better description of what users look for
	Web Usage Patterns	Web controlling: session and visitor analysis, OLAP tools Predictive modelling based on transactions Cross selling, Improve site layout Visitor segmenation
	Recommender Systems and Collaborative Filtering	Purpose: help with information overload Setting: users and items with features Task: utility function from data, predict value of item i to user u and recommend Rating prediction Item prediction (what user likely to buy) Content based recommendation: similar to past item Collaborative filtering: people with similar taste bought in the past Hybrid approaches: association rule, nearest neighbour
	Web Content Mining	Extract useful information text mining part of this and vice versa
	Text analytics	Tools and techniques model and structure information not in databases but in blogs etc>> unstructured
	Challenges in Text analytics	Homomy: same word, diff meaning Synonomy: diff word, same meaning Polysemy: same wird, slightly different meaning Hyponomy: hiearchy concept Misspellings
	Text analytics process	Text > Gather data > corpus > NLP > bag of words > Term matrix > classifcation > knowledge
	NLP	Part of speed tagging: resolve ambiguity Tokenization: split characters in tokens Filtering: dictionaries Stemming
	Document Term Matrix	Describes frequency of terms in a corpus in cells
	Inverse Document Frequency	Ratio of numbers of docs in corpus and number of docs in which a term occurs Large values: better
	Term‐Frequency‐Inverse Document Frequency	High TF: relevance at document level High IDT: distinctiveness at corpus level TF x IDF matrix < commonly used
	N-Grams	Use multiple words to create features Bigrams, Trigrams, N-Grams treated the same way as a single word
	Social Media Definition	Group of internet based application build on ideological and technological foundation of web 2.0 Social Presence / Media Richness and Self-Presentation/ / Self Disclosure Table Low > High
	Graphs	Pair G = (V,E) Vertices = Nodes Edges = Links Edges are two element subset of V ( each E connects to Vs) Node: certain object of interest, edge: connect and describe relationship
	Graph Structures	Telephone Calls, spread of deseases, collaboration networks, topology, social media Types: Directed: one way Undirected: both ways Bipartite: two sets, linking to each other Unipartite: one set Weighted: weight assigned to every edge Ego-centered: nodes one-hop neighborhood relevant with high homophily
	Graph metrics - distance	Shortest path length: two nodes, shortest path between them (fewest hops) Eccentricity: longest shortest path for any node Diameter: max eccentricity Radius: min eccentricity Periphery: nodes that have eccentricity to diameter
	Graph metrics - centrality	Degree centrality: number of edges converging to a node Closeness centrality: average distance of a node to all other nodes, between 1 and 0, 1= close, 0= far away Between centrality: number of shorts paths passing through a node Eigenvector centrality: measure influence of a node in a network. More important when connected to other important nodes
	Graph Neighbor Classification	Probability that a node does something given the classes of direct neighborhood nodes Problem: data not iid Cannot split in test and training as graph is interconnected Huge scale
	Graph Neighbor Classification Procedure	Local model: only node-spefific characteristics Relational model: connections in the network to do the inference Collective inference: how unknown nodes are estimated together influencing each other Markov property: class membership only depends in the class of one-hop neighborhood Homophily assumption: node tend to connect to nodes with similar characteristics Extension with probabilistic RNC: All nodes have probabilities, unknown class average of probabilities
	Relational Logistic Regression (Graph)	Using node specific and network specific characteristics Mind correlation of both > stepwise regression Featurization: features might measure behaviour of neighbouring nodes or local characteristics
	Collective Inferencing	Given a network based on a local and relational model, way to inference class/probabilities of unknown nodes Takes into account: nodes influence each other
	Gibbs sampling	Initialise with local classifier Sample class value of each node Generate random ordering Apply relational learner Sample again Repeat iterations counting time each class is assigned Normalize to give final class probability
	Facebook Bidding	Eye-catching spot most expensive Price based on magnitude and popularity of audience Virtual auction decides which advertisement is placed where Competitive market on website
	Link Prediction	Friends, Admirers, Idols, Neutrals, Enemies Homophily: connect to similar people Similarity: demographics, behaviours, interest, brands link prediction analytics enables to target in a more efficient way.  Customer churn  Customer acquisition
	Tasks in Data Preprocessing	Feature Selection Sampling Missing Values Discretization Encoding Standardizing Data Outliers Target Definition
	Data Preprocessing Process	Selection Cleaning Transformation Reduction
	Cleanining	Noisy Data Missing Values: keep, delete, replace Outliers
	Imputation Methods	Replace with mean / modal value Regression / Tree based imputation Nearest Neighbor
	Outliers Types	Valid Entry error Difficult to handle: univariate, multivariate Detect vs. treatment
	Outlier treatment	Univariate: Boxplot, Histogram IQR: Q3-Q1 1.5 - 3 IQR: mild outlier Extreme > 3 IQR away from Q1/Q3 z scores: outliers away 3 std from mean Multi: Clustering, Regression, RF
	Data Transformation	Continuous: Scaling: distance computations biased: different value ranges - often useful skewed value distribution - log transfomation Discretization: avoid negative impact of outliers, increase comprehensibility, capture non-linear effects, loss of information Unsupervised binning (EF, EW) Supervised binning (Entropy, decision trees) Categorical: Dummy not encode as numbers for regression> meaningless, artificial distances N-1 dummy variables - explosion of dimensionality Binary: better -1 and 1 when using neural networks WOE: ln(fraction of bad in category / fraction of good in category) IV: measure of predictive power, assess suitability , 0.3+: strong, 0.1-03: med, 0.02-0.1: weak
	Data Reduction	Horizontal: Sampling of cases Reduce computational cost class imbalance Vertical reduction: variable selection feature extraction cure of dimensionality computational cost
	Random sampling	Increase speed of modelling Higher Variance Less accuracy Use all data
	Advanced Sampling	Stratified to keep Good/Bad ratio constant Classification model geared towards majority group Undersampling: random deletion of cases, reduce training time, discard useful info Oversampling: random duplicate minority class, no info loss, increased training time, beats undersampling SMOTE: creates artificial minority class cases
	Variable Selection	Reduce: easier to understand, cost of collection, accuracy Find best subset: multivariate, combinatorial problem
	Wrapper / Filter	Wrapper: better performance, multivariate, higher comp. cost, auxiliary validation data to assess variable subset iteratively model while manipulating set of variables, forward selection, backward elimination Filter: often less effective, only an indicator, model agnostic, univariate, lower comp. cost Use statistical indicator, correlation, Fisher score, IV, one variable at a time, often univariate: redundancy, interaction
	Random Forest Variable Importance	OOB variable importance: importance scores like z-values (or probabilities) Gini variable importance: specific to random Forest not readily interpretable Agree often both
	Feature Extraction	Turn variables into new Linear / nonlinear projection Clever transformation Highly effective but costly PCA, Genetic algorithm
	Business Analytics used for	data information technology statistical analysis QM to help managers improved insight about operations and maker better decisions unlock valuable information hidden in data value proposition: help firms to achieve strategic objectives ERP: record data BI: inform managerial decision making BA: may even replace manager Explanatory Approach: explore business value of BA Design approach: new methods for BA
	Predecessors of BA	Man. MIS: reporting Exec. EIS: specific needs of senior management, visual Dec. DSS: formal model and algos, operational management Man. MSS: all of the above
	Scope of BA	Descriptive Analysis: past and present Predictive Analysis: past and predict future (e.g. demand) Prescriptive: not only anticipate future but make recommendations (e.g. prices)
	Define business intelligence	Concepts and methods to improve business decision making by using fact-based support systems
	What is a database?	Collection of data exist of a long period of time managed by DBMS: create and manage data, allow to safely persist over a long period of time
	Data storage types	Volatile Storage: Cache, Memory Nonvolatile Storage: Disk, File-Syste, Tertiary Storage DBMS, Tapes, DVDs
	What does a DBMS do?	Create new DB Store large amounts of data Query data Modify Data Efficient access Durability Access control: isolation, atomicity, security req.
	Relational model	Based on set theory and relational algebra, two-dimensional tables with rows and columns Interaction with SQL
	Relational DB	Collection of items organised as a set which can be accessed Data values can be numeric, strings, etc Tables can join and morph new tables Advantages: Time-proven, ACID: atomicity consistency isolation: same results irrespective of serial execution durability: once committed, it remains so
	Transactions	Group of DB operations ACID
	DB architectures	Two tier: client-server Three-tier: Web server, Application Server (business logic), Database server (DBMS) Redundancy possible
	Cloud computing	Network-based computing over the internet provides data storage, use internet as transport, many client at same time set of integrated and networked hardware hide the complexity of underlying infrastructure on demand service pay for use, elastic / scaling available for general public cloud storage: access remotely only temporarily cached in computer
	Challenges of RDBMS	Diversity Connectivity Data Size
	NoSQL movement	not using a SQL language no schema to follow, no constraint to new data open source, large clusters
	Information Integration	combining information contained in different and heterogenous databases into one large repository in companies, each division own database e.g. different DBMS, different terms and things, legacy applications necessary to build a structure on top of existing DBs
	Data Warehouse vs. Data Federation	Information from many databases is copied periodically with appropriate translation, central DB Data Federation: help of a mediator, middleware, support integrated model of data of various databases, while translating between model and actual models used by each database Storage remain distributed
	What is a Data Warehouse?	database is maintained separately of operational databases support decision making by solid consolidation and historic data analysis Warehousing: constructing and using warehouses
	DW: subject oriented	major subjects: customer, product etc simple and concise view, excluding unuseful data data for decision makers, not daily transactions
	DW: integration	Integration of multiple, heterogenous sources Data cleaning and conversion applied Ensure quality and consistency in naming conventions, structures, measures among different sources
	DW: time variant	larger time horizon of operational system every data item is associated with time
	DW: non-volatile	physically separate operational not in DW: only loading and reading of data
	What is OLAP?	Online Analytical Processing Live Data Analytical: create reports, multidimensional, time based
	OLTP and OLAP	OLTP: traditional relational DBMS day-to-day operations OLAP: DW, analysis and decision making Content: current+detailed vs. historic + consolidated Database design: complex + application vs. star + subject Perspective: current + local vs. evolutionary + integrated Access patterns: update vs. read only with complex queries DBMS: tuned for OLTP, access, index, control DW: tuned for OLAP, complex OLAP queries, multidimensional view, consolidation Different functions: decision support requires historic data Consolidation: aggregation, summarization Different sources inconsistent data representation Enterprise DW: entire organization Data Mart. subset, specific groups of users, confined to specific, selected groups, independent vs. dependent data marts
	Star Schema	Dimensional data model for RD: fewer tables compared to normalized model fact tables: important business measures, sales, revenues etc core of OLAP cube, what is the right level of granularity dimension tables: potentially relevant query dimensions: place, time, product, customer linked to fact tables through keys, include aggregation levels more redundancy, higher storage requirements, fewer tables, better performance
	Snowflake Schema	Extension of Star Normalisation of dimension tables: have on table per hierarchy more tables more join operations less performance, less redundancy, less storage requirements
	Types of Reporting	Standard - fixed frequency Event - driven Ad-hoc Engines: Graphical tools Performance Indicators BI Portals: visualisation and data access reduce search cost all information in one place
	OLAP cube	user centric multi dimensional free navigation MOLAP: high performance, fast navigation, use proprietary database, scalability can be a problem, low performance when updating ROLAP: create cube on the fly, easy to implement, lower performance, better scalability, high performance when updating HOLAP: combines both, relational for all data, in addition, multi dimensional store for critical data
	Data Mining	BI front end technology Formal method to preprocess data set Analyse large data set Discover patterns Automated
	KDD: Selection	Documentation of available data stores Review quality, granularity Selection of candidate data
	KDD: Pre-processing	Conversion to standard analysis format Exploratory data analysis Aggregation
	KDD: Transformation	Handling missing values Data reduction Encoding and projection
	KDD: Data Mining	Select a data mining model, algo, develop and estimate a model
	KDD: interpretation/evaluation	Validity, Reliability and Originality of patterns Start over
	Classification Algorithms	Non ( Linear) Non (Paramtetric) Ensemble - multiple classifier systems Individual - ensemble
	Parametric Classifiers	Parametric: Functional Form Logistic Regression Naiv Bayes Non-Parametric: Decision Tree Protoype Semi-Parametric: Neutral Network, SVM
	No free lunch theorem	No algorithm always better in any application merits: ease of use comprehensibility accuracy scalability
	Prototype Methods	Implicit partitioning of the attribute space into regions assigned to specific classes -No model estimation -Only use available data Mislabeling is a big problem - Use k nearest neighbors - Measure distance Identify k nearest neighbors  Every neighbor votes for class membership
	Tree Based Terminology	Root node: all data Splitter node: question Internal node: answer, subset Leaf node: not further split, class prediction
	Tree Based classifier	Recursive partitioning approach find best split split according to the variable and create a branch for each value recurse for each subsample stop when no improvement possible decisions: splitting rule stopping rule
	Splitting Rules	objective: increase purity of data avoid splitting with too few examples indicators of node impurity: maximized: all classes equally probable minimized: all belong to the same class entropy, giny problem with gain: favors attributes with many distinct values
	Stopping Rule	Otherwise: one example per leaf Models with noise overfitting prune to avoid
	Pruning	Pre-pruning: do not fully grow minimal gain below threshold how to select parameters Post-pruning: fully grow-tree cut branches that do not add much
	Naïve Bayes Classifier	Models conditional probability of class membership given attribute values Simplifying assumption: attributes are conditionally independent Classify an application into the group with maximal posterior probability Collect data Estimate conditional probability per class Predict class with higher probability Directed Graph Arrows indicate class membership causes attribute values with certain probability Features: Fast and memory efficient robust towards irrelevant attributes Generalises easily to more than two classes
	Logistic Regression	Direct approach Problems with linear regression: violation of assumptions, equal variance, predictions >1, <0, non-normal Maximum Likelihood logistic regression creates linear decision boundary
	Curse of Dimensionality	easier to separate cases in high dimensional spaces high dimensionality increase sparseness of attribute spaces distances become indistinguishable boundary difficult if most areas are empty even relatively simple classifies may overfit when dimensionality is high amount of data needs to increase exponentially with dimensionality working with large number of attributes causes problems
	Overfitting	Models derive from training sample more complex, the more the classification error of training sample approaches zero new cases not contained in sample problem! but is is noise perfect fit for training - remedy: cross validation - regularisation
	Regularization	revises practices to models not error minimisation only but: low training error AND low model complexity Measure of model complexity
	Bias-Variance Tradeoff	Bias: too simple to capture trend Variance: zero error but unstable, re-estimation leads to new model hurt accuracy: simple models, high bias, low variance complex models: low bias, high variance Introduce bias to reduce variance, penalize complexity Complex models overfit, high variance Regression indicators: large coefficient associated with unstable models: high dimensionality multicollinearity
	Dimensions of Model Performance	Accuracy Justifiability Comprehensibility Scaleability Robustness Calibration
	Scalability	Time resources: training time prediction time Consumption of memory resources Sensitivity to parameters Parallelization
	Robustness	Real-world data noisy Missing values wrong data irrelevant data Change over time Concept drift
	Comprehensibility	Understand why? Transparency, Blackbox Difficult to measure
	Justifiability	Agrees with prior beliefs and business rules Drives model acceptance Difficult to measure
	Calibration	Feature of probability prediction: Overconfidence Under-confidence > human decision making Prediction model poorly calibrated Translate classifier scores to class membership probability - distorted for some classifiers
	Accuracy	How predicts unseen observations Simulate real-life application Indicator to measure predictive accuracy: compare predicted to actual responses
	Simulation of Model Application	Resubstitution: same data for model estimation and assessment overfitting problem Split-Sample: random split in training and test set easy to implement, fast, approximates error inefficient, not all data used, high variance N-fold cross-validation: N partitions, model validation and remaining for model creation, assess performance and repeat - average performance estimates resource consumption increase robustness - less variance all examples used for model building
	Confusion Matrix	Specificity: correctly classified bad Sensitivity: correctly classified good imbalance problem cut off value - threshold metrics, compare continuous prediction to a cut-off
	Brier Score	Mean Square Error, predicts accuracy and calibration
	Types of accuracy indicators	Threshold Probability Ranking
	ROC curve	generalisation of confusion matrix across all cut-offs compare different classifiers higher the better
	AUC	probability that a randomly chosen bad risk is correctly ranked higher than a randomly chosen good risk
	Accuracy Indicator Selection	Different types of predictions: discrete class predictions continuous confidence scores class probability estimates incompatible between indicators
	Neural Network	􏲟Finding the connection weights in the network Non-liner, non-convex optimisation problem non-deterministic behavior Iterative training algorithm different networks given random weight initalization Classic back propagation Parameter: learning rate: local and global minima Large coefficient weights indicate model complexity complex models prone to overfite weight penalty
	Tuning	Meta-parameter tuning adapt the classifier to a given data set set by analyst: number of layers, nodes output layer function hidden layer activation function regularisation coefficient
	Grid search	Three step approach define candiate setting, test, for each meta parameter Magnify resolution in promising areas
	Learning curve analysis	Studies sensitivity of a classifier with respect to sample size three step approach: draw small sample estimate and assess model increase sample size and repeat typically, degressive trend, marginal utility of more data decreases with size can speed things up: find size wehre classifier training stabilises
	Ensemble Models	Multi-step modelling approach: develop a base model aggregate their predictions
	Why forecasting increases accuracy	Different vies on same data Linear and non linear models Forecast combination gathers information Asking my experts Bias variance trade off (combination) Strength diversity trade off ensemble margin
	Strength-Diversity Trade-Off	Base model strength: accuracy Base model diversity: how the predictions of models agree diversity as forecast correlation Do not combine identical models Not possible to be very strong and diverse at the same time - conflict of strength and diversity
	Diversity	Class probabilities and class membership: different outputs develop diversity measure study behaviour of different measures no consensus on this idea
	Strength Diversity Plot	Scatterplot: each point is a pair x: average strength (PCC) y: Correlation
	Ensemble Margin	Factor explains ensemble success +1 The larger the better Penalty for each wrong prediction -1 Weighing possible
	Categories of Ensemble Algorithms	Homogenous Heterogenous Without pruning With Pruning
	Homogeneous Ensemble Classifier	Base model using same algorithm Inject diversity through training data: random drawing of cases/variables
	Heterogenous Ensemble classifiers	Different algorithms Inject diversity algorithmically different algos, different meta parameter Also called multiple-classifier system
	Ensemble without pruning	All base models into ensemble
	Ensemble with pruning	Candidate base models Optimize ensemble composition using some search strategy Put selected base models into ensemble
	Homogenous Algorithm	Bagging Random Forest Boosting
	Bagging	Work with any classifier best use unstable bootstrap sampling with replacement sample includes some cases multiple times 37% do not appear at all OOB examples: facilitate assessing model and variable importance then: averaging Bias not altered Variance reduced meta parameters: based on algorithm can use any how many bootstrap samples size of bootstrap sample scaleable and generic
	Random Forest	Bagging with random subspace Injects diversity Bootstrap sample, in addition: chose attribute to split data, draw random sample of attributes, find best split from attributes within sample increased diversity random subspace limits access to attributes tree based tend to overfit, pruning helps be definition no bias, but high variance bootstrapping reduces variance meta parameter: size of random sample of attributes, forest size, size of bootstrap sample often accurate and easy to scale, estimate variable importance
	Boosting	Simple classifier / weak learners easy to find them find many combine many easy classifiers Check which cases it gets right second classifier that corrects the errors Check errors of the ensemble third classifier that corrects the errors repeat weight depends on accuracy reduce bias and variance meta-parameters: type of algorithm, often shallow decision trees embody idea of week classifier how many iterations maybe more depending on implementation vulnerable to overfitting
	Multiple Classifier Systems	Simply take the average of preferred classification algorithms but: different algorithms have different outputs common scale stacking: predictions of base classifiers are correlated, use robust 2nd level classifier (classical suffer from multicollinearity) modelling gets very complex, tuning? ensemble selection: build base classifier, select single best, iteratively add one additional and average, continue as long as ensemble accuracy increases weighing and multiple times possible Highly accurate predictions scalable relative complex resource intensive model management costly

Share This Flashcard Set

Set the Language

Bapm Final

Add to Folders

Upgrade to Cram Premium

Card Range To Study

160 Cards in this Set