Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
160 Cards in this Set
- Front
- Back
Cluster Objective |
homogeneity within the segment is maximized: average of pairwise distance heterogeneity between segments is maximized: distance between some objects of different clusters
No response variable
Distance measure a way to quantify similarity |
|
Cluster Applications |
Understanding a customer population Text Analytics Fraud detection
|
|
Cluster Approaches |
Hierachical: Agglomerative Divisive
Non-Hierachical: Exclusive Non-exclusive |
|
Properties of Distance Measures |
Always a real number
non-negativiy symmetry identity shortest path to object is straight line
|
|
Distance Measures |
Euclidian Manhatten (City-Block)
Angle between vectors |
|
Distance Measures for non-numeric |
Translate into dummy Hamming distance How many operations to transform a text Graphs, Time Series |
|
General Notion of similarity/distance |
effort to transform one object into another transformation specific costs
|
|
Hierarchical Clustering |
Iterative Algorithm: form cluster solution loop
Divisive: Top-Down, all elements on cluster, split up Decreases heterogeneity in a cluster
Agglomerative: Bottom-Up Every case a cluster, merge clusters Increases heterogeneity in a cluster |
|
Dendogram |
Tree-type graph to depict development of hierarchical clustering
height of line depicts to which extent intra-cluster heterogeneity increases when merged
sharp increase suggests cut
outlier detection possible |
|
How to measure distances between clusters? |
Single linkage: closest objects only small dissimilarities needed chaining
Complete: most dissimilar objects violates closeness property
Compromises:
Average: mean of all pairwise Centroid: distance between centroid Wards: marginal increase in variance when merging
When clusters well separated: similar results |
|
Non-Hierarchical Clustering |
Combinatorial Centroid Distribution based Density based Self-Organising maps |
|
Exclusive Clustering |
Every case assigned to one cluster k-means |
|
K-Means clustering |
Define number of clusters: K Difficult to know: domain knowledge, heuristic approach. Elbow finding strategy
Randomly guess centroids/centers Assign cases to nearest centroid Update centroids Repeat until cluster assignment stops changing
|
|
Non-exclusive clustering |
Soft-cluster assignment: cluster membership probability - like k-Means related: Gaussian Mixture Models
More robust towards outliers Better suited for overlapping distributions |
|
Gaussian Mixture Model Clustering |
Chose K Algorithm models data Each mixture component influences every case Strong influence if case is close to mean vector, small otherwise
Compute association to mixture components Recompute parameters based on responsibilities |
|
Comparison Hierarchical / Non-Hierarchical |
Hierarchical: Cluster follow from dendogram arbitrary distance measures: flexibility, validity? Limited scalability: computational extensie Choose linkage measure
Non-Hierarchical: number of clusters set by user less flexibility in terms of distance measure, geared for numeric data Fairly scalable: k-means converges fast |
|
Association rule mining |
Detect frequent occurring patterns between different items in a transaction: - purchased together - co-occur in a text - chosen together
Discover rules: If A then B Sequence analysis: time dependency |
|
Association rule terminology |
Antecedent Consequent Whereby Item sets ((dis)joint)
2^N-1 combinations of sets: ignore empty set |
|
Association rule measures |
Support: # (XuY) / #all transactions Threshold: minsup, frequent item set
Confidence: Fractions of all transactions:
#(XuY) / #Y Measure of strength of association conditional probability - no causal link
Lift: Interestingness of a rule <1: substitution >1: complementarity
# XuY / X*Y
|
|
Association Rule Applications |
Reveal dependency among clusters
Market basket Next best offer Product bundling Store and shelf layout Product catalog Recommender system Web Usage |
|
Mining Association Rules |
Only strong meaningful rules of interest Strategy: support > minsup >> must be sufficient frequent confidence > minconf
Brute-Force approach: every time could be a set, create support for every set > very complex 2^N-1 |
|
Strategies to find frequent item sets |
Reduce candidates M M=2^N-1
Reduce transactions N Pooling, DHP
Reduce comparisons nM Efficient data structure No compare every candidate to every transaction |
|
Apriori Algorithm |
Property: downward closure if something is frequent, all subsets frequent if item infrequent, all superset also infrequent
Functioning: minsup, minconf two step approach: exploit downward closure, find candidate sets with K items Discover rules > minconf
Issue: Efficiency |
|
Issues with Association Rule Mining |
How to set thresholds Irrelevant / trivial rules Useful only moderate supports Interesting patterns overlooked |
|
Association Rules and Supervised Learning |
Automatic modelling: rules have one consequent, response variable only few antecedents sort by confidence top rules insight for test
Variable selection: mimic correction based filter Identify association rules among categorical variables
Develop rules |
|
Sequential Pattern Mining |
Maximal sequence among all sequences
Order of time in sequence is important
Ass rules: same time intratransaction Seq rules: different times intertransaction
Apriori Algorithm : if k infrequent, k+1 also infrequent |
|
Branches of Web Mining |
Structure Mining: knowledge from hyperlinks (structure of web), key technology in search engines, discover communities of users, no link structure to relations table
Usage Mining: discovery of access patterns from logs, many mining algos, pre-processing of clickstream data in logs to produce data for mining
Content Mining: web page content, classify cluster pages to topics, similar to data mining, discover patterns in pages, extract infos for many purposes, reviews, descriptions etc
|
|
Web Structure Mining |
Extract patterns from hyperlinks that connect pages Examine structure of a single page
Web as a directed graph |
|
Crawlers |
Procedure: download page extract urls repeat for unread urls
Breadth-first search: explore links layer by layer, well-connected pages
Depth-first search: follow first link, subsequent pages
Dynamics: examine page update frequency, re-visit periodically |
|
Search engine |
Page ranking + others Important if others link to it Query independent relevance scores Inbound links cast a vote vote is weighted by sending pages' quality Recursive calculation
HITS (Hypertext, induced, topic search) |
|
Web Usage Mining |
Patters in clickstreams and transcations - interactions with web resource
Sources: logs, meta data, domain knowledge, client side scripting, cookies
Types: Usage, Content, Structure, User |
|
Web Usage Mining Process |
Data collection / preparation, pattern discovery and analysis
Preparation: create DB and fill with data Discovery: typical behavior statistics Analysis: aggregate, input to engines and reports |
|
Data Preparation Tasks |
Pageview identification User identification Sessionization Path Completion Data Integration |
|
Pageview identification |
Single access: multiple entries (images, html) Remove redundant entries Remove: Well-behaved robots / stealth robots |
|
User Identification |
Not in logs: cookies, IP time ,referrer, user agent
|
|
Sessionization |
Segment user activity into sessions: authentication clickstream heuristics
Episode: subset of session, related page views, classification of pageviews |
|
Path completion |
Server log not capture whole navigation path: cache
Knowledge of site structure, referrer information |
|
Data Integration |
Sessionized clickstream: integrate transaction data, demographics, products, commerce data |
|
User-Pageview-Matrix |
If order of views not relevant: clustering and asco rule mining |
|
Term‐Pageview Matrix |
Cluster Pages according to a topic |
|
Content-Enhanced Matrix |
Combine Term and User page view matrix, better description of what users look for |
|
Web Usage Patterns |
Web controlling: session and visitor analysis, OLAP tools Predictive modelling based on transactions
Cross selling, Improve site layout Visitor segmenation |
|
Recommender Systems and Collaborative Filtering |
Purpose: help with information overload
Setting: users and items with features
Task: utility function from data, predict value of item i to user u and recommend
Rating prediction Item prediction (what user likely to buy)
Content based recommendation: similar to past item Collaborative filtering: people with similar taste bought in the past
Hybrid approaches: association rule, nearest neighbour |
|
Web Content Mining |
Extract useful information text mining part of this and vice versa |
|
Text analytics |
Tools and techniques model and structure information
not in databases but in blogs etc>> unstructured |
|
Challenges in Text analytics |
Homomy: same word, diff meaning Synonomy: diff word, same meaning Polysemy: same wird, slightly different meaning Hyponomy: hiearchy concept Misspellings |
|
Text analytics process |
Text > Gather data > corpus > NLP > bag of words > Term matrix > classifcation > knowledge |
|
NLP |
Part of speed tagging: resolve ambiguity Tokenization: split characters in tokens Filtering: dictionaries Stemming |
|
Document Term Matrix |
Describes frequency of terms in a corpus in cells |
|
Inverse Document Frequency |
Ratio of numbers of docs in corpus and number of docs in which a term occurs
Large values: better |
|
Term‐Frequency‐Inverse Document Frequency |
High TF: relevance at document level High IDT: distinctiveness at corpus level TF x IDF matrix < commonly used |
|
N-Grams |
Use multiple words to create features
Bigrams, Trigrams, N-Grams
treated the same way as a single word |
|
Social Media Definition |
Group of internet based application build on ideological and technological foundation of web 2.0
Social Presence / Media Richness and Self-Presentation/ / Self Disclosure Table
Low > High |
|
Graphs |
Pair G = (V,E)
Vertices = Nodes Edges = Links Edges are two element subset of V ( each E connects to Vs)
Node: certain object of interest, edge: connect and describe relationship
|
|
Graph Structures |
Telephone Calls, spread of deseases, collaboration networks, topology, social media
Types:
Directed: one way Undirected: both ways
Bipartite: two sets, linking to each other Unipartite: one set
Weighted: weight assigned to every edge Ego-centered: nodes one-hop neighborhood relevant with high homophily |
|
Graph metrics - distance |
Shortest path length: two nodes, shortest path between them (fewest hops)
Eccentricity: longest shortest path for any node
Diameter: max eccentricity Radius: min eccentricity Periphery: nodes that have eccentricity to diameter |
|
Graph metrics - centrality |
Degree centrality: number of edges converging to a node
Closeness centrality: average distance of a node to all other nodes, between 1 and 0, 1= close, 0= far away
Between centrality: number of shorts paths passing through a node
Eigenvector centrality: measure influence of a node in a network. More important when connected to other important nodes |
|
Graph Neighbor Classification |
Probability that a node does something given the classes of direct neighborhood nodes
Problem: data not iid Cannot split in test and training as graph is interconnected Huge scale
|
|
Graph Neighbor Classification Procedure |
Local model: only node-spefific characteristics
Relational model: connections in the network to do the inference
Collective inference: how unknown nodes are estimated together influencing each other
Markov property: class membership only depends in the class of one-hop neighborhood
Homophily assumption: node tend to connect to nodes with similar characteristics
Extension with probabilistic RNC: All nodes have probabilities, unknown class average of probabilities |
|
Relational Logistic Regression (Graph) |
Using node specific and network specific characteristics
Mind correlation of both > stepwise regression
Featurization: features might measure behaviour of neighbouring nodes or local characteristics |
|
Collective Inferencing |
Given a network based on a local and relational model, way to inference class/probabilities of unknown nodes Takes into account: nodes influence each other
|
|
Gibbs sampling |
Initialise with local classifier Sample class value of each node Generate random ordering Apply relational learner Sample again Repeat iterations counting time each class is assigned Normalize to give final class probability |
|
Facebook Bidding |
Eye-catching spot most expensive Price based on magnitude and popularity of audience Virtual auction decides which advertisement is placed where Competitive market on website |
|
Link Prediction |
Friends, Admirers, Idols, Neutrals, Enemies
Homophily: connect to similar people Similarity: demographics, behaviours, interest, brands
link prediction analytics enables to target in a more efficient way. Customer churn
|
|
Tasks in Data Preprocessing |
Feature Selection Sampling Missing Values Discretization Encoding Standardizing Data Outliers Target Definition |
|
Data Preprocessing Process |
Selection Cleaning Transformation Reduction |
|
Cleanining |
Noisy Data Missing Values: keep, delete, replace Outliers |
|
Imputation Methods |
Replace with mean / modal value Regression / Tree based imputation Nearest Neighbor |
|
Outliers Types |
Valid Entry error Difficult to handle: univariate, multivariate Detect vs. treatment |
|
Outlier treatment |
Univariate: Boxplot, Histogram
IQR: Q3-Q1
1.5 - 3 IQR: mild outlier Extreme > 3 IQR away from Q1/Q3
z scores: outliers away 3 std from mean
Multi: Clustering, Regression, RF
|
|
Data Transformation |
Continuous:
Scaling: distance computations biased: different value ranges - often useful skewed value distribution - log transfomation
Discretization: avoid negative impact of outliers, increase comprehensibility, capture non-linear effects, loss of information
Unsupervised binning (EF, EW) Supervised binning (Entropy, decision trees)
Categorical: Dummy not encode as numbers for regression> meaningless, artificial distances N-1 dummy variables - explosion of dimensionality
Binary: better -1 and 1 when using neural networks
WOE: ln(fraction of bad in category / fraction of good in category)
IV: measure of predictive power, assess suitability , 0.3+: strong, 0.1-03: med, 0.02-0.1: weak |
|
Data Reduction |
Horizontal: Sampling of cases Reduce computational cost class imbalance
Vertical reduction: variable selection feature extraction cure of dimensionality computational cost |
|
Random sampling |
Increase speed of modelling Higher Variance Less accuracy Use all data |
|
Advanced Sampling |
Stratified to keep Good/Bad ratio constant Classification model geared towards majority group
Undersampling: random deletion of cases, reduce training time, discard useful info
Oversampling: random duplicate minority class, no info loss, increased training time, beats undersampling
SMOTE: creates artificial minority class cases |
|
Variable Selection |
Reduce: easier to understand, cost of collection, accuracy
Find best subset: multivariate, combinatorial problem |
|
Wrapper / Filter |
Wrapper: better performance, multivariate, higher comp. cost, auxiliary validation data to assess variable subset
iteratively model while manipulating set of variables, forward selection, backward elimination
Filter: often less effective, only an indicator, model agnostic, univariate, lower comp. cost
Use statistical indicator, correlation, Fisher score, IV, one variable at a time,
often univariate: redundancy, interaction |
|
Random Forest Variable Importance |
OOB variable importance: importance scores like z-values (or probabilities)
Gini variable importance: specific to random Forest not readily interpretable
Agree often both |
|
Feature Extraction |
Turn variables into new
Linear / nonlinear projection Clever transformation
Highly effective but costly
PCA, Genetic algorithm |
|
Business Analytics used for |
data information technology statistical analysis QM
to help managers improved insight about operations and maker better decisions
unlock valuable information hidden in data value proposition: help firms to achieve strategic objectives
ERP: record data BI: inform managerial decision making BA: may even replace manager
Explanatory Approach: explore business value of BA Design approach: new methods for BA |
|
Predecessors of BA |
Man. MIS: reporting Exec. EIS: specific needs of senior management, visual Dec. DSS: formal model and algos, operational management Man. MSS: all of the above |
|
Scope of BA |
Descriptive Analysis: past and present Predictive Analysis: past and predict future (e.g. demand) Prescriptive: not only anticipate future but make recommendations (e.g. prices) |
|
Define business intelligence |
Concepts and methods to improve business decision making by using fact-based support systems
|
|
What is a database? |
Collection of data exist of a long period of time
managed by DBMS: create and manage data, allow to safely persist over a long period of time |
|
Data storage types |
Volatile Storage: Cache, Memory Nonvolatile Storage: Disk, File-Syste, Tertiary Storage DBMS, Tapes, DVDs |
|
What does a DBMS do? |
Create new DB Store large amounts of data Query data Modify Data Efficient access Durability Access control: isolation, atomicity, security req. |
|
Relational model |
Based on set theory and relational algebra, two-dimensional tables with rows and columns Interaction with SQL |
|
Relational DB |
Collection of items organised as a set which can be accessed Data values can be numeric, strings, etc Tables can join and morph new tables
Advantages:
Time-proven, ACID: atomicity consistency isolation: same results irrespective of serial execution durability: once committed, it remains so |
|
Transactions |
Group of DB operations ACID |
|
DB architectures |
Two tier: client-server
Three-tier: Web server, Application Server (business logic), Database server (DBMS)
Redundancy possible |
|
Cloud computing |
Network-based computing over the internet
provides data storage, use internet as transport, many client at same time
set of integrated and networked hardware
hide the complexity of underlying infrastructure
on demand service
pay for use, elastic / scaling
available for general public
cloud storage: access remotely only temporarily cached in computer |
|
Challenges of RDBMS |
Diversity Connectivity Data Size |
|
NoSQL movement |
not using a SQL language no schema to follow, no constraint to new data open source, large clusters |
|
Information Integration |
combining information contained in different and heterogenous databases into one large repository
in companies, each division own database e.g.
different DBMS, different terms and things, legacy applications
necessary to build a structure on top of existing DBs |
|
Data Warehouse vs. Data Federation |
Information from many databases is copied periodically with appropriate translation, central DB
Data Federation: help of a mediator, middleware, support integrated model of data of various databases, while translating between model and actual models used by each database
Storage remain distributed |
|
What is a Data Warehouse? |
database is maintained separately of operational databases support decision making by solid consolidation and historic data analysis
Warehousing: constructing and using warehouses |
|
DW: subject oriented |
major subjects: customer, product etc
simple and concise view, excluding unuseful data
data for decision makers, not daily transactions |
|
DW: integration |
Integration of multiple, heterogenous sources
Data cleaning and conversion applied
Ensure quality and consistency in naming conventions, structures, measures among different sources |
|
DW: time variant |
larger time horizon of operational system
every data item is associated with time |
|
DW: non-volatile |
physically separate
operational not in DW: only loading and reading of data |
|
What is OLAP? |
Online Analytical Processing
Live Data Analytical: create reports, multidimensional, time based |
|
OLTP and OLAP |
OLTP: traditional relational DBMS day-to-day operations
OLAP: DW, analysis and decision making
Content: current+detailed vs. historic + consolidated Database design: complex + application vs. star + subject Perspective: current + local vs. evolutionary + integrated Access patterns: update vs. read only with complex queries
DBMS: tuned for OLTP, access, index, control DW: tuned for OLAP, complex OLAP queries, multidimensional view, consolidation
Different functions: decision support requires historic data
Consolidation: aggregation, summarization
Different sources inconsistent data representation
Enterprise DW: entire organization Data Mart. subset, specific groups of users, confined to specific, selected groups, independent vs. dependent data marts |
|
Star Schema |
Dimensional data model for RD:
fewer tables compared to normalized model
fact tables: important business measures, sales, revenues etc core of OLAP cube, what is the right level of granularity
dimension tables: potentially relevant query dimensions: place, time, product, customer linked to fact tables through keys, include aggregation levels
more redundancy, higher storage requirements, fewer tables, better performance |
|
Snowflake Schema |
Extension of Star Normalisation of dimension tables: have on table per hierarchy more tables more join operations less performance, less redundancy, less storage requirements |
|
Types of Reporting |
Standard - fixed frequency Event - driven Ad-hoc
Engines: Graphical tools Performance Indicators
BI Portals: visualisation and data access reduce search cost all information in one place |
|
OLAP cube |
user centric multi dimensional free navigation
MOLAP: high performance, fast navigation, use proprietary database, scalability can be a problem, low performance when updating
ROLAP: create cube on the fly, easy to implement, lower performance, better scalability, high performance when updating
HOLAP: combines both, relational for all data, in addition, multi dimensional store for critical data |
|
Data Mining |
BI front end technology Formal method to preprocess data set Analyse large data set Discover patterns Automated |
|
KDD: Selection |
Documentation of available data stores Review quality, granularity Selection of candidate data |
|
KDD: Pre-processing |
Conversion to standard analysis format Exploratory data analysis Aggregation |
|
KDD: Transformation |
Handling missing values Data reduction Encoding and projection |
|
KDD: Data Mining |
Select a data mining model, algo, develop and estimate a model |
|
KDD: interpretation/evaluation |
Validity, Reliability and Originality of patterns
Start over |
|
Classification Algorithms |
Non ( Linear) Non (Paramtetric) Ensemble - multiple classifier systems Individual - ensemble |
|
Parametric Classifiers |
Parametric: Functional Form Logistic Regression Naiv Bayes
Non-Parametric: Decision Tree Protoype
Semi-Parametric: Neutral Network, SVM |
|
No free lunch theorem |
No algorithm always better in any application
merits: ease of use comprehensibility accuracy scalability |
|
Prototype Methods |
Implicit partitioning of the attribute space into regions assigned to specific classes -No model estimation -Only use available data Mislabeling is a big problem - Use k nearest neighbors - Measure distance Identify k nearest neighbors Every neighbor votes for class membership |
|
Tree Based Terminology |
Root node: all data Splitter node: question Internal node: answer, subset Leaf node: not further split, class prediction |
|
Tree Based classifier |
Recursive partitioning approach find best split split according to the variable and create a branch for each value recurse for each subsample stop when no improvement possible
decisions: splitting rule stopping rule |
|
Splitting Rules |
objective: increase purity of data avoid splitting with too few examples
indicators of node impurity: maximized: all classes equally probable minimized: all belong to the same class
entropy, giny
problem with gain: favors attributes with many distinct values |
|
Stopping Rule |
Otherwise: one example per leaf Models with noise overfitting prune to avoid |
|
Pruning |
Pre-pruning: do not fully grow minimal gain below threshold how to select parameters
Post-pruning: fully grow-tree cut branches that do not add much |
|
Naïve Bayes Classifier |
Models conditional probability of class membership given attribute values Simplifying assumption: attributes are conditionally independent
Classify an application into the group with maximal posterior probability
Collect data Estimate conditional probability per class Predict class with higher probability
Directed Graph Arrows indicate class membership causes attribute values with certain probability
Features: Fast and memory efficient robust towards irrelevant attributes
Generalises easily to more than two classes
|
|
Logistic Regression |
Direct approach Problems with linear regression: violation of assumptions, equal variance, predictions >1, <0, non-normal
Maximum Likelihood
logistic regression creates linear decision boundary |
|
Curse of Dimensionality |
easier to separate cases in high dimensional spaces high dimensionality increase sparseness of attribute spaces distances become indistinguishable boundary difficult if most areas are empty even relatively simple classifies may overfit when dimensionality is high amount of data needs to increase exponentially with dimensionality
working with large number of attributes causes problems |
|
Overfitting |
Models derive from training sample more complex, the more the classification error of training sample approaches zero new cases not contained in sample problem! but is is noise
perfect fit for training - remedy: cross validation - regularisation
|
|
Regularization |
revises practices to models not error minimisation only but: low training error AND low model complexity
Measure of model complexity |
|
Bias-Variance Tradeoff |
Bias: too simple to capture trend Variance: zero error but unstable, re-estimation leads to new model
hurt accuracy: simple models, high bias, low variance
complex models: low bias, high variance
Introduce bias to reduce variance, penalize complexity
Complex models overfit, high variance
Regression indicators: large coefficient associated with unstable models: high dimensionality multicollinearity |
|
Dimensions of Model Performance |
Accuracy Justifiability Comprehensibility Scaleability Robustness Calibration
|
|
Scalability |
Time resources: training time prediction time
Consumption of memory resources Sensitivity to parameters Parallelization |
|
Robustness |
Real-world data noisy Missing values wrong data irrelevant data Change over time Concept drift |
|
Comprehensibility |
Understand why? Transparency, Blackbox Difficult to measure |
|
Justifiability |
Agrees with prior beliefs and business rules Drives model acceptance Difficult to measure |
|
Calibration |
Feature of probability prediction: Overconfidence Under-confidence > human decision making Prediction model poorly calibrated
Translate classifier scores to class membership probability - distorted for some classifiers |
|
Accuracy |
How predicts unseen observations Simulate real-life application Indicator to measure predictive accuracy: compare predicted to actual responses |
|
Simulation of Model Application |
Resubstitution: same data for model estimation and assessment overfitting problem
Split-Sample: random split in training and test set easy to implement, fast, approximates error inefficient, not all data used, high variance
N-fold cross-validation: N partitions, model validation and remaining for model creation, assess performance and repeat - average performance estimates resource consumption increase robustness - less variance all examples used for model building |
|
Confusion Matrix |
Specificity: correctly classified bad Sensitivity: correctly classified good imbalance problem
cut off value - threshold metrics, compare continuous prediction to a cut-off |
|
Brier Score |
Mean Square Error, predicts accuracy and calibration |
|
Types of accuracy indicators |
Threshold Probability Ranking |
|
ROC curve |
generalisation of confusion matrix across all cut-offs compare different classifiers higher the better
|
|
AUC |
probability that a randomly chosen bad risk is correctly ranked higher than a randomly chosen good risk |
|
Accuracy Indicator Selection |
Different types of predictions: discrete class predictions continuous confidence scores class probability estimates
incompatible between indicators |
|
Neural Network |
Finding the connection weights in the network Non-liner, non-convex optimisation problem non-deterministic behavior Iterative training algorithm different networks given random weight initalization Classic back propagation Parameter: learning rate: local and global minima
Large coefficient weights indicate model complexity complex models prone to overfite weight penalty |
|
Tuning |
Meta-parameter tuning adapt the classifier to a given data set set by analyst: number of layers, nodes output layer function hidden layer activation function regularisation coefficient |
|
Grid search |
Three step approach define candiate setting, test, for each meta parameter
Magnify resolution in promising areas |
|
Learning curve analysis |
Studies sensitivity of a classifier with respect to sample size
three step approach: draw small sample estimate and assess model increase sample size and repeat
typically, degressive trend, marginal utility of more data decreases with size
can speed things up: find size wehre classifier training stabilises
|
|
Ensemble Models |
Multi-step modelling approach: develop a base model aggregate their predictions |
|
Why forecasting increases accuracy |
Different vies on same data Linear and non linear models Forecast combination gathers information Asking my experts
Bias variance trade off (combination) Strength diversity trade off ensemble margin |
|
Strength-Diversity Trade-Off |
Base model strength: accuracy Base model diversity: how the predictions of models agree diversity as forecast correlation
Do not combine identical models
Not possible to be very strong and diverse at the same time - conflict of strength and diversity |
|
Diversity |
Class probabilities and class membership: different outputs
develop diversity measure study behaviour of different measures no consensus on this idea |
|
Strength Diversity Plot |
Scatterplot: each point is a pair
x: average strength (PCC) y: Correlation |
|
Ensemble Margin |
Factor explains ensemble success +1 The larger the better Penalty for each wrong prediction -1 Weighing possible |
|
Categories of Ensemble Algorithms |
Homogenous
Heterogenous
Without pruning With Pruning |
|
Homogeneous Ensemble Classifier |
Base model using same algorithm Inject diversity through training data: random drawing of cases/variables |
|
Heterogenous Ensemble classifiers |
Different algorithms Inject diversity algorithmically different algos, different meta parameter Also called multiple-classifier system |
|
Ensemble without pruning |
All base models into ensemble |
|
Ensemble with pruning |
Candidate base models Optimize ensemble composition using some search strategy Put selected base models into ensemble |
|
Homogenous Algorithm |
Bagging Random Forest Boosting |
|
Bagging |
Work with any classifier best use unstable
bootstrap sampling with replacement sample includes some cases multiple times 37% do not appear at all OOB examples: facilitate assessing model and variable importance then: averaging
Bias not altered Variance reduced
meta parameters: based on algorithm can use any how many bootstrap samples size of bootstrap sample scaleable and generic |
|
Random Forest |
Bagging with random subspace Injects diversity Bootstrap sample, in addition: chose attribute to split data, draw random sample of attributes, find best split from attributes within sample
increased diversity random subspace limits access to attributes
tree based tend to overfit, pruning helps be definition no bias, but high variance bootstrapping reduces variance
meta parameter: size of random sample of attributes, forest size, size of bootstrap sample
often accurate and easy to scale, estimate variable importance |
|
Boosting |
Simple classifier / weak learners easy to find them find many combine many easy classifiers
Check which cases it gets right Check errors of the ensemble repeat
weight depends on accuracy
reduce bias and variance meta-parameters: type of algorithm, often shallow decision trees embody idea of week classifier how many iterations maybe more depending on implementation
vulnerable to overfitting
|
|
Multiple Classifier Systems |
Simply take the average of preferred classification algorithms
but: different algorithms have different outputs common scale
stacking: predictions of base classifiers are correlated, use robust 2nd level classifier (classical suffer from multicollinearity) modelling gets very complex, tuning?
ensemble selection: build base classifier, select single best, iteratively add one additional and average, continue as long as ensemble accuracy increases weighing and multiple times possible
Highly accurate predictions scalable
relative complex resource intensive model management costly |