Cluster Objective

homogeneity within the segment is maximized: average of pairwise distance

heterogeneity between segments is maximized: distance between some objects of different clusters

No response variable

Distance measure a way to quantify similarity

Cluster Applications

Understanding a customer population

Text Analytics

Fraud detection

Cluster Approaches







Properties of Distance Measures

Always a real number




shortest path to object is straight line

Distance Measures


Manhatten (City-Block)

Angle between vectors

Distance Measures for non-numeric

Translate into dummy

Hamming distance

How many operations to transform a text

Graphs, Time Series

General Notion of similarity/distance

effort to transform one object into another

transformation specific costs

Hierarchical Clustering

Iterative Algorithm:

form cluster solution



Top-Down, all elements on cluster, split up

Decreases heterogeneity in a cluster



Every case a cluster, merge clusters

Increases heterogeneity in a cluster


Tree-type graph to depict development of hierarchical clustering

height of line depicts to which extent intra-cluster heterogeneity increases when merged

sharp increase suggests cut

outlier detection possible

How to measure distances between clusters?

Single linkage: closest objects

only small dissimilarities needed


Complete: most dissimilar objects

violates closeness property


Average: mean of all pairwise

Centroid: distance between centroid

Wards: marginal increase in variance when merging

When clusters well separated: similar results

Non-Hierarchical Clustering



Distribution based

Density based

Self-Organising maps

Exclusive Clustering

Every case assigned to one cluster


K-Means clustering

Define number of clusters: K

Difficult to know: domain knowledge, heuristic approach. Elbow finding strategy

Randomly guess centroids/centers

Assign cases to nearest centroid

Update centroids

Repeat until cluster assignment stops changing

Non-exclusive clustering

Soft-cluster assignment:

cluster membership probability - like k-Means

related: Gaussian Mixture Models

More robust towards outliers

Better suited for overlapping distributions

Gaussian Mixture Model Clustering

Chose K

Algorithm models data

Each mixture component influences every case

Strong influence if case is close to mean vector, small otherwise

Compute association to mixture components

Recompute parameters based on responsibilities

Comparison Hierarchical / Non-Hierarchical


Cluster follow from dendogram

arbitrary distance measures: flexibility, validity?

Limited scalability: computational extensie

Choose linkage measure


number of clusters set by user

less flexibility in terms of distance measure, geared for numeric data

Fairly scalable: k-means converges fast

Association rule mining

Detect frequent occurring patterns between different items in a transaction:

- purchased together

- co-occur in a text

- chosen together

Discover rules: If A then B

Sequence analysis: time dependency

Association rule terminology




Item sets ((dis)joint)

2^N-1 combinations of sets: ignore empty set

Association rule measures

Support: # (XuY) / #all transactions

Threshold: minsup, frequent item set

Confidence: Fractions of all transactions:

#(XuY) / #Y

Measure of strength of association

conditional probability - no causal link

Lift: Interestingness of a rule

<1: substitution

>1: complementarity

# XuY / X*Y

Association Rule Applications

Reveal dependency among clusters

Market basket

Next best offer

Product bundling

Store and shelf layout

Product catalog

Recommender system

Web Usage

Mining Association Rules

Only strong meaningful rules of interest


support > minsup >> must be sufficient frequent

confidence > minconf

Brute-Force approach: every time could be a set, create support for every set > very complex 2^N-1

Strategies to find frequent item sets

Reduce candidates M


Reduce transactions N

Pooling, DHP

Reduce comparisons nM

Efficient data structure

No compare every candidate to every transaction

Apriori Algorithm


downward closure

if something is frequent, all subsets frequent

if item infrequent, all superset also infrequent

Functioning: minsup, minconf

two step approach: exploit downward closure, find candidate sets with K items

Discover rules > minconf

Issue: Efficiency

Issues with Association Rule Mining

How to set thresholds

Irrelevant / trivial rules

Useful only moderate supports

Interesting patterns overlooked

Association Rules and Supervised Learning

Automatic modelling:

rules have one consequent, response variable

only few antecedents

sort by confidence

top rules insight for test

Variable selection:

mimic correction based filter

Identify association rules among categorical variables

Develop rules

Sequential Pattern Mining

Maximal sequence among all sequences

Order of time in sequence is important

Ass rules: same time intratransaction

Seq rules: different times intertransaction

Apriori Algorithm : if k infrequent, k+1 also infrequent

Branches of Web Mining

Structure Mining: knowledge from hyperlinks (structure of web), key technology in search engines, discover communities of users, no link structure to relations table

Usage Mining: discovery of access patterns from logs, many mining algos, pre-processing of clickstream data in logs to produce data for mining

Content Mining: web page content, classify cluster pages to topics, similar to data mining, discover patterns in pages, extract infos for many purposes, reviews, descriptions etc

Web Structure Mining

Extract patterns from hyperlinks that connect pages

Examine structure of a single page

Web as a directed graph



download page

extract urls

repeat for unread urls

Breadth-first search: explore links layer by layer, well-connected pages

Depth-first search: follow first link, subsequent pages

Dynamics: examine page update frequency, re-visit periodically

Search engine

Page ranking + others

Important if others link to it

Query independent relevance scores

Inbound links cast a vote

vote is weighted by sending pages' quality

Recursive calculation

HITS (Hypertext, induced, topic search)

Web Usage Mining

Patters in clickstreams and transcations - interactions with web resource

Sources: logs, meta data, domain knowledge, client side scripting, cookies

Types: Usage, Content, Structure, User

Web Usage Mining Process

Data collection / preparation, pattern discovery and analysis

Preparation: create DB and fill with data

Discovery: typical behavior statistics

Analysis: aggregate, input to engines and reports

Data Preparation Tasks

Pageview identification

User identification


Path Completion

Data Integration

Pageview identification

Single access: multiple entries (images, html)

Remove redundant entries

Remove: Well-behaved robots / stealth robots

User Identification

Not in logs:

cookies, IP time ,referrer, user agent


Segment user activity into sessions:


clickstream heuristics

Episode: subset of session, related page views,

classification of pageviews

Path completion

Server log not capture whole navigation path: cache

Knowledge of site structure, referrer information

Data Integration

Sessionized clickstream:

integrate transaction data, demographics, products, commerce data


If order of views not relevant: clustering and asco rule mining

Term‐Pageview Matrix

Cluster Pages according to a topic

Content-Enhanced Matrix

Combine Term and User page view matrix, better description of what users look for

Web Usage Patterns

Web controlling: session and visitor analysis, OLAP tools

Predictive modelling based on transactions

Cross selling, Improve site layout

Visitor segmenation

Recommender Systems and Collaborative Filtering

Purpose: help with information overload

Setting: users and items with features

Task: utility function from data, predict value of item i to user u and recommend

Rating prediction

Item prediction (what user likely to buy)

Content based recommendation: similar to past item

Collaborative filtering: people with similar taste bought in the past

Hybrid approaches: association rule, nearest neighbour

Web Content Mining

Extract useful information

text mining part of this and vice versa

Text analytics

Tools and techniques model and structure information

not in databases but in blogs etc>> unstructured

Challenges in Text analytics

Homomy: same word, diff meaning

Synonomy: diff word, same meaning

Polysemy: same wird, slightly different meaning

Hyponomy: hiearchy concept


Text analytics process

Text > Gather data > corpus > NLP > bag of words > Term matrix > classifcation > knowledge


Part of speed tagging: resolve ambiguity

Tokenization: split characters in tokens

Filtering: dictionaries


Document Term Matrix

Describes frequency of terms in a corpus

in cells

Inverse Document Frequency

Ratio of numbers of docs in corpus and number of docs in which a term occurs

Large values: better

Term‐Frequency‐Inverse Document Frequency

High TF: relevance at document level

High IDT: distinctiveness at corpus level

TF x IDF matrix < commonly used


Use multiple words to create features

Bigrams, Trigrams, N-Grams

treated the same way as a single word

Social Media Definition

Group of internet based application build on ideological and technological foundation of web 2.0

Social Presence / Media Richness and Self-Presentation/ / Self Disclosure Table

Low > High


Pair G = (V,E)

Vertices = Nodes

Edges = Links

Edges are two element subset of V ( each E connects to Vs)

Node: certain object of interest, edge: connect and describe relationship

Graph Structures

Telephone Calls, spread of deseases, collaboration networks, topology, social media


Directed: one way

Undirected: both ways

Bipartite: two sets, linking to each other

Unipartite: one set

Weighted: weight assigned to every edge

Ego-centered: nodes one-hop neighborhood

relevant with high homophily

Graph metrics - distance

Shortest path length: two nodes, shortest path between them (fewest hops)

Eccentricity: longest shortest path for any node

Diameter: max eccentricity

Radius: min eccentricity

Periphery: nodes that have eccentricity to diameter

Graph metrics - centrality

Degree centrality: number of edges converging to a node

Closeness centrality: average distance of a node to all other nodes, between 1 and 0, 1= close, 0= far away

Between centrality: number of shorts paths passing through a node

Eigenvector centrality: measure influence of a node in a network. More important when connected to other important nodes

Graph Neighbor Classification

Probability that a node does something given the classes of direct neighborhood nodes

Problem: data not iid

Cannot split in test and training as graph is interconnected

Huge scale

Graph Neighbor Classification Procedure

Local model: only node-spefific characteristics

Relational model: connections in the network to do the inference

Collective inference: how unknown nodes are estimated together influencing each other

Markov property: class membership only depends in the class of one-hop neighborhood

Homophily assumption: node tend to connect to nodes with similar characteristics

Extension with probabilistic RNC:

All nodes have probabilities, unknown class average of probabilities

Relational Logistic Regression (Graph)

Using node specific and network specific characteristics

Mind correlation of both > stepwise regression

Featurization: features might measure behaviour of neighbouring nodes or local characteristics

Collective Inferencing

Given a network based on a local and relational model, way to inference class/probabilities of unknown nodes

Takes into account: nodes influence each other

Gibbs sampling

Initialise with local classifier

Sample class value of each node

Generate random ordering

Apply relational learner

Sample again

Repeat iterations counting time each class is assigned

Normalize to give final class probability

Facebook Bidding

Eye-catching spot most expensive

Price based on magnitude and popularity of audience

Virtual auction decides which advertisement is placed where

Competitive market on website

Link Prediction

Friends, Admirers, Idols, Neutrals, Enemies

Homophily: connect to similar people

Similarity: demographics, behaviours, interest, brands

link prediction analytics enables to target in a more efficient way.

 Customer churn
 Customer acquisition

Tasks in Data Preprocessing

Feature Selection


Missing Values



Standardizing Data


Target Definition

Data Preprocessing Process






Noisy Data

Missing Values:

keep, delete, replace


Imputation Methods

Replace with mean / modal value

Regression / Tree based imputation

Nearest Neighbor

Outliers Types


Entry error

Difficult to handle: univariate, multivariate

Detect vs. treatment

Outlier treatment

Univariate: Boxplot, Histogram

IQR: Q3-Q1

1.5 - 3 IQR: mild outlier

Extreme > 3 IQR away from Q1/Q3

z scores: outliers away 3 std from mean

Multi: Clustering, Regression, RF

Data Transformation


Scaling: distance computations biased: different value ranges - often useful

skewed value distribution - log transfomation


avoid negative impact of outliers, increase comprehensibility, capture non-linear effects, loss of information

Unsupervised binning (EF, EW)

Supervised binning (Entropy, decision trees)

Categorical: Dummy

not encode as numbers for regression> meaningless, artificial distances

N-1 dummy variables - explosion of dimensionality

Binary: better -1 and 1 when using neural networks

WOE: ln(fraction of bad in category / fraction of good in category)

IV: measure of predictive power, assess suitability , 0.3+: strong, 0.1-03: med, 0.02-0.1: weak

Data Reduction


Sampling of cases

Reduce computational cost

class imbalance

Vertical reduction:

variable selection

feature extraction

cure of dimensionality

computational cost

Random sampling

Increase speed of modelling

Higher Variance

Less accuracy

Use all data

Advanced Sampling

Stratified to keep Good/Bad ratio constant

Classification model geared towards majority group

Undersampling: random deletion of cases, reduce training time, discard useful info

Oversampling: random duplicate minority class, no info loss, increased training time, beats undersampling

SMOTE: creates artificial minority class cases

Variable Selection


easier to understand, cost of collection, accuracy

Find best subset: multivariate, combinatorial problem

Wrapper / Filter


better performance, multivariate, higher comp. cost, auxiliary validation data to assess variable subset

iteratively model while manipulating set of variables, forward selection, backward elimination


often less effective, only an indicator, model agnostic, univariate, lower comp. cost

Use statistical indicator, correlation, Fisher score, IV, one variable at a time,

often univariate:

redundancy, interaction

Random Forest Variable Importance

OOB variable importance:

importance scores like z-values (or probabilities)

Gini variable importance:

specific to random Forest

not readily interpretable

Agree often both

Feature Extraction

Turn variables into new

Linear / nonlinear projection

Clever transformation

Highly effective but costly

PCA, Genetic algorithm

Business Analytics used for


information technology

statistical analysis


to help managers improved insight about operations and maker better decisions

unlock valuable information hidden in data

value proposition: help firms to achieve strategic objectives

ERP: record data

BI: inform managerial decision making

BA: may even replace manager

Explanatory Approach: explore business value of BA

Design approach: new methods for BA

Predecessors of BA

Man. MIS: reporting

Exec. EIS: specific needs of senior management, visual

Dec. DSS: formal model and algos, operational management

Man. MSS: all of the above

Scope of BA

Descriptive Analysis:

past and present

Predictive Analysis:

past and predict future (e.g. demand)


not only anticipate future but make recommendations (e.g. prices)

Define business intelligence

Concepts and methods to improve business decision making by using fact-based support systems

What is a database?

Collection of data exist of a long period of time

managed by DBMS:

create and manage data, allow to safely persist over a long period of time

Data storage types

Volatile Storage: Cache, Memory

Nonvolatile Storage: Disk, File-Syste, Tertiary Storage DBMS, Tapes, DVDs

What does a DBMS do?

Create new DB

Store large amounts of data

Query data

Modify Data

Efficient access


Access control: isolation, atomicity, security req.

Relational model

Based on set theory and relational algebra, two-dimensional tables with rows and columns

Interaction with SQL

Relational DB

Collection of items organised as a set which can be accessed

Data values can be numeric, strings, etc

Tables can join and morph new tables






isolation: same results irrespective of serial execution

durability: once committed, it remains so


Group of DB operations


DB architectures

Two tier:



Web server, Application Server (business logic), Database server (DBMS)

Redundancy possible

Cloud computing

Network-based computing over the internet

provides data storage, use internet as transport, many client at same time

set of integrated and networked hardware

hide the complexity of underlying infrastructure

on demand service

pay for use, elastic / scaling

available for general public

cloud storage: access remotely only temporarily cached in computer

Challenges of RDBMS



Data Size

NoSQL movement

not using a SQL language

no schema to follow, no constraint to new data

open source, large clusters

Information Integration

combining information contained in different and heterogenous databases into one large repository

in companies, each division own database e.g.

different DBMS, different terms and things, legacy applications

necessary to build a structure on top of existing DBs

Data Warehouse vs. Data Federation

Information from many databases is copied periodically with appropriate translation, central DB

Data Federation: help of a mediator, middleware, support integrated model of data of various databases, while translating between model and actual models used by each database

Storage remain distributed

What is a Data Warehouse?

database is maintained separately of operational databases

support decision making by solid consolidation and historic data analysis

Warehousing: constructing and using warehouses

DW: subject oriented

major subjects: customer, product etc

simple and concise view, excluding unuseful data

data for decision makers, not daily transactions

DW: integration

Integration of multiple, heterogenous sources

Data cleaning and conversion applied

Ensure quality and consistency in naming conventions, structures, measures among different sources

DW: time variant

larger time horizon of operational system

every data item is associated with time

DW: non-volatile

physically separate

operational not in DW: only loading and reading of data

What is OLAP?

Online Analytical Processing

Live Data

Analytical: create reports, multidimensional, time based



traditional relational DBMS

day-to-day operations


DW, analysis and decision making

Content: current+detailed vs. historic + consolidated

Database design: complex + application vs. star + subject

Perspective: current + local vs. evolutionary + integrated

Access patterns: update vs. read only with complex queries

DBMS: tuned for OLTP, access, index, control

DW: tuned for OLAP, complex OLAP queries, multidimensional view, consolidation

Different functions: decision support requires historic data

Consolidation: aggregation, summarization

Different sources inconsistent data representation

Enterprise DW: entire organization

Data Mart. subset, specific groups of users, confined to specific, selected groups, independent vs. dependent data marts

Star Schema

Dimensional data model for RD:

fewer tables compared to normalized model

fact tables: important business measures, sales, revenues etc

core of OLAP cube, what is the right level of granularity

dimension tables: potentially relevant query dimensions: place, time, product, customer

linked to fact tables through keys, include aggregation levels

more redundancy, higher storage requirements, fewer tables, better performance

Snowflake Schema

Extension of Star

Normalisation of dimension tables:

have on table per hierarchy

more tables

more join operations

less performance,

less redundancy,

less storage requirements

Types of Reporting

Standard - fixed frequency

Event - driven



Graphical tools

Performance Indicators

BI Portals:

visualisation and data access

reduce search cost

all information in one place

OLAP cube

user centric

multi dimensional free navigation


high performance, fast navigation, use proprietary database, scalability can be a problem, low performance when updating


create cube on the fly, easy to implement, lower performance, better scalability, high performance when updating

HOLAP: combines both, relational for all data, in addition, multi dimensional store for critical data

Data Mining

BI front end technology

Formal method to preprocess data set

Analyse large data set

Discover patterns


KDD: Selection

Documentation of available data stores

Review quality, granularity

Selection of candidate data

KDD: Pre-processing

Conversion to standard analysis format

Exploratory data analysis


KDD: Transformation

Handling missing values

Data reduction

Encoding and projection

KDD: Data Mining

Select a data mining model, algo, develop and estimate a model

KDD: interpretation/evaluation

Validity, Reliability and Originality of patterns

Start over

Classification Algorithms

Non ( Linear)

Non (Paramtetric)

Ensemble - multiple classifier systems

Individual - ensemble

Parametric Classifiers


Functional Form

Logistic Regression

Naiv Bayes


Decision Tree



Neutral Network, SVM

No free lunch theorem

No algorithm always better in any application


ease of use




Prototype Methods

Implicit partitioning of the attribute space into regions assigned to specific classes

-No model estimation

-Only use available data

Mislabeling is a big problem

- Use k nearest neighbors

- Measure distance

Identify k nearest neighbors

 Every neighbor votes for class


Tree Based Terminology

Root node: all data

Splitter node: question

Internal node: answer, subset

Leaf node: not further split, class prediction

Tree Based classifier

Recursive partitioning approach

find best split

split according to the variable and create a branch for each value

recurse for each subsample

stop when no improvement possible


splitting rule

stopping rule

Splitting Rules

objective: increase purity of data

avoid splitting with too few examples

indicators of node impurity:

maximized: all classes equally probable

minimized: all belong to the same class

entropy, giny

problem with gain:

favors attributes with many distinct values

Stopping Rule

Otherwise: one example per leaf

Models with noise


prune to avoid



do not fully grow

minimal gain below threshold

how to select parameters


fully grow-tree

cut branches that do not add much

Naïve Bayes Classifier

Models conditional probability of class membership given attribute values

Simplifying assumption:

attributes are conditionally independent

Classify an application into the group with maximal posterior probability

Collect data

Estimate conditional probability per class

Predict class with higher probability

Directed Graph

Arrows indicate class membership causes attribute values with certain probability

Features: Fast and memory efficient

robust towards irrelevant attributes

Generalises easily to more than two classes

Logistic Regression

Direct approach

Problems with linear regression:

violation of assumptions, equal variance, predictions >1, <0, non-normal

Maximum Likelihood

logistic regression creates linear decision boundary

Curse of Dimensionality

easier to separate cases in high dimensional spaces

high dimensionality increase sparseness of attribute spaces

distances become indistinguishable

boundary difficult if most areas are empty

even relatively simple classifies may overfit when dimensionality is high

amount of data needs to increase exponentially with dimensionality

working with large number of attributes causes problems


Models derive from training sample

more complex, the more the classification error of training sample approaches zero

new cases not contained in sample

problem! but is is noise

perfect fit for training - remedy: cross validation - regularisation


revises practices to models

not error minimisation only but:

low training error AND low model complexity

Measure of model complexity

Bias-Variance Tradeoff

Bias: too simple to capture trend

Variance: zero error but unstable, re-estimation leads to new model

hurt accuracy:

simple models, high bias, low variance

complex models: low bias, high variance

Introduce bias to reduce variance, penalize complexity

Complex models overfit, high variance

Regression indicators: large coefficient associated with unstable models: high dimensionality


Dimensions of Model Performance








Time resources:

training time

prediction time

Consumption of memory resources

Sensitivity to parameters



Real-world data noisy

Missing values

wrong data

irrelevant data

Change over time

Concept drift


Understand why? Transparency, Blackbox

Difficult to measure


Agrees with prior beliefs and business rules

Drives model acceptance

Difficult to measure


Feature of probability prediction:



> human decision making

Prediction model poorly calibrated

Translate classifier scores to class membership probability - distorted for some classifiers


How predicts unseen observations

Simulate real-life application

Indicator to measure predictive accuracy:

compare predicted to actual responses

Simulation of Model Application


same data for model estimation and assessment

overfitting problem


random split in training and test set

easy to implement, fast, approximates error

inefficient, not all data used, high variance

N-fold cross-validation:

N partitions, model validation and remaining for model creation, assess performance and repeat - average performance estimates

resource consumption

increase robustness - less variance

all examples used for model building

Confusion Matrix

Specificity: correctly classified bad

Sensitivity: correctly classified good

imbalance problem

cut off value - threshold metrics, compare continuous prediction to a cut-off

Brier Score

Mean Square Error, predicts accuracy and calibration

Types of accuracy indicators




ROC curve

generalisation of confusion matrix

across all cut-offs

compare different classifiers

higher the better


probability that a randomly chosen bad risk is correctly ranked higher than a randomly chosen good risk

Accuracy Indicator Selection

Different types of predictions:

discrete class predictions

continuous confidence scores

class probability estimates

incompatible between indicators

Neural Network

􏲟Finding the connection weights in the network

Non-liner, non-convex optimisation problem

non-deterministic behavior

Iterative training algorithm

different networks given random weight initalization

Classic back propagation

Parameter: learning rate: local and global minima

Large coefficient weights indicate model complexity

complex models prone to overfite

weight penalty


Meta-parameter tuning

adapt the classifier to a given data set

set by analyst:

number of layers, nodes

output layer function

hidden layer activation function

regularisation coefficient

Grid search

Three step approach

define candiate setting, test, for each meta parameter

Magnify resolution in promising areas

Learning curve analysis

Studies sensitivity of a classifier with respect to sample size

three step approach:

draw small sample

estimate and assess model

increase sample size and repeat

typically, degressive trend, marginal utility of more data decreases with size

can speed things up: find size wehre classifier training stabilises

Ensemble Models

Multi-step modelling approach:

develop a base model

aggregate their predictions

Why forecasting increases accuracy

Different vies on same data

Linear and non linear models

Forecast combination gathers information

Asking my experts

Bias variance trade off (combination)

Strength diversity trade off

ensemble margin

Strength-Diversity Trade-Off

Base model strength:


Base model diversity:

how the predictions of models agree

diversity as forecast correlation

Do not combine identical models

Not possible to be very strong and diverse at the same time - conflict of strength and diversity


Class probabilities and class membership: different outputs

develop diversity measure

study behaviour of different measures

no consensus on this idea

Strength Diversity Plot

Scatterplot: each point is a pair

x: average strength (PCC)

y: Correlation

Ensemble Margin

Factor explains ensemble success


The larger the better

Penalty for each wrong prediction


Weighing possible

Categories of Ensemble Algorithms



Without pruning

With Pruning

Homogeneous Ensemble Classifier

Base model using same algorithm

Inject diversity through training data:

random drawing of cases/variables

Heterogenous Ensemble classifiers

Different algorithms

Inject diversity algorithmically

different algos, different meta parameter

Also called multiple-classifier system

Ensemble without pruning

All base models into ensemble

Ensemble with pruning

Candidate base models

Optimize ensemble composition using some search strategy

Put selected base models into ensemble

Homogenous Algorithm


Random Forest



Work with any classifier

best use unstable

bootstrap sampling with replacement

sample includes some cases multiple times

37% do not appear at all

OOB examples:

facilitate assessing model and variable importance

then: averaging

Bias not altered

Variance reduced

meta parameters: based on algorithm

can use any

how many bootstrap samples

size of bootstrap sample

scaleable and generic

Random Forest

Bagging with random subspace

Injects diversity

Bootstrap sample, in addition:

chose attribute to split data, draw random sample of attributes, find best split from attributes within sample

increased diversity

random subspace limits access to attributes

tree based tend to overfit, pruning helps

be definition no bias, but high variance

bootstrapping reduces variance

meta parameter: size of random sample of attributes, forest size, size of bootstrap sample

often accurate and easy to scale, estimate variable importance


Simple classifier / weak learners

easy to find them

find many

combine many easy classifiers

Check which cases it gets right
second classifier that corrects the errors

Check errors of the ensemble
third classifier that corrects the errors


weight depends on accuracy

reduce bias and variance

meta-parameters: type of algorithm, often shallow decision trees

embody idea of week classifier

how many iterations

maybe more depending on implementation

vulnerable to overfitting

Multiple Classifier Systems

Simply take the average of preferred classification algorithms

but: different algorithms have different outputs

common scale

stacking: predictions of base classifiers are correlated, use robust 2nd level classifier (classical suffer from multicollinearity)

modelling gets very complex, tuning?

ensemble selection:

build base classifier, select single best, iteratively add one additional and average, continue as long as ensemble accuracy increases

weighing and multiple times possible

Highly accurate predictions


relative complex

resource intensive

model management costly