• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/160

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

160 Cards in this Set

  • Front
  • Back

Cluster Objective


homogeneity within the segment is maximized: average of pairwise distance


heterogeneity between segments is maximized: distance between some objects of different clusters



No response variable



Distance measure a way to quantify similarity

Cluster Applications

Understanding a customer population


Text Analytics


Fraud detection


Cluster Approaches

Hierachical:


Agglomerative


Divisive



Non-Hierachical:


Exclusive


Non-exclusive

Properties of Distance Measures

Always a real number



non-negativiy


symmetry


identity


shortest path to object is straight line


Distance Measures

Euclidian


Manhatten (City-Block)



Angle between vectors

Distance Measures for non-numeric

Translate into dummy


Hamming distance


How many operations to transform a text


Graphs, Time Series

General Notion of similarity/distance

effort to transform one object into another


transformation specific costs


Hierarchical Clustering

Iterative Algorithm:


form cluster solution


loop



Divisive:


Top-Down, all elements on cluster, split up


Decreases heterogeneity in a cluster



Agglomerative:


Bottom-Up


Every case a cluster, merge clusters


Increases heterogeneity in a cluster

Dendogram

Tree-type graph to depict development of hierarchical clustering



height of line depicts to which extent intra-cluster heterogeneity increases when merged



sharp increase suggests cut



outlier detection possible

How to measure distances between clusters?

Single linkage: closest objects


only small dissimilarities needed


chaining



Complete: most dissimilar objects


violates closeness property



Compromises:



Average: mean of all pairwise


Centroid: distance between centroid


Wards: marginal increase in variance when merging



When clusters well separated: similar results

Non-Hierarchical Clustering

Combinatorial


Centroid


Distribution based


Density based


Self-Organising maps

Exclusive Clustering

Every case assigned to one cluster


k-means

K-Means clustering

Define number of clusters: K


Difficult to know: domain knowledge, heuristic approach. Elbow finding strategy



Randomly guess centroids/centers


Assign cases to nearest centroid


Update centroids


Repeat until cluster assignment stops changing



Non-exclusive clustering

Soft-cluster assignment:


cluster membership probability - like k-Means


related: Gaussian Mixture Models



More robust towards outliers


Better suited for overlapping distributions

Gaussian Mixture Model Clustering

Chose K


Algorithm models data


Each mixture component influences every case


Strong influence if case is close to mean vector, small otherwise



Compute association to mixture components


Recompute parameters based on responsibilities

Comparison Hierarchical / Non-Hierarchical

Hierarchical:


Cluster follow from dendogram


arbitrary distance measures: flexibility, validity?


Limited scalability: computational extensie


Choose linkage measure



Non-Hierarchical:


number of clusters set by user


less flexibility in terms of distance measure, geared for numeric data


Fairly scalable: k-means converges fast


Association rule mining

Detect frequent occurring patterns between different items in a transaction:


- purchased together


- co-occur in a text


- chosen together



Discover rules: If A then B


Sequence analysis: time dependency

Association rule terminology

Antecedent


Consequent


Whereby


Item sets ((dis)joint)



2^N-1 combinations of sets: ignore empty set

Association rule measures

Support: # (XuY) / #all transactions


Threshold: minsup, frequent item set



Confidence: Fractions of all transactions:



#(XuY) / #Y


Measure of strength of association


conditional probability - no causal link



Lift: Interestingness of a rule


<1: substitution


>1: complementarity



# XuY / X*Y


Association Rule Applications

Reveal dependency among clusters



Market basket


Next best offer


Product bundling


Store and shelf layout


Product catalog


Recommender system


Web Usage


Mining Association Rules

Only strong meaningful rules of interest


Strategy:


support > minsup >> must be sufficient frequent


confidence > minconf



Brute-Force approach: every time could be a set, create support for every set > very complex 2^N-1

Strategies to find frequent item sets

Reduce candidates M


M=2^N-1



Reduce transactions N


Pooling, DHP



Reduce comparisons nM


Efficient data structure


No compare every candidate to every transaction

Apriori Algorithm

Property:


downward closure


if something is frequent, all subsets frequent


if item infrequent, all superset also infrequent



Functioning: minsup, minconf


two step approach: exploit downward closure, find candidate sets with K items


Discover rules > minconf



Issue: Efficiency

Issues with Association Rule Mining

How to set thresholds


Irrelevant / trivial rules


Useful only moderate supports


Interesting patterns overlooked

Association Rules and Supervised Learning

Automatic modelling:


rules have one consequent, response variable


only few antecedents


sort by confidence


top rules insight for test



Variable selection:


mimic correction based filter


Identify association rules among categorical variables



Develop rules

Sequential Pattern Mining

Maximal sequence among all sequences



Order of time in sequence is important



Ass rules: same time intratransaction


Seq rules: different times intertransaction



Apriori Algorithm : if k infrequent, k+1 also infrequent

Branches of Web Mining

Structure Mining: knowledge from hyperlinks (structure of web), key technology in search engines, discover communities of users, no link structure to relations table



Usage Mining: discovery of access patterns from logs, many mining algos, pre-processing of clickstream data in logs to produce data for mining



Content Mining: web page content, classify cluster pages to topics, similar to data mining, discover patterns in pages, extract infos for many purposes, reviews, descriptions etc


Web Structure Mining

Extract patterns from hyperlinks that connect pages


Examine structure of a single page



Web as a directed graph

Crawlers

Procedure:


download page


extract urls


repeat for unread urls



Breadth-first search: explore links layer by layer, well-connected pages



Depth-first search: follow first link, subsequent pages



Dynamics: examine page update frequency, re-visit periodically

Search engine

Page ranking + others


Important if others link to it


Query independent relevance scores


Inbound links cast a vote


vote is weighted by sending pages' quality


Recursive calculation



HITS (Hypertext, induced, topic search)

Web Usage Mining

Patters in clickstreams and transcations - interactions with web resource



Sources: logs, meta data, domain knowledge, client side scripting, cookies



Types: Usage, Content, Structure, User

Web Usage Mining Process

Data collection / preparation, pattern discovery and analysis



Preparation: create DB and fill with data


Discovery: typical behavior statistics


Analysis: aggregate, input to engines and reports


Data Preparation Tasks

Pageview identification


User identification


Sessionization


Path Completion


Data Integration

Pageview identification

Single access: multiple entries (images, html)


Remove redundant entries


Remove: Well-behaved robots / stealth robots

User Identification

Not in logs:


cookies, IP time ,referrer, user agent


Sessionization

Segment user activity into sessions:


authentication


clickstream heuristics



Episode: subset of session, related page views,


classification of pageviews

Path completion

Server log not capture whole navigation path: cache



Knowledge of site structure, referrer information

Data Integration

Sessionized clickstream:


integrate transaction data, demographics, products, commerce data

User-Pageview-Matrix

If order of views not relevant: clustering and asco rule mining


Term‐Pageview Matrix

Cluster Pages according to a topic

Content-Enhanced Matrix

Combine Term and User page view matrix, better description of what users look for

Web Usage Patterns

Web controlling: session and visitor analysis, OLAP tools


Predictive modelling based on transactions



Cross selling, Improve site layout


Visitor segmenation


Recommender Systems and Collaborative Filtering

Purpose: help with information overload



Setting: users and items with features



Task: utility function from data, predict value of item i to user u and recommend



Rating prediction


Item prediction (what user likely to buy)



Content based recommendation: similar to past item


Collaborative filtering: people with similar taste bought in the past



Hybrid approaches: association rule, nearest neighbour

Web Content Mining

Extract useful information


text mining part of this and vice versa

Text analytics

Tools and techniques model and structure information



not in databases but in blogs etc>> unstructured

Challenges in Text analytics

Homomy: same word, diff meaning


Synonomy: diff word, same meaning


Polysemy: same wird, slightly different meaning


Hyponomy: hiearchy concept


Misspellings

Text analytics process

Text > Gather data > corpus > NLP > bag of words > Term matrix > classifcation > knowledge

NLP

Part of speed tagging: resolve ambiguity


Tokenization: split characters in tokens


Filtering: dictionaries


Stemming

Document Term Matrix

Describes frequency of terms in a corpus


in cells

Inverse Document Frequency

Ratio of numbers of docs in corpus and number of docs in which a term occurs



Large values: better


Term‐Frequency‐Inverse Document Frequency

High TF: relevance at document level


High IDT: distinctiveness at corpus level


TF x IDF matrix < commonly used

N-Grams

Use multiple words to create features



Bigrams, Trigrams, N-Grams



treated the same way as a single word

Social Media Definition

Group of internet based application build on ideological and technological foundation of web 2.0



Social Presence / Media Richness and Self-Presentation/ / Self Disclosure Table



Low > High

Graphs

Pair G = (V,E)



Vertices = Nodes


Edges = Links


Edges are two element subset of V ( each E connects to Vs)



Node: certain object of interest, edge: connect and describe relationship


Graph Structures

Telephone Calls, spread of deseases, collaboration networks, topology, social media



Types:



Directed: one way


Undirected: both ways



Bipartite: two sets, linking to each other


Unipartite: one set



Weighted: weight assigned to every edge


Ego-centered: nodes one-hop neighborhood


relevant with high homophily

Graph metrics - distance

Shortest path length: two nodes, shortest path between them (fewest hops)



Eccentricity: longest shortest path for any node



Diameter: max eccentricity


Radius: min eccentricity


Periphery: nodes that have eccentricity to diameter

Graph metrics - centrality

Degree centrality: number of edges converging to a node



Closeness centrality: average distance of a node to all other nodes, between 1 and 0, 1= close, 0= far away



Between centrality: number of shorts paths passing through a node



Eigenvector centrality: measure influence of a node in a network. More important when connected to other important nodes


Graph Neighbor Classification

Probability that a node does something given the classes of direct neighborhood nodes



Problem: data not iid


Cannot split in test and training as graph is interconnected


Huge scale



Graph Neighbor Classification Procedure

Local model: only node-spefific characteristics



Relational model: connections in the network to do the inference



Collective inference: how unknown nodes are estimated together influencing each other



Markov property: class membership only depends in the class of one-hop neighborhood



Homophily assumption: node tend to connect to nodes with similar characteristics



Extension with probabilistic RNC:


All nodes have probabilities, unknown class average of probabilities


Relational Logistic Regression (Graph)

Using node specific and network specific characteristics



Mind correlation of both > stepwise regression



Featurization: features might measure behaviour of neighbouring nodes or local characteristics


Collective Inferencing

Given a network based on a local and relational model, way to inference class/probabilities of unknown nodes


Takes into account: nodes influence each other



Gibbs sampling

Initialise with local classifier


Sample class value of each node


Generate random ordering


Apply relational learner


Sample again


Repeat iterations counting time each class is assigned


Normalize to give final class probability

Facebook Bidding

Eye-catching spot most expensive


Price based on magnitude and popularity of audience


Virtual auction decides which advertisement is placed where


Competitive market on website

Link Prediction

Friends, Admirers, Idols, Neutrals, Enemies



Homophily: connect to similar people


Similarity: demographics, behaviours, interest, brands



link prediction analytics enables to target in a more efficient way.


 Customer churn
 Customer acquisition


Tasks in Data Preprocessing

Feature Selection


Sampling


Missing Values


Discretization


Encoding


Standardizing Data


Outliers


Target Definition

Data Preprocessing Process

Selection


Cleaning


Transformation


Reduction

Cleanining

Noisy Data


Missing Values:


keep, delete, replace


Outliers

Imputation Methods

Replace with mean / modal value


Regression / Tree based imputation


Nearest Neighbor

Outliers Types

Valid


Entry error


Difficult to handle: univariate, multivariate


Detect vs. treatment

Outlier treatment

Univariate: Boxplot, Histogram



IQR: Q3-Q1



1.5 - 3 IQR: mild outlier


Extreme > 3 IQR away from Q1/Q3



z scores: outliers away 3 std from mean



Multi: Clustering, Regression, RF



Data Transformation

Continuous:



Scaling: distance computations biased: different value ranges - often useful


skewed value distribution - log transfomation



Discretization:


avoid negative impact of outliers, increase comprehensibility, capture non-linear effects, loss of information



Unsupervised binning (EF, EW)


Supervised binning (Entropy, decision trees)



Categorical: Dummy


not encode as numbers for regression> meaningless, artificial distances


N-1 dummy variables - explosion of dimensionality



Binary: better -1 and 1 when using neural networks



WOE: ln(fraction of bad in category / fraction of good in category)



IV: measure of predictive power, assess suitability , 0.3+: strong, 0.1-03: med, 0.02-0.1: weak

Data Reduction

Horizontal:


Sampling of cases


Reduce computational cost


class imbalance



Vertical reduction:


variable selection


feature extraction


cure of dimensionality


computational cost

Random sampling

Increase speed of modelling


Higher Variance


Less accuracy


Use all data

Advanced Sampling

Stratified to keep Good/Bad ratio constant


Classification model geared towards majority group



Undersampling: random deletion of cases, reduce training time, discard useful info



Oversampling: random duplicate minority class, no info loss, increased training time, beats undersampling



SMOTE: creates artificial minority class cases

Variable Selection

Reduce:


easier to understand, cost of collection, accuracy



Find best subset: multivariate, combinatorial problem

Wrapper / Filter

Wrapper:


better performance, multivariate, higher comp. cost, auxiliary validation data to assess variable subset



iteratively model while manipulating set of variables, forward selection, backward elimination



Filter:


often less effective, only an indicator, model agnostic, univariate, lower comp. cost



Use statistical indicator, correlation, Fisher score, IV, one variable at a time,



often univariate:


redundancy, interaction


Random Forest Variable Importance

OOB variable importance:


importance scores like z-values (or probabilities)



Gini variable importance:


specific to random Forest


not readily interpretable



Agree often both

Feature Extraction

Turn variables into new



Linear / nonlinear projection


Clever transformation



Highly effective but costly



PCA, Genetic algorithm


Business Analytics used for

data


information technology


statistical analysis


QM



to help managers improved insight about operations and maker better decisions



unlock valuable information hidden in data


value proposition: help firms to achieve strategic objectives



ERP: record data


BI: inform managerial decision making


BA: may even replace manager



Explanatory Approach: explore business value of BA


Design approach: new methods for BA

Predecessors of BA

Man. MIS: reporting


Exec. EIS: specific needs of senior management, visual


Dec. DSS: formal model and algos, operational management


Man. MSS: all of the above

Scope of BA

Descriptive Analysis:


past and present


Predictive Analysis:


past and predict future (e.g. demand)


Prescriptive:


not only anticipate future but make recommendations (e.g. prices)

Define business intelligence

Concepts and methods to improve business decision making by using fact-based support systems


What is a database?

Collection of data exist of a long period of time



managed by DBMS:


create and manage data, allow to safely persist over a long period of time

Data storage types

Volatile Storage: Cache, Memory


Nonvolatile Storage: Disk, File-Syste, Tertiary Storage DBMS, Tapes, DVDs

What does a DBMS do?

Create new DB


Store large amounts of data


Query data


Modify Data


Efficient access


Durability


Access control: isolation, atomicity, security req.

Relational model

Based on set theory and relational algebra, two-dimensional tables with rows and columns


Interaction with SQL

Relational DB

Collection of items organised as a set which can be accessed


Data values can be numeric, strings, etc


Tables can join and morph new tables



Advantages:



Time-proven,


ACID:


atomicity


consistency


isolation: same results irrespective of serial execution


durability: once committed, it remains so

Transactions

Group of DB operations


ACID

DB architectures

Two tier:


client-server



Three-tier:


Web server, Application Server (business logic), Database server (DBMS)



Redundancy possible

Cloud computing

Network-based computing over the internet



provides data storage, use internet as transport, many client at same time



set of integrated and networked hardware



hide the complexity of underlying infrastructure



on demand service



pay for use, elastic / scaling



available for general public



cloud storage: access remotely only temporarily cached in computer

Challenges of RDBMS

Diversity


Connectivity


Data Size

NoSQL movement

not using a SQL language


no schema to follow, no constraint to new data


open source, large clusters

Information Integration

combining information contained in different and heterogenous databases into one large repository



in companies, each division own database e.g.



different DBMS, different terms and things, legacy applications



necessary to build a structure on top of existing DBs

Data Warehouse vs. Data Federation

Information from many databases is copied periodically with appropriate translation, central DB



Data Federation: help of a mediator, middleware, support integrated model of data of various databases, while translating between model and actual models used by each database



Storage remain distributed

What is a Data Warehouse?

database is maintained separately of operational databases


support decision making by solid consolidation and historic data analysis



Warehousing: constructing and using warehouses

DW: subject oriented

major subjects: customer, product etc



simple and concise view, excluding unuseful data



data for decision makers, not daily transactions

DW: integration

Integration of multiple, heterogenous sources



Data cleaning and conversion applied



Ensure quality and consistency in naming conventions, structures, measures among different sources

DW: time variant

larger time horizon of operational system



every data item is associated with time

DW: non-volatile

physically separate



operational not in DW: only loading and reading of data

What is OLAP?

Online Analytical Processing



Live Data


Analytical: create reports, multidimensional, time based

OLTP and OLAP

OLTP:


traditional relational DBMS


day-to-day operations



OLAP:


DW, analysis and decision making



Content: current+detailed vs. historic + consolidated


Database design: complex + application vs. star + subject


Perspective: current + local vs. evolutionary + integrated


Access patterns: update vs. read only with complex queries



DBMS: tuned for OLTP, access, index, control


DW: tuned for OLAP, complex OLAP queries, multidimensional view, consolidation



Different functions: decision support requires historic data



Consolidation: aggregation, summarization



Different sources inconsistent data representation



Enterprise DW: entire organization


Data Mart. subset, specific groups of users, confined to specific, selected groups, independent vs. dependent data marts

Star Schema

Dimensional data model for RD:



fewer tables compared to normalized model



fact tables: important business measures, sales, revenues etc


core of OLAP cube, what is the right level of granularity



dimension tables: potentially relevant query dimensions: place, time, product, customer


linked to fact tables through keys, include aggregation levels



more redundancy, higher storage requirements, fewer tables, better performance

Snowflake Schema

Extension of Star


Normalisation of dimension tables:


have on table per hierarchy


more tables


more join operations


less performance,


less redundancy,


less storage requirements

Types of Reporting

Standard - fixed frequency


Event - driven


Ad-hoc



Engines:


Graphical tools


Performance Indicators



BI Portals:


visualisation and data access


reduce search cost


all information in one place

OLAP cube

user centric


multi dimensional free navigation



MOLAP:


high performance, fast navigation, use proprietary database, scalability can be a problem, low performance when updating



ROLAP:


create cube on the fly, easy to implement, lower performance, better scalability, high performance when updating



HOLAP: combines both, relational for all data, in addition, multi dimensional store for critical data

Data Mining

BI front end technology


Formal method to preprocess data set


Analyse large data set


Discover patterns


Automated

KDD: Selection

Documentation of available data stores


Review quality, granularity


Selection of candidate data

KDD: Pre-processing

Conversion to standard analysis format


Exploratory data analysis


Aggregation

KDD: Transformation

Handling missing values


Data reduction


Encoding and projection

KDD: Data Mining

Select a data mining model, algo, develop and estimate a model

KDD: interpretation/evaluation

Validity, Reliability and Originality of patterns



Start over

Classification Algorithms

Non ( Linear)


Non (Paramtetric)


Ensemble - multiple classifier systems


Individual - ensemble

Parametric Classifiers

Parametric:


Functional Form


Logistic Regression


Naiv Bayes



Non-Parametric:


Decision Tree


Protoype



Semi-Parametric:


Neutral Network, SVM

No free lunch theorem

No algorithm always better in any application



merits:


ease of use


comprehensibility


accuracy


scalability

Prototype Methods

Implicit partitioning of the attribute space into regions assigned to specific classes


-No model estimation


-Only use available data


Mislabeling is a big problem


- Use k nearest neighbors


- Measure distance


Identify k nearest neighbors


 Every neighbor votes for class


membership

Tree Based Terminology

Root node: all data


Splitter node: question


Internal node: answer, subset


Leaf node: not further split, class prediction

Tree Based classifier


Recursive partitioning approach


find best split


split according to the variable and create a branch for each value


recurse for each subsample


stop when no improvement possible



decisions:


splitting rule


stopping rule

Splitting Rules

objective: increase purity of data


avoid splitting with too few examples



indicators of node impurity:


maximized: all classes equally probable


minimized: all belong to the same class



entropy, giny



problem with gain:


favors attributes with many distinct values

Stopping Rule

Otherwise: one example per leaf


Models with noise


overfitting


prune to avoid

Pruning

Pre-pruning:


do not fully grow


minimal gain below threshold


how to select parameters



Post-pruning:


fully grow-tree


cut branches that do not add much

Naïve Bayes Classifier


Models conditional probability of class membership given attribute values


Simplifying assumption:


attributes are conditionally independent



Classify an application into the group with maximal posterior probability



Collect data


Estimate conditional probability per class


Predict class with higher probability



Directed Graph


Arrows indicate class membership causes attribute values with certain probability



Features: Fast and memory efficient


robust towards irrelevant attributes



Generalises easily to more than two classes



Logistic Regression

Direct approach


Problems with linear regression:


violation of assumptions, equal variance, predictions >1, <0, non-normal



Maximum Likelihood



logistic regression creates linear decision boundary

Curse of Dimensionality

easier to separate cases in high dimensional spaces


high dimensionality increase sparseness of attribute spaces


distances become indistinguishable


boundary difficult if most areas are empty


even relatively simple classifies may overfit when dimensionality is high


amount of data needs to increase exponentially with dimensionality



working with large number of attributes causes problems

Overfitting

Models derive from training sample


more complex, the more the classification error of training sample approaches zero


new cases not contained in sample


problem! but is is noise



perfect fit for training - remedy: cross validation - regularisation


Regularization

revises practices to models


not error minimisation only but:


low training error AND low model complexity



Measure of model complexity

Bias-Variance Tradeoff

Bias: too simple to capture trend


Variance: zero error but unstable, re-estimation leads to new model



hurt accuracy:


simple models, high bias, low variance



complex models: low bias, high variance



Introduce bias to reduce variance, penalize complexity



Complex models overfit, high variance



Regression indicators: large coefficient associated with unstable models: high dimensionality


multicollinearity

Dimensions of Model Performance

Accuracy


Justifiability


Comprehensibility


Scaleability


Robustness


Calibration


Scalability

Time resources:


training time


prediction time



Consumption of memory resources


Sensitivity to parameters


Parallelization

Robustness

Real-world data noisy


Missing values


wrong data


irrelevant data


Change over time


Concept drift

Comprehensibility

Understand why? Transparency, Blackbox


Difficult to measure

Justifiability

Agrees with prior beliefs and business rules


Drives model acceptance


Difficult to measure

Calibration

Feature of probability prediction:


Overconfidence


Under-confidence


> human decision making


Prediction model poorly calibrated



Translate classifier scores to class membership probability - distorted for some classifiers

Accuracy

How predicts unseen observations


Simulate real-life application


Indicator to measure predictive accuracy:


compare predicted to actual responses

Simulation of Model Application

Resubstitution:


same data for model estimation and assessment


overfitting problem



Split-Sample:


random split in training and test set


easy to implement, fast, approximates error


inefficient, not all data used, high variance



N-fold cross-validation:


N partitions, model validation and remaining for model creation, assess performance and repeat - average performance estimates


resource consumption


increase robustness - less variance


all examples used for model building

Confusion Matrix

Specificity: correctly classified bad


Sensitivity: correctly classified good


imbalance problem



cut off value - threshold metrics, compare continuous prediction to a cut-off

Brier Score

Mean Square Error, predicts accuracy and calibration

Types of accuracy indicators

Threshold


Probability


Ranking

ROC curve

generalisation of confusion matrix


across all cut-offs


compare different classifiers


higher the better


AUC

probability that a randomly chosen bad risk is correctly ranked higher than a randomly chosen good risk

Accuracy Indicator Selection

Different types of predictions:


discrete class predictions


continuous confidence scores


class probability estimates



incompatible between indicators

Neural Network


􏲟Finding the connection weights in the network


Non-liner, non-convex optimisation problem


non-deterministic behavior


Iterative training algorithm


different networks given random weight initalization


Classic back propagation


Parameter: learning rate: local and global minima



Large coefficient weights indicate model complexity


complex models prone to overfite


weight penalty

Tuning

Meta-parameter tuning


adapt the classifier to a given data set


set by analyst:


number of layers, nodes


output layer function


hidden layer activation function


regularisation coefficient

Grid search

Three step approach


define candiate setting, test, for each meta parameter



Magnify resolution in promising areas

Learning curve analysis

Studies sensitivity of a classifier with respect to sample size



three step approach:


draw small sample


estimate and assess model


increase sample size and repeat



typically, degressive trend, marginal utility of more data decreases with size



can speed things up: find size wehre classifier training stabilises


Ensemble Models

Multi-step modelling approach:


develop a base model


aggregate their predictions

Why forecasting increases accuracy

Different vies on same data


Linear and non linear models


Forecast combination gathers information


Asking my experts



Bias variance trade off (combination)


Strength diversity trade off


ensemble margin

Strength-Diversity Trade-Off

Base model strength:


accuracy


Base model diversity:


how the predictions of models agree


diversity as forecast correlation



Do not combine identical models



Not possible to be very strong and diverse at the same time - conflict of strength and diversity

Diversity

Class probabilities and class membership: different outputs



develop diversity measure


study behaviour of different measures


no consensus on this idea

Strength Diversity Plot

Scatterplot: each point is a pair



x: average strength (PCC)


y: Correlation

Ensemble Margin

Factor explains ensemble success


+1


The larger the better


Penalty for each wrong prediction


-1


Weighing possible

Categories of Ensemble Algorithms

Homogenous



Heterogenous



Without pruning


With Pruning

Homogeneous Ensemble Classifier

Base model using same algorithm


Inject diversity through training data:


random drawing of cases/variables

Heterogenous Ensemble classifiers

Different algorithms


Inject diversity algorithmically


different algos, different meta parameter


Also called multiple-classifier system

Ensemble without pruning

All base models into ensemble

Ensemble with pruning

Candidate base models


Optimize ensemble composition using some search strategy


Put selected base models into ensemble

Homogenous Algorithm

Bagging


Random Forest


Boosting

Bagging

Work with any classifier


best use unstable



bootstrap sampling with replacement


sample includes some cases multiple times


37% do not appear at all


OOB examples:


facilitate assessing model and variable importance


then: averaging



Bias not altered


Variance reduced



meta parameters: based on algorithm


can use any


how many bootstrap samples


size of bootstrap sample


scaleable and generic

Random Forest

Bagging with random subspace


Injects diversity


Bootstrap sample, in addition:


chose attribute to split data, draw random sample of attributes, find best split from attributes within sample



increased diversity


random subspace limits access to attributes



tree based tend to overfit, pruning helps


be definition no bias, but high variance


bootstrapping reduces variance



meta parameter: size of random sample of attributes, forest size, size of bootstrap sample



often accurate and easy to scale, estimate variable importance

Boosting

Simple classifier / weak learners


easy to find them


find many


combine many easy classifiers




Check which cases it gets right
second classifier that corrects the errors


Check errors of the ensemble
third classifier that corrects the errors


repeat



weight depends on accuracy



reduce bias and variance


meta-parameters: type of algorithm, often shallow decision trees


embody idea of week classifier


how many iterations


maybe more depending on implementation



vulnerable to overfitting


Multiple Classifier Systems

Simply take the average of preferred classification algorithms



but: different algorithms have different outputs


common scale



stacking: predictions of base classifiers are correlated, use robust 2nd level classifier (classical suffer from multicollinearity)


modelling gets very complex, tuning?



ensemble selection:


build base classifier, select single best, iteratively add one additional and average, continue as long as ensemble accuracy increases


weighing and multiple times possible



Highly accurate predictions


scalable



relative complex


resource intensive


model management costly