• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/30

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

30 Cards in this Set

  • Front
  • Back
Data Mining
The nontrivial process ofidentifying valid, novel, potentially useful, and ultimately understandablepatterns in data stored in structured databases
Data
- Categorical

Nominal


Ordinal


- Numerical


Interval


Ratio





Types of Data Mining Patterns
1. Association

2. Prediction (supervised)


3. Cluster


4. Sequential



Types of data mining
1. Discovery driven

2. Hypothesis driven

Data Mining Process
1. CRISP-DM (Cross-Industry Standard Process)

2. SEMMA (Sample, Explore, Modify, Model and Assess)


3. KDD (Knwledge Discovery Database)

CRISP DM
Step 1: BusinessUnderstanding

Step 2: DataUnderstanding


Step 3: DataPreparation (!)


Step 4:ModelBuilding


Step 5: Testingand Evaluation


Step 6: Deployment

SEMMA (Sample, Explore, Modify, Model and Assess)
Classification
- Partof the machine-learning family

- Employsupervised learningLearnfrom past data, classify new data


- Theoutput variable is categorical (nominal or ordinal)

Assessment Methods for Classification
- Predictiveaccuracy (Hitrate)

- Speed(Modelbuilding; predicting)


- Robustness


- Scalability


- Interpritability (Transparency;ease of understanding)

Accuracy of Classification Models
Confusion matrix
Confusion matrix


Accuracy of Classification Models
True Positive Rate - Accuracy of Classification Models
True Negative Rate - Accuracy of Classification Models
Recall - Accuracy of Classification Models
Precision - Accuracy of Classification Models
EstimationMethodologies for Classification
- Simple Split: Splitthe data into 2 mutually exclusive sets, training ~70%and testing 30%



- ForANN: the data is split into three sub-sets: training~60%, validation~20%, testing~20%

Estimation Methodologies forClassification
k-FoldCross Validation (rotation estimation)

Leave-one-out


bootstrapping


jackknifing

Classification TechniquesEnd Fragment
Decisiontree analysis

Statisticalanalysis


Neuralnetworks


Supportvector machinesCase-basedreasoning


Bayesianclassifiers


Geneticalgorithms


Roughsets

Decision Trees
Recursivelydivides a training set until each division consists of examples from one class



DT algorithms varies from splitting criteria, pruning criteria and stopping criteria




ID3, C4.5, C5; CART; CHAID; M5

Cluster Analysis for Data Mining
Usedfor automatic identification of natural groupings of things

Partof the machine-learning family


Employunsupervised learning


“Learnsthe clusters of things” from past data, then assigns new instances


Thereis no output variable


Alsoknown as segmentation

Applications of Cluster Analysis
- Natural groupings ofcustomers

- Identify rules for assigning newcases to classes for targeting/diagnostic purposes


- Provide characterization,definition, labeling of populations


- Decrease the size and complexityof problems for other data mining methods


- Identify outliers in a specificdomain

Clustering methods
- Statisticalmethods

such as k-means,k-modes,and so on.


- Neuralnetworks


adaptiveresonance theory - ART, self-organizing map - SOM


- Fuzzylogic


e.g.,fuzzy c-means algorithm


- Geneticalgorithms Divisiveversus Agglomeration methods

Association Rule Mining
- Avery popular DM method in businessFindsinteresting relationships (affinities) between variables (items or events)

- Partof machine learning family


- Employsunsupervised learning


- Thereis no output variable


- Alsoknown as marketbasket analysis


- Usedas an example to describe DM to non-data mining people, such as the“relationship between diapers & beer!”

Association Rule Results for Business Use
- Put the items next to each other for easeof finding

- Promote items as a package: do not put one on sale if the others are onsale


- Place items far apart from each other sothat the customer has to walk the aisles to search for it, and by doing sopotentially see and buy other items


- On an e-commerce site, promote “Customers

Applications of Association Rule Mining
Inbusiness: cross-marketing, cross-selling,store design, catalog design, e-commerce site design, optimization of onlineadvertising, product pricing, and sales/promotion configuration



Inmedicine: relationshipsbetween symptoms and illnesses;

Algorithms for Assocaition
- Apriori

- Eclat


- FP-Growth+ Derivatives


- hybrids of the three

Artificial Neural Networks forDM
- Artificialneural networks (ANN or NN) is a brain metaphor for information processing

- a.k.a.Neural Computing


- Verygood at capturing highly complex non-linear functions!


- Manyuses – prediction (regression, classification), clustering/segmentation


- Manyapplication areas – finance, medicine, marketing, manufacturing, serviceoperations, information systems.

Elements/Concepts of ANN/body
- Processingelement (PE)

- Informationprocessing


- Networkstructure


- Learningparameters


- ANNSoftware

Data Mining myths

- provides instantsolutions/predictions.


- is not yet viable for businessapplications.


- requires a separate, dedicateddatabase.


- can only be done by those withadvanced degrees.


- is only for large firms that havelots of customer data.


- is another name for good-oldstatistics.

Common DM Blunders
1.Selectingthe wrongproblem forDM

2.Ignoringwhat your sponsor thinks DM is, what it reallycan do


3.Notleaving sufficient time for data acquisition, selectionand preparation


4.Lookingonly at aggregatedresults andnot at individual records/predictions


5.Beingsloppy aboutkeeping track of the data mining procedure and results


6.Ignoringsuspicious (goodor bad) findings and quickly moving on


7.Runningalgorithmsrepeatedly & blindly, w/o thinking about the nextstage


8.Naivelybelieving everything you are told about the data


9.Naivelybelieving everything you are told about your own DM analysis


10.Measuringresults differently from the way your sponsormeasures them