Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
70 Cards in this Set
- Front
- Back
What does the term data mining mean?
|
Data mining enables you to deduce hidden knowledge by examining or training the data. The knowledge you find is expressing patterns and rules.
|
|
What is the unit of examination called?
|
Case, which can be interpreted as one apperance of an entity or a row in a table
|
|
What does a Data model store?
|
Information about the variables you use, the algorithms you implement on the data, and the paramaters of the selected algorithms, and (after training is complete) extracted knowledge
|
|
What are the two main classes of techniques in data mining?
|
Directed approach and undirected approach
|
|
What are the four parts of the Data Mining Life Cycle?
|
1. Identifying the business prblem
2. Using data mining techniques to transform the data into actionable information 3. Acting on the information 4. Measuring the results |
|
What are the four stages in the transform phase of the Data Mining Life Cycle?
|
Prepre the data
Create the models Examine and evaluate the models Deploy selected model |
|
What are the two most important factors in the success of a data mining project?
|
Data preparation and understanding
|
|
What are four kinds of variables you can use to measure values?
|
Categorical or nominal attributes
Ranks Intervals True numeric variables |
|
What is the difference between a simple or complex case?
|
Complex cases have nested tables in one or more columns. These need to be glattened or normalize them into standard rowsets where you perform joins between parent and child tables
|
|
What two kinds of data do you need to decide how to handle?
|
Outliers and missing data
|
|
What is the key to properly preparing the training set and the test set?
|
Statistically split the data randomly (you can use Row Sampling and Percentage Sampling Transformations in SSIS)
|
|
How do you verify that your data split is random?
|
The first four moments (mean, standard deviation, skewness, and kurtosis)
|
|
What is kurtosis?
|
Kurtosis measures peakedness of probability distribution, showing whether the distribution is narrow and high around the center or lower close to the center
|
|
What do you use to create your models?
|
Analysis Services Project template in BIDS. You define the data source and DSV objects in the same way you create them for UDM dimensions and cube
|
|
What is a data mining structure?
|
A data structure that defines the domain from which you build your mining models, it specifies the source data through a DSV, the columns, and training models. Can contain multiple mining models
|
|
How many models should you make?
|
Multiple. Evaluate them all, see if they agree, and then deploy the one that works the best.
|
|
What are the four options in the mining model for defining the use of columns?
|
Input
Predictable Input and predictable Ignored |
|
What are the nine data mining algorithms included in SSAS?
|
Association rules
Clustering Decision trees Linear regression Logistic Regression Naive Bayes Nueral Network Sequence Clustering Time Series |
|
What do you need to use in order to anaylze texts such as articles in magazines?
|
Text mining, which is not part of SSAS. Instead, use the two SSIS transformations for text mining: Term Extraction and Term Lookup
|
|
Which algorithm do advanced e-mail SPAM filters use?
|
Naive Bayes
|
|
What is the Association Rules algorithm used for?
|
Market based analysis. Used to find cross-selling opportunities
|
|
What is the Clustering Algorithm used for?
|
Groups cases from a dataset into clusters of similar charctersitcs. Used for grouping customers for a CRM applicaiton. Also for searching for anomalies in data, as in fruad detection
|
|
What is the Decision Trees Algorithm used for?
|
The most popular data mining algorithm. Easy to understand. Used to predict discrete and coninuous variables. A tree that predicts continous variables is a regression tree
|
|
What is the Linear Regression algorithm?
|
Predicts continuous variales using a single multiple linear regreasion formula. It is a regression tree with no splits.
|
|
What is the Logistic Regression algorithm?
|
A Logistic Regression algorithm is a Neural Network without any hidden layers.
|
|
What is the Naive Bayes algorithm?
|
Calculates probabilities for each possible state of the input attribute. Fast and a good starting point. Doesn't support continuous attributes
|
|
What is the Neural Network algorithm?
|
Serches for nonlinear functional dependencies. Harder to predict than linear lagorithms such as decision trees and not often used for business.
|
|
What is the sequence clustering algorithm?
|
Searches for clusters based on a model rather than simliarity of cases. Builds markov chains with combinations of all possible states and assigns probabilities of moving from one state to another. Used for analyzing web sites.
|
|
What is the time series algorith?
|
Created for forecasting continuous variables using ART Auto-Regression Trees and ARIMA Auto-Regressive Integrated Moving Average algorithms
|
|
What three main tools are included in BIDS for creating mining models?
|
Data Mining Wizard
Data Mining Desinger Data Mining Viewers |
|
What three things do you do with the Data Mining Wizard?
|
Define the DSV and the tables and columns from the DSV that you want to use
Add an initial model to the structure Partition the data into training and test sets |
|
What five tasks can you perform in the Data Mining Designer?
|
Modify the mining structure
add additonal mining models to the structure Process the strcuture and browse the models using Data Ming Viewers Check the accuracy of the models using a lift chart and other techniques Create DMX prediction queries using you models |
|
What are the two discretization methods in SSAS 2008?
|
EqualAreas and Clusters
|
|
What are the different components that make up the SQL Server BI Suite?
|
SSAS cubes
SSAS data mining SSRS SSIS |
|
What are three ways to prepare training sets and test sets?
|
The Data Mining Wizard and Data Mining Designer in BIDS to specify the percentage of the holdout data for the test set
Use the TABLESAMPLE option of the T-SQL SELECT statement Use the SSIS Row Sampling Transfromation and Percentage Sampling Transformation |
|
What are the five supported content types for columns?
|
Discrete
Continuous Discretized Ordered Cyclical |
|
What are the three key column types?
|
Primary
Key Sequence Key Time |
|
What algorithm did Microsoft develop for clickstream analysis?
|
Sequence Clustering
|
|
What is your case table and case-level columns when you mine an OLAP cube?
|
A dimension is the case table and any measure group or fact table connected with the selected dimension can be used as a case-level column
|
|
Why can't you use a mining model as a dimension in the same cube in which you used it as the source for the model?
|
You would get a circular reference an never stop processing
|
|
Which algorithm would you use to find the best way to arrange products on shelves in a retail store?
|
Associtation rules
|
|
What does a lift chart show?
|
The compares the performance of models when predicting a value
Or shows the quality of global predictions |
|
What does a classification matrix show?
|
Compares te actual valuses compared to the predicted values
|
|
What are the five settings you can define for cross-validation?
|
Fold Count (how many partitions created in training data)
Max Cases Target Attribute Target State Target Threshold (minimum accuracy needed for a prediction to be counted as correct) |
|
On a real data mining project, which two tasks will take most of the time?
|
Data preparation and then validation of predictive models
|
|
What are the three measures that give you information about the quality of the rules that the Association Rules algorithm finds?
|
Support (How many times items were found together)
Probability (build direction A-->B not B-->A) Importance (Score of the rule, how coorelated they are) |
|
What two paramaters can be used to control the creation o historical models?
|
HISTORICAL_MODEL_COUNT (number of model built)
HISTORICAL_MODEL_GAP (Number of time slices between historical models) |
|
What are the two kins od DMX statements?
|
DDL Data Definition Language
DML Data Manipulation Language |
|
What are eight DMX DDL statements?
|
Create mining strucutre
Alter mining structure Create mining model Export Import Select into Drop Mining Model Drop Mining Structure |
|
What are four DMX DDL statements?
|
Insert into (which trains the model)
Select Update Delete |
|
Do dataset tables support nested tables?
|
No
|
|
What are the three types of charts you can use to evaluate predictive models?
|
Lift chart for global statistics
Lift chart for a single value Profit chart |
|
How do you evaluate a Time Series model?
|
You can make historical predictions to evaluate a time series model
|
|
How do you evaluate a clustering model?
|
You should evaluate clustering models from a business perspective
|
|
Using DMX, can you add a mining model to an existing structure so that you can share the structure with other models?
|
You can use the ALTER MINING STRUCTURE dmx statement to add a mining model to an existing structure so it can be shared with other models
|
|
Can you use DMX to drill through to the ample cases you used for trianing a mining model?
|
Yes, you can use the dmx SELECT FROM <model>.CASES syntax to drill through to the sample cases you used to train a mining model
|
|
What are the four SSAS general data mining properties?
|
AllowSessionMiningModels
AllowAdHocOpenRowsetQueries AllowedProvidersInOpenRowset MaxConcurrentPredictionQueries |
|
If you want to let applications use the SSAS data mining features, which data mining property do you need to set as "true"?
|
AllowSessionMiningModels
|
|
What are your four options for impersonating information in a data source?
|
Use a specific username and password
Use the service acount Use the credentials of the current user Inherit (Impersonates current users) |
|
What permission must a user have to connect to an SSAS database through SSMS or BIDS?
|
Read Definition permission for a complete SSAS database
|
|
What is a data mining structure?
|
A blueprint of the database schema that is shared by all mining models inside the structure
|
|
What is the defaul CacheMode property and what does it allow?
|
The default CacheMode property is set to KeepTrainingCases which caches the data mining model training data to allow the user to issue drill-through queries to see the source data. You can set it to ClearAfterProcessing to avoid keeping large data volumes in the cache
|
|
What is another phrase for training the model?
|
Model processing
|
|
What are the four steps to processing a mining structure?
|
Save changes in BIDS
On the Mining Structure tab click the Process the Mining Structure button In the process dialog box select the desired processing option, then click run Watch the process prgress dialog box |
|
Can you use SQL server logins for SSAS authentication?
|
No. SSAS supports Windows authentication only
|
|
Do end users need the Process permission on a mining structure?
|
No
|
|
As an administrator, how would you prevent usage of the clustering data mining algorithm?
|
Use the Analysis Services Properties dialog box in SSMS
|
|
What processing option deletes the training data in a mining structure without affecting its mining models?
|
Use the Process Clear Structure option to pruge the structure data without affecting the models inside the structure
|
|
Can an SSRS report use a mining model as its source?
|
Yes
|
|
How do you brows mining models?
|
The DMX language. You can also use the Prediction Query Builder in SSMS and BIDS to create prediction DMX queries
|