Reading...
Play button
Play button
Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
100 Cards in this Set
- Front
- Back
OLAP definition
|
An approach to quickly answer multi-dimensional analytical queries.
|
|
OLAP data
|
Organized heirarchically and stored in cubes instead of tables
|
|
OLAP function
|
Slicing and dicing of data
|
|
OLAP definition
|
A category of applications and technologoes for collecting, managing, processing and presenting multidimensioanl data for analysis and management purposes.
|
|
What is BI?
|
Computer-based methods for identifying and extracting useful information from business data.It encompasses OLAP, relational reporting and data mining.
|
|
Excel can represent multi-dimension data. T or F?
|
True. What can Excel represent?
|
|
What function on Excel provide OLAP capabilities?
|
Pivot Tables
|
|
Measure
|
A summarized numerical value that you use to monitor how well the business is doing.
e.g., Units sold, revenue, defects, number of people who responded to an ad. |
|
Aggregation techniques used when presenting measures
|
Sum, Average, Max Min
|
|
How do you expand a measure (e.g., spread total sales across a time interval, monthly, or region, product, salesperson)
|
Add a dimension (e.g., spread total sales across a time interval, monthly, or region, product, salesperson)
|
|
Dimensions
|
The different characteristics by which the measure values may be presented to the user
|
|
Cube
|
A measure and its associated dimensions.
A subset of highly interrelated data that is organized to allow users to combine any attributes in a cube (e.g., stores, products, customers, suppliers) with any metrics/measures in the cube (sales, profit, units, age) to create various views. |
|
Data cube
|
A two-dimensional, three-dimensional, or higher dimensional object in which each element of the data represents a measure of interest.
|
|
slice
|
A two dimensional view that is a subset of highly interrelated data in a multi-dimensional cube,
|
|
Fact Table
|
Stores the detailed values for measures in un-normalized form
|
|
Dimension Table
|
A table that houses the name and attributes of the different characteristics by which the measure values may be presented to the user, it is linked by a foreign keys to a fact table. They contain classification and aggregate information about the central fact table rows.
|
|
A ______________ is a column in a dimension table.
|
attribute
|
|
Star Schema
|
The simplest DW design, all dimension tables are directly related to the fact table by foreign keys, it is denormalized and takes more space
|
|
grain
|
The highest level of detail that is supported in a data warehouse.
|
|
Function of a dimension table
|
Defines how data will be sliced and diced
|
|
Snowflake schema
|
A DW design where dimension tables are layered and not all directly related to the fact table. It is normalized, more complicated queries and more processing time for complicated joins.
|
|
Data Mining
|
The process through which previously undiscovered patterns in data are identified leading to knowledge discovery
|
|
Techniques data mining uses to extract and identify new knowledge remaining untapped in large databases
|
statistical, mathematical, artificial intelligence and machine learning techniques
|
|
Types of new knowledge that data mining creates
|
rules, affinities, correlations, trends or prediction models
|
|
Data mining extracts data from ___________________ data sources.
|
disparate data sources
|
|
Sensitivity analysis
|
the study of how the uncertainty in the output of a mathematical model or system (numerical or otherwise) can be apportioned to different sources of uncertainty in its inputs.
|
|
Data
|
refers to a collection of facts, usually obtained as the result of experience, observations, or experiments.
|
|
Patterns DM tries to identify
|
Classification
Clustering Association/Sequencing discovery Prediction/Forecasting |
|
Classification
|
HISTORICAL BEHAVIOR TO PREDICT - Analyzing the historical behavior of groups of entities to predict the future behavior of a new entity from its similarity to those groups. It is the most frequently used data mining method for real world problems
|
|
Another name for Classification
|
Supervised induction
|
|
Common classification tools
|
neural networks and decision trees, logistic regression, and discriminant analysis, and emerging tools such as rough sets, support vector machine, and genetic algorithms
|
|
How Classification works
|
learns patterns from past data in order to place new instances (with unknown labels) into their respective groups
|
|
Examples of classification data mining method
|
weather prediction, credit approval, store location, targeted marketing, and fraud detection.
|
|
Supervised learning
|
A method of training artificial neural networks in which sample cases are shown to the networks as input, and the weights are adjusted to minimize the error in the outputs.
|
|
The approach/algorithm used in the classification data mining method
|
decision trees
|
|
Which data mining methods employ supervised learning methods.
|
Classification and Prediction/Forecasting
|
|
Which data mining methods employ unsupervised learning methods.
|
Clustering and Association/Sequence discovery
|
|
Classification method objective
|
Analyzing the historical behavior of groups of entities to predict the future behavior of a new entity from its similarity to those groups.
|
|
Classification
|
This induced model consists of generalizations over the records of a training dataset, which help distinguish pre-defined classes.
|
|
Clustering
|
PARTITIONING - A natural partitioning of data into groups of entities with similar characteristics
|
|
Uses of Clustering methods
|
grouping students according to grades, perform market segmentation,
|
|
Association/Sequence discovery
|
SIMULTANEOUS RELATIONSHIPS TIME ORDER - establishing relationships among items that occur together or in a time order (basket analysis).
|
|
Association/Sequence discovery
|
a popular and well-researched technique for discovering interesting relationships among variables in large databases.
|
|
The approach/algorithm used in the Association/Sequence discovery data mining method
|
Apriori
|
|
Define Decision Tree Attributes
|
The input variables that may have an impact on the classification of different patterns.
|
|
Predictive accuracy
|
The model’s ability to correctly predict the class label of new or previously unseen data. It is the percentage of test dataset samples correctly classified by the model.
|
|
Speed
|
The computational costs involved in generating and using the model, where faster is deemed to be better.
|
|
Robustness
|
The model’s ability to make reasonably accurate predictions, given noisy data or data with missing and erroneous values.
|
|
Scalability
|
The ability to construct a prediction model efficiently given a rather large amount of data.
|
|
Interoperability
|
The level of understanding and insight provided by the model (e.g., how and/or what the model concludes on certain predictions)
|
|
Gini Index
|
Used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable
|
|
Gini index
|
Measures the homogeneity/diversity of data in a sample set.
|
|
Gini index=0
|
Gini idex rating if the data is homogeneous
|
|
Gini index >0
|
Gini index rating which indicates diversity in the data
|
|
The Apriori algorithm
|
The most commonly used algorithm to discover association rules.
|
|
What is a data warehouse?
|
A subject oriented, integrated, time-varient, nonvolatile collection of data, produced to support decision making; it is also a repository of current and historical data, usually structured in a form ready for analytical processing.
|
|
What are the major components of the data warehousing process?
|
Transaction data systems
ETL Process - Extract, Transform, Load Data Warehouse -comprehensive database Data Marts - Middleware/Analytical Tools- SQL, cubes BI Applications (Visualization)- OLAP, Dashboard, Web |
|
What Data marts provide
|
different views of the data warehouse
|
|
Neural computing
|
a pattern-recognition methodology for machine learning
|
|
Artificial Neural Network (ANN)
|
The resulting model from pattern-recognition methodology for machine learning
|
|
What have neural networks been used for?
|
Used in many business applications for pattern recognition, forecasting, prediction and classification. It is the key component of any data mining tool.
|
|
Connection weights
|
The key element of an ANN, they express the relative strength of the input data (always a single attribute), crucial in that they store learned patterns of information.
|
|
Summation Function
|
computes the weighted sums of all the input elements.
|
|
Transfer function (e.g., Sigmoid function)
|
A popular and useful non-linear function, it is an S-shaped transfer function in the range of 0 to It is used to sum the inputs to the node and to define the response out from the node.
|
|
Supervised learning
|
Sets of input are iteratively presented to the neural network and compared to the desired output
|
|
Learning algorithm
|
The determines how the neural interconnection weights are corrected due to differences in the actual and desired output for a member of the training set.
|
|
Unsupervised learning
|
The network learns a pattern through repeated exposures, it is not compared to a target answer, self-organizing or clustering
|
|
Learning Rate (alpha)
|
A parameter in neural networks; it affects the speed at which the ANN arrives at the solution; it determines the portion of the existing discrepancy that must be offset.
|
|
Momentum
|
A parameter in back-propagation neural networks, it slows, smoothens and stabilizes the learning process; reduces over-correcting of weights.
|
|
Back propagation learning
|
The best-known learning algorithm in neural computing where the learning is done by comparing computed outputs to desired outputs of training cases.
|
|
Text mining
|
The semi-automated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources.
|
|
corpus
|
A large and structured set of texts (now usually stored and processed electronically) prepared for the purpose of conducting knowledge discovery.
|
|
Term
|
A term is a single word or a multiword phrase extracted directly from the corpus of a specific domain by means fo natural language processing methods.
|
|
Concepts
|
underlying meaning, the features generated from a collection of documents a categorization methodology. Compared to terms, they are a higher level of abstraction.
|
|
Stemming
|
The process of reducing inflected words to their base or root form.
|
|
stop words (noise words)
|
Words that are filtered out prior to or after processing of natural language text
|
|
polysemes
|
homonyms, syntactically identical words (same spelling) with different meanings
|
|
Tokenizing
|
The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements which becomes input for further processing A categorized block of text in a sentence. The block of text is categorized according to the function it performs.
|
|
Index word
|
Any word appearing in 2 or more documents and is not a stop word
|
|
Term-By-Document Matrix (Occurrence matrix)
|
A representation of the frequency-based relationship between the terms and documents in tabular format. Terms are listed in rows, documents in columns, and the frequency listed in cells
|
|
Term Frequency-Inverse Document Frequency (TF-IDF)
|
A statistical measure to evaluate how important a word is to a document in a collection
|
|
Natural Language Processing (NLP)
|
An important component of text mining, it is a sub-field of AI and computational linguistics.
|
|
Singular Value Decomposition (SVD)
|
A matrix operation in linear algebra, that splits a given matrix of data into three parts.
It is a dimensionality reduction method, used to transform a Term-by-Document matrix to a manageable size; similar to Principal Component Analysis |
|
Principal component analysis
|
A mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables
|
|
The first principal component
|
In PCA, what accounts for as much of the variability in the data as possible
|
|
In PCA, what accounts for as much of the remaining variability as possible
|
each succeeding component in PCA accounts for what?
|
|
The text mining process
|
1. Establish the Corpus - collect and organize
2. Create the Term-Document Matrix - introduce structure to the Corpus, reduce dimensionality, SVD 3. Extract knowledge - discover patterns from the TD Matrix, classification, clustering, association, trend analysis |
|
Commercial Software Text mining tools
|
SPSS PASW Text Miner
SAS Enterprise Miner Statistica Data Miner ClearForest |
|
Free text mining software tools
|
RapidMiner
GATE Spy-EM |
|
Web mining
|
the process of discovering intrinsic relationships from Web data (textual, linkage, or usage)
|
|
The main areas of Web mining
|
Content Mining
Structure mining Usage Mining |
|
Web Content Mining
|
Uses unstructured textual content of the Web pages as a data source
|
|
Web Structure Mining
|
Uses URL links contained in the web pages as a data source
|
|
Web Usage Mining
|
Uses the detailed description of a Web site's visits (click streams) as a data source
|
|
Web Content and Structure mining tool
|
Data collection via Web crawlers
|
|
Authoritative pages
|
Links included on a web age can help to infer "authority", like citations used in journal article. There are differences, web links may be paid ads, they may exclude commercial rivals, may not be decriptive
|
|
Hubs
|
One or more web pages that provide a collection of links to authoritative pages, they provide links to a collection of prominent sites on a specific topic of interest.
|
|
Hyperlink-Induced Topic Search algorithm (HITS)
|
The most popular known and referenced algorithm to calculate hubs and authorities. It is a link analysis algorithm that rates web pages using the hyperlink information contained within them.
|
|
Web usage mining tools and methods
|
data stored in server access logs, referrer logs, agent logs, and client-side cookies
user characteristics and usage profiles metadata, such as page attributes, content attributes, and usage data |
|
ETL steps are performed by ________________ in SQL server
|
The Integration Services tool performs____________________?
|