• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/100

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

100 Cards in this Set

  • Front
  • Back
OLAP definition
An approach to quickly answer multi-dimensional analytical queries.
OLAP data
Organized heirarchically and stored in cubes instead of tables
OLAP function
Slicing and dicing of data
OLAP definition
A category of applications and technologoes for collecting, managing, processing and presenting multidimensioanl data for analysis and management purposes.
What is BI?
Computer-based methods for identifying and extracting useful information from business data.It encompasses OLAP, relational reporting and data mining.
Excel can represent multi-dimension data. T or F?
True. What can Excel represent?
What function on Excel provide OLAP capabilities?
Pivot Tables
Measure
A summarized numerical value that you use to monitor how well the business is doing.
e.g., Units sold, revenue, defects, number of people who responded to an ad.
Aggregation techniques used when presenting measures
Sum, Average, Max Min
How do you expand a measure (e.g., spread total sales across a time interval, monthly, or region, product, salesperson)
Add a dimension (e.g., spread total sales across a time interval, monthly, or region, product, salesperson)
Dimensions
The different characteristics by which the measure values may be presented to the user
Cube
A measure and its associated dimensions.
A subset of highly interrelated data that is organized to allow users to combine any attributes in a cube (e.g., stores, products, customers, suppliers) with any metrics/measures in the cube (sales, profit, units, age) to create various views.
Data cube
A two-dimensional, three-dimensional, or higher dimensional object in which each element of the data represents a measure of interest.
slice
A two dimensional view that is a subset of highly interrelated data in a multi-dimensional cube,
Fact Table
Stores the detailed values for measures in un-normalized form
Dimension Table
A table that houses the name and attributes of the different characteristics by which the measure values may be presented to the user, it is linked by a foreign keys to a fact table. They contain classification and aggregate information about the central fact table rows.
A ______________ is a column in a dimension table.
attribute
Star Schema
The simplest DW design, all dimension tables are directly related to the fact table by foreign keys, it is denormalized and takes more space
grain
The highest level of detail that is supported in a data warehouse.
Function of a dimension table
Defines how data will be sliced and diced
Snowflake schema
A DW design where dimension tables are layered and not all directly related to the fact table. It is normalized, more complicated queries and more processing time for complicated joins.
Data Mining
The process through which previously undiscovered patterns in data are identified leading to knowledge discovery
Techniques data mining uses to extract and identify new knowledge remaining untapped in large databases
statistical, mathematical, artificial intelligence and machine learning techniques
Types of new knowledge that data mining creates
rules, affinities, correlations, trends or prediction models
Data mining extracts data from ___________________ data sources.
disparate data sources
Sensitivity analysis
the study of how the uncertainty in the output of a mathematical model or system (numerical or otherwise) can be apportioned to different sources of uncertainty in its inputs.
Data
refers to a collection of facts, usually obtained as the result of experience, observations, or experiments.
Patterns DM tries to identify
Classification
Clustering
Association/Sequencing discovery
Prediction/Forecasting
Classification
HISTORICAL BEHAVIOR TO PREDICT - Analyzing the historical behavior of groups of entities to predict the future behavior of a new entity from its similarity to those groups. It is the most frequently used data mining method for real world problems
Another name for Classification
Supervised induction
Common classification tools
neural networks and decision trees, logistic regression, and discriminant analysis, and emerging tools such as rough sets, support vector machine, and genetic algorithms
How Classification works
learns patterns from past data in order to place new instances (with unknown labels) into their respective groups
Examples of classification data mining method
weather prediction, credit approval, store location, targeted marketing, and fraud detection.
Supervised learning
A method of training artificial neural networks in which sample cases are shown to the networks as input, and the weights are adjusted to minimize the error in the outputs.
The approach/algorithm used in the classification data mining method
decision trees
Which data mining methods employ supervised learning methods.
Classification and Prediction/Forecasting
Which data mining methods employ unsupervised learning methods.
Clustering and Association/Sequence discovery
Classification method objective
Analyzing the historical behavior of groups of entities to predict the future behavior of a new entity from its similarity to those groups.
Classification
This induced model consists of generalizations over the records of a training dataset, which help distinguish pre-defined classes.
Clustering
PARTITIONING - A natural partitioning of data into groups of entities with similar characteristics
Uses of Clustering methods
grouping students according to grades, perform market segmentation,
Association/Sequence discovery
SIMULTANEOUS RELATIONSHIPS TIME ORDER - establishing relationships among items that occur together or in a time order (basket analysis).
Association/Sequence discovery
a popular and well-researched technique for discovering interesting relationships among variables in large databases.
The approach/algorithm used in the Association/Sequence discovery data mining method
Apriori
Define Decision Tree Attributes
The input variables that may have an impact on the classification of different patterns.
Predictive accuracy
The model’s ability to correctly predict the class label of new or previously unseen data. It is the percentage of test dataset samples correctly classified by the model.
Speed
The computational costs involved in generating and using the model, where faster is deemed to be better.
Robustness
The model’s ability to make reasonably accurate predictions, given noisy data or data with missing and erroneous values.
Scalability
The ability to construct a prediction model efficiently given a rather large amount of data.
Interoperability
The level of understanding and insight provided by the model (e.g., how and/or what the model concludes on certain predictions)
Gini Index
Used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable
Gini index
Measures the homogeneity/diversity of data in a sample set.
Gini index=0
Gini idex rating if the data is homogeneous
Gini index >0
Gini index rating which indicates diversity in the data
The Apriori algorithm
The most commonly used algorithm to discover association rules.
What is a data warehouse?
A subject oriented, integrated, time-varient, nonvolatile collection of data, produced to support decision making; it is also a repository of current and historical data, usually structured in a form ready for analytical processing.
What are the major components of the data warehousing process?
Transaction data systems
ETL Process - Extract, Transform, Load
Data Warehouse -comprehensive database
Data Marts -
Middleware/Analytical Tools- SQL, cubes
BI Applications (Visualization)- OLAP, Dashboard, Web
What Data marts provide
different views of the data warehouse
Neural computing
a pattern-recognition methodology for machine learning
Artificial Neural Network (ANN)
The resulting model from pattern-recognition methodology for machine learning
What have neural networks been used for?
Used in many business applications for pattern recognition, forecasting, prediction and classification. It is the key component of any data mining tool.
Connection weights
The key element of an ANN, they express the relative strength of the input data (always a single attribute), crucial in that they store learned patterns of information.
Summation Function
computes the weighted sums of all the input elements.
Transfer function (e.g., Sigmoid function)
A popular and useful non-linear function, it is an S-shaped transfer function in the range of 0 to It is used to sum the inputs to the node and to define the response out from the node.
Supervised learning
Sets of input are iteratively presented to the neural network and compared to the desired output
Learning algorithm
The determines how the neural interconnection weights are corrected due to differences in the actual and desired output for a member of the training set.
Unsupervised learning
The network learns a pattern through repeated exposures, it is not compared to a target answer, self-organizing or clustering
Learning Rate (alpha)
A parameter in neural networks; it affects the speed at which the ANN arrives at the solution; it determines the portion of the existing discrepancy that must be offset.
Momentum
A parameter in back-propagation neural networks, it slows, smoothens and stabilizes the learning process; reduces over-correcting of weights.
Back propagation learning
The best-known learning algorithm in neural computing where the learning is done by comparing computed outputs to desired outputs of training cases.
Text mining
The semi-automated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources.
corpus
A large and structured set of texts (now usually stored and processed electronically) prepared for the purpose of conducting knowledge discovery.
Term
A term is a single word or a multiword phrase extracted directly from the corpus of a specific domain by means fo natural language processing methods.
Concepts
underlying meaning, the features generated from a collection of documents a categorization methodology. Compared to terms, they are a higher level of abstraction.
Stemming
The process of reducing inflected words to their base or root form.
stop words (noise words)
Words that are filtered out prior to or after processing of natural language text
polysemes
homonyms, syntactically identical words (same spelling) with different meanings
Tokenizing
The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements which becomes input for further processing A categorized block of text in a sentence. The block of text is categorized according to the function it performs.
Index word
Any word appearing in 2 or more documents and is not a stop word
Term-By-Document Matrix (Occurrence matrix)
A representation of the frequency-based relationship between the terms and documents in tabular format. Terms are listed in rows, documents in columns, and the frequency listed in cells
Term Frequency-Inverse Document Frequency (TF-IDF)
A statistical measure to evaluate how important a word is to a document in a collection
Natural Language Processing (NLP)
An important component of text mining, it is a sub-field of AI and computational linguistics.
Singular Value Decomposition (SVD)
A matrix operation in linear algebra, that splits a given matrix of data into three parts.

It is a dimensionality reduction method, used to transform a Term-by-Document matrix to a manageable size; similar to Principal Component Analysis
Principal component analysis
A mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables
The first principal component
In PCA, what accounts for as much of the variability in the data as possible
In PCA, what accounts for as much of the remaining variability as possible
each succeeding component in PCA accounts for what?
The text mining process
1. Establish the Corpus - collect and organize
2. Create the Term-Document Matrix - introduce structure to the Corpus, reduce dimensionality, SVD
3. Extract knowledge - discover patterns from the TD Matrix, classification, clustering, association, trend analysis
Commercial Software Text mining tools
SPSS PASW Text Miner
SAS Enterprise Miner
Statistica Data Miner
ClearForest
Free text mining software tools
RapidMiner
GATE
Spy-EM
Web mining
the process of discovering intrinsic relationships from Web data (textual, linkage, or usage)
The main areas of Web mining
Content Mining
Structure mining
Usage Mining
Web Content Mining
Uses unstructured textual content of the Web pages as a data source
Web Structure Mining
Uses URL links contained in the web pages as a data source
Web Usage Mining
Uses the detailed description of a Web site's visits (click streams) as a data source
Web Content and Structure mining tool
Data collection via Web crawlers
Authoritative pages
Links included on a web age can help to infer "authority", like citations used in journal article. There are differences, web links may be paid ads, they may exclude commercial rivals, may not be decriptive
Hubs
One or more web pages that provide a collection of links to authoritative pages, they provide links to a collection of prominent sites on a specific topic of interest.
Hyperlink-Induced Topic Search algorithm (HITS)
The most popular known and referenced algorithm to calculate hubs and authorities. It is a link analysis algorithm that rates web pages using the hyperlink information contained within them.
Web usage mining tools and methods
data stored in server access logs, referrer logs, agent logs, and client-side cookies
user characteristics and usage profiles
metadata, such as page attributes, content attributes, and usage data
ETL steps are performed by ________________ in SQL server
The Integration Services tool performs____________________?