Data Mining

Contents
•
•
•
•
•
•
Data Mining
Techniques
Why data mining?
What is data mining?
Knowledge Discovery in Databases Process
What kind of data for mining?
Data mining functionality
Data mining applications
L/O/G/O
Chapter 1
Introduction to Data Mining
2
Anantaporn Hanskunatai
Why Data Mining?
Why Data Mining?
• Data explosion problem
• Solution: Data warehousing and data mining
• Major sources of abundant data
– Business: Web, e-commerce,
transactions, stocks, …
– Science: Remote sensing,
bioinformatics
– Society and everyone: news, digital
cameras, YouTube
– Data warehousing and on-line analytical processing
– Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data in large
databases
Data mining
Solution
We are drowning in data,
but starving for knowledge!
Anantaporn Hanskunatai
3
Anantaporn Hanskunatai
4
What is Data Mining?
What is Data Mining?
• Data mining
• Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge Discovery in Databases (KDD),
knowledge extraction, data/pattern analysis,
data archeology, data dredging, information
harvesting, business intelligence, etc.
– Extraction of interesting information or patterns
from data in large databases
• What is not data mining?
– Simple search and query processing.
– Expert systems (Deductive)
or small ML/statistical programs
5
Anantaporn Hanskunatai
Evolution of Database Technology
6
Anantaporn Hanskunatai
Data Information and Knowledge
1960’s
and
Earlier
Data Collection
- Primitive File Processing
Data
1970’s
Database Management System
- Network and Relational Database Management System
- Data Modeling Tools
- Query Language
Processing
1980’s Present
Advance Database Management System
- Advance Data Model
- Object-oriented Database Management System
Decision Support System
- Data Warehouse
- Data Mining
1990’s Present
1990’s Present
Web-based Database System
validity
- XML-based Database System
- Web Mining
Anantaporn Hanskunatai
7
Anantaporn Hanskunatai
8
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Knowledge Discovery (KDD) Process
• view from typical database systems
and data warehousing communities
Pattern Evaluation
Statistics
Data Mining
Machine
Learning
Data Mining
Task-relevant Data
Visualization
Selection
Data
Warehouse
Information
Retrieval
Data Cleaning
High Performance
Computing
Data Integration
9
Anantaporn Hanskunatai
Knowledge Discovery (KDD) Process
• View from machine learning and statistics
Input Data
Data PreProcessing
Data
Mining
Databases
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
PostProcessing
Decision
Making
Data Presentation
Visualization Techniques
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………
Pattern
Pattern
Pattern
Pattern
10
Anantaporn Hanskunatai
End User
Business
Analyst
Data Mining
Information Discovery
evaluation
selection
interpretation
visualization
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Anantaporn Hanskunatai
Data
Analyst
11
Anantaporn Hanskunatai
DBA
12
Characteristics of Knowledge
•
•
•
•
•
•
•
•
•
•
Nontrivial
Valid
Novel / Previously unknown
Potentially useful
Interesting
Understandable
Anantaporn Hanskunatai
What Kind of Data for Mining?
Relational databases
Transactional databases
Data warehouses
Advanced DB and information repositories
–
–
–
–
–
–
–
Object-oriented
object-relational databases
Spatial databases
Time-series data and temporal databases
Text databases
multimedia databases
WWW
13
Anantaporn Hanskunatai
14
What Kind of Data for Mining?
What Kind of Data for Mining?
• Relational databases
• Transactional databases
a collection of tables
each table is assigned
a unique name
 each table consists of
a set of attributes
(columns or fields) and a
large set of tuples (rows
or records)
 each table represents
an object identified by a
unique key and
described by a set of
attribute values
– Each record represents a transaction
– A transaction includes a unique transaction identity
number (trans_ID) and a list of items making up the
transaction


Anantaporn Hanskunatai
trans_ID
T100
List of item_ID
I1, I3, I8, I16
T200
I2, I8
…
…
15
16
Anantaporn Hanskunatai
What Kind of Data for Mining?
• Data warehouses
•
•
•
•
•
•
– A repository of information collected from multiple
sources, stored under a unified schema
Client
Toronto
Tokyo
New York
Clean
Integrate
Transform
Load
Refresh
Data
Warehouse
The Problems of Multiple Sources
Query and
analysis tools
Schema Differences
Naming Differences
Data Type Differences
Value Differences
Semantic Differences
Missing Values
Client
Chicago
Typical framework of data warehouse
Anantaporn Hanskunatai
17
The Problems of Multiple Sources
The AutoX company has 1,000 branches around the
world. Each branch has its own set of databases.
The Problems of Multiple Sources
– Branch A: color “black”
– Branch B: color  “BL”
– Branch A: Cars(serialNo, model, color, autoTrans,
cdPlayer, …)
– Branch B: Autos(serial, model, color)
Options(serial, option)
• Semantic Differences
– Branch A: Autos  cars
– Branch B: Autos  cars and 4x4 W
• Naming Differences
– Branch A: Table name  Cars
– Branch B: Table name  Autos
• Missing Values
– Branch A: model Civic DX, LX or EX
– Branch B: model  Civic
• Data Type Differences
Anantaporn Hanskunatai
18
• Value Differences
• Schema Differences
– Branch A: serialNo  integer
– Branch B: serial  string
Anantaporn Hanskunatai
19
Anantaporn Hanskunatai
20
What Kind of Data for Mining?
What Kind of Data for Mining?
• Advanced DB and information repositories
• Advanced DB and information repositories
– Object-oriented
• Based on object-oriented programming paradigm
– Time-series data and temporal databases
• Store time-related data
• Time-series database stores sequences of values that
change with time
• Temporal database stores relational data that include
time-related attributes
– Object-relational databases
• Extends the relational databases by providing a rich
data type handling complex data types
– Spatial databases
• Geographic (map) databases
• VLSI chip design databases
• Medical and satellite image databases
– Text databases
• Contain word descriptions for objects such as
product specifications, error or bug reports,
warning messages, summary reports
– Multimedia databases
• Store image, audio, and video data
– WWW
• Provide rich, world-wide, on-line information
services
21
Anantaporn Hanskunatai
22
Anantaporn Hanskunatai
Data Mining Strategies
Data Mining Strategies
• Supervised Learning
– Process of building classification models using data
instances of known origin to learning with predefined
classes
– Build models by using input attributes (independent
variables) to predict output or dependent variables
(class label)
Data Mining Strategies
Predictive or
Supervised Learning
Descriptive or
Unsupervised Learning
• Unsupervised Learning
Clustering
Association
Classification
Anantaporn Hanskunatai
Estimation/Regression
23
– To discover natural grouping or concept structures in
data
– Without a dependent variable to guide learning process
– Rather, it builds knowledge structures by using some
measures of cluster quality to group instances into
classes
Anantaporn Hanskunatai
24
Data Mining Functionality
Supervised Learning Models
• Association Analysis
Purpose
Classification Determine a value for
an unknown output
attribute (deal with
Estimation
current behavior)
Output
– Discovery of association rules showing attribute-value
conditions that occur frequently together in a given
set of data
– Widely used for market basket analysis
– single-dimensional association rules
Categorical
(discrete)
Numerical
(continuous)
• contains(T, “computer”)  contains(T, “software”)
[support=2%, confidence=75%]
– multidimensional association rule
• age(X, “20..29”) and income(X, “20..29K”)  buys(X, “PC”)
[support = 2%, confidence = 60%]
Anantaporn Hanskunatai
25
Data Mining functionality
26
Anantaporn Hanskunatai
Data Mining functionality
• Classification and Prediction
• Classification and Prediction
– Construct models (functions) based on some training
examples
– Describe and distinguish classes or concepts for future
prediction
– Predict some unknown class labels
– Typical methods
• Decision trees, naïve Bayesian, neural networks, rule-based
classification
neural network model
decision tree model
the set of classification rule model
Anantaporn Hanskunatai
27
Anantaporn Hanskunatai
28
Data Mining functionality
Data Mining functionality
• Cluster Analysis
• Outliner Analysis
– Outlier: a data object that does not comply with the
general behavior of the data
– Noise or exception? – one person’s garbage could be
another person’s treasure
– Use in fraud detection, rare events analysis
– Methods: clustering or regression analysis
– Analyze data without consulting a known class label
– Maximizing the intra-class similarity and minimizing
the interclass similarity
• Trend and Evolution Analysis
– Describes and models regularities or trends for
objects whose behavior changes over time
Anantaporn Hanskunatai
29
Data Mining Applications
How can I do with
these information?
– Bank: Fraud detection
– Retail : Market Basket Analysis
Financial
Sport
Entertainment
Social Network
Lifestyle
Public Utility
Medical
Anantaporn Hanskunatai
30
Market Basket Analysis
• Education
• Commercial
•
•
•
•
•
•
•
Anantaporn Hanskunatai
–
–
–
–
31
Plan advertising strategies
Catalog design
Design different store layout
Plan which items to put on
sale at reduced prices
Anantaporn Hanskunatai
32
Lifestyle
Financial
• Gold price
• Stock market
http://slideplayer.us/slide/763105/
33
Medical
• Disease diagnosis
• Effectiveness of treatments
Anantaporn Hanskunatai
35
Anantaporn Hanskunatai
34