Contents • • • • • • Data Mining Techniques Why data mining? What is data mining? Knowledge Discovery in Databases Process What kind of data for mining? Data mining functionality Data mining applications L/O/G/O Chapter 1 Introduction to Data Mining 2 Anantaporn Hanskunatai Why Data Mining? Why Data Mining? • Data explosion problem • Solution: Data warehousing and data mining • Major sources of abundant data – Business: Web, e-commerce, transactions, stocks, … – Science: Remote sensing, bioinformatics – Society and everyone: news, digital cameras, YouTube – Data warehousing and on-line analytical processing – Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases Data mining Solution We are drowning in data, but starving for knowledge! Anantaporn Hanskunatai 3 Anantaporn Hanskunatai 4 What is Data Mining? What is Data Mining? • Data mining • Alternative names and their “inside stories”: – Data mining: a misnomer? – Knowledge Discovery in Databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. – Extraction of interesting information or patterns from data in large databases • What is not data mining? – Simple search and query processing. – Expert systems (Deductive) or small ML/statistical programs 5 Anantaporn Hanskunatai Evolution of Database Technology 6 Anantaporn Hanskunatai Data Information and Knowledge 1960’s and Earlier Data Collection - Primitive File Processing Data 1970’s Database Management System - Network and Relational Database Management System - Data Modeling Tools - Query Language Processing 1980’s Present Advance Database Management System - Advance Data Model - Object-oriented Database Management System Decision Support System - Data Warehouse - Data Mining 1990’s Present 1990’s Present Web-based Database System validity - XML-based Database System - Web Mining Anantaporn Hanskunatai 7 Anantaporn Hanskunatai 8 Data Mining: Confluence of Multiple Disciplines Database Technology Knowledge Discovery (KDD) Process • view from typical database systems and data warehousing communities Pattern Evaluation Statistics Data Mining Machine Learning Data Mining Task-relevant Data Visualization Selection Data Warehouse Information Retrieval Data Cleaning High Performance Computing Data Integration 9 Anantaporn Hanskunatai Knowledge Discovery (KDD) Process • View from machine learning and statistics Input Data Data PreProcessing Data Mining Databases Data Mining in Business Intelligence Increasing potential to support business decisions PostProcessing Decision Making Data Presentation Visualization Techniques Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… Pattern Pattern Pattern Pattern 10 Anantaporn Hanskunatai End User Business Analyst Data Mining Information Discovery evaluation selection interpretation visualization Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Anantaporn Hanskunatai Data Analyst 11 Anantaporn Hanskunatai DBA 12 Characteristics of Knowledge • • • • • • • • • • Nontrivial Valid Novel / Previously unknown Potentially useful Interesting Understandable Anantaporn Hanskunatai What Kind of Data for Mining? Relational databases Transactional databases Data warehouses Advanced DB and information repositories – – – – – – – Object-oriented object-relational databases Spatial databases Time-series data and temporal databases Text databases multimedia databases WWW 13 Anantaporn Hanskunatai 14 What Kind of Data for Mining? What Kind of Data for Mining? • Relational databases • Transactional databases a collection of tables each table is assigned a unique name each table consists of a set of attributes (columns or fields) and a large set of tuples (rows or records) each table represents an object identified by a unique key and described by a set of attribute values – Each record represents a transaction – A transaction includes a unique transaction identity number (trans_ID) and a list of items making up the transaction Anantaporn Hanskunatai trans_ID T100 List of item_ID I1, I3, I8, I16 T200 I2, I8 … … 15 16 Anantaporn Hanskunatai What Kind of Data for Mining? • Data warehouses • • • • • • – A repository of information collected from multiple sources, stored under a unified schema Client Toronto Tokyo New York Clean Integrate Transform Load Refresh Data Warehouse The Problems of Multiple Sources Query and analysis tools Schema Differences Naming Differences Data Type Differences Value Differences Semantic Differences Missing Values Client Chicago Typical framework of data warehouse Anantaporn Hanskunatai 17 The Problems of Multiple Sources The AutoX company has 1,000 branches around the world. Each branch has its own set of databases. The Problems of Multiple Sources – Branch A: color “black” – Branch B: color “BL” – Branch A: Cars(serialNo, model, color, autoTrans, cdPlayer, …) – Branch B: Autos(serial, model, color) Options(serial, option) • Semantic Differences – Branch A: Autos cars – Branch B: Autos cars and 4x4 W • Naming Differences – Branch A: Table name Cars – Branch B: Table name Autos • Missing Values – Branch A: model Civic DX, LX or EX – Branch B: model Civic • Data Type Differences Anantaporn Hanskunatai 18 • Value Differences • Schema Differences – Branch A: serialNo integer – Branch B: serial string Anantaporn Hanskunatai 19 Anantaporn Hanskunatai 20 What Kind of Data for Mining? What Kind of Data for Mining? • Advanced DB and information repositories • Advanced DB and information repositories – Object-oriented • Based on object-oriented programming paradigm – Time-series data and temporal databases • Store time-related data • Time-series database stores sequences of values that change with time • Temporal database stores relational data that include time-related attributes – Object-relational databases • Extends the relational databases by providing a rich data type handling complex data types – Spatial databases • Geographic (map) databases • VLSI chip design databases • Medical and satellite image databases – Text databases • Contain word descriptions for objects such as product specifications, error or bug reports, warning messages, summary reports – Multimedia databases • Store image, audio, and video data – WWW • Provide rich, world-wide, on-line information services 21 Anantaporn Hanskunatai 22 Anantaporn Hanskunatai Data Mining Strategies Data Mining Strategies • Supervised Learning – Process of building classification models using data instances of known origin to learning with predefined classes – Build models by using input attributes (independent variables) to predict output or dependent variables (class label) Data Mining Strategies Predictive or Supervised Learning Descriptive or Unsupervised Learning • Unsupervised Learning Clustering Association Classification Anantaporn Hanskunatai Estimation/Regression 23 – To discover natural grouping or concept structures in data – Without a dependent variable to guide learning process – Rather, it builds knowledge structures by using some measures of cluster quality to group instances into classes Anantaporn Hanskunatai 24 Data Mining Functionality Supervised Learning Models • Association Analysis Purpose Classification Determine a value for an unknown output attribute (deal with Estimation current behavior) Output – Discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data – Widely used for market basket analysis – single-dimensional association rules Categorical (discrete) Numerical (continuous) • contains(T, “computer”) contains(T, “software”) [support=2%, confidence=75%] – multidimensional association rule • age(X, “20..29”) and income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%] Anantaporn Hanskunatai 25 Data Mining functionality 26 Anantaporn Hanskunatai Data Mining functionality • Classification and Prediction • Classification and Prediction – Construct models (functions) based on some training examples – Describe and distinguish classes or concepts for future prediction – Predict some unknown class labels – Typical methods • Decision trees, naïve Bayesian, neural networks, rule-based classification neural network model decision tree model the set of classification rule model Anantaporn Hanskunatai 27 Anantaporn Hanskunatai 28 Data Mining functionality Data Mining functionality • Cluster Analysis • Outliner Analysis – Outlier: a data object that does not comply with the general behavior of the data – Noise or exception? – one person’s garbage could be another person’s treasure – Use in fraud detection, rare events analysis – Methods: clustering or regression analysis – Analyze data without consulting a known class label – Maximizing the intra-class similarity and minimizing the interclass similarity • Trend and Evolution Analysis – Describes and models regularities or trends for objects whose behavior changes over time Anantaporn Hanskunatai 29 Data Mining Applications How can I do with these information? – Bank: Fraud detection – Retail : Market Basket Analysis Financial Sport Entertainment Social Network Lifestyle Public Utility Medical Anantaporn Hanskunatai 30 Market Basket Analysis • Education • Commercial • • • • • • • Anantaporn Hanskunatai – – – – 31 Plan advertising strategies Catalog design Design different store layout Plan which items to put on sale at reduced prices Anantaporn Hanskunatai 32 Lifestyle Financial • Gold price • Stock market http://slideplayer.us/slide/763105/ 33 Medical • Disease diagnosis • Effectiveness of treatments Anantaporn Hanskunatai 35 Anantaporn Hanskunatai 34
© Copyright 2024 ExpyDoc