資料私密性 保護資料確保合規

由企業數據倉庫在大數據分析時代的角色變遷
談數據集成及數據質量
尹寒柏 Bob Yin
Senior Product Specialist
Agenda
•
•
•
•
•
關於Informatica
企業數據倉庫在大數據分析時代的角色變遷
Informatica 的解決方案
Informatica BDE Demo
Q&A
Informatica
The #1 Independent Leader in Data Integration
$900
•
Founded: 1993
•
2013 Revenue: $948.2 million
•
7-year Annual CAGR:
17% per year
•
Employees: 3,210+
•
Partners: 450+
•
•
Major SI, ISV, OEM and
On-Demand Leaders
$800
$700
$600
$500
$400
$300
Customers: Over 5,000
•
Customers in 82 Countries
•
Direct Presence in 28 Countries
•
# 1 in Customer Loyalty Rankings
(7 Years in a Row)
$200
$100
$0
2006 2007 2008 2009 2010 2011 2012 2013
#1 Magic Quadrant for Data Integration Tools
Gartner 24 July 2014 ID:G00261678
2015/10/1
The Forrester Wave™: Hybrid2 Integration, Q1
2014
Data Integration Hairball
Mainframe apps - Blue
PC/NT apps - Green
Unix apps - Yellow
3rd party interface - Orange
Lines: Colors have no special meaning.
DRAFT
They are to help make the diagram easier to
read.
For More Information: See the database
containing information about each
application: Application V4.mdb
AIS Reports
Budget
Analysis Tool
Vendor
Maintenance
Insertions
Orders
AIS Calendar
Due Dates
General
Maintenance
Broadcast
Filter
Stores & Mrkts
Best Buy - Application Diagram V4
November 10, 1999
Vendor Setup
Process Servers
(Imaging)
Printer
Maintenance
NEW Soundscan
NPD Group
AIG Warranty Guard
Mesa Data
S20-Sales
Polling
Printer PO
DRAFT
Page 1 of 2
Depository
Banks
UAR - Universal Account
Reconcilliation
Sterling VAN
Mailbox (Value)
Roadshow
I06 - Customer
Order
S01 - Sales
Corrections
I17 Customer Perceived
In-Stock
I13- Auto
Replenishment
I15 Hand Scan
Apps
I06 Warehouse
Management
Print Costing
Invoice App
E13
E3 Interface
Fringe PO
Smart Plus
M03 - Millennuim 3.0
Smart Plus
Launcher
S04 - Sales Posting
S07 - Cell
Phones
P16 - Tally Sheet
I03 Return to
Vendor
D01 Post Load
Billing
M02 - Millennium
S06 - Credit App
Equifax
Stock Options
L02-Resource
Scheduling
(Campbell)
A04 - Cust
Refund Chks
E01-EDI
P14 On-line New
Hire Entry
Resumix
Frick
Co
CTS
ACH
V02-Price
Marketing
Support
CTO2.Bestbuy.
com
V04-Sign
System
Prodigy
Banks - ACH and Pos to
Pay
I10 Cycle Physical
Inventory
I04 Home
Deliveries
U18 - CTO
I02 Transfers
B01 - Stock
Status
Spec Source
SKU Tracking
Intercept
E02-Employee
Purchase
S08 - Vertex
Sales
Tax
I11 Price
Testing
I09 Cycle Counts
S02 Layaways
NPD,
SoundScan
Spec
Source
Scorecard - HR
S03-Polling
K02
Customer Repair
Tracking
ASIS
I18
SKU Rep
SKU
Performance
SKU Selection
Tool
Arthur Planning
I35 Early Warning
System
I55 SKU
Information
ELT
PowerSuite
Supplier
Compliance
L60 MDF
Coop
I05
Inventory Info
V01-Price Management
System
I35 - CEI
Rebate
Transfer
X92-X96
Host to AS400
Communication
I01 PO
Receiving
V03- Mkt
Reactions
P09
Bonus/HR
Washington,
RGIS,
Ntl Bus Systems
S11 - ISP
Tracking
POS
Plan Administrators
(401K, PCS, Life,
Unicare, Solomon
Smith Barney)
Store
Monitor
L01-Promo
Analysis
1
AAS
P01Employee
Masterfile
P09 - P17
Cyborg
Cobra
S09 - Digital
Satellite
System
I12 Entertainment
Software
P15 EES Employee
Change Notice
I07 Purchase
Order
Ad Expense
G02 - General
Ledger
Store
Scorecard
Sign
System
Texlon 3.5
NARM
I14 Count Corrections
Store Budget
Reporting
Valley Media
U16-Texlon
B02 Merchandise
Analysis
CopyWriter's
Workspace
BMP - Bus
performance Mngt
EDI
Coordinator
Merch Mngr Approval
Batch Forcasting
Ad Measurement
AIMS Admin
Ad
Launcher
AIMS
Journal Entry Tool Kit
A05 - AP
Cellular
Rollover
AIMS
Reporting
S05 - House
Charges
Optika
PSP
C02 - Capital
Projects
Data Warehouse
(Interfaces to and from the
Data Warehouse are not
displayed on this diagram)
US Bank Recon
File
Connect 3
ICMS Credit
SiteSeer
In-Home
Repair
Warranty
Billing
System
OTHER APPS - PC
AP - Collections/Credit
TM - Credit Card DB
F06 - Fixed
Assets
Star Repair
Connect 3
PDF Transfe
Connect 3
Reports
Cash Over/
Short
Cash Receipts/Credit
Misc Accounting/Finance Apps - PC/NT
COBA (Corp office Budget Assistant)
PCBS(Profit Center Budget System)
Merchandising Budget
INVENTORY CONTROL APPS - PC
Code Alarm
Debit Receivings
Devo Sales
Display Inventory
In Home
Junkouts
Merchandise Withdrawl
Promo Credits
RTV Accrual
Shrink
AP Research - Inv Cntrl
AP Research-Addl Rpts
Book to Perpetual Inventory
Close Out Reporting
Computer Intelligence Data
Count Corrections
Cross Ref for VCB Dnlds
Damage Write Off
Debit Receivings
DFI Vendor Database
Display Inventory Reconcil
Display Inventory Reporting
INVENTORY CONTROL APPS - PC
DPI/CPI
IC Batching
Inventory Adj/Count Correct
Inventory Control Reports
Inventory Levels
Inventory Roll
Merchandise Withdrawl
Open Receivings
PI Count Results
PI Time Results from Inv
Price Protection
Sales Flash Reporting
Shrink Reporting
SKU Gross Margin
SKU Shrink Level Detail
USM
VCB Downloads
ACCTS REC APPS - PC
990COR
Bad Debt
Benefical Fees
Beneficial Reconcil
JEAXF
JEBFA
JEBKA
JEDVA
JESOA
JEVSA
JEVSF
NSF
TeleCredit Fees
Prepared by Michelle Mills
什麼是大數據
• 大數據的大實際上並不是最令人關注的特徵
• 大資料是很多不同格式的
•
•
•
•
結構化
半結構化
非結構化
原始數據(raw data)
• 在某些情況下看起來與過去 30 年來我們存儲在資料倉庫中
清一色的數值和代碼的文字完全不同
• 很多大資料不能使用任何類似SQL 這樣的工具進行分析
• 大數據是我們如何看待資料資產、在哪裡收集、怎樣分析
以及如何將分析得到的見解轉化為利潤的一種範式轉換
2015/10/1
資料也是資產負債表上的一種資產
• 企業日益認識到資料本身是一種資產,應該和製造業時代
始終在資產負債表出現的設備和土地這樣的傳統資產一樣
• 確定資料資產的價值有多種方法,其中包
•
•
•
•
•
產生數據的成本
資料丟失時替換資料的成本
資料所帶來的收入或獲利機會
如果資料落入競爭對手手中所造成的收入或利潤損失
如果資料向錯誤物件公開而面臨罰款和訴訟的法律風險
• 但是比資料本身更重要的是,企業已經展示了對資料的洞
察可以轉化為利潤
2015/10/1
資料倉庫證明資料驅動洞察力
• 直到最近,數據倉庫注重的還是歷史交易數據
• 最近十年中資料倉庫方面發生了三個翻天覆地的轉變
• 低延遲運營數據與已有的歷史資訊一起引入資料倉庫
• 第二個在這十年中一直在不斷深化的巨大轉變是客戶行為資料
的收集
• 這件事就是從社交媒體提取產品偏好和客戶的情感,尤其是
由 .com 公司的新業務範式產生的巨量由機器生成的非結構化
資料,並不是說非結構化數據是最近才發現的什麼新事物,而
是說非結構化資料的分析直到最近才成為主流
2015/10/1
揭開大資料分析的面紗
大資料分析使用案例
•
•
•
•
•
•
•
•
•
•
貸款風險分析和保單承保
客戶流失分析
搜索排名
廣告追蹤
位置和鄰近追蹤
因果因素發現
社交 CRM
文件相似度檢測
•
•
•
•
•
•
•
•
•
基因組分析
客戶佇列群體的發現
•
在航飛機狀態
智能電表
樓宇傳感
衛星圖像對比
CAT 掃描對比
金融帳戶欺詐行為檢測和預防
電腦系統駭客入侵偵測和干預
線上遊戲手勢跟蹤
大科學包括原子對撞機、天氣
分析、空間探測遙感饋送
“資料包”探索
與傳統關聯式資料庫系統肯定非常不同
2015/10/1
11
有兩種體系架構浮出水面,來應對大資料分析
大数据是座金山
2015/10/1
13
Big Data
Project
80% of the work in
big data projects is
data integration and
data quality
Data
Integration
2014
The Challenge
2011
Devices
& Machines
2007
Communities
& Society
1990s
Business
Ecosystems
1980s
BUSINESS
1960s-1970s
USERS
VALUE
Few
Employees
Back Office
Automation
Customers/
Consumers
Many
Employees
Front Office
Productivity
Line-of-Business
Self-Service
Social
Engagement
Real-Time
Optimization
E-Commerce
TECHNOLOGIES
OS/360
SOURCES
TECHNOLOGY
MAINFRAME
10 2
CLIENT-SERVER
10 4
WEB
10 6
CLOUD
10 7
SOCIAL
10 9
INTERNET
OF THINGS
10 11
Solution: Enterprise Data Integration
The right data in the right time in the right way
ACCESS
DISCOVER
*
CLEANSE*
INTEGRA
TE
DELIVER
• Universal
Data Access
Any Data
Source
Any
Latency
Any
Delivery
Mechanism
Batch
Real-time
Change Capture
• Multi-Modal
Data
Provisioning
• Business-IT
Collaboration
Physical
Target
Publish as
Web Service
Publish to
Message Bus
Virtual View
• CostEffective
Scalability
* requires Informatica Data Quality
16
PowerCenter Big Data Edition
No-Code Productivity
Big Transaction Data
Online Transaction
Processing (OLTP)
Oracle
DB2
Ingres
Informix
Sysbase
SQL Server
…
Online Analytical
Processing (OLAP) &
DW Appliances
Teradata
Redbrick
EssBase
Sybase IQ
Netezza
Exadata
High-Speed Data
Ingestion and
Extraction
HANA
Greenplum
DataAllegro
Asterdata
Vertica
Paraccel …
Business-IT
Collaboration
Unified Administration
Facebook
Twitter
Linkedin
Youtube
…
Web applications
Blogs
Discussion forums
Communities
Partner portals
…
Other Interaction Data
9.6
Salesforce.com
Concur
Google App Engine
Amazon
…
Complex Data
Parsing on Hadoop
Social Media & Web Data
Universal Data Access
Cloud
ETL on Hadoop
Big Interaction Data
the VibeTM virtual
data machine
Clickstream
image/Text
Scientific
Genomoic/pharma
Medical
Medical/Device
Sensors/meters
RFID tags
CDR/mobile
…
Entity Extraction and
Data Classification on
Hadoop
Big Data Processing
Profiling on Hadoop
PowerExchange Connectors
Enterprise
Applications,
Software as a
Service (SaaS)
JDE EnterpriseOne
JDE World
Lotus Notes
Oracle E-Business Suite ✔
PeopleSoft Enterprise
Salesforce (salesforce.com) ✔
SAP NetWeaver ✔
SAP NetWeaver BI ✔
SAS
Siebel
Netsuite
Microsoft Dynamics
Databases and
Data
Warehouses
Adabas for UNIX, Windows
C-ISAM
DB2 for LUW ✔
Essbase
EMC/Greenplum
Informix Dynamic Server
Netezza Performance Server
ODBC
Oracle ✔
SQL Server ✔
Sybase
Teradata
Messaging
Systems
JMS ✔
MSMQ ✔
TIBCO ✔
webMethods Broker ✔
WebSphere MQ ✔
Technology
Standards
Email (POP, IMAP)
HTTP(S) ✔
LDAP ✔
Web Services ✔
XML
Mainframe
Adabas for z/OS ✔
Datacom ✔
DB2 for z/OS, z/Linux✔
IDMS ✔
IMS DB ✔
Oracle for z/Linux ✔
Teradata
WebSphere MQ for z/Linux ✔
VSAM ✔
Asterdata,
Greenplum
Vertica
ParAccel
Microsoft PDW
Kognitio
Facebook, Twitter, LinkedIn
DataSift, Kapow
MongoDB
HDFS
HIVE
HBASE
Big Data
Social
Hadoop
✔- Accessible in Real-time and/or via Change Data Capture (CDC)
Cloud of Connectors
Real-Time Data Collection and Streaming
Management
and Monitoring
Web Servers,
Operations
Monitors, rsyslog,
SLF4J, etc.
Handhelds, Smart
Meters, etc.
Discrete Data
Messages
Node
Internet of Things,
Sensor Data
Node
Ultra Messaging Bus
Node
Publish / Subscribe
Zookeeper
HDFS, HBase,
Node
Node
Node
Real Time
Analysis, Complex
Event Processing
No SQL
Databases:
Cassandara, Riak,
MongoDB
Targets
Sources
Leverage High Performance Messaging
Infrastructure Publish with Ultra
Messaging for global distribution without
additional staging or landing.
20
Unstructured Supported Data Formats
Unstructured
Semi-structured
XML


































Microsoft Word
Microsoft Excel
PDF
PowerPoint
Star Office
Word Perfect
ASCII reports
HTML
EBCDIC
Undocumented
binaries
Flat files
RPG
ANSI
HL7
SWIFT
AL3
HIPAA
EDI–X12
EDI-Fact
FIX
NACHA
ASTM
Cargo IMP
COBOL
PL1
UCS
WINS
VICS
ACORD XML
LegalXML
IFX
cXML
ebXML
HL7 V3.0
Unstructured
Print Streams


AFP
Post Script
Unstructured Data
A parser contains a script which uses ins
to parse through an unstructured docu
A marker is a unique anchor that
defines an area in the document.
Content highlights the data
that you wish to capture.
The XSD describes the content
that you will be capturing.
上海海關監管業務—外网数据抓取
以淘宝数据为例
What is Data Quality?
Let’s walk through an example
Mr Frank Reagan
12 Richmond Hill Rd
Staten Island, NY
Profile
Parse
First
First Name:
Name:
Last
Last Name:
Name:
AddressL1:
AddressL1:
AddressL2:
AddressL2:
City:
City:
County:
County:
Post
Post Code:
Code:
Frank
Reagan
12 Richmond
Road
George’s
QuayHill
House
43 Townsend Street
Staten
Dublin Island
NY
10341
Correct
powered by
First Name:
Mid Name:
Last Name:
AddressL1:
AddressL2:
City:
State
Zip Code:
Phone:
Email:
Francis
Joseph
Reagan
12 Richmond Hill Road
Staten Island
NY
10341
(212) 423 49 0866
[email protected]
Standardize
Golden Record
Match
CRM System
F.J. Ragan
NEW YORK
[email protected]
(212) 423 490866
Consolidate
Enhance
Salutation:
First Name:
Mid Name:
Last Name:
AddressL1:
AddressL2:
City:
Zip code
ZIP+4:
Longitude:
Latitude:
Phone:
Email:
C_Category:
C_Group:
Commissioner
Francis
Joseph
Reagan
12 Richmond Hill Road
Staten Island
New York
10341
5963
40.588
-74.167
(212) 423 49 0866
[email protected]
White Collar families
Affluent Families
Developer – Eclipse based
Comprehensive
transformations
Re-usable rules
Mid-Stream data
preview
25
Data Quality
• Transformations
26
Profiling – Discover the data
Column & Rule Profiling
Value & Pattern Freqs
Drill Down Results
資料標準化 Standardizer
•
•
•
•
•
(02)25770257 => 02-25770257 •
0492734503 => 049-2734503 •
•
083-655203 => 0836-55203
•
0936936887 => 0936-936887
台積電
台灣積體電路股份有限公司
台灣積體電路(股)有限公司
台灣積體電路製造股份有限公司
082668351 => 0826-68351
28
資料切割 Parser
Legacy data
New data
客戶資料
新安國際有限公司 使用人:周凱
全欣交通股份有限公司靠行司機:張大
詹一中先生
陳春 使用人:陳裕.許麗
Token1
Token2
新安國際有限公司
周凱
全欣交通股份有限公司
張大
Token3
詹一中
陳春
陳裕
許麗
Standardizer & Parser 應用在中文地址
Probabilistic Parsing and Labelling
• Probabilistic approach using
Natural Language Processing
(NLP)
• Support for statistical models to
Example
• Input:
• ‘BROADCASTING HOUSE ATTN:
HILARY THOMAS, ROOM G12
,ACCOUNT SERVICES
ENGINEERING’
predict relations between words
• Able to correctly label
ambiguous terms that can have
more than 1 meaning
• Reduce mapping complexity
• The Stanford Natural Language
Processing Group
• Output: (highlights show terms used
during configuration / model
training)
•
•
•
•
•
LOC:
BROADCASTING HOUSE
NOISE: ATTN
PERS: HILARY THOMAS
LOC:
ROOM G12
ORG: ACCOUNT SERVICES
ENGINEERING,
Probabilistic Model
資料關聯比對 Match
Legacy data 1
Legacy data 2
身份證號
A123456789
A223456789
A112345678
A212345678
姓名
王門騫
陳東壁
闕劍明
管廷興
Legacy code1
王門騫
陳東壁
闕劍明
鐘宜偉
Legacy code2
王門鶱
陳東墾
關劍明
鍾怡緯
Toyotomi Hideyoshi
豊臣秀吉
トヨトミヒデヨシ
とよとみひでよし
身份證號
A123456789
A223456789
A112345678
A212345678
姓名
王門鶱
陳東墾
關劍明
管延興
Match Score
0.666666687
0.666666687
0.666666687
0.00000000
上本町207 シャトー上本町303
シャトー上本町303 兵庫県 小野市
上本町207 上本町303
シャトー上本町33 兵庫県 野市
33
Informatica Identity Resolution
• 身分識別是困難且複雜的,身分證ID不會唯一的條件
• 多證照
• 同音字(鐘宜偉 vs 鍾怡緯)
• 資料順序 (Mark Douglas or Douglas Mark)
• 贅字(先生/小姐/Title)
• 翻譯拼音錯誤(Yin/Ying)
• 跨語系比對(簡繁/中英)
• 姓名 + 地址 + 電話 + 公司組織
• IIR 優點
• 多種語言發音的支持
• 靈活的比對方式
• 擴充性
• DQ IR
34
模糊比對
同音 同義 多語言
繁簡
簡英
簡英(廣東)
合併管理 Consolidate
使用模糊匹配,標示相似的
數據.
使 用數據子集進行精確匹配
(i. e. 姓、名、聯系方式、
信息、地址信息等)
ID
行銷系統
Primary Key
First Name
Last Name
929992
寶強
王
呼叫中心
Primary Key
First Name
Last Name
保強
王
AKK-111
綜合管理系統
客戶樣例
Identity
Resolution
創建唯一的ID,並為
黃金記錄選擇恰當的屬
性
Primary Key Full Name
098388188
Wang Bao Qiang
創建交叉引用索引,
指向原有系統的主鍵
和系統標示
Unique ID
First Name
Last Name
111
寶強
王
Unique
ID
Source
System
System ID
111
CRM
929992
111
ERP
AKK-111
111
Legacy
098388188
Informatica Platform
Data
Loader
PowerExchange
ActiveVOS
Data
Replication
Data
Synchronization
Identity
Resolution
Cloud
Integration
Cloud
MDM
Cloud
Extend
Cloud
Test Data Management
Data Quality
Assessment
Dynamic Data
Masking
Heiler - PIM
VIBE Data
Streaming
Data Quality
Data Integration
Hub
Data Archive
Pro-Active
Monitoring
Registry Edition
AE,RTE,PE
Business
Glossary
Data Exchange
Data Subset
RulePoint
Multi-domain Hub
Streaming Edition
Data Replication
AddressDoctor
Data
Transformation
Data Privacy
Real-Time
Alert Manager
Data Director
Queuing Edition
PowerCenter
Persistent Edition
巨量資料平台數據治理藍圖
安全、優化與災備
正確性
數據探勘&數據品質
測試數據管理
IDQ
ILM/TDM
數據及身份搜索識別
活備份與系統優化
數據
治理平台
IIR
主數據與客戶數據管理
MDM
ILM/DataArchive
複雜事件處理
CEP
MM
PWC
數據整合
即時性
變化實時數據捕獲
微秒級訊息數據整合
PWX
EAI
非結構及半結構化採集
雲計算數據整合
B2B
CLOUD
Map Once. Deploy Anywhere.
ON PREMISE
CLOUD
HADOOP
3rd PARTY
APPLICATIONS
Q&A