由企業數據倉庫在大數據分析時代的角色變遷 談數據集成及數據質量 尹寒柏 Bob Yin Senior Product Specialist Agenda • • • • • 關於Informatica 企業數據倉庫在大數據分析時代的角色變遷 Informatica 的解決方案 Informatica BDE Demo Q&A Informatica The #1 Independent Leader in Data Integration $900 • Founded: 1993 • 2013 Revenue: $948.2 million • 7-year Annual CAGR: 17% per year • Employees: 3,210+ • Partners: 450+ • • Major SI, ISV, OEM and On-Demand Leaders $800 $700 $600 $500 $400 $300 Customers: Over 5,000 • Customers in 82 Countries • Direct Presence in 28 Countries • # 1 in Customer Loyalty Rankings (7 Years in a Row) $200 $100 $0 2006 2007 2008 2009 2010 2011 2012 2013 #1 Magic Quadrant for Data Integration Tools Gartner 24 July 2014 ID:G00261678 2015/10/1 The Forrester Wave™: Hybrid2 Integration, Q1 2014 Data Integration Hairball Mainframe apps - Blue PC/NT apps - Green Unix apps - Yellow 3rd party interface - Orange Lines: Colors have no special meaning. DRAFT They are to help make the diagram easier to read. For More Information: See the database containing information about each application: Application V4.mdb AIS Reports Budget Analysis Tool Vendor Maintenance Insertions Orders AIS Calendar Due Dates General Maintenance Broadcast Filter Stores & Mrkts Best Buy - Application Diagram V4 November 10, 1999 Vendor Setup Process Servers (Imaging) Printer Maintenance NEW Soundscan NPD Group AIG Warranty Guard Mesa Data S20-Sales Polling Printer PO DRAFT Page 1 of 2 Depository Banks UAR - Universal Account Reconcilliation Sterling VAN Mailbox (Value) Roadshow I06 - Customer Order S01 - Sales Corrections I17 Customer Perceived In-Stock I13- Auto Replenishment I15 Hand Scan Apps I06 Warehouse Management Print Costing Invoice App E13 E3 Interface Fringe PO Smart Plus M03 - Millennuim 3.0 Smart Plus Launcher S04 - Sales Posting S07 - Cell Phones P16 - Tally Sheet I03 Return to Vendor D01 Post Load Billing M02 - Millennium S06 - Credit App Equifax Stock Options L02-Resource Scheduling (Campbell) A04 - Cust Refund Chks E01-EDI P14 On-line New Hire Entry Resumix Frick Co CTS ACH V02-Price Marketing Support CTO2.Bestbuy. com V04-Sign System Prodigy Banks - ACH and Pos to Pay I10 Cycle Physical Inventory I04 Home Deliveries U18 - CTO I02 Transfers B01 - Stock Status Spec Source SKU Tracking Intercept E02-Employee Purchase S08 - Vertex Sales Tax I11 Price Testing I09 Cycle Counts S02 Layaways NPD, SoundScan Spec Source Scorecard - HR S03-Polling K02 Customer Repair Tracking ASIS I18 SKU Rep SKU Performance SKU Selection Tool Arthur Planning I35 Early Warning System I55 SKU Information ELT PowerSuite Supplier Compliance L60 MDF Coop I05 Inventory Info V01-Price Management System I35 - CEI Rebate Transfer X92-X96 Host to AS400 Communication I01 PO Receiving V03- Mkt Reactions P09 Bonus/HR Washington, RGIS, Ntl Bus Systems S11 - ISP Tracking POS Plan Administrators (401K, PCS, Life, Unicare, Solomon Smith Barney) Store Monitor L01-Promo Analysis 1 AAS P01Employee Masterfile P09 - P17 Cyborg Cobra S09 - Digital Satellite System I12 Entertainment Software P15 EES Employee Change Notice I07 Purchase Order Ad Expense G02 - General Ledger Store Scorecard Sign System Texlon 3.5 NARM I14 Count Corrections Store Budget Reporting Valley Media U16-Texlon B02 Merchandise Analysis CopyWriter's Workspace BMP - Bus performance Mngt EDI Coordinator Merch Mngr Approval Batch Forcasting Ad Measurement AIMS Admin Ad Launcher AIMS Journal Entry Tool Kit A05 - AP Cellular Rollover AIMS Reporting S05 - House Charges Optika PSP C02 - Capital Projects Data Warehouse (Interfaces to and from the Data Warehouse are not displayed on this diagram) US Bank Recon File Connect 3 ICMS Credit SiteSeer In-Home Repair Warranty Billing System OTHER APPS - PC AP - Collections/Credit TM - Credit Card DB F06 - Fixed Assets Star Repair Connect 3 PDF Transfe Connect 3 Reports Cash Over/ Short Cash Receipts/Credit Misc Accounting/Finance Apps - PC/NT COBA (Corp office Budget Assistant) PCBS(Profit Center Budget System) Merchandising Budget INVENTORY CONTROL APPS - PC Code Alarm Debit Receivings Devo Sales Display Inventory In Home Junkouts Merchandise Withdrawl Promo Credits RTV Accrual Shrink AP Research - Inv Cntrl AP Research-Addl Rpts Book to Perpetual Inventory Close Out Reporting Computer Intelligence Data Count Corrections Cross Ref for VCB Dnlds Damage Write Off Debit Receivings DFI Vendor Database Display Inventory Reconcil Display Inventory Reporting INVENTORY CONTROL APPS - PC DPI/CPI IC Batching Inventory Adj/Count Correct Inventory Control Reports Inventory Levels Inventory Roll Merchandise Withdrawl Open Receivings PI Count Results PI Time Results from Inv Price Protection Sales Flash Reporting Shrink Reporting SKU Gross Margin SKU Shrink Level Detail USM VCB Downloads ACCTS REC APPS - PC 990COR Bad Debt Benefical Fees Beneficial Reconcil JEAXF JEBFA JEBKA JEDVA JESOA JEVSA JEVSF NSF TeleCredit Fees Prepared by Michelle Mills 什麼是大數據 • 大數據的大實際上並不是最令人關注的特徵 • 大資料是很多不同格式的 • • • • 結構化 半結構化 非結構化 原始數據(raw data) • 在某些情況下看起來與過去 30 年來我們存儲在資料倉庫中 清一色的數值和代碼的文字完全不同 • 很多大資料不能使用任何類似SQL 這樣的工具進行分析 • 大數據是我們如何看待資料資產、在哪裡收集、怎樣分析 以及如何將分析得到的見解轉化為利潤的一種範式轉換 2015/10/1 資料也是資產負債表上的一種資產 • 企業日益認識到資料本身是一種資產,應該和製造業時代 始終在資產負債表出現的設備和土地這樣的傳統資產一樣 • 確定資料資產的價值有多種方法,其中包 • • • • • 產生數據的成本 資料丟失時替換資料的成本 資料所帶來的收入或獲利機會 如果資料落入競爭對手手中所造成的收入或利潤損失 如果資料向錯誤物件公開而面臨罰款和訴訟的法律風險 • 但是比資料本身更重要的是,企業已經展示了對資料的洞 察可以轉化為利潤 2015/10/1 資料倉庫證明資料驅動洞察力 • 直到最近,數據倉庫注重的還是歷史交易數據 • 最近十年中資料倉庫方面發生了三個翻天覆地的轉變 • 低延遲運營數據與已有的歷史資訊一起引入資料倉庫 • 第二個在這十年中一直在不斷深化的巨大轉變是客戶行為資料 的收集 • 這件事就是從社交媒體提取產品偏好和客戶的情感,尤其是 由 .com 公司的新業務範式產生的巨量由機器生成的非結構化 資料,並不是說非結構化數據是最近才發現的什麼新事物,而 是說非結構化資料的分析直到最近才成為主流 2015/10/1 揭開大資料分析的面紗 大資料分析使用案例 • • • • • • • • • • 貸款風險分析和保單承保 客戶流失分析 搜索排名 廣告追蹤 位置和鄰近追蹤 因果因素發現 社交 CRM 文件相似度檢測 • • • • • • • • • 基因組分析 客戶佇列群體的發現 • 在航飛機狀態 智能電表 樓宇傳感 衛星圖像對比 CAT 掃描對比 金融帳戶欺詐行為檢測和預防 電腦系統駭客入侵偵測和干預 線上遊戲手勢跟蹤 大科學包括原子對撞機、天氣 分析、空間探測遙感饋送 “資料包”探索 與傳統關聯式資料庫系統肯定非常不同 2015/10/1 11 有兩種體系架構浮出水面,來應對大資料分析 大数据是座金山 2015/10/1 13 Big Data Project 80% of the work in big data projects is data integration and data quality Data Integration 2014 The Challenge 2011 Devices & Machines 2007 Communities & Society 1990s Business Ecosystems 1980s BUSINESS 1960s-1970s USERS VALUE Few Employees Back Office Automation Customers/ Consumers Many Employees Front Office Productivity Line-of-Business Self-Service Social Engagement Real-Time Optimization E-Commerce TECHNOLOGIES OS/360 SOURCES TECHNOLOGY MAINFRAME 10 2 CLIENT-SERVER 10 4 WEB 10 6 CLOUD 10 7 SOCIAL 10 9 INTERNET OF THINGS 10 11 Solution: Enterprise Data Integration The right data in the right time in the right way ACCESS DISCOVER * CLEANSE* INTEGRA TE DELIVER • Universal Data Access Any Data Source Any Latency Any Delivery Mechanism Batch Real-time Change Capture • Multi-Modal Data Provisioning • Business-IT Collaboration Physical Target Publish as Web Service Publish to Message Bus Virtual View • CostEffective Scalability * requires Informatica Data Quality 16 PowerCenter Big Data Edition No-Code Productivity Big Transaction Data Online Transaction Processing (OLTP) Oracle DB2 Ingres Informix Sysbase SQL Server … Online Analytical Processing (OLAP) & DW Appliances Teradata Redbrick EssBase Sybase IQ Netezza Exadata High-Speed Data Ingestion and Extraction HANA Greenplum DataAllegro Asterdata Vertica Paraccel … Business-IT Collaboration Unified Administration Facebook Twitter Linkedin Youtube … Web applications Blogs Discussion forums Communities Partner portals … Other Interaction Data 9.6 Salesforce.com Concur Google App Engine Amazon … Complex Data Parsing on Hadoop Social Media & Web Data Universal Data Access Cloud ETL on Hadoop Big Interaction Data the VibeTM virtual data machine Clickstream image/Text Scientific Genomoic/pharma Medical Medical/Device Sensors/meters RFID tags CDR/mobile … Entity Extraction and Data Classification on Hadoop Big Data Processing Profiling on Hadoop PowerExchange Connectors Enterprise Applications, Software as a Service (SaaS) JDE EnterpriseOne JDE World Lotus Notes Oracle E-Business Suite ✔ PeopleSoft Enterprise Salesforce (salesforce.com) ✔ SAP NetWeaver ✔ SAP NetWeaver BI ✔ SAS Siebel Netsuite Microsoft Dynamics Databases and Data Warehouses Adabas for UNIX, Windows C-ISAM DB2 for LUW ✔ Essbase EMC/Greenplum Informix Dynamic Server Netezza Performance Server ODBC Oracle ✔ SQL Server ✔ Sybase Teradata Messaging Systems JMS ✔ MSMQ ✔ TIBCO ✔ webMethods Broker ✔ WebSphere MQ ✔ Technology Standards Email (POP, IMAP) HTTP(S) ✔ LDAP ✔ Web Services ✔ XML Mainframe Adabas for z/OS ✔ Datacom ✔ DB2 for z/OS, z/Linux✔ IDMS ✔ IMS DB ✔ Oracle for z/Linux ✔ Teradata WebSphere MQ for z/Linux ✔ VSAM ✔ Asterdata, Greenplum Vertica ParAccel Microsoft PDW Kognitio Facebook, Twitter, LinkedIn DataSift, Kapow MongoDB HDFS HIVE HBASE Big Data Social Hadoop ✔- Accessible in Real-time and/or via Change Data Capture (CDC) Cloud of Connectors Real-Time Data Collection and Streaming Management and Monitoring Web Servers, Operations Monitors, rsyslog, SLF4J, etc. Handhelds, Smart Meters, etc. Discrete Data Messages Node Internet of Things, Sensor Data Node Ultra Messaging Bus Node Publish / Subscribe Zookeeper HDFS, HBase, Node Node Node Real Time Analysis, Complex Event Processing No SQL Databases: Cassandara, Riak, MongoDB Targets Sources Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing. 20 Unstructured Supported Data Formats Unstructured Semi-structured XML Microsoft Word Microsoft Excel PDF PowerPoint Star Office Word Perfect ASCII reports HTML EBCDIC Undocumented binaries Flat files RPG ANSI HL7 SWIFT AL3 HIPAA EDI–X12 EDI-Fact FIX NACHA ASTM Cargo IMP COBOL PL1 UCS WINS VICS ACORD XML LegalXML IFX cXML ebXML HL7 V3.0 Unstructured Print Streams AFP Post Script Unstructured Data A parser contains a script which uses ins to parse through an unstructured docu A marker is a unique anchor that defines an area in the document. Content highlights the data that you wish to capture. The XSD describes the content that you will be capturing. 上海海關監管業務—外网数据抓取 以淘宝数据为例 What is Data Quality? Let’s walk through an example Mr Frank Reagan 12 Richmond Hill Rd Staten Island, NY Profile Parse First First Name: Name: Last Last Name: Name: AddressL1: AddressL1: AddressL2: AddressL2: City: City: County: County: Post Post Code: Code: Frank Reagan 12 Richmond Road George’s QuayHill House 43 Townsend Street Staten Dublin Island NY 10341 Correct powered by First Name: Mid Name: Last Name: AddressL1: AddressL2: City: State Zip Code: Phone: Email: Francis Joseph Reagan 12 Richmond Hill Road Staten Island NY 10341 (212) 423 49 0866 [email protected] Standardize Golden Record Match CRM System F.J. Ragan NEW YORK [email protected] (212) 423 490866 Consolidate Enhance Salutation: First Name: Mid Name: Last Name: AddressL1: AddressL2: City: Zip code ZIP+4: Longitude: Latitude: Phone: Email: C_Category: C_Group: Commissioner Francis Joseph Reagan 12 Richmond Hill Road Staten Island New York 10341 5963 40.588 -74.167 (212) 423 49 0866 [email protected] White Collar families Affluent Families Developer – Eclipse based Comprehensive transformations Re-usable rules Mid-Stream data preview 25 Data Quality • Transformations 26 Profiling – Discover the data Column & Rule Profiling Value & Pattern Freqs Drill Down Results 資料標準化 Standardizer • • • • • (02)25770257 => 02-25770257 • 0492734503 => 049-2734503 • • 083-655203 => 0836-55203 • 0936936887 => 0936-936887 台積電 台灣積體電路股份有限公司 台灣積體電路(股)有限公司 台灣積體電路製造股份有限公司 082668351 => 0826-68351 28 資料切割 Parser Legacy data New data 客戶資料 新安國際有限公司 使用人:周凱 全欣交通股份有限公司靠行司機:張大 詹一中先生 陳春 使用人:陳裕.許麗 Token1 Token2 新安國際有限公司 周凱 全欣交通股份有限公司 張大 Token3 詹一中 陳春 陳裕 許麗 Standardizer & Parser 應用在中文地址 Probabilistic Parsing and Labelling • Probabilistic approach using Natural Language Processing (NLP) • Support for statistical models to Example • Input: • ‘BROADCASTING HOUSE ATTN: HILARY THOMAS, ROOM G12 ,ACCOUNT SERVICES ENGINEERING’ predict relations between words • Able to correctly label ambiguous terms that can have more than 1 meaning • Reduce mapping complexity • The Stanford Natural Language Processing Group • Output: (highlights show terms used during configuration / model training) • • • • • LOC: BROADCASTING HOUSE NOISE: ATTN PERS: HILARY THOMAS LOC: ROOM G12 ORG: ACCOUNT SERVICES ENGINEERING, Probabilistic Model 資料關聯比對 Match Legacy data 1 Legacy data 2 身份證號 A123456789 A223456789 A112345678 A212345678 姓名 王門騫 陳東壁 闕劍明 管廷興 Legacy code1 王門騫 陳東壁 闕劍明 鐘宜偉 Legacy code2 王門鶱 陳東墾 關劍明 鍾怡緯 Toyotomi Hideyoshi 豊臣秀吉 トヨトミヒデヨシ とよとみひでよし 身份證號 A123456789 A223456789 A112345678 A212345678 姓名 王門鶱 陳東墾 關劍明 管延興 Match Score 0.666666687 0.666666687 0.666666687 0.00000000 上本町207 シャトー上本町303 シャトー上本町303 兵庫県 小野市 上本町207 上本町303 シャトー上本町33 兵庫県 野市 33 Informatica Identity Resolution • 身分識別是困難且複雜的,身分證ID不會唯一的條件 • 多證照 • 同音字(鐘宜偉 vs 鍾怡緯) • 資料順序 (Mark Douglas or Douglas Mark) • 贅字(先生/小姐/Title) • 翻譯拼音錯誤(Yin/Ying) • 跨語系比對(簡繁/中英) • 姓名 + 地址 + 電話 + 公司組織 • IIR 優點 • 多種語言發音的支持 • 靈活的比對方式 • 擴充性 • DQ IR 34 模糊比對 同音 同義 多語言 繁簡 簡英 簡英(廣東) 合併管理 Consolidate 使用模糊匹配,標示相似的 數據. 使 用數據子集進行精確匹配 (i. e. 姓、名、聯系方式、 信息、地址信息等) ID 行銷系統 Primary Key First Name Last Name 929992 寶強 王 呼叫中心 Primary Key First Name Last Name 保強 王 AKK-111 綜合管理系統 客戶樣例 Identity Resolution 創建唯一的ID,並為 黃金記錄選擇恰當的屬 性 Primary Key Full Name 098388188 Wang Bao Qiang 創建交叉引用索引, 指向原有系統的主鍵 和系統標示 Unique ID First Name Last Name 111 寶強 王 Unique ID Source System System ID 111 CRM 929992 111 ERP AKK-111 111 Legacy 098388188 Informatica Platform Data Loader PowerExchange ActiveVOS Data Replication Data Synchronization Identity Resolution Cloud Integration Cloud MDM Cloud Extend Cloud Test Data Management Data Quality Assessment Dynamic Data Masking Heiler - PIM VIBE Data Streaming Data Quality Data Integration Hub Data Archive Pro-Active Monitoring Registry Edition AE,RTE,PE Business Glossary Data Exchange Data Subset RulePoint Multi-domain Hub Streaming Edition Data Replication AddressDoctor Data Transformation Data Privacy Real-Time Alert Manager Data Director Queuing Edition PowerCenter Persistent Edition 巨量資料平台數據治理藍圖 安全、優化與災備 正確性 數據探勘&數據品質 測試數據管理 IDQ ILM/TDM 數據及身份搜索識別 活備份與系統優化 數據 治理平台 IIR 主數據與客戶數據管理 MDM ILM/DataArchive 複雜事件處理 CEP MM PWC 數據整合 即時性 變化實時數據捕獲 微秒級訊息數據整合 PWX EAI 非結構及半結構化採集 雲計算數據整合 B2B CLOUD Map Once. Deploy Anywhere. ON PREMISE CLOUD HADOOP 3rd PARTY APPLICATIONS Q&A
© Copyright 2024 ExpyDoc