Semistructured Data and Mediation Prof. Letizia Tanca Dipartimento di Elettronica e Informazione Politecnico di Milano SEMISTRUCTURED DATA FOR THESE DATA THERE IS SOME FORM OF STRUCTURE, BUT IT IS NOT AS – PRESCRIPTIVE – REGULAR – COMPLETE AS IN TRADITIONAL DBMSs EXAMPLES – WEB DATA – XML DATA – BUT ALSO DATA DERIVED FROM THE INTEGRATION OF HETEROGENEOUS DATASOURCES 1 AN EXAMPLE OF SEMISTRUCTURED DATA a page produced from a database 2 AN EXAMPLE OF SEMISTRUCTURED DATA 3 AN EXAMPLE OF SEMISTRUCTURED DATA GRAPH-BASED REPRESENTATION: THE IRREGULAR DATA STRUCTURE APPEARS MORE CLEARLY PROFESSOR NAME AGE 37 NAME LAST NAME KING TEACHES PROFESSOR TEACHES COURSE NAME ABSTRACT DATABASES OBJECTS AGE YEARS NAME? 45 CONCRETE VALUES 4 A SIMPLE XML DOCUMENT WITH ITS GRAPH BASED REPRESENTATION <producer> <mn-name>Mercury</mn-name> <year>1999</year> PRODUCER <model> <mo-name>Sable LT</mo-name> mn-name <front-rating>3.84</front-rating> year <side-rating>2.14</side-rating> <rank>9</rank> </model> ……… Mercury 1999 </producer> ….. MODEL rank mo-name Sable LT front-rating side-rating 9 3.84 2.14 5 INFORMATION SEARCH IN SEMISTRUCTURED DATABASES • WE WOULD LIKE TO: – INTEGRATE – QUERY – COMPARE DATA WITH DIFFERENT STRUCTURES ALSO WITH SEMISTRUCTURED DATA, JUST AS IF THEY WERE ALL STRUCTURED 6 DYNAMIC INTEGRATION OF SEMISTRUCTURED DATABASES AN OVERALL DATA REPRESENTATION SHOULD BE PROGRESSIVELY BUILT, AS WE DISCOVER AND EXPLORE NEW INFORMATION SOURCES 7 SEMISTRUCTURED DATA MODELS • BASED ON – TEXT – TREES – GRAPHS • LABELED NODES • LABELED ARCS • BOTH • THEY ARE ALL DIFFERENT AND DO NOT LEND THEMSELVES TO EASY INTEGRATION 8 Recall: MEDIATORS The term mediation includes: • the processing needed to make the interfaces work • the knowledge structures that drive the transformations needed to transform data to information • any intermediate storage that is needed (Wiederhold) 9 TSIMMIS APPLICATION 1 (LOREL) (OEM) WRAPPER 1 DATASOURCE 1 (RDBMS) MEDIATOR 1 APPLICATION 2 (LOREL) (OEM) WRAPPER 2 DATASOURCE 2 (LOTUS NOTES) MEDIATOR 2 WRAPPER 3 DATASOURCE 3 (WWW) 10 Mediator-based approach IN TSIMMIS: • UNIQUE, GRAPH-‐BASED DATA MODEL • DATA MODEL MANAGED BY THE MEDIATOR • WRAPPERS FOR THE MODEL-‐TO-‐MODEL TRANSLATIONS 11 OEM (Object Exchange Model) (TSIMMIS) • Graph-based • Does not represent the schema • Directly represents data : self-descriptive <temp-in-farenheit,int,80> 12 Object structure <Object-id,label,type,value> Nested structure 13 OEM (Object Exchange Model) (TSIMMIS) 14 Typical complications when integrating • Each mediator is specialized into a certain domain (e.g. weather forecast), thus • Each mediator must know domain metadata , which convey the data semanWcs • On-‐line duplicate recogniWon and removal (no designer to solve conflicts at design Wme here) 15 Query formulation “Find books authored by Aho” select library.book.title where library.book.author = “Aho” from library (if more than one root is available) OK, but if this query must be produced at run-time, how does the user (or the system, if a transformation has to be applied) know that: – A node library exists, which contains nodes book, which in turn contain fields author and title – TSIMMIS uses the Dataguide: a-posteriori schema, progressively built while exploring the data sources 16 TSIMMIS’s language is LOREL • Lightweight Object REpository Language • Object-based • Similar to OQL, with some modifications appropriate for semistructured data select library.book.title where library.book.author = “Aho” from library 17 WRAPPERS (translators) • Convert queries into queries/commands which are understandable for the specific data source – they can extend the query possibilities of a data source • Convert query results from the source’s format to a format which is understandable for the application 18 WRAPPERS Wrapper HTML page BookTitle Author Editor The HTML Sourcebook J. Graham ... Computer Networks A. Tannenbaum ... Database Systems R. Elmasri, S. Navathe ... Data on the Web S. Abiteboul, P.Buneman, D.Suciu … database table(s) (or XML docs) 19 Example: estraction of information from HTML docs • Information extraction – Source Format: plain text with HTML tags (no semantics) – Target Format: relational table (possibly nested, NF2) or XML (we add structure, i.e. semantics ) • Wrapper – Software module which performs an extraction step – Intuition: use extraction rules which exploit the marking tags 20 A complex extraction process 20-30KB IN HTML >10 attributes with nesting 21 Problems • Web sites change very frequently • A layout change may affect the extracWon rules • Human-‐based maintenance of an ad-‐hoc wrapper is very expensive • Be]er: automa&c wrapper genera&on 22 Automa'c wrapper genera'on... • We can only use them when pages are regular to some extent • OK when: – Many pages sharing the same structure – e.g. pages are dynamically generated from a DB à data intensive web sites 23 Online library Books Name AuthorsTitle Year Editions Details Price Details 20$ First … Year Price #e1 #b1 #a1 John Smith #e2 First Edition, 2000 Paperback 30$ 1998 Second 20$ #b1 … #b2 This book … 1998 #a2 Paul Jones #e3 … … #e4 #b3 1999 30$ #e5 #b4 1993 30$ #e6 #b4 Id An undergraduate … Author Title #b1 #a1 Database Primer … #b2 #a1 Computer Systems XML at Work HTML and Scripts #e7 A comprehensive … An useful HTML … #b3 #a2 XML at Work #b4 #a2 HTML and Script JavaScripts #b5 … Book Name Computer Systems Books Paul Jones Description Id Id Database Primer John Smith Editions #a2 …… … A must in … Java Script …… 1995 40$ Second Edition, Hard Cover First Edition, Paperback First … 2000 1995 30$ First … null 40$ 1999 45$ Second Description #b5 First Edition, 2000 Paperback 50$ 1999 null 30$ … … … … … null … Second …Edition, Hard Cover 1993 30$ 1999 45$ 2000 50$ … … … null … … … Source Dataset HTML Pages 24 The ROAD RUNNER project • Page Class – It is the collec'on of all pages generated by the same script from a common dataset • Schema DerivaWon – Given a set of HTML sample pages, belonging to the same class, find the underlying dataset structure (database schema) à SoluWon: Wrapper Generator • Underlying dataset structure • ExtracWon rules 25 B A Wrapper Output C D F E G First Edition, 1998 … This book … John Second 2000 Smith Edition, … An Computer First Edition, undergraduate 1995 Systems … … A First Edition, <HTML><BODY><TABLE> comprehensive XML at Work <HTML><BODY><TABLE> 1999 … <TR> <TR> … <TD><FONT>books.com<FONT></TD> <TD><FONT>books.com<FONT></TD> null 1993 <TD><A>John Smith</A></TD> <TD><A>Paul Jones</A></TD> HTML and An useful HTML Paul Jones Second </TR> </TR> Scripts … Edition, … 1999 <TR> <TR> <TD><FONT>Database Primer<FONT></TD> <TD><FONT>XML at Work<FONT></TD> JavaScripts A must in … Edition, null Paperback</B></TD>, 2000 <TD><B>First Edition, Paperback</B></TD>, <TD><B>First … … … … … … … Database Primer Input HTML Sources Wrapper = Grammar <HTML><BODY><TABLE> <TR> <TD><FONT>books.com<FONT></TD> <TD><A> #PCDATA</A></TD></TR> (<TR> ( <TD><FONT> #PCDATA <FONT></TD> ( <TD><B> #PCDATA </B> )? </TD> ( <TR><TD><FONT> #PCDATA <FONT> <B> #PCDATA </B> </TD></TR> )+ )+… H 20$ 30$ 40$ 30$ 30$ 45$ 50$ … Target Schema SET ( TUPLE (A : #PCDATA; B : SET ( TUPLE ( C : #PCDATA; D : #PCDATA; E : SET ( TUPLE ( F: #PCDATA; G: #PCDATA; 26 H: #PCDATA))) Model Management approach (Atzeni, Bernstein, others…) • Given two different data models (e.g. OO and relational, or XML and OO, etc.) (when datasources are at least semistructured) • Define general mappings from one model into another, which allow to – Map SQL schema to XML schema – Map data source to data warehouse – Map OO classes to data source tables, … • To this end, one possibility is to use a metamodel OBSERVATION: constructs in the various models are similar • Only few categories of basic constructs (see Hull & King 1986): – – – – – – – Lexical: set of printable values (domain) Abstract (entity, class, …) Aggregation: a construction based on (subsets of) Cartesian products (relationship, table) Function (attribute, property) Hierarchies … • A metamodel uses these basic constructs to create the constructs of the model we need to represent 28 METAMODEL • A METAMODEL IS AN ABSTRACT MODEL FOR THE SPECIFICATION OF CONCRETE MODELS • TWO TYPES OF METAMODELS: 1. GENERAL ENTITIES WHOSE SPECIALIZATIONS BECOME OBJECTS IN THE TARGET MODEL, E.G. GSMM 2. ENTITIES DESCRIBING THE OBJECTS OF THE TARGET MODEL, E.G.GEOGRAPHIC DATA FILES (GDF) 29 Using a metamodel for integrated data representation • TRANSLATION OF DIFFERENT MODELS INTO A UNIQUE FORMALISM • EASY A-PRIORI COMPARISON BETWEEN THE DIFFERENT MODEL’S FEATURES • AUTOMATIC TRANSLATION DICTATED BY THE REPRESENTATION RULES OF THE CONCRETE MODEL INTO THE METAMODEL 30 Data IntegraWon based on a Meta-‐model Q (In the metamodel language) Q1 (In DS1 language) Meta-‐model-‐based schema Qn Q2 Q3 (In DS2 language) (In DS3 language) (In DSn language) DSn DS1 DS2 DS3 … 31 GSMM (General Semistructured Meta-‐Model) • GRAPH-‐BASED • DOES NOT REPRESENT THE SCHEMA: SELF-‐DESCRIPTIVE AS TSIMMIS • GSSM REPLACES THE CONCEPT OF SCHEMA WITH THAT OF CONSTRAINT • INTRODUCTION OF FLEXIBILITY IN DATA REPRESENTATION 32 An XML document… ….. PRODUCERS <producers> <producer> <mn-name>Mercury</mn-name> <year>1999</year> <model> <mo-name>Sable LT</mo-name> <front-rating>3.84</frontrating> <side-rating>2.14</side-rating> <rank>9</rank> </model> …… </producer> … </producers> ….. PRODUCER mn-name year MODEL Mercury 1999 rank mo-name Sable LT front-rating side-rating 9 3.84 2.14 33 Its representation in GSMM ….. <PRODUCERS,root,0,⊥> sub-element of sub-element of ….. <PRODUCER,element,1, ⊥> sub-element of <mn-name,element,1, Mercury> sub-element of <MODEL,element,3, ⊥> <year,element,2, 1999> sub-element of <mo-name,element,1, SableLT> sub-element of sub-element of <rank,element,4, 9> <front-rating,element,2, 3.84> sub-element of <side-rating,element,3, 2.14> 34 APPLICATION OF THE GSMM METAMODEL • THE METAMODEL ALLOWS THE CONSTRUCTION OF A GENERIC GRAPH WHICH REPRESENTS AN INSTANCE OF A CONCRETE MODEL • PLUS CONSTRAINTS (also graphs): represented by means of – HIGH LEVEL, OR META- CONSTRAINTS – these dictate the sintactic rules of the object data model – LOW LEVEL CONSTRAINTS – as usual, these dictate the application domain semantics • IT IS A METAMODEL OF THE FIRST KIND 35 ModelGen (Atzeni et al.) A model management operator: • given two data models M1 and M2, and a schema S1 of M1 (the source schema and model) • generate a schema S2 of M2 (the target schema and model), corresponding to S1 • and, for each database D1 over S1, generate an equivalent database D2 over S2 36 Geographic Data Files (GDF) • GDF is a standard, used for describing roadmaps – CITY STREET REGISTER – STREET SIGN ARCHIVE – TRAFFIC CONTROL SYSTEMS – CAR NAVIGATOR SYSTEMS (GPS) – ….. • IT IS A METAMODEL OF THE SECOND KIND 37 GDF STRUCTURE • DATASET OF 82 ASCII CHARACTERS RECORDS – ENTITIES – ATTRIBUTES – RELATIONS • 11 INTEREST THEMES – – – – – – – ROADS AND FERRIES BRIDGES AND TUNNELS RAILROADS RIVERS PUBLIC TRANSPORTATION ADMINISTRATION AREAS …….. • 3 DESCRIPTION LEVELS FOR EACH THEME 38 GDF • LEVEL 0 – TOPOLOGY – A PLANAR GRAPH: • POINT • ARC • NODE • POLYGON – EACH ELEMENT IS UNIQUELY IDENTIFIED BY AN ID – TOPOLOGICAL RULES GOVERN INTERNAL STRUCTURE AND RELATIONSHIPS AMONG ELEMENTS 39 GDF LEVELS • LEVEL 1 – ELEMENTARY ENTITIES THIS IS THE BASIS FOR THE CITY STREET REGISTER – ROAD ELEMENT – JUNCTION – TRAFFIC AREA • LEVEL 2 – COMPLEX ENTITIES THIS IS THE BASIS FOR THE GEOGRAPHIC INFORMATION SYSTEMS – STREET – INTERSECTION 40 GDF LEVELS LEVEL 2 CL203 CS102 CS103 L524 C627 C621 L811 L812 C624 L813 C628 L815 C630 C625 L814 C629 C626 L816 C631 L821 LEVEL 0 C622 L820 C623 L817 S870 LEVEL 1 L819 L322 S869 L818 L525 S868 L321 S867 CL235 CITY MAP CL234 VIALE ABRUZZI 41 SUMMARY Letizia Tanca DB design and integration 42 Bibliography on Data IntegraWon 43
© Copyright 2024 ExpyDoc