Dynamic data integration, wrappers and mediators

Semistructured Data
and Mediation
Prof. Letizia Tanca
Dipartimento di Elettronica e Informazione
Politecnico di Milano
SEMISTRUCTURED DATA
FOR THESE DATA THERE IS SOME FORM OF STRUCTURE, BUT IT IS NOT AS –  PRESCRIPTIVE –  REGULAR –  COMPLETE AS IN TRADITIONAL DBMSs EXAMPLES –  WEB DATA –  XML DATA –  BUT ALSO DATA DERIVED FROM THE INTEGRATION OF HETEROGENEOUS DATASOURCES 1
AN EXAMPLE OF
SEMISTRUCTURED DATA
a page produced from a database
2
AN EXAMPLE OF
SEMISTRUCTURED DATA
3
AN EXAMPLE OF
SEMISTRUCTURED DATA
GRAPH-BASED REPRESENTATION: THE IRREGULAR DATA
STRUCTURE APPEARS MORE CLEARLY
PROFESSOR
NAME
AGE
37
NAME
LAST NAME
KING
TEACHES
PROFESSOR
TEACHES
COURSE
NAME
ABSTRACT DATABASES
OBJECTS
AGE
YEARS
NAME?
45
CONCRETE
VALUES
4
A SIMPLE XML DOCUMENT WITH ITS
GRAPH BASED REPRESENTATION
<producer>
<mn-name>Mercury</mn-name>
<year>1999</year>
PRODUCER
<model>
<mo-name>Sable LT</mo-name>
mn-name
<front-rating>3.84</front-rating>
year
<side-rating>2.14</side-rating>
<rank>9</rank>
</model>
………
Mercury
1999
</producer>
…..
MODEL
rank
mo-name
Sable LT
front-rating side-rating
9
3.84
2.14
5
INFORMATION SEARCH IN
SEMISTRUCTURED DATABASES
•  WE WOULD LIKE TO: –  INTEGRATE –  QUERY –  COMPARE DATA WITH DIFFERENT STRUCTURES ALSO WITH SEMISTRUCTURED DATA, JUST AS IF THEY WERE ALL STRUCTURED 6
DYNAMIC INTEGRATION OF
SEMISTRUCTURED DATABASES
AN OVERALL DATA REPRESENTATION SHOULD BE PROGRESSIVELY BUILT, AS WE DISCOVER AND EXPLORE NEW INFORMATION SOURCES 7
SEMISTRUCTURED
DATA MODELS
•  BASED ON –  TEXT –  TREES –  GRAPHS •  LABELED NODES •  LABELED ARCS •  BOTH •  THEY ARE ALL DIFFERENT AND DO NOT LEND THEMSELVES TO EASY INTEGRATION 8
Recall: MEDIATORS The term mediation includes:
•  the processing needed to make
the interfaces work
•  the knowledge structures that
drive the transformations needed
to transform data to information
•  any intermediate storage that is
needed (Wiederhold)
9
TSIMMIS
APPLICATION 1
(LOREL)
(OEM)
WRAPPER 1
DATASOURCE 1
(RDBMS)
MEDIATOR 1
APPLICATION 2
(LOREL)
(OEM)
WRAPPER 2
DATASOURCE 2
(LOTUS NOTES)
MEDIATOR 2
WRAPPER 3
DATASOURCE 3
(WWW)
10
Mediator-based approach
IN TSIMMIS: •  UNIQUE, GRAPH-­‐BASED DATA MODEL •  DATA MODEL MANAGED BY THE MEDIATOR •  WRAPPERS FOR THE MODEL-­‐TO-­‐MODEL TRANSLATIONS 11
OEM (Object Exchange Model)
(TSIMMIS)
•  Graph-based
•  Does not represent the schema
•  Directly represents data : self-descriptive
<temp-in-farenheit,int,80>
12
Object structure
<Object-id,label,type,value>
Nested structure
13
OEM (Object Exchange Model)
(TSIMMIS)
14
Typical complications when
integrating
•  Each mediator is specialized into a certain domain (e.g. weather forecast), thus •  Each mediator must know domain metadata , which convey the data semanWcs •  On-­‐line duplicate recogniWon and removal (no designer to solve conflicts at design Wme here) 15
Query formulation
“Find books authored by Aho”
select library.book.title
where library.book.author = “Aho”
from library (if more than one root is available)
OK, but if this query must be produced at run-time, how
does the user (or the system, if a transformation has to
be applied) know that:
–  A node library exists, which contains nodes book, which in turn
contain fields author and title
–  TSIMMIS uses the Dataguide: a-posteriori schema,
progressively built while exploring the data sources
16
TSIMMIS’s language is LOREL
•  Lightweight Object REpository Language
•  Object-based
•  Similar to OQL, with some modifications
appropriate for semistructured data
select library.book.title
where library.book.author = “Aho”
from library
17
WRAPPERS (translators) •  Convert queries into queries/commands
which are understandable for the specific
data source
–  they can extend the query possibilities of a data
source
•  Convert query results from the source’s
format to a format which is understandable
for the application
18
WRAPPERS Wrapper
HTML page
BookTitle
Author
Editor
The HTML
Sourcebook
J. Graham
...
Computer
Networks
A. Tannenbaum
...
Database
Systems
R. Elmasri,
S. Navathe
...
Data on the
Web
S. Abiteboul,
P.Buneman,
D.Suciu
…
database table(s)
(or XML docs)
19
Example:
estraction of information from
HTML docs
•  Information extraction
–  Source Format: plain text with HTML tags (no
semantics)
–  Target Format: relational table (possibly nested, NF2)
or XML (we add structure, i.e. semantics )
•  Wrapper
–  Software module which performs an extraction step
–  Intuition: use extraction rules which exploit the marking
tags
20
A complex extraction process
20-30KB IN HTML
>10 attributes
with nesting
21
Problems
•  Web sites change very frequently •  A layout change may affect the extracWon rules •  Human-­‐based maintenance of an ad-­‐hoc wrapper is very expensive •  Be]er: automa&c wrapper genera&on 22
Automa'c wrapper genera'on... •  We can only use them when pages are regular to some extent •  OK when: –  Many pages sharing the same structure –  e.g. pages are dynamically generated from a DB à data intensive web sites 23
Online library
Books
Name
AuthorsTitle
Year
Editions
Details
Price
Details
20$
First …
Year Price
#e1
#b1
#a1
John Smith
#e2
First Edition, 2000
Paperback 30$ 1998 Second
20$
#b1
…
#b2
This book …
1998
#a2
Paul Jones
#e3
…
…
#e4
#b3
1999
30$
#e5
#b4
1993
30$
#e6
#b4
Id
An undergraduate …
Author
Title
#b1
#a1
Database Primer
…
#b2
#a1
Computer Systems
XML at Work
HTML and Scripts
#e7
A comprehensive
…
An useful HTML …
#b3
#a2
XML at Work
#b4
#a2
HTML and Script
JavaScripts
#b5
…
Book
Name
Computer Systems
Books
Paul Jones
Description
Id
Id
Database Primer
John Smith
Editions
#a2
…… …
A must in …
Java Script
……
1995
40$
Second Edition, Hard Cover
First Edition, Paperback
First …
2000
1995
30$
First …
null
40$
1999
45$
Second
Description
#b5
First Edition, 2000
Paperback 50$ 1999 null
30$
… …
…
…
…
null
…
Second
…Edition, Hard Cover
1993
30$
1999
45$
2000
50$
…
…
…
null
…
…
…
Source Dataset
HTML Pages
24
The ROAD RUNNER project
•  Page Class –  It is the collec'on of all pages generated by the same script from a common dataset •  Schema DerivaWon –  Given a set of HTML sample pages, belonging to the same class, find the underlying dataset structure (database schema) à SoluWon: Wrapper Generator •  Underlying dataset structure •  ExtracWon rules 25
B
A
Wrapper Output
C
D
F
E
G
First Edition,
1998
…
This book …
John
Second
2000
Smith
Edition, …
An
Computer
First Edition,
undergraduate
1995
Systems
…
…
A
First Edition,
<HTML><BODY><TABLE>
comprehensive
XML at Work <HTML><BODY><TABLE>
1999
…
<TR>
<TR>
…
<TD><FONT>books.com<FONT></TD>
<TD><FONT>books.com<FONT></TD>
null
1993
<TD><A>John Smith</A></TD>
<TD><A>Paul
Jones</A></TD>
HTML and
An useful HTML
Paul Jones
Second
</TR>
</TR>
Scripts
…
Edition, …
1999
<TR>
<TR>
<TD><FONT>Database Primer<FONT></TD>
<TD><FONT>XML at Work<FONT></TD>
JavaScripts
A must in … Edition,
null Paperback</B></TD>,
2000
<TD><B>First Edition, Paperback</B></TD>,
<TD><B>First
…
…
…
…
…
…
…
Database
Primer
Input HTML Sources
Wrapper = Grammar
<HTML><BODY><TABLE>
<TR>
<TD><FONT>books.com<FONT></TD>
<TD><A> #PCDATA</A></TD></TR>
(<TR>
( <TD><FONT> #PCDATA <FONT></TD>
( <TD><B> #PCDATA </B> )? </TD>
( <TR><TD><FONT> #PCDATA <FONT>
<B> #PCDATA </B> </TD></TR> )+
)+…
H
20$
30$
40$
30$
30$
45$
50$
…
Target Schema
SET (
TUPLE (A : #PCDATA;
B : SET (
TUPLE ( C : #PCDATA;
D : #PCDATA;
E : SET (
TUPLE ( F: #PCDATA;
G: #PCDATA;
26
H: #PCDATA)))
Model Management approach
(Atzeni, Bernstein, others…)
•  Given two different data models (e.g. OO and
relational, or XML and OO, etc.) (when datasources
are at least semistructured)
•  Define general mappings from one model into
another, which allow to
–  Map SQL schema to XML schema
–  Map data source to data warehouse
–  Map OO classes to data source tables, …
•  To this end, one possibility is to use a metamodel
OBSERVATION:
constructs in the various models are similar
•  Only few categories of basic constructs (see Hull &
King 1986):
– 
– 
– 
– 
– 
– 
– 
Lexical: set of printable values (domain)
Abstract (entity, class, …)
Aggregation: a construction based on (subsets of)
Cartesian products (relationship, table)
Function (attribute, property)
Hierarchies
…
•  A metamodel uses these basic constructs to create the
constructs of the model we need to represent
28
METAMODEL
•  A METAMODEL IS AN ABSTRACT MODEL FOR THE SPECIFICATION OF CONCRETE MODELS •  TWO TYPES OF METAMODELS: 1.  GENERAL ENTITIES WHOSE SPECIALIZATIONS BECOME OBJECTS IN THE TARGET MODEL, E.G. GSMM 2.  ENTITIES DESCRIBING THE OBJECTS OF THE TARGET MODEL, E.G.GEOGRAPHIC DATA FILES (GDF) 29
Using a metamodel
for integrated data representation
•  TRANSLATION OF DIFFERENT MODELS INTO A
UNIQUE FORMALISM
•  EASY A-PRIORI COMPARISON BETWEEN THE
DIFFERENT MODEL’S FEATURES
•  AUTOMATIC TRANSLATION DICTATED BY THE
REPRESENTATION RULES OF THE CONCRETE
MODEL INTO THE METAMODEL
30
Data IntegraWon based on a Meta-­‐model Q
(In the
metamodel
language)
Q1
(In DS1
language)
Meta-­‐model-­‐based schema Qn
Q2
Q3
(In DS2
language)
(In DS3
language)
(In DSn
language)
DSn DS1 DS2 DS3 …
31
GSMM (General Semistructured Meta-­‐Model) •  GRAPH-­‐BASED •  DOES NOT REPRESENT THE SCHEMA: SELF-­‐DESCRIPTIVE AS TSIMMIS •  GSSM REPLACES THE CONCEPT OF SCHEMA WITH THAT OF CONSTRAINT •  INTRODUCTION OF FLEXIBILITY IN DATA REPRESENTATION 32
An XML document…
…..
PRODUCERS
<producers>
<producer>
<mn-name>Mercury</mn-name>
<year>1999</year>
<model>
<mo-name>Sable LT</mo-name>
<front-rating>3.84</frontrating>
<side-rating>2.14</side-rating>
<rank>9</rank>
</model>
……
</producer>
…
</producers>
…..
PRODUCER
mn-name
year
MODEL
Mercury
1999
rank
mo-name
Sable LT
front-rating side-rating
9
3.84
2.14
33
Its representation in GSMM
…..
<PRODUCERS,root,0,⊥>
sub-element of
sub-element of
…..
<PRODUCER,element,1, ⊥>
sub-element of
<mn-name,element,1, Mercury> sub-element of
<MODEL,element,3, ⊥>
<year,element,2, 1999>
sub-element of
<mo-name,element,1, SableLT>
sub-element of
sub-element of
<rank,element,4, 9>
<front-rating,element,2, 3.84>
sub-element of
<side-rating,element,3, 2.14>
34
APPLICATION OF THE GSMM
METAMODEL
•  THE METAMODEL ALLOWS THE CONSTRUCTION OF A
GENERIC GRAPH WHICH REPRESENTS AN INSTANCE OF A
CONCRETE MODEL
•  PLUS CONSTRAINTS (also
graphs):
represented by means of
–  HIGH LEVEL, OR META- CONSTRAINTS – these dictate the
sintactic rules of the object data model
–  LOW LEVEL CONSTRAINTS – as usual, these dictate the
application domain semantics
•  IT IS A METAMODEL OF THE FIRST KIND
35
ModelGen (Atzeni et al.)
A model management operator:
•  given two data models M1 and M2, and a
schema S1 of M1 (the source schema and
model)
•  generate a schema S2 of M2 (the target
schema and model), corresponding to S1
•  and, for each database D1 over S1, generate an
equivalent database D2 over S2
36
Geographic Data Files (GDF)
•  GDF is a standard, used for describing
roadmaps
–  CITY STREET REGISTER
–  STREET SIGN ARCHIVE
–  TRAFFIC CONTROL SYSTEMS
–  CAR NAVIGATOR SYSTEMS (GPS)
–  …..
•  IT IS A METAMODEL OF THE SECOND
KIND
37
GDF STRUCTURE
•  DATASET OF 82 ASCII CHARACTERS RECORDS
–  ENTITIES
–  ATTRIBUTES
–  RELATIONS
•  11 INTEREST THEMES
– 
– 
– 
– 
– 
– 
– 
ROADS AND FERRIES
BRIDGES AND TUNNELS
RAILROADS
RIVERS
PUBLIC TRANSPORTATION
ADMINISTRATION AREAS
……..
•  3 DESCRIPTION LEVELS FOR EACH THEME
38
GDF
•  LEVEL 0 – TOPOLOGY
–  A PLANAR GRAPH:
•  POINT
•  ARC
•  NODE
•  POLYGON
–  EACH ELEMENT IS UNIQUELY IDENTIFIED
BY AN ID
–  TOPOLOGICAL RULES GOVERN INTERNAL
STRUCTURE AND RELATIONSHIPS AMONG
ELEMENTS
39
GDF LEVELS
•  LEVEL 1 – ELEMENTARY ENTITIES
THIS IS THE BASIS FOR THE CITY STREET
REGISTER
–  ROAD ELEMENT
–  JUNCTION
–  TRAFFIC AREA
•  LEVEL 2 – COMPLEX ENTITIES
THIS IS THE BASIS FOR THE GEOGRAPHIC
INFORMATION SYSTEMS
–  STREET
–  INTERSECTION
40
GDF LEVELS
LEVEL 2
CL203
CS102
CS103
L524
C627
C621
L811
L812
C624 L813 C628 L815 C630
C625 L814 C629
C626
L816 C631
L821
LEVEL 0
C622
L820
C623
L817
S870
LEVEL 1
L819
L322
S869
L818
L525
S868 L321
S867
CL235
CITY MAP
CL234
VIALE ABRUZZI
41
SUMMARY
Letizia Tanca
DB design and integration 42
Bibliography on Data IntegraWon 43