Markup Langauges

SSS A
legal-X
Markup
Languages
SG
X ML
HT W
CP
VOX
DHT
G
math
C
DS
Yaakov J. Stein
Chief Scientist
RAD Data Communications
Stein Markup 1.1
What do I do?
I digest, edit and produce documents







business letters
email
meeting summaries
proposals
reports
requirement specifications
project plans




web pages
research articles
review articles
books
Stein Markup 1.2
What do others do?
Pretty much the same
US corporations produce >100 billion documents per year
90% of a modern institution’s information is in documents
>50% of typical corporation’s efforts involves documents
That’s why word processing SW
was expected to bring efficiency increases
But didn’t!
Stein Markup 1.3
Word processing?
PROs


makes nicer looking documents
expedites document sharing during creation
CONs



typically 30% of effort on format and reformat
doesn’t increase information accessibility
doesn’t facilitate information mining
Stein Markup 1.4
Databases?
The natural alternative to documents are databases
PROs


increase information accessibility
facilitate information mining
CONs


not human readable
format inflexible
Stein Markup 1.5
The solution
What we really want is to write unconstrained text
but to have information retrieval as well !

Method 1 Automatic text analysis
AI program analyzes text
Recognizes document structure, sentence syntax
Performs gisting, facilitates information mining
Complete solution equivalent to solving Turing test

Method 2 Manual markup
Document author responsible for marking
Clarifies document structure
Enables automated retrieval of selected information
Suggests presentation format
Stein Markup 1.6
Why is text analysis hard?
The man cried FIRE
!
The man cried FIRE
the gun !
The man cried FIRE
the gun maker !
Stein Markup 1.7
Are MLs computer languages?
There are many different types of computer languages:
procedural languages
for (n=0;n<10;i++)
if (n>5) printf(“markup languages are fun!\n”);
graphic languages
newpath
0 0 moveto 0 1 lineto 1 1 lineto 1 0 lineto
closepath fill
database languages
SELECT book FROM biblio WHERE subject=‘DSP’ AND author=‘STEIN’ ;
logical languages
useful(DSP), useful(hardware), fun(DSP), fun(web)
interesting(X) if useful(X) and fun(X)
?-interesting(X)
Stein Markup 1.8
They are!
Markup languages do not directly instruct computers
like procedural languages
rather indirectly instruct computer
like logical languages
They do this by using:
<BOOK SUBJECT=“dsp”>
elements
(tags) <TITLE FORMAT=“short”>DSP-CSP</TITLE>
attributes
<AUTHOR>J. Stein</AUTHOR>
This is a great book!
entities
&standard-disclaimer
text
</BOOK>
}
Stein Markup 1.9
Some markup element functions

Structural
– Clarifies document structure
– Delineates document parts

Descriptive (informative)
– Indicates
– Facilitates information retrieval

Presentational (display)
– Presents information in nice format
– Helps human readability

Referential (links, applications)
– Provide hypertext links
– Launch applications
Stein Markup 1.10
Structural Markup
<HEADING>September 1, 2000</HEADING>
<GREETING>Dear Prof. Stein, </GREETING>
<BODY>
I would like to tell you how much I enjoyed reading your new text
“Digital Signal Processing, A Computer Science Perspective”.
I hope we will be able to meet at the next conference.
</BODY>
<SIGNATURE>
Sincerely,
Dee Espy
</SIGNATURE>
Stein Markup 1.11
Descriptive Markup
<DATE>September 1, 2000</DATE>
Dear <PERSON>Prof. Stein,</PERSON>
I would like to tell you how much I enjoyed reading your new text
<BOOK>
“Digital Signal Processing, A Computer Science Perspective”.
</BOOK>
I hope we will be able to meet at the next <EVENT>conference.</EVENT>
Sincerely,
<PERSON>Dee Espy</PERSON>
Stein Markup 1.12
Presentational Markup
<RIGHT-JUSTIFY>September 1, 2000</RIGHT-JUSTIFY>
<BOLD>Dear Prof. Stein,</BOLD>
I would like to tell you how much I enjoyed reading your new text
<UNDERLINE>
“Digital Signal Processing, A Computer Science Perspective”.
</UNDERLINE>
I hope we will be able to meet at the next
<BLINK>conference.</BLINK>
Sincerely,
<IMAGE SRC=“deesignature.jpg” ALIGN=“left”>
<FONT FACE=“Times-Roman”>Dee Espy</FONT>
Stein Markup 1.13
Relational Markup
<today xlink:form=“simple” href=“date” actuate=“auto”>
Dear Prof. Stein,
I would like to tell you how much I enjoyed reading your new text
<A HREF=“www.amazon.com/exec/obidos/ASIN/04712954”>
“Digital Signal Processing, A Computer Science Perspective”.
</A>
I hope we will be able to meet at the next
<A HREF=“conference”>conference.</A>
Sincerely,
<IMAGE SRC=“dee-signature.jpg” ALIGN=“left”>
<A HREF=“mailto:[email protected]”>Dee Espy</A>
Stein Markup 1.14
Generalized Markup Language
William Tunnicliffe, Stanley Rice [1960s]
(independently) invent idea of structural markup language

Problem: need different ML for each type of document
(letter, report, article, book, etc)

Charles Goldfarb, Edward Mosher, Raymond Lorie (IBM) [1973]
invent Generalized Markup Language (GML)
Solution: use metalanguage
Document Type Definition (DTD) defines tags
IBM marked up 90% of its documents with GML
Stein Markup 1.15
With GML structure is evident
Library
Novels
Journals
Textbooks
Algebraic zoology
Botanical history
Computer poetry
DSP
DSP-CSP
DSP just for fun
Elementary QED
Title
Full: Digital Signal Processing
a Computer Science Perspective
Short: DSPCSP
Author
Name: Jonathan (Y) Stein
Association: RAD Data Comm.
Publication
Publisher: John Wiley
Year: 2000
Location: New York
ISBN: 04712954
Stein Markup 1.16
Standard Generalized Markup Language
Problems with GML:
– No validating parser
– Not portable (between computer systems)
Solution:
SGML
ANSI [1978]
ISO/IEC 8879 [1986] (Intl Org for Standardization / Intl Electrotechnical Commission)
JTC1/SC34/WG1 (WG 1 of SubCommittee 34 of Joint Technical Committee 1)
For presentation:
Document Style Semantics and Specification Language
Stein Markup 1.17
SGML - cont.
If SGML is so good why doesn’t anyone use it ?

Complexity
–
–
–
–
–

base standard >500 pages
SGML is a metalanguage
writing DTD is complex programming
marked up text is hard to read
DSSSL adds to complexity
Inflexibility - requires absolute conformity
– assumes only one correct way to markup
– constrains author to dictated structure
– not good at capturing author’s structure
Stein Markup 1.18
HyperText Markup Language
CERN (particle physics institute in Switzerland) was an early Internet adopter
 Used extensively for collaboration (articles have long author lists)
 Major problems with format incompatibility
– only straight ASCII worked reliably
Tim Berners-Lee (computer specialist) defined requirements
 simplicity (couldn’t expect physicists to use SGML)
 freedom (didn’t need validation, let browser ignore bad markup)
 needed hypertext links (including to documents over Internet)
 presentational markup (papers must look nice - authors used to TEX)
Solution: HTML - a specific application of SGML (not metalanguage)
Stein Markup 1.19
HTML versions
HTML 1.0 (1989) Berners-Lee original CERN version
hypertext, images, head+body structure, presentational markup
HTML 2.0 (1994) IETF standard - RFC 1866
added lists, forms, etc.
HTML 3.2 (1997) W3C recommendation (incorporates Netscape extensions)
added tables, applets, super/sub-scripts
HTML 4.0 (1997) W3C recommendation (and similar ISO/IEC 15445)
minimizes presentational markup
XHTML 1.0 (2000) present W3C recommendation
reformulates HTML in XML
Stein Markup 1.20
HTML document structure
<HTML>
<HEAD>
global definitions such as
<TITLE>Web page title</TITLE>
</HEAD>
<BODY>
marked-up text
</BODY>
</HTML>
Stein Markup 1.21
Some HTML (body) elements









<H1>Level 1 Heading</H1>
<H2>Level 2 Heading</H2>
<H3>Level 3 Heading</H3>
<EM> emphasized </EM>
<P> Paragraph </P>
<A HREF=url>link</A>
<UL>
<LI> item 1 </LI>
<LI> item 2 </LI>
</UL>
<OL>
<LI> item 1 </LI>
<LI> item 2 </LI>
</OL>
<IMG SRC=url>
Level 1 Heading
Level 2 Heading
Level 3 Heading
emphasized
Paragraph
link
.. item 1
item 2
1 item 1
2 item
2
Stein Markup 1.22
Problems with HTML
Presentational aspects have predominated
<B> bold text </B>
<BLINK> blinking text </BLINK>
<FONT COLOR=“red”> red text </FONT>
Practically no descriptive markup
Search engines are reduced to flat text search
Search by topic only through keywords or portals
Not extensible
Can’t add new tags
Unknown tags ignored
Links are relatively simple
Usually user action is required (except IMG)
Only full document (with offset) linkable
Link management is logistic nightmare
Stein Markup 1.23
Not everything is HTML
Due to HTML limitations other tools are also used:

Multimedia extensions
– (dynamic) gif, jpg, …
– streaming audio

Common Gateway Interface
– generate HTML on-the-fly
– Perl, C, …



Server Push - Server Pull
Javascript
Java
Stein Markup 1.24
eXtensible Markup Language

Simplified (best parts of) SGML (subset of features)

Flexible content management tool

W3C recommendation(s)

Extensible - can add new elements (even without DTD)

Easy to create special purpose languages (with DTD/SCHEMA)

Includes HTML-like hypertext links
– and extensions (XLINK, XPOINTER)

The future of the web !
Stein Markup 1.25
XML - an Example
<?xml version="1.0" standalone="yes"?>
<bibliography>
<book isbn=04712954>
<title>Digital Signal Processing: a Computer Science Perspective</title>
<author>Jonathan (Y) Stein</author>
<publisher>John Wiley and Sons</publisher>
</book>
<article>
<title>False Alarm Reduction for ASR and OCR</title>
<author>Yaakov Stein</author>
<proceedings>Tenth AICVNN Symposium</proceedings>
<pages>195-200</pages>
</article>
...
</bibliography>
Stein Markup 1.26
What can we do with an XML file?






Check if well-formed
Check if valid (against DTD or schema)
Display “as-is” in browser
Parse in special-purpose program (SAX, DOM)
Process (XSL) to XML, HTML, etc.
Display after processing
Stein Markup 1.27
Wireless Markup Language
Markup language element of Wireless Application Protocol
WAP forum (1997)
– Ericsson, Motorola, Nokia, Unwired Planet (phone.com)
– bring Internet to cellular phone users
– re-use fundamental Internet concepts (TCP/IP, http, html, javascript)
but adapted to lower bandwidth
smaller screen
limited input facilities
limited computational resources
– applications scale across transport options (GSM, TDMA, CDMA, 3G)
and device types (mobile phones, personal assistants)
Stein Markup 1.28
WML Philosophy
Defined using XML
Transported in compressed binary (for BW reduction)
Applications are modeled as decks of cards
Features:
Actions (OK, navigation, help) can be performed
Hyperlinks (like in HTML)
String variables
Timers
wbmp images (B&W)
Select boxes, forms (for input)
wmlscript (like javascript)
Stein Markup 1.29
WML structure
< ? xml version=“1.0” ? >
<!DOCTYPE wml …>
<wml>
<card>
<p>
text
</p>
<p>
text
</p>
</card>
<card>
...
</card>
</wml>
Stein Markup 1.30
Some WML elements











<p> </p>
text
<a href=...> </a>
hyperlink (anchor)
<do> </do>
action
<go href=.../>
goto wml page
<timer>
trigger event (units = tenths of a second)
<input/>
input user text
<prev/>
return to previous page
$(…)
value of variable
<img src=… />
display image
<postfield name=… value=…/>
set variable
<select > <option> <option> </select>
select box
Stein Markup 1.31
Some more markup languages
















VML = Vector (graphics) Markup Language
VoiceXML
SSML = Speech Synthesis Markup Language
CPML = Call Policy Markup Language
DSML = Directory Services Markup Language
MathML = Mathematical Markup Language
CML = Chemical Markup Language
AML = Astronomical Markup Language
LegalXML
BSML = Bioinformatic Sequence Markup Language
GedML = Genealogical Data Markup Language
FinXML = Financial market Markup Language
ChessML
SDML = Signed Document Markup Language
RELML = Real Estate Listing Markup Language
etc. etc. etc. ...
Stein Markup 1.32
Examples

HTML
– html examples

XML
– xml-file xsl-file xml

VML
– vml-file

WML (get M3gate emulator)
– wml examples
Stein Markup 1.33