Dynamic Glyph Generation

Dynamic Glyph Generation
Based on variable length
encoding schema
Yap Cheah Shen
eForth Technology.
Glyph & Typesetting Workshop
Kyoto, 29Nov2003
Outline of Presentation





Morpheme: Latin vs. Han
Latin text encoding
Missing character in Chinese text
Solution
Implementation details
• Glyph decomposition database
• Topological conversion of strokes
• Automatic frame calculation


Integrating to existing OS
Other issue
Morpheme: Latin vs. Han





Morpheme is the smallest meaningful
unit in a language.
For Latin text, it is “word”.
For Chinese text, it is Hanzi or Kanji.
Representing a real-world idea,
morpheme keeps changing from time
to time
Morphemes form an open-set.
Latin Text Encoding




Alphabets form a fix set of symbols.
All words can be represented as
sequences of alphabets.
They are the ideal encoding units for
Latin text; e.g., ASCII.
No “missing word” encoding problem.
Missing Characters
in Chinese Text




Not all existing Hanzi are encoded.
Hanzi are in an open-set , theoretically,
historically and practically.
Wrong assumptions and designs of
existing encoding schema.
Unending loop of assigning code point, OS
update, new font, new input method table
Industries are happy. (users suffer)
Solution-1

Parts or components as encoding
unit.
日月金木水火土人心手口女艹疒犭


Most characters can be represented
by a finite set of basic parts.
Strokes are used to construct rarely
used parts.( thousand of parts
appear only once or twice)
Solution -2




A close-set of basic parts and strokes
as encoding unit.
3 Joining operator : horizontal ,
vertical, and enclosing.
1 Shielding operator : for hiding
stroke
Prefix notation : allowing recursive
composition.
Solution-3

Ordinary CJK fix-length encoding
schema, numeric value as character
code.
• Input method table

Convert input keystroke to character code.
• Static Font file


Glyph data is pre-designed
Access glyph data by character code.
• Text file

Sequence of character code.
Solution-4

Additional feature of variable length
encoding CJK environment.
• Input


Character can be sorted, filtered by parts.
Compatible with any existing input method.
• Display


Font file stores commonly used characters and parts.
Generate glyph on the fly by glyph descriptive
sequence.
• Storage and data-exchange


Compatible with Unicode.
Ideographic description sequence.
Dynamic Glyph Generator

Input:
• Various type of Variable length descriptive character
code sequence.




構字式 of Academia Sinica
組字式 of CBETA
Unicode ideographic descriptive characters
Output: display & print
• True-type compatible outline
• Rasterized bitmap.
• Macromedia Flash, SVG

The Task: a layout problem, fitting a 1
dimensional sequence into a 2 dimensional
square.
Implementation -1
The system consists of 3 major parts
 Glyph decomposition database
• Courtesy of Prof. Hsieh from Academia Sinica,
Taiwan http://www.sinica.edu.tw/~cdp/

Outline of strokes and components
• Beijing ZhongYi Co. professional outline font
vendor. http://www.zhongyicts.com.cn/

The eForth system: putting everything
together, hardware-software coengineering.
Implementation-2

Glyph decomposition database
• All CJK glyph defined by Unicode 4.0 , 71000+
in total.
• 549 basic parts, stroke sequence are
preserved
• 3996 total parts
• Total parts frequency :165122
• Accumulated frequency:



Top 50 : 51389 = 31%
Top 200 : 87381 = 53%
Top 1000: 129393 = 78%
Implementation-3




Stroke are describe as a outline with
skeletal line.
Both outline and skeletal line are
Quadric Bezier curves.
Outline points are recalculated
according to scaled- skeletal line.
Result:
• Stroke data is highly reusable
• Stroke weights are adjustable
Implementation-4

Automatic frame calculation
• Algorithm of estimating the complexity of each
parts, to decide the proportion of the part in
result glyph.
• 漁: 氵25%, 魚 70% , roughly.
• 觀 : 雚 55%, 見 40%, roughly.

Result:
• Clear glyph descriptive expressions
• Search engine friendly
• Human readable
Integrating into existing OS/GUI

String manipulation library
• Number of characters

-1 for operators, +1 for characters
• Characters width

Graphic sub-system
• drawing a text line (e.g. ExtTextOut)

Text handling widgets
• Awareness of glyphs expression for
caret, selection and delete/backspace.
Other Issues

Quality of the glyph
• Trade-off with space: More part outlines,
better quality.

Speed of generation
• No problem for IBM PC, glyph
generation is rare.
• For handheld device, Hardware
acceleration is recommended.
Examples
⿱
⿰
⿴
–




Vertical combination
Horizontal combination
enclosing
hide
盟 = ⿰明皿 or ⿰⿱日月皿
李世民 = 民-5 hide 5th stroke
玄燁 = 玄-5
丘-4 = U+20009
Thank You