Language Diversity on the Internet -

Language and the Internet
Assessing Linguistic Bias
Measuring the Information Society
WSIS, Tunis, November 15, 2005
John C. Paolillo, Indiana University
Overview
• Sources of Linguistic Bias
• Linguistic Bias: examples
– Text Communication
– Internet Host Names
– Web Programming
• Global Linguistic Diversity
– Who bears the costs?
• Conclusions
Sources of Linguistic Bias
(Friedman and Nissenbaum 1997)
• Pre-existing
– originate from outside the technical system
• National, trans-national and institutional policies
• Technology companies
• Technical
– are built into the technical system itself
• Developers’ language backgrounds, national origins
• Legacy standards, “backward” compatibility
• Emergent
– arise in specific contexts of use of a technical system
• Economics of technology industry (marketing, monopoly power,
unstable markets, etc.)
• Rapid technologization
Text Communication
• Requires an encoding and its support
– Assign code numbers to script characters
• ASCII (American English)
• ISO-8859-1 (European Languages)
• Unicode (most languages, but support is uneven)
– Support means many things
• Fonts, rendering, sorting, spell-checking etc.
• Computer-Mediated Communication
– Web pages, Email, chat, etc.
– Language use is not uniform in these modes
• Multilinguals tend to favor different languages for specific
purposes
• Represents both technical and emergent biases
Unicode Status: Examples
Good
support
Poor
support
No
support
Language
Chinese
English
French
German
Spanish
Finnish
Russian
Arabic
Hindi
Sinhala
S. Azerbaijani
Unicode
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
no
Browser
good
good
good
good
good
good
good (late)
good (late)
poor
none
none
Script
Chinese
Roman
Roman
Roman
Roman
Roman
Cyrillic
Arabic
Indic
Indic
Arabic
Pop.
1,240M
400M
81M
82M
358M
5M
132M
247M
213M
15M
26M
Internet Host Names
• The Domain Name System
– Uses a 30-year old 7-bit ASCII standard
• Now supports Punycode (a variant of Unicode)
• Imposes a maximum name length
– Run by ICANN under US Dept of Commerce
contract
• More concerned with trademark protection
• Host/domain naming is widely abused (e.g. tv domain)
• Names provided by the DNS are not that useful
• An example of emergent bias
– Technical origin
– Economic and political forces amplify and sustain it
Web Programming and Unicode
• Markup & web scripting languages
– Unicode is standard
– Browser support, fonts, etc. lag behind
– Databases and development environments tend to
lack proper Unicode support
– End-user oriented, not programmer oriented
• All of the most important technologies are Open-Source
software (FLOSS)
– User extensible/modifiable
– Language localization of these is possible but rare
Linguistic Bias in Web
Programming
• English is the source language for most
programming & markup languages
– Keywords
– Operator-argument order
– Programming constructs, etc.
• Programming as a linguistic act
– Complex concepts are rendered into text
– Different languages have different ways of doing this
• Emergent language biases
Linguistic Properties of
Programming
• LISP
– Predicates precede their arguments
• Like Arabic, Celtic, Hebrew, etc.
(defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1)))))
• Postscript
– Predicates follow their arguments
• Like Farsi, Hindi, Japanese, Tamil, Turkish, etc.
/factorial { dup 1 gt { dup 1 sub factorial mul } if } def
The Linguistic Digital Divide
• Language issues go beyond content
– WSIS repeatedly re-affirms principles of
• Transparency
• Self-determination
• Open access to participation for all parties
These principles cannot be guaranteed unless speakers of
different languages can manipulate all aspects of IT use in a way
that is native-like
• The linguistic divide has broader consequences
– Costs are borne in
• Education — great for non-English speaking people
• Technical development — small, in comparison
(there is a trade-off)
Language Diversity
Who bears the costs?
Distribution of language groups by size
Number of groups
1400
1200
1000
800
600
400
200
0
0.0001
0.01
1
100
10000
1000000
Population (in thousands)
(source data: www.ethnologue.com)
A typical language group has around 10-50 thousand people
80% of language groups have fewer than 100 thousand members
Cumulative proportion of world's population
Cumulative proportion
1
0.8
0.6
0.4
0.2
0
1000
100
10
1
0.1
0.01
0.001
Language group population (millions)
(source data: www.ethnologue.com)
90% of the world’s population belongs to a language
group with at least 1 million people (416 groups)
Many languages with hundreds of milloins of speakers
lack adequate support
0.0001
Worldwide Linguistic Diversity by Region
N America
USA
E Asia
W Asia
SC Asia
S America
Africa
Europe
SE Asia
Oceania
Per-Country Linguistic Diversity by Region
N America
USA
Africa
E Asia
W Asia
Oceania
SC Asia
SE Asia
(source data: www.ethnologue.com)
Europe
S America
Conclusions
• Linguistic Bias is manifest in many ways
– Technical biases are sometimes overt
– Emergent biases can be subtle
• All potential sources of bias need to be
examined and questioned if we are to uphold
principles affirmed by WSIS
• Without this effort, the linguistic digital divide
will simply amplify existing disparities in
wealth and power
Language Diversity
On The Internet
Estimated populations of Internet users (millions)
1000
English
Chinese
Japanese
Spanish
German
Korean
French
Italian
Portuguese
Scandinavian
Dutch
Other
100
10
1
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Global Reach
Linguistic Diversity
Based on Entropy:
Diversity = –2 ∑pi ln pi
Diversity is the long-run per-individual
average variance in language category
(similar to log-likelihood)
Linguistic Diversity of Internet Users
7
6
Diversity Index
5
4
3
2
minimum
maximum
1
0
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
Languages on the Web
Dutch
1%
Portuguese
2%
Chinese
Italian
2%
2%
Spanish
3%
Japanese
3%
Russian
1%
Finnish
1%
Swedish
1%
other
2%
French
3%
German
7%
O’Neill, Lavoie and Bennett, 2003
English
72%
Millions
Interne t Hosts (www.isc.org/ds)
200
180
160
140
120
100
80
60
40
20
2003-01
2002-01
2001-01
2000-01
1999-01
1998-01
1997-01
1996-01
www.isc.org/ds
1995-01
0
Host growth by region (millions)
100
10
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
Other
1
0.1
0.01
www.isc.org/ds
2002
2001
2000
1999
1998
1997
1996
1995
0.001
Users per host by region
100
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
10
1
0.1
1998
www.isc.org/ds, ITU
1999
2000
2001
Proportion of random .com hosts
United States
Canada
Nethe rlands
Australia
Unknown
United Kingdom
Hongkong
Israel
Proportion of random .net hosts
United States
Australia
Nethe rlands
Unknown
Canada
Germany
Japan
User growth by region (millions)
1000
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
100
10
1
1998
ITU
1999
2000
2001
Proportion of Internet hosts by region
0.6
0.5
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
Other
0.4
0.3
0.2
0.1
www.isc.org/ds
2002
2001
2000
1999
1998
1997
1996
1995
0
Host growth by region (millions)
1
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
Other
0.1
0.01
www.isc.org/ds
2002
2001
2000
1999
1998
1997
1996
1995
0.001
Internet hosts per thousand inhabitants by region
1000
100
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
10
1
0.1
0.01
www.isc.org/ds, UNPD
2002
2001
2000
1999
1998
1997
1996
1995
0.001
Users per thousand inhabitants by region
1000
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
100
10
1
1998
ITU, UNPD
1999
2000
2001
User growth by region (millions)
1000
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
100
10
1
1998
ITU
1999
2000
2001
Internet users per host by region
100
90
80
USA
N America
S America/Caribbe an
Europe
E Asia
SE Asia
S Central Asia
W Asia
Oceania
Africa
70
60
50
40
30
20
10
0
1998
www.isc.org/ds, ITU
1999
2000
2001
1067 Random Hosts (all domains)