Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University Overview • Sources of Linguistic Bias • Linguistic Bias: examples – Text Communication – Internet Host Names – Web Programming • Global Linguistic Diversity – Who bears the costs? • Conclusions Sources of Linguistic Bias (Friedman and Nissenbaum 1997) • Pre-existing – originate from outside the technical system • National, trans-national and institutional policies • Technology companies • Technical – are built into the technical system itself • Developers’ language backgrounds, national origins • Legacy standards, “backward” compatibility • Emergent – arise in specific contexts of use of a technical system • Economics of technology industry (marketing, monopoly power, unstable markets, etc.) • Rapid technologization Text Communication • Requires an encoding and its support – Assign code numbers to script characters • ASCII (American English) • ISO-8859-1 (European Languages) • Unicode (most languages, but support is uneven) – Support means many things • Fonts, rendering, sorting, spell-checking etc. • Computer-Mediated Communication – Web pages, Email, chat, etc. – Language use is not uniform in these modes • Multilinguals tend to favor different languages for specific purposes • Represents both technical and emergent biases Unicode Status: Examples Good support Poor support No support Language Chinese English French German Spanish Finnish Russian Arabic Hindi Sinhala S. Azerbaijani Unicode yes yes yes yes yes yes yes yes yes yes no Browser good good good good good good good (late) good (late) poor none none Script Chinese Roman Roman Roman Roman Roman Cyrillic Arabic Indic Indic Arabic Pop. 1,240M 400M 81M 82M 358M 5M 132M 247M 213M 15M 26M Internet Host Names • The Domain Name System – Uses a 30-year old 7-bit ASCII standard • Now supports Punycode (a variant of Unicode) • Imposes a maximum name length – Run by ICANN under US Dept of Commerce contract • More concerned with trademark protection • Host/domain naming is widely abused (e.g. tv domain) • Names provided by the DNS are not that useful • An example of emergent bias – Technical origin – Economic and political forces amplify and sustain it Web Programming and Unicode • Markup & web scripting languages – Unicode is standard – Browser support, fonts, etc. lag behind – Databases and development environments tend to lack proper Unicode support – End-user oriented, not programmer oriented • All of the most important technologies are Open-Source software (FLOSS) – User extensible/modifiable – Language localization of these is possible but rare Linguistic Bias in Web Programming • English is the source language for most programming & markup languages – Keywords – Operator-argument order – Programming constructs, etc. • Programming as a linguistic act – Complex concepts are rendered into text – Different languages have different ways of doing this • Emergent language biases Linguistic Properties of Programming • LISP – Predicates precede their arguments • Like Arabic, Celtic, Hebrew, etc. (defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1))))) • Postscript – Predicates follow their arguments • Like Farsi, Hindi, Japanese, Tamil, Turkish, etc. /factorial { dup 1 gt { dup 1 sub factorial mul } if } def The Linguistic Digital Divide • Language issues go beyond content – WSIS repeatedly re-affirms principles of • Transparency • Self-determination • Open access to participation for all parties These principles cannot be guaranteed unless speakers of different languages can manipulate all aspects of IT use in a way that is native-like • The linguistic divide has broader consequences – Costs are borne in • Education — great for non-English speaking people • Technical development — small, in comparison (there is a trade-off) Language Diversity Who bears the costs? Distribution of language groups by size Number of groups 1400 1200 1000 800 600 400 200 0 0.0001 0.01 1 100 10000 1000000 Population (in thousands) (source data: www.ethnologue.com) A typical language group has around 10-50 thousand people 80% of language groups have fewer than 100 thousand members Cumulative proportion of world's population Cumulative proportion 1 0.8 0.6 0.4 0.2 0 1000 100 10 1 0.1 0.01 0.001 Language group population (millions) (source data: www.ethnologue.com) 90% of the world’s population belongs to a language group with at least 1 million people (416 groups) Many languages with hundreds of milloins of speakers lack adequate support 0.0001 Worldwide Linguistic Diversity by Region N America USA E Asia W Asia SC Asia S America Africa Europe SE Asia Oceania Per-Country Linguistic Diversity by Region N America USA Africa E Asia W Asia Oceania SC Asia SE Asia (source data: www.ethnologue.com) Europe S America Conclusions • Linguistic Bias is manifest in many ways – Technical biases are sometimes overt – Emergent biases can be subtle • All potential sources of bias need to be examined and questioned if we are to uphold principles affirmed by WSIS • Without this effort, the linguistic digital divide will simply amplify existing disparities in wealth and power Language Diversity On The Internet Estimated populations of Internet users (millions) 1000 English Chinese Japanese Spanish German Korean French Italian Portuguese Scandinavian Dutch Other 100 10 1 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Global Reach Linguistic Diversity Based on Entropy: Diversity = –2 ∑pi ln pi Diversity is the long-run per-individual average variance in language category (similar to log-likelihood) Linguistic Diversity of Internet Users 7 6 Diversity Index 5 4 3 2 minimum maximum 1 0 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Languages on the Web Dutch 1% Portuguese 2% Chinese Italian 2% 2% Spanish 3% Japanese 3% Russian 1% Finnish 1% Swedish 1% other 2% French 3% German 7% O’Neill, Lavoie and Bennett, 2003 English 72% Millions Interne t Hosts (www.isc.org/ds) 200 180 160 140 120 100 80 60 40 20 2003-01 2002-01 2001-01 2000-01 1999-01 1998-01 1997-01 1996-01 www.isc.org/ds 1995-01 0 Host growth by region (millions) 100 10 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa Other 1 0.1 0.01 www.isc.org/ds 2002 2001 2000 1999 1998 1997 1996 1995 0.001 Users per host by region 100 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa 10 1 0.1 1998 www.isc.org/ds, ITU 1999 2000 2001 Proportion of random .com hosts United States Canada Nethe rlands Australia Unknown United Kingdom Hongkong Israel Proportion of random .net hosts United States Australia Nethe rlands Unknown Canada Germany Japan User growth by region (millions) 1000 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa 100 10 1 1998 ITU 1999 2000 2001 Proportion of Internet hosts by region 0.6 0.5 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa Other 0.4 0.3 0.2 0.1 www.isc.org/ds 2002 2001 2000 1999 1998 1997 1996 1995 0 Host growth by region (millions) 1 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa Other 0.1 0.01 www.isc.org/ds 2002 2001 2000 1999 1998 1997 1996 1995 0.001 Internet hosts per thousand inhabitants by region 1000 100 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa 10 1 0.1 0.01 www.isc.org/ds, UNPD 2002 2001 2000 1999 1998 1997 1996 1995 0.001 Users per thousand inhabitants by region 1000 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa 100 10 1 1998 ITU, UNPD 1999 2000 2001 User growth by region (millions) 1000 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa 100 10 1 1998 ITU 1999 2000 2001 Internet users per host by region 100 90 80 USA N America S America/Caribbe an Europe E Asia SE Asia S Central Asia W Asia Oceania Africa 70 60 50 40 30 20 10 0 1998 www.isc.org/ds, ITU 1999 2000 2001 1067 Random Hosts (all domains)
© Copyright 2025 ExpyDoc