pdf:poster - Deutsches Textarchiv

D
D IA C OLLO :
Bryan Jurish
Berlin-Brandenburg Academy of Sciences and Humanities
A BSTRACT
DiaCollo is a new software tool for the efficient
extraction, comparison, and interactive visualization of collocations from a diachronic text
corpus. Unlike other conventional collocation
extractors, DiaCollo is suitable for extraction
and analysis of diachronic collocation data: collocate pairs whose association strength depends
on the date of their occurrence. By tracking
changes in a word’s typical collocates over time,
DiaCollo can help to provide a clearer picture
of diachronic changes in the word’s usage, especially those related to semantic shift or discourse environment.
T HE S ITUATION
p
p
p
ON THE TRAIL OF DIACHRONIC COLLOCATIONS
Diachronic Text Corpora
heterogeneous text collections
t especially with respect to date of origin
increasing number available, e.g.
t Deutsches Textarchiv (DTA) [4]
t Historical American English (COHA) [2]
even putatively “synchronic” corpora have a
nontrivial temporal extension [8]
E XAMPLE 1: Krise (“ CRISIS ”) IN THE WEEKLY DIE ZEIT (1946–2014)
1980
1950
p
existing methods [1, 3, 7] implicitly assume
corpus homogeneity
1980
1990
2000
2010
5.5
5.0
Schmidt
4.5
4.0
Bonn
3.5
NATO
3.0
2.5
Europa
Polen
2.0
1.0
Sozialdemokratische_Partei_Deutschlands
500m
AEG_Hausgeräte_Gmbh
0.0
p
p Russian annexation of Crimea
p Greek government-debt crisis
1950
p
p
p
p
p
p
Drawbacks
sparse data requires larger corpora
computationally expensive
large index size
I MPLEMENTATION
p
p
Interfaces
Perl API & command-line utilities
RESTful web-service plugin + GUI
p
p
p
Features
scalable even in a high-load environment
t no persistent server process required
t index access via file I/O or mmap() syscall
supports both unary and “diff” profiles
full DDC query support via ddc back-end
p
Output & Visualization
TSV, JSON, HTML, Highcharts, d3-cloud, . . .
Contact:
[email protected]
http://clarin.bbaw.de
http://kaskade.dwds.de/diacollo
1970
1980
1990
2000
2010
Freie_Demokratische_Partei
5.5
5.0
4.5
European_Union
4.0
t speculation regarding Italy & Spain
t bailout terms re-negotiated with EU Troika
p German FDP loses Bundestag presence
Italien Syrien
3.5
Ukraine
3.0
2.5
Merkel Spanien
2.0
Europa
1.5
1.0
500m
0.0
Griechenland
Krim
E XAMPLE 2: 400 Y EARS OF P OTABLES (1600–1999)
Remarks
DDC back-end + GermaNet [5, 6] expansion
p
p fine-grained search for beverages in object po-
DiaCollo Profile
"(Getränk|gn-sub WITH $p=NN)=2 (trinken WITH $p=/VV[IP]/)" #FMIN 1
10
Score (log Dice)
7.5
Alkohol
Bier
Branntwein
Kaffee
Milch
Schnaps
Sekt
Tee
Wasser
Wein
5
2.5
0
-2.5
1700
1800
1900
D IACHRONIC P ROFILING
Advantages
full support for diachronic axis
variable query-level granularity
flexible attribute selection
1960
6.0
Date (slice)
p
p
p
p NATO Pershing-II missiles in western Europe
2010
2010–2014
civil wars in Ukraine & Syria
1600
Idea
represent terms as attribute n-tuples
t including document date!
partition term vocabulary on-the-fly
t user-specified epochs
collect epoch profiles into final result-set
p
p Solidarność & martial law in Poland
p collapse of Helmut Schmidt (SPD) coalition
p AEG sells consumer electronics division
t subsequent takeover by Daimler-Benz AG
Berlin
1.5
“You shall know a word by the company it keeps” – J. R. Firth
t rank candidates by association score
t filter out “chance” co-occurrences
t statistical methods require large sample
1970
Sowjetunion Afghanistan
6.0
Collocation Profiling
p find “significant” collocates of a target term
1960
1980–1989
Soviet war in Afghanistan
sition of verb trinken (“to drink”)
p
Observations
staples ∼ constants, e.g.
t Bier, Milch, Wasser (“beer, milk, water”)
p 1650–1750: Tee, Kaffee (“tea, coffee”) appear
p 1800–1900: Schnaps displaces Branntwein
p 1850–1900: Alkohol (“alcohol”) as a beverage
E XAMPLE 3: G ENDER B IAS (1600–1900)
p comparison profile: Mann (“man”) vs. Frau (“woman”)
t node size indicates absolute association score difference
1750–1774
p fixed & formulaic expressions very prominent
t gnädige Frau
t Frau X geborene Y
t der gemeine Mann
(“milady”) → masculine: gnädiger Herr
(“born”)
→ birth- vs. married surname
(“common”) → masculine generic
p historical corpus data can reveal persistent cultural biases
t Mann ; berühmt, ehrlich, gelehrt, . . . (“famous, honest, learned, . . . ”)
t Frau ; lieb, schön, verwitwet, . . . (“dear, beautiful, widowed, . . . ”)
p differences grow less pronounced in late 18
th
1850–1874
th
& 19 centuries
t political discourse: deutsch, eigen, frei (“German, own, free”)
R EFERENCES
[1] K. W. Church and P. Hanks. Word association norms,
mutual information, and lexicography. Computational
Linguistics, 16(1):22–29, 1990.
[2] M. Davies. Expanding horizons in historical linguistics
with the 400-million word Corpus of Historical American English. Corpora, 7(2):121–157, 2012.
[3] S. Evert. The Statistics of Word Cooccurrences: Word Pairs
and Collocations. PhD thesis, IMS Stuttgart, 2005.
[4] A. Geyken, S. Haaf, B. Jurish, M. Schulz, J. Steinmann, C. Thomas, and F. Wiegand. Das deutsche
Textarchiv: Vom historischen Korpus zum aktiven
Archiv. In S. Schomburg, C. Leggewie, H. Lobin, and
C. Puschmann, editors, Digitale Wissenschaft. Stand und
Entwicklung digital vernetzter Forschung in Deutschland,
pages 157–161, 2011.
[5] B. Hamp and H. Feldweg. GermaNet – a lexical-semantic
net for German. In Proceedings of the ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, 1997.
[6] V. Henrich and E. Hinrichs. GernEdiT – the GermaNet
editing tool. In Proceedings LREC 2010, pages 2228–2235,
2010.
[7] A. Kilgarriff and D. Tugwell. Sketching words. In M.H. Corréard, editor, Lexicography and Natural Language
Processing: A Festschrift in Honour of B. T. S. Atkins, EURALEX, pages 125–137, 2002.
[8] J. Scharloth, D. Eugster, and N. Bubenhofer.
Das
Wuchern der Rhizome. linguistische Diskursanalyse
und Data-driven Turn. In D. Busse and W. Teubert,
editors, Linguistische Diskursanalyse. Neue Perspektiven,
pages 345–380. VS Verlag, Wiesbaden, 2013.