D D IA C OLLO : Bryan Jurish Berlin-Brandenburg Academy of Sciences and Humanities A BSTRACT DiaCollo is a new software tool for the efficient extraction, comparison, and interactive visualization of collocations from a diachronic text corpus. Unlike other conventional collocation extractors, DiaCollo is suitable for extraction and analysis of diachronic collocation data: collocate pairs whose association strength depends on the date of their occurrence. By tracking changes in a word’s typical collocates over time, DiaCollo can help to provide a clearer picture of diachronic changes in the word’s usage, especially those related to semantic shift or discourse environment. T HE S ITUATION p p p ON THE TRAIL OF DIACHRONIC COLLOCATIONS Diachronic Text Corpora heterogeneous text collections t especially with respect to date of origin increasing number available, e.g. t Deutsches Textarchiv (DTA) [4] t Historical American English (COHA) [2] even putatively “synchronic” corpora have a nontrivial temporal extension [8] E XAMPLE 1: Krise (“ CRISIS ”) IN THE WEEKLY DIE ZEIT (1946–2014) 1980 1950 p existing methods [1, 3, 7] implicitly assume corpus homogeneity 1980 1990 2000 2010 5.5 5.0 Schmidt 4.5 4.0 Bonn 3.5 NATO 3.0 2.5 Europa Polen 2.0 1.0 Sozialdemokratische_Partei_Deutschlands 500m AEG_Hausgeräte_Gmbh 0.0 p p Russian annexation of Crimea p Greek government-debt crisis 1950 p p p p p p Drawbacks sparse data requires larger corpora computationally expensive large index size I MPLEMENTATION p p Interfaces Perl API & command-line utilities RESTful web-service plugin + GUI p p p Features scalable even in a high-load environment t no persistent server process required t index access via file I/O or mmap() syscall supports both unary and “diff” profiles full DDC query support via ddc back-end p Output & Visualization TSV, JSON, HTML, Highcharts, d3-cloud, . . . Contact: [email protected] http://clarin.bbaw.de http://kaskade.dwds.de/diacollo 1970 1980 1990 2000 2010 Freie_Demokratische_Partei 5.5 5.0 4.5 European_Union 4.0 t speculation regarding Italy & Spain t bailout terms re-negotiated with EU Troika p German FDP loses Bundestag presence Italien Syrien 3.5 Ukraine 3.0 2.5 Merkel Spanien 2.0 Europa 1.5 1.0 500m 0.0 Griechenland Krim E XAMPLE 2: 400 Y EARS OF P OTABLES (1600–1999) Remarks DDC back-end + GermaNet [5, 6] expansion p p fine-grained search for beverages in object po- DiaCollo Profile "(Getränk|gn-sub WITH $p=NN)=2 (trinken WITH $p=/VV[IP]/)" #FMIN 1 10 Score (log Dice) 7.5 Alkohol Bier Branntwein Kaffee Milch Schnaps Sekt Tee Wasser Wein 5 2.5 0 -2.5 1700 1800 1900 D IACHRONIC P ROFILING Advantages full support for diachronic axis variable query-level granularity flexible attribute selection 1960 6.0 Date (slice) p p p p NATO Pershing-II missiles in western Europe 2010 2010–2014 civil wars in Ukraine & Syria 1600 Idea represent terms as attribute n-tuples t including document date! partition term vocabulary on-the-fly t user-specified epochs collect epoch profiles into final result-set p p Solidarność & martial law in Poland p collapse of Helmut Schmidt (SPD) coalition p AEG sells consumer electronics division t subsequent takeover by Daimler-Benz AG Berlin 1.5 “You shall know a word by the company it keeps” – J. R. Firth t rank candidates by association score t filter out “chance” co-occurrences t statistical methods require large sample 1970 Sowjetunion Afghanistan 6.0 Collocation Profiling p find “significant” collocates of a target term 1960 1980–1989 Soviet war in Afghanistan sition of verb trinken (“to drink”) p Observations staples ∼ constants, e.g. t Bier, Milch, Wasser (“beer, milk, water”) p 1650–1750: Tee, Kaffee (“tea, coffee”) appear p 1800–1900: Schnaps displaces Branntwein p 1850–1900: Alkohol (“alcohol”) as a beverage E XAMPLE 3: G ENDER B IAS (1600–1900) p comparison profile: Mann (“man”) vs. Frau (“woman”) t node size indicates absolute association score difference 1750–1774 p fixed & formulaic expressions very prominent t gnädige Frau t Frau X geborene Y t der gemeine Mann (“milady”) → masculine: gnädiger Herr (“born”) → birth- vs. married surname (“common”) → masculine generic p historical corpus data can reveal persistent cultural biases t Mann ; berühmt, ehrlich, gelehrt, . . . (“famous, honest, learned, . . . ”) t Frau ; lieb, schön, verwitwet, . . . (“dear, beautiful, widowed, . . . ”) p differences grow less pronounced in late 18 th 1850–1874 th & 19 centuries t political discourse: deutsch, eigen, frei (“German, own, free”) R EFERENCES [1] K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29, 1990. [2] M. Davies. Expanding horizons in historical linguistics with the 400-million word Corpus of Historical American English. Corpora, 7(2):121–157, 2012. [3] S. Evert. The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis, IMS Stuttgart, 2005. [4] A. Geyken, S. Haaf, B. Jurish, M. Schulz, J. Steinmann, C. Thomas, and F. Wiegand. Das deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In S. Schomburg, C. Leggewie, H. Lobin, and C. Puschmann, editors, Digitale Wissenschaft. Stand und Entwicklung digital vernetzter Forschung in Deutschland, pages 157–161, 2011. [5] B. Hamp and H. Feldweg. GermaNet – a lexical-semantic net for German. In Proceedings of the ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, 1997. [6] V. Henrich and E. Hinrichs. GernEdiT – the GermaNet editing tool. In Proceedings LREC 2010, pages 2228–2235, 2010. [7] A. Kilgarriff and D. Tugwell. Sketching words. In M.H. Corréard, editor, Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins, EURALEX, pages 125–137, 2002. [8] J. Scharloth, D. Eugster, and N. Bubenhofer. Das Wuchern der Rhizome. linguistische Diskursanalyse und Data-driven Turn. In D. Busse and W. Teubert, editors, Linguistische Diskursanalyse. Neue Perspektiven, pages 345–380. VS Verlag, Wiesbaden, 2013.
© Copyright 2025 ExpyDoc