Using Corpora in Language Learning

Using Corpora
in Linguistics
Introduction to WordSmith Tools
for Beginners
Íde O’Sullivan
Regional Writing Centre
www.ul.ie/rwc
Corpus Linguistics

McEnery and Wilson (2001:1) describe
corpus linguistics as “the study of
language based on examples of ‘real life’
language use”.
McEnery, T. and Wilson, A. (2001) (2nd
edition) Corpus Linguistics. Edinburgh:
Edinburgh University Press.
Regional Writing Centre
2
Corpus: Definition

“A corpus is [the name given to] a set of
texts which has been put together for
some purpose, usually (though not
necessarily), in computer-readable form”
(Wray, Trott & Bloomer, 1990:213).
Wray, T., Trott, K. & Bloomer, A. (1998)
Projects in Linguistics: A Practical Guide to
Researching Language. London, New York:
Arnold.
Regional Writing Centre
3
Corpus: Definition

“a corpus typically implies a finite body of
text,
sampled
to
be
maximally
representative of a particular variety of a
language, and which can be stored and
manipulated using a computer” McEnery
and Wilson (2001:73).

Corpus ≠ Archive
Regional Writing Centre
4
Concordancing: Definition

“A concordance, in its simplest form, is an
alphabetical listing of the words in a text,
given together with the contexts in which
they appear”.
Catherine Ball, Concordances & Corpora:
Tutorial:
http://www.georgetown.edu/faculty/ballc/cor
pora/tutorial.html
Regional Writing Centre
5
Concordancing: Definition

“A concordance is a list of examples of a
particular word, part of a word or combination of
words, in its contexts drawn from a text corpus.
The search word is sometimes also referred to as
a keyword. The most common way of displaying
a concordance is by a series of lines h the
keyword in context (KWIC)”.
Kettemann, B. (1995) “Concordancing in stylistics
teaching”, in Grosser, W., Hogg, J. and Hubmeyer, K.
(eds), Style: Literary and Non-Literary. Contemporary
Trends in Cultural Stylistics. New York: The Edwin Mellen
Press: 307-318.
Regional Writing Centre
6
Regional Writing Centre
7
Software to Analyse Corpora

“Concordancing software enables you to
discover patterns that exist in natural
language by grouping text in such a way that
they are clearly visible […] The real value of
the concordancer lies in this question of
visibility” (Tribble & Jones, 1997:3).
Tribble,
C.
and
Jones,
G.
(1997)
Concordances in the Classroom: Using
Corpora in Language Education. Houston TX:
Athelstan.
Regional Writing Centre
8
Regional Writing Centre
9
Using Corpora in Language
Learning and Teaching
Organisation of the CD
 This CD contains a collection of small genrespecific academic and journalistic corpora in
English, French, Gaeilge, German and
Spanish.
 For each language there are two small genrespecific corpora: a journalistic corpus
(100,000 words) and an academic corpus
(50,000 words). The journalistic corpora are
divided into four subcorpora: current affairs,
editorials, reviews and sport. The academic
corpora are divided into two subcorpora:
theses and articles.
Regional Writing Centre
10
Using Corpora in Language
Learning and Teaching
Organisation of the CD
Languages:
English, French, Gaeilge, German, Spanish
Academic Corpus
50,000 words
Theses
25,000
Articles
25,000
Journalistic Corpus
100,000 words
Current Affairs
44,000
paper 1
paper 2
Editorial
22,000
paper 1
Regional Writing Centre
paper 2
Reviews
12,000
paper 1
paper 2
Sport
22,000
paper 1
paper 2
11
Sources of Journalistic Corpora
English: Irish Examiner
Irish Independent
Irish Times
French: Le Monde
L’Humanité
Gaeilge: Beo
Foinse
Lá
German: Die Süddeutsche Zeitung
Die Frankfurter Allgemeine Zeitung
Spanish: La Vanguardia
El Periódico
Regional Writing Centre
12
Sources of Academic Corpora

Articles and thesis written by native speakers

Subject Areas:
Literature,
Cultural Studies,
Translation Studies, Education,
Applied Linguistics, Sociolinguistics,
Corpus Linguistics, Media Studies,
Language Pedagogy, Teacher Training,
Discourse Analysis, Politics,
Research Methodology,
Second Language Acquisition,
History of Language
Regional Writing Centre
13
WordSmith Tools

Wordlists





Frequency
Alphabetical order
Statistical information
Keywords
Concord





Collocations
Clusters
Patterns
Plot
Source text
Regional Writing Centre
14
WordSmith Tools

Concord
 Sorting data
 Concord expansion option
 Concordance with multiple views
 Settings
 Wildcards
 Advanced searching
 Close texts
Regional Writing Centre
15
Worksheet


Run individual wordlists for the Academic
Corpus and the Journalistic Corpus. Compare
and contrast your findings to reach relative
conclusions about each genre.
Run a concordance lists for a chosen aspect of
the language:
 Do any collocational patterns emerge from
this evidence?
 What are the most common clusters including
the search word(s).
 Identify the most common uses of the word.
 Are their exceptions to these uses?
Regional Writing Centre
16
Resources

WordSmith Tools:
http://www.lexically.net/wordsmith/

MonoConc and ParaConc
http://www.athel.com/mono.html
Regional Writing Centre
17
Online Resources

Tim Johns Data-driven Learning Page:
http://www.eisu.bham.ac.uk/johnstf/tim
conc.htm
Mike Barlow:
http://www.athel.com/corpus.html
 Other resources:
http://www.ul.ie/~appliedlanguages/LI4
113_C&C_websites.doc

Regional Writing Centre
18
Online Concordancing
Hong Kong Virtual Language Centre
http://www.edict.com.hk/concordance/de
fault.htm
 The Compleat Lexical Tutor (Lextutor)
http://www.lextutor.ca/
 French Learner Language Oral Corpus
(flloc)
http://www.flloc.soton.ac.uk/

Regional Writing Centre
19
Resources
Freeware Concordancers
 ConcApp:
http://www.edict.com.hk/pub/concapp/
 Create your own corpus - Disposable
corpus
 Issues of copyright
 Issue of reliability
Regional Writing Centre
20
Resources

British National Corpus (corpus demo)
http://info.ox.ac.uk/bnc/

Cobuild Bank of English (wordbanks
online)
http://www.cobuild.collins.co.uk/

Corpus Concordance Sampler
http://www.collins.co.uk/Corpus/Corpus
Search.aspx
 Limerick Corpus of Irish-English (L-CIE):
http://www.ul.ie/~lcie/
Regional Writing Centre
21