Description of the Corpus

Description of the Czech Lexical Phonological Corpus
[last updated 19/1/2015]
1 Phonological Analysis
The Czech Lexical Phonological Corpus (henceforth: CzPhLexCorp) consists primarily of
lexemes and their phonological representation. The lexemes have been phonologically transcribed in accord with the phonological analysis of Modern Czech presented in the book Phonotactics of Czech by Aleš Bičan (Peter Lang, 2013). The mentioned book contains more details and information.
A phonological description accounts for two types of properties of linguistic data: segmental phonetic properties and suprasegmental phonetic properties. A phonological theory
provides phonological models referring to sets of these properties. Three phonological models
are relevant here: phoneme, phonotagm, para-phonotactic feature.
 The phoneme has three equivalent definitions: a) an unordered set of distinctive features;
b) a minimal phonotactic entity; c) a set of allophones. It accounts for segmental phonetic
properties of speech sounds. See 1.1.
 The phonotagm is defined as a self-contained (autonomous, complete) combination of
phonemes. One of the phonemes is the nuclear element; the others are peripheral or nonnuclear (and usually optional) elements within the combination. Phonotagms roughly account for segmental phonetic properties of syllables. See 1.2.
 The para-phonotactic features are additional phonological properties which either distinguish one phonological form from other another (e.g. tones in tone languages) or determine the groupment of phonotagms (e.g. accent group). They roughly account for suprasegmental phonetic properties of utterances. See 1.3.
1.1 Phonemes of Czech
According to the function within phonotagms, three classes of phonemes are recognized for
Czech: vowels, consonants and semiconsonants.
1.1.1 Vowels
Vowels are phonemes which are always nuclear; the classification of the Czech vowels according to distinctive features together with the range of their allophones is given in table 1.
The diphthongs are interpreted as single phonemes (see Bičan 2013: 37–40).
Front
Short
Long
Diphthongal
high
/i/
[ɪ]
/ī/
[iː]
mid
/e/
[ɛ]
/ē/
[ɛː]
/ë/
[ɛu̯]
Central
/a/
[a]
/ā/
[aː]
/ä/
[au̯]
Back
high
/u/
[u]
/ū/
[uː]
mid
/o/
[o]
/ō/
[oː]
/ö/
[ou̯]
Table 1: Vowels of Czech with their allophones
1
1.1.2 Consonants
The consonants are phonemes which are always peripheral (non-nuclear); the classification of
the consonants of Czech according to distinctive features together with the range of their allophones is given in table 2.
Labial
Alveolar
Palatal
v’less
v’ed
v’less
v’ed
v’less
v’ed
/p/
/b/
/t/
/d/
/ť/
/ď/
Occlusive
[p]
[b]
[t] [c] [d] [ɟ]
[c]
[ɟ]
/f/
/v/
/s/
/z/
/š/
/ž/
Fricative
[f]
[v]
[s]
[z]
[ʃ]
[ʒ]
/m/
/n/
/ň/
Nasal
[m] [ɱ]
[n] [ɲ] [ŋ]
[ɲ]
Outside the proportional system: /j/ [j] and /ř/ [r̝] [r̝̝̊]
Velar
v’less
v’ed
/k/
/g/
[k]
[ɡ]
/x/
/h/
[x]
[ɦ]
Table 2: Consonants of Czech in the context of relevance and their allophones
1.1.2.1 Affricates
The affricates [t͜s], [d͜z], [t͜ʃ] and [d͜ʒ] are interpreted as combinations of two phonemes (see
Bičan 2013: 33–7 for the defense of this analysis). Their interpretation is shown in table 3.
They are subject to neutralization (see 1.1.4)
Affricate
In contexts of
relevance
Phonological interpretation
In contexts of
Example
neutralization
Example
[t͜s]
/Ts/
/TseSta/ cesta
/TS/
/peTS/ pec
/klaTSki/ klacky
[d͜z]
/Tz/
/Tzinkaťi/ dzinkati
/TS/
/leTSKgo/ leckdo
[t͜ʃ]
/Tš/
/TšaS/ čas
/TŠ/
/mīTŠ/ míč
/poTŠti/ počty
[d͜ʒ]
/Tž/
/Tžem/ džem
/TŠ/
/lēTŠba/
Table 3: Interpretation of the affricates
The affricates are distinguished from the sequences of pre-alveolar stops (realized with an
explosion) and pre-alveolar or post-alveolar fricatives [t.s], [d.z], [t.ʃ], [d.ʒ] (cf. práce × prát
se, počít × podšít). The stop-fricative sequences are in this case interpreted as signals of
boundaries because this pronunciation is invariably found across grammatical boundaries only
(see 1.3.2).
1.1.3 Semiconsonants
Semiconsonants are phonemes which can be both nuclear entities (like vowels) and peripheral
entities (like consonants). Two such phonemes are recognized for Czech: /r/ realized as syllabic [r̩] and non-syllabic [r], and /l/ realized as syllabic [l̩ ] and non-syllabic [l].
2
1.1.3.1 Notation of nuclear /r/ and /l/
For the sake of convenience, the nuclear semiconsonants are transcribed /R/, /L/ in CzPhLexCorp (e.g. /pLnī/ plný).
1.1.3.2 Syllabic [m̩]
Syllabic [m̩] sometimes pronounced in sedm, osm and similar words is interpreted as a realization of the sequence /um/ because it is in free variation with [um], hence /sedum/ and
/osum/.
1.1.4 Neutralization and archi-phonemes
Neutralization is contextual irrelevancy of a difference between two or more phonemes which
is relevant (distinguishes the meaning) in other contexts (= contexts of relevance). The phonemes occurring in contexts of neutralization are archi-phonemes defined as the intersection
of two or more phonemes (i.e. the phoneme between which the neutralization takes place).
Two neutralizations are recognized for Czech: neutralization of voicing, and neutralization
of the place of articulation for nasals.
1.1.4.1 Neutralization of voicing
The neutralization of voicing is contextual irrelevancy of the difference between the voiceless
and voiced consonants. It results in voicing archi-phonemes; their classification together
with the range of realization is given in table 4.
Labial
Alveolar
Palatal
/P/
/T/
/Ť/
Occlusive
[p] [b]
[t] [d]
[c] [ɟ]
/F/
/S/
/Š/
Fricative
[f] [v]
[s] [z]
[ʃ] [ʒ]
/m/
/n/
/ň/
Nasal
[m] [ɱ]
[n] [ɲ] [ŋ]
[ɲ]
Outside the proportional system: /j/ [j] and /ř/ [r̝], [r̝̝̊]
Velar
/K/
[k] [ɡ]
/X/
[x] [ɦ]
Table 4: Consonants of Czech in the context of neutralization of voicing with their allophones
Neutralization of voicing takes place:
1)
At the end of phonological words1: /peS/ pes, /beS-ūTšelnī/ bezúčelný;
2)
Before a voiceless or voiced consonant (with the exception of /v/): /StāT/ stát, /Sdar/
zdar (cf. /svāT/ svát × /zvāT/ zvát);
3)
Before a voicing archi-phoneme: /FSpřīmiT/ vzpřímit, /xePSkī/ chebský;
4)
Before /ř/ if the latter is followed by a voiceless or voiced consonant (with the exception
of /v/): /Třťina/ třtina, /XřbeT/ hřbet (cf. /břve/ Břve);
5)
Before /ř/ if the latter occurs at the end of a phonological word: /pePř/ pepř, /dovniTř/
dovnitř.
The voicing archi-phonemes are phonologically neither voiceless nor voiced. Their phonetic/realizational voicing is completely predictable from the context they occur in and is thus
non-phonological.
1
See 1.3.2 on the notion phonological word.
3
1.1.4.2 Neutralization of the place of articulation of nasals
The neutralization of the place of articulation of nasals is contextual irrelevancy of the difference between /m/, /n/ and /ň/. It results in the nasal archi-phoneme /M/ always realized as
[m]. The neutralization is a consequence of the fact that no nasal other than [m] is found in
certain contexts making thus the place of articulation completely predictable there.
Neutralization of the place of articulation of nasals takes place:
1)
Phonotagm-initially before any consonant: /Mdlo/ mdlo, /MhöřiT/ mhouřit;
2)
Phonotagm-initially before any semiconsonant (either nuclear or non-nuclear): /sMlöva/
smlouva, /zMRznöT/ zmrznout;
3)
Across phonotagms in between a consonant and a nuclear semiconsonant: /posMRtnī/
posmrtný, /odMLTšeT/ odmlčet.
1.2 Phonotagms
Phonotagms are self-contained combinations of phonemes. In Czech a phonotagm is usually
realized as a single syllable except for some special cases:
 /Stārl/, /karl/, /tirl/ are all single phonotagms (the final /l/ is not nuclear; see Bičan
2013: 140ff.)
1.3 Para-phonotactic features
In Czech para-phonotactic features determine the groupment of phonotagms. Two such
groupments are recognized for Czech: phonological word and accent group.
1.3.1 Accent group
The accent group is a group of phonotagms gathered together by features of accent. Accent
groups are determined by internal coherence and melodic contour (Palková 2013). The same
sequence of phonotagms (syllables) may correspond to different accent groups the difference
between which is determined by the melodic contour (e.g. od ní mají × odnímají).
In CzPhLexCorp the concept of accent group is accordance with the theory of Zdena
Palková, and the groupment of phonotagms into accent groups follows her rules formulated
and applied to automatic speech synthesis of Czech (Palková 2004).
1.3.2 Phonological word
The phonological word is recognized as a constituent of accent groups in the case of occurrence of the glottal stop and other boundary-marking phenomena. The notion is important in
order to account for the differences such as the following ones:
[potʔokɛm] pod okem
[potokɛm] potokem
[fʔaktɛx] v aktech
[faktɛx] faktech
[pot.ʃiːt] podšít
[pot͜ʃ] počít
[vrtɛx] v rtech
[vr̩tɛx] vrtech
All of the examples correspond to single accent groups, but the left-hand examples are interpreted as corresponding to two phonological words.
For more details see Bičan (2014).
4
1.3.3 Notation
Although the corpus is phonological, and accent groups and phonological words are phonological units, several types of boundary-signaling symbols are used there so as easily determine to which grammatical units the accent groups and phonological words correspond. See
table 5 for the notation used.
Corresponds
grammatically to
the boundary of
word or stem in compound words
Symbol
Marks the boundary of
+
accent group
=
phonological word
word
-
phonological word
morpheme
_
.
orthographical
word
phonotagm boundary
Example
/hoďiT+sebö/ hodit sebou
/TšeSko+slovenSkī/ česko-slovenský
/nuďiT=se/ nudit se
/poT-ūředňīK/ podúředník
word
/pēTsi_se/ péci se
–
/po.lēF.ka/ polévka
Table 5: Boundary-signaling marks used in the corpus
2 Structure of the CzPhLexCorp
2.1 Format
CzPhLexCopr is available in the Comma Separated Value format (.csv). This format allows
for storing tabular data in plain-text form. It can be opened and edited in applications designed for editing .csv file or imported in editors such as Microsoft Excel, Microsoft Access
etc.2 The data are separated by the separator “;”, i.e. the semicolon. Once opened or imported,
the data will be displayed as a table.
2.2 Heading
The first row of the table is the heading. Table 6 shows a division of the heading into five
sections. The sections are reproduced in tables 7–11. The other rows contain the data. See also
Bičan (ms. 1).
Orthographic
form
Phonological representation
Phonological
properties
Parts of
speech
Occurrence in
dictionaries
Table 6: Sections of the heading
2
See http://office.microsoft.com/en-001/excel-help/import-or-export-text-txt-or-csv-files-HP010342598.aspx for
explanation how to import .csv to MS Excel.
5
Ortho
Table 7: Rows in the section Orthographic form
PhRep Syllab
Table 8: Rows in the section Phonological representation
Length Phtagms CVStr Place Manner Voicing Horiz Vertic
Quant
Table 9: Rows in the section Orthographic form
PoS
Table 10: Rows in the section Parts of speech
SSČ
SSJČ
PSJČ
SN1
SN2
CSN
SPr
SVaz
FSČ
ASCS
VSČ
Table 11: Rows in the sections Occurrence in dictionaries
2.3 Columns
The content of the respective columns is described below. Standard editors such as MS Excel
allow searching and filtering the data according to criteria specified for each column.3
2.3.1 Column Ortho
This column lists lexical items in their standard orthographic form.
2.3.2 Column PhRep
This column provides the phonological representation of the lexical items, i.e. the phonological form (PhF) of the item. The transcription follows the analysis outlined in section 1.
2.3.3 Column Syllab
This column provides the syllabification of the lexical items. Syllabification rules are stored
in external files and can be freely replaced with alternative rules. At the moment, CzPhLexCorp follows the rules proposed by Kučera – Monroe (1968).
2.3.4 Column Length
This column contains numbers corresponding to the length of a PhF. The length equals to the
number of phonemes within the PhF. Boundary-signaling symbols are not counted.
2.3.5 Column Phtagms
This column contains numbers corresponding to the number of phonotagms within a PhF. It
equals the number of nuclear phonemes within the PhF.
3
See http://office.microsoft.com/en-001/excel-help/filter-data-in-a-range-or-table-HP010073941.aspx on filtering data in MS Excel.
6
2.3.6 Column CVStr
This column reproduces the structure of a PhF according to the membership of constituent
phonemes into the class of peripheral entities and nuclear entities. The symbols used are given
in table 12. Boundary-signaling symbols are not included in the representation.
Symbol
C
V
W
Stands for
consonants and non-nuclear semiconsonants
vowels
nuclear semiconsonants
Table 12: Symbols used in column CVStr
2.3.7 Column Place
This column reproduces the structure of a PhF according to the place of articulation of constituent consonants (see table 2). The symbols used are given in table 13. The category
“place” is not relevant for vowels and nuclear semiconsonants, and they are thus transcribed
with minuscule letters.
Symbol
L
A
P
K
I
v
w
Stands for
labials
alveolars
palatals
velars
isolated consonants and semiconsonants (/M/, /j/, /ř/, /r/, /l/)
vowels
nuclear semiconsonants
Table 13: Symbols used in column Place
2.3.8 Column Manner
This column reproduces the structure of a PhF according to the manner of articulation of constituent consonants (see table 2). The symbols used are given in table 14. The category “manner” is not relevant for vowels and nuclear semiconsonants, and they are thus transcribed with
minuscule letters.
Symbol
O
F
N
R
v
w
Stands for
occlusives
fricatives
nasals
sonants (/j/, /ř/, /r/, /l/)
vowels
nuclear semiconsonants
Table 14: Symbols used in column Manner
2.3.9 Column Voicing
This column reproduces the structure of a PhF according to the voicing of constituent consonants (see table 2). The symbols used are given in table 15. The category “voicing” is not rel-
7
evant for vowels and nuclear semiconsonants, and they are thus transcribed with minuscule
letters.
Symbol
U
Z
X
v
w
Stands for
voiceless
voiced
indifferent (nasals and voicing archi-phonemes)
vowels
nuclear semiconsonants
Table 15: Symbols used in column Voicing
2.3.10 Column Horiz
This column reproduces the structure of a PhF according to the horizontal axis of constituent
vowels (see table 3). The symbols used are given in table 16. The category “horizontal axis”
is not relevant for consonants and semiconsonants, and they are thus transcribed with minuscule letters.
Symbol
Q
E
B
c
w
Stands for
front
central
back
consonants and non-nuclear semiconsonants
nuclear semiconsonants
Table 16: Symbols used in column Horiz
2.3.11 Column Vertic
This column reproduces the structure of a PhF according to the vertical axis of constituent
vowels (see table 3). The symbols used are given in table 17. The category “vertical axis” is
not relevant for consonants and semiconsonants, and they are thus transcribed with minuscule
letters. The category is also not relevant for non-high and non-mid vowels; they are transcribed with a minuscule letter, too.
Symbol
H
M
v
c
w
Stands for
high
mid
non-high and non-mid vowels (/a/, /ā/, /ä/, /ë/, /ö/)
consonants and non-nuclear semiconsonants
nuclear semiconsonants
Table 17: Symbols used in column Vertic
2.3.12 Column Quant
This column reproduces the structure of a PhF according to the quantity (length) of constituent vowels (see table 3). The symbols used are given in table 18. The category “quantity” is
not relevant for consonants and semiconsonants, and they are thus transcribed with minuscule
letters.
8
Symbol
S
G
D
c
w
Stands for
short
long
diphthongal
consonants and non-nuclear semiconsonants
nuclear semiconsonants
Table 18: Symbols used in column Quantity
2.3.13 Column PoS
This column provides information about the part of speech of a given entry. The digits corresponds to a particular part of speech; see table 19.
Symbol
1
2
3
4
5
6
7
8
9
0
Stands for
nouns
adjectives
pronouns
numerals
verbs
adverbs
prepositions
conjunctions
particles
onomatopoeia
Table 19: Symbols used in column PoS
2.3.14 Columns for dictionaries
The remaining columns mark whether the lexical item is recording in the dictionaries and
databases of Czech. The zero (0) means the item is not recorded; the digit 1 means that it is
recorded. The list of the dictionaries currently included in the corpus is given in table 20.
Abbreviation
Stands for
SSČ
Slovník spisovné češtiny (4th edition, 2005)
SSJČ
Slovník spisovného jazyka českého (2nd edition, 1989)
PSJČ
Příruční slovník jazyka českého (1935–1957)
Co v slovnících nenajdete (Novinky v současné slovní zásobě)
CSN
(1994)
SN1
Nová slova v češtině. Slovník neologizmů 1 (1998)
SN2
Nová slova v češtině. Slovník neologizmů 2 (2004)
Slovesa pro praxi. Valenční slovník nejčastějších českých
SPr
sloves (1997)
Slovník slovesných, substantivních a adjektivních vazeb a spoSVaz
jení (2005)
FSČ
Frekvenční slovník češtiny (2004)
ASCS
Akademický slovník cizích slov A-Ž (1995)
VSČ
Výslovnost spisovné češtiny (1978)
Table 20: Dictionaries included in the main Corpus
9
3 Sub-corpora
3.1 Municipalities and their parts
This corpus consists of names of the Czech municipalities and their parts (as to 31/12/2013).
It is structured similarly like the main corpus. See table 21 for its heading.
Ortho PhRep Syllab Length Phtagms CVStr Place Manner Voicing Horiz Vertic Quant
Type
Table 21: The heading of the Municipalities sub-corpus
The content of all columns except for Type is the same as that of the main corpus.
Column Type marks whether the item in column Ortho is a name of a municipality (M)
or of its part (P).
3.2 Given names and their hypocorisms
This corpus consists of a list of the 785 most common Czech given names and their hypocorisms (in total 5,724 entries). It is structured similarly like the main corpus. See table 22 for its
heading.
Ortho PhRep Syllab Length Phtagms CVStr Place Manner Voicing Horiz Vertic Quant Type
Sex
Table 22: The heading of the Given Names sub-corpus
The content of all columns except for Type and Sex is the same as that of the main corpus.
Column Type marks whether the item in column Ortho is a basic form of a given name
(B) or a hypocorism (H).
Column Sex marks whether it is a name of a male (M) or a female (F).
3.3 Czech botanical names
This corpus consists of a list of 2,549 Czech botanical names. It is structured similarly like the
main corpus. See table 23 for its heading. The content of all columns is the same as that of the
main corpus. Column Type contains only the value “botanical”.
Ortho PhRep Syllab Length Phtagms CVStr Place Manner Voicing Horiz Vertic Quant Type
Table 23: The heading of the Botanical Names sub-corpus
3.4 Czech zoological names
This corpus consists of a list of 30,469 Czech zoological names. It is still updated and enlarged. It is structured similarly like the main corpus. See table 24 for its heading.
Ortho PhRep Syllab Length Phtagms CVStr Place Manner Voicing Horiz Vertic Quant Type
Table 24: The heading of the Zoological Names sub-corpus
10
The content of all columns except for Type is the same as that of the main corpus. Column
Type provides information about the zoological class of a given name. See table 25 for a list
of the classes included. The corpus is going to be supplemented with names of other classes.
Type
FCC
mammals
fish
Stands for
fungi, corrals, ctenophores
mammals
fish
Table 25: Zoological names included in the Zoological Names corpus
4 Evaluation files
This section describes the content of the evaluation files and lists abbreviations used therein.
4.1 Abbreviations used
Table 26 provides a list of entities used in evaluation files. It is them that evaluations (statistics etc.) are done for.
Type
Pgm/DiGto
Pgm/DiGty
DiGto
DiGty
AcGto
AcGty
Wto
Wty
Stands for
phonotagm in phonological words as tokens
phonotagm in phonological words as types
phonological word (diaereme group) as a token
phonological word (diaereme group) as a type
accent group as a token
accent group as a type
orthographical word as a token
orthographical word as a type
Table 26: Types of entities in evaluation files
Table 27 provides a list of abbreviations used for various types of phonemes.
Type
Voc
Con
Sem
NuS
PeS
NucP
PerP
Stands for
vowels
consonants
semiconsonants
nuclear semiconsonants
peripheral (non-nuclear) semiconsonants
nuclear phonemes (vowels and nuclear semiconsonants)
peripheral phonemes (consonants and non-nuclear semiconsonants)
Table 27: Types of phonemes in evaluation files
Table 28 provides a list of abbreviations used for various types of phonemes.
11
Type
PerComb
PreComb
PostComb
InterComb
PhoPho
CC
CV
VC
VV
Stands for
peripheral combinations, i.e. combinations of peripheral phonemes
pre-nuclear peripheral combinations
post-nuclear peripheral combinations
inter-nuclear peripheral combinations
combinations of two phonemes
combinations of two peripheral phonemes
combinations of a peripheral phoneme and a nuclear phoneme
combinations of a nuclear phoneme and a peripheral phoneme
combinations of two nuclear phonemes
Table 28: Types of phoneme combinations in evaluation files
Table 29 provides a list of other abbreviations used.
Type
Occ
Rat
Stands for
absolute occurrence
ratio of occurrence, i.e. relative occurrence
Table 29: Other abbreviations used in evaluation files
Absolute occurrence is counted as the total occurrence of an entity within a given corpus.
Relative occurrence (= ratio of occurrence) is counted as absolute occurrence of an entity divided by total occurrence of all entities.
4.2 Contents of the evaluation folder
The evaluation folder contains a number of files each of which provides a different output of
the evaluation file with different data.
4.2.1 Main folder
The following files are included in the main folder. Except for overview.txt, all are .csv files
and contain tabular data. In all of them the first row contains a heading and the first column
contains a list of entities for which the values are counted. See above for the abbreviations
used.
cc-pgm.csv
A list of combinations of two PerP attested at any position within Pgm and their absolute and
relative occurrence.
e.g. /tr/ in /koS.tra/
cc.csv
A list of combinations of two PerP attested at any position within DiG and their absolute and
relative occurrence.
e.g. /vl/ in /vlāda/
12
cv-pattern-acg.csv
A list of consonantal-vocalic patterns for AcG and their absolute and relative occurrence.
C = a PerP
V = a NucP
e.g. /nuďiT=se/ has the pattern CVCVCCV
cv-pattern-dig.csv
A list of consonantal-vocalic patterns for DiG and their absolute and relative occurrence.
e.g. /maTka/ has the pattern CVCCV
cv-pattern-pgm.csv
A list of consonantal-vocalic patterns for Pgm and their absolute and relative occurrence.
e.g. /maT/ in /maT.ka/ has the pattern CVC
cv-pattern-w.csv
A list of consonantal-vocalic patterns for W and their absolute and relative occurrence.
e.g. /se/ in /nuďiT=se/ has the pattern CV
cv-pgm.csv
A list of combinations of a PerP + a NucP attested within Pgm and their absolute and relative
occurrence.
e.g. /ma/ in /maT.ka/
cv.csv
A list of combinations of a PerP + a NucP attested at any position within DiG and their absolute and relative occurrence.
e.g. /da/ in /nuda/
dig-nucp-env-class-occ-ty.csv
Absolute occurrence of classes of NucP in various environments. The environments are listed
in table 30.
Type
#
V
CV
C#
CombV
Comb#
Stands for
at the end of DiG
before a PerP
before a - boundary
before a PerP followed by NucP
at the end of DiG before a single PerP
before a combination of PerP followed by a NucP
at the end of DiG before a combinaiton of PerP
Table 30: Environments of phoneme occurrence
dig-nucp-env-class-rat-ty.csv
Relative occurrence of classes of NucP in various environments (see table 30).
dig-nucp-env-ph-occ-ty.csv
Absolute occurrence of NucP in various environments (see table 30).
dig-nucp-env-ph-rat-ty.csv
Relative occurrence of classes of NucP in various environments (see table 30).
13
dig2-seq-class-occ-ty.csv
Absolute occurrence of classes of NucP in various phonotagms attested within DiG.
M = non-final phonotagm
F = final phonotagm
digit = position of the phonotagm from the beginning (e.g. M1 = first medial phonotagm)
dig2-seq-class-rat-ty.csv
Relative occurrence of classes of NucP in various phonotagms attested within DiG; see dig2seq-class-occ-ty.csv for abbreviations.
dig2-seq-ph-occ-ty.csv
Absolute occurrence of NucP in various phonotagms attested within DiG; see dig2-seq-classocc-ty.csv for abbreviations.
dig2-seq-ph-rat-ty.csv
Relative occurrence of NucP in various phonotagms attested within DiG; see dig2-seq-classocc-ty.csv for abbreviations.
inter-csv
A list of inter-nuclear combinations attested within DiG and their absolute and relative occurrence.
e.g. /St/ in /paSta/
length-acg.csv
Distribution of phonemic length in AcG and the absolute and relative occurrence of AcG of a
given phonemic length. Phonemic length = number of phoneme in a unit.
length-dig.csv
Distribution of phonemic length in DiG and the absolute and relative occurrence of DiG of a
given phonemic length.
e.g. /peS/ has the phonemic length of 3
length-pgm.csv
Distribution of phonemic length in Pgm and the absolute and relative occurrence of AcG of a
given phonemic length.
e.g. /ma/ in /ma.so/ has the phonemic length of 2
length-w.csv
Distribution of phonemic length in W and the absolute and relative occurrence of AcG of a
given phonemic length.
e.g. /se/ in /nuďiT=se/ has the phonemic length of 2
overview.csv
An overview and summary of the results. It contains additional results sorted into several categories.
ph-classes.csv
Frequency of phoneme classes. The classes are defined according to distinctive features of
phonemes.
14
phonemes.csv
Frequency of individual phonemes and their absolute and relative occurrence. Symbol “lL”
stands for the phoneme /l/ and symbol “rR” for the phoneme /r/. In the corpora, symbols “l”
and “r” are used for non-nuclear /l/ and /r/, and symbols “L” and “R” for nuclear /r/ and /l/.
phonotagm-acg.csv
Distribution of AcG according to the number of Pgm and the absolute and relative occurrence
of AcG of a given number of Pgm.
e.g. /poT=uTšitel/ contains 4 Pgm
phonotagm-dig.csv
Distribution of DiG according to the number of Pgm and the absolute and relative occurrence
of DiG of a given number of Pgm.
e.g. /uTšitel/ contains 3 Pgm
phopho-pgm.csv
A list of combinations of any two phonemes attested at any position within Pgm and their
absolute and relative occurrence.
e.g. /sl/ in /slo.vo/
phopho.csv
A list of combinations of any two phonemes attested at any position within DiG and their absolute and relative occurrence.
e.g. /ov/ in /slovo/
post-pgm.csv
A list of post-nuclear combinations attested within Pgm and their absolute and relative occurrence.
e.g. /ST/ in /pro.paST/
post.csv
A list of post-nuclear combinations attested within DiG and their absolute and relative occurrence.
e.g. /rKT/ in /infarKT
pre-pgm.csv
A list of pre-nuclear combinations attested within Pgm and their absolute and relative occurrence.
e.g. /pr/ in /pro.paST/
pre.csv
A list of pre-nuclear combinations attested within DiG and their absolute and relative occurrence.
e.g. /Ml/ in /MloK/
vc-pgm.csv
A list of combinations of a NucP + a PerP attested within Pgm and their absolute and relative
occurrence.
e.g. /aS/ in /pro.paST/
15
vc.csv
A list of combinations of a NucP + a PerP attested within DiG and their absolute and relative
occurrence.
e.g. /oS/ in /živoST/
vv.csv
A list of combinations of two NucP attested within DiG and their absolute and relative occurrence.
e.g. /ao/ in /xaoS/
4.2.2 Subfolder quantity-pattern-dig
This subfolder contains files with quantity patterns for phonological words (DiG) of different
number of Pgm (the digit = the number of Pgm).
S = short vowel
G = long vowel
D = diphthongal vowel
W = nuclear semiconsonant
4 References
Bičan, Aleš. 2013. Phonotactics of Czech. Peter Lang.
Bičan, Aleš. 2014. “K pojmu fonologické slovo v češtině”. Sophia Slavica (eds. Vít Boček –
Bohumil Vykypěl), 13–23. Brno: Tribun EU.
Bičan, Aleš. ms. 1. “Fonologický lexikální korpus a jeho analýza”. Submitted. Draft available
here: <http://www.ujc.cas.cz/phword>.
Kučera, Henry – Monroe, George. 1968. A Comparative Quantitative Phonology of Russian,
Czech, and German. Elsevier.
Palková, Zdena. 2004. “The set of phonetic rules as a basis for the prosodic component of an
autonomous TTS synthesis in Czech”. Phonetica Pragensia X.33–46.
Palková, Zdena. 2013. “Prozodické vlastnosti češtiny ve vztahu k mezislovnímu sandhi”.
Studie k moderní mluvnici češtiny 5, K české fonetice a fonologii, 106–118. Univerzita
Palackého v Olomouci.
16