Benutzer:WritingLikeHell/Vorbereitung

The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues[1] (short KEC) is a corpus of spontaneously spoken German recorded between 2014 and 2016 at the Department of General Linguistics of the Eberhard Karls Universität Tübingen. The corpus is hosted at the BAS CLARIN repository and contains fourty one hour long acoustic recordings of dialogues between two friends on various topics. Recordings were performed in two isolated recording chambers. The corpus contains manual annotations of word boundaries, and forced aligned segment and morphological annotations. The corpus also contains electromagnetic articulography (EMA) recordings for thirty speakers. Annotations come in the form of textgrids for the speech analysis software Praat.

Contents

Bearbeiten

The KEC contains a total of 79 hours of recorded speech with a total of 450,311 words (23,265 different tokens). Speakers were allowed to chose the topic on their own to allow for a fluent, natural discussion. As a result, the corpus contains vocabulary from different kinds of topics.

Electromagnetic Articulography

Bearbeiten

In addition to acoustic recordings, the corpus contains EMA recordings for 20 speakers for a duration of thirty minutes. EMA recordings contain 51,762 words (5,364) tokens. EMA sensors were recorded at following locations: tongue back, tongue mid, tongue tip, upper teeth, lower teeth, upper lip, lower lip, left lip edge and jaw. Apart from the jaw and LL sensor, all sensors were attached along the midsagittal plane. In addition, three reference sensors were placed at the nasion and the left and right mastoid.

Frequency distributions of words

Bearbeiten

Coropra of spoken language allow to estimate frequency distributions of words in a given language. The following table illustrates the twenty most common words in the corpus[1], including their relative frequency in the corpus.

Word Relative Frequency
ja 0.043
und 0.038
ich 0.026
so 0.024
das 0.022
die 0.020
dann 0.016
auch 0.015
da 0.013
aber 0.012
also 0.011
der 0.011
halt 0.011
ist 0.011
nicht 0.010
du 0.009
war 0.008
was 0.009
hat 0.007
'ne 0.007

See also

Bearbeiten

Speech corpus

Timit

Buckeye Corpus

Bearbeiten

The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues - audio and articulatory recordings

Einzelnachweise

Bearbeiten
  1. a b Arnold, D. and Tomaschek, F.: The Karl Eberhards Corpus of spontaneously spoken Southern German in dialogues - audio and articulatory recordings. In: Draxler, C; Kleber, F. (Hrsg.): Tagungsband der 12. Tagung Phonetik und Phonologie im deutschsprachigen Raum. Ludwig-Maximilians-Universität München 2016, S. 9–11, urn:urn:nbn:de:bvb:19-epub-29405-2(?!?!).