=============================================================================== README - NCHLT Speech Corpus pre-release =============================================================================== Version: clean v0.1 pre-release IMPORTANT: Note that this is a pre-release of the National Centre for Human Language Technologies (NCHLT) Speech Corpus. The final version of the corpus will be made available at http://http://rma.nwu.ac.za/. Any queries with regard to updated versions can be directed to rma@nwu.ac.za. This corpus is shared under the Creative Commons Attribution 3.0 license. For more information see license.txt in this directory. When using this corpus, please cite: - C. van Heerden, M.H. Davel and E Barnard, "The semi-automated creation of stratified speech corpora", Pattern Recognition Association of South Africa (PRASA), Johannesburg, December 2013. =============================================================================== Additional documentation: - N.J. de Vries, M.H. Davel, J. Badenhorst, W.D. Basson, F. de Wet, E. Barnard and A. de Waal, "A smartphone-based ASR data collection tool for under- resourced languages", Speech Communication, Volume 56, January 2014, pp 119–131. - C van Heerden, M.H. Davel and Barnard, E., "Medium-Vocabulary Speech Recognition for Under-Resourced Languages", 3rd International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU 2012), Cape Town, South Africa, May 2012, pp. 146-151. =============================================================================== LANGUAGE INFORMATION =============================================================================== A separate speech corpus exists for each of South Africa's eleven official languages. Each of the languages is represented by its three-character ISO 639-3:2007 language code, as listed below: --------------------------------- | Language code | Language name | --------------------------------- | afr | Afrikaans | | eng | English | | nbl | isiNdebele | | xho | isiXhosa | | zul | isiZulu | | nso | Sepedi | | tsn | Setswana | | sot | Sesotho | | ssw | Siswati | | ven | Tshivenda | | tso | Xitsonga | --------------------------------- =============================================================================== DIRECTORY AND DATA FORMAT DESCRIPTION =============================================================================== The corpus for each language consists of audio and transcriptions in the following directory structure: nchlt_ |_ audio |_ |_ *.wav |_ transcriptions |_ nchlt_.trn.xml |_ nchlt_.tst.xml where and are used as placeholders for the language and speaker identifiers, respectively. Transcriptions are provided in XML (UTF8) format and subdivided into individual train and test suites ("nchlt_.trn.xml" and "nchlt_.tst.xml", respectively). Each XML entry contains the orthographic transcription of the specific utterance as well as additional metadata: the age and gender of the speaker, the utterance duration and the md5sum of the audio file. Individual recordings are encoded using mono 16-bit signed integer PCM sampled at 16kHz. Recordings are subdivided using a unique speaker identifier () for every speaker. Speaker identifiers for the test suite audio correspond to the integer values of 500-507. =============================================================================== CORPUS STATISTICS =============================================================================== Language speakers female/male duration Afrikaans 210 107/103 56:22 English 210 100/110 56.26 isiNdebele 148 78/70 56.14 isiXhosa 209 106/103 56.15 isiZulu 210 98/112 56.15 Sepedi 210 100/110 56.20 Sesotho 210 113/97 56.19 Setswana 210 109/101 56.20 Siswati 197 96/101 56.15 Tshivenda 208 83/125 56.17 Xitsonga 198 95/103 56.16 ===============================================================================