Paired Data (Segmented): The paired data was manually segmented into ~5 min segments which contained only transcribed audio. This was done by listening to the 15 minute long audio files and selecting a segment of audio which started with the first transcribed sentence and ended with the last transcribed sentence. This meant that some segments extracted can still contain telephone interviews/segments, or clips of audio with reference to the news being addressed. This adds ‘noise’ into the audio as these are not transcribed and not studio quality audio. The transcriptions are not verbatim, as these are the scripts meant for the news reader. However, the readers do not always follow exactly what is written and do make mistakes. Another inconsistency between audio and text would be the normalization that was performed. The original text data contained numbers (digits) within the text, which needed to be converted to words and were not always converted correctly (i.e. 2017 -> can be converted to “two thousand and seventeen” or “twenty seventeen”). The process of converting number is not a trivial one and more work needs to be done to fix this issue as numbers have a major presence in news and will introduce unneeded errors to the training data. This data set has been compiled into a corpus format, which contains the segmented audio files, together with the normalized transcriptions which have been aligned at sentence/utterance level and packaged into XML files. No other processes were performed on the data ( i.e. confidence scoring, dictionaries, language models).