Transcription System using Automatic Speech Recognition (ASR) for the Japanese Parliament (Diet) Tatsuya Kawahara (Kyoto University, Japan) Brief Biography • • • • 1995 1995 1995-96 2003- • 2003-06 • 2006- Ph. D. (Information Science), Kyoto Univ. Associate Professor, Kyoto Univ. Visiting Researcher, Bell Labs., USA Professor, Kyoto Univ. IEEE SPS Speech TC member Technical Consultant, The House of Representative, Japan Published 150~ papers in automatic speech recognition (ASR) and its applications Web http://www.ar.media.kyoto-u.ac.jp/~kawahara/ Contents 1. Review of ASR technology 2. ASR system for the Japanese Diet 3. Next-generation transcription system of the Japanese Diet Trend of ASR Informal Phone conversation Classroom lectures style Formal Spontaneous speech Parliament Formal presentation Reading/ Re-speaking one Business meetings Broadcast news Number of speakers multiple Review of ASR technology (1/2) • Broadcast News [world-wide] – Professional anchors, mostly reading manuscripts – Accuracy over 90% • Public speaking, oral presentations [Japan] – Ordinary people making fluent speech – Accuracy ~80% (close-talking mic.) • Classroom lectures [world-wide] – More informal speaking – Accuracy ~60% (pin mic.) Review of ASR technology (2/2) • Telephone conversations [US] – Ordinary people, speaking casually – Accuracy 60%85% • Business meetings [Europe/US] – Ordinary people, speaking less formally – Accuracy 70% (close mic.), 60% (distant mic.) • Parliamentary meetings [Europe/Japan] – Politicians speaking formally – EU: plenary sessions: 90% – Japan: committee meetings: 85% Deployment of ASR in Parliaments & Courts • Some countries – Steno-mask & Voice writing – Re-speaking Commercial dictation software • Some local autonomies in Japan – Direct recognition of politicians’ speech • Japanese Courts – ASR for efficient retrieval from recorded sessions • Japanese Parliaments (=Diet) – to introduce ASR; direct recognition of politicians’ speech – Mostly in committee meetings …interactive, spontaneous, sometimes excited Language-specific Issues in Japanese • Need to convert kana (phonetic symbol) to kanji • Conversion ambiguous many homonym (ex.) KAWAHARA (カワハラ) → 河原 (not 川原) – Very hard to type-in real-time – Only limited stenographers using special keyboards can • Difference in verbatim-style and transcript-style (ex.) おききしたいのですが ききたい(のです) – Re-speaking is not so simple – need to rephrase in many cases ASR Architecture Signal processing X Depend on input condition P(X/W) Recognition Engine (decoder) P(W/X) ∝ P(W) P(W)・P(X/W) output: W=argmax P(W/X) Acoustic model P(X/P) Dictionary P(P/W) Language model P(W) Depend on application /a, i, u, e, o…/ 京都 ky o: t o 京都+の+天気 Current Status of ASR • Problems unsolved – Spontaneous/conversational speech – Noisy environments • Including distant microphones • Solutions ad-hoc – Collect large-scale “matched” data (corpus) • Same acoustic environment, speakers (10hours~) • Cover same topics, vocabulary (~M words) – Prepare dedicated acoustic & language models • Huge cost in development & maintenance Contents 1. Review of ASR technology 2. ASR system for the Japanese Diet 3. Next-generation transcription system of the Japanese Diet ASR Research in Kyoto Univ. • Since 1960s, one of the pioneers • Development of free software Julius • Research in spontaneous speech recognition – – – – 1999200120042003- Oral presentations TV discussions Classroom lectures Parliamentary meetings Free ASR Software: Julius • Developed since 1997 in Kyoto-U & other sites • Open-source multi-platform (Linux, Mac, Windows, iPhone) • Open architecture – Independent from acoustic & language models Ported to many languages Ported to many applications (telephony, robot…) • Standard model for Japanese • Widely-used research platform http://julius.sourceforge.jp Corpus of Parliamentary Meetings • Cover all major committees and plenary sessions • 200 hours, 2.4M words • Faithful transcripts of utterances including fillers, which are aligned with official minutes {えー}それでは少し、今{そのー}最初に大臣からも、{そのー} 貯蓄から投資へという流れの中に{ま}資するんじゃないだろ うかとかいうような話もありましたけれども、{だけど/だけれど も}、{まあ}あなたが言うと本当にうそらしくなる{んで/ので}{で すね、えー}もう少し{ですね、あのー}これは{あー}財務大臣 に{えー}お尋ねをしたいんです{が}。 {ま}その{あの}見通しはどうかということでありますけれども、 これについては、{あのー}委員御承知の{その}「改革と展望」 の中で{ですね}、我々の今{あのー}予測可能な範囲で{えー} 見通せるものについてはかなりはっきりと書かせていただい て(い)るつもりでございます。 ASR modules oriented for Spontaneous Speech Signal processing X Corpus P(X/W) Recognition Engine (decoder) P(W/X) ∝ P(W) P(W)・P(X/W) Acoustic model Dictionary Language model Cover poor articulation Cover pronunciation variations Cover disfluencies & colloquial expressions Innovative techniques ASR Performance • Accuracy – Word accuracy 85% (Character accuracy 87%) • Plenary sessions 90% • Committee meetings 80~87% – 90% seems almost perfect – No commercial software can achieve!! • Real-time factor 1-3 – Latency in 10 min. Related Techniques • Noise suppression & dereverberation – Not serious once matched training data available • Speaker change detection – Preferred – Current technology level seems not sufficient • Auto-edit – Filler removal easy – Colloquial expression replacement non-trivial – Period insertion still research stage Contents 1. Review of ASR technology 2. ASR system for the Japanese Diet 3. Next-generation transcription system of the Japanese Diet The House of Representatives in Japan • 2005: terminated recruiting stenographers • 2006: investigated ASR technology for the new transcription system • 2007: developed a prototype system and made preliminary evaluations • 2008: system design • 2009: system implementation • 2010: trial and deployment ASR system: Kyoto Univ. model integrated to NTT engine Signal processing X P(X/W) Recognition Engine (decoder) Acoustic model P(X/P) Dictionary P(P/W) P(W/X) ∝ P(W) P(W)・P(X/W) Language model P(W) NTT Corp. Kyoto Univ. House /a, i, u, e, o…/ 京都 ky o: t o 京都+の+天気 Issues in Post-Editor • For efficient correction of ASR errors and cleaning transcript into document-style • Easy reference to original speech (+video) – by time, by utterance, by character (cursor) – Can speed up & down speech-replay • Word-processor interface (screen editor); not line editor – to concentrate on making correct sentences – Serious misunderstanding between system developers and stenographers!! System Evaluation (@Kyoto) edit time (min) • Subjects:18 students • Post-editing ASR outputs is more efficient than typing from scratch, regardless of the accuracy Those hard for ASR are also hard for human Type from scratch Post-edit ASR output 10 9 8 7 6 5 4 3 50 55 60 65 70 75 80 ASR accuracy 85 90 95 System Evaluation (@Kyoto) • Subjective evaluation correlates with ASR accuracy • Threshold in 75% to have ASR preferred 7 Usability score of ASR 6 5 4 3 2 1 50 55 60 65 70 75 80 85 ASR accuracy 90 95 System Evaluation (@House) • Subjects: 8 stenographers • System: proto-type • ASR-based system reduced the edit time, compared with current short-hand system – 78 min. 68 min. (for 5 min. segment) • Threshold in ASR accuracy of 80% – 75% degradation in edit time; a half say negative in using ASR Side effect of ASR-based system • Everything (text/speech/video) digitized and hyper-linked Efficient search & retrieval • Less burden? may work on longer segments?? • Significantly less special training needed compared with current short-hand system Conclusions • ASR of parliamentary meetings is feasible, given a large collection of data – ~100 hour speech – ~1G word text (minutes) – Accuracy 85-90% • Effective post-processing is still under investigation • Automatic translation research is also ongoing
© Copyright 2025 ExpyDoc