Share these talks and lectures with your colleagues
Invite colleaguesHuman–machine collaboration in transcription
Abstract
As automatic speech recognition (ASR) has improved, it has become a viable tool for content transcription. Prior to the use of ASR for this task, content transcription was achieved through human effort alone. Despite improvements, ASR performance is as yet imperfect, especially in more challenging conditions (eg multiple speakers, noise, nonstandard accents). Given this, a promising way forward is a human-in-the-loop (HIL) approach. This contribution describes our work with HIL ASR on the transcription task. Traditionally, ASR performance has been measured using word error rate (WER). This measure may not be sufficient to describe the full set of errors that a speech-to-text (STT) pipeline designed for transcription can make, such as those involving capitalisation, punctuation, and inverse text normalisation (ITN). It is therefore the case that improved WER does not always lead to increased productivity, and the inclusion of ASR in HIL may adversely affect productivity if it contains too many errors. Rev.com provides a convenient laboratory to explore these questions. Originally, the company provided transcriptions of audio and video content executed solely by humans (known as Revvers). More recently, ASR was introduced in an HIL workflow where Revvers postedited an ASR first draft. We provide an analysis of the interaction between metrics of ASR accuracy and the productivity of our 72,000+ Revvers transcribing more than 15,000 hours of media every week. To do this, we utilise two measures of transcriptionist productivity: transcriber real time factor (RTF) and words per minute (WPM). Through our work, we hope to focus attention on the human productivity and quality of experience (QoE) aspects of improvements in ASR and related technologies. Given the broad scope of content transcription applications and the still elusive objective of perfect machine performance, keeping the human in the loop in both practice and mind is critical. This paper provides an overview of human and machine transcription and Rev’s marketplace, followed by an analysis of the relationship between ASR accuracy and transcriptionist productivity, and concludes with suggestions for future work.
The full article is available to subscribers to the journal.
Author's Biography
Corey Miller studied linguistics and Romance languages at Harvard and completed a PhD in linguistics at the University of Pennsylvania. Since that time, he has worked in a range of speech and language technologies, focusing principally on speech synthesis and recognition, with a deep emphasis on language and linguistic issues. Prior to joining Rev, Corey worked at a range of institutions such as Nuance, MITRE and the University of Maryland. Corey is an Adjunct Lecturer at Georgetown University where he offers a course on speech technology.
Migüel Jetté Miguel Jetté is Vice President of Artificial Intelligence at Rev. He leads Rev’s speech research and development team with over 20 years’ experience in speech recognition and machine learning. Before joining Rev, Migüel was a speech scientist at VoiceBox and Nuance Communications where he created state-of-the-art speech models across major industries in multiple languages. Migüel has a Master’s in mathematics and statistics from McGill University.
Dan Kokotov is CTO at CNaught, Inc, building the easy button for carbon offsets. Previously, Dan was Vice President of Engineering at Rev. Rev’s vision is to unlock the full potential of speech for everyone, everywhere, by marrying best-in-class artificial intelligence with the world’s best human transcription force. Prior to joining Rev, Dan worked on software for cancer research and infectious disease tracking.