Imagine playing a memory game and word game simultaneously. Now imagine doing it while driving, having a conversation and following a GPS and you will get some idea of the mental gymnastics involved in live captioning. Before I became a live respeaker, I had no idea of the cognitive challenges of producing live captions. A respeaker listens to the original audio of a live program in real time and ‘respeaks’ it, including punctuation, into speech recognition software, which transforms the speech into captions displayed on screen with minimal delay. A respeaker holds the previous sentence in memory while listening to the next one, analyses and summarises what’s just been said, identifies the punctuation required and changes of speaker, paraphrases where necessary, and then speaks it all out again to produce the captions that appear onscreen.
On top of this, the respeaker has to monitor what’s coming out of the speech recognition software, identify and correct any errors or misrecognitions that might cause the wrong words to come out and make sure the technology is running properly and the captions are being transmitted. The respeaker also has to move the captions around the screen in real time to avoid obscuring speakers’ mouths, graphics and supers and anything else that is important for the viewer to see. All the while, the ultimate goal is to make sure the captions are clear, accurate and accessible for the viewer with as little delay as possible and without distracting too much from what’s happening on screen. Unsurprisingly, this puts quite a demand on the brain.
The area of the brain that allows us to hold and retrieve information in our short term memory is called working memory. Evidence suggests that our working memory can only handle a limited amount of input at any one time and too much input at once can create excessive ‘cognitive load’. This overloads the working memory and leads to a decreased ability to pay enough attention to any of the many things you’re trying to deal with simultaneously. This is one of the main reasons, for example, that texting while driving is so dangerous.
A particular part of working memory thought to be involved with holding audio in short term memory is called the phonological loop. This allows us to store and retrieve what we have just heard, a vital part of the process of live captioning. Studies have also shown that the amount of audio information you can store in your memory is reduced when you’re talking at the same time, which adds to the challenge for live respeakers!
Of course, like driving, the more practice you’ve had, the more you stop consciously thinking about the mechanics and can go on a sort of cognitive autopilot. This leaves room in your working memory to deal with the other things happening at the same time. Think back to when you first started learning to drive, when the simplest of tasks like changing lanes safely seemed impossibly difficult. But with practice, your brain uses less effort to perform these basic functions, allowing you to concentrate more on everything else. This also applies to respeaking; practice and experience allow us to better control the cognitive load and pay better attention to all the various parts of the process.
Live captioners rely heavily on the executive function component of working memory. This is what enables the rapid switching of attention, and analysis, synthesis and chunking of the information stored in short term memory. Some things are naturally easier than others to respeak. A clear, evenly-paced documentary, for example, is easier to caption than a four-person panel show where the panellists interrupt and talk over each other with lots of slang and inside jokes. This is mostly because of the reduced cognitive load – less effort is devoted to simply keeping up with the pace of the audio and switching between multiple speakers, leaving more capacity for the executive function to analyse, paraphrase and focus on other details.
The other common method of producing live captions is stenocaptioning, where trained stenographers use a special shorthand keyboard to type out the dialogue in real time. A stenocaptioner can spell out a whole syllable or word with a single keystroke on their keyboard. They can produce captions faster than a respeaker, sometimes over 300 words per minute.
While stenocaptioners don’t have to focus on speaking at the same time, they have to remember the thousands of different key combinations required to produce the different words they have in their dictionaries, which is an entirely different type of cognitive challenge. Translating the audio into the shorthand on their keyboard and back out again as captions is almost like translating to another language. What’s more, while respeakers work in pairs, usually taking turns live captioning every 15 minutes, stenocaptioners can go for over three hours by themselves. As a respeaker, I’m in awe of what stenocaptioners can do.
Next time you’re watching something live with captions, spare a thought for the cognitive challenge of live captioning!
Written by Rohan Williams, Captioner