From Your Mouth to Your Screen, Transcribing Takes the Next Step

This article is part of our continuing Fast Forward series, which examines technological, economic, social and cultural shifts that happen as businesses evolve.

LOS ALTOS, Calif. — Sam Liang longs for his mother and wishes he could recapture the things she told him when he was in high school.

“I really miss her,” he said of her death in 2001. “Those were precious lifetime moments.”

Mr. Liang, who is the chief executive and co-founder of Otter.ai, a Silicon Valley start-up, has set out to do something about that in the future. His company offers a service that automatically transcribes speech with high enough accuracy that it is gaining popularity with journalists, students, podcasters and corporate workers.

Improvements in software technology have made automatic speech transcription possible. By capturing vast quantities of human speech, neural network programs can be trained to recognize spoken language with accuracy rates that in the best circumstances approach 95 percent. Coupled with the plunging cost of storing data, it is now possible to use human language in ways that were unthinkable just a few years ago.

Mr. Liang, a Stanford-educated electrical engineer who was a member of the original team that designed Google Maps, said that data compression had made it possible to capture the speech conversation of a person’s entire life in just two terabytes of information — compact enough to fit on storage devices that cost less than $50.

Video

Video player loading
Mr. Liang demonstrates the Otter.ai automatic speech transcription service.CreditCreditJim Wilson/The New York Times

The rapid improvement in speech recognition technology, which over the past decade has given rise to virtual speech assistants such as Apple’s Siri, Amazon’s Alexa, Google Voice, Microsoft Cortana and others, is spilling into new areas that are beginning to have a significant impact on the workplace.

These consumer speech portals have already raised extensive new privacy concerns. “Computers have a much greater ability to organize, access and evaluate human communications than do people,” said Marc Rotenberg, the president and executive director of the Electronic Privacy Information Center in Washington. In 2015, the group filed a complaint with the Federal Trade Commission against Samsung, arguing that the capture and storage of conversations by their smart TVs was a new threat to privacy. Speech transcription potentially pushes traditional privacy concerns into new arenas both at home and work, he said.

It will almost certainly pose new privacy questions for corporations. Mr. Liang said that companies were interested in capturing all of the conversations of employees, including what goes on around the water cooler.

“This is the power of this new knowledge base for the enterprise,” he said. “They recognize that people spend so many hours every day in meetings, they want to understand how the ideas move around and how people actually talk to each other.”

The rapid advances being made in the automated transcription market in the past year show striking near-term potential in a growing array of new applications. This fall, for example, at the University of California, Los Angeles, students on campus who require assistance in note taking, such as those who are hearing-impaired, are being equipped with the Otter.ai service. The system is designed to replace the current note-taking process where other students take notes during classes and then share them.

In May, when the former first lady, Michelle Obama, visited campus as part of a student signing day celebration, deaf students were given access to a instantaneous transcription of her speech generated by the transcription service.

Zoom, the maker of a web-based video conferencing system, offers a transcription option powered by the Otter.ai service that makes it possible to instantaneously capture a transcript of a business meeting that can be stored and searched online. One of the features that is offered by Otter.ai and other companies is the ability to easily separate and then label different speakers in a single transcription.

Companies such as Rev, which began in 2010 using temporary workers to offer transcription for $1 a minute, offers an additional automated speech transcription service for 10 cents a minute. As a result, transcription is pushing into a variety of new areas, including captioning for YouTube channels, corporate training videos and market research firms who need transcripts from focus groups.

The Rev system allows the customer to choose whether they want more accuracy or a quicker turnaround at lower cost, said Jason Chicola, the company’s founder and chief executive. Increasingly, his customers will correct machine-generated texts rather than transcribing from scratch. He said that while Rev had 40,000 human transcribers, he did not believe that automated transcription would decimate his work force. “Humans and machines will work together for the foreseeable future,” he said.

Nevertheless, speech technologies are having an undeniable impact on the structure of corporations.

“We have chatbots that are running live in production, and they are deflecting a lot of service cases,” said Richard Socher, the chief scientist at Salesforce, a cloud-based software company. “In large service organizations, with thousands of people, if you can automate 5 percent of password reset requests, it’s a big impact on that organization.”

In the medical field, automated transcription is being used to change the way doctors take notes. In recent years, electronic health record systems became part of a routine office visit, and doctors were criticized for looking at their screens and typing rather than maintaining eye contact with patients. Now, several health start-ups are offering transcription services that capture text and potentially video in the examining room and use a remote human transcriber, or scribe, to edit the automated text and produce a “structured” set of notes from the patient visit.

One of the companies, Robin Healthcare, based in Berkeley, Calif., records office visits with an automated speech transcription system that is then annotated by a staff of human “scribes” who work in the United States, according to Noah Auerhahn, the company’s chief executive. Most of the scribes are pre-med students who listen to the doctor’s conversation, then produce a finished record within two hours of the patient’s visit. The Robin Healthcare system is being used at the University of California, San Francisco, and at Duke University.

A competitor, DeepScribe, also based in Berkeley, takes a more automated approach to generating electronic health records. The firm uses several speech engines from large technology companies like Google and IBM to record the conversation and creates a summary of the examination that is checked by a human. By relying more on speech automation, DeepScribe is able to offer a less expensive service, said Akilesh Bapu, the company’s chief executive.

In the past, human speech transcription has largely been limited to the legal and medical fields. This year, the cost of automated transcription has collapsed as rival start-up firms have competed for a rapidly growing market. Companies such as Otter.ai and Descript, a rival San Francisco-based start-up started by the Groupon founder Andrew Mason, are giving away basic transcription services and focusing on charging for subscriptions that offer enhanced features.

An example of this new functionality is an announcement that Descript made in September of a web-based service intended to permit podcasters to edit audio and video just as they would edit text in a word processor. In the past, audio and video editing have required special skills and software. Now, Descript is hoping to open audio and video editing to a more general audience, Mr. Mason said.

“Automatic transcription was becoming accurate enough and cheap enough that it was actually kind of usable,” he said. “We thought, gosh, wouldn’t it be cool to just build an audio editor that works like a word processor. We floated this idea by some of our producer friends, and they were all like, ‘Well, duh, yeah, we had that idea 20 years ago, when are you guys going to do that?’”

Speech scientists emphasize that while the automated transcription systems are significantly improved, they are still far from perfect. While 95 percent accuracy may be obtained by automated transcription, it is possible only under the best circumstances. An accent, a poorly positioned microphone or background noise can cause accuracy to fall.

The hope for the future is the emergence of another speech technology known as natural language processing, which tries to capture the meaning of words and sentences that will increase computer accuracy to human levels. But for now, natural language processing still remains one of the most challenging frontiers in the field of artificial intelligence.

Christopher Manning, a Stanford University computer scientist who specializes in natural language processing, addressed the issue during a recent speech in San Jose, Calif.

“There is still so much that computers can’t do that humans do effortlessly that I’m absolutely certain that I won’t have to find a new field before I retire,” he said.

Source

Be the first to comment

Leave a Reply

Your email address will not be published.


*


18 + 4 =