Papercup Raises £8M for AI Voice Translation | Startup News

Papercup, a United Kingdom-based artificial intelligence company specializing in voice translation technology for diverse languages, has secured £8 million in a recent funding round. The technology is already gaining traction within the video and television sectors.
LocalGlobe and Sands Capital Ventures spearheaded the investment, with participation from Sky, GMG Ventures, Entrepreneur First (EF), and BDMI. Papercup intends to utilize these funds to further its machine learning research initiatives and enhance its “human-in-the-loop” quality assurance system, which refines and tailors the quality of its AI-translated videos.
The company’s existing investor network includes William Tunstall-Pedoe, the creator of Evi Technologies – later acquired by Amazon for the development of Alexa – and Zoubin Ghahramani, formerly the chief scientist and VP of AI at Uber, currently holding a leadership position at Google Brain.
Established in 2017 by Jesse Shemen and Jiameng Gao during their participation in EF’s company builder program, Papercup is developing an AI and machine learning platform designed to translate not only the words but also the nuances and expressiveness of a person’s voice into other languages. The startup asserts that its voice translations are “indistinguishable” from natural human speech, and uniquely aims to preserve the characteristics of the original speaker’s vocal style.
Currently, the technology is being implemented by video production companies, including Sky News, Discovery, and popular YouTube creators like Yoga with Adriene, as well as independent content creators. It is presented as a more efficient and cost-effective alternative to traditional, fully human dubbing.
According to Papercup co-founder and CEO Shemen, “A vast majority of the world’s video and audio content remains limited to a single language.” He cites billions of hours of videos on platforms like YouTube, millions of podcast episodes, thousands of courses on Skillshare and Coursera, and extensive content libraries on Netflix. “Content owners are actively seeking international expansion, but a simple and affordable solution for translation beyond subtitles remains elusive.”
While “deep pocketed studios” can afford high-quality dubbing through professional studios and voice actors, this option is financially prohibitive for most content creators. Even well-funded studios often face limitations in the number of languages they can support.
Shemen explains, “This leaves the majority of content owners – approximately 99% – unable to reach international audiences beyond subtitles.” He emphasizes that Papercup aims to address this gap, stating, “Our goal is to produce translated voices that closely resemble the original speaker’s voice.”
To achieve this, Shemen outlines four key areas of focus. First is the creation of “natural sounding” voices, prioritizing clarity and a human-like quality. The second challenge involves replicating the emotion and pacing of the original speaker. Third is capturing the unique qualities of an individual’s voice. Finally, the translated audio must be precisely synchronized with the video.
Shemen elaborates, “We initially concentrated on developing voices that sound as natural and human-like as possible, achieving significant advancements in quality through focused technological refinement. We now have one of the leading Spanish speech synthesis systems currently available.”
“Our current efforts are directed towards improving the retention and transfer of the original emotion and expressiveness from the speaker across different languages, while simultaneously investigating the elements that constitute high-quality dubbing.”
The next significant hurdle, and perhaps the most complex, is “speaker adaptation,” which involves accurately capturing the distinctiveness of a person’s voice. “This represents the final stage of adaptation,” notes Shemen, “but it was also one of our earliest research breakthroughs. While we have models capable of achieving this, we are currently prioritizing emotion and expressiveness.”
Despite its reliance on artificial intelligence, Papercup also incorporates a “human-in-the-loop” process for quality control. This involves making corrections and adjustments to the translated audio, addressing any errors in speech recognition or machine translation, refining audio timing, and ensuring accurate emotional delivery and pacing.
The extent of human intervention varies depending on the content type and the content owner’s requirements, reflecting the balance between realism and perfection. Shemen points out that “good enough” is often sufficient for a large volume of content.
Regarding the technology’s origins, Shemen credits co-founder and CTO Jiameng Gao, describing him as “exceptionally intelligent and deeply fascinated by speech processing.” Gao holds two Masters degrees from the University of Cambridge – in machine learning and speech language technology – and his thesis focused on speaker adaptive speech processing. It was during his time at Cambridge that he recognized the potential for a technology like Papercup.
“When we began collaborating at Entrepreneur First in late 2017, we developed initial prototype systems that demonstrated the feasibility of this technology, despite the lack of any existing precedent,” says Shemen. “Early discussions revealed overwhelming demand for what we were building, and the primary challenge became creating a system suitable for a production environment.”