Cartesia AI: Run AI Anywhere with Efficiency

The Rising Costs of AI Development and Operation

The financial burden associated with both the development and operation of artificial intelligence systems is steadily increasing. Estimates suggest that OpenAI’s AI initiatives may incur expenses of up to $7 billion this year.

Furthermore, the CEO of Anthropic has indicated that AI models requiring over $10 billion in investment could be introduced in the near future.

The Search for Cost-Effective AI Solutions

Consequently, a significant effort is underway to identify methods for reducing the costs associated with AI. This pursuit involves a variety of approaches.

Some research is dedicated to refining the architectures of existing AI models – essentially, optimizing the underlying structure and components that govern their functionality.

Exploring Novel AI Architectures

Other researchers are concentrating on the creation of entirely new architectures, believing these offer a greater potential for affordable scalability.

Karan Goel is a proponent of this latter approach. At Cartesia, the startup he co-founded, Goel is developing state space models (SSMs).

SSMs represent a more recent and notably efficient model architecture. They are designed to process substantial volumes of data – including text and images – concurrently.

The Importance of New Model Architectures

“The development of novel model architectures is essential for constructing AI models that are genuinely practical and impactful,” Goel explained to TechCrunch.

“Given the highly competitive landscape within the AI industry, encompassing both commercial entities and open-source projects, achieving superior model performance is paramount to success.”

Foundations in Academia

Prior to his involvement with Cartesia, Goel pursued doctoral studies as a PhD candidate within Stanford University’s Artificial Intelligence Laboratory. His research was conducted under the guidance of computer scientist Christopher Ré, and other leading experts. It was during his time at Stanford that Goel connected with Albert Gu, another PhD candidate, and together they initially conceptualized the foundation for what would evolve into the Structured State Space Model (SSM).

Goel subsequently engaged in part-time roles at both Snorkel AI and Salesforce, while Gu accepted a position as an assistant professor at Carnegie Mellon University. Despite these career shifts, both Gu and Goel continued their investigations into SSMs, publishing several significant research papers detailing the architecture’s capabilities.

The year 2023 marked the point where Gu and Goel, alongside former Stanford colleagues Arjun Desai and Brandon Yang, collaboratively established Cartesia. This venture was undertaken to translate their academic research into commercially viable applications.

cartesia claims its ai is efficient enough to run pretty much anywhere

Cartesia’s team includes Ré, and the company is a key contributor to numerous variations of Mamba, currently considered one of the most widely adopted SSMs. Gu, in collaboration with Tri Dao, a professor at Princeton University, initiated Mamba as an open-source research initiative last December, and ongoing development continues through subsequent releases.

Cartesia’s work extends beyond Mamba, encompassing the training of its own proprietary SSMs. Similar to all SSMs, Cartesia’s models provide AI systems with a form of operational memory. This capability results in accelerated processing speeds and potentially greater efficiency in resource utilization.

SSMs and Transformers: A Comparative Overview

The vast majority of contemporary artificial intelligence applications, ranging from ChatGPT to Sora, rely on models built upon the transformer architecture. A key component of how a transformer functions is the addition of data to a “hidden state” as it processes information, effectively allowing it to retain a record of what has been analyzed.

For example, consider a model analyzing a novel; the values within the hidden state would represent the words contained within that novel.

The hidden state is a significant contributor to the capabilities of transformers. However, it also introduces limitations in terms of efficiency.

Even to generate a single response relating to a previously processed book, a transformer must examine its entire hidden state – a process that demands computational resources comparable to rereading the entire text.

How State Space Models (SSMs) Differ

State Space Models (SSMs) operate on a fundamentally different principle. Instead of retaining all prior data, SSMs condense each preceding data point into a concise summary of all previously observed information.

As new data is received, the model’s internal “state” is updated, and the majority of the older data is discarded.

This approach allows SSMs to process extensive datasets with greater efficiency.

Performance and Cost Implications

The outcome of this difference in architecture is that SSMs can manage substantial data volumes while achieving superior performance on specific data generation tasks.

Given the increasing costs associated with inference, this represents a compelling advantage.

Here's a summary of the key differences:

Transformers: Maintain a full “hidden state” requiring extensive computational resources for recall.
SSMs: Compress data into a summarized “state,” discarding older information for efficiency.

SSMs offer a potentially more scalable and cost-effective solution for certain AI applications, particularly those dealing with large datasets.

Ethical Considerations Surrounding AI Development

Cartesia functions much like a collaborative research environment, focusing on the development of State Space Models (SSMs) through partnerships with external organizations and internal teams. Their newest innovation, Sonic, represents an SSM capable of replicating a person’s voice or creating entirely new vocal profiles, with adjustable tone and speaking rhythm.

According to Goel, Sonic, accessible via both an API and a web-based interface, demonstrates leading speed within its category. He stated that “Sonic exemplifies the strengths of SSMs when processing extensive data, such as audio, while simultaneously upholding the highest standards of stability and precision.”

Navigating Ethical Challenges

Despite Cartesia’s rapid product deployment, the company has encountered similar ethical dilemmas faced by other AI model developers.

The training of at least some of Cartesia’s SSMs involved the use of The Pile, a publicly available dataset known to include copyrighted books without proper licensing. A common argument among AI companies is that the fair-use doctrine provides legal protection against copyright infringement claims. However, this hasn't prevented legal action, with authors currently pursuing lawsuits against companies like Meta and Microsoft for alleged training practices utilizing The Pile.

Furthermore, Cartesia’s Sonic voice cloning technology currently lacks robust preventative measures. Recently, a replication of Vice President Kamala Harris’ voice was successfully created using publicly available campaign speeches (available for listening). The tool’s current requirement is simply a user agreement checkbox acknowledging adherence to the startup’s Terms of Service.

It’s important to note that Cartesia’s approach isn’t uniquely problematic compared to other voice cloning tools currently available. However, given reports of voice clones successfully bypassing bank security protocols, the situation presents a challenging public perception.

Addressing Concerns and Future Steps

Goel did not confirm whether Cartesia has ceased training models using The Pile. He did, however, address the moderation concerns, informing TechCrunch that Cartesia employs both “automated and manual review” processes and is “developing systems for voice verification and watermarking.”

“Dedicated teams are consistently evaluating aspects such as technical performance, potential misuse, and inherent biases,” Goel explained. “We are also forging partnerships with independent auditors to provide impartial verification of our models’ safety and reliability. We understand that this is a continuous process requiring ongoing improvement.”

Following the publication of this report, a public relations representative for Cartesia communicated via email, stating that the company is “no longer training models on The Pile.”

Emerging Business Venture

According to Goel, a significant number of customers – numbering in the “thousands” – are currently subscribing to access the Sonic API, which represents Cartesia’s primary revenue stream. Automated calling application Goodcall is among these clients. The Cartesia API offers free access for up to 100,000 characters of text-to-speech conversion, while the most comprehensive plan is priced at $299 monthly for 8 million characters. (Cartesia also provides an enterprise-level service featuring dedicated support and customizable usage limits.)

Cartesia’s standard practice involves utilizing customer data to enhance its product offerings, a common policy but one that may raise concerns among users prioritizing data privacy. Goel clarifies that users have the option to opt out of this data usage, and that tailored data retention policies are available for larger organizations.

Despite these data practices, Cartesia’s business doesn’t appear to be negatively impacted, particularly while the company maintains a technological edge. Bob Summers, CEO of Goodcall, explains his selection of Sonic was driven by its unique capability: it was the sole voice generation model exhibiting a latency of less than 90 milliseconds.

Summers further stated that “[It] demonstrated performance four times superior to the next best available option.”

Currently, Sonic is being implemented in areas such as gaming and voice dubbing, but Goel believes its potential is far from fully realized.

His long-term objective is the development of models capable of operating on any device, with the ability to understand and generate various data types – including text, images, and videos – with near-instantaneous speed. As a preliminary step towards this goal, Cartesia launched a beta version of Sonic On-Device this summer, optimized for execution on mobile devices and smartphones for applications like real-time language translation.

In conjunction with Sonic On-Device, Cartesia released Edge, a software toolkit designed to optimize State Space Models (SSMs) for diverse hardware configurations, and Rene, a streamlined language model.

“Our overarching ambition is to become the leading provider of multimodal foundation models for all devices,” Goel explained. “Our future development plans include the creation of multimodal AI models, aiming to deliver real-time intelligence capable of processing extensive contextual information.”

Realizing this vision will require Cartesia to demonstrate the value of its architecture to prospective clients, despite the learning curve involved. Maintaining a competitive advantage over other companies exploring alternatives to the transformer architecture will also be crucial.

Companies like Zephyra, Mistral, and AI21 Labs are actively training hybrid models based on the Mamba architecture. Additionally, Liquid AI, under the leadership of robotics expert Daniela Rus, is pioneering its own unique architectural approach.

Goel maintains that Cartesia, with its team of 26 employees, is well-positioned for success, bolstered by recent financial investment. The company recently secured $22 million in funding led by Index Ventures, increasing Cartesia’s total funding to $27 million.

Shardul Shah, a partner at Index Ventures, envisions Cartesia’s technology powering applications in areas such as customer service, sales and marketing, robotics, and security.

“By challenging the conventional reliance on transformer-based architectures, Cartesia has unlocked innovative methods for building AI applications that are real-time, cost-effective, and scalable,” Shah commented. “The market demands faster, more efficient models that can function anywhere – from data centers to individual devices. Cartesia’s technology is uniquely suited to meet this demand and drive the next generation of AI advancements.”

Additional participants in Cartesia’s latest funding round, based in San Francisco, included A* Capital, Conviction, General Catalyst, Lightspeed, and SV Angel.

TechCrunch offers a dedicated newsletter focused on AI! Subscribe here to receive it directly in your inbox each Wednesday.

Topics

More

Cartesia AI: Run AI Anywhere with Efficiency

The Rising Costs of AI Development and Operation

The Search for Cost-Effective AI Solutions

Exploring Novel AI Architectures

The Importance of New Model Architectures

Foundations in Academia

SSMs and Transformers: A Comparative Overview

How State Space Models (SSMs) Differ

Performance and Cost Implications

Ethical Considerations Surrounding AI Development

Addressing Concerns and Future Steps

Emerging Business Venture

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization