LOGO

OpenAI Sora Trained on Game Content? Legal Concerns Arise

December 11, 2024
OpenAI Sora Trained on Game Content? Legal Concerns Arise

Sora's Training Data: Potential Use of Game Content and Twitch Streams

The precise data sources utilized by OpenAI to train its innovative video-generating AI, Sora, remain undisclosed. However, evidence suggests that Twitch streams and video game walkthroughs may have been incorporated into the training dataset.

Sora's Capabilities and Initial Observations

Launched on Monday, Sora demonstrates the ability to generate videos up to 20 seconds in length, accommodating various aspect ratios and resolutions. This is achieved through either text prompts or image inputs.

OpenAI previously indicated the model’s training included Minecraft videos. This prompted an investigation into whether other video game playthroughs were also utilized during the development process.

Evidence of Game-Inspired Content Generation

Sora exhibits a capacity to create videos reminiscent of classic games. For instance, it can generate footage resembling a Super Mario Bros. clone, albeit with some visual distortions.

Furthermore, the AI can produce gameplay footage that draws inspiration from popular first-person shooter titles like Call of Duty and Counter-Strike.

The model is also capable of generating clips in the style of a ’90s Teenage Mutant Ninja Turtles arcade fighter.

Recognition of Streaming Culture

Sora appears to understand the visual conventions of Twitch streams, suggesting exposure to such content during its training. The AI can generate screenshots that accurately capture the general layout and elements of a typical stream.

Notably, one generated screenshot features a striking resemblance to Raúl Álvarez Genes, known as Auronplay on Twitch, even replicating the tattoo on his forearm.

Similarities to Popular Streamers

Beyond Auronplay, Sora has also generated a video depicting a character bearing a resemblance to Imane Anys, widely recognized as Pokimane, though with some artistic modifications.

Circumventing Filters and Identifying Training Data

While OpenAI has implemented filters to prevent the generation of content featuring trademarked characters, tests reveal that game-related content may still be present within Sora’s training data.

Creative prompting, such as requesting an “Italian plumber game,” can yield results despite the filters. Direct requests for specific titles, like “Mortal Kombat 1 gameplay,” are typically blocked.

OpenAI's Data Sourcing Practices

OpenAI has remained relatively opaque regarding its data acquisition methods. In a recent interview, the company’s former CTO, Mira Murati, did not deny the possibility of utilizing content from platforms like YouTube, Instagram, and Facebook.

The technical specifications for Sora confirm the use of “publicly available” data, alongside licensed content from stock media providers such as Shutterstock.

Potential Legal Ramifications

The inclusion of game content in Sora’s training dataset could raise legal concerns, particularly if OpenAI develops more interactive applications based on the AI.

Joshua Weigensberg, an IP attorney, explained to TechCrunch that training AI models on unlicensed footage carries significant risks. The process of training generally involves copying data, and video game playthroughs often contain copyrighted material.

Probabilistic Models in Generative AI

AI models such as Sora operate on probabilistic principles. Through extensive training using vast datasets, these models identify recurring patterns. This allows them to formulate predictions, for instance, accurately depicting the result of someone taking a bite from a hamburger.

This capability is advantageous, enabling models to simulate a degree of understanding regarding how the world functions through observation. However, it also presents a significant vulnerability. Specifically crafted prompts can cause models – frequently trained on publicly accessible web data – to generate outputs that closely resemble their original training materials.

This tendency has understandably caused concern among creators whose intellectual property has been incorporated into training datasets without their consent. Consequently, an increasing number are pursuing legal recourse to address these issues.

Currently, both Microsoft and OpenAI are facing legal challenges alleging that their AI tools are capable of reproducing licensed code. Furthermore, three companies – Midjourney, Runway, and Stability AI – prominent in the AI art space, are defendants in a case centered around accusations of copyright infringement impacting artists’ rights. Major record labels have also initiated lawsuits against Udio and Suno, two startups specializing in AI-driven music generation, alleging infringement of copyright.

A common defense employed by many AI companies revolves around the concept of fair use, arguing that their models produce transformative works rather than plagiarized content. Suno, for example, posits that broad-based training is comparable to a musician developing their own songs after immersing themselves in a particular genre.

However, game content introduces unique legal considerations, according to Evan Everist, a copyright law specialist at Dorsey & Whitney.

“Recordings of gameplay sessions often involve multiple layers of copyright protection,” Everist explained to TechCrunch via email. “These include the game’s content, owned by the developer, and the unique video created by the player documenting their experience.”

He further elaborated that certain games may introduce a third layer of rights related to user-generated content within the game itself. For instance, a playthrough video of a custom map in Epic’s Fortnite could involve three copyright holders: (1) Epic Games, (2) the player utilizing the map, and (3) the map’s original creator.

“If courts determine that training AI models constitutes copyright infringement, each of these rights holders could potentially pursue legal action or become a source for licensing agreements,” Everist stated. “Developers training AI on such videos face exponentially increasing risk exposure.”

Weigensberg highlighted that games contain numerous elements subject to copyright protection, such as unique textures, which a court might consider in an intellectual property dispute. “Unless proper licensing is secured for these works,” he noted, “training on them could be deemed an infringement.”

TechCrunch contacted several game studios and publishers for comment, including Epic, Microsoft (owner of Minecraft), Ubisoft, Nintendo, Roblox, and CD Projekt Red, the developer of Cyberpunk. Responses were limited, and none offered an official statement for publication.

A representative from CD Projekt Red indicated they were unable to participate in an interview at this time. EA communicated to TechCrunch that they had no comment to provide.

Potential Legal Challenges for AI Companies

Despite advancements, AI companies face potential setbacks in ongoing legal battles. Courts might determine that generative AI possesses a “highly transformative purpose,” mirroring a previous ruling concerning Google’s digital archiving of books.

Previously, a court permitted Google to copy millions of books for its Google Books project, a digital archive. Authors and publishers contested this, asserting that online reproduction of their intellectual property constituted infringement.

it sure looks like openai trained sora on game content — and legal experts say that could be a problemJesse Saivar, chair of Greenberg Glusker’s IP and digital media and technology groups, explained to TechCrunch that the core issues surrounding copyright infringement by AI models remain unresolved. Key questions include whether copyrighted material is copied during training, if this copying constitutes infringement, and whether it negatively impacts the market for the original work.

Furthermore, it must be determined if copyright holders of the training materials can demonstrate actual harm or injury. A favorable ruling for AI companies, however, wouldn’t automatically protect users from legal claims.

Should a generative model reproduce copyrighted material, individuals publishing or incorporating that material into other projects could still be held accountable for intellectual property infringement, as noted by Weigensberg. The generation of recognizable IP assets remains a significant concern.

it sure looks like openai trained sora on game content — and legal experts say that could be a problemCertain AI companies offer indemnity clauses to mitigate these risks, but these clauses often have limitations. For instance, OpenAI’s indemnity applies solely to corporate clients, excluding individual users.

Beyond copyright, other legal risks exist, such as potential violations of trademark rights, as Weigensberg points out. The generated output could include branded assets, like recognizable game characters, creating trademark concerns.

The development of world models introduces further complexity. These models, exemplified by OpenAI’s Sora, can generate video games in real-time. If these “synthetic” games closely resemble the content used for training, legal issues could arise.

it sure looks like openai trained sora on game content — and legal experts say that could be a problemAvery Williams, an IP trial lawyer at McKool Smith, stated that training an AI platform on elements like voices, movements, characters, songs, dialogue, and artwork from a video game constitutes copyright infringement. This would be the case even if these elements were utilized in other creative endeavors.

The fair use arguments central to lawsuits against generative AI companies will similarly impact the video game industry and other creative markets.

#openai#sora#ai#artificial intelligence#game content#legal issues