OpenAI’s Sora: The devil is in the ‘details of the data’


Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.


For OpenAI CTO Mira Murati, an exclusive Wall Street Journal interview with personal tech columnist Joanna Stern yesterday seemed like a slam-dunk. The clips of OpenAI’s Sora text-to-video model, which was shown off in a demo last month and Murati said could be available publicly in a few months, were “good enough to freak us out” but also adorable or benign enough to make us smile. That bull in a china shop that didn’t break anything! Awww.

But the interview hit the rim and bounced wildly at about 4:24, when Stern asked Murati what data was used to train Sora. Murati’s answer: “We used publicly available and licensed data.” But while she later confirmed that OpenAI used Shutterstock content (as part of their six-year training data agreement announced in July 2023), she struggled with Stern’s pointed asks about whether Sora was trained on YouTube, Facebook or Instagram videos.

‘I’m not going to go into the details of the data’

When asked about YouTube, Murati scrunched up her face and said “I’m actually not sure about that.” As for Facebook and Instagram? She rambled at first, saying that if the videos were publicly available, there “might be” but she was “not sure, not confident,” about it, finally shutting it down by saying “I’m just not going to go into the details of the data that was used — but it was publicly available or licensed data.”

I’m pretty sure many public relations folks did not consider the interview to be a PR masterpiece. And there was no chance that Murati would have provided details anyway — not with the copyright-related lawsuits, including the biggest filed by the New York Times, facing OpenAI right now.

But whether or not you believe OpenAI used YouTube videos to train Sora (keep in mind, The Information reported in June 2023 that OpenAI had “secretly used data from the site to train some of its artificial intelligence models”) the thing is, for many the devil really is in the details of the data. Generative AI copyright battles have been brewing for over a year, and many stakeholders, from authors, photographers and artists to lawyers, politicians, regulators and enterprise companies, want to know what data trained Sora and other models — and examine whether they really were publicly available, properly licensed, etc.

VB Event

The AI Impact Tour – Boston

We’re excited for the next stop on the AI Impact Tour in Boston on March 27th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on best practices for data integrity in 2024 and beyond. Space is limited, so request an invite today.

Request an invite

This is not simply an issue for OpenAI

The issue of training data is not simply a matter of copyright, either. It’s also a matter of trust and transparency. If OpenAI did train on YouTube or other videos that were “publicly available,” for instance — what does it mean if the “public” did not know that? And even if it was legally permissible, does the public understand?

It is not simply an issue for OpenAI, either. Which company is definitely using publicly shared YouTube videos to train their video models? Surely Google, which owns YouTube. And which company is definitely using Facebook and Instagram publicly shared images and videos to train its models? Meta, which owns Facebook and Instagram, has confirmed that it is doing exactly that. Again — perfectly legal, perhaps. But when Terms of Service agreements change quietly — something the FTC issued a warning about recently — is the public really aware?

Finally, it is not just an issue for the leading AI companies and their closed models. The issue of training data is a foundational generative AI issue that in August 2023 I said could face a reckoning — not just in US courts, but in the court of public opinion.

As I said in that piece, “until recently, few outside the AI community had deeply considered how the hundreds of datasets that enabled LLMs to process vast amounts of data and generate text or image output — a practice that arguably began with the release of ImageNet in 2009 by Fei-Fei Li, an assistant professor at Princeton University — would impact many of those whose creative work was included in the datasets.”

The commercial future of human data

Data collection, of course, has a long history — mostly for marketing and advertising. That has always been, at least in theory, about some kind of give and take (though obviously data brokers and online platforms have turned this into a privacy-exploding zillion-dollar business). You give a company your data and, in return, you’ll get more personalized advertising, a better customer experience, etc. You don’t pay for Facebook, but in exchange you share your data and marketers can surface ads in your feed.

There simply isn’t that same direct exchange, even in theory, when it comes to generative AI training data for massive models that is not provided voluntarily. In fact, many feel it’s the polar opposite — that generative AI models have “stolen” their work, threaten their jobs, or do little of note other than deepfakes and content ‘slop.’

Many experts have explained to me that there is a very important place for well-curated and documented training datasets that make models better, and many of those folks believe that massive corpora of publicly-available data is fair game — but this is usually meant for research purposes, as researchers work to understand how models work in an ecosystem that is becoming more and more closed and secretive.

But as they become more educated about it, will the public accept the fact that the YouTube videos they post, the Instagram Reels they share, the Facebook posts set to “public” have already been used to train commercial models making big bank for Big Tech? Will the magic of Sora be significantly diminished if they know that the model was trained on SpongeBob videos and a billion publicly available birthday party clips?

Maybe not. Maybe it will all feel less icky over time. Maybe OpenAI and others don’t care that much about “public” opinion as they push to reach whatever they believe “AGI” is. Maybe it’s more about winning over developers and enterprise companies that use their non-consumer options. Maybe they believe — and maybe they’re right — that consumers have long thrown up their hands around issues of true data privacy.

But the devil remains in the details of the data. Companies like OpenAI, Google and Meta might have the advantage in the short-term, but in the long run, I wonder if today’s issues around AI training data could wind up being a devil’s bargain.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.





Source link

About The Author

Scroll to Top