ChatGPT Needs to Go to College to Get More Accurate: Parmy Olson

For all the intelligence that we like to ascribe to ChatGPT, the chatbot was essentially homeschooled. Its creator OpenAI trained it on the vast, imperfect glory of the public internet — one reason why ChatGPT makes so many embarrassing mistakes. A lawyer who recently used the chatbot to write his court brief realized he’d blundered when it cited six nonexistent cases. How can ChatGPT get more accurate? Send it to college by training it on better-quality data.

That poses the tantalizing possibility of a new revenue stream for publishers and any other company that owns valuable, accurate text that could be used to train language models. It will be expensive for OpenAI, but it could reinforce the dominance of Sam Altman’s company, along with Google, Meta Platforms and the handful of other large firms that make so-called foundation models. They may become the few that can afford to pay for AI’s higher education.

OpenAI has kept its training data for GPT-4 a secret. But for previous versions it used an online corpus of thousands of self-published books, many of them skewed toward romance and vampire fiction. Academics have found that many popular books that found their way online, like the Harry Potter series, likely feature in GPT-4 too, which has led to chatter in the book-publishing world about whether their prodigious archives could serve as the next training ground — if AI companies are willing to pay.

What better professors for ChatGPT than academic books and journals, with their concentrated expertise in business, medicine, economics and more?

For months, scuttlebutt in the AI field has been that a large chunk of GPT-4’s training data came from Reddit. Then last month, the popular internet forum said it would start charging companies to access its trove of conversations. That got some book publishers wondering if they might be able to do the same for their past work, according to Dan Conway, chief executive officer of the UK Publishers Association. “This is a very live conversation,” he says. “Part of the conversation that needs to happen is how does licensing for content work.”

This isn’t just wishful thinking, because OpenAI may have to start looking beyond the public internet to teach the next iteration of ChatGPT. The online datasets it was trained on have always held fairly reliable data. But now that ChatGPT is a public sensation, those datasets face being spammed with junk data aimed at skewing a chatbot’s results — in the same way SEO spam skews Google results. OpenAI may well need to look further afield and start paying for its next round of training.

The company isn’t the only potential buyer. Others that want to fashion their own language models now want more data too. Investment banks in particular, who want to help their clients do smarter investment research, have been building sophisticated chatbots and training them on data from companies in the insurance, freight, telecommunications and retail industries, according to Brad Schneider, the CEO of Nomad, an online marketplace for data.

Virtually no one outside of the big tech firms like OpenAI and Google are actually building the underlying language models from scratch, but many companies are buying access to those models, like GPT-4, and then tweaking them with specialist data for their own purposes. (Disclosure: Bloomberg has announced its own language model for finance, which will likely compete with OpenAI’s GPT-4.)

Schneider says that three months ago, virtually no one was buying data to train language models in this way. Now those transactions make up about 15 percent of the total volume on his platform, with prices ranging from tens of thousands to millions of dollars. Companies with unique data that’s in high demand — such as data that can help an AI tool do software programming — tend to be in a stronger selling position, Schneider adds.

In one sense, this all points to a thriving market for data. In a year or two, we could see an array of insurance firms, banks and medical companies buying and selling data to build specialized alternatives to ChatGPT.

But this market could move in a darker direction too — one dominated by incumbent technology firms. That’ll depend on if OpenAI and Google build language models that can do anything for anyone — a kind of Swiss Army knife version of ChatGPT with expertise on an array of subjects. General-purpose bots, in other words, could supplant the niche bots, and if data prices go too high, that would also make those niche bots harder to build.

The larger tech firms “are always going to be able to spend more on compute [and data] than we can,” says Keith Peiris, co-founder and CEO of Tome, an AI tool for generating stories. “Odds are they will win because of capital, not necessarily because of innovation.”

That has been the story of Big Tech for years, and it’s unlikely to change now.

The Motorola Edge 40 recently made its debut in the country as the successor to the Edge 30 that was launched last year. Should you buy this phone instead of the Nothing Phone 1 or the Realme Pro+? We discuss this and more on Orbital, the Gadgets 360 podcast. Orbital is available on Spotify, Gaana, JioSaavn, Google Podcasts, Apple Podcasts, Amazon Music and wherever you get your podcasts.