Can generative AI only improve from here? It's complicated

I asked machine learning experts about the future of AI and LLMs. Their answers made me realize I was focusing on the wrong questions.

David Priest
Amazon Employee
Published Apr 18, 2024
Computers will only get better. Our hardware is continually becoming more efficient, our storage more expansive, our software more powerful. Cars will only get better, too: more fuel efficient, safer, smarter. The same is true with almost all technology; if we set aside business-driven enshittification for the moment, technological improvement in the 21st century seems to be about as dependable as death and taxes.
But what about generative AI? Will it only improve? I’ve heard countless experts say so, as though it’s all but guaranteed. But the answer doesn’t seem so clear to me. Generative AI isn’t just software plus hardware. It’s also data. And while the hardware (think of data storage and computation cost) will almost certainly improve, and the software (think of semantic mapping) will almost certainly improve, the data (think of all the written language ingested by LLMs) will almost certainly not.
So will generative AI only get better from here? Probably... but it’s more complicated than you might think.

Hardware + Software + Data

Let’s get more specific: is the software and hardware undergirding AI really improving? We constantly read, after all, about the massive cost of generative AI. But those costs are also quickly improving. Transformer models, introduced in 2017, use GPUs far more efficiently than previous generations of encoder-decoder models (like RNN, LSTM, etc.) did, even if we’re rapidly increasing how many GPUs we’re using for these models. Look at cloud services in general: the tech industry generally improves performance over time, lowering prices relative to that performance in the process — and there’s little reason to think generative AI will be any different.
Customer-facing costs will almost certainly go down, too. Amazon Bedrock, for example, recently cut costs to people testing and developing on it by as much as 75%. That was a material improvement based simply on service offering optimization.
The same pattern holds true for the software — that is, the mechanisms that are taking raw data (billions of words) and using them to answer your random queries on ChatGPT. Just look at the recent improvements in language generation: ten years ago, tools were available that would take a word and start spinning out a sentence by guessing the next word — based simply on the previous word (or perhaps the previous few words). The result was a stream-of-consciousness meandering that could be intriguing, but wasn’t particularly useful.
What followed was many smart people working to capture the context of language more effectively — and the resulting emergence of self-attention, a concept that allowed models to pay, well, attention to the context not simply of the last word, but of many words together. (As an example, self-attention allows models to more reliably make sense of the pronouns in a sentence like, “Jake lent Bob his jacket because he was cold,” by paying attention to the whole thing together.)
An illustration of a man looking at himself in the mirror
One way to think about self-attention is that it is a "mirror" for LLMs to understand their context.
Jump forward a few more years, to 2017, and a bunch of Google engineers realize that attention isn’t just useful in conjunction with other tools, like recurrence and convolutions, but on its own. They publish one of the most important recent papers on machine learning, aptly titled, “Attention Is All You Need.”
And within a few years, we land at gpt-3 and other decoder-only models that string words together in exactly the same way as those old, meandering tools. They don’t guess the next word based only on the last word to be generated, though. Instead they guess based on the hidden context of all the words written so far — creating incredibly cogent, essay-length responses to questions (even if they sometimes hallucinate along the way).
If you only tracked with 50% of that, don’t worry. The important fact to understand for our purposes is this: we are still very much learning about how to create effective large language models, and that means our ability to leverage our massive datasets is only improving over time.

The Data Problem

Hardware and software may be improving, but the story of the data itself isn’t quite as clear. Let’s pretend that our dataset is the internet: that includes most of what’s been written (or at least published) in human history, or thousands of years of written words. A year from now, we will have exactly one more year of data. That's not insignificant; we are creating a massive amount of data online every year, so one year of written-language data in the 2020s is far larger than one year of written language data from even a hundred years ago — let alone a thousand years ago.
Okay, we’re creating a lot of data, so our dataset (the internet) should be getting bigger — and that means better, right? Well... maybe not. After all, popular generative AI tools like ChatGPT are now live and being used widely. That means the input data is quickly being tainted by outputs from previous iterations of generative AI. (This isn’t intentional, to be clear; it’s just incidental to everyone suddenly publishing AI-generated content all over the internet.)
This sort of corruption matters for a few reasons. First, it contributes to what one group of machine learning experts, in a research paper published last year, called "model collapse." Essentially, as models approximate the distribution of a large dataset, some of that data is inevitably lost — meaning the approximated distribution changes over time. Feed the outputs into the model, and a problem emerges: the more models approximate based upon their own approximates, the more their outputs will diverge from the original dataset.
A screenshot depicting how real data can be tainted over time by model-generated data online.
A screenshot from "The Curse of Recursion: Training on Generated Data Makes Models Forget"
Not only will this data corruption lead to model collapse, but outright errors in model outputs will also propagate across generations of generative AI tools, further degrading their performance over time.
In other words, generative AI ingesting its own outputs is very bad for its reliability.

What are the solutions?

Researchers aren’t ignorant of these data challenges. For extra perspective, I reached out to Mark Kon, a professor of mathematics and statistics at Boston University, and an expert on machine learning and neural networks.
“The danger is not apparent yet,” he said. “You’re looking ahead a few years, where... AI-based content becomes the majority of internet content. And the question is, will this lead to an iterated corruption of information?”
Kon said the answer to that question is yes — as long as the spread of AI-generated content remains unchecked. In essence, the so called dead internet theory would in such a case become a reality. But Kon says there are solutions.
“It’s going to be... a ‘better mousetrap, better mouse’ situation," he said, "in that human safeguards against the iteration of AI generated content will iteratively be defeated by stronger and more powerful AI content, which will lead to better human safeguards, and so on.”
We’ll need to build features, explained Kon, to “establish the authenticity of a piece of information based on verifiable human authorship.” Plenty of AI tools, like Amazon Titan, are already making impressive efforts to watermark AI-generated images (better mousetraps). Of course, as we use artificial intelligence more and more, the line between human-created content and AI-generated content will become blurrier (better mice).
A mousetrap with cheese on it
"I think we're going to need a bigger mousetrap"
My colleague Mike Chambers, an AI Specialist and Developer Advocate at AWS, was more hopeful about LLM improvements when I spoke to him. He pointed out that we’ve already taken steps to improve model performance through “tidying” existing datasets, which indicates that we need to clean our data as much as or more than we need to increase it. Generative AI itself might even be able to assist with this cleaning process.
Likewise, he said active research into the use of alignment and fine tuning, along with regular discoveries in the area of prompting techniques, are improving how we leverage existing data — even if we don’t always have great visibility into said research. (Research and development for such high-value, cutting-edge technology remains highly secretive.)
“Will generative AI only improve?” asked Chambers. “Absolutely yes! As an industry, we are floundering around in the dark near the very beginning of this technology. We are just waiting on the next small discovery, the next step change discovery, or for someone to invent a torch.”
Kon and Chambers both are getting at the fact that there are different kinds of “better” when it comes to data; it doesn't always come down to "more."
I was reminded of a recent conversation I had with a scientist who likened our current data situation to the pre-war steel problem. Essentially, after nuclear bombs were detonated in 1940s and 1950s, atmospheric radiation resulted in newly produced steel having higher radioactivity — making it less suitable for a variety of purposes. That raised the value of pre-war steel (and, more specifically, pre-war steel shielded from the effects of radiation by, for example, being in a shipwreck at the bottom of the sea), which was crucial to the production of Geiger-counters and various scientific instruments.
Perhaps the fallout of ChatGPT is too universal to be contained, and saved copies of the internet pre-2020 will be the new pre-war steel — with a level of purity we won’t ever achieve again. And as Chambers argued, our task is to clean that data and optimize our use of it. Or as Kon argued, perhaps tools to filter out AI-generated content will help preserve the value of newer online data as it is created.
To me, either option feels useful. LLMs really do seem almost certain to improve, but how exactly they improve remains to be seen — and no matter the answer, that improvement will require significant investment on our part.

The Bigger Picture of Data

Talking with experts like Kon and Chambers made me realize I might have been asking the wrong question in the first place. Will generative AI get better? Sure. But there’s a bigger question about data here.
Tidying older data and labeling newer data are important strategies, but they mostly apply to our same old categories of data: text, images, and videos. The future of AI will likely also involve finding new types data to ingest. I’ve written about the perception problem of generative AI before: that everything an LLM generates is a hallucination insofar as it is an expression of a reality the LLM cannot itself perceive. Well, one solution to the data problem is giving models more means of perception.
Think about the transition from paper maps to satellite-enabled navigation systems. We went from millennia of cartographic practices to something radically new: camera-equipped satellites that were able to map the earth — and see our exact locations on it — in a way never before achieved. Within a few years of GPS navigation, cars with 360-degree cameras strapped to their roofs began to roam the world, capturing street views. Digital photography in its various implementations created a completely new category of data for maps to use, from the clouds to the streets.
A smart phone with a virtual map floating above it
New categories of data have radically reshaped how we think about maps and navigation.
Generative AI is looking for that new category of data.
In much the same way scientists have discovered how to more effectively “understand” and leverage the data of written language through vector embedding, I believe we will begin to “understand” and leverage new categories of data, be it spacial, visual, or even socio-cultural. It might sound like science fiction, but the apparatus is already in development: both new devices for gathering data and new methods of gathering data from existing devices are on the way. In other words, the camera-equipped satellites and cars of the AI age aren’t so far away from launching.
Generative AI will improve because of our efforts to better leverage or preserve existing data. But in the long run, it will get more powerful because it will begin to access whole new categories of data. After all, the best source for understanding reality isn’t the internet; it’s reality itself.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments