The AI Boom Is About to Hit a Data Wall
Big models built on bad data are reaching their breaking point. The winners in the next AI race will be those who master quality over quantity.
Image source: Pixabay - Free to use.
Artificial intelligence developers are running headlong towards an invisible data wall, and the impending collision is going to cause a rethink on how AI gets done. Current AI models are absolutely ravenous, gulping down vast amounts of data to feed their increasingly powerful algorithms. But they’re going to need a change of diet soon enough, for developers can’t keep filling their plates.
What publicly available data there is has already been used. As early as 2023, the original ChatGPT was trained on basically the entire internet as it existed back then. New data isn’t being added fast enough to fuel AI’s seemingly insatiable appetite, and to make matters worse, the biggest content creators aren’t playing ball anymore. They don’t think it’s fair that AI companies are helping themselves to everything they publish, and they’re beginning to make a stand.
A case in point - Reddit recently took Anthropic to court over claims that its chatbot Claude was trained on more than 100,000 of its user’s posts. That was just the latest in a long string of lawsuits regarding the issue of “fair use” and whether or not such blatant scraping of intellectual property should be allowed.
Anthropic’s willingness to take that risk, which came after Reddit very publicly announced licensing deals with Google and OpenAI, underscores the increasingly desperate situation for AI developers, said Dr. Max Li, founder and Chief Executive of the decentralized data cloud OORT. He explained there’s a growing sense of urgency around the impending data wall, because of its potentially negative impact on AI innovation.
“There will be major consequences if we don’t fix this, with a handful of bigger companies dominating because they have access to higher quality data,” he said. “Smaller companies will become more reliant on synthetic data, but this increases the risks of model hallucinations and reduced reliability.”
Why Less Is More
Awareness over the importance and value of data is growing, and if anything it’s making it harder to come by. Enterprises have started to guard their information more jealously than ever, and they’re a lot less willing to share it. Moreover, what freely available data there is has become heavily polluted, because much of the internet is now populated with AI generated content that you definitely don’t want to be training your models on.
Tal Melenboim, founder of Data+, a startup that’s focused on data collection and labeling for AI training, said AI-generated content is the worst thing in the world for AI models, because it leads to a documented phenomenon known as “model collapse”. “It’s bad quality data that causes errors and inaccuracies to be amplified and bias to be reinforced,” he said.
Rather than trying to get its hands on as much data as possible, Melenboim says the industry needs to focus on the idea that less is more, so long as the data is of the best quality. There’s a consensus in the AI industry that the more data you feed into a model, the better it becomes. But experiments with “small language models” fed on smaller, domain-specific datasets undermine that theory.
“Clean, high quality and domain-specific data helps to provide clearer and more relevant signals to learn from while reducing bias and overfitting,” Melenboim said. “If the data is free from errors and biases, the model will make more accurate predictions. It generalizes to new, previously unseen data, making models more performant in real-world scenarios, and you don’t need tons of it.”
Data Quality Matters
This idea of doing more with less data could help AI companies leap over the data wall once they arrive at it, but only if they’re prepared. Rather than spending money on datasets, they should invest in their data infrastructure, in order to guarantee their data’s quality and stop it degrading over time.
Melenboim thinks the industry can take its lead from DevOps, which uses version control processes to maintain software quality. Developers carefully track all of the changes they make to their codebase as it evolves so they can see how it impacts their applications with each update, he said. If something goes wrong, they can look and see how it happened and fix it easily.
“Teams need to create and implement frameworks that verify data from its source and throughout its lifecycle, and implement audit trails to make it transparent, so they can see where the data was sourced and how it has changed over time,” Melenboim advised.
Dr. Li said the best way to do this is with decentralized data, which is extremely transparent due to its open nature. In the case of OORT, it employs a layered system for data verification, where new datasets are validated by the community through a peer review system that filters out low quality submissions. Then there’s an infrastructure layer, where OORT’s patented Proof-of-Honest mechanism employs a carrot and stick approach, with nodes being incentivized to operate transparently and honestly, and punished for malicious behavior.
Finally, there’s an algorithmic layer, Dr. Li said. It uses AI models to verify quality and detect anomalies, duplicates and inconsistencies in the datasets and monitor them in real-time as they evolve. “By combining these methods, we create a verification ecosystem that ensures quality data is consistently rewarded and poor data naturally phases out,” he said.
Quality Isn’t So Costly
Another powerful argument for high quality data is cost efficiency, because it doesn’t always have to be expensive.
In decentralized scenarios, contributors are rewarded based on the quality and value their data provides. It can be thought of as a performance-based system, Dr. Li explained. The more an AI model is used, the greater the rewards for the creators of the data it’s trained on. It’s a workable alternative to paying upfront to access licensed datasets. “By carefully designing reward structures and verifying the datasets, developers can encourage high-quality contributions while keeping the economics sustainable,” he stated.
Data can be had cheaply in other ways. Some of the best quality data is owned by private companies and these may be interested in creating data marketplaces that facilitate the exchange of clean, licensed and verified datasets at fair costs. Melenboim said it’s in the interests of enterprises to sell their data, as the revenue can be used to buy datasets from other companies. Alternatively, enterprises may start forming cooperatives where they access each other’s data for free under contractually agreed terms. When data is sensitive, it can be anonymized before being shared.
Melenboim agreed that startups might be disadvantaged in these scenarios because they have lower budgets and less data of their own to share, but said they still have other options. For instance, they can turn to open datasets or partner with academic institutions and use their data in return for sharing their technology.
“Startups also have the option of providing incentives for their users to share their own data,” Melenboim said. “This is how companies like Google and Facebook operate, and it clearly works well. Give your users free access to something in return for their data, and they’ll be only too happy to sign up.”
