The Iteration of Large Models May Face a Pause
Advertisements
In an era where artificial intelligence (AI) develops at an unprecedented pace, a looming crisis is becoming increasingly evident—data scarcity. Data serves as the essential lifeblood for AI models, influencing not only their capabilities but also the breadth of their potential applications. A recent study by Epoch AI warns that by 2028, the volume of data required to train AI models may reach the scale of the entire corpus of public online text. This revelation signals a troubling trend: within the next few years, AI models may exhaust the pool of high-quality data essential for their training and performance.
While computational power has surged, enabling large models to process vast amounts of data, the supply of high-quality data that is relevant to specific contexts has failed to keep pace. It’s not that data is entirely depleted; rather, the challenge lies in obtaining data that meets qualitative standards. The increasing scale of these AI models introduces diminishing returns—the added value gained from incorporating additional data decreases as model size grows. As organizations strive to enhance model performance, the demand for high-quality, targeted datasets becomes more stringent.
The search for adequate internet data raises a pivotal question: Is the web running dry? The appetite for data among large models is enormous. Consider GPT-4, for instance; it boasts trillions of parameters, necessitating an enormous dataset to train adequately.
A professional from a data intelligence center explained that the primary sources for large model training data can be categorized into three types. The first includes publicly available internet data gathered through web scraping or APIs, encompassing websites, social media platforms, forums, academic literature, and open datasets. The second category consists of internal corporate data, reflecting user interactions, transaction histories, and product usage logs—these are particularly valuable for models tailored to specific industries. Lastly, third-party data providers offer meticulously curated data, specialized for certain sectors.
However, despite the massive volume of data generated daily on the internet, the availability of high-quality datasets remains largely limited. The rapid rate of data generation falls well short of meeting the insatiable demand from expansive AI models.
OpenAI's former scientist, Igor Sytskevich, highlighted this issue succinctly, stating, “We have only one internet,” and emphasized that data growth is decelerating. As the critical resource fueling AI innovation begins to dwindle, the implications are profound.

The same data professional noted, “The claim that internet data is nearing exhaustion is misleading; rather, we have reached a peak in the availability of high-quality data. The proliferation of misinformation and redundant content on social media, alongside biased statements found online and even data generated by AI itself, severely impairs data quality. Low-quality datasets do not provide effective training resources; they can mislead model judgments and degrade performance.”
As an example, he referenced a previous instance where an AI had claimed to be Gemini, a different model, which was amusing but indicative of how the internet's resources may be significantly polluted.
Liang Bin, founder and CEO of Baiyou Technology, shared insights into the market dynamics, stating that in 2023, all entities investing in large models, including enterprises of various types, are scrambling to acquire data—often without a full understanding of its quality. By 2024, customers will prefer purchasing data that adheres to strict standards; for example, when buying images, they’ll specify the size and what needs to be depicted. This evolution indicates a growing recognition of the importance of high-quality data.
“For the latter categories of data sources, acquiring relevant data is increasingly challenging,” the data expert confessed. “As the use of AI models proliferates, data owners are becoming more protective, enforcing stricter rules regarding the utilization of their content.”
Liu Xingliang, a member of the Ministry of Industry and Information Technology's Expert Committee on Information and Communication Economics, elaborated on the restrictions posed by privacy and security regulations. The global emphasis on data privacy has escalated, leading many companies and platforms to hesitate in sharing large volumes of user data.
Additionally, the high costs associated with obtaining quality data pose significant obstacles for firms. Current efforts by major model developers to clean data require substantial financial resources, making the endeavor challenging. “There exists a plethora of noise within raw data, and the costs of cleanup and labeling, especially in specialized fields like healthcare and law, are exorbitant,” Liu noted. “Simultaneously, legal questions around data copyright hinder access to high-value datasets (such as literary works and scientific articles).”
Industry sentiment suggests that the delayed release of GPT-5 underscores the urgency of addressing these data bottlenecks, which complicate the training process. However, top-tier companies like OpenAI and Google remain optimistic, asserting that AI has not encountered insurmountable barriers. They believe advancements in data sourcing, improved model inference capabilities, and the utilization of synthetic data will propel continued progress in AI development.
The growing realization of potential data exhaustion has prompted companies to confront this challenge squarely and actively seek viable solutions. This involves tapping into existing data potentials, leveraging synthetic datasets, establishing data-sharing platforms, enhancing data governance, and exploring innovative sources of data. For example, OpenAI has created a foundational team tasked with investigating ways to address training data shortages and refine scale parameter applications to maintain model advancements.
A staff member from the data intelligence center remarked, “Currently, large models are frequently reducing their prices—this is driven not only by cost considerations but also by a desire to attract more user data. By enticing users with low-cost or even free access to models, companies gather more data to optimize performance, which in turn fosters a cycle of attracting users and enhancing model results.”
Many professional observers agree that in an environment of constrained data resources, fostering collaboration and data-sharing across various institutions and industries represents an effective strategy to combat data scarcity. Data-sharing platforms can enable companies and research organizations to consolidate and share their data resources, promoting increased interconnectivity.
Renowned economist and expert committee member Pan Helin suggested, “An immediate solution would involve collaboration between AI firms and internet platform companies to develop leading AI models. Internet platforms have abundant computational power, financial resources, and vast data.”
Chinese Academy of Sciences Academician Mei Hong illustrated the necessity for integration: “For instance, the data generated by different transportation systems—buses, taxis, subways—are managed by independent information systems, creating isolated data islands. For seamless sharing and integration of these datasets, interoperability among systems must be realized. Cost and efficiency constraints imply that each organization cannot undertake this independently. Therefore, the establishment of a new data-centric infrastructure is crucial, aiming to fundamentally support the interconnectivity of data online. Essentially, this infrastructure expands and extends the internet technology ecosystem.”
“We need to encourage the development of open data platforms across industries or research fields, and simultaneously formulate reasonable data-sharing and usage regulations to ensure compliance,” Liu asserted. “The concept of a ‘data drought’ feels more akin to an issue of data acquisition and usage efficiency rather than an outright lack of data. Privacy and security regulations indeed impose rigorous demands for data circulation but also incentivize innovation in technologies and business models. Ultimately, the AI sector must find a balanced approach between data acquisition efficiency, technological advancements, and adherence to regulations.”
Make A Comment