The Iteration of Large Models May Face a Pause

Advertisements

January 16, 2025 96

In an era where artificial intelligence (AI) develops at an unprecedented pace, a looming crisis is becoming increasingly evident—data scarcityData serves as the essential lifeblood for AI models, influencing not only their capabilities but also the breadth of their potential applicationsA recent study by Epoch AI warns that by 2028, the volume of data required to train AI models may reach the scale of the entire corpus of public online textThis revelation signals a troubling trend: within the next few years, AI models may exhaust the pool of high-quality data essential for their training and performance.

While computational power has surged, enabling large models to process vast amounts of data, the supply of high-quality data that is relevant to specific contexts has failed to keep paceIt’s not that data is entirely depleted; rather, the challenge lies in obtaining data that meets qualitative standards

The increasing scale of these AI models introduces diminishing returns—the added value gained from incorporating additional data decreases as model size growsAs organizations strive to enhance model performance, the demand for high-quality, targeted datasets becomes more stringent.

The search for adequate internet data raises a pivotal question: Is the web running dry? The appetite for data among large models is enormousConsider GPT-4, for instance; it boasts trillions of parameters, necessitating an enormous dataset to train adequately.

A professional from a data intelligence center explained that the primary sources for large model training data can be categorized into three typesThe first includes publicly available internet data gathered through web scraping or APIs, encompassing websites, social media platforms, forums, academic literature, and open datasetsThe second category consists of internal corporate data, reflecting user interactions, transaction histories, and product usage logs—these are particularly valuable for models tailored to specific industries

Lastly, third-party data providers offer meticulously curated data, specialized for certain sectors.

However, despite the massive volume of data generated daily on the internet, the availability of high-quality datasets remains largely limitedThe rapid rate of data generation falls well short of meeting the insatiable demand from expansive AI models.

OpenAI's former scientist, Igor Sytskevich, highlighted this issue succinctly, stating, “We have only one internet,” and emphasized that data growth is deceleratingAs the critical resource fueling AI innovation begins to dwindle, the implications are profound.

The same data professional noted, “The claim that internet data is nearing exhaustion is misleading; rather, we have reached a peak in the availability of high-quality dataThe proliferation of misinformation and redundant content on social media, alongside biased statements found online and even data generated by AI itself, severely impairs data quality

Low-quality datasets do not provide effective training resources; they can mislead model judgments and degrade performance.”

As an example, he referenced a previous instance where an AI had claimed to be Gemini, a different model, which was amusing but indicative of how the internet's resources may be significantly polluted.

Liang Bin, founder and CEO of Baiyou Technology, shared insights into the market dynamics, stating that in 2023, all entities investing in large models, including enterprises of various types, are scrambling to acquire data—often without a full understanding of its qualityBy 2024, customers will prefer purchasing data that adheres to strict standards; for example, when buying images, they’ll specify the size and what needs to be depictedThis evolution indicates a growing recognition of the importance of high-quality data.

“For the latter categories of data sources, acquiring relevant data is increasingly challenging,” the data expert confessed

alefox

“As the use of AI models proliferates, data owners are becoming more protective, enforcing stricter rules regarding the utilization of their content.”

Liu Xingliang, a member of the Ministry of Industry and Information Technology's Expert Committee on Information and Communication Economics, elaborated on the restrictions posed by privacy and security regulationsThe global emphasis on data privacy has escalated, leading many companies and platforms to hesitate in sharing large volumes of user data.

Additionally, the high costs associated with obtaining quality data pose significant obstacles for firmsCurrent efforts by major model developers to clean data require substantial financial resources, making the endeavor challenging“There exists a plethora of noise within raw data, and the costs of cleanup and labeling, especially in specialized fields like healthcare and law, are exorbitant,” Liu noted

“Simultaneously, legal questions around data copyright hinder access to high-value datasets (such as literary works and scientific articles).”

Industry sentiment suggests that the delayed release of GPT-5 underscores the urgency of addressing these data bottlenecks, which complicate the training processHowever, top-tier companies like OpenAI and Google remain optimistic, asserting that AI has not encountered insurmountable barriersThey believe advancements in data sourcing, improved model inference capabilities, and the utilization of synthetic data will propel continued progress in AI development.

The growing realization of potential data exhaustion has prompted companies to confront this challenge squarely and actively seek viable solutionsThis involves tapping into existing data potentials, leveraging synthetic datasets, establishing data-sharing platforms, enhancing data governance, and exploring innovative sources of data

For example, OpenAI has created a foundational team tasked with investigating ways to address training data shortages and refine scale parameter applications to maintain model advancements.

A staff member from the data intelligence center remarked, “Currently, large models are frequently reducing their prices—this is driven not only by cost considerations but also by a desire to attract more user dataBy enticing users with low-cost or even free access to models, companies gather more data to optimize performance, which in turn fosters a cycle of attracting users and enhancing model results.”

Many professional observers agree that in an environment of constrained data resources, fostering collaboration and data-sharing across various institutions and industries represents an effective strategy to combat data scarcityData-sharing platforms can enable companies and research organizations to consolidate and share their data resources, promoting increased interconnectivity.

Renowned economist and expert committee member Pan Helin suggested, “An immediate solution would involve collaboration between AI firms and internet platform companies to develop leading AI models

Internet platforms have abundant computational power, financial resources, and vast data.”

Chinese Academy of Sciences Academician Mei Hong illustrated the necessity for integration: “For instance, the data generated by different transportation systems—buses, taxis, subways—are managed by independent information systems, creating isolated data islandsFor seamless sharing and integration of these datasets, interoperability among systems must be realizedCost and efficiency constraints imply that each organization cannot undertake this independentlyTherefore, the establishment of a new data-centric infrastructure is crucial, aiming to fundamentally support the interconnectivity of data onlineEssentially, this infrastructure expands and extends the internet technology ecosystem.”

“We need to encourage the development of open data platforms across industries or research fields, and simultaneously formulate reasonable data-sharing and usage regulations to ensure compliance,” Liu asserted

Make A Comment