Google recycles the web for its AI, even when it is said no

The current antitrust trial in the United States highlights Google’s practices in training its artificial intelligence models. At the center of questions: the use of content published online, even when their authors have explicitly refused to be used for this purpose.

On May 3, as part of a hearing before a federal court in Washington, Eli Collins, vice-president in charge of the product at Google Deepmind, confirmed a technical point with major implications: The teams in charge of research at Google can lead to their artificial intelligence products, such as “AI Overviews”, on web content whose publishers have requested exclusion from the training process. The exclusion filter, based on the file robots.txtonly applies to the models developed by Deepmind. It does not regulate the uses made by other divisions of the group, in particular that in charge of the search engine.

A binary choice for content publishers

This internal distinction creates a singular situation. Publishers have a technical way to report their refusal to see their content used to train AI models: the file robots.txtwidely used since the beginnings of the web to control indexing by engines. However, Google said that exclusion of IA training content is only possible if these content is also excluded from indexing in the search engine.

In other words, publishers must choose: Accept that their contents participate in the training of Google AI products, or give up their visibility in the engine. A dilemma difficult to arbitrate as the consequences are heavy for actors depending on referencing for their traffic and their income.

Massive data volumes, partially filtered

An internal document presented at the hearing reveals that in August 2024, Google would have withdrawn 80 billion “tokens” (text units) out of a total of 160 billion after applying the exclusion filters. This operation marks an attempt to take into account the preferences of the publishers, but constitutes only a partial filter. The same document also mentions the use of data from research sessions, YouTube videos and other interactions with Google services to improve models.

These often behavioral data is not subject to the same control mechanisms as web content. They give Google a structural advantage in the creation of internal data games for the training of its AI.

A continuous improvement loop

The responses generated by AI in search results – at the top of the page, before conventional links – arouse increasing concern. Several website publishers believe that These responses decrease the number of clicks redirected to their pagesfor the benefit of information summarized directly in the research interface. This phenomenon accentuates not only dependence on the platform, but reduces the economic prospects of content producers.

At the same time, AI models integrated into research benefit from massive exposure, and are continuously improved by the interaction of users with Google services. This loop – data collection, generation of responses, capture of attention, new training – gradually strengthens the quality of the services offered, and significantly strengthens Google’s position.

A competitive and legal dimension

The current trial, brought by the American Justice Department, seeks to determine whether Google’s practices in research and artificial intelligence violate antitrust laws. Among the proposals mentioned: the prohibition of contracts by which Google becomes the default search engineor the sale of his Chrome browser. The authorities also wish Imposing restrictions on how data collected via research can be used to cause AI models.

During the hearing, Diana Aguilar, representing the Doj, cited an internal document in which the Deepmind CEO, Demis Hassabis, mentioned the possibility of driving a model with the classification data from the search engine, in order to assess the improvement obtained.

Still vague regulation

The case illustrates the complexity of governance mechanisms around AI, and the difficulty for regulators to follow the rapid development of practices. If the robots.txt remains a useful tool for supervising indexing, It appears insufficient in the face of the integration of AI into research interfaces.

The debate opened by this trial exceeds the only case of Google. It questions the way in which technological companies can constitute and exploit competitive advantages from resources which are partly part of the public domain or third -party productions. It also raises the question of effective regulation, capable of distinguishing the legitimate uses of AI from practices that can strengthen a situation of domination.