Can we really train an LLM on personal data?

The rise of large (LLM) language models such as GPT, Llama or Mistral marks a turning point in the automation of language treatment. Their ability to understand, synthesize or generate texts is unprecedented. But their effectiveness is based on a massive training phase, often opaque, which poses a fundamental problem: Can we use personal data in Europe to train an LLM?

A strict legal framework, often misunderstood

Article 4 of the GDPR defines personal data as any information relating to an identified or identifiable person. Training a model on emails, CVs, Slack exchanges or HR documents containing names, addresses, identifiers or health data is therefore a processing of personal data.

This treatment initiates several obligations:

  • Arrange of a legal basis (legitimate interest, consent, execution of a contract, legal obligation or vital interest).
  • Inform the people concerned the purpose of treatment.
  • Minimize the data used (principle of proportionality).
  • Guarantee security treatment and prevent any leakage or re -identification.
  • Allow the exercise of the right to erasure.

However, in the case of LLM, Most of these obligations become extremely difficult to hold Once the data is ingested in the model.

Learning logic: a regulatory black box

Unlike a search engine, an LLM do not index Content: he encode Statistical representations through billions of parameters. This a priori prevents identifying or extracting personal data from the model directly. However, empirical tests have shown that:

  • Some prompts allow Reinject identifying data.
  • LLM can hallucinate Telephone numbers or names present in their training corpus.
  • He is almost impossible to disentrate a model Without rebounding it entirely.

The principle of Privacy by design is therefore difficult to apply if these precautions have not been integrated Before training.

Concrete case and associated risks

🟠 Case 1: Internal data of a company

A company wants to be fine-tuned an LLM on its internal exchanges (support tickets, contracts, emails). If she did not obtain the explicit consent of the employees or customers concerned, she violates the GDPR. Even with on-premise accommodation, the use remains illegal without a clear legal basis.

🟠 Case 2: Public data on the web

Many training bases incorporate data from Wikipedia, Github, Stackoverflow or Common Crawl. However, the simple fact that a fact is public does not cancel its personal character. For example, the pseudonym of a forum user can be connected to a person.

🔴 Case 3: Open source pre-trained models

If a company uses an open source model already trained (e.g. Llama, Falcon), it inherits legal risk If illegally acquired personal data has been integrated into it.

What technical and operational solutions?

1. Data pre-processing (filtering, anonymization)

Above all training, it is imperative toIdentify and clean Personal data. This implies:

  • Detect named entities (NER).
  • Delete or pseudonymize sensitive elements.
  • Check the origin and legitimacy of the corpus.

Limit: Anonymization is rarely perfect. A crossing with other data can allow a re -identification.

2. Use of alternative methods: RAG (Retrieval-Augmented Generation)

The model is not not trained on sensitive databut he dynamically accesses it via an external documentary base.

Example: an LLM has not learned the content of a contract, but it can access it at the time of the prompt via an internal search engine. The information remains located and can be modified or deleted at any time.

3. Fine controlled tutorial and sovereign accommodation

When an internal training is necessary, it must:

  • Be carried out in a controlled environment (Trusted on-premise or cloud).
  • Rely on documented treatment, with a legal basis established.
  • Integrate a treatment register Specific GDPR.

Governance and responsibilities

🔸 Who is responsible?

  • THE controller (often the user company).
  • THE Model supplier (editor or integrator).
  • THE subcontractor which provides infrastructure or cloud services.

🔸 Risks involved:

  • Financial sanctions (up to 4 % of global turnover).
  • Civil litigation (in the event of privacy damage).
  • Loss of confidence or image.

What else? A litigation in training

In the United States, class actions emerge against Openai, Google or Meta. The complainants denounce training without consent on their data (works, photos, journalistic content). In Europe, data protection authorities (CNIL, EDPS) closely examine LLM treatment practices, especially in the application of Data Governance Act and AI Act.


Summary

Element Main risk Recommended solution
HR data training Violation of the right to be forgotten Anonymization or RAG with controlled access
Use of public data Possible re -identification Prior consent or exclusion
Open source pre-trained model Encapsulation of unknown personal data Corpus audit, Hallucinations test
User request (prompt) Involuntary revelation of sensitive data Film filtering, audited logs, monitoring

In conclusion

LLM training on personal data is not prohibitedbut he is strictly supervised. Companies must now consider LLM not as magic boxes, but as High risk data processing. In the absence of robust governance, dataset audit and infrastructure control, the use of these models could expose major sanctions.