Close
What would you like to look for?
Site search
17 July 2024 Part 19: Language models with and without personal data

We have explained the technical aspects of a large language model in part 17 of our blog. But what conclusions can we draw from this in terms of data protection? Does an LLM contain personal data within the meaning of the GDPR and the Data Protection Act? A proper answer to this question must take into account the intrinsic nature of such AI models as well as the needs of the data subjects. This part 19 of our AI blog series will provide an answer.

Executive Summary:

Whether or not personal data is contained in a large language model (and whether such a model produces such data) must be assessed from the perspective of those who formulate the input and those who have access to the output. Only if it is "reasonably likely" that these persons will formulate an input that leads to the generation of texts containing personal data can it be assumed that a language model contains such data (provided, of course, that the persons with access to it can also reasonably identify the data subjects). Whether it is reasonably likely depends not only on the objective possibility that such input will be written, but also on the interest of these persons to do so. This follows from the so-called relative approach and the further case law on the GDPR and Swiss DPA, whereby the input in the case of a language model is one of the means that make it possible to identify the data subject to which the factual knowledge of a language model relates, given that this knowledge can only be accessed via output and every output necessarily requires an input to which the model can react. For one user, a language model will not contain or even produce any personal data; for another, the same model will do so. In addition, it should be noted that not all personal data generated by the model is based on the factual knowledge stored in the model. It can also be merely a random result because the model does not have more probable content available for the output. In such cases, the output may contain personal data, but it is not in the model. This applies even if the data in question has been "seen" by the model during training. This is because the model does not "memorize" most of this data as such, but only information seen more often. Although there are other ways in science to infer the content of a model, for example by analysing the "weights" it contains, these are irrelevant for answering the question of whether it contains personal data and which data, because according to the relative approach it only depends on how the respective user of the model uses it - and this user normally accesses it via a prompt (i.e. input). The different phases in the "life" and use of a large language model must therefore also be regarded as different processing activities for which there are different controllers. Those who have trained the model are not automatically responsible for data protection violations by its users – and vice versa. From a data protection perspective, an LLM thus only contains the few personal data that has been memorised durng the training due to its frequent occurrence in the training data. It is stored only in a pseudonomyzed form and accessible only for those who have the key. The key or means for identification is the prompt. Whether it is reasonably likely that it will be used has to be decided for each specific application.

Here you can download up a chart that illustrates how large language models can store and retrieve personal data.

At the very end of this blog there is a box with further legal explanations and a commentary on the theory that large language models per se do not contain any personal data.

At first glance, the answer to the question in the lead seems clear: it must be a clear "yes", especially as we all know that we can also use a language model to generate statements about natural persons – without the names of these persons being mentioned in the prompt. If you ask ChatGPT or Copilot who the Federal Data Protection Commissioner (i.e. the head of the Swiss data protection supervisory authority) is, you will find out that his name is Adrian Lobsiger – and you will learn a lot more, even that he was a public prosecutor and judge in the Swiss Canton of Appenzell Ausserrhoden (both of which are wrong, but we will discuss how to deal with such errors in another blog post).

This is why most data protection professionals and authorities take it for granted that language models contain personal data without asking any further questions. No distinction is made as to whether a language model has merely picked up the identifying elements in the output from the prompt ("Write me a love letter for Amelia Smith from Manchester, my first love from kindergarten"), whether the chatbot has the personal data from an Internet query or whether it really has been sourced from within the language model. No distinction is made as to whether they are completely random results or originate from factual "knowledge" contained in the language model.

The answer to this question is by no means trivial. As far as we can see, this question has also barely been discussed to date. So far, the Danish Data Protection Authority and the Hamburg Data Protection Commissioner are of the opinion that LLMs per se do not contain any personal data; at the same time as the publication by the Hamburg Data Protection Authority, an essay by Flemming Moos was also published arguing the same (see the box at the end of this blog, explaining why we disagree with their analysis and conclusion). Conversely, the blanket assumption that an LLM as such is always subject to data protection – including the right of access and the right to be forgotten – is not correct, either, because it does not take essential elements of such models into account. For example, it would mean that anyone who stores a model on their system (which many people do without even knowing) automatically becomes a controller with all the consequences. Once again, the truth (and reasonable solution) lies somewhere in the middle. 

Delimitation of responsibility required

First of all, the question of whether a large language model is subject to data protection is basically the wrong one. Data protection obligations apply to persons, bodies or processing activities, but not objects. So, if we ask ourselves whether the EU General Data Protection Regulation (GDPR) or the Swiss Data Protection Act (DPA) applies to an LLM, we have to ask which party this question refers to. The answer may be different depending on whether it is answered from the perspective of the person who trains the LLM, the person who keeps it stored on their computer, the person who uses it indirectly by accessing a service such as ChatGPT or Copilot, or the person who makes it available as a service in the cloud, as hyperscalers do for their customers.

We have noticed that the distinction in responsibility between the creator of an LLM and the various forms of its users, which has already been propagated last year, is becoming increasingly established in practice. This also makes sense: the training of an LLM is a different processing activity in relation to personal data than the use of an LLM to generate output in which personal data can be found. If data protection has been violated when creating an LLM (e.g. by using unauthorised sources), the person who uses the LLM and whose output does not contain personal data that relates to such violation cannot be held responsible for the violations that have occurred in an earlier  and separate processing activity. Also, an LLM generally does not, as such, store the personal data it has been shown during the training (more on this below). If you want to know more about the technical aspects of an LLM, you should read our blog post no. 17.

This does not mean that there are no circumstances in which a downstream entity that uses such an illegally trained LLM can be held responsible for the upstream controller's breach of the law. There are cases where this could happen, but they presuppose, at least in terms of data protection, that the earlier breach is somehow continued in the form of personal data that is being processed by the downstream entity at issue. The reason is that data protection obligations are linked to personal data, and dependent on its existence. If such data no longer exists in this form, the responsibility under data protection law ends, as well. The doctrine of the "fruit of the poisonous tree" does not apply in data protection beyond the specific "poisoned" personal data.

Independent processing activities

For each processing activity in the life cycle of a language model, it must therefore be determined independently who is considered the controller and who is not, or who may be the processor. The consequence of this is that the person who trains an LLM correctly in terms of data protection law, in principle, cannot just because of such training be held responsible for infringements that a separate controller later on commits with it. For example because erroneous output is used without verifying its accuracy or because a model is used to generate unlawful output. The only link between the two, that could result in liability of the former for the latter, would be the act of disclosing the LLM to such subsequent controller.

A distinction must also be made between the different users in the same way. When companies A, B and C access the LLM stored at a hyperscaler D, there are three different processing activities that occur. Company A is not responsible for what B and C do with the same model, and D will generally not be responsible for these processing activities either, as it will usually only act as their processor. D is ultimately only responsible for the storage of the model on its servers – provided that the model contains data from the perspective of D, which is by no means clear.

As the following diagram illustrates, there are several distinct areas of responsibility in practice: One is the creation of the language model, others are making it available in a cloud (as Microsoft, AWS, Google and OpenAI do, for example), offering AI services to the general public (such as ChatGPT, Copilot or DeepL), various forms of deployment within companies (with models used locally but also access to those in the cloud or via AI services) and the development of AI solutions:

In particular, companies that basically only use AI for their own purposes will be happy to hear that most of the legal disputes that language models have led to so far primarily concern those that train such models or offer them as SaaS services on the open market and are accessible to a wide audience. The demarcation between creation and use of an LLM in the sense of a legal "firewall" shielding both sides is becoming increasingly clear. This is good news insofar as this reduces the practical risks for the mere users of AI. Although we expect to see court rulings that find that the creation of an LLM has infringed upon third party rights and fines issued to some of the big names in the AI market for violations of data protection, we predict that these developments will not fundamentally alter the way these technologies are used and spread.

Definition – the relative approach

This brings us to the core question of this discussion: Does a large language model contain personal data? We will answer this question with regard to large language models that are based on the Transformer technique, as described in our blog post no. 17, which you should read for to aid your understanding of the following considerations.

To answer the question, we have to dissect the definition of personal data as per the GDPR and the DPA. We have to be careful not to rely on the understanding of the term that has been propagated by many European data protection authorities in the past, as it is overbroad and not supported by current law: Neither is singularisation sufficient for identification (it is only one element of identification), nor is it sufficient that an identification can be made by any person and not only by those with access to the data (also known as the "absolute" or "objective" approach).

Rather, we rely on the interpretation of the ECJ for EU law, namely in "Breyer" (C-582/14) and most recently in "VIN" (C-319/22), and the Swiss Federal Court for the DPA, as in "Logistep" (BGE 136 II 508): Whether a piece of information is considered personal data can only be assessed from the perspective of those with access to the information at issue. If the data subjects to whom the information relates can be identified by these parties with access, then the information is personal data (for them), otherwise it is not. This is also known as the "relative" approach. Whether the controller itself can identify the data subjects is not essential; anyone who discloses information that is not personal data for them, but is for the recipients, is nevertheless disclosing personal data to a third party and must therefore comply with data protection with regard to such processing activity (this is also how the reference to third parties in para. 45 of the decision C-319/22 is to be understood).

The persons are only identifiable if the necessary means are "reasonably likely to be used", to cite Recital 26 of the GDPR. As a certain degree of probability is therefore required, a theoretical possibility of identification (as wrongly assumed by various EU data protection authorities in the "Google Analytics" decisions) is not sufficient. The Swiss Federal Court already established this years earlier for the DPA by defining two components (BGE 136 II 508, E. 3.2 ff.): It is not enough that identification is objectively possible with a certain amount of effort (objective component). The interest in making the effort required for identification also has to be considered and, thus, sufficient (subjective component). 

The interest in identification

At first glance, this Swiss view does not appear to be entirely the same as laid out in the recent "VIN" decision of the ECJ. According to the latter, it is sufficient that the means for identification for the party with access to the data are "reasonably likely to be used" (C-319/22, para. 45, with the German version saying "reasonably … could be used") and that such party can be regarded as someone "who reasonably has means enabling (identification)" (para. 46), but not whether the party actually uses these means. However, since an objective standard applies to the "reasonableness" test that has to be applied, the difference between the Swiss and the ECJ approach will be small to irrelevant in practice.

Even if the ECJ did not define what it means by "reasonably" in "VIN", it is clear from common language and expertise that the interest in identification will be decisive, as it is precisely this interest that determines the effort a party makes for this purpose, at least from an objective point of view (as required by the GDPR). Anyone who has no interest in identifying a data subject will – reasonably – not make any effort to do so, as there is no benefit to justify the effort. This does not rule out the possibility that an organisation could nevertheless obtain data to identify a data subject. However, if it takes effective measures against this happening, for example by deleting such identifying information or only processing it in such a way that it cannot be used for identification, this must be taken into account when considering which means of identification it "reasonably" has at its disposal. It makes a difference whether someone has information in a form that can be used for identification or whether someone has such information, but cannot practically use it for identification. To describe the situation with an example: If the information for identification is somewhere in a large, disorganised pile of documents, but the owner has no idea where, they will only search through the whole pile if their interest in the information is large, and only then do they "reasonably" have the means to identify the data subject. So even under the GDPR, we are in line with the formula of the Swiss Federal Court, according to which it all depends not only on the means available, but also on the interest in obtaining or using them.

The "motivated intruder test" of the UK supervisory authority ICO also relies on the interest in identification when answering whether data has been truly anonymized, i.e. no longer constitutes personal data for anyone: It asks whether a "motivated attacker" would be able to identify the individuals without prior knowledge using legal means and average skills. While it presupposes interest in obtaining and using these resources the extent to which such interest exists must be determined for each individual case. The criteria for doing this includes, among other things, the attractiveness of the data. And where people with additional knowledge or special skills typically have no interest in using these means, this can also be taken into account according to the ICO.

The definition of personal data is therefore basically the same under the GDPR and the DPA. As a result, the question of whether certain information constitutes personal data cannot be answered in absolute and general terms, but only in relation to those parties that have access to it, and secondly, it must be assessed whether they have relevant means of identification available to them, which in turn depends on their interest in identification, which determines the effort they will objectively make to obtain and use these means of identification.

Application to large language models

These findings on the concept of personal data can be applied to large language models and provide corresponding answers.

As a first step, it should not be contentious that anyone looking inside a large language model will not see any personal data. Such a model contains a dictionary of tokens (i.e. words or parts of words) and first and foremost many numbers. So, if you get your hands on a file with an LLM and don't know how to make it work and have no interest in doing so, you have no personal data (which doesn't mean that you are out of scope of data protection law if you were to give the file to someone who has this knowledge and interest).

Although there are other ways of determining the "knowledge" content of an LLM other than simply using it (see, for example, the research results from Anthropic here), this is not yet relevant to the practical use of an LLM, which is why we will not look any deeper into this here. Anyone who does not use such methods to look inside a language model when using an LLM does not have to be judged on the basis of the findings that would be possible with it when it comes to the question of whether their application is subject to data protection. 

According to the relative approach, whether a large language model contains personal data must be assessed solely from the perspective of the user and those who have access to the output. This is because only they create the input and can identify the data subjects in the output generated from it. Therefore, only the way in which they use the language model in question can be relevant. This will typically be via a prompt, i.e. the formulation of the input. 

If they cannot reasonably be expected to provide input that generates output relating to specific data subjects, or if those with access to the output do not reasonably have the means to identify those data subjects, then the use of the language model in their hands will not result in personal data and the data protection requirements will not apply in this case. Whether the model would be able to produce personal data under other circumstances is irrelevant under such circumstances.

However, even if personal data is to be expected in the output (and therefore the handling of the output is subject to data protection), this does not say anything about whether such data is also contained in the model itself. This aspect needs further consideration.

Factual knowledge through associations seen in the training material

To begin with, a machine does not have to contain personal data in order to generate it. A computer programme that randomly invents combinations of first and last names and outputs a text with an invented date as a birthday corresponding to these names generates personal data, at least if the recipients assume that the information actually relates to a natural person. However, if the recipients know that the person does not exist, there is no data subject – and without a data subject, there is no personal data.

The situation is similar with language models, except that they are not programmed by humans and are much more complex. The output that an LLM generates is also not part of the model itself, but only the result of its use – just as the random generator just mentioned does not contain the finished texts that it outputs, but creates them anew each time from a stock of non-personalised building blocks, based on an understanding of how language works and which terms fit together and how. An LLM by definition provides knowledge about language, not about general facts. Nevertheless, knowledge about how sentences are created and how words and sentences on certain topics typically sound or how they fit together inevitably also includes associated factual knowledge. It is more or less diffusely contained in the LLM as associations and affinities. If it is triggered with the right input, it can manifest itself in the output if the probabilities are right. 

However, we must be aware that an LLM is not a database in which facts are stored that can be easily retrieved as long as the correct query command is given. An LLM is basically a text generation machine that attempts to "autocomplete" a certain input and uses empirical values from training to do so. If you ask a common LLM when Donald Trump was born, the answer will be "On 14 June 1946". If you ask who was born on 14 June 1946, the answer will be "Donald Trump". On the other hand, if you ask who was born in June 1946, you will sometimes get Trump as the answer, sometimes other people who were not born in June 1946 (but had their birthday then), sometimes others who were actually born in June 1946, depending on the model. The models have seen the date 14 June 1946, the word "birthday" or "born" and the name Donald Trump associated with each other very frequently during training and therefore assume that this pairing of words in a text is most likely to match, i.e. they have a certain affinity with each other and therefore form the most likely response to the prompt. The models do not "know" more than this. In particular, they do not contain a table with the birthdays of celebrities that can be looked up for the response. Only the right combination of input and the numbers in the model (as well as certain other parameters) produces the output. Each of these elements on their own will not produce anything close.

A language model contains personal data ...

The practical difficulty for the user of an LLM (and thus also for data protection) is that they generally do not know which parts of the output represent factual knowledge and which parts are invented, because an LLM itself also does not distinguish between fact and fiction, but only between more or less probable reactions to the respective input.

Whether the output generated by a system has very low probabilities (and is thus likely a "hallucination"), whether a part of the output it generates has been an obvious choice for the LLM due to the training material (and therefore a "good" output is generated from its point of view), but it is factually incorrect (because the training material was already incorrect), or whether the system recognises that it cannot give a reliable answer and therefore does not generate a factual answer, depends on its initial training and its other "programming" (such as the so-called alignment or the system prompt).

Today, there are ways of visualising the associations and affinities that are preserved in an LLM (see the research work by Anthropic cited above - they refer to "features" of a model). However, an LLM cannot know whether its answer is factually correct because it does not "think" in such categories. The fact that it nevertheless often formulates its answer as if it were proclaiming established facts is one of the main causes of the problems we have today with the use of language models. We are very easily deceived by self-confident formulations that sound as if they come from a human being.

However, even if language models are not technically designed to store factual knowledge about individual, identifiable natural persons and therefore personal data, this can still occur, as shown in the example of Donald Trump. Even the association between Trump, his birthday and 14 June 1946 will not be a 100% probability in the respective models, but ultimately this is not necessary for it to be the most probable choice for the output. It is clear to everyone that this information, and therefore personal data, is somehow "inside" the models in question. For this reason, the statement that the large language models do not contain any personal data is not correct in this generalised form.

... but not any data it saw during training

Conversely, it is also incorrect to assume that the large language models contain all the personal data that they have "seen" during training. This is not possible already in terms of quantity alone: GPT3, for example, requires 350 gigabytes of storage space. However, it was fed around 45 terabytes of data during training. AI researchers like to describe the training of language models as "compressing" the training content so that it fits into the model. However, from a data protection perspective, this can mislead people to assume that the original information content and therefore the personal reference is retained. This is normally not the case; in the case of GPT3, a compression by a factor of 128 took place when looking at it from a purely mathematical point of view, whereby the focus was on the preservation of linguistic knowledge, not factual knowledge.

Usually, only information (e.g. references between words that have a common meaning) that occurs frequently in the same way or is specific enough to become a fixed association or affinity and "survive" this compression process. In other words: The compression is very, very "lossy", i.e. most of the details are lost. However, even science does not know exactly how much a language model remembers. Even if a piece of information survives this training process, it is not clear in what form and it is also not clear whether the content is correct – even if it is only due to errors in the training material or because the association or affinity was incorrect due to other circumstances. Instead of distinguishing between fact and fiction, the model provides for more or less strong associations and affinities between the individual elements of language, which results in a probability distribution when the model is asked to calculate the next word for a sentence that has been started.

The above makes it clear why, from a data protection point of view, it is not really decisive for the assessment of a language model whether certain personal data has occurred in the training material. If at all, relevance would only be given if it can be shown that the information in question has survived in the model as such as a relevant association and affinity, which depends above all on the frequency of its occurrence. Conversely, it cannot be concluded from the fact that certain personal data did not occur in the training material that the training process did not create any person-related associations or affinities in the model due to any coincidence, i.e. assuming, for example, that there is a clear connection between a name, birthday and a certain date. If a person is always mentioned together with another person in public texts, this can lead to a reciprocal association of attributes of the two persons. The profession, title or place of residence of one person is attributed to the other and vice versa.

The literature repeatedly refers to studies and methods (see for LLM e.g. here and the related blog post here) that make it possible to determine whether certain information - including personal data - has been used to train a model (usually referred to as "membership inference attacks"). It is emphasised that these attack methods pose a risk to data protection because they can be used to extract training content. This overlooks the fact that in the models in question, the training content can be found in the output even without an "attack" if the input is suitable, because the model has "seen" it sufficiently often during training; it is the phenomenon of "memorization", i.e. the model remembers a particual content seen during the training, such as Donald Trump's date of birth. In terms of data protection law, corresponding personal data is therefore contained in the model anyway if corresponding inputs are to be expected. The issue here is therefore the confidentiality of the training content and not data protection compliance of a model in general. If the training content is not confidential (or does not have to remain anonymous), the possibility of such attacks is irrelevant. We will explain what this means for the training or fine-tuning of models in a separate blog post.

A model does not simply contain what it generates

Following these explanations, we now want to answer the question of whether a large language model itself also contains all the personal data it generates in the output.

The answer is "not necessarily". Take the birthday of another public figure, namely the data protection activist Max Schrems. His organisation has filed a data protection complaint against OpenAI because its "ChatGPT" service incorrectly displayed the date of birth of a prominent person (Max Schrems?). Indeed: If the OpenAI model "GPT-4o" from OpenAI is given the input "When is Max Schrems' birthday?", it with a temperature of 0.0 states the 9 October 1987, with 0.5 the 14 October 1987 and with 1.0 the 3 October 1987 (all tests done in German).

These outputs show that the model seems to "know" the year and month of birth of "Max Schrems" (we leave it open here whether this is correct), but it has to "guess" the day. The variations with regard to the particular day indicate that the model does not have an association with "Max Schrems" on a particular day that is significantly stronger than the association with other days. The temperature has an effect as a parameter on the probability distribution of the possible answers (here concerning the days in October 1987) in particular in cases where the probabilities are rather similar, i.e. if no number is particularly probable. In practice, such "guessed" information can also be identified by slightly varying the input. For example, the question "What is Max Schrems' date of birth?" in GPT-4o (in German) results in 11 October 1987, i.e. another different date, if the temperature remained the same. 

If the model lacks such a clear association, we cannot conclude that the information in question is contained in the model. It appears that it is rather not contained in it, just as the random birthday and name generator described above does not contain the personal data it generates. The fact that the model nevertheless produces a statement as if it knew what it is talking about is not the result of knowledge, but of how a language model is designed to act (we will discuss separately what this means from a data protection point of view). So while the building blocks for generating a date in the model for "Donald Trump" in response to the question about his birthday are obviously so close together within the model that they occur regularly in the output, there is no such building block for the day nearby "Max Schrems", which is why the model has to extend its search to the wider environment of "Max Schrems" and his birthday in order to find a number when generating an output, figuratively speaking. 

It highlights a weakness of many current applications of language models: they are designed to always provide an answer, even if they have to provide answers that are not very probable. In such cases, they fill the gaps in their "knowledge" to a certain extent by inventing the corresponding information. Although the information is not completely plucked out of the air, it does not have a reliable basis. Depending on the application, such behaviour may be desirable or at least unproblematic, but it may pose a problem where facts are expected or assumed to exist.

One possible solution to this problem is the use of so-called confidence thresholds, i.e. the systems are programmed in such a way that they either only produce an answer if the systems are rather certain of it or they indicate how certain they are of the individual statements. In the case of deterministic AI – i.e. systems that specialise in recognising or classifying certain things – such values are commonly used. In the field of generative AI, however, this is not yet very common. In our view, it should be used more often. For example, a chatbot can be programmed so that it only provides an answer if it is relatively certain of it. It is, however, not clear how high the probability must be for something to be considered (supposed) fact instead of (supposed) fiction.

Moreover, this probability is also only in relation to the training material: If Max Schrems' date of birth from our example appears incorrectly in the training data with sufficient frequency, the LLM will produce it with a high degree of certainty and still be wrong. Of course, this does not change the data protection qualification of such information being personal data; incorrect information can be personal data, too (we will discuss whether such information is actually considered incorrect under data protection law in another blog post).

There is a further complication: If the model is not asked for the birthday of "Max Schrems", but "Maximilian Schrems", GPT-4o (in the German version) will with a relatively high probability output 11 October 1987. With this long form name, the association with the "11" is much stronger than with its short form, and it can be concluded that this date is indeed contained in the model. This may be because the training content with Mr Schrems' long name contained the date or otherwise made a special reference to the number "11" in the context of "birthday". The model cannot easily distinguish between these things. For the model, the names "Max Schrems" and "Maximilian Schrems" are strongly associated with each other, but with regard to the date of birth they lead to different results. It will therefore be important to work with such differences in both the right of access and the right of rectification.

Finally, there is one more point to consider: Of course, not every part of the output that a model generates with a high probability in connection with the mentioning of a specific person can be considered as personal data stored in the model about this person. The high probability or confidence must relate to information triggered by the occurrence of the person concerned (i.e. by their identifiers such as name, telephone number or e-mail address). Only then is the high probability an indication that factual information and a specific natural person (or his or her identifiers) are sufficiently close or associated with each other to be considered personal data in the model. If a model has the task of generating a children's story in which Donald Trump plays the main role, only limited information in the output will within the model have an association with him, even if the model produces the output with a high degree of confidence and even if the entire story appears to be Trump's personal data. In order to determine such a personal association within the model, the model will need to be "triggered" in a specific manner to eliminate the impact of other elements in the output as much as possible. We will address this topic in more detail in our blog post on the right of access in connection with LLMs.

The relative approach to the use of language models

The analysis of whether a language model contains personal data is not yet complete with the above. It also takes into account how likely it is that there will be input at all that is capable of retrieving personal data in the model. This results from the relative approach and the fact that the input is one of the "means" that is necessary to identify any personal data of a person contained in the model. If there is a vault full of personal data, but no one with access to the vault has the key, there is no personal data for these entities or they have no data protection obligations when handling the vault (apart from possibly the obligation not to make it accessible to any unauthorized person with a key). In a large language model, the user's input assumes the function of this key, as shown. Consequently, no personal data is available in relation to the language model as long as no input can reasonably be expected that is capable of extracting the personal data contained in the model in the form of output.

At this point, we can make us of our above finding, that in order to answer the question of whether such input is "reasonably likely" (and also the possibility of identifying the data subject), it is not only the objective possibility of the users of the model that counts, but also their subjective interest in formulating such input. This makes a big difference in practice: Objectively speaking, every user of a large language model can in principle create any input - all they have to do is to type it in. In practice, however, they will only do so if there is sufficient interest in doing so.

Therefore, anyone who uses a language model in a controlled environment with prompts which, according to general judgement, are unlikely to lead to the generation of personal data, because the user has no interest in such prompts (e.g. because this is not suitable for the use case) does not have to worry about data protection with regard to either the language model or the output. This would not change even if, by chance, an input nevertheless led to an output that could be regarded as personal data because, firstly, the user of the model would not readily recognise it as such (as in the case of the random birthday generator) and, secondly, only the use of means of identification (including the prompt in this case) that are reasonably likely to lead to personal data are relevant. This presupposes a certain degree of predictability, which is not the case here.

The situation will be similar for a company that uses language models to generate content with personal data, but this data does not originate from the model itself, but from the input, be it from the user's prompt or other data sources consulted as part of an application with Retrieval Augmented Generation (RAG). In these cases, the language model is only used to process language and it is not the purpose of its use to supplement the output with factual knowledge or personal data from the model.

Conversely, a provider such as OpenAI, who makes its LLM available to any number of people, must also expect a correspondingly broad variety of prompts and therefore assume that a corresponding broad amount of personal data will be generated by the model. It is yet unclear whether the provider must expect any number of prompts or only certain prompts. Taking into account, that the means of identifying a person must be "reasonably likely", it should not include every single prompt, even in the case of widely used chatbots such as "ChatGPT" or "Copilot". A provider of such a chatbot will have to assume that its users will ask the chatbot about public figures. However, if we take the "reasonably likely" standard serious, they will not have to assume that users will use it to query factual knowledge about any other person. Consequently, they could take the position that their models do not contain any personal data about these other people - until it is shown that such prompts do occur, of course. For such providers, the above findings therefore only provide limited relief. 

Max Schrems would be such a public figure. So, if he complains that his date of birth is incorrectly included in ChatGPT, the model would indeed have to be checked for this accusation. As shown for GPT-4o, this would presumably reveal that the date of birth as such is not actually included in the model because there is no clear association in the model, which is why it invents a day of the month when asked about his birthdate. It appears that factual knowledge only exists with regard to the month and year of birth.

As mentioned, the output that the chatbot would generate in response to the question about Max Schrems' date of birth must be separated from this. According to the above, it will in any event be personal data because such a question is to be expected with a public chatbot such as "ChatGPT" and "Copilot", even if the operator of the chatbot would not ask it himself. It is sufficient that its users do so and that its system discloses the corresponding data to them. We will discuss in a separate blog post whether this is actually to be regarded as incorrect under data protection law if Mr Schrems' date of birth is incorrect.

To sum up: It is possible that an output contains personal data, but the underlying language model does not.

Conclusions for large language models

In order to determine whether and which personal data a language model contains pursuant to the GDPR and Swiss DPA, in summary, the following steps must be considered:

  1. In the specific case, who formulates the input for the large language model used and who has access to its output?
  2. Is it "reasonably likely" that (i) these persons will use the language model with inputs that cause the model to generate outputs with personal data that do not originate from the input itself, and (ii) those with access to such outputs will be able to identify the data subjects concerned. When determining whether it is reasonably likely, the interests of these persons in creating such input and identifying the data subjects has to be considered, not only the objective ability to do so.
  3. To the extent that the model generates such outputs with personal data about a particular data subject, do these vary as the input about that individual and the temperature is varied, as well, or do they always remain the same in substance? In the latter case, this indicates that they are actually included in the model with a strong association to the data subject and that the data in question are not merely "gap fillers" due to a lack of better alternatives. In those cases, they are presumably contained in the model as personal data, and are not only included in the output by sheer coincidence or for reasons unrelated to the data subject.

Since the probabilities in steps 2 and 3 apply cumulatively, the overall probability of the presence of personal data of certain individuals in the model itself is normally correspondingly low. It results from the multiplication of the two individual probabilities. The above test steps can be used, for example, for a data protection risk assessment if a company is planning to use a language model and wants to demonstrate that any personal data that may be contained in a model does not pose a risk to the company because it will not be retrieved.

Conversely, in order to assert data protection claims against a language model for personal data contained therein, a data subject or supervisory authority will have to show that in the particular case the intended audience of the language model (step 1) is reasonably likely to use an input which, from the user's point of view, generates personal data about them in the output (step 2) and that the information in dispute is not invented, but contained in the model in a manner related to the person (step 3).  

In our next blogs posts, we will explain the impact this has on the question of how data subjects can access the personal data processed about them and how this relates to the accuracy of an output and how the right of rectification can be implemented.

David Rosenthal

PS. Many thanks to Imanol Schlag from the ETH AI Center for reviewing the technical aspects of the article. 

A few words on the thesis "LLMs do not contain any personal data"

Almost at the same time as our blog post, the Hamburg Data Protection Authority published a discussion paper in which it comes to the conclusion that large language models per se do not contain any personal data and therefore no data subject rights apply in this regard. At the same time, Flemming Moos published an article in Computer & Recht that takes the same view. The Danish Data Protection Authority has already adopted a similar position. Marvin Bartels describes a somewhat differentiated view in an article.

It is pleasing to see such contributions to the discussion of this topic, and that they also deal with the technical peculiarities of large language models. However, the position advocated by the Hamburg authority falls short in various respects and is not convincing:

  1. The authority's paper focusses on individual training documents instead of the overall information stored in an LLM. It argues that it is not possible to extract individual training documents (with personal data) from the LLM because an LLM only consists of aggregated information about them. Therefore, in their view, there could be no personal data in the model. This overlooks the fact that personal information can be included in the aggregated information if it occurs frequently enough in the training documents. It is true (as we also write) that this is not the case for most information, but it does happen. The birthday of Donald Trump discussed in our blog is an example; it can be extracted directly from the LLM via the API of GPT-4o (without using ChatGPT or Copilot). This memorisation of personal data (not individual training data) presupposes that a certain piece of information is present in a large number of training data, which is particularly the case for public figures or those who publish a lot themselves. They are therefore primarily affected.
  2. The thesis only deals with the individual tokens instead of looking at the entire function of the model. It therefore misses the wood for the trees. An LLM establishes relationships not only via vectors in the embedding space, but also via non-linear functions (neural networks) and functions such as the attention mechanism. In this way, information is stored across numerous levels. The authority's paper only looks at one. For a layperson, the arguments in favour of the counter-thesis may sound convincing, but they fall far short of the mark (we recommend reading our blog post no. 17 which explains what goes on in an LLM).
  3. It is argued that it is not possible to extract individual information about natural persons from a single token and its embedding. This is of course correct, but not the point: The fact that information cannot be extracted from individual elements, but only results from the combination of the parts, and the fact that we today do not understand exactly where and how it is stored in an LLM or how the model retrieves it, does not mean that the information does not exist. We humans also have knowledge, even if we don't understand exactly where and how individual pieces of information are stored in our brain. Just because something is a black box does not mean that it cannot store personal data. Let's again take Donald Trump's birthday as proof. GPT-4o provides this (undoubtedly personal) information reliably when asked; no special tricks are required.
  4. The need for a technical process to make stored information readable for us humans does not rule out the existence of such information either, even if it is only stored on a meta-level in the stored data. The numbers stored in a JPEG image file with a scanned CV of a person are an example. The file itself only contains the data of the pixels. Nobody will be able to recognise personal data in this data. For us to read the text and thus access the personal data, the file needs to be opened by an image viewer software that will convert the data into pixels and use the pixels to generate an image. This is also the case with a language model, with the difference that putting the information together as a kind of key requires the prompt and the process is much more complex. The counter-thesis does not take this meta-level into account either.
  5. It is argued that an LLM does not contain any personal data because it is not a reproduction of training data, but rather an LLM generates new content. This, however, cannot be the decisive point. To give an example, a graph database must also compile the results that it delivers in response to a query from a wide variety of parts because it has stored information in the form of interlinked pieces of information. The query response is newly created based on pieces of information that have been stored in the database individually. We do not see why such knowledge could not possibly be personal data just because what is fed into the database is not produced in the same form when queried.
  6. The view is taken that just because the information stored in an LLM is more or less likely to be information relating to a specific individual, it cannot be personal data. This is not convincing. If the probability is sufficiently high that the combination of a certain input ("When is Donald Trump's birthday?") and the information contained in the model results in the output "Donald Trump was born on 14 June 1946", which is the case here even with varying input and temperature, then this information must be contained in the model. Where else should it come from? The precise degree of probability at which information becomes sufficiently certain is a discussion that is of limited relevance here. In the case of GPT-4o, it is sufficient that "Donald Trump" and "birthday" are inextricably linked to 14 June 1946. This is also the case everywhere else in data protection (and the rest of life): If an employer monitors the activities of its employees' devices using their IP address, no one would not consider this to be personal data of the very employees, even if it cannot be ruled out that an employee's computer is exceptionally used by another person. Just because it is only most likely that the log entries with the IP address relate to the employee does not remove their personal reference. Even the standard of proof of facts in a court does not require absolute certainty. Higher standards of data protection cannot – and do not apply.
  7. For example, the statement that if it is not one hundred per cent and absolutely clear to everybody to whom a particular piece of information refers is not such a date, is not valid, at least in the present context. This argument attempts to justify that in the case of an LLM, it is never completely clear to which person the information refers. However, if information is assigned to a specific person by the recipient of the information, this is sufficient from a data protection perspective. The fact that there may be other persons (e.g. with the same name) does not change this. However, it shows that, as we have argued above, it is important who uses an LLM and who has access to the output, and to understand what they will consider the output to be. The recipient's "horizon" is crucial in data protection: This is why the date of birth of Donald Trump quoted above will in the people's mind refer to the US politician with this name - and not to the oncologist of the same name, who also exists in the U.S. If a language model tells us with clarity when Donald Trump's birthday is, then we assign this information to the US politician, even if we cannot be absolutely certain that this information actually relates to him. It thus becomes his personal data. If we see a person on a surveillance video and we are of the opinion that it is a specific person, then we process the personal data of this very person, even if we are wrong.
  8. We also have the impression that there are misunderstandings about what belongs to the model like GPT-4o and what belongs to the applications like ChatGPT that use a model. All the examples we have mentioned do not come from ChatGPT, but are generated directly by the GPT-4o model of OpenAI; the API is still part of the model according to the EU AI Act. The model is not only the collection of model parameters (such as the weights or the parameters of the neural networks), but also includes the system that combines them with the prompt and processes it to generate an output. ChatGPT is not needed for this. To say that the personal data is stored in ChatGPT and not in the model is therefore simply wrong. ChatGPT only controls and processes the input and the output generated by the model.
  9. The various statements focus on the various methods of privacy attacks on LLM to extract personal data as a supposed counterargument to their theses and argue that these are not relevant because they are too costly or too complicated. This also misses the point, as no privacy attack is required to extract personal data from the model. A simple prompt is sufficient, as shown and as everyone knows from experience. The arguments are also irrelevant because the privacy attacks (as we have also explained) are aimed at making personal data visible from the training. This is only a problem in the case of training (including fine tuning) with confidential data, but that is not what we are talking about here. We are talking about whether personal data is depicted in the model itself, regardless of how it was created. As we write, it is conceivable, for example, that a training program generates personal data in an LLM that is incorrect and as such did not appear in the training documents. It therefore makes no difference at all how complex such attacks are or whether they are legal. They are not the decisive factor. The only decisive factor is whether a prompt occurs that retrieves the personal data from the model. This question can only be answered from the perspective of the person who has access to the model, uses it or provides it. For example, the same model will not contain any personal data in use cases A, B and C, but it will in use case D, because corresponding prompts are to be expected there (and only there). This differentiation results directly from the relative approach, which the ECJ confirmed again in "VIN". The opinions do not make this distinction either, although it is essential in view of the case law.
  10. Ultimately, above thesis only considers whether the content of an LLM identifies a specific person, but does not examine whether they can be identified by additional means. The Hamburg paper essentially argues that you can't see to whom the numbers and tokens relate, and because an identifier is missing, there cannot be any personal data. This is once again too short-sighted: If personalised information has been seen enough times in training, it can be made readable with the right prompt. The prompt thus becomes a means of identification. In a figurative sense, it is the template that is placed on the model and ensures that the corresponding tokens "light up" in the correct order. This results in information suddenly becoming recognisable in what otherwise is a jumble of numbers, and this insight can then be transferred to the output. If the triggered associations are strong enough, they will always lead to the same result, even with differently formulated prompts for the same question. This means that information that is not already contained in the prompt can be retrieved from the model with its help – such as Donald Trump's date of birth, for example. We have also developed an experimental tool that can be used to test for such strong associations with the GPT models. We will present it in our article on the right to access with regard to LLMs. To say that there is nothing in an LLM that at least functionally corresponds to an identifier is too short-sighted for these reasons. The nature of an LLM means that the identifier is not simply a number or a word, but ultimately a very complex pattern that results from a combination of numerical associations that an LLM stores. We humans may not be able to recognise it, but that may not matter if the model does it for us, if we provide it with the necessary trigger with the prompt and it then makes the relevant information readable for us by copying it into the output. If we want to do justice to the technology of LLMs, we have to go beyond our established ideas – for example on the question of what an "identifier" is – and apply them to the new world. There is nothing to suggest that this identifier cannot be a multidimensional pattern that we ourselves may not yet fully understand. This is also not inconsistent with case law. The GDPR does not even necessarily require an identifier to establish a reference and provide information. In Article 4(1), it expressly states that the identifier is only one of several possibilities for this assignment of a piece of information to a data subject. Only the assignment is essential. This is another reason why the counter-thesis falls short.

To summarise, these explanations show that the statement that a language model cannot contain any personal data within the meaning of the GDPR on principle alone is not reliable. Instead, in our view, a differentiated approach is required which, as shown in our blog post, takes into account the respective use case, the group of people who use an LLM and, above all, the prompts that are used because they are the means by which information in the model can become personal data – and identify the data subjects. It is therefore also relevant how likely it is that such prompts will be used in a particular use case. We also come to the conclusion that in many, if not most applications, the language models used do not contain any personal data. However, denying the possibility across the board does the matter no favours. We simply cannot ignore the fact that GPT-4o, for example, "knows" when Donald Trump was born and tells us so. The most we can say is that nobody would reasonably ask the model about it - at least in most use cases. And this is the point we make. More on this in our article above.

In the case of the GDPR, this can be formulated as follows:

  • When a model is trained, the personal data that occurs frequently enough in different parts of the training data is stored in the model as aggregated information. They can be regarded as pseudonymised data within the meaning of Art. 4 No. 4 GDPR, as it is not possible to establish a personal reference without additional information.
  • As long as a model is stored or otherwise processed and it cannot reasonably be expected that the model will be put into operation and supplied with a prompt that retrieves this information from the model, no personal data is available. It is the same as with an encrypted file for which the key is not available.
  • However, if such a prompt is entered, it represents the means by which the information contained in the model about a person concerned is activated, i.e. becomes visible, and thus identifies the person to whom this information relates. This information becomes readable for us when the model generates an output based on it. The primary challenge is to recognise where a real reference exists, as the model will also generate an output without a reference - but with invented or less probable data.
  • If a controller must reasonably expect such prompts in a use case, the LLM will contain corresponding personal data in their case. They must therefore endeavour to comply with the requirements of the GDPR when using the LLM - including responding to any requests for information.  

See also our illustration here.

Another critical comment on the Hamburg paper has been published by David Vasella in his blog datenrecht.ch. He concludes: "Overall, however, the explanations appear to be result-orientated. In any case, such serenity is rarely seen with authorities – perhaps the fear that LLMs could otherwise be de facto banned was also the inspiration for the theses."

July 18/22, 2024

July 23, 2024: 23. See the technical considerations and experiment of Christian Bennefeld here (in German).

This article is part of a series on the responsible use of AI in companies:

We support you with all legal and ethical issues relating to the use of artificial intelligence. We don't just talk about AI, we also use it ourselves. You can find more of our resources and publications on this topic here.

Author