You are here: Resources > FIDIS Deliverables > Forensic Implications > D5.4: Anonymity in electronic government: a case-study analysis of governments? identity knowledge > 
Ingredients for anonymization techniques  Identification versus anonymity in e-government
CASE STUDY: A FEDERAL AGENCY COLLECTING HEALTH DATA
 Case study: international data collection for orthopedic evaluative research

 

Case study: a federal agency collecting health data

In this section, we present the current situation in Switzerland, where a central federal agency, namely the Swiss Federal Statistical Office, collects health data, especially from hospitals, from all over Switzerland in order to compute statistical outcomes of different sorts. This means that many data-generating partners must send their sensitive data to the central agency in a secure and reliable way. 

All hospitals in Switzerland (more than 400) are connected to this system. Some of them send information directly to the Federal Office while in some cantons (Swiss Federal States) the data are gathered in Cantonal Offices that check the data and forward them to the Federal Office.  

The concepts presented [BFS 1997, Jaquet-Chiffelle & Jeanneret 2001] have been used successfully for several years now to protect the medical data during the transmission phase as well as in the databases of the central agency. Note that in Switzerland, data protection is strictly regulated and hence in the context of identifying data, its transmission as well as its storage is not possible without protection measures. In the following, we present the case of Switzerland in a summarized and slightly simplified way. 

Consider the situation where a hospital wants to transmit its data records to the Federal Statistical Office. The data consists of tuples, and each such tuple contains on one hand parts which are identifying (like name, first name, date of birth, etc) and on the other hand treatment data which are usually not identifying.

Step 1: Unified representation of identifying data 

The data which is identifying is processed in order to remove “noise” introduced by wrong spelling. This is a frequent reason for duplicated entries in databases. One source is just spelling errors, but another typical example which can induce such cases are names which can be written in different ways. Consider for example the German Umlauts: often, the name “Müller” is written as “Mueller”.  

There are frequently used algorithms to reduce this source of ambiguity from the data: compression algorithms like Soundex or other phonetic algorithms are used for this purpose. They reduce spelling ambiguities by compressing the names to some sort of “spoken normal form”. Typically, in such an algorithm, double letters are reduced, silent letters eliminated, etc.

The identifying data is then grouped according to an algorithm to some normal form like for example John Doe, born 1.1.2000, male, is represented by 

01012000MDOEJOHN 

assuming that the algorithm did not change the name and first name. The Soundex algorithm replaces names of arbitrary length by codes of fixed length. 

Step 2: Generating linking code 

The linking code is generated from the unified representation using a one-way hash-function (e.g., SHA-2). The result is a bit-string of a predefined length (usually determined by the algorithm, for example 160-bits). This linking code is then used as an identifier for the patient, in other words a digital “fingerprint”. Note that anyone in possession of the personal data of some person can create this linking code; there is no secret in the generation of the linking code itself. Hence this linking code can clearly not be used as is for data processing.  

On the other hand, given a linking code, one cannot compute the original data anymore because of the one-way property of the underlying hash-function. The only possibility is to make a so-called dictionary attack, which clearly is still possible: find the personal information of every citizen in Switzerland, compute for everyone the linking code and compare. Or just generate all possible combinations of last names, first names, date of birth and gender. 

The goal of this procedure is to create a new pseudonymous ID for each individual. Clearly, due to the nature of the hash-function, there is no guaranteed one-to-one correspondence between individuals and linking codes as, with a very small probability (see above), different strings are mapped to the same linking code by a hash-function. Yet as this probability is very small, the influence on any resulting well-founded statistical outcome is not significant.

Step 3: Encrypted linking code for transmission 

For the transmission of the data from the hospital to the Swiss Federal Statistical Office, the linking codes are encrypted. In principle it is sufficient to encrypt only the linking code itself and not the medical data itself as an attacker can link the medical data only to the encrypted linking code, and hence not to an individual. Yet the medical data itself should be protected during transmission as well, in order not to allow attackers to use this information as identifying data in special cases (see footnote 3). 

For the encryption, a hybrid encryption approach is used, i.e., a symmetric algorithm is used for encrypting the data (as a large quantity of data is transmitted) using a so-called secret session-key which is typically generated for each session (validity restricted to an hour or to a specific day) individually. After finishing the session, the key is destroyed and another one generated. Clearly, this very session key must also be transferred to the Federal Statistical Office (cf. next step). 

Step 4: Encrypted secret key for transmission 

The secret session-keys are encrypted with public-key cryptography using the known public-key of the Swiss Federal Statistical Office. Remember that public-key algorithms allow anyone knowing the public-key to encrypt the message, but only the party possessing the private-key to decrypt the message. This transmission is only necessary when a new session key (see step 3) has been created, i.e., at the beginning of a new session. The public-key of the Swiss Federal Statistical Office must be available to all parties through a channel that guarantees the authenticity of the key.  

Step 5: Generating uniform linking code 

At the Swiss Federal Statistical Office, first the secret session-key is decrypted (using the private-key) and the recovered session-key then used to decrypt the linking codes. The linking codes themselves are then re-encrypted using a symmetric algorithm in order to break the link to the original linking codes and to prevent dictionary attacks. The result of the encryption is called a uniform linking code. The secret-key used for this encryption should be shared between different entities using standard secret sharing algorithms. The original linking codes are destroyed.  

The process described using these five steps has several properties: 

  1. The process is simple and needs only standard cryptographic procedures 

  2. The linking codes are generated using a standard and simple procedure which is computationally feasible at a hospital.  

  3. For the same patient at different hospitals, the same linking code is generated. Hence the Federal Statistical Office will be able to “follow” the patient. 

  4. During the transmission phase, linking codes are encrypted, hence an attacker would have to break a standard symmetric encryption algorithm to access the linking codes.  

  5. The uniform linking codes are anonymous because they cannot be connected to the linking codes anymore. There is no dictionary attack possible. Even when knowing the personal data of the whole population, one cannot link a uniform linking code to a person, as the uniform linking code itself can only be generated using the secret key. 

  6. The security at the Federal Statistical Office depends on the protection of the Secret Key. 

Note that uniform linking codes are in general destroyed after 10 years [BFS 1997]: re-encryption takes place according to a predefined scheme in order to change the Secret Key used to generate the uniform linking codes. 

While the presented procedures are implemented in practice, an extended framework was presented [Baumann et al. 2005] introducing a recovery authority which can be separated from the central data collector. All tasks regarding generation of universal linking codes (on line) as a well as possible recoveries (only off line) are done by this new party, allowing the central data collector to focus only on the collection of data. This additional party induces more complex protocols, hence the simplicity is partially lost while additional properties are gained. Their theoretical approach, while interesting, is not always adapted to the practical constraints inherent in the wide implementation of such a system, in particular in Switzerland.  

In the light of the central question of this deliverable, namely “Is the identity knowledge in the government growing through the development of e-government” (cf. Chapter 1), the present approach is relevant since it tries to minimise the possibility even for the government to expand its knowledge about the citizen using non-identifying pseudonyms, and assuming protection of the secret key at the federal office. 

 

Ingredients for anonymization techniques  fidis-wp5.del5.4-anonymity-egov_01.sxw  Case study: international data collection for orthopedic evaluative research
38 / 45