You are here: Resources > FIDIS Deliverables > Profiling > D7.2: Descriptive analysis and inventory of profiling practices > 

D7.2: Descriptive analysis and inventory of profiling practices

6. Issues for further clarification  Foreword
APPENDIX
 Glossary:

 

Appendix

 

A Forensic Profiling in the field of RFID and Biometrics 

(Zeno Geradts, NFI) 

 

A.1 RFID technologies 

 

Radio Frequency Identification (RFID) tags are expected to be used increasinly in products at retailers and in travel identity cards. Currently the Wal-Mart in the USA, Metro in Germany and Tesco in the United Kingdom are requiring RFID-tags on the products. Also the Department of Defense in the United States is requiring tags on products from suppliers. Tracking and tracing are the most important reasons for this requirement and in the war in Iraq this was convenient for the transportation of equipment and tracking. Losing or misplacing products, parts or equipment will result in higher costs. 

 

It is expected that in 2008 more than 20 billon RFID-tags are used. Most of them will use the EPC-standard (Electronic Product Code). For reading those tags it is expected that 100.000 EPC-readers will be sold. It is expected that the tags will drop to several euro-cents per tag.

 

RFID-technology 

 

A RFID-system consists of different components : 

  1. One or more tags (transponders), consist of a microchip and an antenna 

  2. One or more reader and writers including the RF-modules 

  3. Application software connected to the reader/writer 

 

RFID-tags exist in two forms : 

  1. Active (with a power supply )  

  2. Passive (need power from the signal of the reader) 

 

Active RFID-tags are somewhat larger and more expensive, however they can be used at a greater distance. Passive tags can be used in consumer articles and are cheap. They can also be used in labels. 

 

Furthermore, RFID-tags can be read-only and read-write. The read-write tags are used with an encrypted identification number, where only authorised users can change the information. 

 

The retailers are requiring their suppliers, to use the Electronic Product Code (EPC). This standard has been developed by MIT’s Auto-ID center, and is managed by EPCglobal, which is a non-profit joint venture between two organisations: EAN (European Article Numbering) International and the Uniform Code Council. The second generation of UHF EPC-standard has been ratified in December 2004. In January 2005 this standard is submitted to ISO.  

 

For RFID, the ISO standards concern : 

  1. Technology ISO 18000  

  2. Data content (ISO 15418, 15434, 15459, 24721, 15961 and 15962) 

  3. Device conformance test and performance (ISO 18046 and 18047) 

  4. Application Standards (ISO 10374, 18185, 11785) 

 

A.1.1 Security of RFID 

 

In July an article in Forbes presented a hacker’s guide to RFID. It would be easy to hack a tag, and change price information on a tag. There were also privacy concerns for the consumers. The cheap tags are just readable tags, and they cannot be altered easily. Also when using a write-once tag, often the price is not included in the product, since they use a serial number for the product. It is however an aspect that needs attention.

 



 

Example of RFID-labels from http://www.bluhmsysteme.com/rfid-etiketten.htm 

 

A.1.2 Forensic aspects of RFID and profiling 

 

RFID tags can be used also in forensic science to track a person using on the RFID-tags one carries in products. For example: there is a unique RFID-tag number on a packet of cigarettes and a person steals this packet in Amsterdam from a store. Theoretically it is possible that within European databases of stolen goods, the person can be arrested in Brussels when they scan the RFID-tags of this person. The same applies, of course, to the previously mentioned passport with a chip. The key elements are RFIDs, in combination with databases and if stored properly, it can be used as evidence. 

 

The use of standards makes it also easier for forensic science to develop tools to read the information in a proper forensic way. 

 

A.1.3 RFID and privacy 

 

Also EPC is worried about privacy aspects. For this reason they have implemented certain privacy guidelines:

Guidelines

1. Consumer Notice  

Consumers will be given clear notice of the presence of EPC on products or their packaging. This notice will be given through the use of an EPC logo or identifier on the products or packaging.  

2. Consumer Choice  

Consumers will be informed of the choices that are available to discard or remove or in the future disable EPC tags from the products they acquire. It is anticipated that for most products, the EPC tags would be part of disposable packaging or would be otherwise discardable. EPCglobal, among other supporters of the technology, is committed to finding additional efficient, cost effective and reliable alternatives to further enable customer choice.  

3. Consumer Education  

Consumers will have the opportunity easily to obtain accurate information about EPC and its applications, as well as information about advances in the technology. Companies using EPC tags at the consumer level will cooperate in appropriate ways to familiarise consumers with the EPC logo and to help consumers understand the technology and its benefits. EPCglobal would also act as a forum for both companies and consumers to learn of and address any uses of EPC technology in a manner inconsistent with these Guidelines.  

4. Record Use, Retention and Security  

The Electronic Product Code does not contain, collect or store any personally identifiable information. As with conventional barcode technology, data which is associated with EPC will be collected, used, maintained, stored and protected by the EPCglobal member companies in compliance with applicable laws. Companies will publish, in compliance with all applicable laws, information on their policies regarding the retention, use and protection of any personally identifiable information associated with EPC use.  

If they are properly used, and if the consumer can discard the EPC-tag easily from a product, this means that the example above with the package of cigarettes is not feasible, if the thief discards the RFID-tag. 

 

A.2 Biometric Devices 

 

The market of biometric devices is expanding rapidly, since it is convenient to use, and users can not forget their biometric properties. Some examples of biometric properties which are used in commercial systems: 

  1. Face 

  2. Fingerprint 

  3. Handscanner 

  4. Iris 

  5. Voice 

  6. Handwriting 

In forensic laboratories also other means of biometric measurement are used for comparison: 

  1. DNA 

  1. Ear prints 

  2. Lip prints 

 

In many countries, DNA databases are built of persons who committed a serious crime and DNA found at the scene of crime, as well as, to determine possible contamination of evidence, databases of employees of the forensic science laboratories. The laws in many countries forbid use of these data for purposes other then that for which they are meant: for serious crimes, by court order or if specified by the law. 

 

In a world where no privacy laws exist, theoretically it would be possible to collect databases of fingerprints, iris scans, voice, hand writing and face. These databases can be combined with one another. If the databases are structured correctly, and there are no errors in names or in the links made, these databases can be used to search by profiling for a person or a group of persons, perhaps leading to more crimes might be solved, and perhaps stronger evidence. 

 

One problem is however that the one-to-many comparison, for example with faces, returns a high error rate. A good example is given at http://www.frvt.org/FERET/default.htm where a methodology has been used to examine different commercial packages and to compare them with a standard database.

  1. 71.5 % true accept rate @ 0.01 % false accept rate 

  2. 90.3 % true accept rate @ 1.0 % false accept rate 

 

Another test by NIST on fingerprints, gives better results http://fpvte.nist.gov/ . We can see here by using single good quality fingerprints

  1. 99.4 true accept rate @ 0.01 % false accept rate  

  2. 99.9 true accept rate @ 1.0 % false accept rate 

 

When multiple fingerprints and face images (or even 3D-images) become available, the situation improves. In these systems, poor quality images are not used. In practice when filling databases with real images these will also be included, and results will deteriorate.  

 


Figure 6: Example of mass market 3D-face scanner 

 

 

 

 

 

B. Legal grounds for customer loyalty programs 

(Barbara Körffer, Martin Meints, ICPP) 

 

Based on European legislation, the German Federal Data Protection Act (BDSG) has several sections relating to customer loyalty programs. They are elaborated in detail in the study “Kundenbindungssysteme und Datenschutz”, commissioned by the “Verbraucherzentrale Bundesverband e.V. (vzbv)” and carried out by ICPP in December 2003.

 

Usually customer loyalty programs employ customer account cards. From a data protection perspective, the contract between the customer and either the discount-granting company directly or an operator of the account card system can be interpreted as two distinct parts: 

 

  1. discount related part: rewarding customers’ loyalty by granting discounts 

  2. collection of additional data: collecting and processing additional personal data for different purposes, such as:  

    1. advertising 

    2. market research 

 

For the first part of the contract, Article 28 paragraph 1 no. 1 BDSG is relevant. This paragraph limits the use of personal data to the purpose of the discount related part of the contract. In detail, this means that the contract partner of the customer only is allowed to store and process the following data: 

 

  1. Name 

  2. Address 

  3. Year of birth

  4. One further address information (phone, email etc.)

  5. Time and place of card deployment 

  6. Price of purchased goods / services and discount amount 

  7. Data related to the purchased goods only if they are necessary for computation of the discount amount. 

 

The second part of the contract for collection of additional data can be based on the legal foundation of Article 28 paragraph 1 no. 2 BDSG. This regulation allows the use of data as far as overriding legitimate interests of the customer are not concerned. This includes the name, year of birth and address. 

For further data a written consent of the customer is required. Requirements for the consent are regulated in Article 4a BDSG.  

 

In addition, for data processing in general the following legal regulations apply: 

 

  1. Article 4 paragraph 3 (active information of the customer)  

  2. Article 34 (provision of information to the customer) 

  3. Article 35 paragraph 1 (right to get data corrected) 

  4. Article 35 paragraph 2 (right to get data deleted) 

 

Consequences especially of Article 4a BDSG are that the customer opting for the further use of his/her personal data has to know details and consequences of the collection and processing of the data. In addition, the decision is not considered to be freely entered into if the participation in the loyalty program depends on an agreement for collection and processing of additional personal data. To gain a legally effective declaration of consent, the customer has to be informed about the following facts: 

 

  1. Who is responsible for the data and the data processing? 

  2. Which categories of data will be processed? 

  3. For what purpose are they processed?  

  4. How is the processing done (phases and data flows)? 

  5. Whom are data transferred to? 

  6. Information on voluntary decision of consent and the consequences of a refusal 

  7. Information on the ability to revoke the consent at any time 

 

In the aforementioned study, 16 federal and several regional customer loyalty programs were investigated for their compliance with the BDSG. All programs investigated showed defects, some minor, some major.  

Examples include: 

 

  1. More data than necessary for correct implementation of the discount were collected without consent of the customer 

  2. Participation in the discount program depended on consent to further usage of personal data: no freedom of choice given 

  3. The exclusion of personal data from the declaration of consent was not implemented in a privacy-compliant way. An opt-in or opt-out possibility was not implemented in the contract; the customer had to cancel that part of the contract he didn’t want to agree with. 

  4. In rare cases, conditions of entry and privacy policies were sent to the customer only after getting his/her consent – this means no consent at all because the customers could not know the conditions in advance 

  5. Information was missing that consent is voluntary  

  6. Consequences of a refusal of the declaration of consent were not pointed out 

  7. Planned processing of personal data was not sufficiently described – therefore declaration of consent was not completely based on a free decision 

 

In some cases, such defects were remedied by the operators of the loyalty programs.  

 

For the LN-Card, a customer loyalty program for the readers of a local newspaper, the “Lübecker Nachrichten” (LN), the privacy seal of “Unabhängiges Landeszentrum für Datenschutz Schleswig-Holstein“ (ICPP) was granted.

C. Biometrics profiling 

(Angelos Yannopoulos and Vasiliki Andronikou, ICCS) 

 

C.1 Physical Biometrics Profiling : Face Recognition 

 

Face recognition forms part of the general object recognition problem that seeks to differentiate objects which only vary in minor ways from each other and is one of the most challenging computer vision problems. Recognition is achieved through the analysis of certain facial features, such as the upper sections of the eye sockets, or the area around the cheekbones, and their comparison to stored templates. This technology works for both verification and identification. The design, the development and the performance of such a system face many difficulties owing to vulnerability to variations in the operating environment, complex or moving backgrounds, complex foregrounds, occlusion, and in general where the complexity of the system increases in view of real-time operation. The performance of the system can further degrade because of the age of the template. As in every biometric system, in a face recognition system identification of individuals is possible only for those individuals whose images have been enrolled into the system’s database. Thus, the first step for the construction of an individual’s profile is the enrolment of the person’s image (of reasonable quality) into the system’s watch list.

 

A challenging fact which spurs efforts to overcome the technological limitations of such a system and yet raises great concern regarding privacy is that, generally and compared with other biometrics, there is no reasonable expectation of privacy with regard to physical characteristics that are constantly exposed to the public, such as facial features or voice. Nevertheless, the people entering a public space where video surveillance is used should be notified, so that they can make an informed choice of whether or not to subject themselves to surveillance. This is the case because the professional analysis of stored facial images or other biometrics (by means of profiling techniques) can reveal certain diseases that a layman in face to face contact could not have recognised. 

A simple scenario of the use of such a system could be the tracking of a wanted person by the authorities through the individual’s activities; as people are go about their daily tasks, surveillance cameras could be capturing their faces and transmitting these images in order to be compared to the watch list, that contains the people for whom the authorities have issued an arrest warrant. Such a system would be able to track an individual’s movements, actions and transactions in real-time and combine this information with respective data from the person’s past, stored in databases to which there is access.  

When the Super Bowl XXXV was about to take place in 2001, fears of a terrorist attack led to the need for increased security measures. The final system used relied on facial recognition, which involved surveillance cameras continuously scanning spectators’ faces and capturing images of their faces processed in real-time in order to produce a “faceprint” (set of measurements of facial features). This “faceprint” was then compared against each record in a computerised database of well-known criminals and suspects of terrorism and was classified as either matching or differing from each of them. An alert to the police was to be triggered in case of a match. 

Nevertheless this “super surveillance” could be very easily abused. For instance, what if the watch list does not only include people wanted by the authorities. Should the authorities be allowed to add to the watch list a particular individual simply to track their moves or is the approval of a judge required? Another scenario involves the capturing of images of all the faces included in the videos captured by the cameras, the only technical restriction of which being the required volume of storage. This information could be then maintained for future use, as a record for the particular person. Moreover, increased police attention on previously convicted persons identified in public areas by the system would make their rehabilitation harder.  

Questions arise such as who should be entered into the watch list, whose approval (and in which cases) is essential for such an entry, how long are these records maintained, whether they should be accompanied by information, such as when the entrance of each person’s image took place and at whose request. It is thus obvious that abuse can be eliminated though official procedure to some extent by establishing legal measures controlling the administration, management, use and maintenance of such systems, whereas citizen committees and public boards could monitor government use of the system. However, existing measures taken to handle related privacy issues emerging from other types of surveillance technology could be taken into account. 

 

 

C.2 Behavioural Biometrics Profiling: Keystroke Dynamics 

 

We will now briefly address the issue of whether and how behavioural biometrics can be exploited in a malicious manner. 


A simple example of a behavioural biometric that can be abused in a way that is extremely difficult to track is the measurement of keystroke dynamics. The classic application is to protect passwords: a password system logs the timing taken by a user to enter his/her password, as well as the password itself, and then uses a pattern recognition system to classify a newly entered password as matching or differing from the logged timing patterns. Since computer users constantly use the keyboard, any application that can reliably measure the timing of a user’s typing can also try to perform identification or verification of the user.


Given the basic idea, it is quite easy to engage in various flights of fantasy that involve behavioural biometric systems violating the privacy of computer users by inexorably tracking their identity no matter how they try to achieve anonymity. For instance, one might consider a chat programme which measures users’ typing patterns and identifies them regardless of whether they use multiple accounts, identify themselves with pseudonyms otherwise unrelated to their personal data, use different computers such as those available at internet cafes each time they log on, and so on. Or, perhaps, imagine a browser that monitors how a user moves the mouse and can thus always identify its users.


Clearly, the ability of such malicious software to create a centralised collection of identification data needs to be discussed when considering such scenarios, not to mention the possibility of making the measurement at all: a trivial example is that we need not worry that a web server might realistically make such measurements, as the mouse is obviously handled on the client side by the browser and hence the server of a web page never receives useful timing data about mouse movements – nor, in fact, about typing, since the client sends batches of data at relatively large time intervals, e.g. when a user hits ‘enter’). However, scenarios will arise when such measurements are possible.


The ability to make measurements does not necessarily imply an ability to extract meaningful conclusions. Pattern recognition systems often fail at tasks where there is an unconstrained ability to make measurements, because of difficulties in the task of recognition itself. Obviously, pattern recognition systems do perform well in many cases, so it is important to study what kinds of problems they can resolve. Whether recognition is possible depends on intrinsic characteristics of the task, and also on the data available.


We are currently considering a variety of possible behavioural biometrics and how such data can be abused for the task of identifying an unknown user, as opposed to verifying an identity already claimed by a user, which is generally relatively easy, given adequate data. This is still ongoing work, but we can summarise some results that are emerging. There are a number of different parameters of such a malicious effort, and if all of them are handled well, the task is solvable. We tabulate these parameters and table how their combinations affect the possibility of malicious activity.

 


D. Profiling of web-users

(E. Benoist, VIP) 

 

There are a lot of ways to gather information concerning web users. Sometimes the user is indeed aware that he is providing information to others, but sometimes he does not want to give any information and it is simply stolen.  

 

D.1 Log files analysis 

 

Since ever the web existed, there have been log files. This means that a server (such as Apache Httpd or Microsoft Internet Information Server IIS) stores a lot of information concerning each of the requests they deal with in a file (or several) which is then often used to compute statistics for a site, amongst other tasks. 

Such a file is always generated by the server. It can either be stored for future use or otherwise be deleted if it is only used for debugging purposes. Log files contain a lot of information, and the user is usually unaware of this. Here is an example of an entry in an apache server log file, corresponding to a request for the page f.html originating from the IP address 62.2.135.157. 

62.2.135.157 - - [23/Dec/2002:00:29:12 +0100]  

"GET /Icons/70x1.gif HTTP/1.1" 200 54  

"http://www.isbiel.ch/A/f.html"  

"Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/0" 

This entry contains the following information: IP address, Date of the request, HTTP Request (the first line), the return value (200 means OK, 404 document not found), the size of the information, the URL-referrer (the previously visited page), and the user agent (which browser or spider sent the request).

 

D.2 Session tracking 

 

Since the Internet protocol for the web (HTTP) does not include session handling, developers have been forced to simulate this feature. In many circumstances, it is very important to follow the session of a user to see that different requests belong together. This is useful for instance in recognising whether a user has been granted access to the server with a username and password. It is also used to gather the elements of a virtual basket, such that the user can visit a web site and keep track of what s/he has bought. 

 

One method for tracking a session of a user is to send a cookie. Cookies are small pieces of information that are added inside the HTTP header, i.e. the first part of the source code of a web page, usually not visualised by a browser. The server sends this information with a statement of its validity, and as long as the cookie is still valid, it will be resent to the server each time the client sends a request to the same host.  

Sessions can also be tracked for people that do not accept cookies. It is possible to insert a session ID in all the URLs of the requests for pages on the server. Such a method can be detected by the user. But there exists another more complicated method. Since DHCP servers rarely dynamically change the IP address of a client during a session, one can expect that all the requests occurring from the same user have the same IP address. This offers us the possibility of considering as a session all the requests that originated from the same client.  


 

D.3 User Tracking 

 

Session tracking is useful for attacking privacy, but one can go further. If the user can be linked to all his sessions, a lot of information can be gained by this "total history" of a user’s visits. 

So-called user tracking can be achieved using cookies. But with such cookies, the validity must be set to one month or one year. Such a cookie contains a user ID that is stored in a database on the server such that the server can recognize the client the next time he visits the site. This is being done for example at www.amazon.com where the user receives advertisement and special offers corresponding to all the pages he has visited and all the item he has bought so far.

 

Again, such a tool may be dangerous for privacy. For instance if the user is not only known by his/her ID, but as a human being with a name, an address, and a credit card number. Then these pieces of information once entered by the user may "propagate", while the client may have expected that the site would have “forgotten” them. 

 

 

 

 


 

 

 


 

 

 

 

 

 

 

D.4 Multi-server user tracking 

 

The next level in user profiling deals with the profiling over multiple hosts. Since cookies can only be sent to the site that created them, it is not possible to have the same cookie for more than one web site. That means that the user has different user IDs on different sites. The same person may be seen as ID=1234 on site1, ID=5674 on site2 and ID=007 on site3. 

A solution to this problem is to have a central site that serves all the pages. This is clearly not usable in general, since it would not be efficient at all. Yet there is a simple efficient solution. We just have to insert a request for the same web site in all the pages of the different sites. This is possible, and indeed often done, using a small image. When a web page is downloaded, the browser usually asks for all images that are contained in the page in order to display them in the same page. Usually, images are located on the same web site and are just static files. The idea however is to reference inside the web page an image that is stored on another server. This reference points indeed to a program that generates an image (for example using PHP) which is sent as the answer to the request. Such a program can use its own cookie and this cookie will be the same for all the web sites, since the image comes from the same server. 


Such an image must not harm the design of the web page. There are two solutions: first it can be a banner that contains an advertisement and can centralise all the information. But it can also be a hidden pixel. That means an image with a height and a width of 1 pixel. If the pixel has the same colour as the pages that contain it, the user will never see it. 

 

D.5 Protection measures 

 

Disable Cookies 

The protection can be on various levels. First one can simply disable the cookies in the browser. This means that the browser will never accept or send any cookie. Such protection is quite efficient, since one can not easily treat sessions and it is not possible to reference a user as a recognized client. Yet, a lot of web sites require cookies to handle sessions, and without enabling cookies users are unable to use many web sites that might be interesting. 

Moreover, it is possible, using IP address, to retrace the session of a user using only the IP address (which does not change inside a session). 

 

Disable persistent cookies 

One can allow cookies to be used for one session only: it is possible to prevent them being stored on the hard disk and thus used as user ID for many transactions. This is quite harmless, since the only permanent cookies that are interesting concern language specifications. 

 

Disable third parties cookies 

This is used to prevent the information passing from one site to another. This is a good idea, since such information use always has negative connotations for the user. We cannot find a valuable program that uses this feature. 

 

Define a policy for the use of cookies 

Some web sites declare how they use data collected about users. Based on such information, it is possible to decide whether or not to supply information to this site. 

 

 

E. Mathematical Tools for Data Mining  

(Emmanuel Benoist and Bernhard Anrig, VIP) 

 

The CRISP-DM process model contains 6 phases: Business Understanding, Data Understanding, Data Preparation, Modelling Evaluation and Deployment. We will focus here on the two steps depending on the machine: Modelling and Evaluation. 

Modeling can be done with a very large set of algorithms and tons of heuristics. We present hereafter two rather simple modelisation technics, first the regression analysis and then a simple algorithm for constructing decision trees. 

 

Since there exists no unique modelling technique that would work for any problem, we have to evaluate the accuracy of an algorithm and its configuration. This is done during the evaluation step. As an example, we present the cross-validation protocol for testing the quality of a method on a given set of data.  

E.1 Regression Analysis 

 

Suppose we have one set of data containing two variables (x and y) that seem to be correlated. We can test if the x’s and the y’s are linearly correlated with the following formulas. We try to build a formula for computing the y’s knowing the x’s, y=f(x). The same type of strategy works also if y is dependent on many different parameters (for instance y = 5.93 x + 1.87 z - 0.98 t). 

Suppose we have a set of data containing for each record two parameters x and y. When we plot a graphical representation of the two parameters, they seem to form a line. We can have the intuition that the two parameters are linearly correlated. That is what we can compute.  

Suppose we have some children with the age of 6, and we want to find an equation for modelizing the relation between size (in cm) and weight (in kg). We want to compute the weight knowing the size of a child. 

Child number 

Size (x) 

Weight (y) 

121 

25 

123 

22 

108 

19 

118 

24 

111 

19 

109 

18 

114 

20 

103 

15 

110 

20 

10 

115 

21 

 


We first plot this set of points in a graphical representation. We remark that the points almost form a line. We will try to model the value of y as a linear function of x. This can be noted . The problem is now to find the values of the constants a and b.

 

We want to find a and b such that the line is a good approximation of the values. This will be done by minimizing the differences of the very value of y and its respective computed value. To prevent that negative and positive differences could compensate, the sum of the squares of the differences is minimized. 


This sum is minimal for two values of a and b (using derivation and the 0 of the derivate) to be obtained with the following formulas: 


and 




In our example the mean value for x ( ) is 113.2, for y (  ) is 20.3. The results for the coefficients are therefore:

a = 0.42 and b = -27.38  

This results in the equation: 

weight = 0.42 height - 27.38 

We can hence predict that a child of age 6 which has height 125cm will probably weight 25.27kg. 

 

Such a method can easily be extended for attributes that are linear functions of many parameters. The mathematical concepts remain the same, the equations just have to be written in a matrix way. 

 


Nevertheless this method is only available for numeric attributes that behave linearly. The equations have to be redesigned if the relation is not linear but quadratic (     ).

 

More about linear (and non linear) regression can be found in any book on statistics. 

However, such a method can not be applied to symbolic attributes, like city, profession, education grade, or ability to pay back a loan. Nevertheless, some numeric attributes can also not be used in such a method: the Zip code is normally difficult to be correlated with other values whereas it can be very informative regarding to the academic level or the average income for instance. 

 

E.2 Constructing Decision Trees 

 

The algorithm presented hereafter is known as ID3. Together with a lot of improvements, it has formed the widely used system C4.5 and its new release is called C5.0. Nevertheless, this is not the place to present all the possibilities offered by such systems. We will only focus on the main mechanism for building decision trees, the gain of information in a divide and conquer algorithm.  

The algorithm we present here can only deal with Boolean attributes. One of the more complex steps in data mining is to find a way to discretise continuous variables. Transforming any attribute in a set of Boolean attributes can be done using algorithms (such as clustering), but is often made using ad hoc heuristics, that need to be precisely configured to work fine.  

Suppose we want to build a decision tree. The algorithm works top-down, seeking at each stage an attribute to split on that best separates classes, and then recursively processing the subsets resulting from the split individually (a divide and conquer algorithm).  

Suppose we have the following table: 

Number of the person, Earns more than EUR15000(A), Married (B), University degree(C), Had problems to pay back a loan(D) 

 

Number of the person 

 Earns more than EUR15000(A)

 Married (B)

 University degree(C)

 Had problems to pay back a loan(D)

T  

T  

F  

F  

T  

F  

T  

F  

F  

F  

T  

T  

F  

T  

T  

T  

F  

F  

F  

T  

F  

T  

F  

T  

F  

F  

F  

F  

T  

F  

F  

T  

T  

T  

F  

T  

10 

T  

F  

F  

F  

11 

T  

T  

T  

F  

12 

T  

F  

T  

F  

13 

F  

T  

F  

T  

14 

T  

T  

T  

F  

15 

F  

F  

F  

T  

 

Consider that we have four attributes "Earns more than EUR15000" (A), "is Married" (B), "Received a University degree" (C) and "Had problems to pay back a loan" (D), each of which being Boolean, and we want to predict D. D is called the class. There are three possibilities for the attribute to be used at the root of our decision tree, i.e. A, B, or C. We examine the three cases. If we choose A, for A=true, we have 8 cases and there is 25% D=true and 75 % D=false (noted [2,6]) for A=false we have 7 cases which makes [5,1]. 

We have to measure the information contained in such a tree (or of any subtree). This will be done using its purity, which is the weighted mean of the purity of the leaves. The purity of a leaf is measured using its entropy. The formula is based on the probabilities pi of the class value (D in our example). The formula for the entropy is entropy ([p1,p2,p3,…,pn])=-p1 log(p1)-p2 log(p2) - … -pn log(pn). In our case the entropy of the node [2,6] is -0.25*log(0.25)-0.75*log(0.75)=0.811 and the entropy of the node [6,1] is -0.86*log(0.86)-0.14*log(0.14)=0.592. Hence the value for this tree is the weighted mean of the two entropies, which makes:  

(0.811*8+0.592*7)/15=0.709 

We do the same for all the possible classes for the root of our tree. This means that if we take B as the root we have two leaves in our tree: [4,3] (for B=True) and [4,4] (for B=False). The information needed to order this tree is the weighted mean of the entropies and is 0.993. The last case is when the root is C, we have two leaves [1,4] and [6,3] this needs an information of 0.848 to be sorted. The higher the value is, the less information is gained by using the respective node as root for the decision tree. Hence we see that for example using the information whether someone is married or not (attribute B) does not influence very much (at least in our dataset) the ability of paying back a loan. 

So we select the one attribute whose tree has the smallest entropy (where the purity of the nodes is the best). For our case, this means that the root for our tree is A. The same process described so far is then recursively done with the created leaves that are expanded into subtrees independently. The tree grows until it has only pure leaves, or no other attribute can be used. There exists also heuristics to deal with almost pure leaves and to stop developing the branches, but this process is beyond the scope of this paper. 

 

E.3 Neural Networks 

 

The neural network domain is an involved, complex and evolving one, and detailing its mathematical derivation and implementation is out of the scope of this document. See http://richardbowles.tripod.com/neural/neural.htm for more information. 

The main difficulty with neural networks is the interpretation of the results. Data Mining techniques provide a way to gain information about the data. The knowledge discovered can be understood and used. But neural networks work as black boxes. There is no way to understand how a pattern has been learnt, what sort of deduction has been done. The translation into a humanly understandable language is not possible. Moreover, in order to “learn” the way to answer a new question, a neural network needs a huge training set, which is probably not available.  

 

E.4 Cross-Validation 

 

Usually the amount of data which is available for processing is quite restricted, hence one has to develop concepts for testing the quality of the results gained so far with only the small amount of data. One well-known concept for this purpose is called "cross-validation". It consists mainly in choosing a number of folds in which you partition your data. So for example say we will use ten folds and therefore split our data into ten partitions p1 to p10 of approximately equal size. Then, nine of them, say p1 to p9, are put together and used for the training (the Modelling step of the CRISP-DM process) and the last one, p10, for testing the Evaluation step of the CRISP-DM. This is then repeated exactly ten times so that each partition (pi) is once used as a set for training while all others (p1, p2, … p10 but not pi) are then used for testing. In this case we speak of ten-fold cross-validation. 

Usually together with cross-validation, one uses stratified folds of data, that is one wants to make sure that in each of the folds the classes are represented approximately in the same proportion as in the union of all the ten sets. 

The overall error rate of such a cross-validation is then computed as the average of the ten error estimates.  

Note that depending on the application, one has to introduce different costs for different errors depending on their impact. So for example classifying an A as a B might cause much more damage than classifying a B as an A. 

 

E.5 Products for data mining 

 

A lot of products are available on the market for dealing with data mining problems: C4.5, C5.0, See5, Fair Isaac, Insightful Miner/ S-Plus, Mineset (PurpleInsight), SAS Enterprise Miner, SPSS Clementine, Statsoft Statistica ThinkAnalytics etc. 

 

Weka is an open source framework intended to test algorithms for data mining that can be freely downloaded.

 

 

 

 

 

F. On Algorithms  

(Jean-Paul van Bendegem, VUB) 

 

In this section of the Appendix we will run through each one of the five topics mentioned in 3.2.3 under the heading “algorithms”. The main purpose is to provide some examples that clarify the concise description in the main text of this report.  

F.1. The choice of language in which to express the procedure can have a tremendous effect on the “success” of the algorithm in terms of efficiency (computer space and time required). As an example consider a (formal) language that accepts as a logical rule modus ponens – A and “If A, then B”, therefore B – and suppose that we are dealing with a numerical problem, say, we are checking whether a property P holds for a particular number, say 100. The information we have is that 0 has the property P and that, for all numbers n, if n has the property, then so does n+1. If one wants to formally prove that 100 does indeed have the property, there is only one option: start with 0 and apply modus ponens over and over again to go from 0 to 1, from 1 to 2, …, from 99 to 100, thus the rule is applied a 100 times.

However, suppose that we make the language a bit more expressive. Suppose that we allow ourselves to “contract” repeated steps. To give a concrete example: on the basis of the language given, it is possible to give a general proof (i.e., not restricted to particular numbers) that if the property holds for n, so it holds for 2n, the double of n. In that case one gets a “speed-up” effect. Starting from 1, one runs through the powers of 2, i.e., 2, 4, 8, 16, 32, 64, and, because the next step gets us beyond 100, we will have to make 36 one-step moves, starting from 64, to reach 100. This means that we now need 6 + 36 = 42 steps to complete the same process. (Note incidentally this curious phenomenon, well-known to mathematicians, that proving a general case is often easier than proving a particular case, although, of course, the general case is far more informative than a special case. The implication is straightforward: there is no simple relationship between the complexity of a (logical-mathematical) proof and the scope of the content of the statement proven). 

It seems a straightforward extension of the above idea to make languages as expressive as possible. However, this is not the case, since expressivity entails complexity issues and problems. A classic example in the area of formal logic is the problem of the choice between first-order and second-order logic. In rough terms, if we are talking about objects and their properties, first-order logic only allows quantification over objects (“For all things x, so-and-so”), whereas second-order logic allows quantification over objects and properties (“There is a property P, such that for all things x, so-and-so”). Second-order logic is obviously far more expressive, but, to give just one example, there is no proof theory, meaning, informally, that a satisfactory definition of what constitutes a proof cannot be produced.

(George S. BOOLOS, John P. BURGESS & Richard C. JEFFREY: Computability and Logic, Fourth Edition, Cambridge: Cambridge University Press, 2003, is still considered to be the best introduction).

F.2 The choice of support or carrier for the algorithm. In the logical-mathematical literature the standard conception of a computing device is still the concept of a Turing machine T. T consists of a tape and a reading-writing machine. The tape itself is divided in squares and each square may contain a symbol taken from a (usually minimal) alphabet, viz. 0 and 1. The machine itself is described in terms of (a finite number of) states s1, s2, …, sn and (a finite number of) movements, usually moving left (L), moving right (R). Thus a typical program instruction would be <0, si, sj, 1>, i.e. if T reads a 0 on the tape and is in state si, then it changes into a state sj, and writes 1 on the tape. Often a diagrammatic approach is preferred (this obviously bares a connection to the choice of language!) to represent a basic action of T, i.e., the execution of one instruction:


Amazing as it sounds, but nearly the whole of mathematics can be translated in terms of such machines. It is an extremely powerful tool, because of its simplicity and its scope. It allowed Alan Turing to show that a machine that would be able for any other machine to decide whether that machine will or will not stop when given a specific input, cannot exist. Thus there is no universal mechanism or algorithm to decide whether arbitrary problems have or have not solutions.

However, at the same time, it is clear that a Turing machine T is quite independent of any support whatsoever, it is typically a “paper” computer; the physical basis, i.e., how T is implemented, is of no importance (or, at least, assumed to be). More specifically, no issues of (underlying) causality have to be dealt with. Recent advances in computer technology are rapidly changing this picture:

  1. Quantum computing requires the preparation of quantum mechanical systems in very specific states to make the computation possible; in this case the physical system is of primordial importance and the limits of the quantum mechanical system co-determine the computational limits. (Related to this topic is the discussion whether quantum mechanical Turing machines can do “more” than classical Turing machines - still an open question at this moment), 

  2. Organic computers require a biological support and here too, the inner workings of, say, a cell need to be well understood in order to get a grip on the computation itself. Biochemical processes are involved, such as DNA-RNA interactions, that determine what the computational possibilities are. Hence the support is primary. 

This raises the interesting discussion whether or not one can consider the underlying physical system itself as the algorithm and not necessarily a symbolic representation of that system (comparable to the phenomenon that time is defined presently, not by any artificial man-made object but by a “natural” object, viz. pulsars). It relates to the John Lucas-Roger Penrose discussion whether the human brain can be considered to be a classical Turing machine, or to be seen as such, or none of these.

F.3 Program verification is a separate issue is that is often ignored in actual practice: it concerns the question whether the program or algorithm really does what one expects it to do. Of course, in simple cases there is not much of a problem, but, as soon as complexity reaches a certain level, it becomes a difficult task to prove that the program is correct. Usually the proof verifying the program or the algorithm is more complex than the algorithm itself (often measured in terms of length, how many symbols are needed to write down the algorithm or program and the verification). An example to illustrate this point:

Consider a “simple” program for division for positive numbers, x and y: 

((r := x; q := 0); while y r do (r := r – y; q := 1 + q))

Note that the program not only produces q, the quotient, but at the same time r, the remainder. In order to prove that this program does indeed do what we claim it should do, a formal language to talk about programs is necessary. Related to topic 1, a language is required, in this case, we will need statements such as:

  1. P{Q}R     (informally: “given the input described by P, after execution of Q, R results”),

  2. x := f    (informally: “replace x everywhere by f”).

In addition we will need axioms as we want to prove that the program is correct:

  1. P0{x := f}P, where P is P0 with x replaced everywhere by f,

  2. If P{Q}R and “If R, then S” is provably the case, then P{Q}S            (D1)

  3. If P{Q}R and “If S, then P” is provably the case, then S{Q}R            (D2)

  4. If P{Q1}R1 and R1{Q2}R, then P{Q1;Q2}R,

  5. If (P and B){S}P then P{while B do S}(not-B and P)                (D3)

This is the formal proof of correctness of the above program, illustrating the complexity issue (the two lemmas are merely arithmetical truths that are assumed to be correct; the example is taken from one of the early great “classics” in program verification, viz. C.A.R. Hoare’s 1969


(Note that this topic is of importance to society: there have been court trials to decide whether a (commercially available) program was indeed sufficiently verified or not, as the program malfunctioned and a “guilty” party had to be identified.

 

F.4 Conflicting data and inconsistencies. To illustrate the problem the title is referring to, consider a database containing data, expressed in terms of statements or properties, represented by p, q, r, … Reasoning with these data involves the use of rules in the format of A B, meaning that, if B (often referred to as the head of the clause) is the case, then A (often referred to as the body of the clause) will result. Consider now the following rather simple case:

  1. q p

  2. r p

  3. (s and t) r

  4. not-p s

  5. p

This set of conditions is inconsistent, since the last clause states that p is a fact, but then q is a fact, r is a fact, but then s and t is fact as well, and hence also s, but then also not-p. How to deal with this situation? It is definitely the case that, because of the size of actual databases and because of the permanent, often “real-time” updating of databases, inconsistencies will  and do result. The question then becomes: what to do?

The most common strategy underlying a number of proposals to deal with inconsistencies is to look at maximally consistent subsets. Thus in the above example, it is obvious that by dropping the clause “p “, the problem is solved, although not much can be said now, except that not-p is the case. Therefore, it seems more interesting to delete the fourth clause, because then we can assume (though not necessarily) p, q, r, s and t to be the case. This raises the important issue: how to detect the maximally consistent subsets? Perhaps not entirely unexpectedly, this problem, although in principle solvable in the finite case, is of (at least) NP-complexity (see the next topic for a definition), in short, a “hard” problem.

An alternative is to work with preferences, e.g., defined on the clauses. If we prefer not-p over p, for whatever reasons, then it is obvious that in the example given, we will drop the last clause, because that is compatible with accepting not-p. However, other problems appear: what if the preferences are such that no clause can be satisfied? Then one needs to rearrange the preferences. Of course, one might think about trying out all preferential possibilities, but that problem again is NP-hard. 

A next step – but that, to our knowledge has not really been made in computer science today – is to analyse the contradiction itself, to reason with it, and, on the basis of that analysis, to make a choice one way or another. This requires a move that both logicians and computer scientists are not all that willing to make: to abandon or, at least, to adapt classical logic. Without going into too much detail, here is the crux of the matter. In classical logic a contradiction is a disaster, because of the classically valid argument “If both p and not-p are the case, then anything, q, is the case”, the infamous “ex contradictione sequitur quodlibet”. To reject a logical rule, one must find a counterexample to it, in this case, one must find a situation wherein both p and not-p are the case, and, at the same time, not anything is the case, i.e., q can be false. The latter part is easy, enough things are not the case, but the first part requires a curious move: to accept that there are situations wherein a statement and its negation can be both true. However, the reward is phenomenal: one can now reason with contradictions.

F.5 Complexity and randomness is perhaps the most intriguing topic. Looking at problems that are solvable, i.e., there is a program that will execute the program in a finite time, requiring a finite amount of space, and produce an answer, one way or another, it soon became apparent that some problems are easy to solve, some difficult to solve. The imprecise notions “easy” and “difficult” equally soon became translated in formal terms. Any problem is characterised by some parameters k, l, m, n, … If a problem is solvable in time or space expressible as a polynomial of some parameter of the problem, then we have a P-problem. E.g., given n books, to sort these books alphabetically is a P-problem. (Take the first book, that requires one step, take the second book, see if it comes before or after the first book, this involves two steps, take the third book, see if it come to left, to the right or in the middle of the two books already sorted, in short, one needs 1 + 2 + 3 + … + n steps, or, n(n+1)/2 steps, and that is a polynomial). Other problems require an exponential amount of space and time to solve the problem (putting together a schedule for train traffic is such a problem). A special subgroup are those problems for which an exponential solution is known, at the same time it is not known whether a P-solution exists, and, in addition, it is always possible to make a guess for the solution, a guess that can be checked in P-time. Such problems are NP-problems (see the above topic). In fact, at the present moment, there is a quite complicated hierarchy of complexity measures.

However, even more intriguing, is the fact that, given a program, given sets of data, there will always be structures or patterns, present in the data, that the program will not recognise as such. In other words, to the program the structure will appear random, although there is structure present. The core of the argument is this: if a given set of data has a structure or contains some pattern, then it must be possible to describe that pattern, and, obviously, this description must be shorter than a simple enumeration of all the data. What we are doing in a sense is to compress the data. If this compression is to be reliable, then we must able to prove so (compare with topic 3). If we succeeded in proving the compressions in all possible cases, then any data set can be compressed and that runs counter to the idea that there are genuinely random sets. There will therefore always be data sets that contain patterns that will not be detected and hence will appear as random.

Excellent guides in these matters are the works of Gregory J. CHAITIN: Information, Randomness and Incompleteness. Papers on Algorithmic Information Theory. London: World Scientific, 1990; Algorithmic Information Theory. Cambridge: Cambridge University Press, 1992; The Unknowable. Heidelberg: Springer Verlag, 1999. Here one finds the formal counterparts of the informal arguments presented above.

 

G Using user’s Profiling and Artificial Agents for Stimulating the Knowledge Exchange Process in Virtual Communities 

(Thierry Nabeth, Albert A. Angehrn and Pradeep Kumar Mittal, CALT - INSEAD)

 

In this section, we are going to examine how artificial agents relying on personal behavioural user profiling can be used to stimulate the participation in the knowledge exchange process in virtual communities. 

G.1 What are Virtual communities? 

G.1.1 Defining the virtual community concept

Howard Rheingold, the person who popularised the term "virtual communities”, has defined this concept as “social aggregations that emerge from the Net when enough people carry on those public discussions long enough, with sufficient human feeling, to form webs of personal relationships in cyberspace". In a later review of literature aiming at understanding why people are joining virtual communities, Ridings and Gefen (2004) have found that the concept of virtual community has been characterised more generally as (1) people with shared interests or goals for whom electronic communication is a primary form of interaction, (2) groups of people who meet regularly to discuss a subject of interest to all members, (3) and groups of people brought together by shared interests or a geographical bond. We will define ourselves a virtual community as an “electronically-supported” social structure holding together a set of people sharing a common culture, set of interests, values, goals or norms.

In the last few years, virtual community has emerged as one of the more important actionable paradigms for supporting the circulation of tacit knowledge in our digital societies. 

G.1.2 Virtual community environments 

In this context, virtual community environments are the techno-sociological infrastructures that have been specially elaborated to support these virtual communities. Virtual community environments include all the systems, such as forums or bulletin boards, that provide explicitly shared dedicated spaces for supporting the discussions of communities or groups of people. The communication in these spaces can be asynchronous (bulletin board) or real-time synchronous (chatrooms). People interact mainly with others by posting messages in (public or restricted) shared spaces, but can also sometimes communicate directly and more privately with one another. The control of what can be posted in the public spaces (specified in an explicit or implicit code of conduct) can be enforced by a moderator or by some social regulation mechanisms (social pressure, norms, etc.). Recent forms of virtual community systems are enhanced with technical mechanisms and features aiming at better supporting and stimulating the social process, via the visualisation of people activities (with the whole line of research dubbed as social transluscence) or the provision of matching mechanisms helping the forming of groups or the establishment of relationships.

G.2 The challenges of participation 

G.2.1 Virtual communities: the participation challenge

One of the main challenges facing designers and operators desiring to build successful virtual communities, is the establishment of a sustainable dynamic of participation amongst its members. Indeed, the essential value of a virtual community resides in the activities of its members and in particular is strongly correlated to their willingness to spend time, to interact with others in conversations, or to provide knowledge assets. The participation of the members of a virtual community in this knowledge exchange process is indeed not spontaneous, but is motivated by a certain number of elements such as: direct rewards, increased reputation, internal satisfaction (altruism and efficacy), or reciprocity.

Understanding the “mechanics” of the functioning of the participation in a knowledge exchange in virtual communities or in groups has been the subject of numerous researches in different fields such as: knowledge management and organisation ; CSCW ; complexity; social computing; sociology and communication, or psycho-sociology , to name a few.

G.2.2 Some directions for addressing the participation challenge 

From these theories, we could try to derive a set of principles that could be used to address this participation challenge. One of these principles could consist in working on the establishment for the members of these communities a climate of trust, a sense of community, and a feeling of recognition for the actions of their members. Other principles could be derived from the Social exchange theory, which proposes an economic model of rational choice to explain participation. More concretely, a virtual community designer could set up mechanisms supporting the factors that make people participate, such as: (1) anticipated reciprocity; (2) expected gain in reputation and influence on others; (3) altruism and perception of efficacy; (4) direct reward. Cognition and psycho-sociology also have some theories on influence that could be applied to “orient” people in their desire to contribute to a knowledge exchange process. For instance to stimulate participation one could try to exploit some of the six principles of influence of Robert Cialdini of: (1) reciprocity (felt obligation to “reimburse”); (2) social validation (social conformance); (3) commitment / consistency (tendency to act in a similar way as in the past); (4) friendship / liking; (5) scarcity; (6) authority. Finally, an interpretation of the laws of Metcalfe and of Reed on the importance of critical mass effect could lead to putting more effort to reach a critical mass, and making more visible the perceived value of the virtual community for the user.

G.3 Using Behavioural profiling and Intelligent Agents to Stimulate People Participation 

G.3.1 Using intelligent agents to stimulate people participation 

One of the most interesting approaches found to stimulate social participation in digital community platforms consists in implementing mechanisms that contribute to make the activities of their members visible. This approach has attracted an important attention in the research community via the concept of social transluscence.

Another promising approach that we would like to present in this section would be to start from a similar idea of behaviour-based cognitive profiling suggested by Kinshuk and Lin (2004) for the design of adaptive learning communities, and to apply it for stimulating members’ participation in the knowledge exchange in virtual communities. 

The main principle of this approach consists in the use of artificial agents that are aware of the behavioural profile of the members, and that intervene proactively using this information to stimulate members’ participation. This approach relies on two components: (1) the automatic construction (using a set of heuristics) of a behavioural profile of each member related to his knowledge exchange activity; (2) the generation of agent interventions that are the most likely to stimulate the participation of a particular member. The selection of the most effective interventions is based on the behavioural characteristics of the member. 

The construction of this profile results from the observation of the actions of the user and the application of a set of heuristics helping to determine the participatory profile. The different actions that are captured and intervene in the determination of the participation profile include events such as: entering digital spaces, posting files, posting messages in bulletin boards, answering to messages, and so forth. The different behavioural patterns to which a particular user can be categorised include: the level of involvement (is he often present?) and the nature of his contributions (is he only a lurker? Is he a contributor of knowledge assets? Does he participate in the discussions? Does he initiate discussions? etc.). Examples of heuristic rules include: a user that has not connected to the system in the last month can be considered inactive. A user that posts in discussion at least one time a week is committed in exchanging his knowledge. A user that has posted in the last three months at least one document is an active knowledge contributor. 

 

These agent interventions aim at different objectives: (1) To create awareness. For instance the agent may inform the member about a knowledge exchange practice he may not be aware of. (2) To raise interest. For instance the agent may present a knowledge exchange practice under an angle that makes it particularly meaningful and useful to this member. (3) To stimulate action. For instance, the agent may suggest an action and encourage the member to try this action. (4) To reinforce the practice. For instance, the agent may congratulate the member and provide him with feedback demonstrating the value and the impact of his contributions. 

G.3.2 The importance of the personal behavioural profile 

The effectiveness of these different agent interventions depends greatly on the exploitation of the user personal information, and in particular on the level and nature of user participation. 

For instance, Everett Rogers (1995) theories of innovation diffusion states that people do not adopt straightaway a new attitude but go through a series of phases of adoption (awareness, interest, trial, and adoption). The agent therefore needs to know about the current participatory level of the user (for instance is the member already familiar with some of the knowledge exchange practices or is he totally unaware?) when selecting the most effective intervention, in order to avoid proposing to the member something that is too crude, or on the contrary too sophisticated for this particular user. For instance it would be useless to invite a member of a community to share knowledge with others, if this member has shown in the past very little readiness to participate in an interaction. On the other hand, in may be useful just to inform this member of the benefits people get in interacting more with others.

Similarly this theory of innovation diffusion distinguishes different category of people (from the innovator, to the laggards) in regard to their attitude towards innovation. For instance, the innovators are principally driven in their action by their curiosity, whereas the late adopters are very sensitive to social pressure and will change their practice when they realise that they are isolated in their behaviour. An intervention that is likely to have the most affect on an innovator will be one that emphasis novelty, whereas the interventions that will have the most effects on a late adopter will be one that emphasis the social conformance (“everybody does it that way”). 

We can imagine many other sorts of usage of the behavioural profile information (level and nature of participation, personality traits, cognitive style or attention state) of a member for stimulating people participation in a knowledge exchange. In particular, it could be an interesting exercise to try to derive from the different theories addressing the participation challenge that we have mentioned previously (knowledge management, psycho-sociology, sociology, etc.) a set of agent interventions that would make use of this information to increase the level of participation in a community. For instance, some interventions would be directed to increase the climate of trust, the belonging (sense of community) or the feeling of being recognised. Some other intervention could even function more surreptitiously and rely on the psycho-sociological theories of influence, and play on factors such as social imitation, commitment, reciprocity, to increase the impact of the interventions.

 

G.4 Many promises, but still a long way to go 

 

The approach that we have presented previously looks very promising since it provides an outlook on radically new categories of applications that personal behavioural profiling can make possible. 

However, beyond the ethical questions that are features of the manipulation and the exploitation of behavioural information, the empirical validation of these approaches still need to be done, even if some preliminary work in this direction has already begun.

 

H The Profiling Game Riezlern

(Claudia Diaz, COSIC) 

 

H.1 Introduction: The Game 

This document describes a profiling competition that took place in January 2005 as part of the program of the First FIDIS PhD Event. The competition was organised as follows: fifteen PhD students, divided in three groups of five people, had to collect as much information as possible about three target subjects, using for this purpose information publicly available on the Internet. The participants had a few hours to complete the task.

The purpose of the game was to use this simple experiment to get an idea on the amount of data available on the Internet, as well as to explore approaches to collect and combine these data. The participants were given freedom to decide how to proceed with the collection and combination of data, as far as only publicly available data was collected. In order to be able to verify the correctness of the collected data, the target subjects were among the participants (one per group), and each group had to collect data on the two target subjects who belonged to the other groups. As a disclaimer, it is worth nothing that the specificity of the profiled subjects (German PhD students with active Internet lives), does not (yet) make these results generalisable to the information available on an average citizen. 

At the end of the game, the groups had to present the profiling information of the target subjects they had studied. Each piece of data had to be linked to the web address or resource where it had been obtained. Groups obtained points for the data that was found exclusively by them. When a group had mistakenly added to the profile information that was not related to the subject (e.g., data belonging to someone else with the same or a similar name), points were lost. 

In the following sections, we describe the tools used to collect the information, the type of information that was found, the strategies used to build the profile, the main issues addressed by the discussion that took place after the game and the conclusions we have extracted from this experiment. 

 

H.2 Tools to collect data 

 

The most heavily used tools for the collection of data were search engines. They were used to get the basic information, that was later complemented with other search tools. The participants reported having used the following tools: 

  1. Google was extensively used. The keywords searched were initially name and surname. Later on, as more in­formation on the subject became known, other keywords such as town or email address were used in order to refine the search or obtain additional information. 

  2. Other search engines, such as Altavista provided com­plementary information. Search engines for a spe­cific type of content, such as images.google.com or groups.google.com also provided complementary data (like pictures) on the profiled subjects. Google News was used to find newsgroup posts from student times.

  3. National and local phone books. As the nationality of the three profiled subjects was German, the German phone book proved to be useful to provide the home address, the phone number and sometimes even the profession of the subjects. Searches for people in the same town with the same surname led to the discovery of relatives. 

  4. Home pages proved to be a rich and reliable source of information, they typically provide a CV that describes many of the activities developed by the subject and the places and institutions in which he or she has worked. Moreover, it usually contains some indications of the preferences and interests of the subjects, as well as a publication list. Once the subjects were linked to other people, the home pages of co-authors, friends, colleagues and relatives provided additional information. 

  5. Archived sources such as google’s cache or www.archive.org provided information that was no longer available in the original site, either because it had been removed or because the website was no longer active.

  6. Various messenger accounts were used to obtain infor­mation; for example ICQ, Skype.com, MS Messenger and Jabber. Orkut and Open BC Membership provided information on the social network and the interests of the subject.

  7. OpenBusinessClub provided information about member­ship in organisations. 

  8. Key servers provided the PGP key and information on the social network of the subjects (by looking at the people who had cross-signed their key with the subject). 

  9. School pages, once the schools the subject has attended become known (typically mentioned in the CV). 

  10. Domain’s owners search (www.denic.de) and search for home pages in web domains that contain the name or surname of the subject. This provides additional home pages the subject may have or maintain.

  11. Citeseer, DBLP and other publication search engines provided the publications of the subject and links to co­authors and related researchers. 

  12. Newspapers, mainly local newspapers from the home 

  13. town of the subject provided information on their public activities. Some groups also searched for events that were reported on the birth date of the subject. 

  14. Newsgroups on topics of interest for the subjects provided information on the technical activities and knowledge of the subjects, as well as relationships. 

  15. Well known stores, such as Amazon.com, were looked into in order to find information on the books, films, ior other articles the subject buys on the Internet.

 

Some participants proposed a list of additional tools that could be used to further enhance the profiling, but which were not used due to the limitation of time for the exercise. Additional data gathering tools include: 

  1. Buying the financial record from Schufa. 

  2. Buying the consumer profile from Informa. 

  3. Run a password guessing tool for web form logins on Amazon and eBay with the subject’s e-mail-addresses to harvest the business histories these sites keep. 

  4. Inquire the subject’s highschool, university and employer to ask about certificates, graduation etc. 

  5. Check the many find-your-classmates web pages. 

  6. Write him/her an email to get one back to see the IP-address and the Mail-Client. 

  7. Try to sniff the communication of the subject in order to monitor his or her Internet activity. 

 

H.3 Results of the search 

 

The first goal of the participants was to find personal homepages. These were easily found by searching the names in Google, as they appeared on top of a list of a few hundreds of hits. The home pages were used to collect a basic set of information such as: 

  1. Work place, position, professional activities, work loca­tion 

  2. Contact information: town, address, phone, fax, email address 

  3. CV, which typically contains educational and professional history of the subject (name of previous companies, schools, universities, etc.) 

  4. Publication list, which provides information on the re­search interests and co-authors 

  5. Picture (a picture of the subject is often available) 

 

From this basic information, and using the tools listed in the previous section, the participants were able to obtain for one or more of the profiled individuals additional information such as: 

  1. Personal phone, address, alternative email addresses 

  2. Languages spoken 

  3. Political affiliation, and participation in political activities or discussions 

  4. Religious affiliation, and participation in religious groups 

  5. Participation in public awareness campaigns in the local town 

  6. Multiple alternative email addresses that served as pseudonyms in certain domains. These email addresses could have been further investigated in order to find additional information related to the subject under the pseudonym 

  7. Pictures of the subject and taken by the subject 

  8. Military membership and grade 

  9. Date and place of birth 

  10. Name, address and phone numbers of relatives 

  11. Sports, hobbies and miscellaneous interests (photography, philosophy, etc.) 

  12. Social network: names of the co-authors, colleagues, 

  13. friends (including the name and picture of an Internet fan) and relatives of the subjects were easily found

  14. Places the subject has been to and dates of the stays 

  1. Personal diaries full of personal details (in some cases travel diaries) revealed daily patterns of activity 

  2. Opinions posted in a variety of Newsgroups 

  3. An image of the (physical) signature 

The amount of misleading data (which was unrelated to the subject) found was minimal. However, it is expected that searchs on subjects with more common names than the profiled ones in this particular game would generate larger amounts of misleading information (or would require more effort in order to separate the information that refers to the subject of interest and that belonging to other, unrelated subjects).

 

H.4 Building a profile 

 

The limited duration of the game, and the large amounts of data to be gathered, left little time for sophisticated profiling of the subjects. Most groups presented the collected data, classified as belonging to distinct areas of identity. This organisation of the data led to some preliminary conclusions on meta-data, which is not directly available but inferred by the combination of several sources of data. 

The subjects were naturally implementing identity management mechanisms, as the information they provided varied according to the topic and environment of the hosting web site. For example, one of the subjects had provided an informal picture for his student website, and a formal one for his profile in a business website. However, all these data could be linked together with simple searchs in the Internet (as had been made obvious by the first part of the exercise). The distinct profiles that could be built on each of the profiled subjects were:

  1. The educational and professional profiles were easily available for any of the subjects. Detailed and complete information was provided in the personal homepages. This is consistent with the idea that it is in the interest of the subject to have a publicly available professional identity, because it is useful to improve his or her professional career. Even more, we could say that a public professional profile is today required in the research field in order to conduct daily working activities.

  2. Several aspects of the personalised profile were also easily available. In particular, there was extensive information related to hobbies, sports and other interests (some some­how related to the professional activity, like participation in Newsgroups on topics related to their work; and some totally unrelated, like photography or philosophy). Much of this information was made available through the homepages (or pages linked to it), and other was easily linkable to the subject, as it usually contained his or her name and other verifiable information. From this observation, we could say that users often like to make public these details, possibly as means to be reachable for people who share similar interests. More sensitive information, like sexual orientation, religious beliefs or attitudes towards drugs could also be extracted from the available data.

  3. Unfortunately, there was no psychologist among the participants. However, a draft psychological, behavioural and ideological profile could be built, both based on the opinions (and the websites in which opinions were posted), language use, pictures (face, expression, clothing, background) topics discussed, and personal experiences described in Newsgroups, posts and online diaries (which proved to be a very rich source of information for building a psychological profile). Some of the groups selected some personal details for their expositions in order to give a feeling on the degree of detail of some data that can be found and linked to a person.

  4. Most of the groups presented names of people who appeared to be related to the studied subjects. Participants highlighted that, if more time was available, it would be easy to build the social network of the subject. Names and addresses of relatives were found in the phonebooks; names of colleagues and co-authors were available in the homepages; some groups tried to google the address or phone number to look for other people living with the subject; Membership accounts provided social contacts of the subjects; descriptions of trips or events often included names of other people; and key servers provided the trusted relationships with other PGP users.

  5. One of the aspects that proved harder to find with the available tools was the consumer profile. Some groups could withdraw conclusions on items of interest for the subject from the description of personal interests or experiences, as well as from the available pictures. However, a more detailed search may reveal the consuming patterns of users who have an account in eBay, Amazon, or other e-commerce websites. Moreover, it is worth noting that this information is not available to the Internet user, but it is to the companies with which the subject has made transactions (and to all those other companies that buy the date share information with the original company). In this sense, the data is in fact available to those who are interested in exploiting it in order to build –valuable ­consumer profiles.

  6. No financial information was available through search engines. More sophisticated techniques (possibly involving social engineering or hacking activities) may be required in order to access this sort of information. However, the financial status of the subjects could be roughly estimated from the data. Similarly, no health related information was found. Nevertheless, assumptions could be made on the general health condition and the lifestyle of the subject by examining the available information (for example, someone involved in hard sport activities could be assumed as being in a healthy condition).

 

The combination of information related to a subject, belonging to different contexts, opened the possibility of sophisticated profiling, When many different identities of the profiled individuals were put together (combination of professional, personal, and leisure activities), the result was a rather detailed and complete picture of the subject. Even though this was only outlined by the participating groups, the large amounts of data gathered (which could not be thoroughly analysed) suggested that building a sophisticated profile, which included most important aspects of a person’s life, was possible.

 

H.5 Discussion 

 

A short discussion followed the presentations of the profiles. One of the issues that was raised was the very question of the meaning of profiling. While some argued that the fact of putting together facts relative to a subject which comes from different sources is effectively building a profile, others argued that the gathering of information should be understood as data mining and that profiling consisted in inferring meta-data from the row data (i.e., interpreting the available data in order to classify the subject according to abstract models or categories).

The freedom given in the definition of the game ”build a profile as complete as possible”, led to different interpretations. While some groups looked for as many details as possible, others focussed on more general information that could give a “big picture” of the subject. Similarly, not every group used the same searching tools. Some focussed on the social network, some others on the professional and educational profiles, some on the social activities, and some on their writings. This led to another relevant question: What is relevant to build a profile? In order to answer this question, we need to know what is the purpose of building a profile. Clearly, a company wanting to sell a product is interested in profiles which provide the information of how likely is a customer of buying a certain product; while a potential employer will be interested in the educational and professional profile.

Given that the profiled individuals were among the partici­pants, we could ask the question of whether they were really aware of the amount of information related to them that was publicly available in the Internet. They acknowledged that, although much of the information had been made public by themselves, they were not aware of the existence (or linkability to their public identity) of certain data (in some cases, such as contact details of relatives, the data could seem scary).

What are the potential threats a person could face due to the exposure of personal information? It is clear that an adversary wanting to harm the reputation or even the physical life of a subject, could use the Internet as a source of valuable information. In another level, companies can use this information in order to sell their products or services to the subject, organisations seeking to extend their membership (e.g., religious sects) may use this information in order to target individuals who are more likely to be converted, and public powers may also profile their citizens in order to exercise more control on the populations (e.g., anti-system individuals could be easily monitored and controlled). 

Why do people provide information on themselves? The main advantage and motivation for making personal data publicly available is that having a persistent identity in the Internet is a necessary condition to develop professional, personal, commercial and leisure activities. Offering a public profile enhances accessibility for other people who share interests. Keeping a public profile is also a tool to build reputation (e.g., providing the CV and the list of publications). A public identity is, in short, necessary to develop enhanced professional and personal activities, as well as a tool to extend the social network beyond traditional territorial borders. 

 

H.6 Conclusions 

 

Here we summarise some of the most important conclusions extracted from the experiment: 

  1. For individuals who develop activities in the Internet, large amounts of information linkable to their identity are available. This information includes professional, educa­tional, personal and psychological data, as well as enough information to build (at least) a partial social network. 

  2. Sensitive data such as financial, consumer and health information, religious and political beliefs or sexual orientation are harder to find, but it either can be found with more sophisticated search tools or inferred from the available information. 

  3. Internet users have a clear interest in making public their educational and professional profile in order to widen their professional opportunities. Regarding the personalised profile, some aspects are also made public, possibly to share information or stablish communication with people who share similar interests. 

  4. There are certain aspects of the personalised profile for which the subjects do not have an interest in making easily linkable to their professional profile. Yet, this information is easily linkable with simple search tools (due to the fact that it shares an identifier, e.g., the name, with the professional profile). 

  5. Profiled subjects found most disturbing the public availability of data they had not introduced in the Internet and which referred to their offline lives (for example, the home phone number or address), mainly that available in public indexes as phonebooks. Also, subjects seemed to be surprised by the possibilities of constructing a social network, which ranged from names and locations of relatives to an extensive network of social and professional relationships which could hardly be denied.

  6. Subjects were performing some identity management when interacting with different parties and in distinct contexts. Yet, many of these partial identities could be traced and linked to a natural individual for which extensive information on his work, life, interests and location was available. 

  7. People (including technically educated people) do not seem to be fully aware on the information on themselves that is publicly available in the Internet and easy to find and link. When they realise the degree of public exposure to which they are subject, there is often a feeling of vulnerability. 

  8. The exposure of personal data may constitute a security problem. On the other hand, it is used as a tool to build reputation, enhance the professional and personal network, and share information with people with similar interests. 

  9. Once information has entered the Internet, it cannot be removed. Web pages which are no longer available can still be retrieved from archives and chaches. Therefore, once the data has been introduced in the Internet, it is no longer under the control of its owner. 

  10. Human intelligence was behind the search strategies and profiling methods. Although automated searches could be helpful in the collection of data, complex decisions need to be taken in order to discriminate relevant from irrelevant (and even misleading) information. This high cost in time and human resources for building just one profile may discourage massive sophisticated profiling. 

  11. The intervention of human intelligence may explain the minimal amount of misleading information that was presented as part of the profiles. 

  12. Social networks, on the other hand, are easier to build in an automatic way. A program could easily search for links in webpages and appearances of identifiers such as names, email addresses or locations in order to construct a social network. 

  13. The results provided by the search engines were much better for language dependent content. Regarding the language issue, it also seems that people (at least Germans), even if they have good knowledge of English, tend to write in their mother tong. 

  14. Finally, there is clearly a need for identity management tools that allow for pseudonymous, unlinkable management of information that belongs to separate contexts in order to empower the user in the management of his or her data and identity. 

 

 

 

6. Issues for further clarification  fidis-wp7-del7.2.profiling_practices_03.sxw  Glossary:
8 / 10