Resources
Deliverables.
Identity of Identity.
Interoperability.
Profiling.
D7.2: Descriptive analysis and inventory of profiling practices.
D7.3: Report on Actual and Possible Profiling Techniques in the Field of Ambient Intelligence.
D7.4: Implications of profiling practices on democracy.
D7.6 Workshop on AmI, Profiling and RFID.
D7.7: RFID, Profiling, and AmI.
D7.8: Workshop on Ambient Law.
D7.9: A Vision of Ambient Law.
D7.10: Multidisciplinary literature selection, with Wiki discussion forum on Profiling, AmI, RFID, Biometrics and Identity.
D7.11: Kick-off Workshop on biometric behavioural profiling and Transparency Enhancing Technologies.
Forensic Implications.
HighTechID.
Privacy and legal-social content.
Mobility and Identity.
Other.
IDIS Journal.
FIDIS Interactive.
Press & Events.
In-House Journal.
Identity in a Networked World.
D7.2: Descriptive analysis and inventory of profiling practices
Appendix
A Forensic Profiling in the field of RFID and Biometrics
(Zeno Geradts, NFI)
A.1 RFID technologies
Radio Frequency Identification (RFID) tags are expected to be used increasinly in products at retailers and in travel identity cards. Currently the Wal-Mart in the USA, Metro in Germany and Tesco in the United Kingdom are requiring RFID-tags on the products. Also the Department of Defense in the United States is requiring tags on products from suppliers. Tracking and tracing are the most important reasons for this requirement and in the war in Iraq this was convenient for the transportation of equipment and tracking. Losing or misplacing products, parts or equipment will result in higher costs.
It is expected that in 2008 more than 20 billon RFID-tags are used. Most of them will use the EPC-standard (Electronic Product Code). For reading those tags it is expected that 100.000 EPC-readers will be sold. It is expected that the tags will drop to several euro-cents per tag.
RFID-technology
A RFID-system consists of different components :
One or more tags (transponders), consist of a microchip and an antenna
One or more reader and writers including the RF-modules
Application software connected to the reader/writer
RFID-tags exist in two forms :
Active (with a power supply )
Passive (need power from the signal of the reader)
Active RFID-tags are somewhat larger and more expensive, however they can be used at a greater distance. Passive tags can be used in consumer articles and are cheap. They can also be used in labels.
Furthermore, RFID-tags can be read-only and read-write. The read-write tags are used with an encrypted identification number, where only authorised users can change the information.
The retailers are requiring their suppliers, to use the Electronic Product Code (EPC). This standard has been developed by MIT’s Auto-ID center, and is managed by EPCglobal, which is a non-profit joint venture between two organisations: EAN (European Article Numbering) International and the Uniform Code Council. The second generation of UHF EPC-standard has been ratified in December 2004. In January 2005 this standard is submitted to ISO.
For RFID, the ISO standards concern :
Technology ISO 18000
Data content (ISO 15418, 15434, 15459, 24721, 15961 and 15962)
Device conformance test and performance (ISO 18046 and 18047)
Application Standards (ISO 10374, 18185, 11785)
A.1.1 Security of RFID
In July an article in Forbes presented a hacker’s guide to RFID. It would be easy to hack a tag, and change price information on a tag. There were also privacy concerns for the consumers. The cheap tags are just readable tags, and they cannot be altered easily. Also when using a write-once tag, often the price is not included in the product, since they use a serial number for the product. It is however an aspect that needs attention.
Example of RFID-labels from http://www.bluhmsysteme.com/rfid-etiketten.htm
A.1.2 Forensic aspects of RFID and profiling
RFID tags can be used also in forensic science to track a person using on the RFID-tags one carries in products. For example: there is a unique RFID-tag number on a packet of cigarettes and a person steals this packet in Amsterdam from a store. Theoretically it is possible that within European databases of stolen goods, the person can be arrested in Brussels when they scan the RFID-tags of this person. The same applies, of course, to the previously mentioned passport with a chip. The key elements are RFIDs, in combination with databases and if stored properly, it can be used as evidence.
The use of standards makes it also easier for forensic science to develop tools to read the information in a proper forensic way.
A.1.3 RFID and privacy
Also EPC is worried about privacy aspects. For this reason they have implemented certain privacy guidelines:
Guidelines
1. Consumer Notice
Consumers will be given clear notice of the presence of EPC on products or their packaging. This notice will be given through the use of an EPC logo or identifier on the products or packaging.
2. Consumer Choice
Consumers will be informed of the choices that are available to discard or remove or in the future disable EPC tags from the products they acquire. It is anticipated that for most products, the EPC tags would be part of disposable packaging or would be otherwise discardable. EPCglobal, among other supporters of the technology, is committed to finding additional efficient, cost effective and reliable alternatives to further enable customer choice.
3. Consumer Education
Consumers will have the opportunity easily to obtain accurate information about EPC and its applications, as well as information about advances in the technology. Companies using EPC tags at the consumer level will cooperate in appropriate ways to familiarise consumers with the EPC logo and to help consumers understand the technology and its benefits. EPCglobal would also act as a forum for both companies and consumers to learn of and address any uses of EPC technology in a manner inconsistent with these Guidelines.
4. Record Use, Retention and Security
The Electronic Product Code does not contain, collect or store any personally identifiable information. As with conventional barcode technology, data which is associated with EPC will be collected, used, maintained, stored and protected by the EPCglobal member companies in compliance with applicable laws. Companies will publish, in compliance with all applicable laws, information on their policies regarding the retention, use and protection of any personally identifiable information associated with EPC use.
If they are properly used, and if the consumer can discard the EPC-tag easily from a product, this means that the example above with the package of cigarettes is not feasible, if the thief discards the RFID-tag.
A.2 Biometric Devices
The market of biometric devices is expanding rapidly, since it is convenient to use, and users can not forget their biometric properties. Some examples of biometric properties which are used in commercial systems:
Face
Fingerprint
Handscanner
Iris
Voice
Handwriting
In forensic laboratories also other means of biometric measurement are used for comparison:
DNA
Ear prints
Lip prints
In many countries, DNA databases are built of persons who committed a serious crime and DNA found at the scene of crime, as well as, to determine possible contamination of evidence, databases of employees of the forensic science laboratories. The laws in many countries forbid use of these data for purposes other then that for which they are meant: for serious crimes, by court order or if specified by the law.
In a world where no privacy laws exist, theoretically it would be possible to collect databases of fingerprints, iris scans, voice, hand writing and face. These databases can be combined with one another. If the databases are structured correctly, and there are no errors in names or in the links made, these databases can be used to search by profiling for a person or a group of persons, perhaps leading to more crimes might be solved, and perhaps stronger evidence.
One problem is however that the one-to-many comparison, for example with faces, returns a high error rate. A good example is given at http://www.frvt.org/FERET/default.htm where a methodology has been used to examine different commercial packages and to compare them with a standard database.
71.5 % true accept rate @ 0.01 % false accept rate
90.3 % true accept rate @ 1.0 % false accept rate
Another test by NIST on fingerprints, gives better results http://fpvte.nist.gov/ . We can see here by using single good quality fingerprints
99.4 true accept rate @ 0.01 % false accept rate
99.9 true accept rate @ 1.0 % false accept rate
When multiple fingerprints and face images (or even 3D-images) become available, the situation improves. In these systems, poor quality images are not used. In practice when filling databases with real images these will also be included, and results will deteriorate.
Figure 6: Example of mass market 3D-face scanner
B. Legal grounds for customer loyalty programs
(Barbara Körffer, Martin Meints, ICPP)
Based on European legislation, the German Federal Data Protection Act (BDSG) has several sections relating to customer loyalty programs. They are elaborated in detail in the study “Kundenbindungssysteme und Datenschutz”, commissioned by the “Verbraucherzentrale Bundesverband e.V. (vzbv)” and carried out by ICPP in December 2003.
Usually customer loyalty programs employ customer account cards. From a data protection perspective, the contract between the customer and either the discount-granting company directly or an operator of the account card system can be interpreted as two distinct parts:
discount related part: rewarding customers’ loyalty by granting discounts
collection of additional data: collecting and processing additional personal data for different purposes, such as:
advertising
market research
For the first part of the contract, Article 28 paragraph 1 no. 1 BDSG is relevant. This paragraph limits the use of personal data to the purpose of the discount related part of the contract. In detail, this means that the contract partner of the customer only is allowed to store and process the following data:
Name
Address
Year of birth
One further address information (phone, email etc.)
Time and place of card deployment
Price of purchased goods / services and discount amount
Data related to the purchased goods only if they are necessary for computation of the discount amount.
The second part of the contract for collection of additional data can be based on the legal foundation of Article 28 paragraph 1 no. 2 BDSG. This regulation allows the use of data as far as overriding legitimate interests of the customer are not concerned. This includes the name, year of birth and address.
For further data a written consent of the customer is required. Requirements for the consent are regulated in Article 4a BDSG.
In addition, for data processing in general the following legal regulations apply:
Article 4 paragraph 3 (active information of the customer)
Article 34 (provision of information to the customer)
Article 35 paragraph 1 (right to get data corrected)
Article 35 paragraph 2 (right to get data deleted)
Consequences especially of Article 4a BDSG are that the customer opting for the further use of his/her personal data has to know details and consequences of the collection and processing of the data. In addition, the decision is not considered to be freely entered into if the participation in the loyalty program depends on an agreement for collection and processing of additional personal data. To gain a legally effective declaration of consent, the customer has to be informed about the following facts:
Who is responsible for the data and the data processing?
Which categories of data will be processed?
For what purpose are they processed?
How is the processing done (phases and data flows)?
Whom are data transferred to?
Information on voluntary decision of consent and the consequences of a refusal
Information on the ability to revoke the consent at any time
In the aforementioned study, 16 federal and several regional customer loyalty programs were investigated for their compliance with the BDSG. All programs investigated showed defects, some minor, some major.
Examples include:
More data than necessary for correct implementation of the discount were collected without consent of the customer
Participation in the discount program depended on consent to further usage of personal data: no freedom of choice given
The exclusion of personal data from the declaration of consent was not implemented in a privacy-compliant way. An opt-in or opt-out possibility was not implemented in the contract; the customer had to cancel that part of the contract he didn’t want to agree with.
In rare cases, conditions of entry and privacy policies were sent to the customer only after getting his/her consent – this means no consent at all because the customers could not know the conditions in advance
Information was missing that consent is voluntary
Consequences of a refusal of the declaration of consent were not pointed out
Planned processing of personal data was not sufficiently described – therefore declaration of consent was not completely based on a free decision
In some cases, such defects were remedied by the operators of the loyalty programs.
For the LN-Card, a customer loyalty program for the readers of a local newspaper, the “Lübecker Nachrichten” (LN), the privacy seal of “Unabhängiges Landeszentrum für Datenschutz Schleswig-Holstein“ (ICPP) was granted.
C. Biometrics profiling
(Angelos Yannopoulos and Vasiliki Andronikou, ICCS)
C.1 Physical Biometrics Profiling : Face Recognition
Face recognition forms part of the general object recognition problem that seeks to differentiate objects which only vary in minor ways from each other and is one of the most challenging computer vision problems. Recognition is achieved through the analysis of certain facial features, such as the upper sections of the eye sockets, or the area around the cheekbones, and their comparison to stored templates. This technology works for both verification and identification. The design, the development and the performance of such a system face many difficulties owing to vulnerability to variations in the operating environment, complex or moving backgrounds, complex foregrounds, occlusion, and in general where the complexity of the system increases in view of real-time operation. The performance of the system can further degrade because of the age of the template. As in every biometric system, in a face recognition system identification of individuals is possible only for those individuals whose images have been enrolled into the system’s database. Thus, the first step for the construction of an individual’s profile is the enrolment of the person’s image (of reasonable quality) into the system’s watch list.
A challenging fact which spurs efforts to overcome the technological limitations of such a system and yet raises great concern regarding privacy is that, generally and compared with other biometrics, there is no reasonable expectation of privacy with regard to physical characteristics that are constantly exposed to the public, such as facial features or voice. Nevertheless, the people entering a public space where video surveillance is used should be notified, so that they can make an informed choice of whether or not to subject themselves to surveillance. This is the case because the professional analysis of stored facial images or other biometrics (by means of profiling techniques) can reveal certain diseases that a layman in face to face contact could not have recognised.
A simple scenario of the use of such a system could be the tracking of a wanted person by the authorities through the individual’s activities; as people are go about their daily tasks, surveillance cameras could be capturing their faces and transmitting these images in order to be compared to the watch list, that contains the people for whom the authorities have issued an arrest warrant. Such a system would be able to track an individual’s movements, actions and transactions in real-time and combine this information with respective data from the person’s past, stored in databases to which there is access.
When the Super Bowl XXXV was about to take place in 2001, fears of a terrorist attack led to the need for increased security measures. The final system used relied on facial recognition, which involved surveillance cameras continuously scanning spectators’ faces and capturing images of their faces processed in real-time in order to produce a “faceprint” (set of measurements of facial features). This “faceprint” was then compared against each record in a computerised database of well-known criminals and suspects of terrorism and was classified as either matching or differing from each of them. An alert to the police was to be triggered in case of a match.
Nevertheless this “super surveillance” could be very easily abused. For instance, what if the watch list does not only include people wanted by the authorities. Should the authorities be allowed to add to the watch list a particular individual simply to track their moves or is the approval of a judge required? Another scenario involves the capturing of images of all the faces included in the videos captured by the cameras, the only technical restriction of which being the required volume of storage. This information could be then maintained for future use, as a record for the particular person. Moreover, increased police attention on previously convicted persons identified in public areas by the system would make their rehabilitation harder.
Questions arise such as who should be entered into the watch list, whose approval (and in which cases) is essential for such an entry, how long are these records maintained, whether they should be accompanied by information, such as when the entrance of each person’s image took place and at whose request. It is thus obvious that abuse can be eliminated though official procedure to some extent by establishing legal measures controlling the administration, management, use and maintenance of such systems, whereas citizen committees and public boards could monitor government use of the system. However, existing measures taken to handle related privacy issues emerging from other types of surveillance technology could be taken into account.
C.2 Behavioural Biometrics Profiling: Keystroke Dynamics
We will now briefly address the issue of whether and how behavioural biometrics can be exploited in a malicious manner.
A simple example of a behavioural biometric that can be abused in a way that is extremely difficult to track is the measurement of keystroke dynamics. The classic application is to protect passwords: a password system logs the timing taken by a user to enter his/her password, as well as the password itself, and then uses a pattern recognition system to classify a newly entered password as matching or differing from the logged timing patterns. Since computer users constantly use the keyboard, any application that can reliably measure the timing of a user’s typing can also try to perform identification or verification of the user.
Given the basic idea, it is quite easy to engage in various flights of fantasy that involve behavioural biometric systems violating the privacy of computer users by inexorably tracking their identity no matter how they try to achieve anonymity. For instance, one might consider a chat programme which measures users’ typing patterns and identifies them regardless of whether they use multiple accounts, identify themselves with pseudonyms otherwise unrelated to their personal data, use different computers such as those available at internet cafes each time they log on, and so on. Or, perhaps, imagine a browser that monitors how a user moves the mouse and can thus always identify its users.
Clearly, the ability of such malicious software to create a centralised collection of identification data needs to be discussed when considering such scenarios, not to mention the possibility of making the measurement at all: a trivial example is that we need not worry that a web server might realistically make such measurements, as the mouse is obviously handled on the client side by the browser and hence the server of a web page never receives useful timing data about mouse movements – nor, in fact, about typing, since the client sends batches of data at relatively large time intervals, e.g. when a user hits ‘enter’). However, scenarios will arise when such measurements are possible.
The ability to make measurements does not necessarily imply an ability to extract meaningful conclusions. Pattern recognition systems often fail at tasks where there is an unconstrained ability to make measurements, because of difficulties in the task of recognition itself. Obviously, pattern recognition systems do perform well in many cases, so it is important to study what kinds of problems they can resolve. Whether recognition is possible depends on intrinsic characteristics of the task, and also on the data available.
We are currently considering a variety of possible behavioural biometrics and how such data can be abused for the task of identifying an unknown user, as opposed to verifying an identity already claimed by a user, which is generally relatively easy, given adequate data. This is still ongoing work, but we can summarise some results that are emerging. There are a number of different parameters of such a malicious effort, and if all of them are handled well, the task is solvable. We tabulate these parameters and table how their combinations affect the possibility of malicious activity.
D. Profiling of web-users
(E. Benoist, VIP)
There are a lot of ways to gather information concerning web users. Sometimes the user is indeed aware that he is providing information to others, but sometimes he does not want to give any information and it is simply stolen.
D.1 Log files analysis
Since ever the web existed, there have been log files. This means that a server (such as Apache Httpd or Microsoft Internet Information Server IIS) stores a lot of information concerning each of the requests they deal with in a file (or several) which is then often used to compute statistics for a site, amongst other tasks.
Such a file is always generated by the server. It can either be stored for future use or otherwise be deleted if it is only used for debugging purposes. Log files contain a lot of information, and the user is usually unaware of this. Here is an example of an entry in an apache server log file, corresponding to a request for the page f.html originating from the IP address 62.2.135.157.
62.2.135.157 - - [23/Dec/2002:00:29:12 +0100]
"GET /Icons/70x1.gif HTTP/1.1" 200 54
"http://www.isbiel.ch/A/f.html"
"Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/0"
This entry contains the following information: IP address, Date of the request, HTTP Request (the first line), the return value (200 means OK, 404 document not found), the size of the information, the URL-referrer (the previously visited page), and the user agent (which browser or spider sent the request).
D.2 Session tracking
Since the Internet protocol for the web (HTTP) does not include session handling, developers have been forced to simulate this feature. In many circumstances, it is very important to follow the session of a user to see that different requests belong together. This is useful for instance in recognising whether a user has been granted access to the server with a username and password. It is also used to gather the elements of a virtual basket, such that the user can visit a web site and keep track of what s/he has bought.
One method for tracking a session of a user is to send a cookie. Cookies are small pieces of information that are added inside the HTTP header, i.e. the first part of the source code of a web page, usually not visualised by a browser. The server sends this information with a statement of its validity, and as long as the cookie is still valid, it will be resent to the server each time the client sends a request to the same host.
Sessions can also be tracked for people that do not accept cookies. It is possible to insert a session ID in all the URLs of the requests for pages on the server. Such a method can be detected by the user. But there exists another more complicated method. Since DHCP servers rarely dynamically change the IP address of a client during a session, one can expect that all the requests occurring from the same user have the same IP address. This offers us the possibility of considering as a session all the requests that originated from the same client.
D.3 User Tracking
Session tracking is useful for attacking privacy, but one can go further. If the user can be linked to all his sessions, a lot of information can be gained by this "total history" of a user’s visits.
So-called user tracking can be achieved using cookies. But with such cookies, the validity must be set to one month or one year. Such a cookie contains a user ID that is stored in a database on the server such that the server can recognize the client the next time he visits the site. This is being done for example at www.amazon.com where the user receives advertisement and special offers corresponding to all the pages he has visited and all the item he has bought so far.
Again, such a tool may be dangerous for privacy. For instance if the user is not only known by his/her ID, but as a human being with a name, an address, and a credit card number. Then these pieces of information once entered by the user may "propagate", while the client may have expected that the site would have “forgotten” them.
D.4 Multi-server user tracking
The next level in user profiling deals with the profiling over multiple hosts. Since cookies can only be sent to the site that created them, it is not possible to have the same cookie for more than one web site. That means that the user has different user IDs on different sites. The same person may be seen as ID=1234 on site1, ID=5674 on site2 and ID=007 on site3.
A solution to this problem is to have a central site that serves all the pages. This is clearly not usable in general, since it would not be efficient at all. Yet there is a simple efficient solution. We just have to insert a request for the same web site in all the pages of the different sites. This is possible, and indeed often done, using a small image. When a web page is downloaded, the browser usually asks for all images that are contained in the page in order to display them in the same page. Usually, images are located on the same web site and are just static files. The idea however is to reference inside the web page an image that is stored on another server. This reference points indeed to a program that generates an image (for example using PHP) which is sent as the answer to the request. Such a program can use its own cookie and this cookie will be the same for all the web sites, since the image comes from the same server.
Such an image must not harm the design of the web page. There are two solutions: first it can be a banner that contains an advertisement and can centralise all the information. But it can also be a hidden pixel. That means an image with a height and a width of 1 pixel. If the pixel has the same colour as the pages that contain it, the user will never see it.
D.5 Protection measures
Disable Cookies
The protection can be on various levels. First one can simply disable the cookies in the browser. This means that the browser will never accept or send any cookie. Such protection is quite efficient, since one can not easily treat sessions and it is not possible to reference a user as a recognized client. Yet, a lot of web sites require cookies to handle sessions, and without enabling cookies users are unable to use many web sites that might be interesting.
Moreover, it is possible, using IP address, to retrace the session of a user using only the IP address (which does not change inside a session).
Disable persistent cookies
One can allow cookies to be used for one session only: it is possible to prevent them being stored on the hard disk and thus used as user ID for many transactions. This is quite harmless, since the only permanent cookies that are interesting concern language specifications.
Disable third parties cookies
This is used to prevent the information passing from one site to another. This is a good idea, since such information use always has negative connotations for the user. We cannot find a valuable program that uses this feature.
Define a policy for the use of cookies
Some web sites declare how they use data collected about users. Based on such information, it is possible to decide whether or not to supply information to this site.
E. Mathematical Tools for Data Mining
(Emmanuel Benoist and Bernhard Anrig, VIP)
The CRISP-DM process model contains 6 phases: Business Understanding, Data Understanding, Data Preparation, Modelling Evaluation and Deployment. We will focus here on the two steps depending on the machine: Modelling and Evaluation.
Modeling can be done with a very large set of algorithms and tons of heuristics. We present hereafter two rather simple modelisation technics, first the regression analysis and then a simple algorithm for constructing decision trees.
Since there exists no unique modelling technique that would work for any problem, we have to evaluate the accuracy of an algorithm and its configuration. This is done during the evaluation step. As an example, we present the cross-validation protocol for testing the quality of a method on a given set of data.
E.1 Regression Analysis
Suppose we have one set of data containing two variables (x and y) that seem to be correlated. We can test if the x’s and the y’s are linearly correlated with the following formulas. We try to build a formula for computing the y’s knowing the x’s, y=f(x). The same type of strategy works also if y is dependent on many different parameters (for instance y = 5.93 x + 1.87 z - 0.98 t).
Suppose we have a set of data containing for each record two parameters x and y. When we plot a graphical representation of the two parameters, they seem to form a line. We can have the intuition that the two parameters are linearly correlated. That is what we can compute.
Suppose we have some children with the age of 6, and we want to find an equation for modelizing the relation between size (in cm) and weight (in kg). We want to compute the weight knowing the size of a child.
Child number | Size (x) | Weight (y) |
1 | 121 | 25 |
2 | 123 | 22 |
3 | 108 | 19 |
4 | 118 | 24 |
5 | 111 | 19 |
6 | 109 | 18 |
7 | 114 | 20 |
8 | 103 | 15 |
9 | 110 | 20 |
10 | 115 | 21 |
![]()
We first plot this set of points in a graphical representation. We remark that the points almost form a line. We will try to model the value of y as a linear function of x. This can be noted . The problem is now to find the values of the constants a and b.
We want to find a and b such that the line is a good approximation of the values. This will be done by minimizing the differences of the very value of y and its respective computed value. To prevent that negative and positive differences could compensate, the sum of the squares of the differences is minimized.
This sum is minimal for two values of a and b (using derivation and the 0 of the derivate) to be obtained with the following formulas:
and
![]()
In our example the mean value for x ( ) is 113.2, for y ( ) is 20.3. The results for the coefficients are therefore:
a = 0.42 and b = -27.38
This results in the equation:
weight = 0.42 height - 27.38
We can hence predict that a child of age 6 which has height 125cm will probably weight 25.27kg.
Such a method can easily be extended for attributes that are linear functions of many parameters. The mathematical concepts remain the same, the equations just have to be written in a matrix way.
![]()
Nevertheless this method is only available for numeric attributes that behave linearly. The equations have to be redesigned if the relation is not linear but quadratic ( ).
More about linear (and non linear) regression can be found in any book on statistics.
However, such a method can not be applied to symbolic attributes, like city, profession, education grade, or ability to pay back a loan. Nevertheless, some numeric attributes can also not be used in such a method: the Zip code is normally difficult to be correlated with other values whereas it can be very informative regarding to the academic level or the average income for instance.
E.2 Constructing Decision Trees
The algorithm presented hereafter is known as ID3. Together with a lot of improvements, it has formed the widely used system C4.5 and its new release is called C5.0. Nevertheless, this is not the place to present all the possibilities offered by such systems. We will only focus on the main mechanism for building decision trees, the gain of information in a divide and conquer algorithm.
The algorithm we present here can only deal with Boolean attributes. One of the more complex steps in data mining is to find a way to discretise continuous variables. Transforming any attribute in a set of Boolean attributes can be done using algorithms (such as clustering), but is often made using ad hoc heuristics, that need to be precisely configured to work fine.
Suppose we want to build a decision tree. The algorithm works top-down, seeking at each stage an attribute to split on that best separates classes, and then recursively processing the subsets resulting from the split individually (a divide and conquer algorithm).
Suppose we have the following table:
Number of the person, Earns more than EUR15000(A), Married (B), University degree(C), Had problems to pay back a loan(D)
Number of the person | Earns more than EUR15000(A) | Married (B) | University degree(C) | Had problems to pay back a loan(D) |
1 | T | T | F | F |
2 | T | F | T | F |
3 | F | F | T | T |
4 | F | T | T | T |
5 | F | F | F | T |
6 | F | T | F | T |
7 | F | F | F | F |
8 | T | F | F | T |
9 | T | T | F | T |
10 | T | F | F | F |
11 | T | T | T | F |
12 | T | F | T | F |
13 | F | T | F | T |
14 | T | T | T | F |
15 | F | F | F | T |
Consider that we have four attributes "Earns more than EUR15000" (A), "is Married" (B), "Received a University degree" (C) and "Had problems to pay back a loan" (D), each of which being Boolean, and we want to predict D. D is called the class. There are three possibilities for the attribute to be used at the root of our decision tree, i.e. A, B, or C. We examine the three cases. If we choose A, for A=true, we have 8 cases and there is 25% D=true and 75 % D=false (noted [2,6]) for A=false we have 7 cases which makes [5,1].
We have to measure the information contained in such a tree (or of any subtree). This will be done using its purity, which is the weighted mean of the purity of the leaves. The purity of a leaf is measured using its entropy. The formula is based on the probabilities pi of the class value (D in our example). The formula for the entropy is entropy ([p1,p2,p3,…,pn])=-p1 log(p1)-p2 log(p2) - … -pn log(pn). In our case the entropy of the node [2,6] is -0.25*log(0.25)-0.75*log(0.75)=0.811 and the entropy of the node [6,1] is -0.86*log(0.86)-0.14*log(0.14)=0.592. Hence the value for this tree is the weighted mean of the two entropies, which makes:
(0.811*8+0.592*7)/15=0.709
We do the same for all the possible classes for the root of our tree. This means that if we take B as the root we have two leaves in our tree: [4,3] (for B=True) and [4,4] (for B=False). The information needed to order this tree is the weighted mean of the entropies and is 0.993. The last case is when the root is C, we have two leaves [1,4] and [6,3] this needs an information of 0.848 to be sorted. The higher the value is, the less information is gained by using the respective node as root for the decision tree. Hence we see that for example using the information whether someone is married or not (attribute B) does not influence very much (at least in our dataset) the ability of paying back a loan.
So we select the one attribute whose tree has the smallest entropy (where the purity of the nodes is the best). For our case, this means that the root for our tree is A. The same process described so far is then recursively done with the created leaves that are expanded into subtrees independently. The tree grows until it has only pure leaves, or no other attribute can be used. There exists also heuristics to deal with almost pure leaves and to stop developing the branches, but this process is beyond the scope of this paper.
E.3 Neural Networks
The neural network domain is an involved, complex and evolving one, and detailing its mathematical derivation and implementation is out of the scope of this document. See http://richardbowles.tripod.com/neural/neural.htm for more information.
The main difficulty with neural networks is the interpretation of the results. Data Mining techniques provide a way to gain information about the data. The knowledge discovered can be understood and used. But neural networks work as black boxes. There is no way to understand how a pattern has been learnt, what sort of deduction has been done. The translation into a humanly understandable language is not possible. Moreover, in order to “learn” the way to answer a new question, a neural network needs a huge training set, which is probably not available.
E.4 Cross-Validation
Usually the amount of data which is available for processing is quite restricted, hence one has to develop concepts for testing the quality of the results gained so far with only the small amount of data. One well-known concept for this purpose is called "cross-validation". It consists mainly in choosing a number of folds in which you partition your data. So for example say we will use ten folds and therefore split our data into ten partitions p1 to p10 of approximately equal size. Then, nine of them, say p1 to p9, are put together and used for the training (the Modelling step of the CRISP-DM process) and the last one, p10, for testing the Evaluation step of the CRISP-DM. This is then repeated exactly ten times so that each partition (pi) is once used as a set for training while all others (p1, p2, … p10 but not pi) are then used for testing. In this case we speak of ten-fold cross-validation.
Usually together with cross-validation, one uses stratified folds of data, that is one wants to make sure that in each of the folds the classes are represented approximately in the same proportion as in the union of all the ten sets.
The overall error rate of such a cross-validation is then computed as the average of the ten error estimates.
Note that depending on the application, one has to introduce different costs for different errors depending on their impact. So for example classifying an A as a B might cause much more damage than classifying a B as an A.
E.5 Products for data mining
A lot of products are available on the market for dealing with data mining problems: C4.5, C5.0, See5, Fair Isaac, Insightful Miner/ S-Plus, Mineset (PurpleInsight), SAS Enterprise Miner, SPSS Clementine, Statsoft Statistica ThinkAnalytics etc.
Weka is an open source framework intended to test algorithms for data mining that can be freely downloaded.
F. On Algorithms
(Jean-Paul van Bendegem, VUB)
In this section of the Appendix we will run through each one of the five topics mentioned in 3.2.3 under the heading “algorithms”. The main purpose is to provide some examples that clarify the concise description in the main text of this report.
F.1. The choice of language in which to express the procedure can have a tremendous effect on the “success” of the algorithm in terms of efficiency (computer space and time required). As an example consider a (formal) language that accepts as a logical rule modus ponens – A and “If A, then B”, therefore B – and suppose that we are dealing with a numerical problem, say, we are checking whether a property P holds for a particular number, say 100. The information we have is that 0 has the property P and that, for all numbers n, if n has the property, then so does n+1. If one wants to formally prove that 100 does indeed have the property, there is only one option: start with 0 and apply modus ponens over and over again to go from 0 to 1, from 1 to 2, …, from 99 to 100, thus the rule is applied a 100 times.
However, suppose that we make the language a bit more expressive. Suppose that we allow ourselves to “contract” repeated steps. To give a concrete example: on the basis of the language given, it is possible to give a general proof (i.e., not restricted to particular numbers) that if the property holds for n, so it holds for 2n, the double of n. In that case one gets a “speed-up” effect. Starting from 1, one runs through the powers of 2, i.e., 2, 4, 8, 16, 32, 64, and, because the next step gets us beyond 100, we will have to make 36 one-step moves, starting from 64, to reach 100. This means that we now need 6 + 36 = 42 steps to complete the same process. (Note incidentally this curious phenomenon, well-known to mathematicians, that proving a general case is often easier than proving a particular case, although, of course, the general case is far more informative than a special case. The implication is straightforward: there is no simple relationship between the complexity of a (logical-mathematical) proof and the scope of the content of the statement proven).
It seems a straightforward extension of the above idea to make languages as expressive as possible. However, this is not the case, since expressivity entails complexity issues and problems. A classic example in the area of formal logic is the problem of the choice between first-order and second-order logic. In rough terms, if we are talking about objects and their properties, first-order logic only allows quantification over objects (“For all things x, so-and-so”), whereas second-order logic allows quantification over objects and properties (“There is a property P, such that for all things x, so-and-so”). Second-order logic is obviously far more expressive, but, to give just one example, there is no proof theory, meaning, informally, that a satisfactory definition of what constitutes a proof cannot be produced.
(George S. BOOLOS, John P. BURGESS & Richard C. JEFFREY: Computability and Logic, Fourth Edition, Cambridge: Cambridge University Press, 2003, is still considered to be the best introduction).
F.2 The choice of support or carrier for the algorithm. In the logical-mathematical literature the standard conception of a computing device is still the concept of a Turing machine T. T consists of a tape and a reading-writing machine. The tape itself is divided in squares and each square may contain a symbol taken from a (usually minimal) alphabet, viz. 0 and 1. The machine itself is described in terms of (a finite number of) states s1, s2, …, sn and (a finite number of) movements, usually moving left (L), moving right (R). Thus a typical program instruction would be <0, si, sj, 1>, i.e. if T reads a 0 on the tape and is in state si, then it changes into a state sj, and writes 1 on the tape. Often a diagrammatic approach is preferred (this obviously bares a connection to the choice of language!) to represent a basic action of T, i.e., the execution of one instruction:
Amazing as it sounds, but nearly the whole of mathematics can be translated in terms of such machines. It is an extremely powerful tool, because of its simplicity and its scope. It allowed Alan Turing to show that a machine that would be able for any other machine to decide whether that machine will or will not stop when given a specific input, cannot exist. Thus there is no universal mechanism or algorithm to decide whether arbitrary problems have or have not solutions.
However, at the same time, it is clear that a Turing machine T is quite independent of any support whatsoever, it is typically a “paper” computer; the physical basis, i.e., how T is implemented, is of no importance (or, at least, assumed to be). More specifically, no issues of (underlying) causality have to be dealt with. Recent advances in computer technology are rapidly changing this picture:
Quantum computing requires the preparation of quantum mechanical systems in very specific states to make the computation possible; in this case the physical system is of primordial importance and the limits of the quantum mechanical system co-determine the computational limits. (Related to this topic is the discussion whether quantum mechanical Turing machines can do “more” than classical Turing machines - still an open question at this moment),
Organic computers require a biological support and here too, the inner workings of, say, a cell need to be well understood in order to get a grip on the computation itself. Biochemical processes are involved, such as DNA-RNA interactions, that determine what the computational possibilities are. Hence the support is primary.
This raises the interesting discussion whether or not one can consider the underlying physical system itself as the algorithm and not necessarily a symbolic representation of that system (comparable to the phenomenon that time is defined presently, not by any artificial man-made object but by a “natural” object, viz. pulsars). It relates to the John Lucas-Roger Penrose discussion whether the human brain can be considered to be a classical Turing machine, or to be seen as such, or none of these.
F.3 Program verification is a separate issue is that is often ignored in actual practice: it concerns the question whether the program or algorithm really does what one expects it to do. Of course, in simple cases there is not much of a problem, but, as soon as complexity reaches a certain level, it becomes a difficult task to prove that the program is correct. Usually the proof verifying the program or the algorithm is more complex than the algorithm itself (often measured in terms of length, how many symbols are needed to write down the algorithm or program and the verification). An example to illustrate this point:
Consider a “simple” program for division for positive numbers, x and y:
((r := x; q := 0); while y r do (r := r – y; q := 1 + q))
Note that the program not only produces q, the quotient, but at the same time r, the remainder. In order to prove that this program does indeed do what we claim it should do, a formal language to talk about programs is necessary. Related to topic 1, a language is required, in this case, we will need statements such as:
P{Q}R (informally: “given the input described by P, after execution of Q, R results”),
x := f (informally: “replace x everywhere by f”).
In addition we will need axioms as we want to prove that the program is correct:
P0{x := f}P, where P is P0 with x replaced everywhere by f,
If P{Q}R and “If R, then S” is provably the case, then P{Q}S (D1)
If P{Q}R and “If S, then P” is provably the case, then S{Q}R (D2)
If P{Q1}R1 and R1{Q2}R, then P{Q1;Q2}R,
If (P and B){S}P then P{while B do S}(not-B and P) (D3)
This is the formal proof of correctness of the above program, illustrating the complexity issue (the two lemmas are merely arithmetical truths that are assumed to be correct; the example is taken from one of the early great “classics” in program verification, viz. C.A.R. Hoare’s 1969
(Note that this topic is of importance to society: there have been court trials to decide whether a (commercially available) program was indeed sufficiently verified or not, as the program malfunctioned and a “guilty” party had to be identified.
F.4 Conflicting data and inconsistencies. To illustrate the problem the title is referring to, consider a database containing data, expressed in terms of statements or properties, represented by p, q, r, … Reasoning with these data involves the use of rules in the format of A B, meaning that, if B (often referred to as the head of the clause) is the case, then A (often referred to as the body of the clause) will result. Consider now the following rather simple case:
q p
r p
(s and t) r
not-p s
p
This set of conditions is inconsistent, since the last clause states that p is a fact, but then q is a fact, r is a fact, but then s and t is fact as well, and hence also s, but then also not-p. How to deal with this situation? It is definitely the case that, because of the size of actual databases and because of the permanent, often “real-time” updating of databases, inconsistencies will and do result. The question then becomes: what to do?
The most common strategy underlying a number of proposals to deal with inconsistencies is to look at maximally consistent subsets. Thus in the above example, it is obvious that by dropping the clause “p “, the problem is solved, although not much can be said now, except that not-p is the case. Therefore, it seems more interesting to delete the fourth clause, because then we can assume (though not necessarily) p, q, r, s and t to be the case. This raises the important issue: how to detect the maximally consistent subsets? Perhaps not entirely unexpectedly, this problem, although in principle solvable in the finite case, is of (at least) NP-complexity (see the next topic for a definition), in short, a “hard” problem.
An alternative is to work with preferences, e.g., defined on the clauses. If we prefer not-p over p, for whatever reasons, then it is obvious that in the example given, we will drop the last clause, because that is compatible with accepting not-p. However, other problems appear: what if the preferences are such that no clause can be satisfied? Then one needs to rearrange the preferences. Of course, one might think about trying out all preferential possibilities, but that problem again is NP-hard.
A next step – but that, to our knowledge has not really been made in computer science today – is to analyse the contradiction itself, to reason with it, and, on the basis of that analysis, to make a choice one way or another. This requires a move that both logicians and computer scientists are not all that willing to make: to abandon or, at least, to adapt classical logic. Without going into too much detail, here is the crux of the matter. In classical logic a contradiction is a disaster, because of the classically valid argument “If both p and not-p are the case, then anything, q, is the case”, the infamous “ex contradictione sequitur quodlibet”. To reject a logical rule, one must find a counterexample to it, in this case, one must find a situation wherein both p and not-p are the case, and, at the same time, not anything is the case, i.e., q can be false. The latter part is easy, enough things are not the case, but the first part requires a curious move: to accept that there are situations wherein a statement and its negation can be both true. However, the reward is phenomenal: one can now reason with contradictions.
F.5 Complexity and randomness is perhaps the most intriguing topic. Looking at problems that are solvable, i.e., there is a program that will execute the program in a finite time, requiring a finite amount of space, and produce an answer, one way or another, it soon became apparent that some problems are easy to solve, some difficult to solve. The imprecise notions “easy” and “difficult” equally soon became translated in formal terms. Any problem is characterised by some parameters k, l, m, n, … If a problem is solvable in time or space expressible as a polynomial of some parameter of the problem, then we have a P-problem. E.g., given n books, to sort these books alphabetically is a P-problem. (Take the first book, that requires one step, take the second book, see if it comes before or after the first book, this involves two steps, take the third book, see if it come to left, to the right or in the middle of the two books already sorted, in short, one needs 1 + 2 + 3 + … + n steps, or, n(n+1)/2 steps, and that is a polynomial). Other problems require an exponential amount of space and time to solve the problem (putting together a schedule for train traffic is such a problem). A special subgroup are those problems for which an exponential solution is known, at the same time it is not known whether a P-solution exists, and, in addition, it is always possible to make a guess for the solution, a guess that can be checked in P-time. Such problems are NP-problems (see the above topic). In fact, at the present moment, there is a quite complicated hierarchy of complexity measures.
However, even more intriguing, is the fact that, given a program, given sets of data, there will always be structures or patterns, present in the data, that the program will not recognise as such. In other words, to the program the structure will appear random, although there is structure present. The core of the argument is this: if a given set of data has a structure or contains some pattern, then it must be possible to describe that pattern, and, obviously, this description must be shorter than a simple enumeration of all the data. What we are doing in a sense is to compress the data. If this compression is to be reliable, then we must able to prove so (compare with topic 3). If we succeeded in proving the compressions in all possible cases, then any data set can be compressed and that runs counter to the idea that there are genuinely random sets. There will therefore always be data sets that contain patterns that will not be detected and hence will appear as random.
Excellent guides in these matters are the works of Gregory J. CHAITIN: Information, Randomness and Incompleteness. Papers on Algorithmic Information Theory. London: World Scientific, 1990; Algorithmic Information Theory. Cambridge: Cambridge University Press, 1992; The Unknowable. Heidelberg: Springer Verlag, 1999. Here one finds the formal counterparts of the informal arguments presented above.
G Using user’s Profiling and Artificial Agents for Stimulating the Knowledge Exchange Process in Virtual Communities
(Thierry Nabeth, Albert A. Angehrn and Pradeep Kumar Mittal, CALT - INSEAD)
In this section, we are going to examine how artificial agents relying on personal behavioural user profiling can be used to stimulate the participation in the knowledge exchange process in virtual communities.
G.1 What are Virtual communities?
G.1.1 Defining the virtual community concept
Howard Rheingold, the person who popularised the term "virtual communities”, has defined this concept as “social aggregations that emerge from the Net when enough people carry on those public discussions long enough, with sufficient human feeling, to form webs of personal relationships in cyberspace". In a later review of literature aiming at understanding why people are joining virtual communities, Ridings and Gefen (2004) have found that the concept of virtual community has been characterised more generally as (1) people with shared interests or goals for whom electronic communication is a primary form of interaction, (2) groups of people who meet regularly to discuss a subject of interest to all members, (3) and groups of people brought together by shared interests or a geographical bond. We will define ourselves a virtual community as an “electronically-supported” social structure holding together a set of people sharing a common culture, set of interests, values, goals or norms.
In the last few years, virtual community has emerged as one of the more important actionable paradigms for supporting the circulation of tacit knowledge in our digital societies.
G.1.2 Virtual community environments
In this context, virtual community environments are the techno-sociological infrastructures that have been specially elaborated to support these virtual communities. Virtual community environments include all the systems, such as forums or bulletin boards, that provide explicitly shared dedicated spaces for supporting the discussions of communities or groups of people. The communication in these spaces can be asynchronous (bulletin board) or real-time synchronous (chatrooms). People interact mainly with others by posting messages in (public or restricted) shared spaces, but can also sometimes communicate directly and more privately with one another. The control of what can be posted in the public spaces (specified in an explicit or implicit code of conduct) can be enforced by a moderator or by some social regulation mechanisms (social pressure, norms, etc.). Recent forms of virtual community systems are enhanced with technical mechanisms and features aiming at better supporting and stimulating the social process, via the visualisation of people activities (with the whole line of research dubbed as social transluscence) or the provision of matching mechanisms helping the forming of groups or the establishment of relationships.
G.2 The challenges of participation
G.2.1 Virtual communities: the participation challenge
One of the main challenges facing designers and operators desiring to build successful virtual communities, is the establishment of a sustainable dynamic of participation amongst its members. Indeed, the essential value of a virtual community resides in the activities of its members and in particular is strongly correlated to their willingness to spend time, to interact with others in conversations, or to provide knowledge assets. The participation of the members of a virtual community in this knowledge exchange process is indeed not spontaneous, but is motivated by a certain number of elements such as: direct rewards, increased reputation, internal satisfaction (altruism and efficacy), or reciprocity.
Understanding the “mechanics” of the functioning of the participation in a knowledge exchange in virtual communities or in groups has been the subject of numerous researches in different fields such as: knowledge management and organisation ; CSCW ; complexity; social computing; sociology and communication, or psycho-sociology , to name a few.
G.2.2 Some directions for addressing the participation challenge
From these theories, we could try to derive a set of principles that could be used to address this participation challenge. One of these principles could consist in working on the establishment for the members of these communities a climate of trust, a sense of community, and a feeling of recognition for the actions of their members. Other principles could be derived from the Social exchange theory, which proposes an economic model of rational choice to explain participation. More concretely, a virtual community designer could set up mechanisms supporting the factors that make people participate, such as: (1) anticipated reciprocity; (2) expected gain in reputation and influence on others; (3) altruism and perception of efficacy; (4) direct reward. Cognition and psycho-sociology also have some theories on influence that could be applied to “orient” people in their desire to contribute to a knowledge exchange process. For instance to stimulate participation one could try to exploit some of the six principles of influence of Robert Cialdini of: (1) reciprocity (felt obligation to “reimburse”); (2) social validation (social conformance); (3) commitment /











