You are here: Resources > FIDIS Deliverables > Profiling > D7.2: Descriptive analysis and inventory of profiling practices >

Resources

D7.2: Descriptive analysis and inventory of profiling practices

Foreword
3. DESCRIPTIVE ANALYSIS OF PROFILING

3. Descriptive analysis of profiling

3.1. Introduction

Automated profiling (whether group or personalised) takes place in the process of:

recording data (taking note of them in a computable manner)
storing data (in a way that makes them accessible, aggregated in a certain way)
tracking data (recording and storing over a period of time, linking data to the same data subject)
identifying patterns and trends in the data (by running algorithms through the data base) and
monitoring data (checking whether new data fit the pattern or produce outliers).

Only facts that are recorded as data ‘count’ as such. This means that even before the process of profiling starts, a translation takes place, converting events, situations and actions into what some like to call ‘raw data’. Precisely because these ‘raw data’ are not the facts (events, situations or actions) they represent, real life can cause serious problems if the translation into computable data is or becomes inadequate - originally adequate input can prove to be inadequate due to changes in the facts they relate. This problem cannot be solved by building up databases in terms of a consistent knowledge representation (ontology). Consistent databases avoid the problem of subsequent translation between different knowledge representation systems that hold the same type of facts, but in different formats. However, although such a course of action can make data and databases more interoperable, they do not guarantee the actual adequacy of the translation. This problem can only be solved when data are not conflated with the facts they represent, since only then room is created to redefine the facts and/or recognise that the facts do not (anymore, or yet) fit the framework of computable data.

In par. 3.2 we will discuss the construction of group profiles, in par. 3.3 we will discuss the construction of personalised profiles.

3.2 Group profiling

3.2.1 Introduction

In this paragraph we will analyse the construction of group-profiles. As discussed above a group profile is a set of correlated data that identify and represent a group/category/cluster. To understand in what ways the information society has changed the construction of profiles, as compared with traditional social science research, we shall explore the way profiling based on data-mining works. First we will introduce the evolving de facto industry standard CRISP-DM, which will be compared to the semiotic analysis of knowledge discovery in databases (KDD) by Canhoto and Backhouse. In par. 3.2.3 the process of data mining, that is central for group profiling, will be explored in more detail; in par. 3.2.4 the difference between distributive and non-distributive profiles will be described and in par. 3.2.5 some differences will be discussed between profiles as knowledge constructs and the knowledge produced by means of traditional social science research.

3.2.2 KDD (Knowledge Discovery in Databases)

3.2.2.1. Introduction

Generally speaking, the construction of group-profiles in the developing information society is based on computerised searches in large databases, containing massive amounts of (often personal) data. This ‘knowledge discovery in databases’ (KDD) can be described as ‘the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data’, and can incorporate a powerful tool known as ‘data mining’. While the term data mining originates from the database research community, the term KDD comes from artificial intelligence and machine learning community. the term ‘data mining’ is used in a broad sense to refer to the overall process of analysing data to discover previously unsuspected relationships that provide to the database owners interesting or valuable information. More correctly, the term KDD is used to refer to this overall process, including the interpretation of the emerging results, while the term ‘data mining’ is used to refer specifically to the step of discovering the patterns and trends in the data.

3.2.2.2 Modelling the profiling process – towards a de facto standard

Although several models have been proposed, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is a non-proprietary and freely available set of guidelines and a methodology developed to help guide the overall process. the methodology was created in conjunction with practitioners and vendors to supply checklists, guidelines, tasks, and objectives for every stage of the process.

The CRISP-DM model focuses on six key phases of the overall process, shown in Figure 1. The order of the phases is not strict, in the sense that the results of one phase may show that more effort is required in a previous phase; however, the general links between each phase are shown. The surrounding circle shows that the process itself is in fact a continuous process.

Figure : The phases of the CRISP-DM process model (from the CRISP-DM Process Guide and User Manual)

Each of the phases can be described as follows (adapted from the CRISP-DM Process Guide and User Manual):

Business understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a problem definition and a preliminary plan designed to achieve the objectives.

Data understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to become familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

Data preparation

The data preparation phase covers all activities for constructing the final dataset (data that will be fed into the modelling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed many times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.

Modelling (Data mining)

In this phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same problem type. Some techniques have specific requirements for the form of data. Therefore, stepping back to the data preparation phase is often necessary.

Evaluation

Before proceeding to final deployment of the model, it is important thoroughly to evaluate the model and review the steps executed to construct the model in order to be certain it properly achieves the objectives. A key objective is to determine if there is some important issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the results should be reached.

Deployment

Creation of the model is generally not the end of the project. Even if the purpose of the model is simply to increase knowledge of the data, the knowledge gained will need to be organised and presented in a way that can be utilised.

Notably, each of these phases has a hierarchical structure, incorporating four further layers of abstraction; see Figure 2, which expands the simple model into a full guide for implementing a given application. However, a detailed analysis of these steps is beyond the scope of this document.

Figure : Hierarchical structure of the CRISP-DM methodology

3.2.2.3 Semiotic model of KDD

Canhoto and Backhouse have developed a model, based on semiotics and cognitive psychology. They apply this model to profiling within financial institutions that are obligated by law to report transactions that are suspected to involve money laundering. The profiling they investigate thus produces profiles that may lead to suspicious transactions reports (STR). An elaboration of this case study can be found in par. 5.3.1 below. Their model distinguishes 6 levels to analyse the process of KDD:

Step 1: data collection; first level: physical
Step 2: data preparation; second level: empirical
Step 3: data mining; third level: syntactical
Step 4: interpretation; fourth level: semantics
Step 5: Determine actions; fifth level: pragmatics
Step 6: The application context; sixth level: social situatedness

Step 1: data collection; first level: physical

The first level is the physical. In the case of STR’s this refers to the collection of data that will serve as input for the data mining process that should produce profiles of STR’s.

Collection of data can take place in a deliberate way, by asking people for information, or in less explicit – even illegal – ways. While off-line shopping with cash requires no identification or exchange of personal data, online purchases usually demand some input of personal data that in most cases will be stored and often sold as part of a database. Also, surfing behaviour can be observed and stored by (il)legal spyware that is located invisibly on websites or local computers without explicit or implicit consent. For example, as one visits certain websites, ‘cookies’ are placed on one’s computer to allow monitoring of movement around that specific site in order to improve the user experience. In more nefarious applications, data on all of your on-line behaviour can be remotely stored, processed and used at a later moment. Notably, these cookies or other software are not necessarily related to the website you visited, see also section 3.3.2.

So, whilst off-line transactions or other behaviour are often observed without leaving much of a trace, on-line transactions or surfing-habits often produce traces in the form of sets of data that are stored and processed and may be retrieved for entirely different purposes (even if this is illegal). The use of CCTV, embedded sensors, RFID-technologies and the emergence of Ambient Intelligence (AmI) could ultimately make off-line behaviour recordable and traceable in similar fashion.

What is important at this point is that however massive the amount of data in a database, incorrect and/or incomplete data may impact the construction of profiles by producing false negatives and false positives. This indicates the importance of the first two phases, described in the CRISP-DM model: business understanding and data understanding.

Step 2: data preparation; second level: empirical

The second level is the empiric. This refers to the processing of the collected data such that aggregated data are available at client level. Collecting and organising data in a consistent and useful way is commonly called warehousing. In the case of anti-money laundering profiling technologies, the purpose is to identify suspicious transactions and, to do this, different transactions have to be linked to the same client, which will enable the identification of certain patterns of behaviour that give rise to suspicion of money laundering.

As data are collected they may not be of any use until aggregated. To increase the linkability personal data may be aggregated on the basis of residence, income, life-style or employment, medical history etc. The preliminary step in this process is the linking of personal data to each other as referring to the same person (even if the identity of this person is not known). In the case of anonymity this will not be possible, and in the case of pseudonymity this will be possible only for data linked to the same pseudonyms. This means that attempts to develop identity management devices that enable users to remain anonymous or to use pseudonyms, will affect the possibilities to construct profiles.

The empirical level of data preparation concerns the phase of data preparation in the CRISP-DM model. At the same time it presumes an effective business and data understanding, which means that the professionals working with the model may often revert back to these phases, to enhance the quality of data preparation.

Step 3: data mining; third level: syntactical

The third level is syntactic. An analyst will go through the data to find useful patterns (profiles), or to check if an existing profile fits with the aggregated data. This is the phase of modelling or data mining in the CRISP-DM model. Data mining is focused on the automated discovery of patterns in data or sets of data. The simplest pattern to be found is a linear correlation, for example:

A occurs in 93% of the cases that B occurs.

In data mining communities the term correlation is used only to refer to linear correlations. In many cases non linear or curvilinear correlations will emerge during the data mining process, made visible in graphs, that show non linear patterns (curves) that can be translated into non linear functions between the relevant variables. It should be remembered that a correlation does not imply a causal relation. In some cases the relationship itself may be meaningless because the causal relation is dependent on one or more other factors. Given these spurious correlations, misinterpretation is not uncommon. Data mining techniques will be extensively dealt with in par. 3.2.3, as it is the core of the profiling process.

After patterns of correlated data have been discovered, the adequacy of the profile has to be tested by checking whether it indeed detects transactions that raise suspicion of money laundering. This concerns the phase of evaluation in the CRISP-DM model. The problem of automated anti-money laundering profiling technologies is that usually they over-report, necessitating human resources alongside the automated profile - checking its results by hand. Changing the profile however may result in too many false negatives, which increases the risk to financial institutions of being fined for not detecting suspicious transactions.

Step 4: interpretation; fourth level: semantics

The fourth level concerns semantics, which means putting the patterns that are identified as an indication of suspicious transaction into context. This can, for instance, be the legal context that determines which transactions are to be reported (in fact it is the case that thresholds are specified above which transactions have to be reported). Failing to disclose suspicions is an imprisonable offence in the UK since 2002, so the legislation offers an incentive to over-report for fear of being sanctioned.

Although it is possible to discover patterns and trends between data, the question is what Even genetic profiles (linking specific genes to specific diseases for instance) are often based on correlations without any insight in the causal chains that could be involved. In contrast, data mining is able to uncover patterns and trends and also reveal the causal relationship behind them. This makes it a powerful tool, perhaps with more potential than the on-line analytical processing (OLAP) approach, discussed in par. 3.2.3.1.

Custers claims that ‘when a pattern in data is interesting and certain enough for a user, according to the user’s criteria, this is called knowledge’. He defines patterns as ‘interesting when they are novel (which depends on the user’s knowledge), useful (which depends on the user’s goal), and nontrivial to compute (which depends on the user’s means of discovering patterns, such as the available data and the available people and/or technology to process the data)’. The pattern’s certainty, according to Custers, depends on the integrity of the data and the size of the sample. To decide on the interest and certainty of the correlated data (the profile), the data user will have to evaluate or interpret the profile; this again refers to the evaluative phase of the CRISP-DM model.

Step 5: Determine actions; fifth level: pragmatics

The fifth level concerns pragmatics. At his point they integrate the theory of cognitive prototyping, that claims that experience leads to the construction of categories that function as a kind of prototype. The importance of such prototypes regarding a client’s profile and anti-money laundering cannot be underestimated. On the one hand, if the prototype is adequate, it will facilitate identification for STRs, on the other hand, if it is too static or ill-suited, it will hamper identification (producing false positives and false negatives). Canhoto and Backhouse stress the role of professional experience both for the construction of algorithms at the syntactic level and for the interpretation of the profiles that are produced, that may call for further action (reporting a transaction as STR or not). In terms of the CRISP-DM model we again find ourselves in the evaluation phase, that may demand looping back to earlier phases, as discussed while explaining the model.

After the results of the operations have been scrutinised, the resulting profiles will form the basis for certain actions. Often profiles will be used for selection/access purposes: determining for instance employment opportunities, health risks, insurance risks, targeted advertising, categorisation as potential terrorist or criminal. In terms of the CRISP-DM model we are now in the deployment phase.

Step 6: The application context; sixth level: social situatedness

The social level concerns the expectations or social norms that influence and determine the actions of the data-controller, e.g. what are the implications of different cultural settings within the EU for the way data-mining and profiling is practised? Canhoto and Backhouse discuss varying perceptions of corruption in different EU member states, related to different ideas about what is considered legitimate and they refer to different attitudes to banking secrecy. All these informal social norms influence the extent to which certain transactions are interpreted as suspicious. This social level is of real importance, because it puts profiling technologies in context. For FIDIS it is vital to take a broader view on profiling than just the perspective of technology, as this says little of the actual state of the art in the European Information Society.

3.2.3 Data mining

In this section the key step in KDD of data mining will be further explored. The exploration of the techniques involved that we propose to undertake in this section, can be demanding for readers that have not been initiated into computer science. However, since it is a crucial element of the process of profiling we will attempt some initial clarifications of the techiques involved (further elaboration will take place in the Appendix). As these techniques determine the outcome of the process of profile construction, we advocate some interdisciplinary understanding as a precondition to evaluate the implications of the application of profiles.

Data mining can be defined as data processing, using sophisticated data search capabilities and statistical algorithms to discover patterns and correlations in large pre-existing databases; a way to discover new meaning in data. Data mining is a multidisciplinary field with strong quantitative roots. Its techniques have been developed mostly by the artificial intelligence community (e.g., machine learning and pattern recognition) and the mathematical community (e.g., statistics and uncertainty processing). In general, there are two approaches to this data mining phase. top-down and bottom-up analysis.

3.2.3.1 Top-down analysis

Data mining techniques may proceed from a deductive process that looks for confirmation, or indeed departures, from accepted patterns or models of behaviour. In this sense, data mining is done in order to test hypotheses; to fit models to a dataset; or to identify behaviours that depart from the norm. Hence, the goal is to monitor behaviour. This type of data mining comes close to traditional research in the social sciences, because it starts with a hypothesis that is then tested. This approach can be defined as On-Line Analytical Processing (OLAP), a particular decision support tool, since the database is essentially used simply to test the accuracy of a number of hypothetical patterns and relationships (usually manually generated). The approach becomes ineffective when the number of variables in the data is too large, and therefore too difficult or time-consuming to find a good hypothesis, let alone be sure that a better explanation does not exist.

3.2.3.2 Bottom-up analysis: directed and undirected

This approach differs from OLAP because, rather than verify hypothetical patterns, it uses manipulation of the data itself to uncover automatically such patterns. Data mining in this form is essentially an inductive process. Bottom-up analysis aims to generate hypotheses that can explain behaviour observed or predict future behaviour. As mentioned before, one of the interesting characteristics of profiling is the fact that the data user is not really interested in explaining behaviour in terms of causes or reasons, but only in the predictive significance of correlations. In general terms, bottom-up analysis can be performed with the support of a domain expert who will suggest which fields / attributes / features are the most informative – this is referred to as directed bottom up approach. undirected bottom up approach. It should be clear that directed bottom up analysis indicates some intuitive or reasoned hypothesis as to possible correlations. In practice, to a degree most data mining proceeds as directed bottom up analysis (to prevent the proliferation of spurious correlations).

In any automated KDD process, the basic tool used in data mining is either an algorithm or a heuristic.

3.2.3.3 Algorithms and Heuristics

It seems easy enough to define what an algorithm is. Given an initial state (a set of data, a description, a physical system, …), an algorithm is a procedure to transform the initial state into a desired end-state with certainty (or, if not, at least as close as possible to certainty). Usually the procedure has to be communicable or implementable and therefore an additional element of the definition is that an algorithm is finitely expressible (although consulting an oracle, say Delphi, if absolutely reliable, should be considered an algorithm, even when the oracle is not capable of explaining its powers).

The most important properties of algorithms relevant to a discussion about data mining and related topics are to the philosopher-mathematician’s mind the following:

The choice of language wherein to express the procedure can have a tremendous effect on the “success” of the algorithm (as is well known in computer science, hence the immense variety of computer languages) because it influences strongly, among other things, the complexity of the algorithm,
The choice of support or carrier for the algorithm. We tend to focus on computer programs as typical instantiations of algorithms, but if one agrees that kitchen recipes also count as algorithms, matters become more complex, for, in a kitchen setting, causality relations do play a part. A computer is, in that sense, a rather atypical object for its causality structure is deemed non-existent.
All too often ignored is the problem of how one can know that the algorithm does what it is designed to do, i.e., program verification. Usually one runs into problems because the verification is much more complex than the program itself. A glance at the Journal for Automated Reasoning will reveal the scale of such problems. Hence the term “desired” in the definition of algorithm is highly problematic: do we have any guarantee that we will recognise the end-state as the “desired” state?
Likewise, although the situation is improving, not enough attention is given to the problem of inconsistency: what to do if the algorithm is internally inconsistent, where the algorithm contains contradictory instructions: “if A is the case, do B”, together with “if A is the case, do not-B” (this, of course, being the bottom case, easy to identify). Additional algorithms are needed to repair algorithms. Standard procedure is often to introduce preferences, but then the question is what the preferences are based on. If, e.g., an Amazon client changes his taste, should the program continue to suggest items according to the client’s old taste or not?
Perhaps less well-known are the intrinsic limitations to finding patterns. Any algorithm, if finitely expressible, will have limits to the complexity it can handle (cf. the work of Gregory Chaitin). This means that although patterns might be present in the initial data, the program will not identify them, as they will appear “random” for the program.

In short, all of the above features are centred around one basic concept: complexity and how to deal with it. Essentially, an algorithm is a series of (usually) mathematical operations that are performed on a set of data. Notably, each step in the algorithm is ‘blind’ (it needs no additional information), each step follows the previous step ‘blindly’ (no further information is required to determine the next step) and a final result from the process is guaranteed after a finite amount of steps. If any of these rules are broken, then the process is referred to as a heuristic, i.e. additional, usually expert, information is employed to decide some part of the process. A heuristic is a form of the directed bottom-up approach.

A wide range of data mining techniques exists based on algorithms and heuristics, however, here we shall focus on a selection of these to highlight some of the key differences in their approach. For further elaboration see Appendix, sections E and F.

3.2.3.4 Symbolic approach

The result of data mining process is some sort of classification of the data. In the simplest case, the patterns or trends can be represented by a set of rules. The rules can be of the form:

“people having bought the 9th symphony are likely to buy the 8th”

“if the user lives in the region defined by the postal code 12101 or earns less than 5000€ per year, then s/he should not be sent a brochure for insurance”

Note that in the first case the rule aims to describe, while in the second case the rule aims to prescribe. Essentially, a rule is composed of a set of properties that can be true or false, called an antecedent, with the result termed a consequent. Rules that prescribe can be represented by the use of a ‘Decision Tree’. A decision tree representing the brochure for insurance rules above is shown in Figure 3.

Figure : A simple Decision Tree

The hard rules used in such trees are not ideal, for example, someone with an income of €4999 will be excluded in the above example. Also such trees can become very complex, although heuristics can be applied to ‘prune’ the tree. Various algorithms can be used for building decision trees including CHAID (Chi-squared Automatic Interaction Detection), CART (Classification And Regression Trees), Quest, and C5.0 (Inductive algorithm descended from ID3 and C4.5).

Rule induction is a further technique which produces a set of independent rules that are unlikely to fit a tree structure and may not take every case into account.

3.2.3.5 Connectionist approach

This approach is characterised by the knowledge being contained in weights and nodes. It utilises techniques such as Neural Networks to fit non-linear classification boundaries to data. Such methods offer a potential benefit over simple rule-based approaches as they offer a way of efficiently modelling large and complex problems in which there may be hundreds of variables with many interactions between them. Neural networks utilise a training set of data which is used to configure the structure of the network during a learning phase. Following this phase, the network is able to classify new data based on the training set. Care must be taken however to avoid ‘overfitting’ the training data since the network is flexible enough to learn the specifics of the training data, rather than generalising it. Often a validation phase, utilising separate data, is used to monitor overfitting and halts training if overfitting is identified. Notably, the training phase can take a prohibitive amount of time, and the resultant networks are not easily interpreted, that is, there is no explicit reasoning behind the results a neural network may produce.

3.2.4 Distributive and non-distributive profiles

(A. Vedder, TILT)

In the years to come, group profiling through data mining will become a powerful set of techniques of ever-growing importance. Applying these techniques results in generalisations about groups of persons, rather than about individuals. Regarding these generalisations, we must distinguish between distributive group profiles and non-distributive group profiles.

Distributive profiles assign certain properties to a data or information subject, consisting of a group of persons however defined, in such a way that these properties are actually and unconditionally manifested by all members of that group. Distributive generalisations and profiles are phrased in the form of down-to-earth, matter-of-fact statements. Non-distributive profiles are framed in terms of probabilities, averages and medians, significant deviancies from other groups, etc. They are based on comparisons of members of the group with each other and/or on comparisons of one particular group with other groups. Non-distributive profiles are, therefore, significantly different from distributive profiles. If every member of a certain group has a chance of 30% of dying before the age of 30, the profile describing the group in terms of this chance is distributive. However, if members of a certain group have an average chance of 30% of dying before the age of 30, the profile describing this average is non-distributive. The properties in non-distributive generalisations apply to individuals as members of the reference group, whereas these individuals taken as separate individuals need not in reality exhibit these properties. For instance, an applicant may be refused a life insurance on the basis of a non-distributive generalisation of certain health risks of the group (e.g. defined by a postal code) to which he happens to belong, whereas he or she is a clear exception to the average risks of his or her group. In all such cases, the individual is primarily judged and treated on the basis of belonging to a group or category of persons and not on his or her own merits and characteristics.

Distributive generalisations and profiles amount to infringements of (individual) privacy, because the properties of the group are automatically properties of all individual members of that group. Of course, in order to count as an infringement of privacy, additional conditions apply, e.g., that the individuals involved can easily be identified through a combination with other information available to the recipient or through spontaneous recognition. In the case of non-distributive profiles, the profile remains attached to the data subject as constituted by a group. Because the properties included in the generalisation do not apply to individual members of the group in any straightforward sense, it is very hard to understand how they could be infringements of privacy. The information contained in the profile envisages individuals as members of groups; it does not envisage the individuals as such. Supposing for the sake of argument that the profile has been produced in a methodically sound and reliable way, it only tells us some “truth” about individual members of those groups in a very qualified, conditional manner. This means that the information in non-distributive profiles cannot be traced back to individual persons. Therefore, privacy rules, as they are traditionally conceived, do not apply.

This, however, does not mean that non-distributive profiles are morally and legally indifferent. Non-distributive profiles can be problematic from the viewpoint of fairness, distributive justice, equality and non-discrimination.

Some practical problems stand in the way of simple solutions. These have to do with the non-transparency of group profiling, which in turn has to do with the fact that the groups that are the data subjects of non-distributive profiles can often only be identified by those who defined them for a special purpose and not by those who belong to the group involved, and with the possibility of hiding or masking the use of specific sensitive profiles, by connecting them to profiles that refer to trivial properties. Finally, one must be well aware that the normal instruments for data protection that hinge on the control over the data by the individuals involved are not applicable: in the case of group profiles other individuals will be affected when the profile changes because specific individuals opt out or change input data. The locus of control should not only be with the individual that changed her data, because the group as a whole is affected. However, neither can the affected groups be the locus of control, because such groups mostly are nothing more than sets of individuals randomly brought together, were it not for one characteristic that they share.

3.2.5 Automated profiling and traditional social science research methodologies

Traditional social science usually starts with a hypothesis that is then tested by researching a sample of a population. This testing is done by means of surveys and/or participatory observation and/or in-depth interviews. The cost of testing is such that much attention is given to the preparation of both hypothesis and testing. The research aims at explaining situations or phenomena in terms of causes and/or reasons (motives).

Profiling and data mining work from a different perspective, and in a different setting. First of all, the data that are researched have been recorded in databases, their retrieval does not depend on the memory of witnesses. Since the sixties this has already been the case for much social science research, as far as it collected and recorded data by means of surveys, usually based on statements made by ‘data subjects’. These statements were the input for databases. Five important difference should, however, be pointed out.

First, data mining often does not depend, or depends only in part, on data explicitly given by data subjects; instead data are recorded without explicit consent – or even knowledge – of the data subject (in real time by video cameras; online tracking of web-users; offline tracking of supermarket customers or banking clients, etc.). This also means the data do not indicate what people say about themselves but represent what they do.

Second, the scope of the ‘sample’ that can be mined is enormous. The results are not extrapolated but taken to cover the entire field (and – hopefully – scrutinised for spurious correlations or low-quality data-input).

Third, the low cost of setting up and searching an entire data base, in comparison with the databases filled with the results of surveys and interviews, makes possible repeated and even continuous searching. Often data are being collected just because it is possible to do so, without a clear idea (yet) of when or where it will be used. This allows the data controller and the data user to make data subjects seemingly transparent – while the whole process of data collection, storage and processing is usually hardly visible.

Fourth, data mining is often used to reveal patterns and trends instead of just testing hypotheses. The techniques employed for validation of the results emerging on interrogation of data originate in the discipline of statistics. However, strictly speaking, data mining differs from traditional statistical methods because data mining relies on the use of software to interrogate the data, whereas in statistics the interrogation of the data is done by the researcher who makes all the decisions at each step of the inquiry.

Fifth, the results that emerge are not taken as proof of causal or motivational links. In fact, the purpose of data mining is not so much the construction of true knowledge for its own sake, but assessment of risks and opportunities in the future on the basis of patterns in past behaviour. The meaning of the correlations is not sought but created, by acting on them. Even in the case of genetic profiling, the correlations between genotype and phenotype are used to promote genetic testing, without any knowledge of the causal links between the two.

Evidently, the phrasing of the questions – and the algorithms used to locate correlations – influence the findings. Data users or even data subject may think that because a computer did the job, it must be right. This is not the case. As mentioned earlier, intuition and professional experience play a crucial role that also impacts the interoperability of profiling technologies within and between organisations and national jurisdictions.

3.3 Personalised profiling

3.3.1. Introduction

In this section, we will analyse personalised profiling, that differs from group profiling as far as it is focused on the identification and representation of an individual data subject. As discussed in chapter 2, a personalised profile is a set of correlated data that identifies and represents a single person. If data mining techniques are used to construct personalised profiles, only those data are searched that concern one specific data subject, for instance the DNA structure of a specific suspect or the data of one specific web user. In the following paragraphs we will discuss user modelling (and user adaptive applications) and biometrics as two types of personalised profiling.

3.3.2 User modelling and user adaptive applications

3.3.2.1 Introduction

(Thierry Nabeth, INSEAD; Simone van der Hof, TILT)

Compared to group profiling, which is mostly based on stochastic approaches (KDD, machine learning or data-mining, see par. 3.2.3), user modelling is mainly characterised by knowledge-based, cognitive and more people-centric approaches (knowledge representation, user modelling, reasoning …). Personalised profiling in the sense of user modelling is principally concerned with the discovery of the individual characteristics of a particular user (rather than the characteristics of an abstracted user, as in the case of group profiling) and it covers all the approaches that can be used to help in the construction of the user model of a particular user.

User modelling can be used to intervene in applications and services that need to be informed about the user’s characteristics in order to provide a personalised (or user-adaptive) interaction. Examples of such applications include personalised e-learning systems, that can take into account the previous experience of the user or her learning style to select the learning material that is the most adapted to that user; information retrieval, that can use information on the users preferences and interests to filter information; adaptive AmI application, that may be able to take into account the disabilities of a user and automatically select the mode of interaction that is the most appropriate; or e-commerce applications, that can use information to suggest to customers products that they are more likely to buy.

It is important to notice that few of these adaptive applications have reached the commercial stage, although Amazon, the online bookstore, has used it to enhance the user’s experience. Many of these applications still only exist in the form of prototypes in computer labs. Personalised and adaptive systems represent, nevertheless, an important strand of research for the design of intelligent computer-based applications systems. In association with user modelling, these researches aim at enhancing the quality of the interaction and therefore its effectiveness by taking into account the user specificity, such as her cognitive style, or competence, as well as her context of activity of this user. For instance the current tasks in which he is engaged or the organisational context.

filtering out the irrelevant information (reducing cognitive load), by delivering this information at the right time (just in time);
choosing a form of delivery that maximises its impact on this user (taking into account the cognitive style of the user); or
by proposing very contextualised help (the system is aware of the task in which the user is currently engaged into).

Research on adaptive systems has been conducted for applications in a number of domains such as e-learning, e-commerce or knowledge management.

Personalised (user adaptive) applications generally rely on a user model that is used to represent the characteristic of the user such as name, preferences, location, etc. (all the personal information that can potentially be used to personalise the interaction), and each user is individually represented in the information system by a specific instance of this model. The technologies that are used to represent this use model are often proprietary, although we can observe some effort of standardisation in some application domain (for instance in e-learning, Human Resources, see the Fidis document del. 2.3 models) so as to facilitate systems interoperability and reuse. Of particular interest are the use of ontologies for user-modelling, for their capacity to represent and manipulate complex user models.

The acquisition of user information (i.e. the construction of the profile associated with each user), which is a critical element for the effectiveness of personalised applications, often represent an important challenge. Indeed the direct obtaining of this information by asking the user has many limitations, because it is very inconvenient for the end-user and is not very reliable.

Different options actually exist to build (or acquire) this personal information exist:

The direct input by the end user of personal information (via electronic forms) that we have just mentioned.
The extraction from databases. In this case, the personal data originates from existing databases.
The capture of the user’s activities. The different actions and transaction of the user are recorded to be later used for building the user profile.
The inference of this information from other user information. The value of some attributes of the user can be calculated (algorithm) or inferred (intelligent reasoning using heuristics or other means) from the value of other attributes (that are acquired using the other methods).
The use of data mining techniques. This later method refers to the approach described previously.

As indicated previously, the direct entering of this information by the end user (for example for getting coordinates or preferences) is the simplest method, but is an option that can only be used with moderation: people get quickly bored if they have to enter or to update too much information, resulting in poor quality profiles (incomplete or obsolete). People may also be afraid to be asked to disclose too much of their personal information, or simply do not like to elicit information about themselves (because of the frustration it may provoke in some cases). People can also make mistakes unintentionally (originating from simple errors or cognitive bias), or lie in order to fool the system, to get some advantages or to protect their privacy. Finally, in some case, the frequency of update would be too important or too disturbing for the tasks performed to be done by the users, for instance, in the case of mobile AmI applications, asking the users to input their current location would be considered too inconvenient.

The extraction from databases (governmental, enterprise resource planning, training systems, or others) depends obviously on the existence of these databases (only partial user information is stored in these databases), but also on the permission that one can have to access these database and exploit their content. Often one of the main barriers associated with the use of databases for getting the user’s information is related to privacy, which places limits on its use (purpose, cross-matching data, etc..).

The extraction of personal information from the capture of user activity is related to the recording of the actions of the users. Examples of processes that record information include e-commerce systems (such as Amazon) and fidelity programs that capture the history of different transactions associated with each of the customers, or virtual community systems that can capture the history of activities of the different members (such as age in the community, and number of postings). This activity is recorded in databases, or in different log files.

Some user information is profiled via inferences that can be performed on other existing information (such as the ones obtained by the previous methods). In this case, the inferred values result from different methods of calculation, such as algorithms, heuristics, or rules. Examples of information that can be profiled in this way include risk assessment of a customer by a banker, or the automatic determination of some user’s preferences. For instance Crabtree and Soltysiak (1998) use this method automatically to extract user’s interests in an unobtrusive manner through monitoring various office automation systems that they use.

Group profiling, already described in a previous chapter, can also be used to help to determine the value of some personal user attributes. This approach relies on more global analysis, and data-mining or machine learning techniques. Pohl (1997) for an example has investigated an approach that uses machine learning techniques to help in the construction of behaviour-oriented user models. Shearin & Lieberman (2001) have used case-based reasoning methods to learn about user preferences in the domain of rental property by observing the user’s criticisms of apartment features.

After personalized profiles have been constructed the aim will be to tailor products and services to the wishes and needs of individual customers based on the results obtained during the data analysis stage. This can be done, for instance, by supplying customers with tailor-made information, such as news, weather, and sports reports or by sending them advertisements specifically tailored for them.

Various techniques can be used to make offers to existing and potential customer. For instance, in the case of recommendation systems, a distinction is often made between content-based filtering techniques and collaborative filtering. Sometimes manual rules are distinguished as well. When existing profiles indicate how a certain user values certain products or services, content-based filtering can be used to predict how this user will value brand new, yet similar products or services. According to the value predicted that the user places on these products or services, they can then be offered. The most important drawback to this method is that new products or services that do not fit within a customer’s current profile are not filtered. Potentially, a situation of overspecialisation can arise.

Collaborative filtering offers a potential solution to this problem. The idea is that if two users have the same interests and the first one is interested in product x then this product can also be offered to the second user. An example of this approach is the ‘customers who bought’ feature of Amazon.com. A key characteristic of collaborative filtering is that users have to value the products offered them. Therefore, some input from customers is required. If this information is available, ‘nearest neighbour’ algorithms can be applied to try and detect overlapping interests between users.Collaborative filtering also has some drawbacks, one of which is scalability. As the number of users and of products and services increases, the use of ‘nearest neighbours’ algorithms becomes more and more laborious. A second problem is that a large number of products or services, combined with a reserve on the part of users to value these products and services, can lead to a situation in which making offers becomes difficult. Products that have not yet been valued by users are not used in new offerings. Finally, collaborative filtering does not take into account the contents of products and services, since only value that users put on them matters. Various techniques exist with which the other drawbacks can, at least partially, be addressed as well.

Hereunder follows a discussion of two examples of user modelling, concerning web users and virtual community environments.

3.3.2.2 Profiling of web-users

(Emmanuel Benoist, VIP)

Since the creation of the Internet, people have tried to get a better knowledge of the rather anonymous clients of web sites. First of all, they used the log files, tracing all the requests a server has received, to obtain statistics on the visitors. The evolution of the Internet then gave the possibility following the movements of a single visitor on a web site. This could be done using the technology of cookies. They are a small set of information sent by the server to the client. They are usually used to store a session ID. Such IDs are often used to grant access to a web site. The user with session ID 1234 has given a valid pair of username and password and can therefore access all the subsequent pages without retyping this information. This is also often used to create a virtual basket, the user visits a web site without following any logic, and the site uses the session ID to store all the purchased goods.

Such cookies are also often used by web sites for statistical and profiling purpose. Even without knowing it, the users are tracked, and web sites not only know what they have bought, they also know what they have just visited.

Cookies can also be used to monitor the behaviour of a user in more than one visit. This is useful for remembering the preferences of a user (e.g. preferred language or preferred default page). Such cookies are used to put together sessions that belong to the same person or at least virtual person. Cookies are only sent to the originating server. A client may be known by userID=1234 on website1 and with userID=5678 on website2. It may be very useful to merge information coming from the two sources. In order to do this we need a third web site for the merging of information. The two true web sites insert on all their pages images such as:

The image is often a 1x1 pixel that cannot be seen. It allows the creation of a third party cookie that can be linked to both user IDs. Using this information it is possible to construct a network of all the web sites visited by a given (virtual) person.

The user can take countermeasures to prevent such abuses. But unfortunately, the degree of awareness of the common consumer concerning privacy hazards on the Internet is almost zero. Users can deactivate all cookies, unfortunately, it does block a lot of web sites and does not prevent those that really want to follow a session to do it, since there exists some turnover (userID in the URLs, IP-tracking or even “fake” DNS entry for each user). The user should nevertheless prevent cookies from continuing once a session finishes. It is possible to only accept session cookies, which protects the user from being followed over a long period. Third parties cookies cannot be used for anything of substantive interest to the user. They should therefore be always blocked.

Recently, these techniques have also been used in e-mails. Since e-mails can be written in HTML, it is possible to include images (visible like banners or invisible like an hidden pixel). Some marketers use the images to mark them with a userID, an address is validated once it has requested the included image.

See further elaboration in the Appendix, section D.

3.3.2.3 Profiling of users in virtual communities environments

(Thierry Nabeth INSEAD)

In this section, we are going to examine how behavioural personal user profiling and artificial agents can be used to stimulate the knowledge exchange process in virtual communities. A more elaborated presentation of this approach, as well as a set of references, is provided in the Appendix, section G: “Using user’s Profiling and Artificial Agents for Stimulating the Knowledge Exchange Process in Virtual Communities”.

Virtual communities: the participation challenge

One of the main challenges facing designers and operators desiring to build successful virtual communities is the establishment of a sustainable dynamic of participation amongst its members. Indeed, the essential value of a virtual community resides in the activities of its members and in particular is strongly correlated to their willingness to spend time, to interact with others in conversations, or to make available knowledge. The participation of the members of a virtual community in this knowledge exchange process is indeed not spontaneous, but is motivated by a certain number of elements and factors such as: expectation of reward (direct reward, increased reputation), personal satisfaction (altruism, efficacy, friendship), obligations originating in the desire to reciprocate, social imitation, commitment and consistency, etc.

This “mechanics” of the dynamics of knowledge exchange in communities and groups has been the object of numerous researches in different fields of study, such as knowledge management, computer supported collaborative work (CSCW), complexity, and sociology to name but a few, which have tried but never totally succeeded in understanding it, in order to derive some principles that would allow quasi-deterministically to create sustainable and effective knowledge-sharing virtual communities.

Using agents aware of the users’ behavioural characteristics to stimulate participation

In this paragraph, we would like to present an approach in which personal behavioural profiling represents a central element for the creation of an intelligent application that could be used for stimulating participation in virtual community environments.

The main principle of this approach consists in the use of artificial agents that are aware of the behavioural profile of the members, and that intervene proactively using this information to stimulate member participation. In effect, this approach relies on two components: (1) the automatic construction (using a set of heuristics) of a behavioural profile of each member related to his knowledge exchange activity. (2) The generation of agent interventions that are the most likely to stimulate the participation of a particular member. The selection of the most effective interventions is based on the behavioural characteristics of the member.

The construction of this profile results from the observation of the actions of the user and the application of a set of heuristics helping to determine the participatory profile. The different actions that are captured and intervene in the determination of the participation profile include events such as: entering digital spaces, posting files, posting messages in bulletin boards, answering to messages, etc. The different behavioural patterns to which a particular user can be categorised include: the level of involvement (is he often present?) and the nature of his contributions (Is he only a lurker? Is he a contributor of knowledge assets? Does he participate in the discussions? Does he initiate discussions? etc.). Example of heuristic rules include: a user that has not connected to the system in the last month can be considered as inactive. A user that post in discussion at least one time a week is committed in exchanging his knowledge. A user that has posted in the last three months at least a document is an active knowledge contributor.

The importance of personal behavioural profiling

The effectiveness of these different agent interventions depends greatly on taking into account the behavioural characteristics of the user, since it allows an agent to select the intervention that is likely to have the most impact on the user. Intervening in a way that might ignore the current nature and level of the participation would certainly lead to a poor result. For instance, it would be pointless inviting a member of a community to share some knowledge assets with others if this member shown in the past has very little readiness to participate in an interaction. On the other hand, it may be useful just to inform this same member of the benefit people get from interacting more with others. Similarly, understanding the member in terms of his collaboration style (is this member more a “network” person) or his current attitudinal state (is he busy) can help to avoid the selection of an intervention that would be considered a nuisance by the member.

In the approach previously described, the personal behavioural profiling can be used as an essential component to enable the design of a radically new category of (more intelligent) application. In particular, the success of this category of application relies in an important way on our capacity to observe the user, and to extract a relevant behavioural profile.

3.3.3 Biometric profiling

(Angelos Yannopoulos and Vasiliki Andronikou, ICCS)

In recent years biometric profiling has become a heavily researched and very broadly applied technology. The many possible examples of this include how the retinal scans of older science fiction movies can now be applied at surprisingly low cost in a multitude of realistic scenarios, or the ‘fingerprint mouse’ that can identify a computer user in a way similar to what was once ‘high-tech’ police technology. Biometrics can be divided into two major categories; physiological (or passive) and behavioural (or active) biometrics. The first ones refer to fixed or stable human characteristics and individual attributes such as face image, fingerprint, hand geometry, iris pattern and others, whereas behavioural biometrics are based on measurements concerning characteristics represented by those skills, actions or functions performed by an individual at a specific time for a specific reason: for example a person’s signature or keystroke dynamics. Behavioural biometrics are less common than physical biometrics, but they are still often used, and it may be harder to apply law and to manage because of their fleeting nature.

A fully designed system might be presented as a simple recognition task where little additional variety could be expected: for instance, measure the timing of a user’s typing and match statistics of a remote logon to statistics collected at registration time in order to achieve verification. In fact, however, behavioural biometric profiling can be considered along a number of (fairly) independent “axes”, sets of activites or requirements whose combination shapes the overall application being considered.

Type of measurement being made
1. Typing patterns
2. Mouse movements
3. Web navigation
Access level in order to measure the person’s activity
1. Application based on custom hardware
2. Full software access to a person’s computer
3. Limited software access to a person’s computer
4. Observation of the behaviour of a person’s computer as an indication of the behaviour of the person him/herself
Identification task
1. Identification of unknown person
2. Verification that a person is who he/she claims to be
Technical method used
1. Standard statistic methods for pattern recognition
2. AI pattern recognition techniques (e.g. neural networks)
3. Knowledge technologies combined with a pattern recognition method

Choosing the trait to measure may seem to be the main inspiration behind the method, while the rest is just a matter of technical implementation; but this is not so. For instance, if attempting to verify user identity remotely (e.g. at a server when a user has logged on), the kinds of patterns that can be measured are different. The client-side application almost always buffers data and sends packets at specific points during an interaction (e.g. when a user presses a “submit” button, rather than sending each character as the user types). Thus, an analysis of the access level available in order to make measurements may be the prime drive for innovation, requiring a reasonably coherent definition of a composite behavioural trait to be arrived at that makes sense to try to recognise, but also can be measured correctly. For instance, we could imagine measuring web navigation patterns, if using hypertext with little information per page, many links and rapid traversal of the information space by the user through a fast connection. Clearly, an important issue is to ensure that adequate data can be collected both for initial profiling (training) and for subsequent recognition.

There is also a huge difference between the tasks of identification and verification. In the latter case, a person claims to have a certain identity, for which data has already been collected. Comparing newly collected data to the specific profile for the given user is a relatively easy task. In comparison, compiling a database that describes a multitude of people and then attempting to find which of these is best matched by an unqualified measurement is a much harder task. In this latter case, the data collected must be far greater in amount and highly accurate. This again reflects on the kind of applications possible. We might be able to recognise a totally unknown user e.g. on a corporate or university network where we have unlimited access and can monitor the user’s typing continually throughout the session. However, we cannot identify a totally unknown person just from a login sequence (e.g. if the person is using a pseudonym and we want to match to a database or “real” people).

Finally, the technical tools used in each case need to be customised to the kind of behaviour addressed. If we take the measurements simply to be series of numbers, any range of pattern recognition methods can be tried. However, the raw measurements are not necessarily real “behavioural” measurements: for example, knowing the location of the screen of the mouse pointer for each millisecond during a user’s session is not a direct reflection of behaviour; it might, instead, be necessary to model the application domain and estimate indices, from the raw measurements, such as jerkiness of the motion, speed of reaction to various stimuli, etc. Of course, the example would require a high degree of access to the person’s computer, while a similar knowledge-based example might be constructed without requiring such access, e.g. by finding elements of real behavioural patterns of web browsing before using pattern recognition to identify a full browsing session as originating from one user or the other.

Face recognition is a common example of a physical biometric which dates back to the 1960s and is beginning to be applied in a variety of domains, predominantly for security. This technology can be applied in a non-intrusive manner; it allows for human identification in a passive way, without the person’s knowledge or cooperation since a person’s face is easily captured by video technology. Its applications can be located in many areas, such as in commerce (videophone, teleconference, entertainment, film processing, etc.), industry, security and law enforcement. Typically, facial recognition compares a person’s image with a stored template, either real-time or off-line for either identification or verification purposes. However, face recognition can also assist in the construction of the profile of an individual’s movements, which can be used for security purposes with the risk, however, of invading the individual’s privacy. And what is more, the data collected by this procedure could be combined with other personal information (such as the person’s ID) to enrich the person’s constructed profile and provide a broader and deeper view of the person’s private life.

In the Appendix, section C, face recognition and key-stroke dynamics will be further elaborated as examples of biometric profiling.

3.4 Profiling and interoperability

Personalised profiling involves linkability of different data to one and the same subject. As far as this subject can be identified this may enable extensive personalised profiles to be constructed, that may affect the privacy of the data subject. Several tools have been suggested to limit the linkability of data, especially so called identity management systems (IMS), that enable a user to control access to his or her data. We refer to FIDIS deliverable 3.1 for an overview on such systems and the devices designed for such purposes. In this paragraph we briefly investigate the problems related to the interoperability of such IMS, an issue that will be further elaborated in FIDIS deliverable 4.2.

Interoperability of identity management systems is for the moment an idea rather than a reality. Data contained within an information system that serve to identify, authenticate or just verify a person’s identity are bound within that system in a number of ways. In technical terms there will be many problems of exchanging data across systems that spring from format issues, protocol issues and issues of technical standards. The many proprietary systems that currently constitute the range of offerings in the IMS field were not designed to be able to share identity information. There are as yet few common standards on Some headway is being made in the e-government area in the shape of the Lisbon Agenda and eEurope 2005, but it is early days yet.

Above the technical level, lies the legal and policy level that enshrines checks and balances that are written into operational systems by different legislatures and regulators. In Europe, the right to privacy is set out in Article 8 of the European Convention of Human Rights and Fundamental Freedoms and owners and operators of systems that process personal data will have to have regard to data protection principles before releasing key data to third party systems.

On the social and business fronts, and at the application level itself, while there may be some easy wins to be had in the e-government and e-health areas as far as interoperability of identity goes, it is difficult to see the way forward in e-commerce where merchants, card issuers and aquirers are driven by strong competitive urges. We have already seen how bank cards, both credit and debit cards, are used as make-do identity cards on many occasions, and yet users are forced to amass walletfulls of plastic even though the identity data on each card is largely the same for a given consumer.

Some expect that once such barriers to interoperation are overcome, there will be real benefits in being able to share profiles across databases. In an increasingly technological world, profiling seems the only feasible road to making best use of masses of data to hone products and services for specific types of users, consumers, patients. It would be advantageous to be able to narrow down packages of medical or educational benefits to the precise community that has need of it, say for remedial help or for pre-emptive therapy or treatment. But overcoming the barriers will take some effort if PKI experience is anything to go by and – as will be further developed in subsequent deliverables – sharing profiles across databases raises a number security and privacy issues, considering also the data protection framework that for instance prohibits the use of personal data for other purposes than specified at the moment of collection.

4 / 10