You are here: Resources > FIDIS Deliverables > Identity of Identity > D2.3: Models >

D2.3: Models

D2.3 Models
ACQUISITION OF THE PERSON’S INFORMATION (PROFILING)

Acquisition of the Person’s Information (Profiling)

Profiling of a given user is the process of obtaining the values associated with the different attributes that constitute the user model (note: refer to the work conducted in WP7, especially D7.2, for a more elaborated study of profiling).

The different means that can be employed to get the information associated with the different attributes include:

The direct entering of the personal information by the end-user
The extraction of this information from querying existing data sources (such as databases) or/and captured during different processes (such as the recording of a transaction)
The calculation from existing attributes (simple algorithms or expert systems)
The extraction via the mining of information

Of course, one should be aware that the quality of the information differs, and depends largely on the means that have been employed to get this information. For instance, one can easily imagine that certain information originating from a governmental database is much more reliable than the same information found in a personal web page which the user has entered by her- or himself. In a similar way, we can assume that information that has been calculated or inferred is more prone to error than information that has been entered directly (although the phenomenon of obsolescence of information can mitigate this assertion).

Similarly, the control of this information is strongly correlated to the means that have been employed to collect the information. For instance, information that is present in web pages is totally controlled by the users themselves, whereas information that is present in a governmental database is principally controlled by a third party (the government). In the later case, legislation and the possibility of the end user bringing some correction can help to share part of this control. In the case where the information has been extracted by some data mining procedures, it is totally controlled by some third parties. There is almost no possibility of intervention by the end-users (who may simply be totally unaware that their personal information is being exploited).

Direct entering of the personal data

Direct entering of personal data consists of mechanisms in which the end users are able to enter explicitly their personal information. For instance, a typical example of such systems is an online system in which the users have to describe themselves by specifying their name, addresses, preferences and other characteristics.

The Type 3 IMS (individual function), presented previously, represent a typical category of systems that employ this method.

The personal data directly entered is mostly under the total control of the user who is able to modify this information whenever he likes.

This mode of collection of the personal data appears to better preserve the privacy of the user than the centralised solutions, although it is not without a certain number of limitations. Firstly, this information may not be very reliable nor up-to-date, since it relies on the willingness of the users to enter this information and to be honest. Equally, the users may even involuntary introduce some errors that originate from an incorrect perception of reality (such as rationalisation). Secondly, the entering of this information and its update can be considered too time consuming for the user who may not be ready to spend the effort.

The typical attributes that can be captured in this way include name, addresses, pseudonyms, short descriptions (such as picture) and preferences (basic).

The extraction from data sources and from processes

In this case, the values associated to the attributes originate from two different sources: (1) databases; and (2) processes.

In the first case, the databases may be governmental (such as police or tax), human resource databases (enterprise resource planning and knowledge management systems such as payrolls, or training information) or health file databases (managed by hospitals or by social security units).

In the second case, the data can originate from a series of processes that can be used to capture the data (and that will be stored in databases). Examples of such processes include e-commerce systems (such as Amazon) and fidelity programs that can capture the history of different transactions associated with each of the customers, or virtual community systems that can capture the history of activities of the different members (such as age in the community, and number of posting).

The type 1 IMS (organisational function), presented previously, represents a typical category of systems that employs this method, although it can also be used in the type 3 IMS (individual function).

The personal data that is present in databases or captured via a set of processes is mostly outside the user’s control (the possibilities of correction by the end user are often limited). These data are also often very regulated by some legislation specifying the type of data that can be represented, the possible usage of this data, including combining databases.

Even if this mode of collection of personal data appears to be more intrusive to people’s privacy, it is not without some advantages, even for the people themselves. First, the data captured via this means can be considered much more reliable, since it directly reflects the activities of people, and not only the perception of these activities. Second, because this data collection is automatic, it can be considered less demanding for the end-users.

The values of many attributes that can be recorded in this way include characteristics that have a certain level of permanence, while other categories of person’s information can include all the transactions (commercial or not) in which the people have been engaged.

Data calculated and inferred from other attributes

In this case, unknown values associated to particular attributes originate from the calculation of other attributes (typically the ones that have been extracted from the previous two methods). This category is relatively similar to the category previous described, however, it differs in the level of sophistication of the systems that make use of it. Notably, these are more frequently used in Type 3 IMS (individual function) applications that use it to provide some level of adaptability (for instance in e-learning systems or e-commerce systems).

The reliability of these calculated attributes is generally less accurate than for non-calculated attributes. For instance in Amazon the assertion “a customer that has bought a book about children is interested by children and is likely to buy other books about children” is only correct in average, since they may only have bought this book once in order to offer a present to somebody else.

The level of control on these calculated attributes is often limited by the simplicity of the algorithm used, and the way it was configured for the calculation. Thus, people that read the value of these attributes usually have, at best, only a vague idea about the underlying principles that have been used. For instance, a calculated attribute could be a level of risk that a bank could calculate on a particular client, which results from a combination of values of attributes such as the gross salary of the person, the assets such as real-estates that the person may own, his family status, or the postal code of his place of living or even his ethnic origin. Another application is certain e-commerce websites, where the preferences of a customer are determined automatically.

Data extracted via mining the information

The extraction of values via data mining techniques could appear similar to the previous calculated methods. They differ however in that the algorithms are being applied globally to the data of (very large) groups of people, and not on the data set that is associated with a single person. The algorithms used are also of a more statistical and probability based nature, and often rely on the use of Heuristics. Finally, these algorithms may also be used to help the creation process of the user model itself, and in particular help to determine the set of attributes required to “summarise” the problem (for instance, in a banking application, an algorithm may determine that the knowledge of the age and of the postal code information represent sufficient information to discriminate a reliable customer from an unreliable one, with a limited risk of error).

Type 2 IMS (profiling function), presented previously, represent a typical category of systems that employs this method.

The types of attributes that are extracted via mining typically include people related categories such as social categories or life styles. These attributes can be considered to be more abstract and less directly associated to the individuals.

At a more micro-level, these attributes can represent some user characteristics and behaviours that can be automatically extracted from the use of some Information Systems. For instance such attributes, in the context of an e-commerce system, can reflect reliability characteristics (likeliness of fraud), and, in the context of a virtual community, can reflect the level of participation (such as the activity of the people in SourceForge.net).

9 / 53