IOS Press Ontology-based User Proﬁle Learning from ...¬ned 0 (0) 1 1 IOS Press Ontology-based...

Undefined 0 (0) 1 1IOS Press

Ontology-based User Profile Learning fromheterogeneous Web Resources in a Big DataContextHoppe Anett a, Roxin Ana a, Nicolle Christophe a

a CheckSem Research Group, Laboratoire Electronique, Informatique et Image (LE2I), Université de Bourgogne,BP 47870, 21078 Dijon, France, E-mail: [email protected]

Abstract. With the emergence of real-time distribution of online advertising space (“real-time bidding“), user profiling fromtraces left by online navigation reaches a new importance. The ability to distinguish user interests based on implicit informationas it is contained in navigation logs, enables online advertisers to target customers without interfering with their activities. Currenttechniques apply traditional methods as statistics and machine learning, but also suffer from their limitations. As an answer, theMindMinings research project aims to develop and evaluate a semantic-based profiling system for improvement purpose.

Keywords: Ontology, Big Data, User Profiling, Data Analysis

1. Introduction

The online world has become one of the most impor-tant advertising panels. But, in contrast to traditionalmedia, it does not only allow advertisers to put theirmessage out there: The spectrum of the underlyingtechnology enables to target a single user individually.The Web makes, indeed, a big difference: Whereas no-body may know which pages of the morning news-paper are actually read, similar traces are constantlylogged and stored when customers surf the internet.The exact pages visited, at what time, for how long,with what device – all those information are automat-ically available to page owners. Furthermore, cookies,little files left on the hard disk of a page visitor allowweb servers to recognize him/her on the next visit andto connect the stored information over time.

There is a fundamental research effort to use thesetraces for the improvement of services. Personalisationis a central concept to research ambitions in e.g. in-formation retrieval, personal information managementand recommendation. Indeed, the problem task that isthe focus of the article at hand can be seen as a spe-cialised recommendation task: instead of articles or

products, the system facilitates the choice of appropri-ate advertising contents. However, the features chosenand processed are centred around those relevant forthe application and may differ from those very similartasks.

Driven by these similarities, digital advertising em-braces today a mixture of traditional techniques forpattern discovery within a multitude of data. Statisticsand machine learning serve to automatically identifycustomer groups with shared interests and the respec-tive appropriate content. However, so far, no computersystem has been developed that perfectly understandsa user’s background, interests and predicts future in-tends.

The MindMinings project explores a next step toperformance improvement by introducing the ontology-based paradigm to an exemplary online advertisingsystem. By storing all information within a specialisedontology, it will be possible to integrate a multitude ofsources.

The ontology model is being developed in close col-laboration with a company delivering analysis solu-tions for digital advertising. It is thus designed to cap-ture the specialised knowledge of domain experts. As

0000-0000/0-1900/$00.00 c© 0 – IOS Press and the authors. All rights reserved

2 Ontology-based User Profiling in a Big Data Environment

a base for all analysis, it contains information abouta user’s consulted web resources and, thanks to en-hanced analysis, their content and its interpretation.The usage of standardised ontology languages allowsfurthermore the integration with existing knowledgeresources of the Web of Data – a rich set of taxonomiesand dictionaries that enhance specialised informationwith common knowledge (e.g. using WordNet [17],DBPedia [1], or Freebase[3]). The integration of allthese information builds up a thorough picture of theuser’s behaviour and context. The usage of a very fastand lightweight RDF database, coupled OWL 2 andSWRL support [6] allows to use ontological inferenceto deduce formerly unknown facts that stem from thecombination of the above-named knowledge sources.

The article at hand aims to give detailed informationabout the project context, its keylocks and solutions,as considered and implemented. Section 2 gives anoverview of the different goals defined for the project,followed by a survey of related work (Section 3). Sec-tion 4 offers thorough description of our system de-sign, the developed ontology and how its functional-ities are used for audience segmentation. The articleconcludes with a summary and an outlook on futurework.

2. Task description

The final goal of the MindMinings project is todesign a profiling system that uses the navigationtraces every user leaves when visiting pages to de-duce viable information for digital advertising. As adistinction from similar ambitions, the system willbe fully ontology-based, as opposed to traditionalstatistics- and machine learning-based implementa-tions that dominate today’s market.

The CheckSem research team unites researchers ondifferent stages of their careers whose research inter-ests centre around the way semantic web technolo-gies may be applied, adapted and extended for partic-ular applications. To ensure a correct vision on the do-main on digital advertising, a collaboration was initi-ated with one of the French online advertising marketactors.

The company offers specialised data analysis to on-line publishers, editors and advertisers, based on theuser data that are available to those parties. From theside of online editors, these information comprise de-tails about the users’ navigation on the websites – whatpages have been requested, at what time, with what

device. With the help of cookies, the visits of a cer-tain machine can be linked over time to constitute amore exhaustive usage profile. In certain cases, whenthe user deliberately registered to the page, additionalinformation might available that have been explicitlyrequested on the creation of the user profile – thesemay comprise basic demographic information, such asage and gender, but also more complex statements asinformation about topics of interest and purchase in-tentions.

Even though the nature of the data available to theindustrial partner are more varied, the MindMiningssystem will mainly rely its most stable parts; that is,the users’ navigation logs. Information stemming fromCustomer Relationship Management (CRM) can beused as a measure for validation and evaluation butis, usually, only scarcely available. So the system hasto be able to automatically process the newly enter-ing navigation logs and extract the viable information.The logs arrive in a standardised format and can be di-rectly parsed. Contained are basic information aboutthe users’ page visits: time stamps, the user agent (asummary about the hardware and software used duringthe visit), and the exact URL of the displayed contents.

Given this context, the task at hand comes with threekey issues:

1. Resource analysis: The analysis of the web re-sources viewed by the user are a crucial step inconstructing the user profile. Their content has tobe unambiguously analysed in order to be trans-formed into a machine-interpretable representa-tion.

2. User profiling: Based on all available informa-tion, a unique profile for each user has to be con-structed. The background knowledge about rela-tions between user characteristics has to be ex-tended with knowledge that can be deduced fromthe user’s specific trace.

3. Big data scale: Cookies assemble the eventtraces for the users of each partner site. Thisleads to a high volume of heterogeneous data thathave to be interpreted. Every solution thus has tobe efficient and scalable.

Whereas the domains user profiling and resourceanalysis have been studied for decades, Big Data is adomain that emerged during the last years, mainly mo-tivated by the immense amount of data produced byweb applications. Indeed, interpreting incoming eventsalong with the dynamic creation of the user profile andits exploitation for content suggestion have to scale to

Ontology-based User Profiling in a Big Data Environment 3

the number of partner sites and the number of usersthat surf those pages every day. Thus, the profiling taskat hand qualifies in all criteria defined by the IEEE1 aschracteristics of Big Data, notably:

1. Volume – The amount of data assembled by thecompany each month reaches about 150 millionuser events per month for an average partner site,in sum 2,4 billion events each month, leading toa high volume of data to analyse.

2. Variety – Our analysis includes web data in allits formats and orientations, leading to a highlyvariable spectrum.

3. Velocity – User events arrive for analysis in real-time, meaning a high frequency during main ac-tivity hours.

4. Value – As the analysis is performed on real-lifedata, generated by the activity of users, it bearsinformation about their interests and habits thatis commercially exploitable.

5. Veracity – In fast arriving, variably shaped dataambiguities and erroneous entries are not avoid-able. The trustworthiness of the source and un-certainty are to be considered in the analysis.

User clicks appear in a high number and also in highfrequency. Starting from the numbers indicated above,on an average day, all current clients would produce anumber of 80 million data instances scattered over dayperiods with different degrees of activity. Even thoughdata arrives in form of uniform user events, the webresources referenced therein expose a high level of va-riety and may include different types of media (arti-cles, blog entries,pictures, videos, etc.) and basicallyall combinations of topics. The value of the informa-tion included, however, is a better understanding of theuser behaviour and deduce interests. However, in sucha stream, erroneous events will certainly happen. Thisrefers to all click data that are reported, but that are notpertinent for the summary of the user’s interests.

Using an ontology to realise the information integra-tion brings manifold advantages. An ontology facili-tates the explicit and unambiguous structuring of com-mon knowledge – it allows the clear definition of do-main knowledge and thus its reuse throughout appli-cations. The explicit representation makes it possibleto analyse the domain knowledge, to draw further con-clusions or challenge its implications. In a hierarchi-cal structure, meta-knowledge, domain knowledge and

1http://www.ieee.org

operational knowledge may be separated, allowing areuse of generic information while keeping the opera-tional information limited to their case of application.

3. Related Work

Due to the multitude of research domains touchedby the MindMinings project, the body of related re-search is vast. It goes beyond scope of this article to re-view all literature that proposed solutions for resourceand user profiling. The subsequent sections will, how-ever, take a thorough look at works that use ontology-based paradigms for these two tasks. Finally, a shortpassage sums up efforts to integrate information fromdifferent sources based on ontologies.

3.1. Resource Profiling

Algorithms for website qualification comprise tech-niques for the pre-processing of resources, that extractand normalise the features for each resource; the actualapplication of classification algorithms and the repre-sentation and storage of results.

The widely used technique for the featuring of tex-tual resources is the discovery of keyword terms. Al-gorithms mostly use statistical considerations to deter-mine the terms within the resource that best summariseits content. Techniques reach from the removal of fre-quent, non-valuable stop words to enhanced discoveryof underlying topic dimensions (Latent Dirichlet Al-location [2]). A simple, but powerful approach that isprobably the most used measure for word relevance to-day is the TF-IDF measure [23]: a mathematical mea-sure that facilitates the discovery of those terms thatdistinguish a certain document from the rest of the doc-ument corpus. Semantic Web resources have been usedin this working step as a reference – a large number ofworks report the success of the inclusion of WordNet[17], DBPedia [1] and specific domain ontologies tothe pre-processing of textual resources. [20,21] reviewtechniques for Named Entity Recognition and namestructured knowledge sources as a supportive elementfor the task, besides unstructured sources as corpora.[7] uses a combination of a semantic topic attributedto a term and his syntactic role within the sentence toboost performance for Part-of-Speech tagging. In [18],the inclusion of a Semantic Web reference resourcehas proven to boost efficiency. The authors comparethe traditional statistical LSI approach (Latent Seman-tic Indexing [8]) for topic discovery with a WordNet-


based alternative. They proved the latter to achieve re-sults as good as the traditional approach, but in consid-erably lower computation time.

Furthermore, Semantic Web resources have beenapplied for both, feature reduction and query exten-sion. Both techniques tackle the problem of featuresparseness - given the high number of words in natu-ral language, one text might only feature a small num-ber of the words that are contained in the dictionaryof all documents. Besides, the existence of synonymsin natural language might lead to texts that treat thesame topic, but show no overlaps in their used vocab-ulary. In consequence, a regular matching algorithmwould not find them to be similar. Anyhow, the back-ground knowledge about semantic relations that is con-tained in a semantic dictionary as WordNet [17], canbe used to either (a) cluster words that are closely re-lated and thus reduce the feature space, or (b) extendeach keyword vector by the synonyms of the terms al-ready included and thus facilitate a matching betweendocuments that use disjoint, but synonymic sets ofwords. However, approaches concentrate around im-provements of the traditional keyword based approach.The quality of topic terms benefits from disambigua-tion or extension by synonymic terms. The semanticrelations of the concepts are thus used to ameliorate atraditional paradigm, but then omitted for further ex-ploitation.

3.2. User Profiling

User profiling is a problem task that has been tack-led from many directions, the interest in personalisa-tion has been expressed in several domains. The fo-cus will be set on publications that propose ontology-based profiling solutions. The following passages fea-ture a review of meta-models for ontology-based pro-filing (Section 3.2.1), and resources that facilitate tomodel complex user interests (Section 3.2.2).

3.2.1. Models for ontology-based profilingOne of the main paradigms that reigns the devel-

opment of semantic web resources, is to reuse exist-ing repositories as much if possible to avoid duplicatework and redundancy within knowledge resources. Astudy of literature revealed two main propositions ofprofile ontology models: the General User Model On-tology (GUMO) [10], and the later publications byGolemati et al. [9].

GUMO: GUMO is an OWL-based representation ofuser properties and influenced by the already existingspecifications provided by UserML, SUMO and theUbisWorld ontology. It adopts UserML’s division ofthe user profile dimensions in three basic classes: aux-iliary, predicate and range. Given, one wants to charac-terise the user’s interest in fantasy literature, this mightbe done using the auxiliary “hasInterest“, the predicate“fantasy literature“ and the range “low“, “medium“ or“high“. Expressing a statement about knowledge, onemight use the auxiliary “hasKnowledge“, a predicatethat contains the area of expertise that shall be ex-pressed and a range of the statements “poor“, “aver-age“, “good“ and “excellent“. GUMO identifies about1000 groups of auxiliaries, predicates and ranges. Any-how, one may notice that actually everything can bethe predicate for auxiliaries like “hasInterest“ or “has-Knowledge“, leading to a problem if the ontology de-sign is not realised in a modular fashion [10]. In thecase of GUMO, the basic design solution is to limitthe ontology to the basic user model dimensions and toleave the modelisation of the general world knowledgeopen. Thus, one may plug-in an application-specificdomain ontology to fill in the slots or integrate a gen-eral purpose ontology such as SUMO [22] or Ubis-World 2. Hence, the missing definition of interest con-cepts becomes a feature as it enables the simple adap-tation to a certain application domain. The basic ver-sion as presented in [10] identifies the following set ofuser model auxiliaries hasKnowledge, hasInterest, has-Believe, hasPlan, hasProperty, hasGoal, hasPlan, has-Regularity, hasLocation. The listing provided does notaim for completeness and leaves it open to each de-signer of application-specific adaptations to extend it.

Golemati et al.: Golemati et al. presented theirgeneric user model in 2007. Their identification ofre-occurring core concepts in user profiles is basedon a thorough literature review. The identified top-level dimensions can be found in Table 1. The cen-tral concept of the ontology is the “Person” classthat contains all the characteristics of the user pro-file. These information comprise basic facts such as aname or a date of birth; but also instances of the on-tology classes as the description of the user’s contacts.All classes capturing more complex notions provideslots the respective characteristics of the user’s lifeand a field to represent the time period of actuality forthe information. The classes “Interest“, “Preference“,

2www.ubisworld.org


“Ability“, “Characteristic“ and “Thing“ contain threeslots: “type“, “name“ and “score“ (or “value“ in thecase of “Thing“). “Thing“ has two sub-classes: “Liv-ing Thing“ and “Non Living Thing“, as adopted fromthe WorldNet ontology [miller1995]. For the class “In-terest“ an additional field “interest type“ has been in-cluded to enable the modelisation of interest classesand their hierarchy.

Remark: It has to be noticed that none of the aboveexamples seems to have an active support commu-nity. The links given in the papers are without excep-tion unavailable. However, a rdf- and a owl-versionof the approach by Golemati at al. can be found athttp://www.sdbs.uop.gr/?q=node/273.

UMMO: In a publication from 2005, Yudelson atal. propose a meta-model for user profiling, the UserModel Meta-Ontology (UMMO) that aims to specifymain streams of the domain and categorise approaches.It serves thus not to structure the terms appearingwithin the various profiling approaches but to cate-gorise the approaches themselves based on the used in-put data, data structures and algorithms. The keytermshave been extracted from various domain publicationsand extended and structured based on expert opinions.

3.2.2. Modelling user interestsBoth above-named approaches for user modelling

leave open slots for the integration of application-dependent concepts for complex characteristics, suchas interests, knowledge, etc. Whereas GUMO [10]leaves the decision for a suitable resource up to theontology designer (but giving a recommendation forSUMO and the UbisWorld ontology), the approachproposed by Golemati et al.[9] includes sub-categoriesadopted from the WorldNet ontology [17] as a basefor extension. Independently from the notions of thosebasic profile ontologies, several methods provide so-lutions for the semantic enrichment of identified userinterest keywords. The options comprise (a) the rela-tion of the terms with the clearly defined concepts ina reference ontology (mostly used for disambiguationof keywords), (b) the construction of a profile ontol-ogy based on extracted parts of the reference ontologyand (c) the estimation of relations between the termswithout using a reference.

Use of general purpose ontologies: The usage of se-mantic web resources as a reference has been also ap-plied in user profiling approaches. Often, referencesto existing knowledge bases are used to disambiguateterms (using WordNet: [25,16,26,4], using ODP: [13]),

extend the feature space with their synonyms [25] orrelate them to a upper level concept of interest [15].

Extraction of ontology snippets: More recent ap-proaches do not limit themselves to the mere refer-ence of Semantic Web references, but opt to integratethe additional information that they contain to the userprofile. The above described approaches use, for ex-ample, the synset relation between terms in WordNetto alter the keyword sets for a user/resource. Currentmethods, however, aim to use all semantic relationssurrounding the concept in question. This demandstechniques that discover the relevance of the relationsand concepts that are connected to a profile term. Cale-gari&Pasi [5] suggest to use the basic profile termsfrom a user’s collection as seed terms to then col-lect additional information from a reference ontology.Therefore, the concepts included in the keywords areidentified and their neighbouring concepts integratedto the profile, depending on the distance to the origi-nal concept and the relation types that exist betweenthem. [15] proposed the Spreading Activation algo-rithm for the use of semi-automatic ontology exten-sion, [11] shows an application in the context of per-sonal information management. In a later work, Kati-fori outlines the similarities between the spreading ac-tivation used for weight computing on ontologies andcurrent theories about human memory – and suggestsits usefulness when designing user-centred informa-tion systems [12]. According to him, a model of con-cept integration and weight adaptation that mimics thecapabilities of the human brain will be closest to thepersonal perception of a user and thus, best able to pre-dict his/her needs. Even though the task-centred pro-filing incorporated in PIM applications is not exactlyequivalent to interest profiling, the approach might bea promising starting point for adaptation.

Generation of knowledge from user’s resources: Theontology used for the representation of a user’s inter-est does not need to be connected to a reference re-source. Even though the extraction of relations from anexisting knowledge repository seems to be a straight-forward approach, the properties of the underlying re-source collection might suggest the usage of alterna-tive techniques. Given a set of texts that treat a veryspecific domain of interest, one might not find the nec-essary content in a general purpose ontology as Word-Net. The construction of a specialised domain ontol-ogy is a costly process and might not be possible due tofinancial or timely constraints. In such a case, methodsfor automatic relation discovery may help to construct


Class Name Class Description

Person Basic User Information like name, date of birth, e-mail

Characteristic General user characteristics, like eye color, height, weight, etc.

Ability User abilities and disabilities, both mental and physical

Living Conditions Information relevant to the user’s place of residence and house type.

Contact Other persons, with whom the person is related, including relatives, friends, co-workers.

Preference User preferences, for example ”loves cats“, ”likes blue color” or “dislikes classical music”

Interest User hobby or work-related interests. For example, “interested in sports“, “interested in cooking”

Activity User activities, hobby or work related. For example, “collects stamps“ or “investigates the 4th Crusade“

Education User education issues, including for example university diplomas and languages

Profession The user’s profession

Table 1Upper-level classes of the User Profile Ontology according to [9]

an ontological structure nevertheless. One example forsuch a automatically generated profile ontology wasproposed by the working group around Y. Li and N.Zhong [28,14]. The generated concept relations rely onco-occurrence measures as applied to the user’s doc-ument collection. From the user’s text collection, themost relevant terms are extracted using basic text min-ing techniques. Using pattern recognition techniques,the discovered terms are grouped and then relation-ships between the elements estimated using associa-tion rule mining. The result is a personalised domainontology of the user’s interest terms. The underlyingalgorithms used comprise probabilistic classifiers andHopfield networks [28], fuzzy relational algebra [19](the latter integrating expert knowledge by letting themattribute semantics to the discovered relations)[5].

4. Our approach

The composition of a comprehensive user profile is acomplex task, even if aimed at a limited domain as dig-ital advertising. An intelligent subdivision of tasks isthus indispensable. The main division has already beenindicated in the above passages: (a) web resource pro-filing, (b) user profiling, (c) aggregation and inference.Even though there have been system propositions inliterature that treat individual tasks of this list, only fewapproaches have been found that try an integration ofinformation in form of an ontology. None of these hasbeen specialised for an application domain and, moreimportantly, tested in an industrial environment.

The collaboration with a company that actually of-fers profiling services to actors in digital advertisingoffers a multitude of possibilities for the validation ofour approach. The integration of the domain knowl-edge of the partner is one important point, but a main

advantage is surely the possibility to expose the devel-opments to comparative testing and ameliorate the al-gorithms accordingly.

4.1. General presentation

The core task of the MindMinings project is to de-sign a profiling system that integrates all working stepsand data in one structure, the ontology. To ensure theapplicability of the approach to a real-world applica-tion, all developments have been made in close collab-oration with our industrial partner.

The constraints imposed when using the system ina real-world context have important influence on thesystem design. User events from web platforms ar-rive in a very high frequency, the system must thusintegrate new information in a very fast way – andadapt the respective deductions concerning the adver-tisement to place for each single user. Even though up-to-date triple stores are designed to come up to suchimperatives, the performance requirement do not onlyinclude the reaction time of the ontology, but also thesemantic analysis process.

To speed up the profiling process that happens fol-lowing the user activity on the web pages, the semanticresource profiling got decoupled. The company part-ners up with a limited number of contractual clientsthat may hold a huge, but still limited, number of web-sites. On this limited set, the semantic analysis may bedone before the actual user activity happens, as soonas a new contract is concluded or a new page enters theensemble. The idea is to store the already semanticallyanalysed and aggregated information about each webpage in the triple store-published ontology and to buildup the connections for each user on his/her activity pe-riod.


Fig. 1. Overview of the workflow within the MindMinings profiling system

Figure 1 shows an overview of the building blocksof the MindMinings system. The left shows the en-tering web resources, that are subjected to the asyn-chronous semantic analysis. The resulting informationis used to populate the ontology and will therefore becombined in the synchronous profiling process.

Starting from the log files obtained from a user’sbrowsing behaviour (on the right side of the picture),information are pulled together:

1. facts that can be directly extracted from the usagelogs are fed to the ontology (time stamps, useragent etc.);

2. analysis results from the company’s machinelearning toolkits are brought into ontologicalform and included;

3. semantic information about the web page linkscontained in the log are pulled from the storageand linked at run-time.

The combination of all those elements with expert do-main knowledge (modelled into the ontology by meansof constraints and SWRL rules) about attractive seg-ments and their composition, allows to deduce for eachuser which of the segments he belongs to and to whatdegree.

4.2. Ontology: Entity Presentation

Our ontology comprises several classes, all of thembeing sub-classes of owl:thing. The following sec-tions present these entities and their relations, a graph-ical overview can be found in Figure 2.

4.2.1. Context entitiesThis category groups the entities that are defined

by our industrial partner’s commercial ecosystem. Thecompany concludes a contract with online publishersto provide enhanced analysis of their usage logs. Thus,the content that has to be included in the analysis pro-cess is determined by the partner, the domains he ownsand the web pages that are reachable below this do-mains. Each partner has a different amount informa-tion appertaining to him, possibly extended by collab-orations with other actors on the market. Hence, in-formation about the partner and his possible coalitionswith other players are crucial to determine the factsthat are visible or have to be hidden in the analysis pro-cess. Hence, the following entities have been includedin the ontology:

Partner: The partner is a society that has signed acontract with our industrial partner for the treatment oftheir data. Each partner is identified by a WID, a ”webidentification”. All domain and web pages that belongto a partner will have this ID attached.

Domain: Referring to the official Domain NameSystem (DNS), the domain in our context means thestring that results from the combination of second-level domain and top-level domain. All web pagesand sub-domains subordinated to the domain willbe related to it. In an example: the URL http://lentrepise.lexpress.fr/index.html re-fers to the entry page of the enterprise section of theFrench journal “L’express“. The domain in this casewould be “lexpress.fr“ (including the top-level do-main “.fr“ and the second-level domain “lexpress“).


Fig. 2. Graphical overview of the MindMinings ontology

“lentreprise“ identifies the sub-domain that is ad-dressed, ”index.html” identifies the specific page todisplay.

Webpage: The class “Webpage“ envelops all con-tent pages that are found for a certain (sub-)domain.Every parsed web page will constitute an individualwithin the ontological structure and relate with otherentities that qualify its content. It is identified by itsUnified Resource Identifier (URI) that is included in adata property connected to it. The web page is an en-tity of the company context, but at the same momenta base element for data treatment and analysis. Hence,one may consider it the binding element of those threecontextual areas.

4.2.2. Data entitiesThe entities in this section are issued from our part-

ner’s internal data treatments. For that, they provideadditional contextual information to the web pages ex-tracted from the navigation logs. However, their ori-gin are basic analytic steps (as for example comput-ing the duration of a page view from the time stampsaccompanying it), as opposed to information deducedfrom enhanced statistics or the application of machinelearning techniques.

Hit: The hit comprises all information about a singlepage view of a user. That is, whenever a page is re-quested from the server, this is logged as one hit. In-cluded in the class are all information related to thatentity – the time stamp, the user agent, etc. In the vo-cabulary of our industrial partner, the information en-veloped by the class equals the informational value ofthe “enriched hit“, as we aim to store all information

that can be deduced from the basic entities: user loca-tion (country, region, town), time spent on the websiteetc.

Session: A session is a sequence of hits, grouped bythe fact that the distance between the time stamp of onepage view and the subsequent page does not exceedthirty minutes.

4.2.3. Analysis entitiesThe content analysis process demands the integra-

tion of some classes to capture the elements usedwithin. Those mainly comprise the features used to de-scribe the content of a web page, keywords and uni-verses.

Keyword: A keyword is a basic term that describesone concept contained in a web page. The Keywordclass will be used to capture the keywords found inthe web pages and to handle their disambiguation us-ing external knowledge sources. As such they allowthe integration of external URIs that link to DBPedia,WordNet or the like. The instances of the Keywordclass constitute the binding element between a webpage individual and the categories (”universes”) it be-longs to. Given the above definition, we have createdthe containsKeyword-object property which re-lates a Webpage to a Keyword. This ojbect prop-erty is asymmetric and irreflexive. Its inverse objectproperty is appearsInPage. As mentioned above, akeyword belongs to a universe of keywords. Thereforewe have created the belongsToUniverse objectproperty which is also asymmetric and irreflexive. Itsinverse object property is comprisesKeyword. Wehave also defined object properties for modelling se-


mantic relations between keywords, such as antonymsand synonyms.

Universe: The term “universe“ stems from our part-ner’s internal vocabulary and refers to a certain contentcategory and the keywords that are related to it. Thus,every universe will carry the name of the category itdepicts, and bear close relations to the keywords thatare associated with the respective content domain.

4.2.4. Profile entitiesThe final goal of all computing efforts is the seman-

tically enhanced profile representation for every web-site visitor.

Profile: ”Profile” is the main class linking the at-tributes making up the user profile. This comprisesthe elements stemming from the content analysis ofthe web pages, by linking it with the universes thatwere discovered therein; but also attributes that maybe deduced from those content attributes or stem fromthe analyses’ done via statistics and machine learningtechniques at the company. In consequence the pro-file class contains two sub-classes that group the ele-ments into socio-demographic attributes (such as age,location etc.) and behavioural attributes (such as thebrowser or the affinity to certain brands). For the mo-ment, each of those sub-classes is divided in a numberof groups signifying the commercially interesting divi-sion of the attribute. For example, the age value is cur-rently identified by choosing to link a profile with oneof the individuals “Age 15-24“, “Age 25-34“, “Age 35-49“, “Age 50-64“, “Age above65“, “Age Child“, “AgePre-Teenager“, or “Age Teenager”. The partitions havebeen chosen based on the formats currently used by theindustrial partner.

Segment: One of the key features of the ontology isits capability to automatically determine the attribu-tion of an individual to a certain class. Using the affil-iation of a user individual to certain of the above de-scribed groups, more complex notions can be speci-fied. The segment class captures exactly those morecomplex profile entities that may be constructed usingprofile features (“a female person living in a householdwith children“ belongs to the segment “mother“), con-tent features (“a person reading 90% of the times onpages that treat sports-related topics is a sports-fan“)or a combination of both. The individuals assigned toa class of type ”segment” are those that comply withthe constraints or rules that were imposed to define thesegment.

4.2.5. Constructional entitiesAn additional class was added to the ontology for

mere internal treatment. The concept Weight wasadded to contain numerical values for each objectproperty that may apply to a certain degree. For a morerealistic modelling, it will be convenient to have thepossibility to weight the relations between certain con-cepts. For example, a web page may treat a certainset of topics, but each of them to a different coverage.Hence, we want to model that a certain universe is rep-resented to e.g. 40% in a page and include this notionto the evaluation. As mentioned before, a given userprofile will belong to several universes of interest, anda given keyword will appear in several different webpages. We wanted to be able to quantify this degreeof belonging. We therefore added the Weight con-cept in our ontology. The instances of this class carrya data type property that contains a numerical valuequantifying the weight of the relation. As a direct con-sequence, the relation between the concept Webpageand the concept Universe, named hasUniverse, isspecified as being a concatenation of two other rela-tions: hasWebpageWeight, relating the concepts ofWebpage and Weight and weightHasUniversethat then concludes the relation to the respective Uni-verse concept:

hasWebpageWeight ◦ weightHasUniverse

SubPropertyOf hasUniverse(1)

where hasWebpageWeight is forced to connecta Webpagewith a Weight; weightHasUniverseto connect a Weight with a Universe. The like hasbeen done for the relation between a keyword and auniverse (quantifying how much a keyword is actuallyassociated to a certain category), between a profile anda universe (quantifying how much importance the uni-verse in question has for the description of the profile).

4.3. Ontology: Segment definitions

There are two ways to define definitions within thecurrent ontology: using class constraints and compos-ing SWRL rules atop classes.

4.3.1. Using OWL restrictionsOWL restrictions are constraints that may be defined

on relationships between entities. By their insertion,we define an anonymous class of all individuals fulfill-ing the constraints, but omit to include this class ex-plicitly in our class hierarchy. This is useful if we want


to distinguish an unnamed set of entities that has nofurther semantic value for the contents of the ontologyand, particularly, if we want the inference engine ofour framework to find the affiliations of the individualsautomatically.

4.3.2. Quantifier restrictionsA quantifier restriction puts a constraint on a rela-

tionship a given individual takes part in. It consists ofthree parts:

1. a quantifier (either the existential quantifier someor the universal quantifier only

2. the property that is concerned by the restriction3. a filler that refers to a class

Using those two types of constraints, one can ex-press that at least one kind of this relationship has toexist (“some“) or that exclusively this type of relation-ship may exist.

owl:someValuesFrom: We use the ”some”-relationwhen describing that a certain user shows an interestin certain topics. The corresponding class descriptionwould be: “hasVisited some Webpage USports“, where“hasVisited“ is the relation that connects a user with aweb page that appears in his navigation log and “Web-page USports“ the class capturing all web pages thattreat sports-related topics. By using the ”some” rela-tion, it is stated that the user has to have visited at leastone page that belongs to those that are related to theuniverse “Sports“, but there is no limit imposed con-cerning all the other topics that the same web pagemight be related to.

owl:allValuesFrom: We have chosen to apply theuniversal restriction “only“ in cases when one wantsto single out a very specific group of individuals. Forexample, the following restriction allows identifyingonly the users that are interested in soccer, but in noother sports: ”hasVisited only Webpage USports Soc-cer“.

owl:hasValue: The owl:hasValue restriction is verysimilar to the owl:someValueFrom constraint, only thatthe filler is not a class, but a single individual. It thusenables to define an anonymous class of individualsthat are connected to one specified individual in thestorage. One may note that following the open worldassumption, there is no implied constraint concerningthe other relations the individuals take part in – theymay contribute to other properties, or even the sameproperty but leading to another class of individuals.This kind of constraint becomes useful when targeting

specific brackets of the user profile as those are mod-elled as individuals. For example, to single out the in-dividuals that belong to the age group over 65, it willbe enough to define the anonymous class by the restric-tion ”hasAge value Age Over65”.

4.3.3. Cardinality restrictionsCardinality restrictions allow to establish constraints

on the number of relations of a certain type the indi-viduals may take part in. They come in three flavours:minimum cardinality, maximum cardinality and car-dinality. In compliance with their name, they enableto define a minimal/maximal number of relationshipsthat an individual may have or even set an exact num-ber. As an example, we may specify that a persondoes only count as being interested in sports, if havingseen at least five pages that treat sports-related topics:”hasVisited min 5 Webpage USports”.

4.3.4. Using SWRL rulesAs presented in section 4.2.5, the weight concept

allows us to put a measurement of importance onthe relations within the ontology. In consequence, weare able to not only express binary relations (“motherAND some web pages that talk about sports“ means“SportyMom“), but insert a new level of expressive-ness by allowing quantification: ”to a certainty of 0.8 amother AND more than 90% of pages treat topics re-lated to sports“ means “SportyMom“. To actually usethe weight notation for reasoning, however, the usageof SWRL rules becomes indispensable. The followingexample illustrates the definition of a strong relationbetween a Universe and a Webpage:

Universe(?u),Webpage(?wp),

f loat[>= 0.8, <= 65](?value),

hasUniverseWeight(?w, ?u),

hasWebpageWeight(?wp, ?w),

hasWeight(?w, ?value)

→ hasStrongRelation(?wp, ?u)

(2)

Specifying that a universe u and a web page wp that areconnected by a concatenation of properties hasWeb-pageWeight and hasUniverseWeight, connecting themwith a weight w with float value that lies between 0.65and 0.8, it is assumed that wp and u have a strong re-lation to another.


Fig. 3. Minimal version of the ontology used for examples

4.3.5. Practical illustrationThe following paragraphs and figures illustrate a

prototype application and show some exemplified im-plementations using the theoretical foundations ofabove. For the sake of clarity, a minimal version ofthe user profile ontology was adapted, as visualisedin Fig. 3. We mainly omitted the context entities andthose data entities that do not directly contribute to theanalysis process. However, for the classes included, allspecifications obtained during the collaborations havebeen kept to their maximum. This values particularlyfor the profile class that contains a rich number of sub-classes and their definitions.

The industrial partner aims for marketing-orienteduser segmentation. Depending on the commercial part-ner in question, those segments may vary – a conditionthat favours the usage of an ontology as flexible under-lying data structure. Segments are defined by the com-mercial division of our industrial partner and are inte-grated in the ontology to allow a segmentation of usersemploying inference. The construction of new restric-tions and rules will be supported by an intuitive inter-face that is to be developed in the later stages of theproject. Hence, the integration of new segments willbe an easy-to-handle process even for personnel notfamiliar with ontology construction and maintenance.For the moment, individuals belonging each segmenthave been constructed manually to show the workingsof the inference engine. However, please note that theontology has also been published on a Stardog triplestore[6] that works in combination with several webservices developed earlier in the project. Thus, it ispossible to automatically populate the ontology pub-lished on the triple store by handing over a naviga-tion log as initial data structure. The navigation infor-mation contained in the logs are parsed automaticallyand the respective elements (keywords, webpages, ses-sions, hits, etc.) inserted into the ontology according tothe features that were identified by the analysis. Fur-

Fig. 4. The class hierarchy after the creation of the concept "mother"

thermore, it is contemplated to integrate the resultsstemming from our partner’s machine learning internalapplication as additional source of information. Therest of this section describes how constraints can beused to deduce two segments as they may appear inour industrial partner’s definitions. Therein, we chooseone basic example that combines attributes using ma-chine learning techniques, and one advanced examplethat combines those attributes with information stem-ming from website content analysis.

Basic example For the first example, we chose theexample of the segment ”mother”, designating a per-son that belongs to a certain age segment, lives in ahousehold with child presence and was identified asbeing female. All those attributes belong to the ba-sic profile proposition defined by our industrial partnerand can today be determined using machine learningtechniques. In the ontology, those findings are com-bined with the evidence stemming from content analy-sis and in consequence ascertained or corrected.

The segment “mother“is created as a sub-class ofthe “segment“ concept, the class which assembles allconcepts that are combinations of base attributes to ahigher abstraction level. Figure 4 shows the createdsegment in Protégé’s class hierarchy, . Of course, thesoftware will not automatically know what the seg-ment “mother“ depicts in our model and a definitionthat suits the application context has to be entered man-ually: A value restriction is used for the class Gender(to choose only the individuals that are female), com-bined with a helper class Parent that was defined be-fore as a person of a certain age segment, that livesin a household with child presence. In consequence,the class description changes as shown in Figure 5 –Protégé does not only include the definition that weprovided (“mother“ being a entity that is a member of“parent“ and having a certain gender, in the first lineof the information area) but also assigns all definitions


Fig. 5. Changes appearing after having added the definition of the new segment

Fig. 6. Mother individual with attributes inferred by the reasoner

that have been entered for the segment parent as im-plicit super-classes (in the third line of the informationarea).

For a test, the individual ”user1“ was included tothe database and equipped with all the properties of aninstance of the class Mother. After running the rea-soner, the individual is correctly attributed to the re-spective class – depicted in Figure 6.

Advanced example The same procedure can be re-alised using attributes not only from the basic pro-file categories, but also topic information extracted di-rectly from web pages. For the example, again, we con-struct entities by hand, the user as well as the web page,to exemplify the internal process. However, all thoseindividuals will be discovered automatically from theanalysis steps and the ontology populated accordingly.We create a new segment called ”SportyMom”, desig-nating an individual that is a mother (defined by de-mographic attributes) and has visited some web pagestreating topics from the category “Sports“ as follows:

Segment_Mother and (hasV isited min 1

(hasUniverse some Universe_Sports))(3)

Furthermore, we create a web page-individual thatbelongs to the ”Sports” category and define for User1that she has visited that web page. By the definition ofthe class, User1 is now not only a mother, but also a“SportyMom“ – a classification that is verified whenrunning the reasoner (e.g. the class attribution for thecreated indivdual, to be seen in Figure 7).

Fig. 7. Changed attributes for the individual "User1" after runningthe reasoner

5. Conclusion and Future Work

In the last eighteen months since the begin of theMindMinings project, the first steps were made to cre-ate a thorough, ontology-based user profiling systemfor the digital advertising domain. Two main achieve-ments were realised in this period:

1. the design of a customised ontology based on thedomain knowledge of an industrial partner,

2. the development of a prototype system using cur-rent techniques for term extraction, informationintegration and inference.

The prototype relies on the first version of the on-tology that was stored on a triple store. Initial per-formance test came up to the maximum expected re-sponse times, but can certainly be tweaked when usingadapted methodology.

The MindMinings ontology is our first contributionto the body of research centring around user profiling.Even though two propositions have been made to stan-dardise profiling processes, none of the projects seemsto be still actively used by a community. Further-more, the entities defined therein do completely fulfilour needs. However, to come up to the semantic webparadigm of resource reuse, one of the next steps willbe to connect the concepts used in the MindMiningsontology with their counterparts in the Linked OpenData cloud notably by using owl:sameAs predicatesto ensure semantic disambiguation of keyword con-cepts. This will also open access to elements of com-


mon knowledge for our system as it may use the re-lationships defined within those common knowledgeresources for further inference.

The prototype system has been implemented usingwell-established techniques – as for example the tf-idfmeasure for the identification of relevant index terms,vector-based distance computation for websites anduniverses. The next step will be to introduce seman-tic distance computation for key terms previously dis-ambiguated (by relating them them to the appropri-ate, clearly defined concept in the Linked Open Datacloud). This customised semantic distance measurewill rely on context semantics rather than a mappingof keywords. Furthermore, the basic ontology will berefined and extended to make full use of the weightingconcept that allows gradual statements over the affilia-tions of keywords and web resources to universes, andthus, the degree of a user’s interest in a certain topic orproduct. Adequate frameworks for the propagation ofthis vague information over the ontology will be eval-uated.

The main source of information is composed oftextual resources linked to the history of the profileduser. Text analysis has been widely studied throughdifferent domains, such as Information Retrieval andNatural Language Processing. Efforts have been madewith focus on the extraction of meaning from text andits transformation to machine-interpretable structure.However, the toolkits available often centre around theevaluation of English resources and offer only lim-ited of no support for other languages. As most of theclients of the partner company are French, the adapta-tion of state-of-the-art approaches to French languagewill pose one of the mayor keylocks. Furthermore, wewill explore extensions of their functionality that servetheir customisation for the problem task, focussing onthe extraction of information that are significant formarket-relevant customer segmentation.

The analysis of natural user behaviour implies an-other important challenge: human beings seldom actupon logical, crisp principles with clear boundariesand linear reasoning. The profiling system will thushave to handle changing interests, contradictory be-haviour or unclear implications. A possible mathemat-ical framework to handle vague and uncertain informa-tion could be fuzzy logic, introduced by L.A. Zadeh in1965 [27]. The inclusion of fuzzy logic to ontologicalstructures has a rather short history, with a first com-prehensive work having appeared in 2006 [24]. How-ever, first computing schemes and applications havebeen presented therein.

All developments will be done with a close eye onthe usefulness for the application domain and in searchfor the best performing alternative concerning the in-dustrial constraints.

References

[1] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann,Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus fora web of open data. In The semantic web, pages 722–735.Springer, 2007.

[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latentdirichlet allocation. the Journal of machine Learning research,3:993–1022, 2003.

[3] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge,and Jamie Taylor. Freebase: a collaboratively created graphdatabase for structuring human knowledge. In Proceedings ofthe 2008 ACM SIGMOD international conference on Manage-ment of data, pages 1247–1250. ACM, 2008.

[4] Silvia Calegari and Gabriella Pasi. Definition of user profilesbased on the yago ontology. In IIR, CEUR Workshop Proceed-ings, CEUR-WS. org, volume 704, 2011.

[5] Silvia Calegari and Gabriella Pasi. Personal ontologies: Gener-ation of user profiles based on the yago ontology. InformationProcessing & Management, 2012.

[6] LLC Clark & Parsia. Stardog triple store.http://www.stardog.com/.

[7] William M Darling, Michael J Paul, and Fei Song. Unsu-pervised part-of-speech tagging in noisy and esoteric domainswith a syntactic-semantic bayesian hmm. EACL 2012, 2012.

[8] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, andR. Harshman. Indexing by latent semantic analysis. Journal ofthe American society for information science, 41(6):391–407,1990.

[9] Maria Golemati, Akrivi Katifori, Costas Vassilakis, GeorgeLepouras, and Constantin Halatsis. Creating an ontology forthe user profile: Method and applications. In Proceedings ofthe First RCIS Conference, pages 407–412, 2007.

[10] Dominik Heckmann, Tim Schwartz, Boris Brandherm,Michael Schmitz, and Margeritta von Wilamowitz-Moellendorff. Gumo–the general user model ontology. InUser modeling 2005, pages 428–432. Springer, 2005.

[11] Akrivi Katifori, Costas Vassilakis, and Alan Dix. Using spread-ing activation through ontologies to support personal informa-tion management. Proc. of Common Sense Knowledge andGoal-Oriented Interfaces, 2008.

[12] Akrivi Katifori, Costas Vassilakis, and Alan Dix. Ontologiesand the brain: Using spreading activation through ontologiesto support personal interaction. Cognitive Systems Research,11(1):25–41, 2010.

[13] Georgia Koutrika and Yannis Ioannidis. A unified user profileframework for query disambiguation and personalization. InProceedings of workshop on new technologies for personalizedinformation access, pages 44–53, 2005.

[14] Yuefeng Li and Ning Zhong. Mining ontology for auto-matically acquiring web user information needs. Knowledgeand Data Engineering, IEEE Transactions on, 18(4):554–568,2006.


[15] Fang Liu, Clement Yu, and Weiyi Meng. Personalized websearch by mapping user queries to categories. In Proceedingsof the eleventh international conference on Information andknowledge management, pages 558–565. ACM, 2002.

[16] Pasquale Lops, Marco Degemmis, and Giovanni Semeraro.Improving social filtering techniques through wordnet-baseduser profiles. In User Modeling 2007, pages 268–277.Springer, 2007.

[17] George A. Miller et al. Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995.

[18] Pavel Moravec, Michal Kolovrat, and Vaclav Snasel. Lsi vs.wordnet ontology in dimension reduction for information re-trieval. Databases, Texts, page 18, 2004.

[19] Ph Mylonas, David Vallet, Pablo Castells, Miriam Fernández,and Yannis Avrithis. Personalized information retrieval basedon context and ontological knowledge. The Knowledge Engi-neering Review, 23(01):73–100, 2008.

[20] R. Navigli. Word sense disambiguation: A survey. ACM Com-puting Surveys (CSUR), 41(2):10, 2009.

[21] R. Navigli. A quick tour of word sense disambiguation, in-duction and related approaches. SOFSEM 2012: Theory andPractice of Computer Science, pages 115–129, 2012.

[22] Ian Niles and Adam Pease. Towards a standard upper ontology.

In Proceedings of the international conference on Formal On-tology in Information Systems-Volume 2001, pages 2–9. ACM,2001.

[23] Gerard Salton and Michael J McGill. Introduction to moderninformation retrieval. 1986.

[24] Elie Sanchez. Fuzzy Logic and the Semantic Web, volume 1.Elsevier Science, 2006.

[25] Giovanni Semeraro, Marco Degemmis, Pasquale Lops, andPierpaolo Basile. Combining learning and word sense dis-ambiguation for intelligent user profiling. In Proceedings ofthe 20th international joint conference on Artifical intelligence,pages 2856–2861. Morgan Kaufmann Publishers Inc., 2007.

[26] Giovanni Semeraro, Pasquale Lops, and Marco Degemmis.Wordnet-based user profiles for neighborhood formation in hy-brid recommender systems. In Hybrid Intelligent Systems,2005. HIS’05. Fifth International Conference on, pages 6–pp.IEEE, 2005.

[27] Lotfi A. Zadeh. Fuzzy sets, information and cotrol, 8 (3): 338-353, 1965.

[28] Ning Zhong. Representation and construction of ontologiesfor web intelligence. International Journal of Foundations ofComputer Science, 13(04):555–570, 2002.

IOS Press Ontology-based User Proﬁle Learning from ...¬ned 0 (0) 1 1 IOS Press Ontology-based...

Documents

Transcript of IOS Press Ontology-based User Proﬁle Learning from ...¬ned 0 (0) 1 1 IOS Press Ontology-based...