A Framework for Geographical Interestingness Scoping and its...

A Framework for Geographical Interestingness Scoping and its Application to Web Advertising

Ruth Huang Miller1, Chun Sheng Chen

2, Christoph F. Eick

2, and Abraham Bagherjeiran

3

University of Detroit Mercy Department of Mathematics and

Computer Science, Detroit, MI 48221

[email protected]

University of Houston Department of Computer Science

Houston, TX 77204-301

(lyons19, [email protected])

Yahoo! Labs Santa Clara, CA

USA

[email protected]

ABSTRACT

Predicting if a particular user clicks on a particular Ad is of critical importance for Internet advertizing. Associations between the Internet Ad’s Click Through Rate, CTR, and demographic data may be very weak on the global level but are strong at the regional level. Identifying such regions with strong associations of a continuous performance attribute with other factors is important for geo-targeted advertising. In this paper, we propose a novel framework for interestingness scoping to identify such regions by iteratively merging zip code areas, maximizing a plug-in, user-defined interestingness function. We applied our framework to a geo-spatial data set, combining data from a major Ad network and demographic data from the 2000 Census. In the experimental results, our framework found strong global associations between CTR and educational level, and regions of strong positive and negative associations between CTR and income that do not exist globally. Given that the behavior of the population is the driving force determining CTR, it follows that those regional variations in population demographics would account for differences with respect to CTR in those regions. These results can, in turn, be utilized by the Ad provider to improve its geo-targeted advertising in those regions.

Categories and Subject Descriptors

I.5.3 [Pattern Recognition]: Clustering – algorithms

General Terms

Algorithms, Performance, Economics.

Keywords

Contextual Advertising, Behavioral Targeting, Spatial Data Mining

1. INTRODUCTION Online advertising has become a very important component of our economy. Online Ad revenues totaled $17 billion in 2007, 22

billon in 2008, and $24 billion in 2009, and online advertising is

growing significantly in 2010−current estimates suggest a 20% growth rate in revenues for 2010 [1]. Predicting if a particular user clicks on a particular Ad is therefore of critical importance for the success of online advertising companies.

Ad networks typically utilize a pay-per-click model for generating revenues for displayed Ads; consequently, more clicks mean more revenue for the Ad network. When a user navigates to a webpage that serves Ads, the event of viewing the webpage is a page

impression. The viewing of a specific advertisement on a webpage is an Ad impression. For every page impression, the system retrieves a set of candidate Ads from an Ad index/database, which is often run by an Ad network. Ad networks use a click model to rank the set of selected Ads based upon their click probabilities. The click probability of an Ad is determined by a number of factors, including the degree of context matching between the Ad and the page, as well as user profile information, and other aggregated geo-spatial and economic data. After the Ads are ranked, they are selected for display in Ad slates, where an Ad slate is a set of Ads that are grouped together in the same region of the page, and there can be more than one Ad slate on a page. The primary objective of the Ad network is to increase the click through rate (CTR) which is the ratio of the number of Ad clicks to the number of Ad impressions. In contrast, the primary objective of the advertiser is to increase the conversion rate, which is the ratio of the number of Ad impressions to the number of Ad conversions, where an Ad conversion means the user has taken some action that has benefit for the advertiser and occurs after the user has clicked on the Ad.

It has been pointed out in the geographical literature that global, whole map statistics seldom provide useful insight and that most relationships in spatial datasets are geographically regional, rather than global [28]. Goodchild observed [29] that “efforts toward a

scientific geography …had always assumed … that principles

discovered in one part of the world should also be true in other

parts. Idiographic geography had been disparaged as descriptive

journalism, yielding nothing universal.” In general, regional knowledge has a scope that is the area on a map where this knowledge is valid—if it would be valid on the whole map it would be global knowledge. The internet advertising community is already aware of the importance of regional knowledge in that it uses geo-targeting, delivering different content to internet users based on their location.

Unfortunately, current data mining techniques are ill-suited for discovering regional knowledge for Internet advertising. The goal

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00.

of the presented paper is to alleviate this problem. It centers on a particular aspect of this user-Ad-click prediction problem: how does the geo-location of a user impact the probability that this user clicks on a particular ad? In general, the paper presents computational frameworks that mine spatial patterns of clicking and other internet behavior. In particular, a generic spatial data mining framework and algorithms will be presented and evaluated that find interestingness hotspots with respect to a performance

attribute; interestingness hotspots are contiguous, spatial regions that show interesting associations of the performance attribute with other factors. As we will demonstrate in the paper’s case studies, the so-obtained regional knowledge can, in turn, be utilized by Ad providers to improve its geo-targeted advertising in those regions.

The main contributions of the paper include:

1. It proposes a generic framework to find spatial interesting hotspots for a performance attribute with respect to a particular context.

2. Novel algorithms are introduced that compute interesting hotspots.

3. The presented framework and algorithms will be evaluated in challenging case studies in which we mine for interesting regional associations of CTR with demographic information.

The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 presents our framework for identifying geographical interestingness hotspots, proposes different types interestingness scoping algorithms, and defines various interestingness functions that capture different aspects of the performance attribute. Section 4 presents experimental evaluation results. Section 5 briefly discusses how to use these results to better target users. Section 6 presents our conclusions.

2. RELATED WORK

2.1 Hot Spot Discovery in Spatial Statistics In [13], the detection of hot spots using a variable resolution approach was investigated in order to minimize the effects of spatial superposition. The definition of hotspots was extended in [21] using circular zones for multiple variables. In [17, 22], a popular method to find hot spots in spatial datasets relying on the G* Statistic was proposed. The G* Statistic detects local pockets of spatial association. The value of G* depends upon an a priori given scale of the packets and is calculated for each object individually. Visualizing the results of G* calculations graphically reveals hot spots and cold spots. However, it should be noted that such aggregates are not formally defined clusters, as the G* approach has no built-in clustering capabilities.

2.2 Spatial Co-location pattern discovery Shekhar et al. discuss several interesting approaches to mine co-location patterns, which are subsets of Boolean spatial features whose instances are frequently located together in close proximity [23, 25, 26]. Huang et al. proposed co-location mining involving rare events [18]. In [19], Huang and Zhang explored the relations between clustering and co-location mining. Instead of clustering spatial objects, the features of spatial objects are clustered using a proximity function that is designed to find co-locations. However, it should be stressed that all the approaches mentioned above are restricted to categorical datasets and center on finding

global co-location patterns, whose scope is the whole dataset. Localized association rule mining [11] takes a similar approach to ours, but it discovers association rules that hold in local clustered basket data, and thus is limited to non-spatial basket datasets.

2.3 Finding associations between continuous

variables Most of the approaches to mine association rules in datasets containing continuous attributes use discretization. In [24], numerical attributes are discretized and then adjacent partitions are combined as necessary. This leads to information loss and can generate spurious rules. Aumann et al. [12] introduce numerical association rules that support statistical predicates for continuous attributes, such as variance, and algorithms the mine such rules. In [14], rank correlation is used to mine associations between numerical attributes. Basically, continuous attributes are transformed to ordinal attributes, and a method is proposed to find sets of numerical attributes with high attribute value assocations. Achtert [10] and Jaroszewicz [20] propose different methods for deriving equations describing relationships between continuous variables in datasets.

2.4 Data Mining and Machine Learning There are many data mining and machine learning open source projects available. The majority of them are implemented in Java for platform independence (e.g. Weka [13] and RapidMiner [11]) or in C/C++ for computational efficiency (e.g. malibu [31] and Orange [30]). Weka is a collection of machine learning algorithms for data mining tasks. It is one of the most popular open source software tool bundles and currently contains more than 150 algorithms. Developers can either add an algorithm in Weka or integrate Weka’s classes into another system. RapidMiner provides a GUI for representing experiments in a tree-like structure called the operator tree and uses a multi-layered data view which represents data in memory similar to a database system view. Orange is a framework for machine learning and data mining that is implemented in C++ and Python. A graphical user interface is provided through a set of building blocks called widgets. Different open source tools have different goals. For

Interestingness

Scoping Tool

Performance

Variable Context

Interestingness

Function

Interesting Regionsand their Rewards

Post Processing

& Visualization Tool

Interestingness

Visualization on Map

Figure 1. Spatial Interestingness Scoping for a

Continuous Performance Variable

example, RapidMiner focuses on fulfilling needs of end users and provides a GUI for the users. Unlike end-users, researchers are more interested in code validation, libraries/tools that aid algorithm construction, ease of implementing new algorithms, template structures, and code reusability. However, these approaches all focus on finding frequent global patterns that characterize the complete dataset.

Global statistics seldom provide useful insight and most relationships in spatial datasets are geographically regional. Our approach centers on discovering regions and regional co-location patterns whose scope is a subset of the whole dataset. Our framework, developed by the UH Data Mining and Machine Learning Group [16] and based upon Cougar^2 [15], is a novel approach to mining co-location patterns with respect to continuous variables in spatial datasets. The framework operates in the continuous domain and views regional co-location mining as a clustering problem in which an externally given fitness function has to be maximized. Cougar^2 provides tools not only to facilitate development but also to enhance experiment configuration. So far, nine region discovery algorithms (four representative based, three agglomerative, one divisive, and one density-based) have already been designed and implemented in our past work. CLEVER [6, 8] is a representative based clustering algorithm that uses representatives and randomized hill climbing. MOSAIC [5, 9] is an agglomerative clustering algorithm that merges neighboring clusters greedily maximizing a fitness function. SCMRG [2, 4, 7] is a divisive grid-based clustering algorithm that uses multi-resolution grids to identify promising regions. Finally, there is another density based clustering algorithm implemented in Cougar^2, namely SCDE [9] that does not rely on any plug-in fitness function, but operates using density functions instead. In addition to the clustering algorithms, many fitness functions are implemented in Cougar^2 which achieve different knowledge discovery tasks, and are reusable among clustering algorithms such as co-location mining [3, 8], hotspot discovery [2], and regional correlation [5].

However, to date there is little work on using these data mining techniques on multiple geo-spatial data sets to identify co-location patterns that could help identify generalized regional user profiles that for geo-targeted advertising in geo-spatial/regional hotspots.

3. A FRAMEWORK FOR IDENTIFYING

GEOGRAPHICAL INTERESTINGNESS

HOTSPOTS Identifying geo-spatial location in the US has been made possible by the public availability of IP-address Geo-location databases that associate each IP address with a global location, which are highly accurate and updated monthly. While geo-spatial location is a much more difficult problem in other parts of the world, techniques (mostly proprietary) have been developed by the major Ad networks to alleviate this problem and significantly improve accuracy. Given that it is feasible, and relatively easy, to identify geo-spatial location of a user (at least to the user’s ISP location), the question then becomes one of identifying spatial interestingness with respect to a performance variable/metric. Addressing this question will be the focus of the remainder of this section.

3.1 A Framework for Spatial Interestingness

Scoping of a Performance Variable in

Different Contexts Due to the fact that our focus is geo-targeting, our goal is to mine spatial patterns of a performance measure in different contexts in a predefined space. Spatial patterns have an interestingness scope which consists of spatial regions of interestingness; regions are contiguous areas in the predefined space. A performance measure estimates quantities of a continuous, non-spatial attribute of interest for the success of a business entity, such as average click-through rate (CTRavg), number of items sold (related to conversion rate), or average profits (based upon actual income less the actual cost for the advertisement). Contexts define dimensions and types of associations with respect to performance variables that are analyzed. For example, we might be interested in finding interesting associations between click-through rate (the performance variable) and income (the context under which the performance variable is analyzed) that are assessed by evaluating correlations between the two variables. Interestingness of associations is captured in interestingness functions that associate a reward based on the interestingness of a particular region.

More formally, we assume a spatial dataset O is given in which an

objects o∈O are characterized by:

• The performance attribute p

• A set of spatial attributes S

• A set of binary attributes B

• A set of continuous attributes C

B∪C defines the contexts under which the performance attribute p is analyzed. Moreover, we assume a spatial neighboring relationship N is given

N⊆O×O

that describes which objects belonging to O are neighboring. N is usually computed using the spatial attributes S of the objects in O.

Finally, we assume that we have an interestingness measure

i: 2O→{0}∪ℜ+

that assess the interestingness of subsets of the objects in O by associating rewards with particular contexts. Moreover, we

assume an interestingness threshold θ is given that defines which patterns are interesting. In general, i measures the interestingness of the performance variable p, usually in some context1

X⊆B∪C.

The goal of this research is to develop frameworks and algorithms

that find interestingness hotspots H⊆O; H is an interestingness

hotspot with respect to i if the following 2 conditions are met:

(1) i(H) ≥ θ

(2) H is contiguous with respect to N; that is, for

each pair of objects (o,v) with o,v∈H, there has to be a path from o to v that traverses

1 Occasionally, we just analyze the performance attribute itself, in

which case X=∅

neighboring objects (with respect to N) belonging to H.

In summary, interesting hotspots H are contiguous regions in

space that are interesting (i(H) ≥ θ). Moreover, we call H a global

interestingness hotspot if i(O)≥θ; otherwise, we call H a regional

interestingness hotspot with respect to O.

In general, in this paper we will introduce algorithms that solve 3 different problems:

1. Given O, N, a set of interestingness functions I={i1,…,ik}

and θ, compute global interestingness hotspots for i∈I.

2. Given O, N, i, and θ, compute regional interestingness hotspots H with respect to i.

3. Given O, N, a set of interestingness functions I={i1,…,ik}

and θ, compute regional interestingness hotspots with respect

to i∈I.

In summary, in this paper, we introduce computational techniques that compute interestingness hotspots with respect to particular associations of a performance measure with a particular context. Interestingness hotspots are contiguous areas in space for which

the interestingness function i assigns a reward w≥θ, indicating “news-worthy” regional associations between the performance measure and the context under which it is analyzed. Figures 4 through 9 depict such regions.

3.2 Interestingness Measures In this section, specific interestingness measures i will be introduced, some of which will be used later in the CTR-case studies that will be discussed in Section 4. The most simplistic interestingness measure we can think of is one that directly uses the value of the performance attribute p, which we call ip in the following:

Let H⊆O

ip (H)=(Σh∈H h.p)/|H|

where |H| denotes the cardinality of H.

Another interestingness measure ip|Y analyzes the performance

variable within a set of binary contexts Y⊆B:

=∈∀∧∈=

∑ =∈∀∧∈

trueh.y:YyHhj |{h

h.p(H)i

trueh.y:YyHh

Y|p

Where ip takes an average of the performance attribute for all objects in H, ip|Y restricts the average computation to objects in H for which all binary attributes in Y are true. Moreover, when mining with respect to ip|Y we frequently require an additional support threshold s. If this approach is chosen, H is considered to be an interestingness hotspot with respect to ip|Y only if the following three conditions are met:

(1) i(H) ≥ θ

(2) H is contiguous with respect to N; that is, for

each pair of objects (o,v) with o,v∈H, there has to be a path from o to v that traverses neighboring objects (with respect to N) belonging to H.

(3) |{h| h∈H∧∀y∈Y h.y=true}|≥s

Other interestingness measures can be defined that evaluate associations of the performance variable p with a continuous

variable c∈C. For example, we could be interested to find regions with interesting correlations—correlations whose absolute value is above 0.7—between CTR and income. To accomplish this we

define the following interestingness function icorr|c⊗p as follows:

Let H⊆O, c∈C a continues variable, and corr(H, p, c) denotes the correlation of the objects in H between the performance attribute p and the continuous attribute c — such as the correlation between CTR and income in H for the example we discussed — then:

icorr|c⊗p (H) := |corr(H, p, c)|

icorr|c rewards strong positive or negative correlations more than correlations that are close to 0. Moreover, in general, different interestingness hotspots can still be ordered by the reward they receive; for example, a hotspot with correlation -0.9 would be considered to be more interesting than another hotspot with correlation 0.8 due to the fact that the former region receives a reward of 0.9 whereas the latter region obtains a reward of 0.8

from icorr|c. Finally, regions H whose interestingness is below θ are considered to be outliers and will not be included in results of the interestingness scoping framework, we propose.

3.3 Algorithms that Compute Interestingness

Hotspots We categorize algorithms that compute interestingness hotspots, into the following 3 groups of algorithms:

1. Sampling algorithms, particular circular region sampling algorithms that randomly create interestingness hotspots by wrapping a radius of size r around a point p and then employ a post-processing algorithm from that selects interestingness hotspots from the so generated regions.

2. Agglomerative Growing Algorithms that start with a

region R={o} containing a single object o∈O and search for hotspots by adding objects in the neighborhood of o to R.

3. Spatial Clustering Algorithms that partition O into contiguous, disjoint regions H1,…, Hk, maximizing the sum of rewards obtained for each region that are generated from i.

The circle sampling algorithms that restrict themselves to circular regions, select the point p and the radius r to be explored next intelligently by using reinforcement learning that uses feedback from past experiences—quality of hotspots created earlier and potential overlap with the newly created regions with regions that have been created in the past, also trying to strike the right balance between exploration and exploitation.

Two groups of agglomerative growing algorithms can be distinguished that create interesting hotspots Go agglomerative by adding neighboring objects to {o} as follows:

a. Algorithms that try to obtain interestingness hotspots that maximize the size |Go| of the obtained region.

b. Algorithms that maximize the interestingness i(Go) of the obtained region.

Another problem that both types of algorithms face is that the employed algorithms create a lot of overlapping interestingness hotspots which need to be post-processed to obtain a final set of hotspots. Post processing algorithms that can be used for this purpose have been proposed in [27].

The third group of algorithms we propose for interestingness scoping are spatial clustering algorithms that partition the space of O into interesting, contiguous regions R={H1,…,Hk} that maximize the sum of rewards obtained from i for the regions in R; more specifically:

Given: A dataset O A distance function d defined on instances of O An interestingness function i(H) Objective:

Find H1,…,Hk ⊆O such that:

1. Hi∩Hj=∅ if i≠j

2. R={H1,…,Hk} maximizes Reward(R)= Σj i(Hj)

3. All hotspots Hi∈X are contiguous

4. H1∪…∪Hk ⊆ O (not all objects in O need to belong to an interesting region)

So far, nine region discovery algorithms (four representative based, three agglomerative, one divisive, and one density-based) have already been designed and implemented in our past work [2, 3, 4]. Some of these spatial clustering algorithms that have been used in our past research, in particular CLEVER [8], MOSAIC [9], and SCMRG [2, 4], will be tailored and adapted for the task of interestingness scoping!

3.3.1 Hotspot Discovery Algorithm The algorithm for finding regions based upon zip codes had to be refined, considering that zip codes are non-uniform polygons that often do not approximate a circle or regular polygon. Our algorithm finds regions where the local interestingness is above a defined threshold. In this paper, the interestingness is defined by the correlation between the CTR and the census data in a region. Since the finest granularity of CTR and the census data is on the zip code level, given an initial region the algorithm walks through all the neighboring zip codes that share the boundary with the initial region. A neighboring zip code will be merged to the initial region if the interestingness after merging is above the threshold. Because the merging step changes the scope of the initial region, the list of neighboring zip codes of the new region must be updated. The algorithm stops if it walks through all the zip codes in the neighboring list and none of them is allowed to be merged.

Because it is not possible to compute the correlation between the CTR and the census data for only one zip code, the initial region must consist of multiple continuous zip codes. We generated an initial region of n zip codes (where n varies as a parameter of the algorithm) randomly for the algorithm’s start. The algorithm was run with different initial regions many times. Hotspots/coldspots are reported after multiple runs of the algorithm. Figure 2 shows the pseudo code of the algorithm.

3.4 Collocation Interestingness Functions for

Associations of a Performance Variable with

Point Objects of a Particular Type Let PO={po1,…,pon) be a set of point objects. Let η(p,lo,la) be a function that associates a click-through rate with a particular point in space (lo,la). To identify collocation patterns for a specific point object ps, we wrap an almost circular polygon of radius r (e.g. we could choose r as half of the average of the 10-nearest neighbor distances of the objects in PO) around each point object ps in PO, and compute the number of objects of PO in the circle circle(ps). We then compute the average click-through rate for circle(ps) relying on polygon-area weighted interpolation. The result is a dataset containing tuples of the following form :

(longitude, latitude, #objects, CTR)

Now, using the techniques we introduced before, we can mine hotspots with respect to the performance variable #objects*CTR, or we could analyze correlation between the 3rd and 4th variable.

Alternatively, we could also use the CLT-zip-codes as our reference frame and compute the number of objects in PO in each zip-code polygon and obtain tuples:

(zip-code, CTR, PO-density)

with PO-density= #PO-Objects-in-zip / area(zip-code). We can then define interestingness hotspots as follows:

i(r) :=

IF CTR > avg(CTR) AND

PO-Density > avg(PO-Density)

THEN

(CTR - avg(CTR))*(PO-Density –

Input: a interestingness function F, a list of n initial zip regions zlist, interestingness threshold t

Set HotspotList := empty

Set NeighborList := empty

For each region z in zlist {

If(F(z)>t) {

Find neighbor zip codes of z and add to the

NeighborList;

While (size of NeighborList > 0) {

Remove one zip code M from NeighborList;

If (F(M+z) > t)

Merge M to z and update the NeighborList

}

Add z to HotspotList;

}

}

Return HotspotList;

Figure 2. Hotspot Discovery Algorithm

avg(PO-Density))

ELSE 0

3.5 Zip Code Regions In the zip code data, the first digit, 0-9, designates the general area of the country, as shown in Table 1, with numbers starting lower in the east and increasing as you move west. For example 0 covers Maine while 9 refers to California. The next two digits in a zip code refer to one of the 455 Sectional Center Facilities (SCFs) in the US. Figure 3 shows how the zip codes are organized throughout the United States.

4. EXPERIMENTAL EVALUATION:

MINING GEO-SPATIAL PATTERNS OF

CLICKING BEHAVIOR This section describes the results of our offline data mining experiments.

4.1 Data Sets Data from two separate sources were combined to obtain the data sets used in the experiments. The source for the first set of data is

the Yahoo! click log data for the top 5 Yahoo! domains with rank 1 Ad data. For this data, we set a reasonable minimum threshold of 1000 impressions, and 100 clicks for each zip code. The source for the second set of data is the US Census 20002. Thus, the combined dataset contains records with the set of fields shown in Table 2, including data for 5 digit zip code, the performance variables number of impressions, number of clicks, and CTR, and other attributes such as per capita income, race, and education

2 The US Census 2000 data is publicly available and can be found

at http://www.census.gov/main/www/cen2000.html

Table 1. Regions Defined by 1st Digit of Zip Code

1st digit

of zip

code

US Region

0 Connecticut (CT), Massachusetts (MA), Maine (ME), New Hampshire (NH), New Jersey (NJ), Puerto Rico (PR), Rhode Island (RI), Vermont (VT), Virgin Islands (VI)

1 Delaware (DE), New York (NY), Pennsylvania (PA)

2 District of Columbia (DC), Maryland (MD), North Carolina (NC), South Carolina (SC), Virginia (VA), West Virginia (WV)

3 Alabama (AL), Florida (FL), Georgia (GA), Mississippi (MS), Tennessee (TN)

4 Indiana (IN), Kentucky (KY), Michigan (MI), Ohio (OH)

5 Iowa (IA), Minnesota (MN), Montana (MT), North Dakota (ND), South Dakota (SD), Wisconsin (WI)

6 Illinois (IL), Kansas (KS), Missouri (MO), Nebraska (NE)

7 Arkansas (AR), Louisiana (LA), Oklahoma (OK), Texas (TX)

8 Arizona (AZ), Colorado (CO), Idaho (ID), New Mexico (NM), Nevada (NV), Utah (UT), Wyoming (WY)

9 Alaska (AK), American Samoa (AS), California (CA), Guam (GU), Hawaii (HI), Oregon (OR), Washington (WA)

Table 2. Fields in Combined Dataset used in Experiments

5 Digit Zip Code

Number of Impression

Number of Click

CTR

Total Population

Total Population who are White

Total Population who are African American

Total Population who are American Indian

Total Population who are Asian

Total Population who are Hawaiian or Pacific Islander

Total Population who are Some other Race

Total Population who are 2 or more Races

Percent of Total Population who are White

Percent of Total Population who are African American

Percent of Total Population who are American Indian

Percent of Total Population who are Asian

Percent of Total Population who are Hawaiian or Pacific Islander

Percent of Total Population who are Some other Race

Percent of Total Population who are 2 or more Races

Per Capita Income

Percent of Total Population with Education up to 12th grade

Percent of Total Population with Education up to Bachelors Degree

Percent of Total Population with Education up to Masters Degree

Percent of Total Population with Education up to Ph.D. or Profession Degree Percent of Total Population with Education higher than Masters Degree

Figure 3. Zip Code Zones for United States

level. The original dataset contained 38,000 zip codes, but after applying the thresholds, the dataset was trimmed to 13,869 zip codes.

In this dataset, the race field contains one of the following: White, Black, American Indian, Asian, Hawaiian and Pacific Islander, Some Other Race, and 2 or More Races. The data contains both the number of each race in the zip code, as well as the percentage of total population in the zip code of each race. The education field contains one of the following: Up to 12th Grade, Some College, BS, MS, Grad and Professional, BS and Beyond (which encompasses the last three categories). This data is limited to those who are 25 years of age or older, and is given as a percentage population in each category for the zip code. The age data provides the median age, median male age, and median female age for each zip code.

4.2 Experimental Results Three sets of experimental analysis were performed using the general framework that we presented in section 3. First, global analysis was performed for all the US 5 digit zip codes with respect to the performance variables: income, race, and education. Second, regional analysis was performed, looking at regions defined by the first two digits of the zip codes with respect to the same set of the performance variables. Third, regional analysis was performed based on our hotspot identification algorithm, MOSAIC, with respect to the performance variables.

4.2.1 Global Analysis In the first set of experiments, the framework was applied to the global data sets, without regard to regional or local interestingness, to see if any correlations exist between any of the performance variables of impressions, clicks, and CTR, and the attributes of income, race, and education. With regard to income, and race, there were no interesting correlations to number of Ad impressions, number of clicks, or CTR. However, there was an interestingness score over the threshold with respect to education level. There is a positive correlation, globally, with respect to an increase in percent of population with an education level up to 12th grade, as shown in Table 3. There also a negative correlation with regard to the percent of population with an education level of Bachelors or higher. This suggests that higher educated users are less likely to click on Ads.

Our agglomerative region discovery algorithm identified hotspots and coldspots for correlation between education level and CTR. The agglomerative algorithm was initialized with 2000 randomly selected zip codes. Figure 4 shows the resulting hotspot map, and figure 5 shows the resulting coldspot map.

4.2.2 Regional Analysis In this experiment, we selected our region boundaries based on

Table 3. Global Interestingness Score between

Demographic Features and CTR

Performance variable: Education Level Interesting-

ness Score

Per Capita Income -0.0047

% pop > 25 yrs up to 12th Grade 0.1460

% pop >25 yrs up to BS -0.1259

%pop > 25 yrs up to with MS -0.1213

%pop > 25 yrs up to Ph.D. or Prof. Degree -0.0972

%pop > 25 yrs with Education Level > BS -0.1168

%pop >25 yrs with some College Education -0.0678

Table 4. Regional Interestingness Score between

Per Capita and CTR for 2 Digit Zip Code Regions

Regions of 2 Digit Zip Codes Interestingness Score 41 0.389 69 0.307 38 0.306 86 0.284 87 0.278 63 0.270 80 0.269 75 0.256 17 0.256 … … 79 -0.204 51 -0.250

Figure 5. Agglomerative Algorithm Coldspots for

Correlation between Bachelor’s degree and CTR < -0.7

Figure 4. Agglomerative Algorithm Hotspots for

Correlation between Bachelor’s degree and CTR >0.7

the first 2 digits of the zip codes. These regions are loosely associated with state boundaries, as shown in figure 3. The attributes income and race have hotspots within some of these regions. As shown in Table 4, there are 15 regions which have positive correlation for CTR and per capita income (ranging from 0.25 – 0.38). There are also 2 regions that have a negative correlation with income and CTR (ranging from < -0.25). This indicates that for 82 regions, there was little correlation between per capita income and CTR. In Figure 6, we used our GUI tool to create a hotspot map of the US region. From the colored image we can see that regions with interesting scores are in the inland areas. For comparison purposes, we also plot the hotspots based on the 3 digit zip code regions for Income and CTR (ranging from 0.50 – 0.95) and the image is shown in Figure 7. The figures demonstrated similarities between the two images as expected. The interestingness between income and CTR seems to be located away from the two coasts and away from areas with high cost of living. The high cost of living could be an indicator of lower disposable income, which in turn leads to less overall spending and thus impacts user interests in Ads. Our agglomerative region discovery algorithm also identified hotspots and coldspots for correlation between income and CTR. The agglomerative algorithm was initialized with 2000 randomly selected zip codes.

Figure 8 shows the resulting hotspot map, and figure 9 shows the resulting coldspot map. For all the hotspots, the correlation threshold is set as 0.7, and coldspots the threshold is set as -0.7. These images demonstrate that our agglomerative algorithm was able to generate hotspots and coldspots with a much higher correlation rate, and thus are more precise and significant.

5. Generalized User Profiling We observed that the user interest in an Ad has regional influence. This may be due to a number of factors, including demographics, and user behavior that is common within one set of regions, but not in others. For example, users in Florida might behave differently than users in Alaska, or the Appalachian Mountains. Income seems to play a more important role to specific regions than in others (i.e., there is a higher correlation in these regions, than in others).

Geo-targeted advertising is already used by a number of Ad networks. Another method for predicting interest by a user in an Ad impression is to find Ads with characteristics that closely match the user’s interests or behavior, as is the case with behavioral targeting [34, 35, 36, 37]. Behavioral targeting attempts to develop user profiles that can predict user behavior, and therefore user interest, in particular Ads. Behavioral targeting’s goal is the delivery of selected ads to targeted users

Figure 9. Agglomerative Algorithm Coldspots for

Correlation between Income and CTR Figure 7. Image Hotspots for Income and CTR

Based on 3 Digit Zip Code Regions

Figure 8. Agglomerative Algorithm Hotspots for

Correlation between Income and CTR

Figure 6. Image of Hotspots for Income and CTR

Based on 2 Digit Zip Code Regions

based on information collected on each individual user’s web search and browsing behaviors, including the pages they have visited, and the web searches they have conducted. Behavioral targeting is generally used by Ad networks to improve the effectiveness of Ads (leading to higher CTRs) by targeting the most relevant user for the specific Ads displayed to him, and vice versa. This is achieved by segmenting users according to behavioral characteristics and then ranking segments for advertisements.

Developing user profiles for individual users is a daunting task, given that, in general, users surfing the web are anonymous. Also, behavioral targeting approaches do not work when a user’s historical activities are not available (e.g., the user is new to the system), but that is not the case for the geo-targeting approach because the user’s position can always be derived from the IP address. Thus, demographic and socio-economic data from the US Census can be used to develop a regional profile for the general user within specific regions. When combined with Ad network click log data, this data gives a set of characteristics for an average user for each region, down to the level of a zip code. This regional knowledge about the average user can be used to derive a set of probable interests and behavior for the users in that region. Thus, the US Census data allows the modeling of a typical user’s characteristics, or profile, through geo-spatial data for each region or locale. This, in turn, can be utilized by Ad networks to better geo-target Ads for this typical user. These behaviors can be understood using our interestingness scoping framework.

6. CONCLUSIONS This paper has described the results of efforts to extract generalized user profile characteristics from demographic, economic, and geospatial information found in an Ad network’s click logs and in Census 2000 data, in order to identify new patterns of user behavior with the goal of improving geo-targeted advertising leading to improved CTR and conversion. Associations between the CTR, and demographic data may be very weak on the global level but are strong at the regional level.

We presented our novel framework for interestingness scoping to identify hotspot/coldspot regions by iteratively merging zip code areas, maximizing a plug-in, user-defined interestingness function. We identified regions with strong associations of the performance attribute CTR with other factors

In the experimental results, we identified global correlations between educational level and CTR, indicating that users with higher levels of education were less inclined to click on Ads. The experimental results also showed that our framework found regions of strong positive and negative associations between CTR and income that do not exist globally.

Given that the behavior of the population is the driving force determining CTR, it follows that those regional variations in population demographics would account for differences with respect to CTR in those regions. These results can, in turn, be utilized by the Ad provider to improve its geo-targeted advertising in those regions.

7. REFERENCES [1] http://investor.google.com/earnings.html

[2] Ding, W., Eick, C. F., Wan, J., and Yuan, X.. 2008. A Framework for Regional Association rule Mining in Spatial Datasets. Proceedings of the IEEE International Conference

on Data Mining (Hong Kong, China, full date, 2006). ICDM’06. IEEE, 851-856.

[3] Ding,W., Jiamthapthaksin, R., Parmar, R., Jiang, D., Stepinski, T., and Eick, C. F. 2008. Towards Region Discovery in Spatial Data Sets. Proceedings of the Pacific-

Asia Conference on Knowledge Discovery and Data Mining (Osaka, Japan, May nn-nn 2008). PAKDD’08. Publisher, City, State, nnn-nnn.

[4] C. F. Eick, B. Vaezian, D. Jiang, and J. Wang. “Discovering of Interesting Regions in Spatial Data Sets Using Supervised Clustering.” Proceedings of the 10th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’06), Berlin, Germany, 2006. Publisher, City, State, nnn-nnn.

[5] Celepcikay, O. U., Eick, C. F., and Ordonez, C. 2009. Regional Pattern Discovery in Geo-Referenced Data using PCA. Publisher, City, State, nnn-nnn.

[6] Celepcikay, O. U., and Eick, C. F. 2009. REG^2: A Regional Regression Framework for Geo-Referenced Datasets. Publisher, City, State, nnn-nnn.

[7] C. F. Eick, N. Zeidat, and Z. Zhou. “Supervised Clustering – Algorithms and Benefits.” Proceedings of the International Conference on tools with AI (ICTAI’04), Boca Raton, FL, 2004.

[8] C. F. Eick, W. Ding, T. F. Stepinski, J. P. Nicot. “Finding Regional Co-location Patterns for Sets of Continuous Variables in Spatial Datasets.” Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS’08), Irvine, CA, 2008.

[9] Choo, J., Jiamthapthaksin, R., Sheng Chen, C., Celepcikay, O. U., Giusti, C., Eick, C. F. 2007. MOSAIC: A proximity graph approach for agglomerative clustering. Proceedings of the 9th International Conference on Data Warehousing and Knowledge Discovery (where, Germany, full date, 2007). DAWAK 2007. Publiserh, City, State, nnn-nnn.

[10] Achtert, E.,Bohm, C. Kriegel, H. Kroger, P. and Zimek, A. 2006. Deriving quantitative models for correlation clusters. Proceedings of the 12th ACM International Conference on

Knowledge Discovery and Data Mining (Philadelphia, PA, August nn-nn, 2006). SIGKDD 2006. ACM, 4-13.

[11] Aggarwal, C. C., Procopiuc, C. M., and Yu, P. S. 2002. Finding localized associations in market basket data. IEEE

Transactions on Knowledge and Data Engineering, 14, 51-62.

[12] Aumann, Y., and Lindell, Y. 1999. A Statistical Theory For Quantitative Association Rules. In Proc. of the 5th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining. KDD '99. ACM, New York, NY, 261-270.

[13] Brimicombe, A.J.: Cluster Detection in Point Event Data Having Tendency Towards Spatially Repetitive Events, 8th International Conference on GeoComputation, Ann Arbor, Michigan (2005).

[14] Calders, T., Goethals, B., and Jaroszewicz, S. 2006. Mining Rank-Correlated Sets of Numerical Attributes. In Proc. of the 12th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining. KDD '06. ACM, New York, NY, 96-105.

[15] Cougar^2: Data Mining and Machine Learning Framework, https://cougarsquared.dev.java.net/.

[16] Data Mining and Machine Learning Group, University of Houston, http://www.tlc2.uh.edu/dmmlg.

[17] Getis, A., and Ord, J. K. 1996. Local Spatial Statistics: an Overview. In Spatial analysis: modeling in a GIS environment, Cambridge, GeoInformation International. (Cambridge, 1996), 261–277.

[18] Huang, Y., Pei, J., and Xiong, H. 2006. Mining Co-Location Patterns with Rare Events from Spatial Data Sets. Geoinformatica 10 (3), 239-260.

[19] Huang, Y., and Zhang, P. 2006. On the Relationships between Clustering and Spatial Co-location Pattern Mining. In Proc. of the 18th IEEE Intl. Conf. on Tools with Artificial intelligence. ICTAI. IEEE Computer Society, Washington, DC, 513-522.

[20] Jaroszewicz, S. 2008. Minimum Variance Associations—Discovering Relationships in Numerical Data. In Proc. of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, (Osaka, Japan, May 2008). PAKDD ’08.

[21] Kulldorff, M. 2001. Prospective Time Periodic Geographical Disease Surveillance Using a Scan Statistic. Journal of the Royal Statistical Society Series A, 164, 6-72.

[22] Ord, J. K., and Getis, 1995. A. Local Spatial Autocorrelation Statistics: Distributional Issues and an Application. Geographical Analysis, 27(4), 286-306.

[23] Shekhar, S., and Huang, Y. 2001. Discovering Spatial Colocation Patterns: A Summary of Results. In Proc. of the 7th Intl. Symp. on Advances in Spatial and Temporal Databases, Springer-Verlag, London, 236-256.

[24] Srikant, R., and Agrawal, R. 1996. Mining Quantitative Association Rules in Large Relational Tables. SIGMOD Rec. 25(2), 1-12.

[25] Xiong, H., Shekhar, S., Huang, Y., Kumar, V., Ma, X., and Yoo, J. S. 2004. A Framework for Discovering Co-location Patterns in Data Sets with Extended Spatial Objects. In Proc. of SIAM Intl. Conf. on Data Mining (SDM).

[26] Yoo, J.S., and Shekhar, S. 2006. A Join-less Approach for Mining Spatial Co-location Patterns. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18.

[27] R. Jiamthapthaksin, C. F. Eick, and V. Rinsurongkawong, An

Architecture and Algorithms for Multi-Run Clustering, in Proc. Computational Intelligence Symposium on Data Mining (CIDM), Nashville, Tennessee, April 2009.

[28] S. Openshaw. Geographical data mining: Key design issues. In GeoComputation, 1999.

[29] Goodchild observed. [M. F. Goodchild. Twenty years of progress: GIScience in 2010, JOURNAL OF SPATIAL INFORMATION SCIENCE, Number 1 (2010), pp. 3–20

[30] J. Demsar, B. Zupan, G. Leban, and T. Curk, “Orange: From Experimental Machine Learning to Interactive Data Mining”,in Proc. of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2004(Demonstration paper)

[31] R. E. Langlois, and H. Lu, “Intelligible Machine Learning with Malibu”, in Proc. of the 30th IEEE International Conference on Engineering in Medicine and Biology Society, 2008

[32] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations, Volume 11, Issue 1.

[33] X

[34] Ratnaparkhi, A. 2010. Finding predictive queries for behavioral targeting. ADKDD. 07/2010, Washington, D.C., (2010).

[35] Yan, J., Liu, N., Wang, G., Zhang, W., Jiang, Y., and Chen, Z. 2009. How much can Behavioral Targeting Help Online Advertising? WWW 2009. Madrid, Spain.

[36] Chen, Y., Pavlov, D., and Canny, J. 2009. Large-scale behavioral targeting. Proceedings of 15th ACM International Conference on Knowledge Discovery and Data Mining. ACM SIGKDD 2009. ACM, New York, NY, 209-218.

[37] Wu, X., Yan, J., Liu, N., Yan, S., Chen, Y., and Chen, Z. 2009. Probabilistic latent semantic user segmentation for behavioral targeted advertising. Proceedings of the 3rd International Workshop on Data Mining and Audience Intelligence for Advertising. ADKDD 2009. ACM, New York, NY, 10-17.

A Framework for Geographical Interestingness Scoping and its...

Documents

Transcript of A Framework for Geographical Interestingness Scoping and its...