IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website,...

8
IP Geolocation with a Crowd-sourcing Broadband Performance Tool Yeonhee Lee Dept. of Wired and Wireless Communication Intra Research Electronics and Telecommunications Research Institute, Korea [email protected] Heasook Park Dept. of Wired and Wireless Communication Intra Research Electronics and Telecommunications Research Institute, Korea [email protected] Youngseok Lee Dept. of Computer Engineering Chungnam National University, Korea [email protected] ABSTRACT In this paper, we propose an IP geolocation DB creation method based on a crowd-sourcing Internet broadband per- formance measurement tagged with locations and present an IP geolocation DB based on 7 years of Internet broadband performance data in Korea. Compared with other commer- cial IP geolocation DBs, our crowd-sourcing IP geolocation DB shows increased accuracy with fine-grained granularity. We confirm that the low accuracy of commercial IP geolo- cation DBs mainly results from selecting a single represen- tative location for a large IP block from the Whois registry DB, parsing city names in a naive way, and resolving the wrong geolocation coordinates. We also found that the geo- graphic location of IP blocks has continuously changed but has been stable. Although our IP geolocation DB is lim- ited to Korea, the 32 million broadband performance test records over 7 years provide wide coverage as well as fine- grained accuracy. 1. INTRODUCTION The geographic location of IP addresses is important and useful for targeting advertising, localizing content distribu- tion, and tracking illegal users committing cyber-attacks or crimes. Recently, the accuracy of commercial IP geoloca- tion DBs has been verified with the ground truth in country- and city-level granularity. Many free or commercial IP ge- olocation DBs, such as MaxMind, IPligence, IP2Location, GeoBytes, NetAcuity, and Akamai Edgescape, provide the high country-level accuracy but low city-level accuracy [1,2]. Thus, the detailed location information at the level of ad- ministrative districts within large metropolitan cities like Seoul, Tokyo, Los Angeles, or New York are not available. IP geolocation DBs usually depend on Whois registry DBs or BGP routing tables. With the Whois DB, the city- or district-level accuracy of IP geolocation cannot be guaran- teed because the Whois DB contains only the headquarters address of the ISP for all the entire IP blocks. In addition, commercial IP geolocation DBs focus on only a few coun- tries (e.g., U.S.) so that their accuracy in other countries, such as Korea, is arguably low [1]. In BGP routing, many IP blocks assigned to ISPs are aggregated, and thus, it is difficult to know local IP blocks without cooperation from a broadband ISP. In order to overcome the weakness of the BGP routing table or the Whois service, many active measurement ap- proaches have been proposed for building IP geolocation DBs. For instance, DIMES [3] collects ping or traceroute probing results with location information from many par- ticipants across the world, but its accuracy does not cover district-level resolution. Though the active measurement approach is effective, the population of this service should be large enough to be statistically meaningful. In addition, many participants should constantly use the active measure- ments to keep the IP geolocation DB up-to-date. One active approach that can solve the scalability issue uses a crowd-sourcing Internet performance measurement website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance test websites such as Speedtest, Benchbee, NIA Speed, and ISP’s own’s. When running a test, users provide their broadband sub- scription information, such as the ISP name, service cat- egory, and location. Therefore, we can easily probe the location of IP addresses contributed by many broadband subscribers. However, building an IP geolocation DB from crowd-sourcing measurement data has two innate problems. First, even though the tests are countinuously launched by many volumteer participants, it is hard to fully cover the en- tire IP address space with the tests. Second, the voluntarily provided address information may not be reliable. Our first contribution is to invent a new IP geolocation DB creation method based on crowd-sourcing Internet broad- band performance measurements tagged with the location. our method employs an IP block-level geolocation DB method and applies the majority rule for deciding the location for the IP address blocks with a continuous DB update mecha- nism. To build a reliable IP geolocation DB based on crowd- sourcing data, we study the reasonable threshold value for applying the majority rule. We examine the IP address and record distribution within three different grains of IP prefix subnets (/24, /25, and /26 IP prefix subnets), and the coinci- dence level of the user’s selecitions for two different regional levels of regions (district and province) in three different- grain of IP prefix subnets. We show that the /26 IP prefix subnet has the highest coincidence regardless of the region level, and a district-level IP geolocation DB with 80 % of ma- jority threshold is the better option for achieving a highly accurate DB, even though it allows for a certain amount of loss of samples. ACM SIGCOMM Computer Communication Review 13 Volume 46, Number 1, January 2016

Transcript of IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website,...

Page 1: IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance

IP Geolocation with a Crowd-sourcing BroadbandPerformance Tool

Yeonhee LeeDept. of Wired and Wireless

Communication IntraResearch

Electronics andTelecommunications Research

Institute, [email protected]

Heasook ParkDept. of Wired and Wireless

Communication IntraResearch

Electronics andTelecommunications Research

Institute, [email protected]

Youngseok LeeDept. of Computer

EngineeringChungnam National

University, [email protected]

ABSTRACTIn this paper, we propose an IP geolocation DB creationmethod based on a crowd-sourcing Internet broadband per-formance measurement tagged with locations and present anIP geolocation DB based on 7 years of Internet broadbandperformance data in Korea. Compared with other commer-cial IP geolocation DBs, our crowd-sourcing IP geolocationDB shows increased accuracy with fine-grained granularity.We confirm that the low accuracy of commercial IP geolo-cation DBs mainly results from selecting a single represen-tative location for a large IP block from the Whois registryDB, parsing city names in a naive way, and resolving thewrong geolocation coordinates. We also found that the geo-graphic location of IP blocks has continuously changed buthas been stable. Although our IP geolocation DB is lim-ited to Korea, the 32 million broadband performance testrecords over 7 years provide wide coverage as well as fine-grained accuracy.

1. INTRODUCTIONThe geographic location of IP addresses is important and

useful for targeting advertising, localizing content distribu-tion, and tracking illegal users committing cyber-attacks orcrimes. Recently, the accuracy of commercial IP geoloca-tion DBs has been verified with the ground truth in country-and city-level granularity. Many free or commercial IP ge-olocation DBs, such as MaxMind, IPligence, IP2Location,GeoBytes, NetAcuity, and Akamai Edgescape, provide thehigh country-level accuracy but low city-level accuracy [1,2].Thus, the detailed location information at the level of ad-ministrative districts within large metropolitan cities likeSeoul, Tokyo, Los Angeles, or New York are not available.

IP geolocation DBs usually depend on Whois registry DBsor BGP routing tables. With the Whois DB, the city- ordistrict-level accuracy of IP geolocation cannot be guaran-teed because the Whois DB contains only the headquartersaddress of the ISP for all the entire IP blocks. In addition,commercial IP geolocation DBs focus on only a few coun-tries (e.g., U.S.) so that their accuracy in other countries,such as Korea, is arguably low [1]. In BGP routing, manyIP blocks assigned to ISPs are aggregated, and thus, it isdifficult to know local IP blocks without cooperation from abroadband ISP.

In order to overcome the weakness of the BGP routingtable or the Whois service, many active measurement ap-

proaches have been proposed for building IP geolocationDBs. For instance, DIMES [3] collects ping or tracerouteprobing results with location information from many par-ticipants across the world, but its accuracy does not coverdistrict-level resolution. Though the active measurementapproach is effective, the population of this service shouldbe large enough to be statistically meaningful. In addition,many participants should constantly use the active measure-ments to keep the IP geolocation DB up-to-date.

One active approach that can solve the scalability issueuses a crowd-sourcing Internet performance measurementwebsite, such as Ookla’s Speedtest.net. In Korea, thereare several Internet broadband performance test websitessuch as Speedtest, Benchbee, NIA Speed, and ISP’s own’s.When running a test, users provide their broadband sub-scription information, such as the ISP name, service cat-egory, and location. Therefore, we can easily probe thelocation of IP addresses contributed by many broadbandsubscribers. However, building an IP geolocation DB fromcrowd-sourcing measurement data has two innate problems.First, even though the tests are countinuously launched bymany volumteer participants, it is hard to fully cover the en-tire IP address space with the tests. Second, the voluntarilyprovided address information may not be reliable.

Our first contribution is to invent a new IP geolocation DBcreation method based on crowd-sourcing Internet broad-band performance measurements tagged with the location.our method employs an IP block-level geolocation DB methodand applies the majority rule for deciding the location forthe IP address blocks with a continuous DB update mecha-nism. To build a reliable IP geolocation DB based on crowd-sourcing data, we study the reasonable threshold value forapplying the majority rule. We examine the IP address andrecord distribution within three different grains of IP prefixsubnets (/24, /25, and /26 IP prefix subnets), and the coinci-dence level of the user’s selecitions for two different regionallevels of regions (district and province) in three different-grain of IP prefix subnets. We show that the /26 IP prefixsubnet has the highest coincidence regardless of the regionlevel, and a district-level IP geolocation DB with 80 % of ma-jority threshold is the better option for achieving a highlyaccurate DB, even though it allows for a certain amount ofloss of samples.

ACM SIGCOMM Computer Communication Review 13 Volume 46, Number 1, January 2016

Page 2: IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance

Our second contribution is to present an IP geolocationDB1 made from 32 million Internet broadband performancedata over 7 years in Korea. Our IP geolocation DB providestwo-tier mapping of fine-grained IP block precision (/24,/25, and /26 IP prefixes): province-level DB maps IP blocksto 16 provinces or metropolitan cities; the district-level DBassociates IP blocks with 233 small cities or districts within ametropolitan city. The most precise mapping of the district-level location of /26 prefix precision has never been offeredby a commercial IP geolocation DB. Though we cannot di-rectly prove the accuracy based on the ground truth, weshow that the crowd-sourcing IP geolocation DB provideshighly accurate geolocation information in Korea throughcross-checking with commercial IP geolocation DBs.

Our third contribution is to find the cause of the in-accuracy of the current commercial IP geolocation DBs.We examine how MaxMind builds its IP geolocation DB,and demonstrate that the faults of MaxMind DB have beenmainly caused by not only the high dependency on theWhois registry but also the incorrect city name parsing mech-anism and the mistaken geographic location resolution method.Akamai also provides a geolocation service, but it guaran-tees country-level accuracy in Korea based on the locationof its content distribution servers.

In addition, we examine the dynamics of IP block alloca-tion by the ISP and its granularity. We observed that thegeographic location of IP blocks has been stabilized over 7years. The /24 IP prefix subnet was mainly adopted byISPs even in district-level area, but /26 IP prefix subnetwas still used. Although our crowd-sourcing IP geolocationDB is limited to Korea, we believe that this approach cancontribute to building an accurate IP geolocation DB dueto the constant participation by many users.

2. RELATED WORKMany commercial IP geolocation services, such as Max-

Mind, IPligence, IP2Location, GeoBytes, and NetAcuity, donot unveil how they build and update thier DBs. Recently,Poese et al. [1], and Shavitt and Zilberman [2] studied the ac-curacy of commercial IP geolocation DBs. In [1], they foundthat the city-level accuracy of commercial geolocation DBsis unreliable by assessing the PoP-level accuracy with a Eu-ropean ISP’s ground truth. In [2], authors evaluated 7 com-mercial geolocation DBs with the PoP-level ground truthand reported that most geolocation DBs do not achieve theacclaimed city-level accuracy. Siwpersad et al. [4] examinedthe city-level accuracy of the MaxMind IP geolocation DBand the Hexasoft IP2Location DB.

To enhance the low accuracy of IP geolocation DBs, manyactive measurement methods have been proposed. Gueyeet al. [5] looked into the imprecision of the MaxMind IPgeolocation DB by mapping a single IP address of an IPblock to a location, and proposed an active measurementmethod. In [6], Gueye et al. used a triangulation-like loca-tion estimation method with delay constraints. Yoshida etal. [7] also used an end-to-end delay measurement methodto build PoP-level topology with 13 cities in Japan. Thoughthe active measurement approach improves the accuracy ofIP geolocation, this method is difficult to deploy to manyusers for a long period.

1The crowd-sourcing geoIP project website ishttp://geoip.cs-cnu.org.

150

250

350

450

# of

/24

pref

ix IP

blo

cks

(K)

150

250

350

450

2006.01 2008.01 2010.01 2012.01

●●●●●●

●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●

●●

●●●

●●●●

●●●

●●●●●●●●●●●●●●●●

● # of /24 prefix IP blocks (K)# of subscribers (M)

1213

1415

1617

18#

of s

ubsc

riber

s (M

)

05

1015

2025

3035

Cum

ulat

ive

coun

t (M

)

2006.01 2008.01 2010.01 2012.01

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● # of records (M)# of visible IPs (K)

(a) Internet registry (b) NIA speed dataFigure 1: (a) Cumulative number of allocated /24IP prefix blocks and the number of broadband sub-scribers by KISA [13] and (b) The cumulative num-ber of visible IPs and /26 IP prefix blocks from 2006to 2012 from the NIA Speed test data.

In general, commercial IP geolocation DBs offer moder-ate or low accuracy in city-level granularity [8,9]. Moreover,they even do not provide district-level resolution. For ex-ample, the MaxMind IP geolocation DB announced thatits city-level accuracy in Korea is only 22 % as of Jan.2015 [10]. Akamai Edgescape claims a global country-LevelSLA (Service-Level Agreement) with Edgescape [11].

3. CROWD-SOURCING IP GEOLOCATION

3.1 Raw dataIn Korea, the high-speed Internet broadband subscription

ratio is high. According to the OECD [12], Korea is one ofthe most highly connected countries with the highest broad-band penetration rate (e.g., Hong Kong, Japan, Norway, andSweden). KISA [13] is the agency responsible for IP addressallocation in Korea, which maintains 420 K of /24 IP prefixblocks as of Oct. 2012 (Fig. 1 (a)). The broadband pop-ulation in Korea reached 18 million as of 2013, up from 14million in 2006, which is 97.5 % households with broadbandaccess [12].

In Korea, the National Information Society Agency (NIA)operates the Speed website [14] that has provided publicbroadband performance tests for Korean broadband clientssince 2006. The test results are sent to a server with human-provided meta data including city and district names (seeTable 1). Raw data totaled 32 million test records with 9million unique IP addresses, 825,260 /26 IP prefixes, and290,189 /24 IP prefixes at the end of 2012 (Fig. 1 (b)).On average, 1.3 million unique IP addresses per year par-ticipated except in 2008, the system maintenance period.The NIA launched a speed test service for mobile devicesin 2013. Breaking down the test data according to the IPprefix subnets (/24, /25, and /26), approximately 20 % con-tain a single unique IP address within their subnets (Fig. 2(a)) whereas, in terms of records, the ratio of subnets witha single record is only about 13 % (Fig. 2 (b)).

Table 1: Properties of test data collected from theNIA Speed site.

Type Properties Source typeperfor-mance

datetime, download speed,upload speed, latency

tool-generated

metainfo

ISP, Service name, locationhuman-profiled

ACM SIGCOMM Computer Communication Review 14 Volume 46, Number 1, January 2016

Page 3: IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance

1 5 50 500

0.0

0.2

0.4

0.6

0.8

1.0

1 5 50 500

0.0

0.2

0.4

0.6

0.8

1.0

1 5 50 500

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

# of records within subnet1 5 50 500

0.0

0.2

0.4

0.6

0.8

1.0

1 5 50 500

0.0

0.2

0.4

0.6

0.8

1.0

1 5 50 500

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

# of IPs within subnet

/26/25/24

(a) Records (b) Unique IPsFigure 2: The number of records and unique IP dis-tribution per /24, /25, and /26 IP prefix subnet.

(a) Province (b) DistrictFigure 4: The number of IP blocks with the differentthreshold of majority portion.

3.2 Build method

3.2.1 IP block-level geolocation by majority ruleBuilding an IP geolocation DB from crowd-sourcing mea-

surement data has innate problems such that the IP ad-dress space is hard to fully cover by the test and the vol-untarily provided address information may not be reliable.According to the KISA IP registry [13], the IP addressesassigned to Korean ISPs correspond to a group of /24 IPprefix blocks. [15] reported that 99 % of IP prefixes in the/24-/31 ranges announced by a single stub AS are assignedto the same location. With this observation, we propose anIP block-level geolocation DB build method based on themajority rule deciding the location of the IP block accord-ing to the majority’s selection.

Figure 3 summarizes the procedure for building the IP ge-olocation DB. In Fig. 2, the raw data have a large portionof IP prefix blocks with a small number of test records. Toavoid the bins filled with massive tests by a small number ofusers, we assume an IP address is a single user. We mergetest records by IP address and decide a single location foreach IP address by applying the majority rule to the loca-tions. Then, we regroup the merged entries by IP prefixsubnet and filter out IP prefix blocks with fewer than twounique IP addresses within the subnet because two is theminimum value to which the majority vote can be applied.Finally, we decide the location for the IP block by applyingmajority votes for the subnet with the threshold value.

To determine a reasonable threshold value for the major-ity vote, we examine the concurrency of the majority users’location selection in Fig. 4, which shows a trade-off betweenthe quality and the size of the sample space. The X-axis rep-resents the majority portion threshold for the location se-lection, and Y-axis denotes the CCDF of the number of theIP prefix subnets that satisfy the majority threshold. Weexclude records for IP blocks with fewer than two unique

IP addresses for the IP prefix subnet. We observe that thesample space will become the largest with /26 subnetting inthe province-level DB (Fig. 4(a)) and the district-level DB(Fig. 4(b)) regardless of the threshold value of the majorityportion 2. This shows we can build fine-grained IP block-location DBs through the /26 subnetting with a small dataloss.

Finally, we set the threshold of the majority portion forthe location selection to 80 %. Figure 5 shows an example ofthe application of the majority rule of 80 % regarding threedifferent-level subnet groups of the /24, /25, and /26 IPprefix. We presume that two unique IPs with the majoritythreshold of 80 % can efficiently offset the implicit errorsby user’s misinformed location selection and the locationestimation for the subnet block because all selected locationswithin a block should be unanimous if it contains a smallnumber of unique IPs (e.g., 2, 3, or 4) to conform to the 80% threshold. Further, the probability the same location isselected out of 233 district-level geolocations by chance islow. With the threshold of 80 % and two minimum uniqueIPs, we can obtain 82 % of the province-level sample spaceand 60 % of the district-level IP blocks.

Figure 5: An example of applying the majority ruleof 80 % regarding three different-level subnet groups(/24, /25, and /26 IP prefix)

3.2.2 Continuous updatesOur crowd-sourcing IP geolocation method can update

the IP geolocation DB with newly probed test data fromusers. To keep track of the location changes of IP blocks,we adopt a time-window approach and calculate the validIP geolocation information by applying the majority rule ateach time epoch t. We define a cumulative IP block-locationset Gt at time t as follows:

G0 = M0, t = 0

Gt = ((Gt−1 −Gt−s − (Gt−1 ∩Mt)) ∪Mt), t > 0,(1)

where Mt denotes a group of IP block-location samples thatsatisfy the majority rule during the time window. s is astale window size to maintain the IP geolocations up-to-date, which is a multiple of the time period t. With the1-year time window and the 5-year stale window size, wecapture the snapshots of the geographic location of IP blocksfor each year (G2006 ∼ G2012).

2Korea consists of 16 province-level regions (7 metropolitancities and 9 provinces). Each province-level region has sev-eral small cities or districts, which we named district-levelregions. In Korea, a small city (not a metropolitan city)should have a population of more than 150,000, and a citywith more than 500,000 consists of more than 2 districts.In total, out of 233 district-level regions, 69 districts for 7metropolitan cities and 164 districts for 9 provinces exist.

ACM SIGCOMM Computer Communication Review 15 Volume 46, Number 1, January 2016

Page 4: IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance

Figure 3: Process for extracting district-level IP geolocation from crowd-sourcing performance test data.

4. EVALUATION

4.1 CoverageOut of 7 snapshots (G2006 ∼ G2012), we choose G2012 DB

which has the largest sample space covering the province-level resolution of 16 provinces or metropolitan cities andthe district-level resolution of 233 small cities in provincesor district areas in metropolitan cities. Further, it can morefairly be compared with current commercial DBs because itis the most up-to-date. Overall, G2012 DB can provide fine-grained nation-wide IP geolocational mapping in Korea withthe large number of IP blocks (619 K). We use the /26 IPprefix as default precision to build the G2012 DB because theIP blocks allocated to national broadband ISPs are usuallydivided into multiple regional IP subblocks.

Figure 6 shows the portion of samples per region regard-ing the number of subscribers in April 2007 reported by [13].The capital area (b and I) is dominant with the subscriberpopulation and the number of samples. The number of sam-ples per regional geolocation in our data set is almost pro-portional to the population of the subscribers as shown inFig. 6. Figure 7 shows the shares of samples by two ISPtypes, six national ISPs and regional ISPs (e.g., enterprises,research and educational organizations, regional cable ser-vice providers).

a b c d E F G H I J K l m n o p

portion of subscribersportion of samples

Nor

mal

ized

Por

tion

0.0

0.1

0.2

0.3

0.4

Figure 6: The number of samples per region com-pared to the number of subscribers (April 2007).

Figure 7: Shares of ISPs for the matched IP blocks.

We examine the subnet length distribution of the Max-Mind and IP2Location DBs in Fig. 8. Although the Max-Mind database has a wide distribution of the subnet lengthranging from 8 to 32 for the entire dataset, for Korea, ithas a very limited subnet length range, and specific subnetslarger than the /25 IP prefix are never shown. Even thoughIP2Location has more finely-chopped IP blocks up to /32 IPprefix, the portion finer than the /24 IP prefix is only about0.1 % of its IP address space.

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Subnet

CD

F

24

Block allocation−KR(2014)MaxMind−KRIP2Location−KRBGP−KR (2012.10)

Figure 8: Cumulative distribution function of IPblock regarding IP subnet size for GeoIP databases.

In Table 2, the coverage of our latest DB G2012 with Max-Mind and IP2Location is compared. Although the two com-mercial DBs seem to cover more extensive IP address spacethan G2012, the number of IP blocks is far smaller thanG2012. MaxMind has many locations with geographic coor-dinates, but most are not matched to the city name and arehighly concentrated in the capital, which will be discussedin Section 4.2. Though IP2Location has a small number oflocations, most locations are exactly matched to real citynames with coordinates. MaxMind and IP2Location claim22 % and 80 % accuracy in exact city-level mapping respec-tively. However, locations in G2012 are evenly distributed inthe district-level granularity, which is not offered by otherDBs.

Table 2: The number of IP blocks and location en-tries of IP geolocation DBs in Korea.

DB #of blks IPs/blks locations coordsG2012 619 K 64 233 231

MaxMind 24 K 7,387 1,543 313IP2Loc 74 K 3,021 130 130

ACM SIGCOMM Computer Communication Review 16 Volume 46, Number 1, January 2016

Page 5: IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

Distance difference (km)

G2012_d26G2012_d24G2012_m26G2012_m24

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

Distance difference (km)

(a) MaxMind-KR (b) IP2Location-KRFigure 9: Distance difference distribution of G2012

DBs (G2012 p24, G2012 p26, G2012 d24, and G2012 d26)compared to two commercial DBs.

4.2 Accuracy

4.2.1 IP block sizeSince no ground truth about Korean IP geolocation data

is available, we examine the accuracy of G2012 by compar-ing the distance difference with other DBs. Figure 5 showsthat the /26 IP prefix DB can achieve better accuracy be-cause of its fine granularity. In Fig. 9, the distance dif-ference between the commercial DBs and the G2012 DBsis compared: G2012 p24 with the province-level /24 pre-fix precision; G2012 p26 with the province-level /26 prefixprecision; G2012 d24 with the district level /24 prefix preci-sion; G2012 d26 with the district level /26 prefix precision.We divided the IP blocks of commercial DBs into the samesize as the G2012 DBs for fair comparison and selected thecommon IP blocks of the three DBs, which hold 3,783 IPblocks for the /26 prefix precision and 1,725 for the /24.We used the Google Geocoding API to map districts, cities,or provinces to the geographic longitude and latitude of theG2012 DB [16].

First, we can infer that G2012 d26 is the most represen-tative DB because the G2012 d26 DB with the district-levelresolution and the /26 prefix precision shows the minimaldistance difference distribution. However, looking into thedetails of G2012 d26, the percentages of the distance differ-ence shorter than 50 Km 3 are only 55 % in MaxMind and 53% in IP2Location DBs, respectively. Since Korea is a smallcountry with a high population density around the capital,Seoul, and 6 metropolitan cities, the 50-km distance is toolarge to identify a specific city4. As for the IP blocks with ashort distance difference below 20 km, the percentages are38 % in MaxMind and 37 % in IP2Location, which are notenough to be considered accurate location information.

4.2.2 Regional BreakdownTo explain the reason for the large distance gap, we break

down the distance distribution of G2012 d26 into province-level regions in Fig. 10(a), where the IP blocks are groupedby province and sorted in ascending order of distance fromthe capital. Each bar is the average of the geographic dis-tance difference of two commercial geoIP DBs compared toG2012 d26. As shown in Fig. 10(a), the geographic distancedifferences of both commercial DBs become large when theregion is far from Seoul. We can explain with Figure 10(b)

3MaxMind sets the city-level accuracy to 50 Km.4The longest diameter of Seoul is 36.8 Km. Seoul has 25 ad-ministrative divisions. Several satellite cities around Seoulare located in 40 Km.

10 30 112 189 266 307

MaxMind−KRIP2Location−KR

Dis

tanc

e di

ffere

nce

(km

)

010

020

030

040

0

Province−level region (km from the capital)0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

Distance difference compared to the capital(km)

CD

F

G2012_d26MaxMind−KRIP2Location−KR

(a) Distance difference (b) Geographic distributionFigure 10: Regional distance difference of commer-cial DBs compared G2012 d26 and geographic distri-bution of IP blocks.

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

Distance difference (km)

MaxMind−KRIP2Location−KR

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

Distance difference (km)

MaxMind−KRIP2Location−KR

(a) Capital (b) Non-capital regionFigure 11: Distance difference of commercial geoIPDBs against G2012 d26: capital vs. non-capital re-gions.

that shows the geographic distribution of IP blocks for eachgeoIP DB. Here, most records of the commercial DBs (82 %in MaxMind and 87 % in IP2Location) are concentrated onlocations within 50 km from the capital, which means thatmost IP blocks point to Seoul and satellite towns. Consider-ing the population ratio of the capital area in Korea (about49 % 5), the extreme skew toward the capital area of thecommercial DBs is problematic. However, G2012 has a rel-atively fair distribution of the number of IP blocks acrossregions.

We classify the IP block samples into two groups of capitaland non-capital locations and examine the distance differ-ence against G2012 d26 in Fig. 11. For the capital (Fig.11(a)), the percentage of the distance difference less than 50km is only 49 %, while for the non-capital (Fig. 11(b)), it is86 %. This implies that many IP blocks of commercial DBsare given an area in the capital as their location by default.Conversely, most IP blocks for non-capital areas are explic-itly given their locations, which affects better accuracy andhigh regional coincidence with our G2012 DB.

4.2.3 ISP BreakdownNext, we investigate the distance difference according to

the ISPs. Based on the majority rule for Internet providerselection, we identified the ISP for each /26 IP prefix subnetblock. In Fig. 12, the IP blocks of two regional cable serviceproviders (C&M and TBroad) show a comparably short dis-tance distribution. Especially, the IP blocks of a nationalcable service provider (CJ Hellovision) have the largest dis-tance difference pattern. We presume that the large distancedifference in the national ISPs comes from the fact that theIP blocks of the national ISPs registered at the Whois DB

5http://www.index.go.kr/potal/main/EachDtlPageDetail.do?idx cd=1007

ACM SIGCOMM Computer Communication Review 17 Volume 46, Number 1, January 2016

Page 6: IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

Distance difference (km)

KT TelecomLG TelecomSK broadbandCJ HellovisionC&MTBroadetc

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

Distance difference (km)

(a) MaxMind-KR (b) IP2Location-KRFigure 12: Distance difference of commercial geoIPDBs against G2012 d26 by ISPs: four national ISPs(KT Telecom, LG Telecom, SK broadband, andCJ Hellovision) and two regional ISPs (C&M andTBroad).

are usually collapsed into a single headquarters address inSeoul. However, we presume that the high coincidence withregional cable service providers ISPs indirectly validates thehigh accuracy of G2012.

5. DISCUSSION

5.1 Defects of commercial DBThe MaxMind DB has two main tables of GeoIP Blocks

and GeoIP Location (Fig. 13). Each key in the GeoIP Blocks

refers to an entry in the GeoIP Location. The GeoIP Location

table contains names of the real-world address with their ge-ographic coordinates, where we frequently observe the wrongaddress names such as town, district, street, and building,instead of city name. In addition, we often find typos in thecity names that are identical to those in the APNIC Whoisregistry. The geographic coordinates in the GeoIP Location

table, which are resolved by city name, are often erroneouslymarked with the incorrect or improperly parsed addressname.

Figure 13: Method for building an IP geolocationDB in MaxMind.

During the creation of the MaxMind DB, several possibledefects can trigger mapping errors of IP geolocation. First,MaxMind depends on the Whois registry or official web siteswhen it gathers location information for the IP block, whichcauses too coarse IP block resolution to pinpoint the fine-grained local IP block allocation across regions. For exam-ple, MaxMind derived only one location name for all mili-tary bases widely dispersed in Korea. Second, due to the

lack of full understanding of the Korean mail address andlanguage system, a similar but wrong city name is mistak-enly chosen as a small town, a road/street, a building, ora university name, which results in incorrect geographic co-ordinates. Last, even with a properly parsed city name, IPblocks are often mapped to incorrect geographic coordinateswhile the address is resolved to the geographic coordinates.For instance, MaxMind has the correct city name Yeosu foran IP block, but it points to the geographic coordinates ofthe street Yeosul located in a different province.

Akamai claims a global country-level SLA with the Aka-mai Edgescape [11]. Akamai has the Akamai EdgePlatform,a network of 170,000 secure servers deployed in 102 coun-tries, and provides Edgescape geolocation services that en-able customers to build a targeted service based on their IPintelligence base collected on the Akamai Platform. How-ever, it offers not the city-level SLA but the global country-level. In practice, when we query Edgescape in several differ-ent non-capital regions and in ISP networks, they returnedonly the capital city name with the country code for Ko-rea. The reason is that Edgescape knows only the locationresiding the EdgePlatform servers distributed in the world,where most servers in Korea are placed near the capital.

5.2 IP geolocation dynamicsWe examine the dynamics of the ISPs’ IP block allocation

policy with the time-window approach. Figure 14(a) tracesthe share of IP blocks for the IP geolocation DBs. In G2007,23 % of the IP blocks were newly observed and 34 % were re-located, but both were reduced to 2 % in G2012. We explorethe moving distance of relocated IP blocks in Fig. 14(b). Tofurther focus on geolocation changes without skew, we select8,338 /26 IP prefix blocks constantly observed at every timeepoch since 2006. Between G2006 and G2007, the averagemoving distance was 147 km, but it has been reduced to 93km between G2011 and G2012. Then, to find the relocationcharacteristics, we filter out the IP blocks that are not re-located, and compare the frequencies of the inter-provincerelocations with those of the inter-province. In Figure 15,we can explicitly observe that the portion of intra-province(between districts of the same province) relocation has in-creased to 95 % of the inter-province’s in 2012, which was 49% in 2007, which may result in a gradual decline in movingdistance in Fig. 14(b).

To sum up, the dynamics of IP geolocation has been mit-igated not only in quantity but also in moving distance over7 years, where relocations between provinces have decreasedremarkably. One reason for this trend is that ISPs periodi-cally reorganized their broadband IP address spaces due tothe rapid penetration of the high-speed broadband servicein Korea since 2006. In addition, the demand for IP blockallocation because of new subscribers or new broadband ser-vices (e.g., high-speed DSL and FTTH) trigger IP addressrelocation for their access networks in early years.

5.3 Granularity of IP block allocationNow, we question what kind of granularity has been cho-

sen to allocate IP blocks to the region by ISPs. From Section4.2, we observed that district-level /26 IP prefix precisionhas the highest accuracy. Answering the question, we ex-tract the intact /24 IP prefix block areas fully covered withthe /26 IP IP prefix blocks of G2012 d26 because our IP ge-olocation DBs do not cover the entire IP address space of

ACM SIGCOMM Computer Communication Review 18 Volume 46, Number 1, January 2016

Page 7: IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance

G2007 G2008 G2009 G2010 G2011 G2012

newly observed IP blocksrelocated IP blocks

stationary IP blocks

%

020

4060

8010

0

43

81 78 80 83

96

34

147 9 8

223

515 11 9 2

'06~'07 '08~'09 '10~'11

Aver

age

mov

ing

dist

ance

(km

)

050

100

150

(a) Share of IP blocks(b) Moving distance of

relocated IP blocksFigure 14: Comparison of the share of IP blocks andrelocation of continuously observed IP blocks over 7years.

a b c d E F G H I J K l m n o p

2006~20072011~2012

%

020

4060

8010

0

Figure 15: The ratio of intra-province relocationagainst inter-province’s between two consecutiveyears.

Korea. We count the number of /26 IP prefix blocks per re-gion regarding each /24 IP prefix subnet and summarize theportions of subnetting regarding the size of the IP subnetin Table 3. We consider the IP block is allocated in /26 IPprefix granularity if the group count value is one, and if itis four, this means the IP block is assigned in /24 IP prefixgranularity. In the case of three, we assume the one-third is/26, and the others are /25 IP prefix subnets. In Table 3,for the province-level, it is confirmed that about 5 % of IPblocks are allocated in the granularity finer than /24 IP pre-fix subnet, but for the district-level, about 13 % of IP blockscorrespond to it. The portions of IP allocation regarding theprecise granularities of IP prefix subnets (/24,/25, and /26)and two different level of areas have never been presentedin [15]. To conclude, /24 IP prefix subnetting was mainlychosen regardless of the level of area (province level or dis-trict level), but /26 IP prefix subnetting was still used evenin district-level allocation.

Table 3: The number of IP blocks and location en-tries of IP geolocation DBs in Korea.

/24 /25 /26Province 180,036 6,804 2,756

(%) (95 %) (4 %) (1 %)District 164,412 17,620 7,564

(%) (87 %) (9 %) (4 %)

6. CONCLUSIONIn this paper, we proposed a crowd-sourcing IP geolo-

cation method and presented its accuracy compared withMaxMind and IP2Location. Without the ground truth, wevalidated the accuracy of our DB indirectly by comparing itwith two commercial DBs. We showed that the accuracy ofcommercial DBs is low because they depend on Whois DB,the poor parsing scheme of city names, and the wrong ge-olocation resolution method. In addition, we found that thegeographic location of IP blocks has gradually changed over7 years. Though our IP geolocation DB is limited to Korea,it provides wide coverage as well as the fine-grained accu-racy due to 32 million broadband performance test recordsover 7 years. We believe that the crowd-sourcing broadbandmeasurement approach can contribute to a high-resolutionIP geolocation DB due to the continuous participation ofmany subscribers across the nation.

AcknowledgmentThis research was partly supported by the ICT R&D pro-gram of MSIP/IITP [KI001810044556] and the Basic ScienceResearch Program through the National Research Founda-tion of Korea(NRF) funded by the Ministry of Education,Science and Technology [NRF-2013R1A1A2007326]. YoungseokLee is the corresponding author.

7. REFERENCES[1] I. Poese, S. Uhlig, M. A. Kaafar, B. Donnet, and B.

Gueye, “IP geolocation databases: unreliable?,” ACMSIGCOMM Comput. Commun. Rev. vol. 41, no. 2,pp. 53-56, April 2011.

[2] Y. Shavitt and N. Zilberman, “A GeolocationDatabases Study,” IEEE JSAC, vol. 29, no. 10, pp.2044-2056, Dec. 2011.

[3] Yuval Shavitt and Eran Shir, “DIMES: let theinternet measure itself,” ACM SIGCOMM ComputerCommunication Review, vol. 35, no.5, Oct. 2005

[4] S. S. Siwpersad, B. Gueye, and S. Uhlig, “Assessingthe geographic resolution of exhaustive tabulation forgeolocating internet hosts,” Passive and ActiveMeasurement, vol. 4979, pp. 11-20, 2008.

[5] B. Gueye, S. Uhlig, and S. Fdida, “Investigating theimprecision of IP block-based geolocation,” PAM’07,2007.

[6] B. Gueye, A. Ziviani, M. Crovella, and S. Fdida,“Constraint-based geolocation of internet hosts,”IEEE/ACM Trans. Netw., vol. 14, no. 6, 2006.

[7] K. Yoshida, Y. Kikuchi, M. Yamamoto, Y. Fujii, K.Nagami, I. Nakagawa and H. Esaki, “InferringPoP-level ISP topology through end-to-end delaymeasurement,” PAM, vol. 5448, pp. 35-44, 2009.

[8] Hexsoft Development IP2Location,http://www.ip2location.com

[9] http://lite.ip2location.com/edition-comparison

[10] MaxMind LLC, GeoIP City Accuracy for SelectedCountries,”http://www.maxmind.com/app/city accuracy

[11] http://www.akamai.com/html/solutions/edge-computing.html

ACM SIGCOMM Computer Communication Review 19 Volume 46, Number 1, January 2016

Page 8: IP Geolocation with a Crowd-sourcing Broadband Performance Tool … · 2016. 1. 20. · website, such as Ookla’s Speedtest.net. In Korea, there are several Internet broadband performance

[12] OECD broadband portal,http://www.oecd.org/sti/broadband/oecdbroadbandportal.htm

[13] KISA ISIS, http://isis.kisa.or.kr.

[14] NIA Speed site, http://speed.nia.or.kr/.

[15] M. Freedman, M. Vutukuru, N. Feamster, and H.Balakrishnan, “Geographic locality of IP prefixes, inProc. ACM/SIGCOMM IMC, Berkeley, CA, USA,Oct. 2005.

[16] Google, Geolocation API,http://code.google.com/apis/gears/api geolocation.html

ACM SIGCOMM Computer Communication Review 20 Volume 46, Number 1, January 2016