AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki,...

18
AD-HOC GEOREFERENCING OF WEB- AD-HOC GEOREFERENCING OF WEB- PAGES USING STREET-NAME PREFIX PAGES USING STREET-NAME PREFIX TREES TREES Andrei Tabarcea, Ville Hautamäki, Pasi Fränti Andrei Tabarcea, Ville Hautamäki, Pasi Fränti University of Eastern Finland University of Eastern Finland http://cs.joensuu.fi/mopsi/ http://cs.joensuu.fi/mopsi/

Transcript of AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki,...

AD-HOC GEOREFERENCING OF WEB-PAGES AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREESUSING STREET-NAME PREFIX TREES

Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi FräntiUniversity of Eastern FinlandUniversity of Eastern Finland

http://cs.joensuu.fi/mopsi/http://cs.joensuu.fi/mopsi/

INTRODUCTIONINTRODUCTION

• Our goal is to find services and points of interest close to the user’s location

• We call this “location-based search”• We try to find location information in web-pages

AD-HOC GEOREFERENCINGAD-HOC GEOREFERENCING

• The problem is how to extract and validate location data from free-form text

• Most web pages don’t contain explicit georeferencing (eg. geo-tags)• Postal address is the most common location data found• Our goal is to give geographical coordinates to services mentioned in

web-pages• We call this method ad-hoc georeferencing

<HTML><HEAD profile"="http://geotags.com/geo> <META name="geo.position" content="62.35;29.44"> <META name="geo.region" content="FI"><META name="geo.placename" content="Joensuu"> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css" type="text/css"><TITLE>Pages of Pasi Fränti</TITLE></HEAD>

MOPSI LOCATION-BASED SEARCHMOPSI LOCATION-BASED SEARCHMOPSI = Mobiilit paikkatieto-

sovellukset ja Internet (Mobile location based applications and Internet)

Available on http://cs.joensuu.fi/mopsi/http://cs.joensuu.fi/mopsi/

Main focus areas: • Mobile search engine• How to collect & present

location-based data• Other location-related topics

MOBILE SEARCH ENGINEMOBILE SEARCH ENGINE– How can you find services:

– Asking directions– Advertisements– Wandering around– Yellow pages– Internet

– Query consists of:– Keyword– Location

MOBILE SEARCH ENGINE STRUCTUREMOBILE SEARCH ENGINE STRUCTURE

Geocoded street-name

database

Core server software

Mobileapplication

Web userinterface

Coordinates

AddressKeywordCoordinates

Searchresults

KeywordCoordinates

Searchresults

Search Engine consists of:•User interface•Core server software•Geocoded street-name database

CORE SERVER SOFTWARECORE SERVER SOFTWARE

Georeferencing module

Geocodeddatabase

Address and

description detector

Address validator

Word list

Results list

Sorted results list

KeywordMunicipalities

<keyword, municipality>

query

Result links

Coordinates

Municipalities list

Addresses

Coordinates

Relevant municipalities

detector

Keyword, Address,Coordinates

Page parser

CORE SERVER SOFTWARECORE SERVER SOFTWARE

Georeferencing module

Geocodeddatabase

Address and

description detector

Address validator

Word list

Results list

Sorted results list

KeywordMunicipalities

<keyword, municipality>

query

Result links

Coordinates

Municipalities list

Addresses

Coordinates

Relevant municipalities

detector

Keyword, Address,Coordinates

Page parser

OUR SOLUTIONOUR SOLUTION• A rule-based solution that detects

address-based locations using a gazetteer and street-name prefix trees created from the gazetteer

• We compare this approach against:– a method that doesn’t require a

gazetteer (a heuristic method that assumes that the street-name has a certain structure)

– a method that also uses data structures created from the gazetteer in the form of street-name arrays

StreetNameDetection(words){

WHILE i < count(words) DO{

IF words[i] = street name THEN {

Search for street number, postal code and other address elements near words[i].

IF address elements found THEN{

Create address blockGet coordinates using Geocoded

DatabaseIF coordinates found THENAdd address block to address

list}

} i = i + 1; }}

STREET-ADDRESS DETECTIONSTREET-ADDRESS DETECTION

• We use a rule-based pattern matching algorithm• The detection of street-names is the starting point of the algorithm• An address-block candidate is constructed by detecting typical address

elements (street names, numbers, postal codes, telephone numbers and municipal names)

• Address block candidates are validated using the gazetteer

STREET-NAME DETECTIONSTREET-NAME DETECTION

• Street-name detection is the starting point of the address detection• Heuristic and brute-force method are compared against our Prefix

Tree solution• Our application uses a commercial gazetteer for Finland and, for

Singapore, street data from the free map project OpenStreetMap

Gazetteer Statistics Finland Singapore

Number of municipalities 410 1

Total number of street names 92 572 573

Number of streets per municipality 474 573

Average street name length 11.6 6.1

Total size (MB) 2 982 0.18

PREFIX TREESPREFIX TREES• Invented by Friedkin (1960)• The prefix tree (or trie) is a

fast ordered tree data structure used for retrieval

• Root is associated with an empty string

• All the descendants of a node have a common prefix of the string associated with that node

• Some nodes can have associated values (usually they mark the end of a word)

STREET-NAME PREFIX TREESSTREET-NAME PREFIX TREES

• Our solution is to detect street-names using prefix trees constructed from the gazetteer

• A street-name prefix tree is build for each municipality used in the search

• The user’s location and his area of interested are known, therefore prefix-trees can be limited to municipalities

Prefix Tree Statistics Finland Singapore

Maximum tree depth 34 14

Average tree depth 12.7 7.4

Average tree width 105 167

Average number of nodes per tree 2338 2335

Total size (MB) 74.4 0.18

OTHER SOLUTIONSOTHER SOLUTIONS• Heuristic solution

– Relies on regular expression matching– Street names usually have similar endings or similar

prefixes– A gazetteer is not needed (except for validation)– Can be fast but not precise

• Brute-force solution– Every word should be checked if it exists in the

gazetteer– An optimized solution is used (gazetteer is locally

limited and preloaded into arrays)

EXPERIMENTSEXPERIMENTS• 10 urban locations (blue) and 10

rural location (orange) were used for testing

• Testing was done using the MOPSI prototype for Finland and Singapore

• Both commercial and non-commercial keywords were used:

Commercial hotel, restaurant, pizzeria, cinema, car repair

Non-commercial hospital, museum, police station, swimming hall, church

RESULTSRESULTS

• Average processing times for every solution were calculated

• The prefix tree solution proved to be on average 57% faster and 10% more accurate than the heuristic solution and 10 times faster than the brute-force solution

• The resulting solution improves the speed and quality of web-page georeferencing

Method Time (s) Standard deviation

Validated addresses

Rural municipalities

Brute-Force 3,01 2,43 3,7

Heuristic 1,54 1,15 2,5

Prefix Tree 0,51 0,35 3,7

Urban Municipalities

Brute-Force 10,18 7,11 19,8

Heuristic 1,70 1,24 18,6

Prefix Tree 0,87 0,85 19,8

Total

Brute-Force 6,59 6,40 11,8

Heuristic 1,62 1,20 10,5

Prefix Tree 0,69 0,68 11,8

OPEN PROBLEMSOPEN PROBLEMS

• Support approximate matching to avoid problems in misspellings

• Improve flexibility of the address detection algorithm

• Implement a way to learn rules automatically using hand tagged example corpus.

http://cs.joensuu.fi/mopsihttp://cs.joensuu.fi/mopsi

Thank you!Thank you!