Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.
![Page 1: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/1.jpg)
Information Extraction on Real Estate Rental Classifieds
Eddy Hartanto
Ryohei Takahashi
![Page 2: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/2.jpg)
Overview We want to extract 10 fields:
Security deposit Square footage Number of bathrooms Contact person’s
name Contact phone
number
Nearby landmarks Cost of parking Date available Building style /
architecture Number of units in
building
These fields can’t easily be served by keyword search
![Page 3: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/3.jpg)
Approach Hand labeled test set as precision and
recall computation base Pattern matching approach with Rapier Statistical approach using HMM with
different structures
![Page 4: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/4.jpg)
Demo …
![Page 5: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/5.jpg)
Hidden Markov Models We consider three different HMM
structures We train one HMM per field Words in postings are output symbols of
HMM Hexagons represent target states,
which emit the relevant words for that field
![Page 6: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/6.jpg)
Training Data We use a randomly-selected set of 110
postings to use as the training data We manually label which words in each
posting are relevant to each of the 10 fields
![Page 7: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/7.jpg)
HMM Structure #1
A single prefix state and single suffix state Prefixes and suffixes can be of arbitrary
length
![Page 8: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/8.jpg)
HMM Structure #2
Varying numbers of prefix, suffix, and target states
![Page 9: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/9.jpg)
HMM Structure #3
Varying numbers of prefix, suffix, and target states Prefixes and suffixes are fixed in length
![Page 10: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/10.jpg)
Cross-Validation We use cross-validation to find the
optimal number of prefix, suffix, and target states
![Page 11: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/11.jpg)
Preventing Underflow Postings are hundreds of words long Forward and backward probabilities
become incredibly small => underflow To avoid underflow, we normalize the
forward probabilities:
instead of
ˆ t iP qt Si |O1,...,Ot ,
t iP O1,...,Ot,qt Si |
![Page 12: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/12.jpg)
Smoothing We perform add-one smoothing for the
emission probabilities:
bi* k
t it1,Otvk
T
1
t it1
T
M
![Page 13: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/13.jpg)
Rapier Rapier automatically learns rules to
extract fields from training examples We use the same 110 training postings
as for the HMMs
![Page 14: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/14.jpg)
Data Preparation Sentence Splitter (Cognitive Computation
Group at UIUC, http://l2r.cs.uiuc.edu/~cogcomp/tools.php): puts one sentence on each line
Stanford Tagger (Stanford NLP Group, http://nlp.stanford.edu/software/tagger.shtml): tags each word with part of speech
We then manually create a template file for each of the files, with the information for the 10 fields filled in
![Page 15: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/15.jpg)
Test Data We use a randomly-selected set of 100
postings to use as the test data We manually label these 100 postings
with the fields
![Page 16: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/16.jpg)
Rapier Results We use Rapier’s “test2” program to evaluate
performance on the labeled postings Training Set
Precision: 0.990099 Recall: 0.408998 F-measure: 0.578871
Test Set Precision: 0.747126 Recall: 0.151869 F-measure: 0.252427
![Page 17: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/17.jpg)
Another run at Rapier
Overall Precision Recall F-measure
0.847 0.201 0.324
Field Correct RetrievedCorrect&Retrieved Precision Recall
F-measure
security_deposit 23 0 0 0 0 0
square_footage 24 10 10 1 0.417 0.588
no_bathrooms 58 28 25 0.893 0.431 0.581
contact_person 40 28 24 0.857 0.6 0.706
contact_phone 93 2 1 0.5 0.011 0.021
nearby_landmarks 76 8 5 0.625 0.066 0.119
parking_cost 4 0 0 0 0 0
date_available 21 1 0 0 0 0
building_style 6 4 4 1 0.667 0.8
no_units 14 4 3 0.75 0.214 0.333
![Page 18: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/18.jpg)
HMM Structure#1
Field Correct Retrieved CorrectRetrieved Precision RecallF-measure
security_deposit 23 0 0 0 0 0
square_footage 24 0 0 0 0 0
no_bathrooms 58 100 17 0.17 0.293 0.215
contact_person 40 100 0 0 0 0
contact_phone 93 41 26 0.634 0.28 0.388
nearby_landmarks 76 0 0 0 0 0
parking_cost 4 59 0 0 0 0
date_available 21 100 0 0 0 0
building_style 6 100 2 0.02 0.333 0.038
no_units 14 0 0 0 0 0
Overall Precision Recall F-measure
0.09 0.125 0.105
![Page 19: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/19.jpg)
HMM Structure#2
Field Correct RetrievedCorrectRetrieved Precision Recall F-measure
security_deposit 23 0 0 0 0 0
square_footage 24 9 8 0.889 0.333 0.485
no_bathrooms 58 0 0 0 0 0
contact_person 40 0 0 0 0 0
contact_phone 93 100 7 0.07 0.075 0.073
nearby_landmarks 76 0 0 0 0 0
parking_cost 4 0 0 0 0 0
date_available 21 0 0 0 0 0
building_style 6 3 0 0 0 0
no_units 14 0 0 0 0 0
Overall
Precision Recall F-measure
0.134 0.042 0.064
![Page 20: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/20.jpg)
HMM Structure#3
Field Correct RetrievedCorrect
Retrieved Precision Recall F-measure
security_deposit 23 0 0 0 0 0
square_footage 24 9 8 0.889 0.333 0.485
no_bathrooms 58 100 37 0.37 0.638 0.468
contact_person 40 100 4 0.04 0.1 0.057
contact_phone 93 100 6 0.06 0.065 0.062
nearby_landmarks 76 100 7 0.07 0.092 0.08
parking_cost 4 0 0 0 0 0
date_available 21 4 0 0 0 0
building_style 6 31 1 0.032 0.167 0.054
no_units 14 0 0 0 0 0
Overall Precision Recall F-measure
0.142 0.175 0.157
![Page 21: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/21.jpg)
Insights Relatively good performance with Rapier Not too good performance with HMM, due to lack of
training data (only 0.67% or 100 sampled randomly from 15000 postings) while test data is 10% or 1500 postings sampled from 15000 postings.
Limitation of automatic spelling correction although enhanced with California town, city, county names and first person names.
Wish the availability of advanced ontology as Wordnet is somewhat limited: recognize entity such as SJSU, Albertson, street names
![Page 22: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d555503460f94a321e9/html5/thumbnails/22.jpg)
Question & Answer