ISSPA 2007 12 January 1 N -Gram and Local Context Analysis for Persian text retrieval Tehran...

33
1 ISSPA 2007 12 January N-Gram and Local Context Analysis for Persian text retrieval Tehran University Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani School of Electrical and Computer Engineering School of Electrical and Computer Engineering University of Tehran University of Tehran Farhad Oroumchian Farhad Oroumchian University of Wollongong in Dubai University of Wollongong in Dubai

Transcript of ISSPA 2007 12 January 1 N -Gram and Local Context Analysis for Persian text retrieval Tehran...

1

ISSPA 200712 January

N-Gram and Local Context Analysis for Persian text retrieval

Tehran University

Abolfazl AleAhmad, Parsia Hakimian, Farzad MahdikhaniAbolfazl AleAhmad, Parsia Hakimian, Farzad MahdikhaniSchool of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering

University of TehranUniversity of Tehran

Farhad OroumchianFarhad OroumchianUniversity of Wollongong in DubaiUniversity of Wollongong in Dubai

2 University of Tehran - Database Research Group

OutlineThe Persian Language

Used MethodsPivoted normalization

N-Gram approach

Local Context Analysis

The test collections

Our experiment and the results

Conclusion

3 University of Tehran - Database Research Group

Outline

The Persian Language

Used MethodsPivoted normalization

N-Gram approach

Local Context Analysis

Our test collections

Our experiment and the results

Conclusion

4 University of Tehran - Database Research Group

The Persian LanguageIt is Spoken in countries like Iran, Tajikistan and Afghanistan

It has Arabic like script for writing and consists of 32 characters that are written continuously from right to left

It’s morphological analyzers need to deal with many forms of words that are not actually Farsi

Example• The word “کافر” (singular) “کفار” (plural)

• Or “عادت” that has two plural forms in Farsi: – Farsi form“ ها ”عادت– Arabic form“عادات”

So N-Grams are a solution

5 University of Tehran - Database Research Group

Our Study

We investigated vector space model on the Persian language:

unstemmed single term

N-gram based

Local Context Analysis

Using HAMSHAHRI collection which contains 160,000+ news articles

6 University of Tehran - Database Research Group

Outline

The Persian Language

Used MethodsPivoted normalization

N-Gram approach

Local Context Analysis

Our test collections

Our experiment and the results

Conclusion

7 University of Tehran - Database Research Group

Name Weighting

tf.idf tf*log(N/n) / ((tf2) * (qtf2))

lnc.ltc (1+log(tf))*(1+log(qtf))*log((1+N)/n) / ((tf2) * (qtf2))

nxx.bpx (0.5+0.5*tf/max tf)+log((N-n)/n)

tfc.nfc tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) / ((tf2) * (qtf2))

tfc.nfx1 tf* log(N/n)*(0.5+0.5*qtf/max qtf) *log(N/n) / ((tf * log(N/n))2)

tfc.nfx2 tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) / ((tf2))

Lnu.ltu ((1+log(tf))*(1+log(qtf))*log((1+N)/n))/((1+log(average tf)) * ((1-s) + s * N.U.W/ average N.U.W)2)

List of Weights that produced the best results

Best

Vector Space Model

8 University of Tehran - Database Research Group

Problem with Document length normalization

It is supposed to remove the difference between the document's lengthsUnder cosine normalization shorter documents get higher weights but they are less relevant.

Average of median bin lengthA

verage probability of R

elevance/Retrieval

9 University of Tehran - Database Research Group

Lnu.ltu weighting scheme

A good weight proposed by Amit Singhal, et al. and tested on TREC collections

Based on reducing the gap between relevance and retrieval

pivotslopeTUNslope

nNtf

)1()..(

/ln0.1)ln(

Lnu =

ltu =

pivotslopeTUNslopetfaverage

tf

)1()..())(log(1

)log(1

10 University of Tehran - Database Research Group

Pivoted Normalization

Document Length

Pro

ba

bility

Fin

al N

orm

aliza

tion

Fa

cto

r

Old Normalization Factor

Source: A. Singhal, et al. “Pivoted Document Length Normalization”

11 University of Tehran - Database Research Group

Outline

The Persian Language

Used MethodsPivoted normalization

N-Gram approach

Local Context Analysis

Our test collections

Our experiment and the results

Conclusion

12 University of Tehran - Database Research Group

NGRAMS are strings of length n.

In this approach the whole text is considered as a stream of characters and then it is broken down to substrings of length n.

It is remarkably resistant to textual errors (e.g. OCR) and no linguistic knowledge is needed.

Example:for n=4 ”مخابرات“

رات برات ابرا خابر مخاب

NGRAM Approach (Cont.)

13 University of Tehran - Database Research Group

Outline

The Persian Language

Used MethodsPivoted normalization

N-Gram approach

Local Context Analysis

Our test collections

Our experiment and the results

Conclusion

14 University of Tehran - Database Research Group

Word Mismatch ProblemAutomatic query expansion is a good solution for the issue of word mismatch in IR:

Local Analysis• + Expansion based on high ranking documents• - Needs an extra search• - Some queries may retrieve few relevant documents

Global Analysis• + It has robust average performance

• - Expensive in terms of disk space and CPU• - Individual Queries can be significantly degraded

15 University of Tehran - Database Research Group

Local Context AnalysisLocal Context Analysis is an automatic query expansion method

combines global analysis (use of context & phrase structure) and local feedback (Top ranked documents)

LCA is fully automated and there is no need to collect any information from user other than the initial query

+ It is computationally practical

- But has the extra search to retrieve top ranked documents

16 University of Tehran - Database Research Group

LCA has three main steps:1. Run user’s query, break the top N retrieved

documents into passages and rank them again.

2. Calculate similarity of each concept in the top ranked passages with the entire original query using similarity function:

3. the top M ranked concepts are added to the original query and initial retrieval method is done with the expanded query

Local Context Analysis (Cont.)

i

i

idf

qk

ci

n

idfkcfcqsim )

log

)),(log((),(

17 University of Tehran - Database Research Group

Outline

The Persian Language

Used MethodsPivoted normalization

N-Gram approach

Local Context Analysis

Our test collections

Our experiment and the results

Conclusion

18 University of Tehran - Database Research Group

Test Collections1. Qvanin Collection

Documents: Iranian Law Collection • 177089 passages • 41 queries and Relevance Judgments

2. Hamshari CollectionDocuments: 600+ MB News from Hamshari Newspaper

• 160000+ news articles• 60 queries and Relevance Judgments

3. BijanKhan Tagged CollectionDocuments: 100+ MB from different sources

• A tag set of 41 tags• 2590000+ tagged words

19 University of Tehran - Database Research Group

Hamshahri CollectionWe used HAMSHAHRI (a test collection for Persian text

prepared and distributed by DBRG (IR team) of University of Tehran)

The 3rd version:

– contains about 160000+ distinct textual news

articles in Farsi

– 60 queries and relevance judgments for top 20 relevant documents for each query

20 University of Tehran - Database Research Group

Some examples of Queries

Women rights law زنان حقوق قانون

Contamination in Persian gulf

فارس خلیج آلودگی

Birds migration پرندگان کوچ

Increase of gas price بنزین قیمت افزایش

Iranian Wrestling ایران فرنگی کشتی

21 University of Tehran - Database Research Group

OutlineThe Persian Language

Used MethodsPivoted normalization

N-Gram approach

Local Context Analysis

Our test collections

Our experiment and the results

Conclusion

22 University of Tehran - Database Research Group

Term-based vector space model

A. Singhal, et al. in their paper “Pivoted Document Length Normalization” reported that the following two configurations have the best performance:

Slope=0.25 and using pivoted unique normalization (P.U.N.).

• Pivot = average no. of unique terms in a document

Slope=0.75 and using pivoted cosine normalization (P.C.N.).

• Pivot = average cosine factor for 1+log(tf)

23 University of Tehran - Database Research Group

Our experiment results

Comparison of vector space model slope=0.25 and slope=0.75

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

Prec

isio

n

Lnu.ltu Slope0.25 usingP.U.N.

Lnu.ltu Slope0.75 usingP.C.N.

24 University of Tehran - Database Research Group

Our experiment resultsComparison of vector space model and LCA:

In LCA we used Lnu.ltu (slope=0.25 and P.U.N.)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

Prec

isio

n

LCA

Lnu.ltu Slope0.25 usingP.U.N.

25 University of Tehran - Database Research Group

N-Gram Experiments

Next, we assessed N-gram based vector space model for N = 3,4,5 on the HAMSHAHRI collection.

In addition to Lnu.ltu we assessed atc.atc in which both query and documents are weighted as follows:

iiwn

N

tf

tf2

1ln

max5.05.0atc =

26 University of Tehran - Database Research Group

N-Gram experiment resultsN-Grams using atc.atc and lnu.ltu (slope=0.25) weighting schemes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

Pre

cisi

on

Lnu.ltu Term-based Lnu.ltu 3Gram-based

Lnu.ltu 4Gram-based Lnu.ltu 5Gram-based

atc.atc Term-based atc.atc 3Gram-based

atc.atc 4Gram-based atc.atc 5Gram-based

27 University of Tehran - Database Research Group

Previous Works: Comparison of Vector Space System with FuFaIR

They used the first version of HAMSHAHRI collection (300+ MB) in Their experiments. It has 30 QueriesIn vector space model the Slope set to 0.75 and the Pivot set to 13.36

Conclusion

28 University of Tehran - Database Research Group

Comparison of vector space systems with BM25

00.10.20.30.40.50.60.70.80.9

1

Document Cut off

Pre

cisi

on

vector-Lnu.ltu 0.91 0.83 0.76 0.74

vector-tfc.nfx2 0.66 0.62 0.54 0.59

vector-lnc.ltc 0.63 0.60 0.58 0.55

BM25 0.77 0.71 0.68 0.66

P@5 P@10 P@15 P@20

Conclusion

29 University of Tehran - Database Research Group

Experiments on Qavanin Collection

Conclusion

Source: F. Oroumchian, F. Mazhar Garamaleki. “An Evaluation of Retrieval performance Using Farsi Text”. First Eurasia Conference on Advances in Information and Communication

Technology, Tehran, Iran, October 2002.

Comparison of Best Vector Space With Best N-grams

00.10.20.30.40.50.60.70.80.9

1

P@5 P@10 P@15 P@20

Prec

ision

vector-Lnu.ltu

3gram-Lnu.ltu

4gram-Lnu.ltu

paice-nxx.bpx

4gram-BM25

30 University of Tehran - Database Research Group

Our experiment best resultsExperiments using atc.atc and lnu.ltu (slope=0.25) weighting schemes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

Pre

cisi

on

Lnu.ltu Term-based Lnu.ltu 4Gram-based LCA

31 University of Tehran - Database Research Group

Results Analysis (N-Gram)

AS It was shown, 4-gram based vector space with Lnu.ltu weighting scheme has better performance than FuFaIR and other vector space models:

It is in contradiction with the performance of them in English.

The rational is that most Farsi words' roots are about 4 characters.

Our results are more valid than previous works because we used a better collection

Conclusion

32 University of Tehran - Database Research Group

Results Analysis (LCA)

Local Context Analysis only marginally improved the results over the Lnu.ltu method

Lnu.ltu weighting method is performing very well on the Farsi language

It’s better to tune LCA parameters for the HAMSHAHRI collection

Conclusion

33 University of Tehran - Database Research Group

Thanks, Questions

?http://ece.ut.ac.ir/dbrg