Surveys of Some Critical Issues in Chinese Indexing

25
Surveys of Some Critical Issues in Chinese Index Chinese Document Indexing and Word Segmentation • Speaker : Reuy-Lung, Hsiao • Date : Wed, Dec, 22

description

Surveys of Some Critical Issues in Chinese Indexing. Chinese Document Indexing and Word Segmentation. Speaker : Reuy-Lung, Hsiao Date : Wed, Dec, 22. Roadmap. An overview of Web Information Retrieval systems architecture Automatic indexing overview Questions of Chinese document indexing - PowerPoint PPT Presentation

Transcript of Surveys of Some Critical Issues in Chinese Indexing

Page 1: Surveys of Some Critical Issues in Chinese Indexing

Surveys of Some Critical Issues in Chinese Indexing

Chinese Document Indexing and Word Segmentation

• Speaker : Reuy-Lung, Hsiao• Date : Wed, Dec, 22

Page 2: Surveys of Some Critical Issues in Chinese Indexing

Roadmap

1. An overview of Web Information Retrieval systems architecture

2. Automatic indexing overview3. Questions of Chinese document indexing4. Typical approaches to index Chinese

document sets5. Chinese words segmentation mechanism6. Segmentation algorithms7. Discussion and Conclusion8. Reference

Page 3: Surveys of Some Critical Issues in Chinese Indexing

System Overview

Information Discovery

IndexingIndexingIndexingIndexing

Index Database

Similarity MeasurementSimilarity Measurement(Ranking)(Ranking)

Similarity MeasurementSimilarity Measurement(Ranking)(Ranking)

Request Response

Result Document Set

QueryFormulation

Chinese Document Indexing

Page 4: Surveys of Some Critical Issues in Chinese Indexing

Automatic Indexing Overview

1. Automatic indexing mechanism extracts the features (terms or keywords) of a given document.2. Indexing processes may contains the follow steps: (1)Morphological & Lexical Analysis stemming -> stop list -> weighting -> thesaurus construction (2)Syntactic & Semantic Analysis part-of-speech tagging -> information extraction -> concept extraction3. Weighting plays an important role in retrieval effectiveness. (1)Typical term weighting mechanism : TFxIDF. (2)Typical effectiveness measurement : recall,precision.

Page 5: Surveys of Some Critical Issues in Chinese Indexing

Automatic Indexing Overview

4. TFxIDF

5. Recall/Precision Recall =

Precision =

w tfN

dfij ij

j

log

# relevant document

# retrieved relevant document

# retrieved document

# relevant relevant document

Relevance line

Retrieval line

BD

A

C

B+CB

A+BB

=

=

Page 6: Surveys of Some Critical Issues in Chinese Indexing

Questions of Chinese Document Indexing

1.Words, rather than characters, should be the smallest indexing unit.

• More specific to the concepts• Less index space required

2.A comprehensive lexicon is needed.3.Chinese text has no delimiters to mark word boundary. for example:English words have spaces and punctuations as separators

中文句子沒有明顯的分隔符號

Page 7: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text

1.N-gram Indexing• Typically use N = 1,2,3• Produce large index file

2.Statistical Indexing • Typically use mutual information for wo

rd corelation3.Word-based Indexing

• Rule-based approach• Statistical approach• Hybrid approach

Page 8: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (N-gram Indexing)

• N-gram indexing terms produced from the same text string

sentence

unigram

bigram

trigram

C1C2C3C4C5C6

C1 , C2 , C3 , C4 , C5 , C6

C1 C2 , C2 C3 , C3 C4 , C4 C5 , C5 C6

C1 C2 C3 , C2 C3 C4 , C3 C4 C5 , C4 C5 C6

• N-gram index size for TREC-5 Chinese collection

n-gram

unigram

bigram

trigram

# distinct n-grams # of n-grams

6,236

1,393,488

8,119,574

64,611,662

54,362,319

49,886,331

Page 9: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Statistical Indexing)

• Mutual Information I(x,y) between two events x and y is defined as

P(x,y)

P(x)P(y)I(x,y) = log2

• If two events occur independently, p(x,y) would be close to p(x)p(y), I(x,y) would be closed to zero.• If two events are strongly related, p(x,y) would be much larger than p(x)p(y), I(x,y) would be large• Using statistical counting to derive probability

P(C1,C2) = P(C1)P(C2|C1) = N f(c1)

f(C1) f(C1C2)

I(C1,C2) = log2 N + log2

=N

f(C1C2)

f(C1C2)

f(C1)f(C2)

Page 10: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Statistical Indexing)

• Statistical Indexing Algorithm

1. Compute the mutual information values for all adjacent bigrams.2. Treat the bigram of the largest mutual information value as a word and remove it from the text.3. Perform step 2 on each short phrases until all phrases consist of one or tow characters.

• The following statistics are based on text collections from China Times, from 12/19/99, 12/20/99, 12/21/99.• Totally 621079 characters, 3827 distinct characters per day on average.• Comparsion among above indexing methods. (result)

Page 11: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Statistical Indexing)

bigram

f(C1)

f(C2)

f(C1C2)

I(C1,C2)

宣言連戰 戰新 新的 的競 競選 選宣

Step phrases action

連戰新的競選宣言

543

517

76

5.12

517

1498

0

-7.13

1498

16187

80

0.72

16187

223

34

1.77

223

1028

61

5.11

1028

259

2

1.54

259

305

8

4.14

1 remove 連戰□□新的競選宣言2 remove 競選□□新的□□宣言3 remove 宣言□□新的□□□□4 remove 新的

other example

Page 12: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Word-based Indexing)

1.Rule-based approach• Use a dictionary(lexicon) to match words.• Concept: a correct segmentation result should consist of legitimate words. For example: 中國文學 1.中國 文學 2.中國 文 學 3.中 國文 學 4.中 國 文學 5.中 國 文 學 We will choose (1) as the result.• Out-of-Vocabulary problem.

Page 13: Surveys of Some Critical Issues in Chinese Indexing

2.Statistical approach• Rely on statistical information such as word and character (co-)occurrence frequencies in the training data.• Concepts Given an sentence,the best solution is composed of a sequence of potential words Si, such that is the highest.

• Supervised/Unsupervised learning• Require large data to acquire accuracy.• Sparse data problem

Approaches to indexing Chinese Text (Word-based Indexing)

i

isP )(

Page 14: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Segmentation Algo.)

• Hybrid Segmentation Algorithm by Jian-Yun Nie, Martin Brisebois, SIGIR ‘96• Use lexicon and statistical information to segment words, with morphological heuristic rule to augment lexicon coverage. (note:supervised learning)• Terminology:

•background knowledge: words contained in the dictionary with default probability (p)•foreground knowledge: statistical information•heuristic rule: two kind of rules are included

•Nominal pre-determiner structure such as 這一年、一百本、每一天•Affix structure such as 小朋友、大眾化

Page 15: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Segmentation Algo.)

• Algorithm:•Combination of both knowledge if statistical information is available, use it! else background knowledge is taken into account.•Each character in the input string is associated with all the candidate words starting from that character, together with their probability•The candidate words are combined to cover the input string. The word sequence having the highest probability is chosen as the result.

• Example:大會決議和議程項目 (Result)大 會 決 議 和 議 程 項 目

大會 決議 議和 和議 議程 項目大會 決議

議程 項目(0.016) (0.029) (0.00108) (0.0005) (0.945) (0.0005) (0.0005) (0.0005) (0.0024)

(1.0) (0.956) (0.001) (0.001) (1.0) (0.936)

Page 16: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Segmentation Algo.)

• Unsupervised Segmentation Algorithm by Xiaoqiang Luo, Salim Roukos, ACL ‘96• Pure statistical learning model without using dictionary. It divides training set into two parts, randomly segments part-one, and segment part two by part one.• Use the previously-constructed language Model for iteration.• Use Viterbi-like algorithm to build LM.• Concept:

Let a sentence S = C1C2C3..Cn-1Cn, where Ci(1≦i≦n) is a Chinese character.To segment a sentence into words is to group these charac-ters into words, i.e.

Page 17: Surveys of Some Critical Issues in Chinese Indexing

S = C1C2...Cn-1Cn = (C1...Cx1)(Cx1+1

...Cx2)...(Cxm-1+1

...Cxm)

= W1W2...Wm where xk is the index of the last character in kth word Wk, i.e. Wk=Cxk-1+1

...Cxk (k=1..m), and x0=0,xm=n

• A segmentation of the sentence S can be uniquely represented by an integer sequence X1,...,Xm, so we denote all possible segmentation by

G(S)={ (x1...xm)|1≦x1≦...≦xm,m≦n }

and assign a score for a segmentation g(S)=(x1...xm)G(S) by

Approaches to indexing Chinese Text (Segmentation Algo.)

Page 18: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Segmentation Algo.)

L(g(S)) = log Pg(W1...Wm) = log Pg(Wi|hi) where Wj =Cxj-1+1

...Cxj(j=1..m) and hi is the history words

W1...Wi-1, Here we adopt trigram model with hi=Wi-2Wi-1

• Among all possible segmentations, we pick the one g* with the highest score as our result. That is

g* = arg L(g(S)) = arg log Pg(W1..Wm)

• Let L(k) be the max accumulated score for the first k charac- ters. L(k) is defined for k=1..n with L(1)=0, L(g*) = L(n).

m

i

max)S(Gg

max)S(Gg

Page 19: Surveys of Some Critical Issues in Chinese Indexing

Approaches to indexing Chinese Text (Segmentation Algo.)

• Given { L(i) | 1≦i≦k-1 }, L(k) can be computed recursively as follows:

L(k) = [L(i)+log P(Ci+1...Ck|hi)]

p(k) = arg [L(i)+log P(Ci+1...Ck|hi)]

that Cp(k)+1...Ck comprises the last word of optimal segmentation up to the kth character.• For example: a six-character sentence

max1-ki1

max1-ki1

chars

kp(k)

C1 C2 C3 C4 C5 C6

1 2 3 4 5 60 1 1 3 3 4

So the optimal segmentation for the sentence is (C1)(C2C3)(C4)(C5C6)

Page 20: Surveys of Some Critical Issues in Chinese Indexing

Discussion and Conclusion

1. Since most Chinese words consist of two characters, the bi-gram/statistical indexing outperform other methods.(even dictionary-based method) (According to New Advances in Computers and Natural Language Processing in China, Liu, Information Science (for China),’87 5% are unigrams, 75% are bigrams, 14% are trigrams, 6% are words of four or more characters) 2. Character-based indexing is not suited for Chinese text retrieval, due to the reasons below:

•Character-based approaches would lead to a great deal of incorrect matching between queries and documents due to quite free combination of characters

Page 21: Surveys of Some Critical Issues in Chinese Indexing

Discussion and Conclusion

•Complex concept should always be expressed by a fixed character string in both the doucments and the query.•In character-based approaches, every character is dealt with in the same way.•Character-based approaches do not allow us to easily incorporate linguistic knowledge into the searching process.

3. Word-based indexing is the first step toward concept-based indexing/retrieval, to avoid another information explosion.

Page 22: Surveys of Some Critical Issues in Chinese Indexing

Reference

1. A Statistical Method for Finding Word Boundaries in Chinese Text - Richard Sproat and Chilin Shih, CPOCOL ’902. On Chinese Text Retrieval - Jian-Yun Nie, Martin Brisebois, SIGIR ’96.3. An Iterative Algorithm to Build Chinese Language Models – Xiaoqiang Luo, Salim Roukos, ACL ’96.4. Chinese Text Retrieval Without Using a Dictionary – Aitao Chen, Jianzhang He , SIGIR ’975. A Tagging-Based First-Order Markov Model Approach to Automatic Word Identification for Chinese Sentences – T.B.Y Lai, M.S. Sun, COLING ’986. Chinese Indexing using Mutual Information - Christopher C., Asian Digital Library Workshop ’987. A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information – Yubin Dai, Teck Ee Loh, SIGIR ’998. Discovering Chinese Words from Unsegmented Text – Xianping Ge, Wanda Pratt, SIGIR ’99

Page 23: Surveys of Some Critical Issues in Chinese Indexing

Index file Indexing term Segmentation/indexing method Dictionary ,stop-list

1 Unigram Unigrams Unigram None

2 Bigram Bigrams Bigram Stop-list only

3 Trigram Trigrams Trigram Stop-list only

4 Mi Bigrams,unigrams mutual information Stop-list only

5 Max(f) Words,phrases Maximum matching(forward) Both

6 Max(b) Words,phrases Maximum matching(backword) Both

7 Min(f) Words,phrases Minimum matching(forward) Both

8 Min(b) Words,phrases Minimum matching(backward) both

recall unigram bigram trigram mi Max(f) Max(b) Min(f) Min(b)

0.00 0.7751 0.7504 0.6962 0.7696 0.8000 0.7966 0.7404 0.7265

0.10 0.5609 0.6241 0.5006 0.6500 0.6465 0.6414 0.5543 0.5611

0.20 0.4076 0.5243 0.3600 0.5355 0.5283 0.5028 0.4336 0.4432

0.30 0.3400 0.4778 0.2932 0.4705 0.4308 0.4518 0.3595 0.3734

0.40 0.2904 0.4375 0.2546 0.4324 0.3841 0.4085 0.3049 0.3245

0.50 0.2486 0.3864 0.2153 0.3872 0.3455 0.3671 0.2569 0.2903

0.60 0.2050 0.3295 0.1815 0.3346 0.2947 0.3131 0.2216 0.2351

0.70 0.1576 0.2749 0.1586 0.2843 0.2439 0.2678 0.1657 0.1912

0.80 0.0982 0.2173 0.1142 0.2353 0.1891 0.2017 0.1221 0.1217

0.90 0.0300 0.1241 0.0581 0.1378 0.1051 0.1105 0.0819 0.0778

1.00 0.0031 0.0108 0.0091 0.0208 0.0282 0.0341 0.0197 0.0118

Average precision

0.2609 0.3677 0.2405 0.3744 0.3558 0.3465 0.2738 0.2862

-26.67% 3.34% -32.40% 5.23% baseline -2.61% -23.04% -19.56%←

Page 24: Surveys of Some Critical Issues in Chinese Indexing

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

Dictionary

Statistical

Hybrid

•Corpus • 1270 Kbytes• training set : 1247• Test set : 272• 90 words on average, 160 characters per document• segmentation accuracy is around 91%

• Use stop-list such as 的 ,並 ,除非 ,此外…

<Back>

Page 25: Surveys of Some Critical Issues in Chinese Indexing

bigram

f(C1)

f(C2)

f(C1C2)

I(C1,C2)

愈演宋楚 楚瑜 瑜興 興票 票案 案愈1103

800

665

6.15

800

673

665

6.64

673

498

1

0.62

498

687

191

5.85

687

1061

66

4.03

1061

107

1

1.70

107

355

2

3.49

Example: 宋楚瑜興票案愈演愈烈 (849967/3998)

Result: 宋 楚瑜 興票 案 愈演 愈烈

演愈 愈烈355

107

2

3.49

107

118

2

4.58

<back>

bigram

f(C1)

f(C2)

f(C1C2)

I(C1,C2)

愈演宋楚 楚瑜 瑜興 興票 票案 案愈2820

2065

1649

6.27

2065

1703

1678

6.79

1703

1310

4

1.21

1310

1945

383

5.64

1945

2891

90

3.40

2891

345

2

1.32

345

1085

4

2.99

(1865718/4513)

Result: 宋 楚瑜 興票 案 愈演 愈烈

演愈 愈烈1085

345

4

2.99

345

360

4

4.10