Text segmentation

15
Text segmentation Amany AlKhayat

description

Text segmentation. Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation, numbers. This process is called tokenization and segmented units are called word tokens. Ex: In addition, she was there. - PowerPoint PPT Presentation

Transcript of Text segmentation

Page 1: Text segmentation

Text segmentation

Amany AlKhayat

Page 2: Text segmentation

• Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation, numbers.

• This process is called tokenization and segmented units are called word tokens.

• Ex: In addition, she was there.• After segmentation:In addition , she was there .

Page 3: Text segmentation

Tokenization

• Tokenization and sentence splitting can be described as ‘low-level’ segmentation which is performed at the initial level of text processing. The tasks are handled by reg. ex. Written in perl or any other programming language.

Page 4: Text segmentation

Tokenization II

• High-level text segmentation or intrasenetential segmentation involves segmentation of linguistic groups such as named entities, segmentation of noun groups.

• Inter-sentential segmentation involves grouping of sentences and paragraphs into discourse topics which are also called text tiles.

Page 5: Text segmentation

Word segmentation

• Multiple occurrence of words in a text.• Word types are word of vocabulary.• Ex. If Shakespeare’s works included more than

8oo,ooo word tokens, it has 31,000 types of vocabulary

Page 6: Text segmentation

Tokenizing sentences

• It is tiresome to tokenize sentences by adding white space. Moreover, if you tokenize sentences they cannot be put back to normal.

• SGML or XML are cleaner strategies for tokenization to revert it easily to original text.

• Ex.<w c=w> it</w> <w c=w> is </w> <w c=w> here

</w> <w c=p>. </w>

Page 7: Text segmentation

Sentence segmentation

• Important for many text processing apps: syntactic parsing, information extraction, text alignment, Machine translation…etc.

Page 8: Text segmentation

• Accurate splitting is known as sentence boundary disambiguation (SBD) requires analysis of the local context around the periods and othe punctuations

• Compare:• He stopped to see Dr. White.• He stopped at Meadows Dr. Whie falcon was still

open. Which period is sentence internal and which one is

sentence terminal?

Page 9: Text segmentation

Simplist algorithm for sentence boundary disambiguation

• ‘period- space- capital letter’• It marks all periods, exclamation marks and q

marks that are followed by a space and a capital letter.

• Regex:• [.?!][ ()”]+[A-Z]

Page 10: Text segmentation

Part of speech tagging

• Criteria:• 1- syntactic distribution• 2- syntactic function• 3- morphological and syntactic classes that

different parts of speech can be assigned to.

Page 11: Text segmentation

Applications

• Preprocessors• Large tagged text corpora (see Mark Davies

Corpus)• Info technology apps: text indexing and

retrieval (nouns and adjectives are better candidates for good indexing than adverbs, verbs and pronouns

Page 12: Text segmentation

Parsing

• See Stanford university parser online (http://nlp.stanford.edu:8080/parser/index.jsp)

• Using grammar to assign syntactic analysis to a string of words.

• Shallow parsing: partition of the input into chunks identifying the headword of each chunk.

Page 14: Text segmentation

CFP context free parsing

• Context-free grammars are important in linguistics for describing the structure of sentences and words in natural language, and in computer science for describing the structure of programming languages and other formal languages. (wikipedia)

Page 15: Text segmentation

Thank you