Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki.

Processing of Large Document Collections 1

Helena Ahonen-MykaUniversity of Helsinki

Organization of the courseClasses: 17.9., 22.10., 23.10., 26.11.

lectures (Helena Ahonen-Myka): 10-12,13-15 exercise sessions (Lili Aunimo): 15-17 required presence: 75%

Exercises are given (and returned) each week required: 75%

Exam: 4.12. at 16-20, AuditorioPoints: Exam 30 pts, exercises 30 pts

Schedule

17.9. Character sets, preprocessing of text, text categorization

22.10. Text summarization23.10. Text compression26.11. … to be announced…

self-study: basic transformations for text data, using linguistic tools, etc.

In this part...

Character setspreprocessing of texttext categorization

1. Character sets

Abstract character vs. its graphical representation

abstract characters are grouped into alphabets each alphabet forms the basis of the

written form of a certain language or a set of languages

Character sets

For instance for English:

uppercase letters A-Zlowercase letters a-zpunctuation marksdigits 0-9common symbols: +, =

ideographic symbols of Chinese and Japanese

phonetic letters of Western languages

Character sets

To represent text digitally, we need a mapping between (abstract) characters and values stored digitally (integers)

this mapping is a character setthe domain of the character set is

called a character repertoire (= the alphabet for which the mapping is defined)

Character sets

For each character in the character repertoire, the character set defines a code value in the set of code points

in English: 26 letters in both lower- and uppercase ten digits + some punctuation marks

in Russian: cyrillic lettersboth could use the same set of code points (if

not a bilingual document) in Japanese: could be over 6000 characters

Character sets

The mere existence of a character set supports operations like editing and searching of text

usually character sets have some structure e.g. integers within a small range all lower-case (resp. upper-case) letters

have code values that are consecutive integers (simplifies sorting etc.)

Character sets: standars

Character sets can be arbitrary, but in practice standardization is needed for interoperability (between computers, programs,...)

early standards were designed for English only, or for a small group of languages at a time

Character sets: standards

ASCIIISO-8859 (e.g. ISO Latin1)UnicodeUTF-8, UTF-16

American Standard Code for Information Interchange

A seven bit code -> 128 code pointsactually 95 printable characters only

code points 0-31 and 128 are assigned to control characters (mostly outdated)

ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols)

With 7 bits, the set of code points is too small for anything else than American English

solution: 8 bits brings more code points (256) ASCII character repertoire is mapped to the

values 0-127 additional symbols are mapped to other

values

Extended ASCII

Problem: different manufacturers each developed

their own 8-bit extensions to ASCIIdifferent character repertoires ->

translation between them is not always possible

also 256 code values is not enough to represent all the alphabets -> different variants for different languages

ISO 8859

Standardization of 8-bit character sets In the 80´s: multipart standard ISO 8859 was

produceddefines a collection of 8-bit character sets,

each designed for a group of languages the first part: ISO 8859-1 (ISO Latin1)

covers most Western European languages 0-127: identical to ASCII, 128-159 (mostly)

unused, 96 code values for accented letters and symbols

Unicode

256 is not enough code points for ideographically represented languages

(Chinese, Japanese…) for simultaneous use of several languages

solution: more than one byte for each code value

a 16-bit character set has 65,536 code points

Unicode

16-bit character set, e.g. 65,536 code points

not sufficient for all the characters required for Chinese, Japanese, and Korean scripts in distinct positions CJK-consolidation: characters of these

scripts are given the same value if they look the same

Unicode

Code values for all the characters used to write contemporary ’major’ languages also the classical forms of some languages Latin, Greek, Cyrillic, Armenian, Hebrew,

Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan

Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts

Unicode

punctuation marks technical and mathematical symbols arrows dingbats (pointing hands, stars, …) both accented letters and separate diacritical

marks (accents, tildes…) are included, with a mechanism for building composite characterscan also create problems: two characters that look

the same may have different code values->normalization may be necessary

Unicode

Code values for nearly 39,000 symbols are provided

some part is reserved for an expansion method (see later)

6,400 code points are reserved for private use they will never be assigned to any

character by the standard, so they will not conflict with the standard

Unicode: encodings

Encoding is a mapping that transforms a code value into a sequence of bytes for storage and transmission

identity mapping for a 8-bit code? it may be necessary to encode 8-bit characters as

sequences of 7-bit (ASCII) characters e.g. Quoted-Printable (QP)

code values 128-255 as a sequence of 3 bytes1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the

value233 -> E9 -> =E9

Unicode: encodings

UTF-8 ASCII code values are likely to be more

common in most text than any other valuesin UTF-9 encoding ASCII characters are sent

themselves (high-order bit 0)other characters (two bytes) are encoded

using up to six bytes (high-order bit is set to 1)

Unicode: encodings

UTF-16: expansion method two 16-bit values are combined to a 32-

bit value -> a million characters available

2. Preprocessing of text

Text cannot be directly interpreted by the many document processing applications

an indexing procedure is needed mapping of a text into a compact

representation of its contentwhich are the meaningful units of text?how these units should be combined?

usually not ”important”

Vector model

A document is usually represented as a vector of term weights

the vector has as many dimensions as there are terms (or features) in the whole collection of documents

the weight represents how much the term contributes to the semantics of the document

Vector model

Different approaches: different ways to understand what a

term is different ways to compute term weights

Words typical choice set of words, bag of words

phrases syntactical phrases statistical phrases usefulness not yet known?

Part of the text is not considered as terms very common words (function words):

articles, prepositions, conjunctions

numeralsthese words are pruned

stopword listother preprocessing possible

stemming, base words

Weights of terms

Weights usually range between 0 and 1binary weights may be used

1 denotes presence, 0 absence of the term in the document

often the tfidf function is used higher weight, if the term occurs often in the

document lower weight, if the term occurs in many

documents

Structure

Either the full text of the document or selected parts of it are indexed

e.g. in a patent categorization application title, abstract, the first 20 lines of the

summary, and the section containing the claims of novelty of the described invention

some parts may be considered more important e.g. higher weight for the terms in the title

Dimensionality reductionMany algorithms cannot handle high

dimensionality of the term space (= large number of terms)

usually dimensionality reduction is applieddimensionality reduction also reduces

overfitting classifier that overfits the training data is good at

re-classifying the training data but worse at classifying previously unseen data

Dimensionality reduction

Local dimensionality reduction for each category, a reduced set of terms

is chosen for classification that category hence, different subsets are used when

working with different categoriesglobal dimensionality reduction

a reduced set of terms is chosen for the classification under all categories

Dimensionality reduction

Dimensionality reduction by term selection the terms of the reduced term set are a

subset of the original term setDimensionality reduction by term

extraction the terms are not the same type of the terms

in the original term set, but are obtained by combinations and transformations of the original ones

Dimensionality reduction by term selection

Goal: select terms that, when used for document indexing, yields the highest effectiveness in the given application

wrapper approach the reduced set of terms is found iteratively and

tested with the application

filtering approach keep the terms that receive the highest score

according to a function that measures the ”importance” of the term for the task

Many functions available document frequency: keep the high

frequency termsstopwords have been already removed50% of the words occur only once in the

document collection e.g. remove all terms occurring in at most 3

documents

Information-theoretic term selection functions, e.g. chi-square information gain mutual information odds ratio relevancy score

Dimensionality reduction by term extraction

Term extraction attempts to generate, from the original term set, a set of ”synthetic” terms that maximize effectiveness

due to polysemy, homonymy, and synonymy, the original terms may not be optimal dimensions for document content representation

Dimensionality reduction by term extraction

Term clustering tries to group words with a high degree of pairwise

semantic relatedness groups (or their centroids) may be used as

dimensions

latent semantic indexing compresses document vector into vectors of a

lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence

3. Text categorization

Text classification, topic classification/spotting/detection

problem setting: assume: a predefined set of categories,

a set of documents label each document with one (or more)

Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki.

Documents

Transcript of Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki.

Processing of large document collections Part 4 (Applications of text categorization, boosting, text summarization) Helena Ahonen-Myka Spring 2006.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

CFund1-G Slideshow - Myka Ylanan

FINAL REPORT MOTS Costing Exercise By Myka Reinsch

Processing of structured documents Spring 2003, Part 1 Helena Ahonen-Myka.

EFEU / FLEXe Ahonen Tero energy efficiency opportunities at system levels

Myka d Booklet

Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2006.

Heather and myka

Processing of structured documents Spring 2003, Part 8 Helena Ahonen-Myka.

Ahonen, A., M. Kokkala and H. Weckman. Burning Characteristics of ...

Twitter: @tomiahonen Copyright © Tomi T Ahonen 2010 Grand Convergence The 5 Trillion Dollar Contest Keynote Presentation Tomi T Ahonen,

Staple Foods Sector Diagnostic Country Report Roehlano Briones Ivory Myka Galang.

Admissibility Package for Tape all Wrapped Up M ark A. Ahonen.

XML for E-commerce IV Helena Ahonen-Myka. In this part... n Some solutions for delivering dynamic content n Example of using XML.

XML for E-commerce II Helena Ahonen-Myka. XML processing model n XML processor is used to read XML documents and provide access to their content and structure.

Processing of large document collections Part 9 (Information extraction: learning extraction patterns) Helena Ahonen-Myka Spring 2006.

Myka, K. K., Hawkins, M., Dillingham, M., Savery, N ... · Myka, K. K., Hawkins, M., Dillingham, M., Savery, N., & McGlynn, P. (2016). Inhibiting translation elongation can aid genome

Sedmé masmédium: Nový svět komunikací - Tomi T Ahonen

Active and reactive power Renewables Summer Course 17.7.2014 Eetu Ahonen 1.