CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are...

54
1 CHAPTER 1 INTRODUCTION Nowadays, a large quantity of data is being accumulated. Usually there is a huge gap from the stored data to the knowledge that could be construed from the data. This transition won't occur automatically, that's where Data Mining comes into picture. In Exploratory Data Analysis, some initial knowledge is known about the data, but Data Mining could help in a more in-depth knowledge about the data. Seeking knowledge from massive data is one of the most desired attributes of Data Mining. Manual data analysis has been around for some time now, but it creates a bottleneck for large data analysis. Fast developing computer science and engineering techniques and methodology generates new demands to mine complex types of data. A number of Data Mining techniques (such as association, clustering, classification) are developed to mine this vast amount of data. Previous studies [67] of Data Mining mostly focused on structured data, such as relational, transactional and data warehouse data. However, in reality, a substantial portion of the available information is stored in text databases (or document databases), which consists of large collections of documents from various sources, such as news articles, books, digital libraries and Web pages. Text databases are rapidly growing due to the increasing amount of information available in electronic forms, such as electronic publications, e-mail, CD- ROMs, and the World Wide Web (which can also be viewed as a huge, interconnected, dynamic text database). Data stored in most text databases are semi structured data in that they are neither completely unstructured nor completely structured. For example, a document may contain a few structured fields, such as title, authors, publication date, length, category, and, so on, but also contain some largely unstructured text components, such as abstract and contents. There have been a great deal of studies on the modeling and implementation of semi structured data in recent database research. Information Retrieval techniques, such as text indexing, have been developed to handle unstructured documents. But, traditional Information Retrieval techniques become inadequate for the increasingly vast amounts of text data. Typically, only a small fraction of the many available documents will be relevant to a given individual or user. Without knowing what could be in the documents, it is

Transcript of CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are...

Page 1: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

1

CHAPTER 1

INTRODUCTION

Nowadays, a large quantity of data is being accumulated. Usually there is a huge gap from

the stored data to the knowledge that could be construed from the data. This transition

won't occur automatically, that's where Data Mining comes into picture. In Exploratory

Data Analysis, some initial knowledge is known about the data, but Data Mining could

help in a more in-depth knowledge about the data. Seeking knowledge from massive data

is one of the most desired attributes of Data Mining. Manual data analysis has been around

for some time now, but it creates a bottleneck for large data analysis. Fast developing

computer science and engineering techniques and methodology generates new demands to

mine complex types of data. A number of Data Mining techniques (such as association,

clustering, classification) are developed to mine this vast amount of data. Previous studies

[67] of Data Mining mostly focused on structured data, such as relational, transactional

and data warehouse data. However, in reality, a substantial portion of the available

information is stored in text databases (or document databases), which consists of large

collections of documents from various sources, such as news articles, books, digital

libraries and Web pages. Text databases are rapidly growing due to the increasing amount

of information available in electronic forms, such as electronic publications, e-mail, CD-

ROMs, and the World Wide Web (which can also be viewed as a huge, interconnected,

dynamic text database).

Data stored in most text databases are semi structured data in that they are neither

completely unstructured nor completely structured. For example, a document may contain

a few structured fields, such as title, authors, publication date, length, category, and, so on,

but also contain some largely unstructured text components, such as abstract and contents.

There have been a great deal of studies on the modeling and implementation of semi

structured data in recent database research. Information Retrieval techniques, such as text

indexing, have been developed to handle unstructured documents. But, traditional

Information Retrieval techniques become inadequate for the increasingly vast amounts of

text data. Typically, only a small fraction of the many available documents will be relevant

to a given individual or user. Without knowing what could be in the documents, it is

Page 2: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

2

difficult to formulate effective queries for analyzing and extracting useful information

from the data. Users need tools to compare different documents, rank the importance and

relevance of the documents, or find patterns and trends across multiple documents. Thus,

Text Mining has become an increasingly popular and essential theme in Data Mining.

Text Mining, also known as knowledge discovery from text, and document information

mining, refers to the process of extracting interesting patterns from very large text corpus

for the purpose of discovering knowledge [129]. It is an interdisciplinary field involving

Information Retrieval, Text Understanding, Information Extraction, Clustering,

Categorization, Topic Tracking, Concept Linkage, Computational Linguistics,

Visualization, Database Technology, Machine Learning, and Data Mining [120].

The Text Mining tools/applications intend to capture the relationships between data. They

can be roughly organized into two groups. One group focuses on document exploration

functions to organize documents based on their content and provide an environment for a

user to navigate and browse in a document or concept space. It includes Clustering,

Visualization, and Navigation. The other group focuses on text analysis functions to

analyze the content of the documents and discover relationships between concepts or

entities described in the documents. They are mainly based on natural language processing

techniques, including Information Retrieval, Information Extraction, Text Categorization,

and Summarization [128], [129].

Content-based text selection techniques have been extensively evaluated in the context of

Information Retrieval. Every approach to text selection has four basic components:

Some technique for representing the documents

Some technique for representing the information need (i.e., profile construction)

Some way of comparing the profiles with the document representations

Some way of using the results of that comparison

This thesis is about Text Mining for Information Retrieval. Our main interest is in how

techniques and tools of Text Mining can be used as an exploration tool. Search engine is

the most well known Information Retrieval tool. The application of Text Mining to

Information Retrieval improves the precision of IR systems and reduces the number of

documents that a single query returns. In this thesis, we proposed a method of ranking the

Web text pages based on statistical heuristics. Also an efficient and global method of

Page 3: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

3

clustering is proposed to group the similar documents. To speed up the process of text

document retrieval, a new method of storing the inverted index file is proposed by using

the range partition feature of oracle, where the space requirement of Random Access

Memory is reduced considerably by storing the inverted file on the secondary storage and

bringing only the required portion to main memory. Apart from studying this, we have

made some work in the adjacent field of Text Document Summarization to generate

single-document extract summary which can be used to cluster similar documents (Web

text documents) using the proposed clustering method.

1.1 Data Mining – Concepts and Techniques

Database technology has evolved from primitive file processing to the development of

database management systems with query and transaction processing. Due to the explosive

growth in data collected from applications including business and management,

government administration, scientific and engineering, and environmental control, there is

an increasing demand for efficient and effective data analysis and data understanding

tools. Data warehouse systems provide some data analysis capabilities which include data

cleaning, data integration and OLAP (On-Line Analytical Processing). These analysis

techniques provide functionalities such as summarization, consolidation and aggregation,

as well as the ability to view information at different angles. Although OLAP tools support

multidimensional analysis and decision making, additional data analysis tools are required

for in-depth analysis, such as data classification, clustering, and the characterization of

data changes over time. The widening gap between data and information calls for a

systematic development of Data Mining tools which will turn data tombs into golden

nuggets" of knowledge. Data Mining tools perform data analysis which may uncover

important data patterns, contributing greatly to business strategies, knowledge bases, and

scientific and medical research.

Data Mining has many definitions:-

Definition 1 : Data Mining, also referred as knowledge discovery in databases, is a process

of nontrivial extraction of implicit, previously unknown and potentially useful information

(such as knowledge rules, constraints, regularities) from data in databases [106].

Page 4: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

4

Definition2 : "Data Mining is the process of sorting through large amounts of data and

picking out relevant information. It is usually used by business intelligence organizations,

and financial analysts, but is increasingly being used in the sciences to extract information

from the enormous data sets generated by modern experimental and observational

methods" [38].

The ultimate goal of Data Mining is prediction - and predictive Data Mining is the most

common type of Data Mining and one that has the most direct business applications.

The process of Data Mining consists of three stages.

a) Exploration - It is the first stage of Data Mining process. It performs data

preparation which includes data cleaning, data transformations, selecting subsets of

records. If the size of the data set is large containing large number of variables

("fields") then it also performs preliminary feature selection operations to reduce

the number of variables to a manageable range based on the statistical methods.

Depending on the nature of the analytic problem, exploration stage may involve a

simple choice of straightforward predictors for a regression model, or elaborate

exploratory analyses using a wide variety of graphical and statistical methods to

identify the most relevant variables and determine the complexity and/or the

general nature of models. Information processed in this stage is then used in the

next stage.

b) Model building and validation – In this stage various models are considered and

the best model is chosen based on its predictive performance (i.e., which can

explain the variability in question and can produce stable results across samples).

Variety of techniques are developed that can be applied on different models using

same data set to compare their performance to choose the best. These techniques

often called as ―competitive evaluation of models‖ are considered the core of

predictive Data Mining and include Bagging (Voting, Averaging), Boosting,

Stacking (Stacked Generalizations), and Meta-Learning.

c) Deployment – In this model, new data is applied to the best model selected in the

previous stage in order to generate predictions or estimates of the expected

outcome.

Page 5: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

5

Data Mining involves an integration of techniques from multiple disciplines such as

database technology, statistics, machine learning, high performance computing, pattern

recognition, neural networks, data visualization, Information Retrieval, image and signal

processing, and spatial data analysis. By performing Data Mining, interesting knowledge,

regularities, or high-level information can be extracted from databases and viewed or

browsed from different angles. The discovered knowledge can be applied to decision

making, process control, information management, query processing, and so on.

Therefore, Data Mining is considered as one of the most important frontiers in database

systems and one of the most promising, new database applications in the information

industry [49], [67].

In the following section 1.1.1, we discuss the different Data Mining techniques based on

the kinds of databases to be mined, and the kinds of knowledge to be mined. After this, we

also briefly explain the general architecture of the Data Mining.

1.1.1 Nature of Data

In this section, we discuss different data stores [32], [67] on which mining can be

performed. In principle, Data Mining should be applicable to any kind of information

repository. This includes relational databases, data warehouses, transactional databases,

object-oriented and object-relational databases, spatial databases, time-series databases,

text databases, multimedia databases and the World-Wide Web. The challenges and

techniques of mining may differ for each of the repository systems. A brief introduction

to each of the major data repository systems listed above is given below:

a) Relational databases

A relational database is a collection of tables. Each tuple in relational database has a

unique name, consists of a set of attributes (columns or fields) and usually stores a large

number of tuples (records or rows). An object in a relational table is represented by a tuple

and is identified by a unique key (Primary key) and is described by a set of attribute

values. Data Mining applied to relational databases allows searching for trends or data

patterns. For example, Data Mining systems may detect deviations, such as items whose

sales are far from those expected in comparison with the previous year. It may also

Page 6: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

6

analyze customer data to predict the credit risk of new customers based on their income,

age, and previous credit information.

Relational databases are one of the most popularly available and rich information

repositories for Data Mining.

b) Data warehouses

A data warehouse is a repository of information collected from multiple sources, stored

under a unified schema, and which usually resides at a single site. Data warehouses are

constructed via a process of data cleansing, data transformation, data integration, data

loading, and periodic data refreshing. Although data warehouse tools help support data

analysis, additional tools for Data Mining are required to allow more in depth and

automated analysis.

c) Transactional databases

In general, a transactional database consists of a file where each record represents a

transaction. A transaction typically includes a unique transaction identity number (trans

ID), and a list of the items making up the transaction (such as items purchased in a store).

The transactional database may have additional tables associated with it, which contain

other information regarding the transaction, such as sales transactional database contain

information about sales and keep the record of transaction date, the customer ID number,

sales person ID number, and the name of branch at which the sale occurred, and so on. A

regular data retrieval system fails to answer queries like ―Which items sold well together?"

However, Data Mining systems for transactional data can identify the relationship between

different transactions, such as it can detect the set of items frequently sold together. For

example, based on the sales trend that microwave proof container are commonly

purchased together with microwave, an offer of an expensive set of microwave proof

container can be given to customers buying selected models of microwave, in the hope of

selling more of the expensive containers.

Page 7: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

7

d) Object-Oriented Databases

Object–Oriented databases are based on the object-oriented programming paradigm,

where, each entity is an object. Data and code relating to an object are encapsulated into a

single unit. Each object has associated with it the following:

A set of variables that describe the object.

A set of messages that the object can use to communicate with other objects or rest

of the database system.

A set of methods, where each method holds the code to implement a message.

Objects that share common set of properties can be grouped into an object class. Each

object is an instance of its class. Object classes can be organized into class/subclass

hierarchies. Such a class inheritance feature benefits information sharing.

e) Object-Relational Databases

Object-relational model extends the basic relational data model by adding the power to

handle complex data types, class hierarchies, and object inheritance.

Data Mining techniques provide the methods to handle complex object structures, complex

data types, class and subclass hierarchies, property inheritance, and methods and

procedures.

f) Spatial Databases

Spatial databases contain spatial-related information. Such databases include geographic

(map) databases, VLSI chip design databases, and medical and satellite image databases.

Spatial databases may be represented in raster format (n-dimensional bit maps or pixel

maps), vector format (roads, bridges, buildings, lakes).

Spatial data cubes may be constructed to organize data into multidimensional structures

and hierarchies, on which OLAP operations (such as drill-down and roll-up) can be

performed. Spatial Data Mining includes spatial data description, classification,

association, clustering, and spatial trend and outlier analysis.

Page 8: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

8

g) Temporal databases and Time-Series Databases

Temporal databases and time-series databases both store time-related data. A temporal

database stores relational data having time-related attributes which may involve several

timestamps, each having different semantics. A time-series database stores sequence of

values that change with time, such as data collected regarding the stock exchange.

Data mining techniques can be used to find the characteristics of object evolution or the

trend of changes for objects in the database. Such information can be useful in decision

making and strategy planning.

h) Text Databases

Text databases are databases that contain word descriptions (sentences, paragraphs) for

objects (summary reports, error messages, documents). Text databases may be highly

unstructured (some web pages on World Wide Web), semi structured (e-mail messages,

HTML/XML Web pages), relatively structured (library databases). Data Mining on text

databases may uncover general descriptions of object classes, as well as keyword or

content associations, and the clustering behavior of text objects. To handle this, standard

Data Mining methods need to be integrated with Information Retrieval techniques and the

construction or use of hierarchies especially for text data (such as dictionaries and

thesauruses), as well as discipline-oriented term classification systems (such as in

chemistry, medicine, law or economics).

i) Multimedia Databases

Multimedia databases store image, audio, video data, sequence data, hypertext data

containing text, text markups, and linkages. For multimedia database mining, storage and

search techniques need to be integrated with standard Data Mining methods to handle the

issues like content-based retrieval and similarity search, generalization and

multidimensional analysis, classification and prediction analysis, and mining associations

in multimedia data.

Page 9: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

9

j) World Wide Web

The World Wide Web serves as a huge, widely distributed, global information service

center for news, advertisements, financial management, education, and many other

information services. Web contains a rich and dynamic collection of hyperlink information

and Web page access and usage information, providing rich sources for Data Mining. It

involves mining Web linkage structures, Web contents, and Web access patterns to

identify authoritative pages, automatic classification of Web documents, building a

multilayered Web information base and Weblog.

1.1.2 Data Mining Techniques

There have been many advances on researches and developments of Data Mining, and

many Data Mining techniques have been developed. Some of the techniques are briefly

discussed below:

a) Decision trees

A decision tree is a simple inductive learning structure. Given an instance of an object or

situation, which is specified by a set of properties, the tree returns a "yes" or "no" decision

about that instance. Decision tree learning is a common method used in Data Mining. Each

interior node corresponds to a variable; an arc to a child represents a possible value of that

variable. A leaf represents a possible value of target variable given the values of the

variables represented by the path from the root. A tree can be "learned" by splitting the

source set into subsets based on an attribute value test. This process is repeated on each

derived subset in a recursive manner. In Data Mining, trees can also be described as the

combination of mathematical and computing techniques to aid the description,

categorization and generalization of a given set of data [40].

Decision trees use real Data Mining algorithms. Decision trees help users to understand

the information that is very descriptive through data classification. A decision tree process

will generate the rules followed in a process. For example, a lender at a bank goes through

a set of rules when approving a loan. Based on the loan data a bank has, the outcomes of

the loans (default or paid), and limits of acceptable levels of default, the decision tree can

Page 10: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

10

set up the guidelines for the lending institution. These decision trees are very similar to the

first decision support (or expert) systems.

b) Memory-Based Reasoning (MBR)

MBR [109] uses known instances of a model to predict unknown instances. It maintains a

record of characteristics of known records in a training dataset. When a new record arrives

for evaluation, the algorithm uses the characteristics of the neighbors to find neighbors

similar to the new record for prediction and classification. MBR technique uses the

distance function and the combination function as the two key components. The distance

function is used to calculate the distance between the new record and the records in the

training dataset. The results obtain determine the neighbors (data records) of the new

incoming data record in the training dataset. In the next step, combination function of the

algorithm combines the results of the various distance functions to determine the final

answer.

Hence, for solving a Data Mining problem using MBR, three critical issues considered are:

Selecting the most suitable historical records to form the training or base dataset

Establishing the best way to compose the historical record

Determining the two essential functions, namely, the distance function and the

combination

c) Genetic Algorithms

Genetic algorithms [109] apply the ―survival of the fittest‖ principle to Data Mining. It

uses an iterative process of selection, cross-over, and mutation operators to evolve

successive generations of models. At each iteration, every model competes with other by

inheriting traits from previous ones until only the most predictive model survives.

d) Neural Networks

Neural Networks are analytic techniques modeled after the (hypothesized) processes of

learning in the cognitive system and the neurological functions of the brain and capable of

predicting new observations (on specific variables) from other observations (on the same

or other variables) after executing a process of so-called learning from existing data.

Page 11: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

11

These algorithms are effective when the data is shapeless and lacks any apparent pattern.

The basic unit of an artificial neural network is node, modeled after the neurons in the

brain. The other structure is the link that corresponds to the connection between neurons in

the brain. Neural networks [109] mimic the human brain by learning from a training

dataset and applying the learning to generalize patterns for classification and prediction.

Neural networks have high tolerance to noisy data and have the ability to classify patterns

on which they have not been trained. In addition, several algorithms have recently been

developed for the extraction of rules from trained neural networks. These factors

contribute towards the usefulness of neural networks for classification in Data Mining

[67].

e) Link Analysis

Link Analysis technique mines relationships and discovers knowledge. This type of

capability can be used in a variety of advanced artificial intelligence applications like

logistics planning, event probability prediction, and intelligence gathering. For example, in

a sale transaction at a supermarket, many items bought together in one trip are all linked

together. Some technologies included in this category can also perform machine learning

and reasoning functions.

Depending upon the types of knowledge discovery, link analysis techniques have three

types of applications [109]:

Associations Discovery. Associations are affinities between items. Association

discovery algorithms find combinations where the presence of one item suggests the

presence of another.

Sequential Pattern Discovery. These algorithms discover patterns where one set of

items follows another specific set. Time plays a role in these patterns. When records

are selected for analysis, information about date and time as data items enable

discovery of sequential patterns.

Similar Time Sequence Discovery. This technique depends on the availability of time

sequences. In the previous technique, the results indicate sequential events over time.

This technique, however, finds a sequence of events and then comes up with other similar

sequences of events.

Page 12: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

12

f) Clustering and Nearest Neighbor

Clustering [91] seeks to identify a finite set of abstract categories that describe the data by

determining natural affinities in the data set based upon a pre-defined distance or similarity

measure. Clustering can employ categories of different types (e.g., a flat partition, a

hierarchy of increasingly fine-grained partitions, or a set of possibly overlapping clusters).

Clustering can proceed by agglomeration, where instances are initially merged to form

small clusters and small clusters are merged to form larger ones; or by successive division

of larger clusters into smaller ones. Some clustering algorithms produce explicit cluster

descriptions; others produce only implicit descriptions. Different methods [8], [12], [21],

[52] are available to generate the clusters containing similar documents.

Nearest Neighbor algorithm supports clustering and classification matching cases

internally to each other or to an exemplar specified by a domain expert. A simple example

of a nearest neighbor method would be as follows: given a set X = {x1 x2 x3… xn} of

vectors composed of n features with binary values, for each pair (xi , xj), xi and xj, create a

vector vi of length n by comparing the values of each corresponding feature ni of each pair

(xi , xj), entering a 1 for each ni feature with matching values match and 0 otherwise. Then

sum the vi values to compute the degree of match. Those pairs (xi, xj) with the largest result

are the nearest neighbors. For complex nearest neighbor methods, features can be

weighted to reflect degree of importance. Domain expertise is needed to select salient

features, compute weights for those features, and select a distance or similarity measure.

Nearest neighbor approaches have been used for text classification.

g) Rule Induction

Rule induction [62] is an important technique of machine learning used for Data Mining. It

is the technique which expresses the regularities hidden in data in terms of rules. Usually

rules are expressions of the form

if (attribute − 1; value − 1) && (attribute − 2; value − 2) && .. (attribute − n; value − n)

then (decision; value):

Some rule induction systems induce more complex rules, in which values of attributes may

be expressed by negation of some values or by a value subset of the attribute domain. Data

from which, rules are derived, are usually presented in the form of a table in which cases

Page 13: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

13

(or examples) are labels (or names) for rows and variables are labeled as attributes and a

decision.

1.1.3 Data Mining Applications

There are wide varieties of applications benefiting from Data Mining. The technology

encompasses a rich collection of proven techniques that cover a wide range of applications

in both the commercial and noncommercial realms. In some cases, multiple techniques are

used, back to back, to greater advantage.

Listed below are few major applications of Data Mining [67], [109]:

a) Applications in the business area:

Data Mining technology has widespread applications in the commercial arena. Most of the

tools target the commercial sector. A few examples of Data Mining in the business area

are outlined as follows:

Customer Segmentation - Businesses use Data Mining to understand their

customers. Cluster detection algorithms discover clusters of customers sharing the

same characteristics.

Market Basket Analysis - Link analysis algorithms uncover affinities between

products that are bought together. Other businesses such as upscale auction houses

use these algorithms to find customers to whom they can sell higher-value items.

Risk Management - Insurance companies and mortgage businesses use Data

Mining to uncover risks associated with potential customers.

Fraud Detection - Credit card companies use Data Mining to discover abnormal

spending patterns of customers. Such patterns can expose fraudulent use of the

cards.

Delinquency Tracking - Loan companies use the technology to track customers

who are likely to default on repayments.

Demand Prediction - Retail and other businesses use Data Mining to match

demand and supply trends to forecast demand for specific products.

Page 14: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

14

b) Applications in the Telecommunications Industry

The telecommunication industry has quickly evolved from offering local and long distance

telephone services to providing many other comprehensive communication services,

including fax, pager, cellular phone, Internet messenger, images, e-mail, computer and

Web data transmission, and other data traffic. With the deregulation of the

telecommunication industry in many countries and the development of new

communication technologies, the telecommunication market is rapidly expanding and

highly competitive. This creates a great demand for Data Mining in order to help

understand the business involved, identify telecommunication patterns, catch fraudulent

activities, make better use of resources, and improve the quality of service.

A few examples of Data Mining in the telecommunication industry are outlined as follows:

Multidimensional analysis of telecommunication data – Telecommunication data is

intrinsically multidimensional with dimensions such as calling-time, duration,

location of caller, and type of call. The multidimensional analysis of such data can

be used to identify and compare the data traffic, system work load, resource usage,

user group behavior, profit, and so on. OLAP and visualization tools are used to

consolidate telecommunication data.

Fraudulent pattern analysis and the identification of unusual patterns – It is

important to identify potentially fraudulent entry to customer accounts and their

atypical usage patterns, detect attempts to gain fraudulent entry to customer

accounts, switch and route congestion patterns, and periodic calls from automatic

dial-out equipment that have been improperly programmed. Many of these types

of patterns can be discovered by multidimensional analysis, cluster analysis, and

outlier analysis.

Use of visualization tools in telecommunication data analysis – Tools for OLAP

visualization, linkage visualization, association visualization, clustering, has been

useful for telecommunication data analysis.

Page 15: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

15

c) Applications in Banking and Finance

The banking and finance industry is fertile ground for Data Mining. Banks and financial

institutions generate large volumes of detailed transaction data. Fraud detection, risk

assessment of potential customers, trend analysis, and direct marketing are the primary

Data Mining applications at banks.

In the financial area, requirements for forecasting dominate. Forecasting of stock prices

and commodity prices with a high level of approximation can mean large profits. Neural

network algorithms are used in forecasting, options and bond trading, portfolio

management, and in mergers and acquisitions.

A few examples of Data Mining in the banking and finance are outlined as follows:

Loan payment prediction and customer credit policy analysis - Loan payment

prediction and customer credit analysis are critical to the business of a bank. Many

factors can strongly or weakly influence loan payment performance and customer

credit rating. Data Mining methods, such as attribute selection and attribute

relevance ranking, may help identify important factors and eliminate irrelevant

ones.

Classification and clustering of customers for targeted marketing - Classification

and clustering methods can be used for customer group identification and targeted

marketing. Customers with similar behaviors regarding loan payments may be

identified by multidimensional clustering techniques. These can help identify

customer groups, associate a new customer with an appropriate customer group,

and facilitate targeted marketing.

Detection of money laundering and other financial crimes - To detect money

laundering and other financial crimes, it is important to integrate information from

multiple databases (like bank transaction databases, and federal or state crime

history databases), as long as they are potentially related to the study. Multiple data

analysis tools can then be used to detect unusual patterns, such as large amounts of

cash flow at certain periods, by certain groups of customers. Useful tools include

data visualization tools (to display transaction activities using graphs by time and

by groups of customers), linkage analysis tools (to identify links among different

customers and activities), classification tools (to filter unrelated attributes and rank

the highly related ones), clustering tools (to group different cases), outlier analysis

tools (to detect unusual amounts of fund transfers or other activities), and

Page 16: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

16

sequential pattern analysis tools (to characterize unusual access sequences). These

tools may identify important relationships and patterns of activities and help

investigators focus on suspicious cases for further detailed examination.

d) Applications in Biomedical and DNA data Analysis

The past decade has seen an explosive growth in genomics, proteomics, functional

genomics, and biomedical research. Examples range from the identification and

comparative analysis of the genomes of human and other species (by discovering

sequencing patterns, gene functions, and evolution paths) to the investigation of genetic

networks and protein pathways, and the development of new pharmaceuticals and

advances in cancer therapies. Biological Data Mining has become an essential part of a

new research field called bioinformatics.

A few examples of Data Mining in biological data analysis are outlined as follows:

Semantic integration of heterogeneous, distributed genomic and proteomic

databases - Genomic and proteomic data sets are often generated at different labs

and by different methods. They are distributed, heterogeneous, and of a wide

variety. The semantic integration of such data is essential to the cross-site analysis

of biological data. Also, it is important to find correct linkages between research

literature and their associated biological entities. Such integration and linkage

analysis would facilitate the systematic and coordinated analysis of genome and

biological data. Data cleaning, data integration, reference reconciliation,

classification, and clustering methods will facilitate the integration of biological

data and the construction of data warehouses for biological data analysis.

Visualization tools in genetic data analysis - Visualization and visual Data Mining

play an important role in biological data analysis. Alignments among genomic or

proteomic sequences and the interactions among complex biological structures are

most effectively presented in graphic forms, transformed into various kinds of

easy-to-understand visual displays. Such visually appealing structures and patterns

facilitate pattern understanding, knowledge discovery, and interactive data

exploration.

Association and path analysis - Most diseases are not triggered by a single gene but

by a combination of genes acting together. Recently, many studies have focused on

the comparison of one gene to another. Association analysis methods can be used

Page 17: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

17

to help determine the kinds of genes that are likely to co-occur in target samples.

Such analysis would facilitate the discovery of groups of genes and the study of

interactions and relationships between them. While a group of genes may

contribute to a disease process, different genes may become active at different

stages of the disease. If the sequence of genetic activities across the different stages

of disease development can be identified, it may be possible to develop

pharmaceutical interventions that target the different stages separately, therefore

achieving more effective treatment of the disease. Such path analysis is expected to

play an important role in genetic studies.

e) Applications in the Retail Industry

The retail industry is a major application area for Data Mining, since it collects huge

amounts of data on sales, customer shopping history, goods transportation, consumption,

and service. The quantity of data collected continues to expand rapidly, especially due to

the increasing ease, availability, and popularity of business conducted on the Web, or e-

commerce. Retail Data Mining can help identify customer buying behaviors, discover

customer shopping patterns and trends, improve the quality of customer service, achieve

better customer retention and satisfaction, enhance goods consumption ratios, design more

effective goods transportation and distribution policies, and reduce the cost of business.

A few examples of Data Mining in the retail industry are outlined as follows:

Customer Segmentation - Direct marketing involves targeting campaigns and

promotions to specific customer segments. Cluster detection and other predictive

Data Mining algorithms provide customer segmentation. Customer segmentation

tools discover clusters and predict success rates for direct marketing campaigns. At

the backend, Data Mining tools for customer segmentation can be integrated with

the data warehouse for data selection and extraction. At the front end, these tools

work well with standard presentation software.

Market basket analysis - Retail industry promotions necessarily require knowledge

of which products to promote and in what combinations. Retailers use link analysis

algorithms to find affinities among products that usually sell together. Based on the

affinity grouping, retailers can plan their special sale items and also the arrangement

of products on the shelves.

Page 18: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

18

Inventory Management - Inventory for a retailer encompasses thousands of

products. Inventory turnover and management are significant concerns for these

businesses. Retailers use Data Mining for inventory management.

Sales forecasting – Retail sales are subject to strong seasonal fluctuations. Holidays

and weekends also make a difference. Therefore, sales forecasting is critical for the

industry. The retailers turn to the predictive algorithms of Data Mining technology

for sales forecasting.

1.1.4 Data Mining – Interdisciplinary domain

Data Mining involves an integration of techniques from multiple disciplines such as

database technology, statistics, machine learning, data visualization, Information Retrieval,

image and signal processing, and spatial data analysis. Emphasis is on efficient and

scalable Data Mining techniques for large databases. By performing Data Mining,

interesting knowledge, regularities, or high-level information can be extracted from

databases and viewed or browsed from different angles. The discovered knowledge can be

applied to decision making, process control, information management, query processing,

and so on. Therefore, Data Mining is considered as one of the most important frontiers in

database systems and one of the most promising interdisciplinary developments in the

information industry. Few of the different disciplines which can be integrated with Data

Mining are briefly discussed below:

a) Statistics

Data Mining in Statistics deals with finding useful patterns in data sets. This includes

hypothesis testing and parameter estimation. Hypothesis testing is a part of inferential

statistics, which starts with an initial premise (called the Null Hypothesis) and then data

collected is tested with this premise. If the hypothesis is validated for the data to a certain

degree, then the Null Hypothesis is said to be True or else it is said to be False. Parameter

Estimation deals with finding parameter, like means, standard deviations, etc. that would

describe the distribution of a given sample of data points.

Finding optimal strategies for Data Collection is another issue with Statistics. Methods

need to be developed, which would efficiently search large databases to find representative

Page 19: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

19

sample data points. Different Data Mining techniques have to utilize for evolving data as

opposed to for static data.

Model Estimation of the data samples is also an important aspect of Statistics. Samples can

have different model distributions, leading to development of different algorithms for

them. Applicability of algorithms, hence, becomes a major issue in this case.

b) Relational Databases

RDBMS (Relational Database Base Management System) stores the data in tables and can

quickly search the requested data on applying different Query Languages. Query

optimization plays an important role in RDMS and deals with finding the best possible

method for processing the given queries, in terms of time taken for processing task and the

reliability of the query responses.

Databases part of Data Mining provides the fast and reliable access to data.

c) Artificial Intelligence

The goal of Artificial Intelligence is to perceive the information from the environment

using intelligent agents and automate the task of finding set of actions via logical

reasoning to achieve predetermined goal.

Search techniques are employed to map a set of perception into a single or a set of actions.

The search techniques can be divided into uniform cost search or informed search.

Heuristics are used in informed search methods to find an optimal set of actions to achieve

the desired goals.

Knowledge representation methods are used to describe the relationships between

different objects in the environment. Examples for these are First Order Logic and

production rules. Some of these knowledge representation methods can also be used to

describe the semantics of the knowledge, e.g. frames for semantic networks.

Knowledge acquisition, maintenance and application are other branches of Artificial

Intelligence, which are highly related with Databases and also with Data Mining.

Page 20: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

20

d) Machine Learning

Machine Learning focuses on complex representations and search methods for specialized

data-intensive problems. Different machine learning methods utilize the specific prior

knowledge associated with the collected data. Such methods are generally more dependent

on the domain of the data.

Machine learning is often used in the context of Data Mining, to denote the application of

generic model-fitting or classification algorithms for predictive Data Mining. Unlike

traditional statistical data analysis, which is usually concerned with the estimation of

population parameters by statistical inference, the emphasis in Data Mining (and machine

learning) is usually on the accuracy of prediction (predicted classification), regardless of

whether or not the "models" or techniques that are used to generate the prediction is

interpretable or open to simple explanation. Good examples of this type of technique often

applied to predictive Data Mining are neural networks or meta-learning techniques such as

boosting, etc. These methods usually involve the fitting of very complex "generic" models

that are not related to any reasoning or theoretical understanding of underlying causal

processes; instead, these techniques can be shown to generate accurate predictions or

classification in cross validation samples.

e) Visualization

Visualization is used to gain visual insights into the structure of the data. Users can

interactively explore the data, e.g. zoom in/out, rotate for images, or display some specific

detailed information for some attributes. Visualization is abundantly used as a pre- and

post-processing tool for KDD.

Various Branches of Visualization are:

Displaying summarized properties of relevant data.

Exploring various relationships between variables (attributes).

Investigate large databases and to convey huge amount of information.

Analyze data from geographic or spatial domains.

Page 21: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

21

f) Information Retrieval

Information Retrieval (IR) is the area of study concerned with searching for documents,

for information within documents, and for metadata about documents, as well as that of

searching relational databases and the World Wide Web. In response to various challenges

of providing information access, the field of Information Retrieval evolved to give

principled approaches to searching various forms of content. The field began with

scientific publications and library records, but soon spread to other forms of content,

particularly those of information professionals, such as journalists, lawyers, and doctors.

Much of the scientific research on Information Retrieval has occurred in these contexts,

and much of the continued practice of Information Retrieval deals with providing access to

unstructured information in various corporate and governmental domains. Many

universities and public libraries use IR systems to provide access to books, journals and

other documents. Web search engines are the most visible IR applications.

g) Online Analytical Processing

OLAP allows users to browse data following logical questions about the data. OLAP

generally includes the ability to drill down into data, moving from highly summarized

views of data into more detailed views. This is generally achieved by moving along

hierarchies of data. For example, if one were analyzing populations, one could start with

the most populous continent, and then drill down to the most populous country, then to the

state level, then to the city level, then to the neighborhood level. OLAP also includes

browsing up hierarchies (drill up), across different dimensions of data (drill across), and

many other advanced techniques for browsing data, such as automatic time variation when

drilling up or down time hierarchies. OLAP is by far the most implemented and used

technique. It is also generally the most intuitive and easy to use.

1.1.5 Architecture of Data Mining

Broadly Data Mining architecture [95] has three layers- Database Layer with sub-layers to

prepare and store data and metadata, Data Mining Application Layer which uses the

algorithms to process the data and store the results in the database, Front-End Layer to

facilitate the parameter settings for Data Mining Application and visualization of the

results in interpretable form.

Page 22: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

22

a) Database Layer

Database layer can be hosted on an RDBMS or can be mixture of RDBMS and files

system or a file system only, e.g., data from source systems may be initially staged on a

files system and then loaded onto an RDBMS. The Database layer may consist of various

sub- layers. The data in these sub-layers interface with multiple systems based on the

activities in which it participates. Following diagram represents various sub-layers in the

Database layer.

i. Metadata layer is the most commonly and frequently used layer. It forms the

backbone for the data in entire Data Mining Architecture and information about

data sources, transformation algorithms, cleansing rules and the Data Mining

Results.

ii. Data Layer comprises of Staging Area, Prepared / Processed Data and Data Mining

Results. The Staging Area is used for temporarily holding the data sourced from

Parameters,

Data Mining

Queries

Data Mining

Results, Metadata

Database

Queries

Metadata,

Data

Database Data Mining

Application

Front End

Figure 1.1 : Generic 3-layer architecture for Data Mining

Transformation,

Cleansing and

Consolidation

Metadata and Data

Extracted from

Source Systems

Data and Metadata

for Data Mining

Output from Data

Mining

Application

Metdata

Staging Area

Prepared Input Data

Data Mining Results

Figure 1.2: Metadata Layer

Page 23: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

23

various source systems. It can be held in any form e.g. flat files, tables in RDBMS.

This data is transformed, cleansed, consolidated and loaded into a structured

schema during Data Preparation process. This prepared data is used as Input Data

for Data Mining. The base data may undergo summarization or derivation based on

the business case before it‘s presented to the Data Mining Application.

iii. The Data Mining output can be captured in the Data Mining Results layer so that it

can be made available to the users for visualization and analysis.

b) Data Mining Application Layer

Data Mining Application has two primary components as shown in the figure 1.3:

Figure 1.3: Data Mining Application Layer

i. Data Manager Layer

It manages the data in the Database Layer and controls the data flow for Data Mining

purpose. It provides the following functionalities:

Manage Data Sets - The Data Manager layer helps to classify the input data into

multiple sets so that they can be utilized during various stages of the Data Mining

task such as for building the Data Mining Model, Final Testing and Deployment

tasks. Also it classifies the results of the Data Mining task, which might be utilized

for further processing.

Input Data

From

Database

Data Mining

Results Data Manager

Data Mining

Tools/Algorithms

Page 24: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

24

Input Data Flow – Data Manager layer provides transformation routines to extract

the data from the database in the required specific format (like itemized data for

Associations) for the Data Mining task. Also, it controls the flow of data as per the

Data Mining task requirements i.e. row by row or bulk load.

Output Data Flow - Data Manager layer manages the results generated by the Data

Mining task and facilitated them to target systems (Front End or other systems like

CRM) in required data format and data flow specifications.

The Data Manager layer needs to be portable depending on the database from which

data has to be extracted and the Data Mining tool.

ii. Data Mining Tools / Algorithms

This is the heart of the complete DM architecture. Numerous tools are available in the

market like SAS, SPSS, Teradata Miner and IBM Intelligent Miner to facilitate the

application of algorithms on the input data. These Data Mining tools perform different

tasks using various techniques / algorithms depending upon the business to analyze the

data and generate the results.

c) Front End Layer

Front End is the user interface layer. It provides following prime functionalities:

i. Administration

Administration screens for the Data Mining tasks are usually provided as a part of the

products / tools. These are utilized to administer the following primary tasks:

Data flow processes (e.g. Extracts, Loads)

Data Mining routines

Error reporting and correction

User security settings

ii. Input Parameter Settings

During the Data Mining Model build, iterations are inevitable. These iterations are

needed to fine-tune the model by changing various parameters involved in the model.

For executing a Data Mining task, the user needs to provide respective input parameters,

Page 25: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

25

then observe the effect on the results and change the parameters if needed based on the

interpretation and understanding of the results. This facility is provided in the Front End

Layer.

iii. Data Mining Results / Visualization

The results of Data Mining task sometimes need formatting, conversion to user

understandable form to provide reports to the user. The front-end caters to the

predefined formats of the output files generated by the respective Data Mining technique

to provide the user flexibility to view and analyze the results of Data Mining. Reporting

utility performs the task of displaying the reports, charts and smart reports (e.g. Clusters,

Trees, and Networks).

1.2 Text Mining – Characteristics and domains of applications

Data Mining is typically concerned with the detection of patterns in numeric data, but very

often important (e.g., critical to business) information is stored in the form of text. Unlike

numeric data, text is often amorphous, and difficult to deal with. Text Mining generally

consists of the analysis of (multiple) text documents by extracting key phrases, concepts,

etc. and the preparation of the text processed in that manner for further analyses with

numeric Data Mining techniques (e.g., to determine co-occurrences of concepts, key

phrases, names, addresses, product names, etc.).

1.2.1 Representation of text documents

In Text Mining study, a document is generally used as the basic unit of analysis. A

document is a sequence of words and punctuation, following the grammatical rules of the

language, containing any relevant segment of text and can be of any length. It can be the

paper, an essay, book, web page, emails, etc, depending on the type of analysis being

performed and depending upon the goals of the researcher. In some cases, a document may

contain only a chapter, a single paragraph, or even a single sentence. The fundamental unit

of text is a word. A term is usually a word, but it can also be a word-pair or phrase. In this

thesis, we will use term and word interchangeably. Words are comprised of characters, and

Page 26: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

26

are the basic units from which meaning is constructed. By combining a word with

grammatical structure, a sentence is made. Sentences are the basic unit of action in text,

containing information about the action of some subject. Paragraphs are the fundamental

unit of composition and contain a related series of ideas or actions. As the length of text

increases, additional structural forms become relevant, often including sections, chapters,

entire documents, and finally, a corpus of documents. A corpus is a collection of

documents. And, a lexicon is the set of all unique words in the corpus [120].

In Text Mining studies, a sentence is regarded simply as a set of words, or a ―bag of

words‖, and the order of words can be changed without impacting the outcome of the

analysis. The syntactical structure of a sentence or paragraph is intentionally ignored in

order to efficiently handle the text. The bag-or-words concept is also referred to as

exchangeability in the generative language model [84].

1.2.2 Text Mining Techniques

Text Mining is an interdisciplinary field that utilizes techniques from the general field

of Data Mining and additionally, combines methodologies from various other areas

such as Information Extraction, Information Retrieval, Computational Linguistics,

Categorization, Clustering, Summarization, Topic Tracking and Concept Linkage [46],

[50], [120]. In the following sections, we will discuss each of these technologies and the

role that they play in Text Mining.

a) Information Extraction

Information extraction (IE) [71] is a process of automatically extracting structured

information from unstructured and/or semi-structured machine-readable documents,

processing human language texts by means of NLP. The final output of the extraction

process is some type of database obtained by looking for predefined sequences in text, a

process called pattern matching [64].

Page 27: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

27

Tasks performed by IE systems include:

Term analysis, which identifies the terms appearing in a document. This is

especially useful for documents that contain many complex multi-word terms, such

as scientific research papers.

Named-entity recognition, which identifies the names appearing in a document,

such as names of people or organizations. Some systems are also able to recognize

dates and expressions of time, quantities and associated units, percentages, and so

on.

Fact extraction, which identifies and extracts complex facts from documents. Such

facts could be relationships between entities or events.

IE transforms a corpus of textual documents into a more structured database, the

database constructed by an IE module then can be provided to the KDD module for

further mining of knowledge as illustrated in figure 1.4.

b) Information Retrieval

Retrieval of text-based information also termed Information Retrieval (IR) has become a

topic of great interest with the advent of text search engines on the Internet. Text is

considered to be composed of two fundamental units, namely the document (book, journal

paper, chapters, sections, paragraphs, Web pages, computer source code, and so forth) and

the term (word, word-pair, and phrase within a document). Traditionally in IR, text queries

Text

Information

Extraction DB

Data

Mining Rules

Text Data

Mining

Figure 1.4. Overview of IE-based Text Mining framework

Page 28: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

28

and documents both are represented in a unified manner, as sets of terms, to compute the

distances between queries and documents thus providing a framework within to directly

implement simple text retrieval algorithms.

c) Computational Linguistics/ Natural Language Processing

Natural Language Processing is a theoretically motivated range of computational

techniques for analyzing and representing naturally occurring texts at one or more levels of

linguistic analysis for the purpose of achieving human-like language processing for a

range of tasks or applications. The goal of Natural Language Processing (NLP) is to design

and build a computer system that will analyze, understand, and generate natural human-

languages. Applications of NLP include machine translation of one human-language text

to another; generation of human-language text such as fiction, manuals, and general

descriptions; interfacing to other systems such as databases and robotic systems thus

enabling the use of human-language type commands and queries; and understanding

human-language text to provide a summary or to draw conclusions.

NLP system provides the following tasks:

Parse a sentence to determine its syntax.

Determine the semantic meaning of a sentence.

Analyze the text context to determine its true meaning for comparing it with other

text.

The role of NLP in Text Mining is to provide the systems in the information extraction

phase with linguistic data that they need to perform their task. Often this is done by

annotating documents with information like sentence boundaries, part-of-speech tags,

parsing results, which can then be read by the information extraction tools.

d) Categorization

Categorization is the process of recognizing, differentiating and understanding the ideas

and objects to group them into categories, for specific purpose. Ideally, a category

illuminates a relationship between the subjects and objects of knowledge. Categorization is

fundamental in language, prediction, inference, decision making and in all kinds of

environmental interaction.

Page 29: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

29

There are many categorization theories and techniques. In a broader historical view,

however, three general approaches to categorization may be identified as:

Classical categorization - According to the classical view, categories should be

clearly defined, mutually exclusive and collectively exhaustive, belonging to one,

and only one, of the proposed categories.

Conceptual clustering – It is a modern variation of the classical approach in which

classes (clusters or entities) are generated by first formulating their conceptual

descriptions and then classifying the entities according to these descriptions.

Conceptual clustering is closely related to fuzzy set theory, in which objects may

belong to one or more groups, in varying degrees of fitness.

Prototype theory - Categorization can also be viewed as the process of grouping

things based on prototypes. Categorization based on prototypes is the basis for

human development, and relies on learning about the world via embodiment.

e) Topic Tracking

A topic tracking [64] system works by keeping user profiles and, based on the documents

the user views, predicts other documents of interest to the user. Yahoo offers a free topic

tracking tool (www.alerts.yahoo.com) that allows users to choose keywords and notifies

them when news relating to those topics becomes available.

Topic tracking technology however has limitations. For example, if a user sets up an alert

for ―Text Mining‖, s/he will receive several news stories on mining for minerals, and very

few that are actually on Text Mining. Some of the better Text Mining tools let users select

particular categories of interest or the software automatically can even infer the user‘s

interests based on his/her reading history and click-through information. Keyword

extraction has become a basis of several Text Mining applications such as search engine,

text categorization, summarization, and topic detection. In [99], Nelken et al. proposed a

disambiguation system that separates the on-topic occurrences and filters them from the

potential multitude of references to unrelated entities.

Page 30: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

30

f) Clustering

Clustering [3] is a technique in which objects of logically similar properties are physically

placed together in one class of objects and a single access to the disk makes the entire class

available. There are many clustering methods available, and each of them may give a

different grouping of a dataset. The choice of a particular method will depend on the type

of output desired, the known performance of method with particular types of data, the

hardware and software facilities available and the size of the dataset. In general, clustering

methods may be divided into two categories based on the cluster structure which they

produce. The non-hierarchical methods divide a dataset of N objects into M clusters, with

or without overlap. These methods are divided into partitioning methods, in which the

classes are mutually exclusive, and the less common clumping methods, in which overlap

is allowed. Each object is a member of the cluster with which it is most similar; however

the threshold of similarity has to be defined. The hierarchical methods produce a set of

nested clusters in which each pair of objects or clusters is progressively nested in a larger

cluster until only one cluster remains. The hierarchical methods can be further divided into

agglomerative or divisive methods. In agglomerative methods, the hierarchy is build up in

a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-

clustered dataset. The less common divisive methods begin with all objects in a single

cluster and at each of N-1 steps divide some clusters into two smaller clusters, until each

object resides in its own cluster.

g) Concept Linkage

Concept linkage [46] identifies related documents based on commonly shared concepts

and between them. The primary goal of concept linkage is to provide browsing for

information rather than searching for it as in IR. For example, a Text Mining software

solution may easily identify a link between topics X and Y, and Y and Z. Concept linkage

is a valuable concept in Text Mining which could also detect a potential link between X

and Z, something that a human researcher has not come across because of the large

volume of information s/he would have to sort through to make the connection. Concept

linkage is beneficial to identify links between diseases and treatments. In the near future,

Text Mining tools with concept linkage capabilities will be beneficial in the biomedical

Page 31: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

31

field helping researchers to discover new treatments by associating treatments that have

been used in related fields.

h) Information Visualization

Visual Text Mining [46], or information visualization, puts large textual sources in a

visual hierarchy or map and provides browsing capabilities, in addition to simple

searching e.g., Informatik V‘s DocMiner. The user can interact with the document map

by zooming, scaling, and creating sub-maps. The government can use information

visualization to identify terrorist networks or to find information about crimes that may

have been previously thought unconnected. It could provide them with a map of

possible relationships between suspicious activities so that they can investigate

connections that they would not have come up with on their own. Text Mining with

Information visualization has been shown to be useful in academic areas, where it can

allow an author to easily identify and explore papers in which s/he is referenced. It is

useful to user allowing them to narrow down a broad range of documents and explore

related topics.

i) Summarization

A summary is a text that is produced from one or more texts, that contain a significant

portion of the information in the original text(s), and that is no longer than half of the

original text(s). ‗Text‘ here includes multimedia documents, on-line documents,

hypertexts, etc. Many types of summary that have been identified include indicative

summaries (that provide an idea of what the text is about without giving any content) and

informative ones (that do provide some shortened version of the content). Extracts are

summaries created by reusing portions (words, sentences, etc.) of the input text verbatim,

while abstracts are created by re-generating the extracted content. Generic summary is not

related to specific topic while query-based summary generates a summary discussing the

topic mentioned in the given query. Also summary can be created for single document or

multi-documents.

Page 32: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

32

1.2.3 Domains of applications of Text Mining

Regarded as the next wave of knowledge discovery, Text Mining has a very high

commercial value. It is an emerging technology for analyzing large collections of

unstructured documents for the purposes of extracting interesting and non-trivial patterns

or knowledge.

Text Mining applications can be broadly organized into two groups [129]:

Document exploration tools – They organize documents based on their text content

and provide an environment for a user to navigate and browse in a document or

concept space. A popular approach is to perform clustering on the documents based on

their similarities in content and present the groups or clusters of the documents in

certain graphical representation.

Document analysis tools – They analyze the text content of the documents and

discover the relationships between concepts or entities described in the documents.

They are mainly based on natural language processing techniques, including text

analysis, text categorization, information extraction, and summarization.

There are many possible application domains based on Text Mining technology [42], [64],

[130]. We briefly mention a few below:

Customer profile analysis, e.g., mining incoming emails for customers' complaint

and feedback.

Patent analysis, e.g., analyzing patent databases for major technology players,

trends, and opportunities.

Information dissemination, e.g., organizing and summarizing trade news and

reports for personalized information services.

Company resource planning, e.g., mining a company's reports and correspondences

for activities, status, and problems reported.

Security Issues, e.g., analyzing plain text sources such as Internet news. It also

involves in the study of text encryption.

Open-ended survey responses, e.g., analyzing a certain set of words or terms that

are commonly used by respondents to describe the pro‘s and con‘s of a product or

service (under investigation), suggesting common misconceptions or confusion

regarding the items in the study.

Page 33: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

33

Text classification, e.g., filtering out most undesirable ―junk email‖ automatically

based on certain terms or words that are not likely to appear in legitimate

messages.

Competitive Intelligence, e.g., enabling companies to organize and modify the

company strategies according to present market demands and the opportunities

based on the information collected by the company about themselves, the market

and their competitors, and to manage enormous amount of data for analyzing to

make plans.

Customer Relationship Management (CRM), e.g., rerouting specific requests

automatically to the appropriate service or supplying immediate answers to the

most frequently asked questions.

Multilingual Applications of Natural Language Processing, e.g., identifying and

analyzing web pages published in different languages.

Technology watch, e.g., identifying the relevant Science & Technology literatures,

and extracting the required information from these literatures efficiently.

Text summarization, e.g., creating a condensed version of a document or a

document collection (multi-document summarization) that should contain its most

important topics.

Bio-entity recognition, e.g., identifying and classifying technical terms in the

domain of molecular biology corresponding to concepts instances that are of

interest to biologists. Examples of such entities include the names of proteins,

genes and their locations of activity such as cells or organism names.

Organize repositories of document-related meta-information, e.g., automatic text

categorization methods [107] are used to create structured metadata used for

searching and retrieving relevant documents based on a query.

Gain insights about trends, relations between people/places/organizations, e.g.,

aggregating and comparing information extracted automatically from documents of

certain type like incoming mail, customer letters, news-wires and so on.

Page 34: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

34

1.2.4 Architecture of a Text Mining system

Text Mining system takes as an input a collection of documents and then preprocesses

each document by checking its format and character sets [136]. Next, these preprocessed

documents go through a text analysis phase, sometimes repeating the techniques, until the

required information is extracted. Three text analysis techniques are shown in Figure 1.5,

but many other combinations of techniques could be used depending on the goals of the

organization. The resulting extracted information can be input to a management

information system, yielding an abundant amount of knowledge for the user of that

system. Figure 1.6 explores the detailed processing steps followed in Text Mining System.

knowledge

Document

Collection

Retrieve and

preprocess

document

Summarization Clustering

Information

Extraction

Analyze Text

Management

Information

System

Figure 1.5 : An example of Text Mining System

Document

Collection

Retrieve and Pre-

process Document

Feature

Selection

Feature

Generation

Feature

Selection

Feature

Generation

TM Techniques

Management

Information System

Information Retrieval

Summarization Topic Discovery

Information

Extraction

Knowledge

Figure 1.6 : Text Mining Process

Page 35: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

35

Different steps of Text Mining process as shown above, are briefly discussed below:

a) Document files of different formats like PDF files, txt files or flat files are

collected from different sources such as online chat, SMS, emails, message boards,

newsgroups, blogs, wikis and web pages. This unstructured dataset of documents is

pre-processed to perform following three tasks:

Tokenize the file into individual tokens using space as the delimiter.

Remove the stop words which do not convey any meaning.

Use porter stemmer algorithm to stem the words with common root word.

b) Feature Generation and Feature Selection activities are performed on these

retrieved and preprocessed documents to represent the unstructured text documents

in a more structured spread sheet format. Feature Selection algorithms help to

identify the important features which requires an exhaustive search of all subsets of

features of chosen cardinality. If the large numbers are available this is impractical

for supervised learning algorithms the search is for satisfactory set of features

instead of optimal set.

c) After the appropriate selection of features the Text Mining techniques are

incorporated for the applications like Information Retrieval, Information

Extraction, Summarization and Topic Discovery for necessary knowledge

discovery process.

Figure 1.6 depicts the knowledge stored in the management information system where the

knowledge is stored and retrieved.

1.3 Information Retrieval – Basic Concepts, Models and

Techniques

In this section, first we briefly discuss the techniques for Information Retrieval such as

extraction of index terms, retrieval models. Finally, we describe the different Information

Retrieval evaluation techniques and the framework of Information Retrieval.

Page 36: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

36

1.3.1 Introduction

Information Retrieval is a field at the intersection of information science and computer

science. The term was coined by Mooers in 1951, who advocated that IR can be applied to

the ―intellectual aspects‖ of description of information and systems for its searching [97].

It concerns itself with the indexing and retrieval of information from heterogeneous and

mostly-textual information resources. In other words, Information Retrieval is defined as

"The study of systems for indexing, searching, and recalling data, particularly text or other

unstructured forms."

As shown in figure 1.7, an Information Retrieval [72] process begins when a user enters a

query into the system. Queries are formal statements of information needs, for example

search strings in web search engines. In Information Retrieval a query does not uniquely

identify a single object in the collection. An object is an entity that is represented by

information in a database. Depending on the application the data objects may be, for

example, text documents, images, audio, mind maps or videos. Often the documents

themselves are not kept or stored directly in the IR system, but are instead represented in

the system by document surrogates or metadata. User queries are matched against the

database information. Several objects may match the query, perhaps with different degrees

of relevancy. Most IR systems compute a numeric score on how well, each object in the

database match the query, and rank the objects according to this value. The top ranking

objects are then shown to the user. The process may then be iterated if the user wishes to

refine the query.

Page 37: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

37

1.3.2 Document Preprocessing

The aim of text preprocessing is to transform each document into a sequence of features

that will be used in subsequent steps. The input documents go through text segmentation,

punctuation removal, conversion of upper to lower case, and stopword removal [114]. In

this section we outline the preprocessing steps to represent the text document as a term

vector.

Test collections

(e.g. Document databases)

Information need for

anomalous state of

knowledge

Representation and

organization

Text surrogates,

organized

Representation

Comparison or

Interaction

Quer

y

Retrieved texts

Evaluatio

n

Modificati

on

Use

Texts User with goals,

tasks, intentions,

etc.

Figure 1.7: A general model of Information Retrieval

Relevance feedback

Page 38: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

38

1.3.2.1 Lexical Analysis of the text

Lexical analysis is the process of converting the stream of text of the text document into

stream of words which are later treated as index terms. The spaces between the words are

treated as the word separator. There are certain issues which are considered while

identifying the words in the text document.

Punctuation marks like dot(.), comma(,), hyphen(-), apostrophe(‗) etc are removed during

lexical analysis process. But there are certain situations when the removal of punctuation

marks has negative impact on the retrieval performance. For example, the word ―T.V.‖

contains punctuation marks as its integral part, and the removal of punctuation marks from

it ―TV‖ does not affect the analysis process, while if the punctuation marks from the word

―a.m.‖ are removed then the converted word ―am‖ shows totally different meaning. In this

case, the dot mark should not be removed.

During lexical analysis process, the case sensitivity of the words is ignored and two words

with same sequence of letters like ―AMERICA‖ and ―America‖ are treated as equal. But

sometimes this may result in misleading information like word ―US‖ and word ―us‖.

Words containing digits are ignored to be treated as index term. But sometimes, they may

contain important information like ―512B.C.‖. So alphanumeric words should not be

ignored and words containing only digits can be discarded to be treated as index terms.

The lexical analysis process requires suitable exceptions to be considered along with the

general rule to handle the above discussed issues to minimize the document preprocessing

errors.

1.3.2.2 Elimination of Stopwords

Stopwords [1], [14] are regarded as 'functional words' which do not carry meaning in

natural language and so they are ignored when identifying the index terms. Elimination of

stopwords contributes to reducing the size of the indexing structure considerably.

Stopwords list normally includes articles (like a, the), prepositions (like on, in), and

conjunctions (like but, and). Words which occur too frequent among the documents

collection can also be included in the stopwords list as such words do not contribute to

discriminate the documents. For example ―Reuters-21578‖ collection set contains Reuters

newswire stories and so the word ―Reuter‖ appears in each document and hence cannot be

used as an index term to differentiate among the documents.

Page 39: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

39

1.3.2.3 Stemming

Stemming [29], [125] is the process for reducing inflected (or sometimes derived) terms to

their stem, base or root form - generally a written word form. A stem is the portion of a

word which is left after the removal of its affixes (i.e., prefixes and suffixes). In affix

removal, suffix is removed from the word. For example, strings "collected", "collection",

"collecting" are based on the base term "collect". Stemming reduces the size of the

indexing structure on reducing the number of distinct index terms and improves the

retrieval performance by reducing the variants of the same root word to a common

concept. Many suffix removal algorithms are known, like, Lovins algorithm, Krovetz

stemming, Paice/Husk algorithm and Porter algorithm [25].

1.3.2.4 Term Frequency and Weighting

The relevance of each term in the document is estimated using the term frequency

information to generate weights for all the terms in a document [1], [96], [150], [119].

Different methods are used to calculate the weight of a term:

a) Term Frequency (TF) Weighting

The weight of jth

term in ith

document using TF weighting is represented as

wij = tfij

where tfij is the frequency of occurrence of jth

term in ith

document

But TF weighting does not consider the frequency of the term throughout all the

documents in the document corpus.

b) Term Frequency (TF) × Inverse Document Frequency (IDF) Weighting

TF×IDF weighting approach weights the frequency of a term in a document with a factor

that discounts its importance if it appears in most of the documents, as in this case the term

is assumed to have little discriminating power. The weight of the jth

term in ith

document is

represented as

wij = tfij · log ( n / nj ) (1.1)

where

Page 40: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

40

n is the total number of documents in the document pool

nj is the number of documents in the pool containing jth

term, nj ≤ n

c) TF×IDF Weighting with Length Normalization

In this approach, to account for documents of different lengths each document vector is

normalized so that it is of unit length. Here, weight of the jth

term in ith

document is

represented as

(1.2)

log.

log.

1

2

m

k k

ik

j

ij

ij

n

ntf

n

ntf

w

Where, m is total number of unique terms appearing in ith

document.

TF×IDF weighting approach [116] weighs the frequency of a term in a document based on

two observations:

• The relevance of a term to the topic of a document is proportional to the number of times

it appears in the document.

• If a term appears in large number of documents in the document set, then it cannot be

used to discriminate the different documents.

1.3.3 Retrieval models

For the Information Retrieval [72] to be efficient, the documents are typically transformed

into a suitable representation. There are several representation models. These

representation models are categorized according to two dimensions: the mathematical

basis and the properties of the model.

Page 41: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

41

a) Mathematical basis

Based on mathematical basis, the representation models are classified as:

i. Set-theoretic models represent documents as sets of words or phrases. Similarities

are usually derived from set-theoretic operations on those sets. Common Set-

theoretic models include:

Standard Boolean model

Extended Boolean model

Fuzzy retrieval

ii. Algebraic models represent documents and queries as vectors, matrices, or tuples.

The similarity of the query vector and document vector is represented as a scalar

value. Some common algebraic models are:

Vector space model

Generalized vector space model

(Enhanced) Topic-based Vector Space Model

Extended Boolean model

Latent semantic indexing

iii. Probabilistic models treat the process of document retrieval as a probabilistic

inference. Similarities are computed as probabilities that a document is relevant for

a given query. Probabilistic theorems like the Bayes' theorem are often used in

these models. Common Probabilistic models are:

Binary Independence Model

Probabilistic relevance model on which is based the okapi (BM25) relevance

function

Uncertain inference

Language models

Divergence-from-randomness model

Latent Dirichlet allocation

iv. Feature-based retrieval models view documents as vectors of values of feature

functions (or just features) and seek the best way to combine these features into a

single relevance score, typically by learning to rank methods. Feature functions are

Page 42: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

42

arbitrary functions of document and query, and as such can easily incorporate

almost any other retrieval model as just a yet another feature.

b) Properties of the model

Based on properties of the model, the representation models are classified as:

Models without term-interdependencies treat different terms/words as independent.

This fact is usually represented in vector space models by the orthogonality

assumption of term vectors or in probabilistic models by an independency

assumption for term variables.

Models with immanent term interdependencies allow a representation of

interdependencies between terms. However the degree of the interdependency

between two terms is defined by the model itself. It is usually directly or indirectly

derived (e.g. by dimensional reduction) from the co-occurrence of those terms in

the whole set of documents.

Models with transcendent term interdependencies allow a representation of

interdependencies between terms, but they do not allege how the interdependency

between two terms is defined. They relay an external source for the degree of

interdependency between two terms. (For example a human or sophisticated

algorithms.)

In this section, we discuss the following three Information Retrieval mathematical models,

one from each category:

Standard Boolean Model

Vector Space Model

Binary Independence Probabilistic Model

1.3.3.1 Standard Boolean Model

The Boolean model of Information Retrieval is a classical Information Retrieval model. It

is based on Boolean Logic and classical Set theory in that both the documents to be

searched and the user's query are conceived as sets of terms. Retrieval is based on whether

or not the documents contain the query terms. Given a finite set

Page 43: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

43

T = {t1, t2, ..., tj, ..., tm} (1.3)

of elements called index terms (e.g. words or expressions - which may be stemmed -

describing or characterizing documents such as keywords given for a article), a finite set

D = {D1, ..., Di, ..., Dn}, (1.4)

where Di is an element of the power set of T of elements called documents

Given a Boolean expression - in a normal form - Q called a query as follows:

Q = (Wi OR Wk OR ...) AND ... AND (Wj OR Ws OR ...), (1.5)

with Wi=ti, Wk=tk, Wj=tj, Ws=ts, or Wi=NON ti, Wk=NON tk, Wj=NON tj, Ws=NON ts

where ti means that the term ti is present in document Di, whereas NON ti means that it is

not.

Equivalently, Q can be given in a disjunctive normal form, too. An operation called

retrieval, consisting of two steps, is defined as follows:

1. The sets Sj of documents are obtained that contain or not term tj (depending on

whether Wj=tj or Wj=NON tj) :

Sj = {Di|Wj element of Di} (1.6)

2. Those documents are retrieved in response to Q which are the result of the

corresponding sets operations, i.e. the answer to Q is as follows:

UNION ( INTERSECTION Sj)

The main advantages of the Boolean model are:

i. Clean Formalism

ii. Easy to implement

iii. Intuitive concept

Page 44: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

44

On the other hand, the disadvantages of the Boolean model are:

i. Exact matching may retrieve too few or too many documents

ii. Difficult to rank output, some documents are more important than others

iii. Hard to translate a query into a Boolean expression

iv. All terms are equally weighted

v. More like data retrieval than Information Retrieval

1.3.3.2 Vector Space Model

Vector space model (or term vector model) [66], [118], [121], is an algebraic model used

for information filtering, Information Retrieval, indexing and relevancy rankings. It

represents natural language documents (or any objects, in general) in a formal manner

through the use of vectors (of identifiers, such as, for example, index terms) in a multi-

dimensional linear space.

Documents and queries are represented as vectors.

dj = (w1,j,w2,j,...,wt,j) (1.7)

q = (w1,q,w2,q,...,wt,q) (1.8)

Each dimension corresponds to a separate term. If a term occurs in the document, its value

in the vector is non-zero. Several different ways of computing these values, also known as

(term) weights, have been developed. One of the best known schemes is tf-idf weighting.

The definition of term depends on the application. Typically terms are single words,

keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of

the vector is the number of words in the vocabulary (the number of distinct words

occurring in the corpus).

Vector operations can be used to compare documents with queries.

Page 45: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

45

Figure1.8: Vector Space Model

Relevance rankings of documents in a keyword search can be calculated, using the

assumptions of document similarities theory, by comparing the deviation of angles

between each document vector and the original query vector where the query is

represented as same kind of vector as the documents.

In practice, it is easier to calculate the cosine of the angle between the vectors, instead of

the angle itself:

Where d2 . q is the intersection (i.e. that dot product) of the document d2 and the query q

vectors (ref. figure 1.8), is the norm of vector d2, and is the norm vector q.

The norm of a vector is calculated as such:

A cosine value of zero means that the query and document vector are orthogonal and have

no match (i.e. the query term does not exist in the document being considered).

Page 46: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

46

The vector space model has the following advantages over the Standard Boolean model:

i. Simple model based on linear algebra

ii. Term weights not binary

iii. Allows computing a continuous degree of similarity between queries and

documents

iv. Allows ranking documents according to their possible relevance

v. Allows partial matching

The vector space model has the following limitations:

i. Long documents are poorly represented because they have poor similarity values (a

small scalar product and a large dimensionality)

ii. Search keywords must precisely match document terms; word substrings might

result in a "false positive match"

iii. Semantic sensitivity; documents with similar context but different term vocabulary

won't be associated, resulting in a "false negative match".

iv. The order in which the terms appear in the document is lost in the vector space

representation.

v. Assumes terms are statistically independent

vi. Weighting is intuitive but not very formal

1.3.3.3 Probabilistic Model (Binary Independence Model)

The Binary Independence Model (BIM) is a probabilistic Information Retrieval technique

that makes some simple assumptions to make the estimation of document/query similarity

probability feasible.

The "binary" in BIM is to be taken in the sense of "Yes or No', the representation is an

ordered set of Boolean variables. The representation of a documents or query is a vector

with one Boolean element for each term under consideration. More specifically, a

document is represented by a vector d = (x1, ..., xm) where xt=1 if term t is present in the

document d and xt=0 if it's not. Many documents can have the same vector representation

with this simplification. Queries are represented in a similar way. "Independence" signifies

that terms in the document are considered independently from each other and no

Page 47: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

47

association between terms is modeled. This assumption is very limiting, but it has been

shown that it gives good enough results for many situations. This independence is the

"naive" assumption of a Naive Bayes classifier, where properties that imply each other are

nonetheless treated as independent for the sake of simplicity. This assumption allows the

representation to be treated as an instance of a Vector space model by considering each

term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other

terms.

The probability P(R|d,q) that a document is relevant derives from the probability of

relevance of the terms vector of that document P(R|x,q). By using the Bayes rule we get:

where P(x|R=1,q) and P(x|R=0,q) are the probabilities of retrieving a relevant or non

relevant document, respectively. If so, then that document's representation is x. The exact

probabilities cannot be known beforehand, so use estimates from statistics about the

collection of documents must be used.

P(R=1|q) and P(R=0|q) indicate the previous probability of retrieving a relevant or non

relevant document respectively for a query q. If, for instance, we knew the percentage of

relevant documents in the collection, then we could use it to estimate these probabilities.

Since a document is either relevant or non relevant to a query we have that:

P(R = 1 | x,q) + P(R = 0 | x,q) = 1 (1.12)

Roelleke and Wang [113] investigate the probabilistic relational implementations of BIM

under the use of probabilistic relational algebra for the integration of Information Retrieval

in databases. This work surges as a result of the interest shown by Cui and Potok [36] in

applying the knowledge of probabilistic models for Information Retrieval in structured

data, to investigate the problem of ranking the answers to a database query when many

tuples are returned.

Page 48: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

48

1.3.4 Similarity measures

Many data and particularly Text Mining techniques like clustering, classification are based

on the similarity measures between the objects. Measures of similarity may be obtained

from vectors of measurements or characteristics describing each object (document). Given

a particular vector-space representation, distance between documents can be defined as a

well-defined function of distance between their document vectors. Two commonly used

similarity measures are Euclidean Distance and Cosine distance [70], [115], [116], [121].

1.3.4.1 Euclidean Distance

Euclidean distance examines the square root of square differences between the vectors of

pair of text documents.

In n dimensions, the Euclidean distance between two document vectors di and dk is

computed as:

c

j

kjijik ddEd1

2)(

, such that i ≠k, i,k n (1.13)

where, dij (or dkj) is the jth

term in di (or dk) document.

c is the sum of unique terms in di (a) and dk (b), c=a+b

Disadvantage of Euclidean distance is that it is influenced by variables that have larger

values of frequency occurrence in the document. Normalizing the distance to a value

between 0 and 1 is therefore used to prevent this.

1.3.4.2 Cosine Distance

Cosine similarity is a measure of similarity between two vectors of m dimension by

finding the angle between them, often used to compare documents in Text Mining. For

text matching, the document vectors di and dk are the tf-idf vectors of the documents,

which give a high weight to terms that have a high frequency occurrence in the document

and low frequency occurrence in whole document corpus.

Page 49: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

49

Cosine distance between two documents di and dk is defined as:

(1.14)sin

1

2

1

2

1

b

j

kj

a

j

ij

ba

j

kjij

ik

dd

dd

eCo

where, dij (or dkj) is the jth

term in di (or dk) document.

The cosine measure gives values between 0 and 1, with documents that contain similar

words tending to have a similarity close to 1, and documents with few words in common

tending to have a similarity close to 0.

Unlike Euclidean distance, cosine distance ensures that only words shared between

compared documents are considered (the weight of a word is zero if it does not appear in a

document). Cosine distance has been found historically to be quite effective in practical IR

experiments, where queries are also expressed using the same term-based representation as

that used for documents.

1.3.5 Retrieval Evaluation

Many different measures for evaluating the performance of Information Retrieval systems

have been proposed. Cleverdon et al. [35] listed six criteria that could be used to evaluate

an Information Retrieval system: (1) coverage, (2) time lag, (3) recall, (4) precision, (5)

presentation and (6) user effort. Of these criteria, recall and precision have most frequently

been applied in measuring Information Retrieval. The measures require a collection of

documents and a query. All measures contribute by evaluating the relevance of every

document to a particular query.

In the following discussion, we explain the popular retrieval evaluation measures: recall,

precision, F-measure.

Page 50: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

50

1.3.5.1 Recall

Recall [72], [112] is the fraction of documents that are relevant to the query that are

successfully retrieved.

In binary classification, recall is called sensitivity. Recall can also be defined as the

probability that a relevant document is retrieved by the query.

It is trivial to achieve 100% recall by returning all documents in response to any query.

Therefore recall alone is not enough and efficiency of the IR system is computed based

on the number of non-relevant documents retrieved by computing the precision.

1.3.5.2 Precision

Precision [72], [112] is the fraction of the documents retrieved that are relevant to the

user's information need.

In binary classification, precision is analogous to positive predictive value. Precision takes

all retrieved documents into account. It can also be evaluated at a given cut-off rank,

considering only the topmost results returned by the system. This measure is called

precision at n or P@n.

1.3.5.3 F-measure

F-measure is the weighted harmonic mean [133], [142] of precision and recall.

Traditional F-measure or balanced F-score is defined as:

This is also known as the F1 measure, because recall and precision are evenly weighted.

Page 51: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

51

Two other commonly used F measures are the F2 measure, which weights recall twice as

much as precision, and the F0.5 measure, which weights precision twice as much as recall.

The general formula of F-measure for non-negative real β is:

Fβ measures the effectiveness of retrieval with respect to a user who attaches β times as

much importance to recall as precision.

Two Cluster Quality measures [70] to evaluate how representative are the current clusters

to ―true‖ classes are discussed below:

1.3.5.4 Purity

It is measure of the extent to which a cluster contains samples of a single class [39], [69].

The purity of ith

cluster can be computed as:-

where,

pij is the probability that members of ith

cluster belongs to jth

class

nij is total no. of documents from jth

class assigned to ith

cluster

ni is total no. of documents in ith

cluster

Total Purity can be computed as:-

where,

ni is the total no. of documents in ith

cluster

n is the total no. of documents in document pool

Page 52: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

52

1.3.5.5 Entropy

Entropy [15], [70], [121] measures the degree to which each cluster consists of samples of

a single category class.

Entropy of each cluster i can be computed as

where,

Entropyi denotes the entropy of ith

cluster

Q is total number of category classes

The total entropy of cluster set can be computed by weighting entropy of each cluster

where,

k denotes the total number of clusters

Higher the positive value of the total entropy, the better the clustering performance is.

1.4 Thesis Outline

The thesis is organized chapter wise as follows:

Chapter 1: This chapter is devoted to introduction on Data Mining, Text Mining and

Information Retrieval. Different techniques, applications areas and architecture of Data

Mining and Text Mining are discussed in the chapter. Chapter also outlines basic concepts,

models and techniques of Information Retrieval such as extraction of index terms, retrieval

models. At the end of the chapter, different Information Retrieval evaluation techniques

and the framework of Information Retrieval are explained.

Chapter 2: In this chapter, a discussion on related work on document indexing, hyperlink

structure of web pages, clustering and text document summarization is discussed. Based

on the literature survey on each topic, the problems and challenges identified from existing

tools and techniques for each are discussed in brief, providing the basis for the work to be

carried out.

Page 53: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

53

Chapter 3: This chapter discusses the method for Quick Text Retrieval Algorithm

Supporting Synonyms Based on Fuzzy Logic. Different compression algorithms (like Elias

Gamma code, Elias Delta code, Fibonacci Code) are explained in the chapter and the

concept of Fuzzy Information Retrieval is studied along with Suffix tree clustering.

Chapter 4: This chapter is about Web Page Ranking Based on Text Content of Linked

Pages. In this chapter, different link analysis ranking algorithms like (HITS, pSALSA,

SALSA, HubAvg, AThresh, HThresh , FThresh, BFS) are discussed and the rank score of

Web text pages obtained through these link analysis ranking algorithms are compared to

the proposed approach.

Chapter 5: This chapter discusses the problem of Automatic generation of initial value K

to apply K-means method for Text Documents Clustering. In the chapter, methods and

limitations of different clustering techniques like K-means, Bisecting K-means, HFTC

(Hierarchical Document clustering using Frequent Itemsets), Hybrid PSO+K-means,

Global K-means are discussed. Later, a new method of text document clustering is

proposed to overcome the limitations of existing clustering methods discussed in the

chapter.

Chapter 6: This chapter is about Document Summarization based on Sentence Ranking

Using Vector Space Model. In this section, different summarization tools like Copernic,

SweSum, Extractor, MSWord AutoSummarizer, Intelligent, Brevity, Pertinence are

studied. Based on the analysis of these tools, a new method of summarization is proposed

and the summaries obtained on applying these tools to DUC-2002 dataset are then

compared using ROUGE-1.5.5 toolkit.

Chapter 7: It is the last chapter of the thesis in which conclusion and future scope have

been discussed.

Page 54: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3355/9/09...They are mainly based on natural language processing techniques, including Information Retrieval,

54