An Innovative Approach in Text Mining - IJCTA Innovative Approach in Text Mining...

An Innovative Approach in Text Mining

(1)R.Santhanalakshmi,

Research Scholar,

Dept of MCA, ComputerCenter,

Madurai Kamaraj University, Madurai.

[email protected]

(2)

Dr.K.Alagarsamy,

Associate Professor

Dept of MCA, ComputerCenter,

Madurai Kamaraj University, Madurai.

[email protected]

Abstract:

The text mining is the

classification and predictive modelings

that are based on bootstrapping

techniques re-use a source data set for

the specific application, which is

specialized for avoid the information

overloading and redundancy. The results

offer a classification and prediction

results are minimum compare with the

original data source.

Text is the common approach used to

examine text and data in order to draw

conclusions about the structure and

relationships between sets of information

contained in the original set or

approximate the some expected values.

In this paper we are going to retrieve the

bovine diseases information form the

internet using k-means clustering and

principal component analysis.

Keyword: Bovine Diseases, K-Means

Clustering, Principal component

analysis.

I. Introduction:

1.1 Bovine Diseases:

Bovine Diseases are the common

diseases in cattle sector. It has variety of

forms and N number of symptoms. Here

we discuss some forms. BVDV is one of

the common causes of infectious

abortion. It is also correlated with a wide

range of diseases from infertility to

pneumonia, diarrhoea and poor growth.

BVDV is normally the major viral cause

of disease in cattle. BVDV is belongs to

the family of pestiviruses. Other diseases

associated with other pestiviruses

include classical swine fever and border

disease in sheep. Pestiviruses infect

cloven-hoofed stock only, BVDV has

been found in pigs and sheep. BVDV

causes such a wide range of disease it is

rare to be able to diagnose because on

clinical signs alone. Testing the blood

for antibodies and virus is the best

method of diagnosis. A paired blood

sample for antibodies is useful for

pneumonia, diarrhoea and infertility. If

the first sample is taken when the animal

is ill and the second two to three weeks

later, a rise in antibodies suggests that

there was active infection

BVD is a viral disease of cattle

caused by a pestivirus. It has many

different manifestations in a herd,

depending on the herd’s immune and

reproductive status. Transient diarrhoea,

mixed respiratory infection, infertility or

abortion and mucosal disease are the

most common clinical signs of the

disease and can be seen simultaneously

in a herd. Due to its varied

manifestations and sub clinical nature in

many herds, the significance of the

disease has not been understood until

recently, when diagnostic methods

improved.Bovine herpes virus 1 is a

virus of the family Herpesviridae that

causes several diseases worldwide in

cattle, including rhinotracheitis,

R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

193

ISSN: 2229-6093

vaginitis, balanoposthitis, abortion,

conjunctivitis and enteritis. BHV-1 is

also a contributing factor in shipping

fever. Bovine leukemia virus is a bovine

virus closely related to HTLV-I, a

human tumour virus. BLV is a retrovirus

which integrates a DNA intermediate as

a provirus into the DNA of B-

lymphocytes of blood and milk. It

contains an oncogene coding for a

protein called Tax.

1.2 K-Means Clustering:

In statistics and machine

learning, k-means clustering [4] is a

method of cluster analysis which aims to

partition n observations into k clusters in

which each observation belongs to the

cluster with the nearest mean. It is

similar to the expectation-maximization

algorithm for mixtures of Gaussians in

that they both attempt to find the centers

of natural clusters in the data as well as

in the iterative refinement approach

employed by both algorithms.

Procedure:

This algorithm is initiated by

creating ‘k’ different clusters.

The given sample set is first

randomly distributed between

these ‘k’ different clusters.

As a next step, the distance

measurement between each of

the sample, within a given

cluster, to their respective

cluster centroid is calculated.

Samples are then moved to a

cluster that records the shortest

distance from a sample to the

cluster centroid.

As a first step to the cluster

analysis, the user decides on the

Number of Clusters ‘k’. This

parameter could take definite

integer values with the lower

bound of 1 an upper bound that

equals the total number of

samples.

The K-Means algorithm is repeated a

number of times to obtain an optimal

clustering solution, every time starting

with a random set of initial clusters.

1.3 Principal Component Analysis:

The main basis of PCA-based

dimension reduction is that PCA picks

up the dimensions with the largest

variances. Mathematically, this is

equivalent to finding the best low rank

approximation of the data via the

singular value decomposition. However,

this noise reduction property alone is

inadequate to explain the effectiveness

of PCA.

PCA is a basic method of social

network mining with applications to

ranking and clustering that can be further

deployed in marketing, in user

segmentation by selecting communities

with desired or undesired properties as.

In particular the friends list of a blog can

be used for social filtering, that is

reading posts that their friends write or

recently read.

Principal Component Analysis is similar

to the HITS ranking algorithm; in fact

the hub and authority ranking is defined

by the first left and right singular vectors

and the use of higher dimensions is

suggested already and analyzed in detail

in Several authors use HITS for

measuring authority in mailing lists or

blogs , the latter result observing a

strong correlation of HITS score and

degree, indicating that the first principal

axis will contain no high-level

information but simply order by number

of friends. We demonstrate that HITS-


194

style ranking can be used but with

special care due to the Tightly Knit

Community effect that result in

communities that are small on a global

level grabbing the first principal axes.

The probably the first who identify the

TKC problem in the HITS algorithm,

their algorithmic solution however turns

out to merely compute in and out-

degrees. In contrast we keep PCA as the

underlying matrix method and filter the

relevant high-level structural

information by removing TK

II. Proposed Method (SAN Method):

In our method we combined the

k-means clustering and principal

component analysis for the effective

clustering and optimized solution. While

searching the information from the

internet we have to get what information

we required until otherwise that

searching becomes a null and void.

Every clustering method has its own

strategy and importance. We can’t say

the single clustering mechanism enough

for every kind of search and also we

can’t ensure every clustering method

provide the same result for same key

term. For this reason we combined the

both clustering technique and gave new

innovative idea to optimizing the

searching from the large data base or

internet, etc. Both techniques are some

what related to clustering technique. K-

Means clustering grouping the source

data into certain groups called as clusters

based on some distance measures

technique. Principal component analyze

focusing dimension reduction based on

the mathematical models.

Our domain information’s related

to Bovine Disease, which are very

specific instead of searching the all

domains. Even though we have specific

domain we should search through out the

internet if it’s online otherwise in large

data base. As our earlier work we used

modified HITS algorithm to search and

another one we used stemming

algorithm with hierarchical clustering. In

this we combine the K-means and

Principal component analysis and

evaluate the results. Our research end

with comparison making between al

those things and optimize which

technique better for my work.

Bovine diseases keyword given

for searching element using that

keyword first we form the initial

clusters. For example we have n sample

feature vectors bv1, bv2… bvn all from

the same class, and we know that they

fall into k compact clusters, k < n. Let mi

be the mean of the vectors in cluster i.

here for calculating the mean value we

use Euclidean distance formula which

standard algorithm as well as simple

algorithm to find out the distance

between two elements. If the clusters are

well separated, we can use a minimum-

distance classifier to separate them. That

is, we can say that x is in cluster i if x -

mi is the minimum of all the k distances.

This suggests the following procedure

for finding the k means:

Make initial guesses for the

means m1, m2... mk.

Until there are no changes in

any mean

o Use the estimated

means to classify the

samples into clusters

o For i from 1 to k

o Replace bvi with the

mean of all of the

samples for cluster i

o end_for


195

end_until

In addition to improve K- means

algorithm while forming the clustering

analysis we include the brute force

stemming algorithm, suffix stripping and

brute force approach. Brute force

stemmers maintain the lookup table

which contains relations between root

forms and inflected forms. To stem a

word, the table is queried to find a

matching inflection. If a matching

inflection is found, the associated root is

replaced by the original word. Suffix

stripping algorithms not like lookup

table Instead, a typically smaller list of

rules are stored which provide a path for

the algorithm, given an input word form,

to find its root form. Some examples of

the rules include, Rule1) if the word

ends in 'ed', remove the 'ed'.Rule2) if the

word ends in 'ing', remove the

'ing'.Rule3) if the word ends in 'ly',

remove the 'ly'. Like wise they form

some group of clusters at final stage but

we can’t stay these are the final

optimized result so that we going to

analyze this cluster further for that final

clusters distributed into Principal

component analysis.

Principal component analysis is a

mathematical procedure that uses an

orthogonal transformation to convert a

set of observations of possibly correlated

variables into a set of values of

uncorrelated variables called principal

components.

PCA is mathematically defined

as an orthogonal linear transformation

that transforms the data to a new

coordinate system such that the greatest

variance (difference) by any projection

of the data comes on the first coordinate

the second greatest variance on the

second coordinate, and so on.

Define a data matrix, XT, with zero

samples mean, where each of the n rows

represents a different repetition of data

form the different experiment, and each

of the m columns gives a particular kind

of datum .The singular value

decomposition of X is X = W Σ VT,

where the m × m matrix W is the matrix

of eigenvectors of XXT, the matrix Σ is

an m × n rectangular diagonal matrix

with nonnegative real numbers on the

diagonal, and the matrix V is n × n. The

PCA transformation that preserves

dimensionality is then given by:

WXy TT

T

V

V is not uniquely defined in the usual

case when m<n−1, but Y will usually

still be uniquely defined. Since W is an

orthogonal matrix, each row of YT is

simply a rotation of the corresponding

row of XT. The first column of Y

T is

made up of the scores of the cases with

respect to the principal component; the

next column has the scores with respect

to the second principal component, and

so on.

If we want a reduced-dimensionality

representation, we can project X down

into the reduced space defined by only

the first L singular vectors, WL:

L

T

L

T

L VXWY

The matrix W of singular vectors of X is

equivalently the matrix W of

eigenvectors of the matrix of observed

covariance’s C = X XT,


196

TTT WWXX

Given a set of points in Euclidean space,

the first principal component

corresponds to a line that passes through

the multidimensional mean and

minimizes the sum of squares of the

distances of the points from the line. The

second principal component corresponds

to the same concept after all correlation

with the first principal component has

been subtracted out from the points. The

singular values (in Σ) are the square

roots of the eigenvalues of the matrix

XXT. Each Eigen value is proportional to

the portion of the variance that is

correlated with each eigenvector. The

sum of all the eigenvalues is equal to the

sum of the squared distances of the

points from their multidimensional

mean. PCA essentially rotates the set of

points around their mean in order to

align with the principal components.

This moves as much of the variance as

possible into the first few dimensions.

The values in the remaining dimensions,

therefore, tend to be small and may be

dropped with minimal loss of

information. Finally we will get the

reduced cluster as the output of our

query.

III. Result Analysis:

Simulation will carry over in

Matrix lab (Mat lab) software. For

example take this as query: Symptoms of

Bovine leukemia: First we will see the

out come of K-means Clustering in

Fig: 1

Fig: 2

The K-means cluster output value given

to the principal component analysis. It

will generate the variance matrix that

will reduce into further steps finally we

will get these things as the result of

query Comparison Analysis will give

performance evaluation of combined

approach with linearly:

Sample

size

SAN

method

K-

means

PCA

Feeding

Gouge Dehorhing

Rhinotracheitis

ataxia

Palption

Provirus

Lymphocytes

Mononucleosis

B-Sell leukemia

Colostrum

leukaemia

BLV

B-Sell leukemia

Lymphocytes

BLV

Ataxia

Colostrums


197

2750 0.91 0.87 0.82

4550 0.81 0.75 0.69

7700 0.84 0.65 0.62

10100 0.89 0.73 0.65

As the result analysis depicts SAN

method performance will high then any

other methods. While increasing the data

set performance ratio will decrease in K-

means and principal component analysis

IV. Conclusion:

In this paper we provided an effective

method for information retrieval in

Bovine disease. The SAN method gives

an optimum solution compare with

principal component analysis and K-

means. In our earlier work we focused in

enhancing the Medline & Pubmed using

modified hits algorithm and also we tried

with stemming algorithms. It’s our

conclusion among all those methods;

The SAN method gave the effective

solution for bovine disease searching

methodology.

V. References:

[1] Lada A. Adamic and Natalie Glance.

The political blogosphere and the 2004

u.s. election: divided they blog. In

LinkKDD ’05: Proceedings of the 3rd

international workshop on Link

discovery, pages 36–43, New York, NY,

USA, 2005. ACM.

[2] Pedro Domingos and Matt

Richardson. Mining the network value of

customers. In KDD ’01: Proceedings of

the seventh ACM SIGKDD international

conference on Knowledge discovery and

data mining, pages 57–66, New York,

NY, USA, 2001. ACM.

[3] Lars Backstrom, Dan Huttenlocher,

Jon Kleinberg, and Xiangyang Lan.

Group formation in large social

networks: membership, growth, and

evolution. In KDD ’06:

Proceedings of the 12th ACM SIGKDD

international conference on Knowledge

discovery and data mining, pages 44–54,

New York, NY, USA, 2006. ACM

Press.

[4] D Cheng, R Kannan, S Vempala, and

G Wang. On a recursive spectral

algorithm for clustering from pairwise

similarities. Technical report, MIT LCS

Technical Report MIT-LCS-TR-906,

2003.

[5] Matthew Hurst, Matthew Siegler,

and Natalie Glance. On estimating the

geographic distribution of social media.

In Proceedings Int. Conf. on Weblogs

and Social Media (ICWSM-2007), 2007.

[6] M. Newman. Detecting community

structure in networks. The European

Physical Journal B - Condensed Matter,

38(2):321–330, March 2004.

[7] Jun Zhang, Mark S. Ackerman, and

Lada Adamic. Expertise networks in

online communities: structure and

algorithms. In WWW ’07: Proceedings

of the 16th international conference on

World Wide Web, pages 221–230, New

York, NY, USA, 2007. ACM Press.


198

An Innovative Approach in Text Mining - IJCTA Innovative Approach in Text Mining...

Documents

Transcript of An Innovative Approach in Text Mining - IJCTA Innovative Approach in Text Mining...