An Innovative Approach in Text Mining - IJCTA Innovative Approach in Text Mining...
Transcript of An Innovative Approach in Text Mining - IJCTA Innovative Approach in Text Mining...
An Innovative Approach in Text Mining
(1)R.Santhanalakshmi,
Research Scholar,
Dept of MCA, ComputerCenter,
Madurai Kamaraj University, Madurai.
(2)
Dr.K.Alagarsamy,
Associate Professor
Dept of MCA, ComputerCenter,
Madurai Kamaraj University, Madurai.
Abstract:
The text mining is the
classification and predictive modelings
that are based on bootstrapping
techniques re-use a source data set for
the specific application, which is
specialized for avoid the information
overloading and redundancy. The results
offer a classification and prediction
results are minimum compare with the
original data source.
Text is the common approach used to
examine text and data in order to draw
conclusions about the structure and
relationships between sets of information
contained in the original set or
approximate the some expected values.
In this paper we are going to retrieve the
bovine diseases information form the
internet using k-means clustering and
principal component analysis.
Keyword: Bovine Diseases, K-Means
Clustering, Principal component
analysis.
I. Introduction:
1.1 Bovine Diseases:
Bovine Diseases are the common
diseases in cattle sector. It has variety of
forms and N number of symptoms. Here
we discuss some forms. BVDV is one of
the common causes of infectious
abortion. It is also correlated with a wide
range of diseases from infertility to
pneumonia, diarrhoea and poor growth.
BVDV is normally the major viral cause
of disease in cattle. BVDV is belongs to
the family of pestiviruses. Other diseases
associated with other pestiviruses
include classical swine fever and border
disease in sheep. Pestiviruses infect
cloven-hoofed stock only, BVDV has
been found in pigs and sheep. BVDV
causes such a wide range of disease it is
rare to be able to diagnose because on
clinical signs alone. Testing the blood
for antibodies and virus is the best
method of diagnosis. A paired blood
sample for antibodies is useful for
pneumonia, diarrhoea and infertility. If
the first sample is taken when the animal
is ill and the second two to three weeks
later, a rise in antibodies suggests that
there was active infection
BVD is a viral disease of cattle
caused by a pestivirus. It has many
different manifestations in a herd,
depending on the herd’s immune and
reproductive status. Transient diarrhoea,
mixed respiratory infection, infertility or
abortion and mucosal disease are the
most common clinical signs of the
disease and can be seen simultaneously
in a herd. Due to its varied
manifestations and sub clinical nature in
many herds, the significance of the
disease has not been understood until
recently, when diagnostic methods
improved.Bovine herpes virus 1 is a
virus of the family Herpesviridae that
causes several diseases worldwide in
cattle, including rhinotracheitis,
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
193
ISSN: 2229-6093
vaginitis, balanoposthitis, abortion,
conjunctivitis and enteritis. BHV-1 is
also a contributing factor in shipping
fever. Bovine leukemia virus is a bovine
virus closely related to HTLV-I, a
human tumour virus. BLV is a retrovirus
which integrates a DNA intermediate as
a provirus into the DNA of B-
lymphocytes of blood and milk. It
contains an oncogene coding for a
protein called Tax.
1.2 K-Means Clustering:
In statistics and machine
learning, k-means clustering [4] is a
method of cluster analysis which aims to
partition n observations into k clusters in
which each observation belongs to the
cluster with the nearest mean. It is
similar to the expectation-maximization
algorithm for mixtures of Gaussians in
that they both attempt to find the centers
of natural clusters in the data as well as
in the iterative refinement approach
employed by both algorithms.
Procedure:
This algorithm is initiated by
creating ‘k’ different clusters.
The given sample set is first
randomly distributed between
these ‘k’ different clusters.
As a next step, the distance
measurement between each of
the sample, within a given
cluster, to their respective
cluster centroid is calculated.
Samples are then moved to a
cluster that records the shortest
distance from a sample to the
cluster centroid.
As a first step to the cluster
analysis, the user decides on the
Number of Clusters ‘k’. This
parameter could take definite
integer values with the lower
bound of 1 an upper bound that
equals the total number of
samples.
The K-Means algorithm is repeated a
number of times to obtain an optimal
clustering solution, every time starting
with a random set of initial clusters.
1.3 Principal Component Analysis:
The main basis of PCA-based
dimension reduction is that PCA picks
up the dimensions with the largest
variances. Mathematically, this is
equivalent to finding the best low rank
approximation of the data via the
singular value decomposition. However,
this noise reduction property alone is
inadequate to explain the effectiveness
of PCA.
PCA is a basic method of social
network mining with applications to
ranking and clustering that can be further
deployed in marketing, in user
segmentation by selecting communities
with desired or undesired properties as.
In particular the friends list of a blog can
be used for social filtering, that is
reading posts that their friends write or
recently read.
Principal Component Analysis is similar
to the HITS ranking algorithm; in fact
the hub and authority ranking is defined
by the first left and right singular vectors
and the use of higher dimensions is
suggested already and analyzed in detail
in Several authors use HITS for
measuring authority in mailing lists or
blogs , the latter result observing a
strong correlation of HITS score and
degree, indicating that the first principal
axis will contain no high-level
information but simply order by number
of friends. We demonstrate that HITS-
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
194
style ranking can be used but with
special care due to the Tightly Knit
Community effect that result in
communities that are small on a global
level grabbing the first principal axes.
The probably the first who identify the
TKC problem in the HITS algorithm,
their algorithmic solution however turns
out to merely compute in and out-
degrees. In contrast we keep PCA as the
underlying matrix method and filter the
relevant high-level structural
information by removing TK
II. Proposed Method (SAN Method):
In our method we combined the
k-means clustering and principal
component analysis for the effective
clustering and optimized solution. While
searching the information from the
internet we have to get what information
we required until otherwise that
searching becomes a null and void.
Every clustering method has its own
strategy and importance. We can’t say
the single clustering mechanism enough
for every kind of search and also we
can’t ensure every clustering method
provide the same result for same key
term. For this reason we combined the
both clustering technique and gave new
innovative idea to optimizing the
searching from the large data base or
internet, etc. Both techniques are some
what related to clustering technique. K-
Means clustering grouping the source
data into certain groups called as clusters
based on some distance measures
technique. Principal component analyze
focusing dimension reduction based on
the mathematical models.
Our domain information’s related
to Bovine Disease, which are very
specific instead of searching the all
domains. Even though we have specific
domain we should search through out the
internet if it’s online otherwise in large
data base. As our earlier work we used
modified HITS algorithm to search and
another one we used stemming
algorithm with hierarchical clustering. In
this we combine the K-means and
Principal component analysis and
evaluate the results. Our research end
with comparison making between al
those things and optimize which
technique better for my work.
Bovine diseases keyword given
for searching element using that
keyword first we form the initial
clusters. For example we have n sample
feature vectors bv1, bv2… bvn all from
the same class, and we know that they
fall into k compact clusters, k < n. Let mi
be the mean of the vectors in cluster i.
here for calculating the mean value we
use Euclidean distance formula which
standard algorithm as well as simple
algorithm to find out the distance
between two elements. If the clusters are
well separated, we can use a minimum-
distance classifier to separate them. That
is, we can say that x is in cluster i if x -
mi is the minimum of all the k distances.
This suggests the following procedure
for finding the k means:
Make initial guesses for the
means m1, m2... mk.
Until there are no changes in
any mean
o Use the estimated
means to classify the
samples into clusters
o For i from 1 to k
o Replace bvi with the
mean of all of the
samples for cluster i
o end_for
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
195
end_until
In addition to improve K- means
algorithm while forming the clustering
analysis we include the brute force
stemming algorithm, suffix stripping and
brute force approach. Brute force
stemmers maintain the lookup table
which contains relations between root
forms and inflected forms. To stem a
word, the table is queried to find a
matching inflection. If a matching
inflection is found, the associated root is
replaced by the original word. Suffix
stripping algorithms not like lookup
table Instead, a typically smaller list of
rules are stored which provide a path for
the algorithm, given an input word form,
to find its root form. Some examples of
the rules include, Rule1) if the word
ends in 'ed', remove the 'ed'.Rule2) if the
word ends in 'ing', remove the
'ing'.Rule3) if the word ends in 'ly',
remove the 'ly'. Like wise they form
some group of clusters at final stage but
we can’t stay these are the final
optimized result so that we going to
analyze this cluster further for that final
clusters distributed into Principal
component analysis.
Principal component analysis is a
mathematical procedure that uses an
orthogonal transformation to convert a
set of observations of possibly correlated
variables into a set of values of
uncorrelated variables called principal
components.
PCA is mathematically defined
as an orthogonal linear transformation
that transforms the data to a new
coordinate system such that the greatest
variance (difference) by any projection
of the data comes on the first coordinate
the second greatest variance on the
second coordinate, and so on.
Define a data matrix, XT, with zero
samples mean, where each of the n rows
represents a different repetition of data
form the different experiment, and each
of the m columns gives a particular kind
of datum .The singular value
decomposition of X is X = W Σ VT,
where the m × m matrix W is the matrix
of eigenvectors of XXT, the matrix Σ is
an m × n rectangular diagonal matrix
with nonnegative real numbers on the
diagonal, and the matrix V is n × n. The
PCA transformation that preserves
dimensionality is then given by:
WXy TT
T
V
V is not uniquely defined in the usual
case when m<n−1, but Y will usually
still be uniquely defined. Since W is an
orthogonal matrix, each row of YT is
simply a rotation of the corresponding
row of XT. The first column of Y
T is
made up of the scores of the cases with
respect to the principal component; the
next column has the scores with respect
to the second principal component, and
so on.
If we want a reduced-dimensionality
representation, we can project X down
into the reduced space defined by only
the first L singular vectors, WL:
L
T
L
T
L VXWY
The matrix W of singular vectors of X is
equivalently the matrix W of
eigenvectors of the matrix of observed
covariance’s C = X XT,
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
196
TTT WWXX
Given a set of points in Euclidean space,
the first principal component
corresponds to a line that passes through
the multidimensional mean and
minimizes the sum of squares of the
distances of the points from the line. The
second principal component corresponds
to the same concept after all correlation
with the first principal component has
been subtracted out from the points. The
singular values (in Σ) are the square
roots of the eigenvalues of the matrix
XXT. Each Eigen value is proportional to
the portion of the variance that is
correlated with each eigenvector. The
sum of all the eigenvalues is equal to the
sum of the squared distances of the
points from their multidimensional
mean. PCA essentially rotates the set of
points around their mean in order to
align with the principal components.
This moves as much of the variance as
possible into the first few dimensions.
The values in the remaining dimensions,
therefore, tend to be small and may be
dropped with minimal loss of
information. Finally we will get the
reduced cluster as the output of our
query.
III. Result Analysis:
Simulation will carry over in
Matrix lab (Mat lab) software. For
example take this as query: Symptoms of
Bovine leukemia: First we will see the
out come of K-means Clustering in
Fig: 1
Fig: 2
The K-means cluster output value given
to the principal component analysis. It
will generate the variance matrix that
will reduce into further steps finally we
will get these things as the result of
query Comparison Analysis will give
performance evaluation of combined
approach with linearly:
Sample
size
SAN
method
K-
means
PCA
Feeding
Gouge Dehorhing
Rhinotracheitis
ataxia
Palption
Provirus
Lymphocytes
Mononucleosis
B-Sell leukemia
Colostrum
leukaemia
BLV
B-Sell leukemia
Lymphocytes
BLV
Ataxia
Colostrums
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
197
2750 0.91 0.87 0.82
4550 0.81 0.75 0.69
7700 0.84 0.65 0.62
10100 0.89 0.73 0.65
As the result analysis depicts SAN
method performance will high then any
other methods. While increasing the data
set performance ratio will decrease in K-
means and principal component analysis
IV. Conclusion:
In this paper we provided an effective
method for information retrieval in
Bovine disease. The SAN method gives
an optimum solution compare with
principal component analysis and K-
means. In our earlier work we focused in
enhancing the Medline & Pubmed using
modified hits algorithm and also we tried
with stemming algorithms. It’s our
conclusion among all those methods;
The SAN method gave the effective
solution for bovine disease searching
methodology.
V. References:
[1] Lada A. Adamic and Natalie Glance.
The political blogosphere and the 2004
u.s. election: divided they blog. In
LinkKDD ’05: Proceedings of the 3rd
international workshop on Link
discovery, pages 36–43, New York, NY,
USA, 2005. ACM.
[2] Pedro Domingos and Matt
Richardson. Mining the network value of
customers. In KDD ’01: Proceedings of
the seventh ACM SIGKDD international
conference on Knowledge discovery and
data mining, pages 57–66, New York,
NY, USA, 2001. ACM.
[3] Lars Backstrom, Dan Huttenlocher,
Jon Kleinberg, and Xiangyang Lan.
Group formation in large social
networks: membership, growth, and
evolution. In KDD ’06:
Proceedings of the 12th ACM SIGKDD
international conference on Knowledge
discovery and data mining, pages 44–54,
New York, NY, USA, 2006. ACM
Press.
[4] D Cheng, R Kannan, S Vempala, and
G Wang. On a recursive spectral
algorithm for clustering from pairwise
similarities. Technical report, MIT LCS
Technical Report MIT-LCS-TR-906,
2003.
[5] Matthew Hurst, Matthew Siegler,
and Natalie Glance. On estimating the
geographic distribution of social media.
In Proceedings Int. Conf. on Weblogs
and Social Media (ICWSM-2007), 2007.
[6] M. Newman. Detecting community
structure in networks. The European
Physical Journal B - Condensed Matter,
38(2):321–330, March 2004.
[7] Jun Zhang, Mark S. Ackerman, and
Lada Adamic. Expertise networks in
online communities: structure and
algorithms. In WWW ’07: Proceedings
of the 16th international conference on
World Wide Web, pages 221–230, New
York, NY, USA, 2007. ACM Press.
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
198