Information Organization & Retrieval | Automatic Classification Schemes
-
Upload
passy-hearst -
Category
Documents
-
view
213 -
download
0
description
Transcript of Information Organization & Retrieval | Automatic Classification Schemes
Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 4.5.2007
Automatic Classification Assignment
1. Rule: Classify all articles with 4 or more Authors as medical.
When looking at the histogram it is clear to see that medical and non-medical articles
start to mix in the 4 to 5 authors range. While my rule will allow a number of non-medical articles
to creep into the medical class, this is necessary due to the fact that one can not have fractions of
an author (despite the fact that the histogram makes it seem as if you can, though this is due to the
fact that the numbers we see are [1] made up and [2] most likely averages).
When approximating where documents lie in the histogram, it appears that a small
proportion of articles that are medical lie just below the 5 author mark. If I were to make 5 my cut
off point, I would then be misclassifying medical documents as non-medical documents. By
making my cut off point 4, I am, instead, misclassifying a number of non-medical documents as
medical documents. As my job is to index medical documents, I feel that the lesser of the two
evils would be to misclassify non-medical documents as medical documents as having non-
applicable information would do less harm to my company than missing much needed medical
information may.
2. Medical documents: Classified correctly 100% (100 out of 100)
Non-Medical documents: Classified correctly 83% (83 out of 100)
Overall Percentage of correct classification: 91.5% (183 out of 200)
3. If we assume that the distribution of data we have is an unbiased estimator of unforeseen
data then we well be assuming that every set of 200 new documents will follow this same pattern.
However, as we see there is a section where documents overlap (the 4 to 6 author range), our
model runs the possibility of adding too many non-medical documents to the medical class. In the
next group of 200, 17 more would be misclassified and now we would have 34 non-medical
documents in our medical class. With 34 of the 234 documents in the medical class being (in
actuality) non-medical documents we start to find ourselves running into trouble. Despite our
model being completely accurate in terms of classifying medical documents, we find it adding
unnecessary information to our document set. Our new class would end up containing 34 non-
medical documents which would make up 14.5% of the total number of documents put in the
medical class.
4.
I positioned my line such that the entire group of non-medical articles would be
completely separated from the entire group of medical articles. By bisecting the data this way, I
created a firm boundary between the two classes so that no one document from one group falls
into the other group’s territory. By classifying the data this way we have 100% accuracy in terms
of classification with our sample set.
5. Percentage of data correctly classified: 100%
6. In this particular case adding more dimensions to the data set made our classification
problem easier to solve and allowed us a greater degree of accuracy. However, this is not always
the case. First and foremost, when adding a dimension it is imperative that the dimension be
pertinent to classification. If the dimension one adds does not allow us any way of significantly
differentiating between the two classes it will do us no good. For example, if 50% of 100 men
have blue eyes and 50% have brown eyes and 50% of 100 women have blue eyes and 50% have
brown eyes, the dimension of eye color would not do much to help us differentiate between men
and women as the two would completely overlap and there would be no substantial split.
In addition, while adding more dimensions is helpful in creating classification models it
does create the added burden of data collection. As stated in class, for each column we create in a
data table we will need to at least double or triple the number of rows we have. To build a sample
size that would give us a decent level of accuracy in regards to out statistical analysis, it would be
necessary to collect a vast amount of data. Sometimes, however, adding this burden of data
collection can kill any attempts of creating a model before it begins as the data amount could be
too large for efficient computation or the needed quantity of data may just not be able to be found.