Information Organization & Retrieval | Automatic Classification Schemes

Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 4.5.2007

Automatic Classification Assignment

1. Rule: Classify all articles with 4 or more Authors as medical.

When looking at the histogram it is clear to see that medical and non-medical articles

start to mix in the 4 to 5 authors range. While my rule will allow a number of non-medical articles

to creep into the medical class, this is necessary due to the fact that one can not have fractions of

an author (despite the fact that the histogram makes it seem as if you can, though this is due to the

fact that the numbers we see are [1] made up and [2] most likely averages).

When approximating where documents lie in the histogram, it appears that a small

proportion of articles that are medical lie just below the 5 author mark. If I were to make 5 my cut

off point, I would then be misclassifying medical documents as non-medical documents. By

making my cut off point 4, I am, instead, misclassifying a number of non-medical documents as

medical documents. As my job is to index medical documents, I feel that the lesser of the two

evils would be to misclassify non-medical documents as medical documents as having non-

applicable information would do less harm to my company than missing much needed medical

information may.

2. Medical documents: Classified correctly 100% (100 out of 100)

Non-Medical documents: Classified correctly 83% (83 out of 100)

Overall Percentage of correct classification: 91.5% (183 out of 200)

3. If we assume that the distribution of data we have is an unbiased estimator of unforeseen

data then we well be assuming that every set of 200 new documents will follow this same pattern.

However, as we see there is a section where documents overlap (the 4 to 6 author range), our

model runs the possibility of adding too many non-medical documents to the medical class. In the

next group of 200, 17 more would be misclassified and now we would have 34 non-medical

documents in our medical class. With 34 of the 234 documents in the medical class being (in

actuality) non-medical documents we start to find ourselves running into trouble. Despite our

model being completely accurate in terms of classifying medical documents, we find it adding

unnecessary information to our document set. Our new class would end up containing 34 non-

medical documents which would make up 14.5% of the total number of documents put in the

medical class.

4.

I positioned my line such that the entire group of non-medical articles would be

completely separated from the entire group of medical articles. By bisecting the data this way, I

created a firm boundary between the two classes so that no one document from one group falls

into the other group’s territory. By classifying the data this way we have 100% accuracy in terms

of classification with our sample set.

5. Percentage of data correctly classified: 100%

6. In this particular case adding more dimensions to the data set made our classification

problem easier to solve and allowed us a greater degree of accuracy. However, this is not always

the case. First and foremost, when adding a dimension it is imperative that the dimension be

pertinent to classification. If the dimension one adds does not allow us any way of significantly

differentiating between the two classes it will do us no good. For example, if 50% of 100 men

have blue eyes and 50% have brown eyes and 50% of 100 women have blue eyes and 50% have

brown eyes, the dimension of eye color would not do much to help us differentiate between men

and women as the two would completely overlap and there would be no substantial split.

In addition, while adding more dimensions is helpful in creating classification models it

does create the added burden of data collection. As stated in class, for each column we create in a

data table we will need to at least double or triple the number of rows we have. To build a sample

size that would give us a decent level of accuracy in regards to out statistical analysis, it would be

necessary to collect a vast amount of data. Sometimes, however, adding this burden of data

collection can kill any attempts of creating a model before it begins as the data amount could be

too large for efficient computation or the needed quantity of data may just not be able to be found.

Information Organization & Retrieval | Automatic Classification Schemes

Documents

Transcript of Information Organization & Retrieval | Automatic Classification Schemes