Data Mining

Data Mining

Assignment #1:

Use two different search engines and find two different recent stories involving data mining. Describe the role of “data mining” in the story using your own words. Your description should be a minimum of 200-250 words in length. It can be longer but not shorter.

Here is an example. Searching Google News for data mining on August 9, 2010 gave as one of the results:

IBM, Mayo Clinic Take Next Data Mining Step DB2-based system contains records on 4.4M patients

This story describes how Mayo clinic uses IBM tools to analyze the data on its patients in order to develop individualized care. ... A doctor treating a patient for cancer could use data mining to find the results treatments given to the similar patients and use this information to find the best treatment course for new patient.

Data Mining

Assignment #2:

This assignment will give you an opportunity to become familiar with the use of the WEKA workbench to invoke several different machine learning schemes. Lecture 2b explains more about the Weka system and the weather example. You will use the latest stable version. You will be able to use both the graphical interface (Explorer) and command line interface (CLI). See http://www.cs.waikato.ac.nz/~ml/weka/ for Weka documentation.

Use the following learning schemes, with the default settings to analyze the weather data (in weather.arff). (1) Select the “Explorer” option, (2) Click on the “Classify” Tab, (3) for the test options, first choose "Use training set", and then choose "Percentage Split" using default 66% percentage split. Report the model percent error rate.

ZeroR (majority class) OneR Naive Bayes Simple J4.8

Answer the following questions:

Which of these classifiers are you more likely to trust when determining whether to play? Why?

What can you say about accuracy when using training set data and when using a separate percentage to train?

Data Mining

Assignment #3:

Take the file genes-leukemia.csv and convert it to a Weka file genes-a.arff. You can convert the file either using a text editor (brute force way) or find a Weka command that converts .csv file to .arff (a smart way). Perform the following tasks and submit your answers to the questions.

• The Target field is CLASS. Use J48 on genes-leukemia with the "Use training set" option.

• Use genes-leukemia.arff to create two subsets:

genes-leukemia-train.arff, with the first 38 samples (s1 ... s38) of the data genes-leukemia-test.arff, with the remaining 34 samples (s39 ... s72).

• Train J48 on genes-leukemia-train.arff and specify "Use training set" as the test option.

• What decision tree do you get? What is its accuracy?

• Now specify genes-leukemia-test.arff as the test set.

• What decision tree do you get and how does its accuracy compare to one in the previous question?

• Now remove the field “Source” from the classifier (unclick checkmark next to Source, and click on Apply Filter in the top menu) and repeat the previous steps.

• What do you observe? Does the accuracy on test set improve and if so, why do you think it does?

• Which classifier gives the highest accuracy on the test set?

Data Mining

Assignment #4:

Note: This assignment uses the decision tree tool, J4.8 in Weka to predict treatment outcomes.

1. Start with the genes-leukemia.csv dataset used in assignmen3.

2. As a predictor use field TREATMENT_RESPONSE, which has values Success, Failure or "?" (missing)

Step 1: Examine the records where TREATMENT_RESPONSE is non-missing.

1. How many such records are there?

2. Can you describe these records using other sample fields (e.g. Year from XXXX to YYYY, or Gender = X, etc)

3. Why is it not correct to build predictive models for TREATMENT_RESPONSE using records where it is missing?

Step 2: Select only the records with non-missing TREATMENT_RESPONSE. Keep SNUM (sample number) but remove sample fields that are all the same or missing. Call the reduced dataset genes-reduced.csv

4. Which sample fields should you keep?

Data Mining

Assignment #5

Predict disease classes using genetic microarray data

Data

Gene data is in genes-in-rows format, comma-separated values. Download the dm-assign5.zip file and unzip to extract 3 files:

• pp5i_train.gr.csv (training data, 1.7 MB)

• pp5i_train_class.txt (training data classes)

• pp5i_test.gr.csv (test data, 0.6MB)

Instructions

Training data: file pp5i_train.gr.csv, with 7070 genes (no Affy controls) for 69 samples. A separate file pp5i_train_class.txt has classes for each sample, in the order corresponding to the order of samples in pp5i_train.gr.csv. There are 5 classes, labeled EPD, JPA, MED, MGL, RHB.

Test data: file pp5i_test.gr.csv, with 23 unlabelled samples and same genes. You can assume that the class distribution is similar.

Your goal is to learn the best model from the training data and use it to predict the label (class) for each sample in test data. You will also need to write a short paper describing your effort.

Randomization experiments showed that one can get about 10-12 (from 23) correct answers with random guessing.

The final grade will be a combination of effort (40%), presentation (30%), and the accuracy of prediction. Below are suggested steps for doing this experiment, but you can vary and improve on the suggested approach, as long as you produce a prediction for the test set and describe your results.

Important Hints

Be sure that you don’t use the sample number as one of the predictors. Training data is ordered by class, so sample number will appear to be a good predictor on cross-validation, but it will not work on the test data.

One of the MED samples in the training data is very likely misclassified (by a human). So the best result you can expect to get on cross validation is one error (on a MED sample) out of 69. However, this should not affect your accuracy on the test set (all labels there are correct).

You can make all the runs from Weka GUI interface, but if you can learn a UNIX shell, you can run these repeated experiments much easier from the shell. (Caution: Weka cross-validation uses a random number seed which is different in GUI and in shell, so cross-validation results may be slightly different if you call Weka from shell than if you use Weka Explorer).

You can complete the project using only simple steps, but the more advanced steps will give you higher accuracy.

The following steps suggest one way of finding the best model -- you are welcome to make improvements, where you think appropriate.

Step 1: Data Cleaning

Threshold both train and test data to a minimum value of 20, maximum of 16,000.

Step 2: Selecting top genes by class

• Remove from train data genes with fold differences across samples less than 2

• For each class, generate subsets with top 2,4,6,8,10,12,15,20,25, and 30 top genes with the highest T-value

Optional: for each class, select top genes using highest absolute T-value (i.e. also include genes with high negative T-value)

• For each N=2,4,6,8,10,12,15,20,25,30 combine top genes for each class into one file (removing duplicates, if any) and call the resulting file pp5i_train.topN.gr.csv

• Add the class as the last column, remove sample no, transpose each file to "genes-in-columns" format and convert it to arff.

Step 3: Find the best classifier/best gene set combination

Use the following Weka classifiers:

• NaiveBayes

• J48

• IB1

• IBk (for each value of K=2, 3, 4)

• One more Weka classifier of your choice -- that can work with multiclass data.

a. For each classifier, using default settings, measure classifier accuracy on the training set using previously generated files with top N=2,4,6,8,10,12,15,20,25,30 genes. For IBk, test accuracy with K=2, 3 and 4.

b. Select the model and the gene set with the lowest cross-validation error.

Optional: once you found the gene set with the lowest cross-validation error, you can vary 1-2 additional relevant parameters for each classifier to see if the accuracy will improve. E.g. for J4.8, you can vary reducedErrorPruning and binarySplits

c. Use the gene names from best train gene set and extract the data corresponding to these genes from the test set.

d. Convert test set to genes-in-columns format.

e. Add a Class column with "?" values as the last column

Step 4: Generate predictions for the test set

You should now have the best train file, call it pp5i_train.bestN.csv, (with 69 samples and bestN number of genes for whatever bestN you found) and a corresponding test file, call it pp5i_test.bestN.csv, with the same genes and 23 test samples. The train file will have all Class values while the test file Class column will have only "?"

a. Convert test file to arff format (you should already have .arff for train file from Step 3).

Important: In Weka, the variable declarations should be exactly the same for test and train file. To achieve that, change the Class entry in pp5i_test.bestN.arff header section to be the same as in train file, i.e.

@attribute Class {MED,MGL,RHB,EPD,JPA}

b. Use the best train file and the matching test file and generate predictions for the test file class.

If you are using GUI, then

• Select best train set under Preprocess tab

• Click on Supplied test set option under Classify tab and specify the matching test set

• Specify the appropriate classifier parameters, if any.

• Click on Start to run the classifier. Because the classes are unknown ("?") in the test set, the confusion matrix will show all zeros.

• Right-click on the model name in the result list panel (see figure) and select from submenu Visualize classifier errors

• From the visualization screen, select Save and Weka will save the test file and predictions in arff format.

• Extract from it a file with Instance_number and predictedClass columns and write them to a file -predictions.csv

• You should have predictions for 23 instances, with instance number ranging from 0 to 22.

If you are using a shell, then you can generate predictions using, e.g.

java weka.classifiers.Classifier -t train_data.arff -T test.data.arff -p 0

Step 5: Write a short paper describing your effort.

Document each step.

For each classifier used, give a paragraph describing this classifier.

Give a graph showing error rate versus number of genes.

Describe which classifier and which number of genes you have selected

Comment on the relative strengths and weaknesses of the classifiers you used for this type of data.

Your paper can be of any length in any format that you want.

Data Mining

Documents

Transcript of Data Mining