COMP 393, Fall 2007, Assignment 5

(rough draft due Thursday, November 8; final version due Thursday, November 15)

1. Creating a new Weka classifier. (50 points.)

a. We will go through, in class, the procedure needed to compile Weka within Eclipse, and create a new classifier that is a clone of an existing one. Before continuing, verify that you are able to achieve this: you should be able to start the Weka Explorer and use the new classifier you have created. There is nothing to submit for this part of the assignment.

b. Create a new Weka classifier that implements the selective naïve Bayes classifier described on pages 295-296 of Witten and Frank. Most of the code you need to achieve this is in the provided SelectiveNaiveBayes.java file. It should be possible to complete the implementation by inserting code only at the 2 locations of the string “TODO” in this file. Please read carefully all of the provided private methods in the file, as as some of them will be extremely useful to you. Submit your final SelectiveNaiveBayes.java file, and its output on the following two data files that are provided with Weka: contact-lenses.arff and segment-challenge.arff.

2. Feature selection. (10 points.) All parts of this question should be answered by applying suitable Weka tools to the segment-challenge.arff data file provided with Weka. You should use the default settings for J4.8 decision trees and Naïve Bayes classifiers whenever they are required. You should also use default settings for the feature selection techniques, except when the question specifically requires a different setting. If it helps to make your results more compact or readable, you can describe attributes using their numbers in the arff file rather than their full names (e.g. “3” rather than “region-pixel-count”).

a. What are the best three attributes as measured by information gain applied to each attribute separately?

b. What are the best three attributes as measured by Weka’s CfsSubsetEval evaluator when using the BestFirst search technique?

c. Which attributes are actually used by a J4.8 decision tree trained on this arff file?

d. What is the best subset of attributes as determined by a J4.8 wrapper and the greedy search method?

e. What is the best subset of exactly 3 attributes as determined by a J4.8 wrapper and the greedy search method?

f. What is the best subset of exactly 2 attributes as determined by a J4.8 wrapper and the greedy search method?

g. For each subset of attributes you identified in (a)-(f), record the error rate (as determined by 10 fold cross-validation) for J4.8, and the number of leaves in the resulting decision tree. Summarize all these results in a suitable table or chart. Write a few sentences commenting on the results. (For example, if we restrict attention to the subsets of size 3 that you tested, did any perform significantly better or worse than the others; and if so, why? How do the other results compare to the subsets of size 3? Are they what you expect?)

h. Repeat (d), (e) and (f) using a Naïve Bayes wrapper instead of J4.8. Record the error rate of each, and also record the error rate of a Naïve Bayes classifier applied to all attributes, and to the 3 attributes from (a). Again summarize your results in a table or chart. Write a few sentences commenting on these results, and noting any similarities or differences to those in (g). Also compare your results with the output of your selective Naïve Bayes classifier in Question 1(b).

3. Discretizing attributes. (10 points.) All parts of this question should be answered by applying suitable Weka tools to the segment-challenge.arff data file provided with Weka. You should use the default settings for J4.8 decision trees and Naïve Bayes classifiers whenever they are required, and default settings for discretization except where the question requires otherwise.

a. Investigate whether standard discretization (with bins of equal width) improves or hampers the performance of J4.8 and Naïve Bayes on this arff file. Try several different values for the number of bins used, and present your results using a suitable table or chart. Write a few sentences commenting on and analyzing the results.

b. Repeat (a) for frequency-equalized histograms. Discuss whether or not histogram equalization appears to deliver any benefits in these experiments, giving clear evidence for any conclusions.

4. PCA. (10 points.)

a. Use the Weka toolkit to apply PCA to the file pcatest.arff. Retain as many dimensions as seems sensible for this data, and save the result in a new arff file. Report how many dimensions you retained, and your reasons for doing so. Also report the sizes of the original arff file and your new file.

b. Apply a J4.8 decision tree to the original pcatest.arff file and your new file. Write a few sentences analyzing the size and performance of the two trees.

5. Bagging and Boosting. (10 points.) Use the Weka toolkit to estimate the error rate of (i) the REPTree classifier and (ii) the naïve Bayes classifier on the soybean.arff file that is provided with Weka. Also compute the error rates of both classifiers when using (a) bagging, and (b) boosting. Summarize your results in a table, and write a few sentences commenting on them.

Originality: (10 points.) If you perform interesting additional work related to any question or questions, you will be awarded points for originality. Please mark clearly any parts of your assignment you consider to be worthy of originality points.

Coding policy: for this assignment, you may not copy any code from anywhere except the Weka toolkit and your own earlier Assignments.