Assignment 2

Assignment 2a (due September 13, but with an automatic four-day extension for anyone who requests it)

Important note, added 9/6: I am providing a framework of code in order to make this assignment easier. I strongly recommend that you use it, although is not compulsory to do so. All you need to do is fill in the NaiveBayes class in NaiveBayes.java. Good luck! Let me know if you have any problems compiling or running this code. A command line like “java Classifier OneR weather.nominal.arff” should work right out of the box.

1. Extend your program from Assignment 1 in two ways. First, it should now work on both the nominal weather data (weather.nominal.arff) and the “contact lens data” (contact-lenses.arff). Second, it should be capable of running both the 1R algorithm and the naïve Bayes classifier on these data sets. Your program should take the algorithm (i.e. 1R or naïve Bayes) and datafile name as arguments, and output for each instance in the training data at least the following: (i) estimated class probabilities and (ii) the actual classification; finally, the program should output the percentage of errors made on the training set.

2. Write a report containing:

a. the output of your program for the four possible combinations of inputs (two files and two algorithms).

b. 3-4 sentences describing any interesting or difficult design decisions you made in writing the program. (You can write more if you have interesting things to say, but don't ramble -- be concise) [added 9/6/2007]

c. 5-10 sentences analyzing the results. (Hint: compare the effects of the different data sets and the different algorithms.) (You can write more if you have interesting things to say, but don't ramble -- be concise) [added 9/6/2007]

Coding policy: you may not copy any code from anywhere (including the textbook, the Internet, or any other source) for Assignment 2a, except that you may use your code from Assignment 1, and you may also use any or all of my own solution to Assignment 1 (but you must of course clearly indicate which parts you copied and/or altered) [added 9/3/2007], and you may also use any or all of the code framework for assignment 2 provided above. [added 9/6/2007]

Assignment 2b has been eliminated; we will probably implement decision trees in a later assignment instead. [9/6/2007]

Assignment 2b (assigned on September 3, due September 17)

1. Extend your program from Assignment 2a so that it works on the same two data sets with a third algorithm: decision trees. For the splitting criterion in the decision tree, you have two options: either (i) use “reduction in uncertainty” (i.e. reduction in entropy, as defined in the lecture on this topic -- also called "gain" or "information gain" in the textbook) or (ii) if you are feeling adventurous, read up on the “gain ratio” described in the textbook and use that as your splitting criterion. Your program must be able to print out some sensible representation of the decision tree it builds, as well as doing the same stuff as in Assignment 2a (output class probabilities and actual classification for each training instance, and the percentage of errors on the entire training set).

2. ~~Write a report containing:~~

a. ~~the output of your program for the two data sets when using a decision tree classifier.~~

b. ~~3-4 sentences describing any interesting or difficult design decisions you made in writing the decision tree classifier.~~

c. ~~3-4 sentences analyzing the results.~~

~~Coding policy:~~ you may not copy any code from anywhere (including the textbook, the Internet, or any other source) for Assignment 2b, except that you may use your code from earlier assignments, and you may also use any or all of my own solution to Assignment 1 (but you must of course clearly indicate which parts you copied and/or altered).