Important note, added 9/6: I am providing a framework
of code in order to make this assignment easier. I strongly recommend that you use it,
although is not compulsory to do so. All
you need to do is fill in the NaiveBayes class in NaiveBayes.java. Good luck!
Let me know if you have any problems compiling or running this code. A command line like “java Classifier OneR weather.nominal.arff” should work right out of the box.
1. Extend your program from Assignment 1 in two ways. First, it should now work on both the nominal weather data (weather.nominal.arff) and the “contact lens data” (contact-lenses.arff). Second, it should be capable of running both the 1R algorithm and the naïve Bayes classifier on these data sets. Your program should take the algorithm (i.e. 1R or naïve Bayes) and datafile name as arguments, and output for each instance in the training data at least the following: (i) estimated class probabilities and (ii) the actual classification; finally, the program should output the percentage of errors made on the training set.
2. Write a report containing:
a. the output of your program for the four possible combinations of inputs (two files and two algorithms).
b. 3-4 sentences describing any interesting or difficult design decisions you made in writing the program. (You can write more if you have interesting things to say, but don't ramble -- be concise) [added 9/6/2007]
c. 5-10 sentences analyzing the results. (Hint: compare the effects of the different data sets and the different algorithms.) (You can write more if you have interesting things to say, but don't ramble -- be concise) [added 9/6/2007]
Coding policy:
you may not copy any code from anywhere (including the textbook, the Internet,
or any other source) for Assignment 2a, except that you may use your code from
Assignment 1, and you may also use any or all of my
own solution to Assignment 1 (but you must of course clearly indicate which parts
you copied and/or altered) [added 9/3/2007], and you may also use any or all of
the code framework for assignment 2 provided above. [added 9/6/2007]
Assignment
2b has been eliminated; we will probably implement decision trees in a later
assignment instead. [9/6/2007]
1.
Extend your program from
Assignment 2a so that it works on the same two data sets with a third
algorithm: decision trees. For the
splitting criterion in the decision tree, you have two options: either (i) use
“reduction in uncertainty” (i.e. reduction in entropy, as defined in the
lecture on this topic -- also called "gain" or "information
gain" in the textbook) or (ii) if you are feeling adventurous, read up on
the “gain ratio” described in the textbook and use that as your splitting
criterion. Your program must be able to
print out some sensible representation of the decision tree it builds, as well
as doing the same stuff as in Assignment 2a (output class probabilities and
actual classification for each training instance, and the percentage of errors
on the entire training set).
2.
Write a report containing:
a.
the output of your program
for the two data sets when using a decision tree classifier.
b.
3-4 sentences describing
any interesting or difficult design decisions you made in writing the decision
tree classifier.
c.
3-4 sentences analyzing the
results.
Coding policy: you may not copy any code from anywhere (including the textbook, the
Internet, or any other source) for Assignment 2b, except that you may use your
code from earlier assignments, and you may also use any or all of my own
solution to Assignment 1 (but you must of course clearly indicate which parts
you copied and/or altered).