COMP 393, Fall 2007, Assignment 4

(assigned Friday, September 28; rough draft due Monday, October 8; final version due Thursday, October 18)

 

1.     Unpruned decision trees: Write a program to compute a decision tree without pruning. Your program should work on the same two datasets as Assignments 2 and 3 (nominal weather, and contact lenses) --- please submit the output of your program on these data sets . The split criterion employed by your decision tree can be the entropy measure explained in class, or any other suitable measure of your choice.  I am providing a code framework to assist with this part of the assignment.  I strongly recommend that you use the framework, but you may of course ignore some or all of it if you wish (for example, you could replace my DecisionStump with your own from Assignment 3).  If you are using the provided framework, please note the following points carefully since they will probably make your job much easier:

·        It should be possible to complete the assignment by inserting code only at the 8 locations of the string “TODO” (5 in DecisionTree.java, and 8 in DecisionTreeNode.java). Nothing else needs to be changed.

·        Full JavaDoc documentation is provided with the framework.

·        Unlike the frameworks provided for previous assignments, this code follows much more standard Java conventions (e.g. no public fields).  Please try to follow these conventions in your own code also.

 

2.     Pruning decision trees: On page 192 of Witten and Frank, the authors state:

“… the simple [pruned] decision tree… actually performs better than the more complex [unpruned] one.”

Use the Weka toolkit to investigate this claim.  Do you agree that pruned decision trees generally perform better than unpruned ones?  Present some results (using at least 2 different data sets) providing evidence to support your answer.  (The "Experimenter" interface to Weka could be useful here, so consider working through Chapters 12.1-12.4 of  Witten and Frank.) Write a paragraph listing the advantages of pruned trees over unpruned ones, and discuss which of these advantages are the most significant.

 

3.     ROC and precision-recall curves:

a.      Use Weka to generate (i) an ROC curve and (ii) a precision-recall curve for the naïve Bayes classifier applied to the labor negotiations data.  Instructions for generating and saving such curves in Weka are available at http://weka.sourceforge.net/wiki/index.php/ROC_curves.  (Read the whole page carefully, and decide which of the methods is best suited for your purposes).    Submit graphs of your curves as the solution to this question.  The graphs should be either clearly labeled or clearly described in some accompanying text.

b.     Suppose that false positives are 10 times more costly than false negatives in this scenario (i.e. it is 10 times more costly to cause a strike by incorrectly estimating that a negotiation will be successful, than it is to concede a few extra labor conditions due to incorrectly estimating that a negotiation will fail).  What is the optimal threshold for P(good),  above which the naïve Bayes classifier should output "good"?  As usual, you should explain your answer in a clear, scientific style.

 

Coding policy: you may not copy any code from anywhere (including the textbook, the Weka toolkit, the Internet, or any other source) for Question 1 of Assignment 4, except that you may use your code from earlier Assignments, and the framework provided above.  If your solutions to Questions 2 or 3 involve writing code, feel free to cut and paste the code from Weka and/or the web or anywhere else (but note that it is possible to complete these questions without writing any code, and instead using the interactive interfaces to Weka.)