COMP 393, Fall 2007, Assignment 3

(assigned Thursday, September 13; due Thursday, September 27)

 

To assist with part 1 of this assignment, I am providing a code framework very similar to that for Assignment 2.

 

1.     Write a program to compute the first level only of a decision tree. One-level decision trees are sometimes called “decision stumps”. Your decision stump should work on the same two datasets as Assignment 2 (nominal weather, and contact lenses). The split criterion employed by your decision stump can be the entropy measure explained in class, or alternatively you may read up on and implement something else (for example, the “gain ratio” in Witten p104, or the “Gini index” or “misclassification error” on p177 of Alpaydin).

2.     Use the Weka toolkit to apply the k-nearest-neighbor algorithm to the “iris” data (“iris.arff” in the Weka “data” directory). Plot the error rate vs k for k=1,2,…10, using (i) 10-fold and (ii) 2-fold cross-validation. Write one or two paragraphs discussing the results. (Hints: what value of k is best? Do you expect the same answer for (i) & (ii)? Compute the mean and standard deviation for (i) & (ii) and give reasons for any differences.)

3.     The documentation for Weka’s IB1 classifier is incomplete – it doesn’t specify what distance function is used to compute the nearest neighbor. By examining the source code, give a complete description of this distance function. (Use clear, scientific language and be sure to deal with both types of data – nominal & numeric.)

 

Coding policy: you may not copy any code from anywhere (including the textbook, the Internet, or any other source) for Assignment 3, except that you may use your code from earlier Assignments, and the framework provided above.

 

Suggestions for originality points: implement one of the alternative split criteria mentioned in part 1, or do some more detailed investigations of nearest-neighbor algorithms (e.g. repeat part 2 on a different data set), or provide some further comments on IB1’s source code, in addition to that requested by part 3.