Instructions for Detecting Plagiarism lab

Structure of the lab

The lab has two main goals: (i) to practice writing a design for some software before writing any code; (ii) to experience iteration of the design -- that is, changes to the design that result when one attempts to implement the original design. The sequence of steps for the lab is as follows:
  1. Read this entire list of instructions: That's what you're doing now.
  2. Read the background information: This is provided in the section below called "Description of the problem to be solved."
  3. Study the code framework provided: There is no need to absorb every detail; just make sure you are aware of which parts of the problem have already been solved for you.
  4. Write Design Document 1: Write a design document (we'll call it "Design Document 1") explaining how you plan to solve the problem.
  5. Implement Design Document 1
  6. Write Design Document 2: The main purpose of this document is to explain a new design for your solution to the problem, incorporating both changes you have implemented in the code you submitted, and further changes you would recommend if efficiency were a significant consideration. Divide your document into two sections, one for the changes you implemented and another for the ones you recommend in future.
  7. Optional: (No extra credit -- for fun only -- but if anyone attempts it, we will discuss the results in class.) Implement some or all of your recommendations for improving efficiency. Can you turn this into a genuinely effective piece of software that could be used for detecting plagiarism in a very large corpus of documents?

Description of the problem to be solved

The objective is to detect plagiarism in a corpus of text documents. In this lab, we adopt a very restrictive definition of plagiarism, based on the concept of n-grams. An n-gram is a sequence of n consecutive words in a document. For example, if a document consists of the text "the quick brown fox jumps over the lazy dog" then the set of all 3-grams in the document is: "the quick brown", "quick brown fox", "brown fox jumps", ..., "over the lazy", "the lazy dog". Two documents will be defined as suspiciously similar if they have too many n-grams in common.

Specifically, the inputs to the algorithm will be:

Your program should have a main method in the class PlagiarismFinder, which accepts the above three inputs as command line parameters in the given order. For example, if the documents are in a folder called /somewhere/here/sm_doc_set, the following command line runs the program with n=6 and T=100:
java PlagiarismFinder /somewhere/here/sm_doc_set 6 100

The output of the algorithm will be a list of all pairs of suspiciously similar documents. For each pair, the number of n-grams that appeared in both documents should also be displayed. Thus, a typical output will be:

530 abf0704.txt abf70402.txt
675 abf0704.txt edo26.txt
179 abf70402.txt edo26.txt
423 bef1121.txt edo14.txt
285 catchmeifyoucan.txt ecu201.txt
311 catchmeifyoucan.txt hal10.txt
384 catchmeifyoucan.txt tyc12.txt
1622 jrf1109.txt sra31.txt
In fact, the above output is precisely what should be produced when the above command line (with n=6 and T=100) is run on the small document set provided below. Your program may produce the lines of output in a different order, but the output must otherwise be identical.

A code framework is provided to help you get started. The assignment can be completed successfully by adding code only to the PlagiarismFinder class, and only in the places indicated by the comment "// ADD CODE HERE". However, you are welcome to edit the NGramGenerator class and the PlagiarismFinder class as much as you wish, and you can also create new classes if desired.

Note that there are many details of exactly how to compute n-grams that have been left unspecified in the above description (e.g. how to treat capitalization, whitespace, and non-alphabetic characters). These details are not discussed further here, because the provided NGramGenerator class computes the n-grams for you -- please read the code if you're interested in the details, but if you edit this part of the code, its functionality should remain the same.

Three different data sets are provided: small, medium, and large. As noted above, your solution should work on the small and medium data sets within less than 10 seconds (for typical values of the parameters, say n=6 and T=100), but it need not work on the large data set. Each data set contains a file, called catchmeifyoucan.txt, which is definitely plagiarized -- but there may be other examples of plagiarism in each data set.

Grading

Design Document 1 40%
Design Document 2 40%
Code 20%

The design documents will be graded on clarity, appropriate level of detail, and quality of writing. The code will be graded on correctness only, by running some reference tests.

Acknowledgment: This lab is based very closely on Baker Franke's Catching Plagiarists lab, which is made available as part of the the Stanford Nifty Assignments repository.