Instructions for Detecting Plagiarism lab

Structure of the lab

The lab has two main goals: (i) to practice writing a design for some software before writing any code; (ii) to experience iteration of the design -- that is, changes to the design that result when one attempts to implement the original design. The sequence of steps for the lab is as follows:

Read this entire list of instructions: That's what you're doing now.
Read the background information: This is provided in the section below called "Description of the problem to be solved."
Study the code framework provided: There is no need to absorb every detail; just make sure you are aware of which parts of the problem have already been solved for you.
Write Design Document 1: Write a design document (we'll call it "Design Document 1") explaining how you plan to solve the problem.
- You must submit Design Document 1 to Moodle before you write any code, and certainly before the end of class at 5pm. You may browse the code provided -- see below -- but you may not edit it.
- It is expected that you will take no more than one hour to write and submit Design Document 1; the deadline for submission is the end of class (5pm). Re-submissions will not be permitted: the first submission is your final submission. Late submissions will be penalized by 1% for every minute of lateness after 5pm.
- The expected length of Design Document 1 is 0.5-2 pages.
- The document must be written in English, although you may also use pseudocode and diagrams if you wish. We'll be studying more details of how to structure design documents later in the semester, but for now, just use common sense. The document must describe, clearly and succinctly, the main elements of your solution, in such a way that another computer programmer could use your document to implement your solution. Your document should mention any important data structures you plan to use (e.g. ArrayList, HashMap, arrays, Set), describing what data they will contain.
- Any common file format (e.g. PDF, Microsoft Word, OpenOffice, plain text) is acceptable, and any reasonable formatting conventions may be used.
Implement Design Document 1
- Submit your code by the due date for this assignment (i.e. the start of the next class). Code should be submitted to Moodle, as a single zip file of .java files only.
- You should make an honest attempt to implement the design in your document, but if you discover problems with that design, then it is fine to incorporate changes to the design in your code. Indeed, incorporating such changes is one of the primary goals of the lab. You should make a note of any design changes, since these will need to be described in Design Document 2 (see below).
- Don't worry too much about efficiency; for this lab, correctness is more important. However, woefully inefficient solutions will be penalized a little. Your solution should produce correct output for both the small and medium data sets provided below within less than 10 seconds on the lab machines. Your code need not work efficiently on the big data set.
Write Design Document 2: The main purpose of this document is to explain a new design for your solution to the problem, incorporating both changes you have implemented in the code you submitted, and further changes you would recommend if efficiency were a significant consideration. Divide your document into two sections, one for the changes you implemented and another for the ones you recommend in future.
- Submit your document to Moodle by the due date for this assignment.
- Design Document 2 should be 1-2 pages in length. In the very unlikely event that your original design was near-perfect and you have few changes to discuss or recommend, you should instead discuss potential designs for more challenging versions of the problem. (e.g. How would you detect near-duplicate documents on the entire web? How would you detect near-duplicate images?)
- In terms of style, use the same guidelines as for Design Document 1.
Optional: (No extra credit -- for fun only -- but if anyone attempts it, we will discuss the results in class.) Implement some or all of your recommendations for improving efficiency. Can you turn this into a genuinely effective piece of software that could be used for detecting plagiarism in a very large corpus of documents?

Description of the problem to be solved

The objective is to detect plagiarism in a corpus of text documents. In this lab, we adopt a very restrictive definition of plagiarism, based on the concept of n-grams. An n-gram is a sequence of n consecutive words in a document. For example, if a document consists of the text "the quick brown fox jumps over the lazy dog" then the set of all 3-grams in the document is: "the quick brown", "quick brown fox", "brown fox jumps", ..., "over the lazy", "the lazy dog". Two documents will be defined as suspiciously similar if they have too many n-grams in common.

Specifically, the inputs to the algorithm will be:

A directory containing some documents, each of which consists of ASCII text. The filename of each document to be analyzed is guaranteed to end in ".txt".
the phrase length n: the length of the n-grams to be analyzed
the in-common threshold T: if two different documents contain T or more of the same n-grams, the documents will be considered suspiciously similar.

Your program should have a main method in the class PlagiarismFinder, which accepts the above three inputs as command line parameters in the given order. For example, if the documents are in a folder called /somewhere/here/sm_doc_set, the following command line runs the program with n=6 and T=100:

java PlagiarismFinder /somewhere/here/sm_doc_set 6 100

The output of the algorithm will be a list of all pairs of suspiciously similar documents. For each pair, the number of n-grams that appeared in both documents should also be displayed. Thus, a typical output will be:

530 abf0704.txt abf70402.txt
675 abf0704.txt edo26.txt
179 abf70402.txt edo26.txt
423 bef1121.txt edo14.txt
285 catchmeifyoucan.txt ecu201.txt
311 catchmeifyoucan.txt hal10.txt
384 catchmeifyoucan.txt tyc12.txt
1622 jrf1109.txt sra31.txt

In fact, the above output is precisely what should be produced when the above command line (with n=6 and T=100) is run on the small document set provided below. Your program may produce the lines of output in a different order, but the output must otherwise be identical.

A code framework is provided to help you get started. The assignment can be completed successfully by adding code only to the PlagiarismFinder class, and only in the places indicated by the comment "// ADD CODE HERE". However, you are welcome to edit the NGramGenerator class and the PlagiarismFinder class as much as you wish, and you can also create new classes if desired.

Note that there are many details of exactly how to compute n-grams that have been left unspecified in the above description (e.g. how to treat capitalization, whitespace, and non-alphabetic characters). These details are not discussed further here, because the provided NGramGenerator class computes the n-grams for you -- please read the code if you're interested in the details, but if you edit this part of the code, its functionality should remain the same.

Three different data sets are provided: small, medium, and large. As noted above, your solution should work on the small and medium data sets within less than 10 seconds (for typical values of the parameters, say n=6 and T=100), but it need not work on the large data set. Each data set contains a file, called catchmeifyoucan.txt, which is definitely plagiarized -- but there may be other examples of plagiarism in each data set.

Grading

Design Document 1	40%
Design Document 2	40%
Code	20%

The design documents will be graded on clarity, appropriate level of detail, and quality of writing. The code will be graded on correctness only, by running some reference tests.

Acknowledgment: This lab is based very closely on Baker Franke's Catching Plagiarists lab, which is made available as part of the the Stanford Nifty Assignments repository.