I don't like it when people cheat on exams. I don't like people who steal other people's (intellectual) property.
Because of that I did not hesitate for too long, when I was given the opportunity to write a plagiarism detection system as part of my bachelor thesis.
The result of this work is now available on GitHub: https://github.com/yas4891/MUTEX
Background
Software Plagiarism has been rampant among students at my university (http://www.unibw.de) . My analysis in preparation for this project showed that the quota of ripped-off solutions handed in by students was generally around 25 percent and could get as high as 60 percent. Of course, this all depends on the difficulty of the assigned task. The more difficult the task is, the more students will try to cheat.
Plagiarism does not help anyone. The students who cheat don't put in the necessary effort to understand and solve the problem at hand. They subsequently do not learn programming and often fail exams. The professors get annoyed by seeing the same source code over and over again. It is essentially a loose-loose situation.
The Idea
To keep students away from software plagiarism and to improve the level of coding competency among students, it was decided to create a self-written IDE to be used by the students. The feature list includes functional and non-functional tests as well as plagiarism detection. The idea is to automatize the procedure of handing in coding assignments as much as reasonably possible, while at the same time making it harder for students to cheat the system. The new system should rigorously enforce software development best practices (through both functional and non-functional testing) and force every student to hand in his very own solution to the problem.
The student will use the IDE to develop their solution. When she thinks that she's solved the assigned task, she hits a button in the IDE. The source code is then send to a central server, where it is tested and checked for plagiarism. The plagiarism detection is done against the set of previously gathered source codes for the same task. This means that the first student to hand in a correct solution will automatically be judged as having presented a unique source code. The second student's solution is tested against the first solution. The third solution against the second and first solutions and so on.
If the source code does not pass both the tests and plagiarism detection, the student has an unlimited number of retries to rewrite the program.
My job in this over-arching project was to implement a system to reliably detect software plagiarism. Because of the iterative nature of the process layed out for the students – very similar to the way Test Driven Development is performed – the plagiarism detection system has to be fast enough to support this approach.
In absence of a better idea, the plagiarism detection system was named "MUTEX (Munich University Theft EXclusion)".
Requirements
The following requirements were layed out for the plagiarism detection system:
- comparison of a new source code against a set of 20 source codes should take no longer than 10 seconds
- programming language of the source codes: C (with an eye on possible extensions for C++, Java and VHDL)
- runnable as a CGI script withhin a custom written webserver
- operating system: Windows 7 x64
Detecting source code plagiarism
Now, lets have a look at how you actually find matching parts between two source files. The implementation language for this project is C# with .NET Framework 4.0 – in case you were wondering.
The problem of building a plagiarism detection system can be broken down into roughly two chunks – parsing the source code into tokens and comparing token sequences for matching sub sequences.
In this post I will focus solely on finding matching sub sequences shared between two token streams.
Definitions
I will use words and notations, that need some more explanation. So here it goes:
| sequence |
an ordered list of elements – e.g. strings or token streams. Notation: [h,e,l,l,o] or "hello" |
| subsequence |
a part of a sequence |
| Token |
a single element of a sequence. Tokens can be marked as already matched |
| Match |
an identical subsequence that is present in both sequences. Matches are not permanent and only existing for the length of one iteration of the algorithm |
| Tile |
a permanent and unique match. After the algorithm has finished only tiles and unmarked portions of the sequences remain |
| MinimumMatchLength / MML |
configurable parameter of the Greedy-String-Tiling-algorithm. Defines the minimum length of a match to be considered by the algorithm (and when calculating the similarity) |
| similarity |
the ratio of identical tokens between two sequences given in percent.
similarity = (sum(length of all matches) / length of shorter sequence) |
| maximum similarity |
given a new sequence NEW and a set of old sequences OLD_SET: the highest calculated similarity between NEW and any one element of OLD_SET |
| local confusion |
describes an alteration to the token stream that causes the algorithm to miss a otherwise valid match. Inserting a big enough number of locally confusing alterations into a token stream will make it look like valid, unique source code to the algorithm |
Greedy-String-Tiling
The Greedy-String-Tiling algorithm was originally devised by Michael J. Wise at the University of Sydney in 1993. If you like reading the works of people that easily outsmart you as much as I do, read his original report here.
The algorithm works on two sequences A and B. It consists of a total of five loops and can be broken into two phases. Phase 1 is used to find the longest matches in the remaining, yet unmarked portions of the two sequences; phase 2 marks the tokens of the longest matches found, thereby converting the match into a tile. Those two phases are repeated over and over again until no match longer than the MML is found.
Before Phase 1 we set LastMatchLength to MinimumMatchLength(MML) – so that if no longer match is found during this iteration the algorithm will terminate – and create a list to store the matches found in this iteration. In Phase 1 we iterate over all unmarked tokens in A and compare it with all the unmarked tokens in B. When it finds a match, it uses a third loop to extend the match as far possible (hence "greedy"). The loop takes the subsequent tokens in A and B as long as they match and are unmarked. After extending the match to its maximum, the algorithm checks whether the match is longer than or of equal length as the matches previously found during this iteration. If it is longer than the previous matches the list is reset and the new match is added as the only element into this list. If the match is of equal length it is simple appended to the list.
After phase 1 of each iteration we have a list that contains the matches with maximum length in the remaining parts of the sequence. With each iteration the length of the matches found decreases, eventually terminating the algorithm.
In phase 2 we iterate over all found matches, mark all the tokens in each match as used and store the match in the tiles collection (promoting the match to tile).
The relevant code for this simple GST implementation can be found in GSTLibrary.tile.GSTAlgorithm.
///</p>
<p> </p>
<p><summary> /// performs exactly one iteration of the algorithm /// </summary> public override void DoOneRun() { if(Finished) throw new GSTException("algorithm is finished"); var watch = Stopwatch.StartNew(); var listMatches = new List<tile<t>>(); LastMaximumMatch = MinimumMatchLength; // for every token in A that is unmarked foreach(var tA in ListA.Where(t => !t.Marked)) { var tokA = tA; // CLOSURE int indA = ListA.IndexOf(tokA); // for every token in B that is unmarked and matches tokA foreach(var tB in ListB.Where(t => !t.Marked).Where(tokA.EqualsTokenValue)) { //counter++; var tokB = tB; // CLOSURE int indB = ListB.IndexOf(tokB); int matchLength = CalculateMatchLength(indA, indB); if(matchLength >= LastMaximumMatch) { var tile = AbstractGSTAlgorithm.CreateTile(ListA, indA, indB, matchLength); AddTileToMatchList(listMatches, matchLength, tile); } } // foreach in B } // foreach in A foreach (var tile in listMatches) { MarkTileAsMatched(tile); ListTiles.Add(tile); } TilesMatchedInLastRun = listMatches; //Console.WriteLine("one run({1}) took {0} ms", watch.ElapsedMilliseconds, counter); }
runtime complexity of the Greedy-String-Tiling algorithm
The complexity of the algorithm is mainly determined by phase 1 and its three loops. The worst-case complexity is O(n3), best-case complexity is O(n2) – where n is the sequence length of the shorter sequence. The experienced complexity tends to be near to O(n3).
From my experience you can expect this simple implementation of the Greedy-String-Tiling algorithm to exceed runtimes of one second (exactly 1596 ms) at sequence lengths of about 900 tokens. With 1700 tokens you will see runtimes of about 11 seconds (11395 ms) on modern hardware (1.88X tokens, 7.3X runtime compared to 900 tokens). At about 2900 tokens the runtime will exceed 60 seconds (63909 ms; 3.2X tokens, 40.04X runtime).
As most of the source codes handed in by the students tend to be shorter than 900 tokens, the performance seems to be reasonable for our use case.
Conclusion
The Greedy-String-Tiling algorithm is a great heuristic to find similarities shared between two sequences. The implementation shown here is not optimized in any way, although there exists an optimization that reduces the average complexity to near O(n). This optimization will be discussed in one of the upcoming posts on this topic. The series will continue with posts on how to hide plagiarism and the creation of a special grammar for MUTEX to improve the detection rate.