First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I’m sure there may already be open source implementations.

How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like ‘the’, ‘a’, etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?

My language of choice is PHP.


I’m thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.

Here is an online site that claims to do so:

