Plagiarism detection

Plagiarism detection

Plagiarism detection refers to the process of locating instances of plagiarism within a work or document. The widespread use of computers and the advent of the Internet has made it easier to plagiarize the work of others. Most cases of plagiarism are found in academia, where documents are typically essays or reports. However, plagiarism can be found in virtually any field, including scientific papers, art designs, and source code.

Detection can be either manual or computer-assisted. Manual detection requires substantial effort and excellent memory, and is impractical in cases where too many documents must be compared, or original documents are not available for comparison. Computer-assisted detection allows vast collections of documents to be compared to each other, making successful detection much more likely.

Use of search engines

An internet search engine can be used to look for certain keywords or key sentences from a suspected document on the world-wide web. This method can be highly effective when used on small and characteristic fragments, for instance a poem or a poetic translation. Although it can easily detect blatant cases, it is less effective when the plagiarizer has mixed multiple small fragments from different sources, and will not return any relevant results if the search engine has not indexed the original source or sources. Also, considerable effort is required to investigate each suspected case.

Plagiarism detection systems

A plagiarism detection system compares suspect documents to a large collection (corpus) of other documents and attempts to match parts of the suspect document to parts of those in the corpus. As with search engines, no plagiarism can be detected unless the corpus contains the documents from which the suspect has copied.

Academic text-document plagiarism

General design of academic plagiarism detection systems geared for text documents include a number of factors:

Most large-scale plagiarism detection systems use large, internal databases (in addition to other resources) that grow with each additional document submitted for analysis. However, this feature is considered by some as a violation of student copyright.

Academic text-document plagiarism systems

Systems for detecting plagiarism in academic texts include:

Free (On-line)

*Plagiarismdetect []


*Urkund []
*SafeAssign []
*Scanmyessay []
*Plagiarism-detector []
*Plagiarismscanner []

Academic program plagiarism

Plagiarism in computer code is also frequent, and requires different tools than those found in textual document plagiarism. Significant research has been dedicated to academic source-code plagiarism. []

A distinctive aspect of source-code plagiarism is that there are no essay mills, such as can be found in traditional plagiarism. Since most programming assignments expect students to write programs with very specific requirements, it is very difficult to find existing programs that meet them. Since integrating external code is often harder than writing it from scratch, most plagiarizing students choose to do so from their peers.

According to Roy and Cordy [] , source-code similarity detection algorithms can be classified as based on either
* Strings - look for exact textual matches of segments, for instance five-word runs. Fast, but can be confused by renaming identifiers.
* Tokens - as with strings, but using a lexer to convert the program into tokens first. This discards whitespace, comments, and identifier names, making the system more robust to simple text replacements. Most academic plagiarism detection systems work at this level, using different algorithms to measure the similarity between token sequences.
* Parse Trees - build and compare parse trees. This allows higher-level similarities to be detected. For instance, tree comparison can normalize conditional statements, and detect equivalent constructs as similar to each other.
* Program Dependency Graphs (PDGs) - a PDG captures the actual flow of control in a program, and allows much higher-level equivalences to be located, at a greater expense in complexity and calculation time.
* Metrics - metrics capture 'scores' of code segments according to certain criteria; for instance, "the number of loops and conditionals", or "the number of different variables used". Metrics are simple to calculate and can be compared quickly, but can also lead to false positives: two fragments with the same scores on a set of metrics may do entirely different things.
* Hybrid approaches - for instance, parse trees + suffix trees can combine the detection capability of parse trees with the speed afforded by suffix trees, a type of string-matching data structure.

The previous classification was developed for code refactoring, and not for academic plagiarism detection (an important goal of refactoring is to avoid duplicate code, referred to as code clones in the literature). The above approaches are effective against different levels of similarity; low-level similarity refers to identical text, while high-level similarity can be due to similar specifications. In an academic setting, when all students are expected to code to the same specifications, functionally equivalent code (with high-level similarity) is entirely expected, and only low-level similarity is considered as proof of cheating.

Academic program plagiarism systems

MOSS and JPlag can be used free of charge, but both require registration and the software remains proprietary. Personal systems are normal desktop applications, and most of them are both free of charge and released as open-source software.

On-line [free]

* [ JPlag ]
* [ MOSS ]

Personal [free]

* [ AC]
* [ Sherlock]
* [ YAP]
* [ SIM]
* [ SID]
* [ Plaggie]


* Carrol, J. (2002). A" handbook for detecting plagiarism in higher education". Oxford: The Oxford Centre for Staff and Learning Development, Oxford Brookes University. (96 p.).

ee also

* Copyscape
* Locality sensitive hashing
* Nearest neighbor search
* Kolmogorov complexity#Compression - used to estimate similarity between token sequences in several systems


External links

* [ Déjà Vu: a Database of Duplicate Citations in the Scientific Literature]

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Plagiarism — is the unauthorized use or close imitation of the language and thoughts of another author and the representation of them as one s own original work.Within academia, plagiarism by students, professors, or researchers is considered academic… …   Wikipedia

  • Detection du plagiat — Détection du plagiat Avec le développement d Internet et des nouvelles technologies, le phénomène du plagiat scolaire s est beaucoup développé, en particulier dans le milieu universitaire. De nombreux enseignants cherchent alors des moyens… …   Wikipédia en Français

  • Détection Du Plagiat — Avec le développement d Internet et des nouvelles technologies, le phénomène du plagiat scolaire s est beaucoup développé, en particulier dans le milieu universitaire. De nombreux enseignants cherchent alors des moyens efficaces pour lutter… …   Wikipédia en Français

  • Détection du plagiat — Avec le développement d Internet et des nouvelles technologies, le phénomène du plagiat scolaire s est beaucoup développé, en particulier dans le milieu universitaire. De nombreux enseignants cherchent alors des moyens efficaces pour lutter… …   Wikipédia en Français

  • Musical plagiarism — Music plagiarism is the use or close imitation of another author s music while representing it as one s own original work. Plagiarism in music now occurs in two contexts – with a musical idea (that is, a melody or motif) or sampling (taking a… …   Wikipedia

  • Scientific plagiarism in India — India does not have a statutory body to deal with scientific misconduct in academia, like the Office of Research Integrity in the USA and hence cases of plagiarism are often dealt in ad hoc fashion with different routes being followed in… …   Wikipedia

  • Определение плагиата — Способы обнаружения плагиата в з …   Википедия

  • Turnitin — Infobox Software name = Turnitin caption = A sample Turnitin report page, with explanations author = developer = iParadigms, LLC latest release version = latest release date = latest preview version = latest preview date = operating system =… …   Wikipedia

  • Essay mill — An essay mill, sometimes also called a paper mill, is the colloquial term for a type of ghostwriting service which specializes in the sale of essays, term papers, and other forms of homework assignments to university and college students.… …   Wikipedia

  • Copyscape — URL Commercial? Yes Type of site Plagiarism Search …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”