Diff

Diff

In computing, diff is a file comparison utility that outputs the differences between two files, or the changes made to a current file by comparing it to a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also support binary files [MacKenzie "et al". "Binary Files and Forcing Text Comparison" in "Comparing and Merging Files with GNU Diff and Patch". Downloaded 28 April 2007. [http://www.gnu.org/software/diffutils/manual/html_node/Binary.html] ] . The output is called a "diff" or a patch since the output can be applied with the Unix program patch. The output of similar file comparison utilities are also called a "diff". Like the use of the word "grep" for describing the act of searching, the word "diff" is used in jargon as a verb for calculating any difference.

History

The diff utility was developed in the early 1970s on the Unix operating system which was emerging from AT&T Bell Labs in Murray Hill, New Jersey. The final version, first shipped with the 5th Edition of Unix in 1974, was entirely written by Douglas McIlroy. This research was published in a 1976 paper co-written with James W. Hunt who developed an initial prototype of diff. [cite journal|author=James W. Hunt and M. Douglas McIlroy|title= [http://www.cs.dartmouth.edu/~doug/ An Algorithm for Differential File Comparison] |volume=41|journal=Computing Science Technical Report, Bell Laboratories|month=June | year=1976|pages=]

McIlroy's work was preceded and influenced by Steve Johnson's comparison program on GECOS and Mike Lesk's proof program. proof also originated on Unix and, like diff, produced line-by-line changes and even used angle-brackets (">" and "<") for presenting line insertions and deletions in the program's output. The heuristics used in these early applications were, however, deemed unreliable. The potential usefulness of a diff tool provoked McIlroy into researching and designing a more robust tool that could be used in a variety of tasks but perform well in the processing and size limitations of the PDP-11's hardware. His approach to the problem resulted from collaboration also with individuals at Bell Labs including Alfred Aho, Elliot Pinson, Jeffrey Ullman, and Harold S. Stone.

In the context of Unix, the use of the ed line editor provided diff with the natural ability to create machine-usable "edit scripts". These edit scripts, when saved to a file, can, along with the original file, be reconstituted by ed into the modified file in its entirety. This greatly reduced the secondary storage necessary to maintain multiple versions of a file. McIlroy considered writing a post-processor for diff where a variety of output formats could be designed and implemented, but he found it more frugal and simpler to have diff be responsible for generating the syntax and reverse-order input accepted by the ed command.In 1985, Larry Wall composed a separate utility, patch, that generalized and extended the ability to modify files with diff output. Modes in Emacs also allow for converting the format of patches and even editing patches interactively.

In diff's early years, common uses included comparing changes in the source of software code and markup for technical documents, verifying program debugging output, comparing filesystem listings and analyzing computer assembly code. The output targeted for ed was motivated to provide compression for a sequence of modifications made to a file. The Source Code Control System (SCCS) and its ability to archive revisions emerged in the late 1970s as a consequence of storing edit scripts from diff.

Project Xanadu is a conceptual predecessor of diff. It was a hypertext project first conceived in 1960 that was to include a version tracking system necessary for its "transpointing windows" feature. The feature subsumed file differences in the expansive term "transclusion", where a document has included in it parts of other documents or revisions.

Algorithm

The operation of diff is based on solving the longest common subsequence problem.

In this problem, you have two sequences of items:

a b c d f g h j q z

a b c d e f g i j k r x y z

and you want to find the longest sequence of items that is present in both original sequences in the same order. That is, you want to find a new sequence which can be obtained from the first sequence by deleting some items, and from the second sequence by deleting other items. You also want this sequence to be as long as possible. In this case it is

a b c d f g j z

From the longest common subsequence it's only a small step to get diff-like output:

e h i q k r x y + - + - + + + +

Usage

It is invoked from the command line with the names of two files: diff "original" "new". The output of the command represents the changes required to make the "original" file become the "new" file.

If "original" and "new" are directories, then diff will be run on each file that exists in both directories. An option, -r, will descend any matching subdirectories to compare files between directories.

Any of the examples in the article use the following two files, "original" and "new":"original": This part of the document has stayed the same from version to version. It shouldn't be shown if it doesn't change. Otherwise, that would not be helping to compress the size of the changes. This paragraph contains text that is outdated. It will be deleted in the near future. It is important to spell check this dokument. On the other hand, a misspelled word isn't the end of the world. Nothing in the rest of this paragraph needs to be changed. Things can be added after it."new": This is an important notice! It should therefore be located at the beginning of this document! This part of the document has stayed the same from version to version. It shouldn't be shown if it doesn't change. Otherwise, that would not be helping to compress anything. It is important to spell check this document. On the other hand, a misspelled word isn't the end of the world. Nothing in the rest of this paragraph needs to be changed. Things can be added after it. This paragraph contains important new additions to this document.The command diff original new produces the following "normal diff output": 0a1,6 > This is an important > notice! It should > therefore be located at > the beginning of this > document! > 8,14c14 < compress the size of the < changes. < < This paragraph contains < text that is outdated. < It will be deleted in the < near future. --- > compress anything. 17c17 < check this dokument. On --- > check this document. On 24a25,28 > > This paragraph contains > important new additions > to this document.In this traditional output format, a stands for "added", d for "deleted" and c for "changed". Line numbers of the original file appear before a/d/c and those of the modified file appear after. Angle brackets appear at the beginning of lines that are added, deleted or changed. Addition lines are added to the original file to appear in the new file. Deletion lines are deleted from the original file to be missing in the new file.

By default, lines common to both files are not shown. Lines that have moved will show up as added on their new location and as deleted on their old location. [cite book|title= [http://www.gnu.org/software/diffutils/manual/ Comparing and Merging Files with GNU Diff and Patch] |author=David MacKenzie, Paul Eggert, and Richard Stallman|id=ISBN 0-9541617-5-0|publisher=|year=1997]

Variations

Most diff implementations remain outwardly unchanged since 1975. The modifications include improvements to the core algorithm, the addition of useful features to the command, and the design of new output formats. The basic algorithm is described in the papers "An O(ND) Difference Algorithm and its Variations" by Eugene W. Myers [cite journal|author=E. Myers|title= [http://citeseer.ist.psu.edu/myers86ond.html An O(ND) Difference Algorithm and Its Variations] |journal=Algorithmica|volume=1|issue=2|year=1986|pages=251&#x2013;266] and in "A File Comparison Program" by Webb Miller and Myers. [cite journal|author=Webb Miller and Eugene W. Myers|title=A File Comparison Program|journal=Software &#x2014; Practice and Experience|volume=15|issue=11|year=1985|pages=1025&#x2013;1040|doi=10.1002/spe.4380151102] The algorithm was independently discovered and described in "Algorithms for Approximate String Matching", by E. Ukkonen. [cite journal|author=E. Ukkonen|title=Algorithms for Approximate String Matching|volume=64|journal=Information and Control|year=1985|pages=100&#x2013;118 | doi = 10.1016/S0019-9958(85)80046-2] The first editions of the diff program were designed for line comparisons of text files expecting the newline character to delimit lines. By the 1980s, support for binary files resulted in a shift in the application's design and implementation.

Edit script

An edit script can still be generated by modern versions of diff with the -e option. The resulting edit script for this example is as follows: 24a This paragraph contains important new additions to this document. . 17c check this document. On . 8,14c compress anything. . 0a This is an important notice! It should therefore be located at the beginning of this document! .

Context format

The Berkeley distribution of Unix made a point of adding the "context format" (-C) and the ability to recurse on filesystem directory structures (-r), adding those features in 2.8 BSD, released in July 1981. The context format of diff introduced at Berkeley helped with distributing patches for source code that may have been changed minimally.

In the context format, any changed lines are shown alongside unchanged lines before and after. The inclusion of any number of unchanged lines provides a "context" to the patch. The "context" consists of lines that have not changed between the two files, and so can be used as a reference to locate the chunk's place in a modified file and find the intended location a change should be applied regardless if the line numbers no longer correspond. The context format introduces greater readability for humans and reliability when applying the patch and an output which is accepted as input to the patch program. This intelligent behavior isn't possible with the traditional diff output.

The number of unchanged lines shown above and below a change "hunk" can be defined by the user, even zero, but three lines is typically the default. If the context of unchanged lines in a hunk overlap with an adjacent hunk, then diff will avoid duplicating the unchanged lines and merge the hunks into a single hunk.

A "!" represents a change between lines that correspond in the two files. A "+" represents the addition of a line, while a blank space represents an unchanged line. At the beginning of the patch is the file information, including the full path and a time stamp. At the beginning of each hunk are the line numbers that apply for the corresponding change in the files. A number range appearing between sets of three asterisks applies to the original file, while sets of three dashes apply to the new file. The chunk ranges specify the starting line number and the number of lines the change hunk applies to in the respective file.

The command diff -c original new produces the following output: *** /path/to/original "timestamp" --- /path/to/new "timestamp" *************** *** 1,3 **** --- 1,9 ---- + This is an important + notice! It should + therefore be located at + the beginning of this + document! + This part of the document has stayed the same from version to *************** *** 5,20 **** be shown if it doesn't change. Otherwise, that would not be helping to ! compress the size of the ! changes. ! ! This paragraph contains ! text that is outdated. ! It will be deleted in the ! near future. It is important to spell ! check this dokument. On the other hand, a misspelled word isn't the end of the world. --- 11,20 ---- be shown if it doesn't change. Otherwise, that would not be helping to ! compress anything. It is important to spell ! check this document. On the other hand, a misspelled word isn't the end of the world. *************** *** 22,24 **** --- 22,28 ---- this paragraph needs to be changed. Things can be added after it. + + This paragraph contains + important new additions + to this document.

Unified format

The "unified format" (or "unidiff") inherits the technical improvements made by the context format, but produces a smaller diff with old and new text presented immediately adjacent. Unified format is usually invoked using the "-u" command line option. This output is often used as input to the patch program. Many projects specifically request that "diffs" be submitted in the unified format, making unified diff format the most common format for exchange between software developers.

Unified context diffs were originally developed by Wayne Davison in August 1990 (in unidiff which appeared in Volume 14 of comp.sources.misc). Richard Stallman added unified diff support to the GNU Project's diff utility one month later, and the feature debuted in GNU diff 1.15, released in January 1991. GNU diff has since generalized the context format to allow arbitrary formatting of diffs.

The format starts with the same two-line header as the context format, except that the original file is preceded by "---" and the new file is preceded by "+++". Following this are one or more change hunks (chunks) that contain the line differences in the file. The unchanged, contextual lines are preceded by a space character, addition lines are preceded by a plus sign, and deletion lines are preceded by a minus sign.

A chunk begins with range information and is immediately followed with the line additions, line deletions, and any number of the contextual lines. The range information is surrounded by double-at signs and combines onto a single line what appears on two lines for the context format (see above). The format of the range information line is as follows:

@@ -R +R @@

The chunk range information contains two chunk ranges. The one preceded by a minus symbol is the range for the chunk of the original file, and the range for the new file is preceded by a plus symbol. Each chunk range, "R", is of the format "l,s" where "l" is the starting line number and "s" is the number of lines the change hunk applies to for each respective file. In many versions of GNU diff, "R" can omit the comma and trailing value "s", in which case "s" defaults to 1. Note that the only really interesting value is the "l" line number of the first range; all the other values can be computed from the diff.

The chunk range for the original should be the sum of all contextual and deletion (including changed) chunk lines. The chunk range for the new file should be a sum of all contextual and addition (including changed) chunk lines. If chunk size information does not correspond with the number of lines in the hunk, then the diff could be considered invalid and be rejected.

If a line is modified, it is represented as a deletion and addition. Since the hunks of the original and new file appear in the same hunk, such changes would appear adjacent to one another. [ [http://www.artima.com/weblogs/viewpost.jsp?thread=164293 Unified Diff Format] by Guido van Rossum, June 14, 2006] An occurrence of this in the example below is:

-check this dokument. On +check this document. On

The command diff -u original new produces the following output: --- original "timestamp" +++ new "timestamp" @@ -1,3 +1,9 @@ +This is an important +notice! It should +therefore be located at +the beginning of this +document! + This part of the document has stayed the same from version to @@ -5,16 +11,10 @@ be shown if it doesn't change. Otherwise, that would not be helping to -compress the size of the -changes. - -This paragraph contains -text that is outdated. -It will be deleted in the -near future. +compress anything. It is important to spell -check this dokument. On +check this document. On the other hand, a misspelled word isn't the end of the world. @@ -22,3 +22,7 @@ this paragraph needs to be changed. Things can be added after it. + +This paragraph contains +important new additions +to this document.There are some modifications and extensions to the diff formats that are used and understood by certain programs and in certain contexts. For example, some revision control systems -- such as Subversion -- specify a version number, "working copy", or any other comment instead of a timestamp in the diff's header section. Some tools allow diffs for several different files to be merged into one, using a header for each modified file that may look something like this:

Index: path/to/file.cpp =

As a special case, unified diff expects to work with files that end in a newline. If either file does not, unified diff will emit the special line

No newline at end of file

after the modifications. The patch program should be aware of this.

Others

Postprocessors sdiff and diffmk render side-by-side diff listings and applied change marks to printed documents, respectively. Both were developed elsewhere in Bell Labs in or before 1981.

Diff3 compares one file against two other files. It was originally developed by Paul Jensen to reconcile changes made by two persons editing a common source. It is also used internally by many revision control systems.

GNU diff and diff3 are included in the diffutils package with other diff and patch related utilities. Emacs has Ediff for showing the changes a patch would provide in a user interface that combines interactive editing and merging capabilities for patch files.

[http://www.gnu.org/software/wdiff/wdiff.html Wdiff] makes it easy to see the words or phrases that changed in a text document, especially in the presence of word-wrapping or different column widths. [http://hpux.cs.utah.edu/hppd/hpux/Text/spiff-1.0 Spiff] goes yet further, ignoring floating point differences under a tunable precision and ignoring irrelevancies in program files such as whitespace and comment formatting. [http://code.google.com/p/daisydiff/ Daisy Diff] diffs HTML documents and reconstructs the layout and style information. A number of tools for XML diffing and patching have been published, too, for instance by Microsoft and IBM's alphaworks.

See also

* Comparison of file comparison tools
* comm
* cmp
* Delta encoding
* History of software configuration management
* Kompare
* Levenshtein distance
* Longest common subsequence problem
* Meld (software)
* Microsoft File Compare
* Revision Control System
* rsync
* Software configuration management
* tkdiff
* List of Unix programs
* diff3

References

* [http://doi.acm.org/10.1145/359460.359467]
* A generic implementation of the Myers SES/LCS algorithm with the Hirschberg linear space refinement [http://www.ioplex.com/~miallen/libmba/dl/src/diff.c (C source code)]

External links

* [http://www.gnu.org/software/diffutils/diffutils.html GNU Diff utilities] . Made available by the Free Software Foundation. Free Documentation. Free source code.
* [http://kdiff3.sourceforge.net/ KDIFF3] - Another GUI Diff-like tool
* [http://gnuwin32.sourceforge.net/packages/diffutils.htm DiffUtils for Windows] &ndash; part of GnuWin32
* [http://www.iconv.com/diff.htm Online interface to the "diff" program]
* [http://search.cpan.org/~tyemq/Algorithm-Diff-1.1902/lib/Algorithm/Diff.pm Algorithm::Diff] &ndash; A diff library implemented in Perl
* [http://www.incava.org/projects/java-diff/ java-diff] &ndash; A diff library implemented in Java
* JavaScript diff algorithms: [http://ejohn.org/projects/javascript-diff-algorithm/ jsdiff] , [http://xindiff.cvs.sourceforge.net/*checkout*/xindiff/XinDiff/POC/XinDiff.html XinDiff] , [http://code.google.com/p/google-diff-match-patch/ google-diff-match-patch]
* [http://www.mathertel.de/Diff/ diff algorithm in C#] &ndash; Source code of the "An O(ND) Difference Algorithm and its Variations" in C#
* [http://code.google.com/p/daisydiff/ DaisyDiff] &ndash; HTML differ
* [http://winmerge.org/ Winmerge] - GUI Diff-like tool
* [http://www.silver-island.com/apps/FlexDiff/ Adobe Flex Diff] - Diff app implemented in Adobe Flex
* [http://meld.sourceforge.net/ Meld] - Gnome GUI Diff tool


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • diff — ist ein Unix Programm, das die Unterschiede zwischen zwei Textdateien zeilen bzw. abschnittweise einander gegenüberstellt. Inhaltsverzeichnis 1 Grundlagen 2 Programmfunktion 2.1 Aufruf 2.2 A …   Deutsch Wikipedia

  • Diff — ist ein Unix Programm, das synoptisch die Unterschiede zwischen zwei Textdateien zeilen bzw. abschnittweise einander gegenüberstellt. Inhaltsverzeichnis 1 Grundlagen 2 Programmfunktion 2.1 Aufruf 2.2 Ausgabe 3 Gesch …   Deutsch Wikipedia

  • Diff — Saltar a navegación, búsqueda En informática, diff es una utilidad para la comparación de archivos que genera las diferencias entre dos archivos o los cambios realizados en un archivo determinado comparándolo con una versión anterior del mismo… …   Wikipedia Español

  • Diff — В вычислительной технике diff  утилита сравнения файлов, выводящая разницу между двумя файлами. Эта программа выводит построчно изменения, сделанные в файле (для текстовых файлов). Современные реализации поддерживают также двоичные файлы.… …   Википедия

  • diff — est une commande Unix qui permet de comparer deux fichiers et d en afficher les différences. La première version de diff a été livrée avec la 5e édition d Unix en 1974, elle avait été écrite par Douglas McIlroy. L article de recherche a été… …   Wikipédia en Français

  • Diff — est une commande Unix qui permet de comparer deux fichiers et d en afficher les différences. La première version de diff a été livrée avec la 5e édition d Unix en 1974, elle avait été écrite par Douglas McIlroy. L article de recherche a été… …   Wikipédia en Français

  • diff — or dif [dif] n. [Slang] short for DIFFERENCE [what s the diff?] * * * diff (dĭf) n. Informal Difference: “[His] flaw... starts with a fleshy calculation, an instinct to blunt disagreeme …   Universalium

  • DIFF — may refer to: diff, a file comparison utility Dominican International Film Festival Dubai International Film Festival This disambiguation page lists articles associated with the same title. If an …   Wikipedia

  • diff — or dif [dif] n. [Slang] short for DIFFERENCE [what s the diff?] …   English World dictionary

  • diff — This article is about the file comparison utility. For other uses, see DIFF (disambiguation). Diffs redirects here. For the American punk rock group, see The Diffs. In computing, diff is a file comparison utility that outputs the differences… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”