3 min read
How does unix diff algorithm work

whats is diff in UNIX?

diff is a data comparison tool that calculates and displays the differences between two files. diff is line-oriented rather than character-oriented, it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other

according to the web:

” The basic algorithm is described in An O(ND) Difference Algorithm and its Variations, Eugene W. Myers, Algorithmica Vol. 1 No. 2, 1986, pp. 251-266; and in A File Comparison Program, Webb Miller and Eugene W. Myers, Software “Practice and Experience Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in Algorithms for Approximate String Matching, E. Ukkonen, `Information and Control Vol. 64, 1985, pp. 100-118.”

Simple Implementation of Diff

here is a simple implementation of diff to find the difference of 2 string ( or can be extended for 2 files)

Firstly Define a function to fill up the table of the longest common subsequence:

lcs_fill is based on the longest common subsequence algorithm of dynamic programming to get the longest common part between two strings

Algo for filling Table

  • Iterate throughout the table for max of x and y and fillup the LCS[0][y] LCS[x][0] = 0 i.e. first row and column of table
  • if word match: set table current element = one more that diagonally previous element
  • else: maximum of horizontally and vertically previous element


/* lcs_fill to fill up the lcs table */
lcs_fill (char* s1, char* s2,
          int   x,  int   y)
    for (int i =  0; i <= x; i++) {
        for (int j = 0; j <= y; j++) {

            /* Starting row and col are filled with 0 */
            if (i == 0 || j == 0)
                LCS[i][j] = 0;

            /* if current character of s1 and s2 matches 
             * then LCS index == 1 + diagonally previous value */
            else if (s1[i-1] == s2[j-1])
                LCS[i][j] = LCS[i-1][j-1] + 1;
            /* if not matches, then it will be max of above or previous index*/
                LCS[i][j] = max(LCS[i-1][j],LCS[i][j-1]);


Algo for diff:

  • iterate throughout the strings from last and check if the last element is same
  • if same: print
  • else: check if the element of second is present in first
  • if yes: print with + (means that is added)
  • else: check if an element of first is present in the second
  • if yes: print with – (means that element is removed)


/* diff to display the difference */
diff (char* s1, char* s2,
      int   x,  int   y)
    /* if last char of s1 == s2 */
    if ( x > 0 && y > 0
        && s1[x-1] == s2[y-1])
        printf(" %c",s1[x-1]);

    /* if char of *s2 not in *s1 */
    else if (y > 0 &&
        (x == 0 || LCS[x][y-1] >= LCS[x-1][y]))
        printf(" +%c",s2[y-1]);

    /* if char of *s1 not in *s2 */
    else if (x > 0 &&
        (y == 0 || LCS[x][y-10] < LCS[x-1][y]))
        printf(" -%c",s1[x-1]);

Complete code is available at: Click Here


Manjeet Singh

My Code Never Has Bugs, Just Random Features


Leave a Reply