How does unix diff algorithm work2 min read

How does unix diff algorithm work

whats is diff in UNIX?

diff is a data comparison tool that calculates and displays the differences between two files. diff is line-oriented rather than character-oriented, it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other

according to the web:

” The basic algorithm is described in “An O(ND) Difference Algorithm and its Variations”, Eugene W. Myers, ‘Algorithmica’ Vol. 1 No. 2, 1986, pp. 251-266; and in “A File Comparison Program”, Webb Miller and Eugene W. Myers, ‘Software –Practice and Experience’ Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in “Algorithms for Approximate String Matching”, E. Ukkonen, `Information and Control’ Vol. 64, 1985, pp. 100-118.”

Simple Implementation of Diff

here is a simple implementation of diff to find the difference of 2 string ( or can be extended for 2 files)

Firstly Define a function to fill up the table of the longest common subsequence:

lcs_fill is based on the longest common subsequence algorithm of dynamic programming to get the longest common part between two strings

Algo for filling Table

  • Iterate throughout the table for max of x and y and fillup the LCS[0][y] LCS[x][0] = 0 i.e. first row and column of table
  • if word match: set table current element = one more that diagonally previous element
  • else: maximum of horizontally and vertically previous element

Code:

/* lcs_fill to fill up the lcs table */
void
lcs_fill (char* s1, char* s2,
          int   x,  int   y)
{
    for (int i =  0; i <= x; i++) {
        for (int j = 0; j <= y; j++) {

            /* Starting row and col are filled with 0 */
            if (i == 0 || j == 0)
                LCS[i][j] = 0;

            /* if current character of s1 and s2 matches 
             * then LCS index == 1 + diagonally previous value */
            else if (s1[i-1] == s2[j-1])
                LCS[i][j] = LCS[i-1][j-1] + 1;
            
            /* if not matches, then it will be max of above or previous index*/
            else
                LCS[i][j] = max(LCS[i-1][j],LCS[i][j-1]);

        }   
    }
}

Algo for diff:

  • iterate throughout the strings from last and check if the last element is same
  • if same: print
  • else: check if the element of second is present in first
  • if yes: print with + (means that is added)
  • else: check if an element of first is present in the second
  • if yes: print with – (means that element is removed)

Code:

/* diff to display the difference */
void
diff (char* s1, char* s2,
      int   x,  int   y)
{
    /* if last char of s1 == s2 */
    if ( x > 0 && y > 0
        && s1[x-1] == s2[y-1])
    {
        diff(s1,s2,x-1,y-1);
        printf(" %c",s1[x-1]);
    }

    /* if char of *s2 not in *s1 */
    else if (y > 0 &&
        (x == 0 || LCS[x][y-1] >= LCS[x-1][y]))
    {
        diff(s1,s2,x,y-1);
        printf(" +%c",s2[y-1]);
    }

    /* if char of *s1 not in *s2 */
    else if (x > 0 &&
        (y == 0 || LCS[x][y-10] < LCS[x-1][y]))
    {
        diff(s1,s2,x-1,y);
        printf(" -%c",s1[x-1]);
    }
}

Complete code is available at: Click Here

Bookmark(0)

Add a Comment

Your email address will not be published. Required fields are marked *