Random Forest in Machine Learning Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by…

# How does unix diff algorithm work2 min read

# whats is diff in UNIX?

**diff** is a data comparison tool that calculates and displays the differences between two files. diff is line-oriented rather than character-oriented, it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other

according to the web:

” The basic algorithm is described in “An O(ND) Difference Algorithm and its Variations”, Eugene W. Myers, ‘Algorithmica’ Vol. 1 No. 2, 1986, pp. 251-266; and in “A File Comparison Program”, Webb Miller and Eugene W. Myers, ‘Software –Practice and Experience’ Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in “Algorithms for Approximate String Matching”, E. Ukkonen, `Information and Control’ Vol. 64, 1985, pp. 100-118.”

**Simple Implementation of Diff**

here is a simple implementation of diff to find the difference of 2 string ( or can be extended for 2 files)

Firstly Define a function to fill up the table of the longest common subsequence:

** lcs_fill** is based on the longest common subsequence algorithm of dynamic programming to get the longest common part between two strings

**Algo for filling Table**

- Iterate throughout the table for max of x and y and fillup the LCS[0][y] LCS[x][0] = 0 i.e. first row and column of table
- if word match: set table current element = one more that diagonally previous element
- else: maximum of horizontally and vertically previous element

**Code:**

/* lcs_fill to fill up the lcs table */ void lcs_fill (char* s1, char* s2, int x, int y) { for (int i = 0; i <= x; i++) { for (int j = 0; j <= y; j++) { /* Starting row and col are filled with 0 */ if (i == 0 || j == 0) LCS[i][j] = 0; /* if current character of s1 and s2 matches * then LCS index == 1 + diagonally previous value */ else if (s1[i-1] == s2[j-1]) LCS[i][j] = LCS[i-1][j-1] + 1; /* if not matches, then it will be max of above or previous index*/ else LCS[i][j] = max(LCS[i-1][j],LCS[i][j-1]); } } }

**Algo for diff:**

- iterate throughout the strings from last and check if the last element is same
- if same: print
- else: check if the element of second is present in first
- if yes: print with + (means that is added)
- else: check if an element of first is present in the second
- if yes: print with – (means that element is removed)

**Code:**

/* diff to display the difference */ void diff (char* s1, char* s2, int x, int y) { /* if last char of s1 == s2 */ if ( x > 0 && y > 0 && s1[x-1] == s2[y-1]) { diff(s1,s2,x-1,y-1); printf(" %c",s1[x-1]); } /* if char of *s2 not in *s1 */ else if (y > 0 && (x == 0 || LCS[x][y-1] >= LCS[x-1][y])) { diff(s1,s2,x,y-1); printf(" +%c",s2[y-1]); } /* if char of *s1 not in *s2 */ else if (x > 0 && (y == 0 || LCS[x][y-10] < LCS[x-1][y])) { diff(s1,s2,x-1,y); printf(" -%c",s1[x-1]); } }

Complete code is available at: Click Here