Edit Distance
Reference:
Edit Distance
Minimum Edit Distance (最小編輯距離),是用來計算兩個字串的相似程度。
可應用於拼字校正、或是計算兩個DNA序列的相似程度。
- The minimum edit distance between two strings
- Is the Minimum number of editing operations
- Insertion (i)
- Deletion (d)
- Substitution (s)
- Needed to transform one into the other
Example1:
I N T E * N T I O N
| | | | | | | | | |
* E X E C U T I O N
d s s i s
- If each operation has cost of 1
- Distance between these is 5
- If substitution cost 2
- Distance between them is 5
Example2: Minimum Edit Distance
- Source String: SOT
- Target String : STOP
- If substitution cost 2
Please refer to the hyperlink WIKI: Levenshtein distance, you can get a table as below.
Description:
上圖的箭頭路徑是所有minimum edit distance的路徑,接下來會以紅色箭頭來做解說.
- Step1. "S" = "S" ("SOT")
- Step2. delete "O" ("ST")
- Step3. "T" = "T" ("ST")
- Step4. insert "O" ("STO")
- Step5. insert "P" ("STOP")
GeeksforGeeks Version:
Examples:
Input: str1 = "geek", str2 = "gesek"
Output: 1
We can convert str1 into str2 by inserting a 's'.
Input: str1 = "cat", str2 = "cut"
Output: 1
We can convert str1 into str2 by replacing 'a' with 'u'.
Input: str1 = "sunday", str2 = "saturday"
Output: 3
Last three and first characters are same. We basically
need to convert "un" to "atur". This can be done using
below three operations.
Replace 'n' with 'r', insert t, insert a
Recursive Solution:
int min(int x, int y, int z)
{
if(x<y)
return x<z? x:z;
else
return y<z? y:z;
}
int EditDistance(char *s1, int m, char *s2, int n)
{
if(m == 0) return n;
if(n == 0) return m;
if(s1[m-1] == s2[n-1])
return EditDistance(s1, m-1, s2, n-1);
return 1 + min(EditDistance(s1, m-1, s2, n),
EditDistance(s1, m, s2, n-1),
EditDistance(s1, m-1, s2, n-1));
}
Temporary Array Solution:
EX:
str1="sunday"
str2="saturday"
The following is Temporary Array:
0 1 2 3 4 5 6 7 8
1 0 1 2 3 4 5 6 7
2 1 1 2 2 3 4 5 6
3 2 2 2 3 3 4 5 6
4 3 3 3 3 4 3 4 5
5 4 3 4 4 4 4 3 4
6 5 4 4 5 5 5 4 3
int EditDistance(char *s1, int m, char *s2, int n)
{
int **dp;
int res, i, j;
// create temporary array by malloc
dp = (int **)malloc((m+1)*sizeof(int *));
if(dp == NULL)
return -1;
for(i=0; i<m+1; i++) {
dp[i] = (int*)malloc((n+1)*sizeof(int));
if(dp[i] == NULL)
return -1;
}
//caculate
for(i=0; i<m+1; i++) {
for(j=0; j<n+1; j++) {
if(i == 0)
dp[i][j] = j;
else if(j == 0)
dp[i][j] = i;
else if(s1[i-1] == s2[j-1])
dp[i][j] = dp[i-1][j-1];
else
dp[i][j] = 1 + min(dp[i-1][j],
dp[i][j-1],
dp[i-1][j-1]);
}
}
res = dp[m][n];
for(i=0; i<m+1; i++)
free(dp[i]);
free(dp);
return res;
}