Edit Distance

Reference:

  1. Stanford: Minimum Edit
    Distance
  2. 北京大学中文系: 最小编辑距离算法
  3. GeeksforGeeks: Edit Distance
  4. WIKI: Levenshtein distance

  5. 自然語言處理 -- Minimum Edit Distance


Edit Distance

Minimum Edit Distance (最小編輯距離),是用來計算兩個字串的相似程度。
可應用於拼字校正、或是計算兩個DNA序列的相似程度

  • The minimum edit distance between two strings
  • Is the Minimum number of editing operations
    1. Insertion (i)
    2. Deletion (d)
    3. Substitution (s)
  • Needed to transform one into the other

Example1:

I N T E * N T I O N
| | | | | | | | | |
* E X E C U T I O N
d s s   i s
  • If each operation has cost of 1
    • Distance between these is 5
  • If substitution cost 2
    • Distance between them is 5

Example2: Minimum Edit Distance

  • Source String: SOT
  • Target String : STOP
  • If substitution cost 2

Please refer to the hyperlink WIKI: Levenshtein distance, you can get a table as below.
Cowman

Description:

上圖的箭頭路徑是所有minimum edit distance的路徑,接下來會以紅色箭頭來做解說.

  • Step1. "S" = "S" ("SOT")
  • Step2. delete "O" ("ST")
  • Step3. "T" = "T" ("ST")
  • Step4. insert "O" ("STO")
  • Step5. insert "P" ("STOP")

GeeksforGeeks Version:

Examples:

Input:   str1 = "geek", str2 = "gesek"
Output:  1
We can convert str1 into str2 by inserting a 's'.

Input:   str1 = "cat", str2 = "cut"
Output:  1
We can convert str1 into str2 by replacing 'a' with 'u'.

Input:   str1 = "sunday", str2 = "saturday"
Output:  3
Last three and first characters are same.  We basically
need to convert "un" to "atur".  This can be done using
below three operations. 
Replace 'n' with 'r', insert t, insert a

Recursive Solution:

int min(int x, int y, int z)
{
    if(x<y)
        return x<z? x:z;
    else
        return y<z? y:z;
}


int EditDistance(char *s1, int m, char *s2, int n)
{
    if(m == 0)  return n;
    if(n == 0)  return m;

    if(s1[m-1] == s2[n-1])
        return EditDistance(s1, m-1, s2, n-1);

    return 1 + min(EditDistance(s1, m-1, s2, n),
                   EditDistance(s1, m, s2, n-1),
                   EditDistance(s1, m-1, s2, n-1));
}

Temporary Array Solution:

EX:

str1="sunday"
str2="saturday"

The following is Temporary Array:

0 1 2 3 4 5 6 7 8
1 0 1 2 3 4 5 6 7
2 1 1 2 2 3 4 5 6
3 2 2 2 3 3 4 5 6
4 3 3 3 3 4 3 4 5
5 4 3 4 4 4 4 3 4
6 5 4 4 5 5 5 4 3
int EditDistance(char *s1, int m, char *s2, int n)
{
    int **dp;
    int res, i, j;

    // create temporary array by malloc
    dp = (int **)malloc((m+1)*sizeof(int *));
    if(dp == NULL)
        return -1;
    for(i=0; i<m+1; i++) {
        dp[i] = (int*)malloc((n+1)*sizeof(int));
        if(dp[i] == NULL)
            return -1;
    }

    //caculate
    for(i=0; i<m+1; i++) {
        for(j=0; j<n+1; j++) {
            if(i == 0)
                dp[i][j] = j;
            else if(j == 0)
                dp[i][j] = i;      
            else if(s1[i-1] == s2[j-1])
                dp[i][j] = dp[i-1][j-1];
            else
                dp[i][j] = 1 + min(dp[i-1][j], 
                                   dp[i][j-1], 
                                   dp[i-1][j-1]);
        }
    }

    res = dp[m][n];

    for(i=0; i<m+1; i++) 
        free(dp[i]);
    free(dp);

    return res;
}

results matching ""

    No results matching ""