Essays/Levenshtein Distance

From J Wiki
Jump to navigation Jump to search

Introduction

The Levenshtein distance between two strings is the minimum number of edits (insertions, deletions, or single-character substitutions) needed to transform one string to the other. We derive a non-looping calculation of this distance.

A Looping Solution

First, a translation of the pseudo-code in the Wikipedia article into a looping function in J:

LevenshteinMatrix=: 4 : 0
 d=. (1+(#x),#y)$0
 for_i. i.1+#x do. d=. i (<i,0)}d end.  NB. deletion
 for_j. i.1+#y do. d=. j (<0,j)}d end.  NB. insertion
 for_j. i.#y do.
  for_i. i.#x do.
   if. (i{x) = j{y do.
    d=. d (<1+i,j)}~ (<i,j){d
   else.
    m=.      1 + (<i,1+j){d             NB. deletion
    m=. m <. 1 + (<(1+i),j){d           NB. insertion
    m=. m <. 1 + (<i,j){d               NB. substitution
    d=. m (<1+i,j)}d
 end. end. end.
)

The function returns the entire matrix rather than just the Levenshtein distance; the actual distance is the number in the lower right corner. It reproduces the two test cases in the Wikipedia article:

   'sitting' LevenshteinMatrix 'kitten'
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 2 1 2 3 4 5
3 3 2 1 2 3 4
4 4 3 2 1 2 3
5 5 4 3 2 2 3
6 6 5 4 3 3 2
7 7 6 5 4 4 3
   'Sunday' LevenshteinMatrix 'Saturday'
0 1 2 3 4 5 6 7 8
1 0 1 2 3 4 5 6 7
2 1 1 2 2 3 4 5 6
3 2 2 2 3 3 4 5 6
4 3 3 3 3 4 3 4 5
5 4 3 4 4 4 4 3 4
6 5 4 4 5 5 5 4 3

A Non-Looping Solution

A non-looping version obtains through a series of algebraic manipulations, replacing the inner loop with an insert (f/) and the outer loop with an application of the power operator (^:). The result of this function is the transpose of the matrix produced by LevenshteinMatrix.

lmy=: 4 : 'y,(0{x){(1+(1{x)<.{:y),2{x'

lmx=: 4 : 0
 b=. (_1+#y){x
 d=. {:y
 m=. (}.<.}:) d
 y , > lmy&.>/ (|.<"1 b,.m,.}:d),<#y
)

LM=: =/~ lmx^:(#@[) ,:@i.@>:@#@[

Testing against the looping version:

   'sitting' (|:@LM -: LevenshteinMatrix) 'kitten'
1
   'Sunday'  (|:@LM -: LevenshteinMatrix) 'Saturday'
1

Version Producing Only The Scalar Distance

By Henry Rich.

levdist=: 4 : 0 " 1
 'a b'=. (/: #&>)x;y
 z=. >: iz =. i.#b
 for_j. a do.
  z=. <./\&.(-&iz) (>: <. (j ~: b) + |.!.j_index) z
 end.
 {:z
)

Collected Definitions

LevenshteinMatrix=: 4 : 0
 d=. (1+(#x),#y)$0
 for_i. i.1+#x do. d=. i (<i,0)}d end.  NB. deletion
 for_j. i.1+#y do. d=. j (<0,j)}d end.  NB. insertion
 for_j. i.#y do.
  for_i. i.#x do.
   if. (i{x) = j{y do.
    d=. d (<1+i,j)}~ (<i,j){d
   else.
    m=.      1 + (<i,1+j){d             NB. deletion
    m=. m <. 1 + (<(1+i),j){d           NB. insertion
    m=. m <. 1 + (<i,j){d               NB. substitution
    d=. m (<1+i,j)}d
 end. end. end.
)

lmy=: 4 : 'y,(0{x){(1+(1{x)<.{:y),2{x'

lmx=: 4 : 0
 b=. (_1+#y){x
 d=. {:y
 m=. (}.<.}:) d
 y , > lmy&.>/ (|.<"1 b,.m,.}:d),<#y
)

LM=: =/~ lmx^:(#@[) ,:@i.@>:@#@[

levdist=: 4 : 0 " 1
 'a b'=. (/: #&>)x;y
 z=. >: iz =. i.#b
 for_j. a do.
  z=. <./\&.(-&iz) (>: <. (j ~: b) + |.!.j_index) z
 end.
 {:z
)



Contributed by Roger Hui.