Essays/Longest Common Subsequence

From J Wiki
Jump to: navigation, search

The longest common subsequence problem (LCS) is finding a longest sequence which is a subsequence of all sequences in a set of sequences. It is the basis of diff, the file comparison utility, and is used, for example, in matching DNA sequences.

This problem is inherently iterative and unsuitable for J. In practice, if a J application needs to call LCS on a non-trivial data set, it would be best to call an external compiled routine. However, for small data sets, say comparing two strings of 100 characters, LCS works fine in J, and is a key component of the script comparison utility.


A J Solution

Ewart Shaw posed and solved the problem in the forum, see thread longest common subsequence. His solution gives the length of the LCS:

   lcs=: [: >./@, [: (* * 1 + >./\@:(_1&|.)@:|:^:2)^:_ [: ,&0@:,.&0 =/

   'ABRACADABRA' lcs 'BARRACUDA'   NB. lcs is 'ARACDA' or 'BRACDA'
6
   'HHTHHTHHT' lcs 'THHTHTTHT'     NB. lcs is 'HHTHTHT'
7

The solution first makes an equals table (=/) on the strings, and pads the table with trailing zeros. For example, using strings with an LCS of 4:

   A=: 'RACADA' [ B=: 'ARRACUD'
   [M=: 0,.~0,~ A =/ B
0 1 1 0 0 0 0 0
1 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
1 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0
1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0

The core of the routine is the following definition, which is iterated until the result is stable, in this case, stopping at the 4th iteration. This contructs a table which has the maximum possible subsequence length at each position and which is then filtered by possible subsequences. The 4 in the final table indicates the longest subsequence length - this is the string 'RACD':

  lcs0=: * * 1 + >./\@:(_1&|.)@:|:^:2
   <"2 lcs0 ^:1 2 3 4 M
+---------------+---------------+---------------+---------------+
|0 1 1 0 0 0 0 0|0 1 1 0 0 0 0 0|0 1 1 0 0 0 0 0|0 1 1 0 0 0 0 0|
|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|
|0 0 0 0 2 0 0 0|0 0 0 0 3 0 0 0|0 0 0 0 3 0 0 0|0 0 0 0 3 0 0 0|
|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|
|0 0 0 0 0 0 2 0|0 0 0 0 0 0 3 0|0 0 0 0 0 0 4 0|0 0 0 0 0 0 4 0|
|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|1 0 0 2 0 0 0 0|
|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
+---------------+---------------+---------------+---------------+

Construction of the maximum possible length at each position:

  (>./\@:|:^:2) each <"2 lcs0 ^:1 2 3 4 M
+---------------+---------------+---------------+---------------+
|0 1 1 1 1 1 1 1|0 1 1 1 1 1 1 1|0 1 1 1 1 1 1 1|0 1 1 1 1 1 1 1|
|1 1 1 2 2 2 2 2|1 1 1 2 2 2 2 2|1 1 1 2 2 2 2 2|1 1 1 2 2 2 2 2|
|1 1 1 2 2 2 2 2|1 1 1 2 3 3 3 3|1 1 1 2 3 3 3 3|1 1 1 2 3 3 3 3|
|1 1 1 2 2 2 2 2|1 1 1 2 3 3 3 3|1 1 1 2 3 3 3 3|1 1 1 2 3 3 3 3|
|1 1 1 2 2 2 2 2|1 1 1 2 3 3 3 3|1 1 1 2 3 3 4 4|1 1 1 2 3 3 4 4|
|1 1 1 2 2 2 2 2|1 1 1 2 3 3 3 3|1 1 1 2 3 3 4 4|1 1 1 2 3 3 4 4|
|1 1 1 2 2 2 2 2|1 1 1 2 3 3 3 3|1 1 1 2 3 3 4 4|1 1 1 2 3 3 4 4|
+---------------+---------------+---------------+---------------+


##    (1 + >./\@:(_1&|.)@:|:^:2) each <"2 lcs0 ^:0 1 2 3 M
## +---------------+---------------+---------------+---------------+
## |1 1 1 1 1 1 1 1|1 1 1 1 1 1 1 1|1 1 1 1 1 1 1 1|1 1 1 1 1 1 1 1|
## |1 1 2 2 2 2 2 2|1 1 2 2 2 2 2 2|1 1 2 2 2 2 2 2|1 1 2 2 2 2 2 2|
## |1 2 2 2 2 2 2 2|1 2 2 2 3 3 3 3|1 2 2 2 3 3 3 3|1 2 2 2 3 3 3 3|
## |1 2 2 2 2 2 2 2|1 2 2 2 3 3 3 3|1 2 2 2 3 4 4 4|1 2 2 2 3 4 4 4|
## |1 2 2 2 2 2 2 2|1 2 2 2 3 3 3 3|1 2 2 2 3 4 4 4|1 2 2 2 3 4 4 4|
## |1 2 2 2 2 2 2 2|1 2 2 2 3 3 3 3|1 2 2 2 3 4 4 4|1 2 2 2 3 4 4 5|
## |1 2 2 2 2 2 2 2|1 2 2 2 3 3 3 3|1 2 2 2 3 4 4 4|1 2 2 2 3 4 4 5|
## +---------------+---------------+---------------+---------------+
##

Backtracking

The final table generated by lcs indicates possible subsequences, and also requires an iterative procedure to extract them; start at the biggest integer and then backtrack through the table. Getting all subsequences this way is also unsuitable for J, however, the following procedure will pick out a single solution efficiently: for each positive integer, select the position closest to the origin (top left corner); in the event of a tie, pick the one with the smallest column:

   T=: lcs0 ^:3 M
   len=: >./ ,T
   rc=: 1 + }.@{.@/:~@(+/"1 ,. ]) ($T) #: I. len = ,T
   ndx=: <@I."1 msk=: (1+i.len) =/ , rc {. T
   pos=: (+/"1,.]) ; (<rc) #: each ndx
   pos=: /:~ ((# &> ndx) # i.len),.pos
   'ia ib'=: |: 2 }."1 (~:{."1 pos)#pos
   (ia{A) ; ib{B
+----+----+
|RACD|RACD|
+----+----+

Script Comparison

The J distribution includes a script comparison utility. This is not intended to replace professional diff and merge programs, but is very handy for quick comparisons on work in progress - what did I change today? Most J scripts are fairly short, so LCS would work fine on them, but an application may be built from several scripts, and can easily exceed what is possible in J. For example Plot has some 6,000 lines of code, and this is far too large for a J LCS.

It turns out that LCS still useful, as long as it is used sparingly. When doing script comparison on an application like Plot, most likely only a few source scripts will have been updated. Thus the final application will have almost all lines the same, with a few isolated changes. This leads to the following efficient algorithm:

The scripts are read in and converted to lists of integers. Each integer is an index into the list of all possible lines, so that line comparisons take place on integers.

The following 2 step algorithm is applied:

1. The first step is also a 2-step algorithm, repeated until there is no change, or there are no more lines left:

a) Matching heads and tails are removed from each script, leaving only lines that differ at the beginning and end.
b) Any lines not found in both scripts are removed - these will be changed lines.

2. If there are any lines left after step 1, these will necessarily differ at the beginning and end, and consist of lines that occur in both scripts. Very likely these are blocks of code that have simply been rearranged, for example, to put definitions in alphabetical order. Since step 1 no longer has any effect, LCS is now used to break the jam, where no more than 100 lines are used from each script. This not only breaks the jam, but is nearly always a perfect solution, since rarely would more than 100 lines of code be needed for a better LCS solution. After using LCS once, the iteration returns to step 1.



See also

J Forum: Longest common subsequence
WikiPedia: Longest common subsequence problem



Contributed by Chris Burke.