Scripts/Regular Expressions Substitution

From J Wiki
Jump to navigation Jump to search

Originally published at http://olegykj.sourceforge.net/ as regexs.ijs

Regular expressions extended for Perl/awk/sed-like substitution. Features an option to process executable replacements.

In Unix shell tools sed and perl, there is a mechanism to describe search pattern and substitution in one operation. Often both patterns take advantage of sub-patterns to manipulate with string fragments. This bring convenience to often used text transformations such as reordering words, removing subparts, etc.

Although the same result could be achieved programmatically with existing regex operations, it would involve additional low-level logic and familiarity with the J regex API of numerous verbs. So the proposed rxs tools provides a high-level operations without going into J implementation details.

The rxs verb also features a powerful e (execute) option, that applies the specified J expression to each match and merges the results.

With appropriate use of the rxs verb, it can satisfy the need of most regex use cases and replace the need for using the low-level verbs. [{{#file: ""}} Download script: ]

NB. Regular expressions extended for Perl-like substitution
NB. Version 3 for j601+.
NB. Author Oleg Kobchenko. Originally http://olegykj.sourceforge.net/
NB. to do: \xHH

require'strings regex'

coclass 'jregex'

NB. =========================================================
NB.*rxmain v return ()-less mat from ()-ful pattern
rxmain=: ,:"1@:({."2)

NB. =========================================================
NB.*rxs v make Perl-like s/PAT/REPL/OPT substitution
NB. use:
NB.   '/PAT/REPL/OPT' rsx str
NB. PAT - the usual POSIX pattern used in J regex
NB. REPL - the POSIX sed-like replacement string
NB.   \1-\9   corresponding parens content
NB.   \0 or & whole match
NB.   \_      whole match in string representation (for 'e')
NB.   \t      TAB     \n      LF
NB.   \r      CR      \f      FF
NB.   \other  other
NB. OPT - any of 'ige' for ignore case, global, execute
NB. see: examples

RBEGE=: <;._1' \n LF \r CR \t TAB \f FF'
RBEGX=: '\n';LF;'\r';CR;'\t';TAB;'\f';FF

rxs=: 4 : 0
  esc=. {.x
  'pat rpl opt'=. 3{. <;._1 x
  str=. tolower^:('i'e. opt) y
  pat=. tolower^:('i'e. opt) pat
  mat=. pat rxmatch`rxmatches@.('g'e. opt) str
  if. (0=#mat) +. _1=1{.,mat do. y return. end.
  subs=. ,:^:(2: > #@$) mat rxfrom y
  mat=. rxmain mat
  newr=. ''
  if. 'e' e. opt do.
    r=. rpl rplc '\\';esc;RBEGE
    for_i. i.#mat do.
      pairs=. '&';5!:5<'t' [ t=. >(<i,0){subs
      pairs=. pairs,'\_';'('&,@(,&')')@(5!:5) <'t' [ t=. i{subs
      for_j. i.{:$subs do.
        pairs=. pairs, ('\',":j);5!:5<'t' [ t=. >(<i,j){subs
      end.
      pairs=. pairs,'\';'';esc;'\'
      re=. r rplc pairs
      for_j. i.+/'e'E.opt do.
        re=. (,@":@:".) :: ('__'"_) re
      end.
      newr=. newr,<re
    end.
  else.
    r=. rpl rplc '\\';esc;RBEGX
    for_i. i.#mat do.
      pairs=. '&';>(<i,0){subs
      for_j. i.{:$subs do.
        pairs=. pairs, ('\',":j);>(<i,j){subs
      end.
      pairs=. pairs,'\';'';esc;'\'
      newr=. newr,<r rplc pairs
    end.
  end.
  newr mat rxmerge y
)

rxs_z_=: rxs_jregex_

Note 'Examples'  NB. run indented lines and compare results
«examples»
)

[{{#file: ""}} Download script: ]

   str=. 'hello Mr John Dow hi miz Sarah Bernard hi mr none'
   '/(mr|miz) ([a-z]+) ([a-z]+) */\3, \2 (\1) -- was: \0\n' rxs str
hello Mr John Dow hi miz Sarah Bernard hi mr none
   '/(mr|miz) ([a-z]+) ([a-z]+) */\3, \2 (\1) -- was: \0\n/i' rxs str
hello Dow, John (Mr) -- was: Mr John Dow
hi miz Sarah Bernard hi mr none
   '/(mr|miz) ([a-z]+) ([a-z]+) */\3, \2 (\1) -- was: \0\n/ig' rxs str
hello Dow, John (Mr) -- was: Mr John Dow
hi Bernard, Sarah (miz) -- was: miz Sarah Bernard
hi mr none

   p1=. '!(mr|miz) (([a-z]+) )?([a-z]+) *'
   r1=. '!\4,s,(":#\4),s, \3, s,\1,s,'' used: '',(":+/a:~:\_),\n [ s=.''/'''
   o1=. '!gie'
   (p1,r1,o1) rxs str
hello Dow/3/John/Mr/ used: 5
hi Bernard/7/Sarah/miz/ used: 5
hi none/4//mr/ used: 3

   '/([^ ]+) ([^ ]+)/\2,''-'',\1/e' rxs 'q''123 z456'
z456-q'123

   '/([^ ]+) ([^ ]+)/\2,''-'',\1/ee' rxs '123 456'     NB. multiple /e
333

   '/(\w+) (\w+) (.*)/\2, \1 \3' rxs 'Henry Rich xxx'
Rich, Henry xxx

See Also


Contributed by Oleg Kobchenko