Guides/Regular Expressions/Basic Patterns

From J Wiki
Jump to navigation Jump to search
Overview | Verbs | Basic Patterns | J Patterns | Compiling Patterns

Navigator

Basic Patterns

The following is basic description of regular expressions.

A regular expression pattern is a sequence of elements which matches successive portions of a character string. For example, simple letters are elements which match the same characters in the string. The asterisk indicates that the previous element should be matched 0 or more times. So, a pattern of abcd must match in the string exactly; a pattern of ab*cd matches the letter a followed by 0 or more occurrences of the letter b , followed by the letters cd .

Characters

Non-special characters match exactly. Non-special characters are anything other than:

   [ ] ( ) { } $ ^ . * + ? | \

A special character is included as simple text by preceding it with a backslash.

Character sets

The special character . matches any character (except the null character, 0{a. )

The special characters ^ and $ match the start and end of lines.

Sets of characters are defined by enclosing the list of characters in brackets:
[aeiou] matches a single vowel character

Ranges can also be included within the brackets:
[a-z] matches any lower case letter

Combinations of the above are acceptable:
[a-zA-Z13579] matches any lower case, upper case, or odd digit

Fixed sets (classes) of characters can be included in the list, as a name within bracket-colon pairs:
[#[:digit:]abc] matches the character  #, a digit, or any of the letters a, b, or c

The character classes defined are:

   alnum  alphanumeric  alpha   alphabetic
   blank  tab+space     cntrl   control chars
   digit  digits        graph   printable (except space)
   lower  lowercase     print   printable
   punct  punctuation   space   whitespace
   upper  uppercase     xdigit  hex digits

If a set begins with ^ , then the pattern will match with any character not in the set.

Subexpressions

A series of elements may be combined by enclosing them in parenthesis. Subexpression are affected by closures such as * just as simple characters are:
([a-z][0-9])* matches any number of occurrences of a letter followed by a digit

The result of searches for a pattern return a match for the overall pattern, and a separate match for each subexpression.

A \ followed by a digit, N, matches the same substring which occurred in the Nth subexpression:
([[:digit:]]+)#\1 matches one or more digits, followed by a # , followed by the same string of digits.

Closures

A * following an element matches 0 or more occurrences of that element:
 [aeiou]* matches 0 or more vowels

A + following an element matches 1 or more occurrences of that element:
 [[:alpha:]]+ matches 1 or more alphabetic characters

A ? following an element matches 0 or 1 occurrences of that element:
 -?[[:digit:]]+ matches an optional hyphen, followed by 1 or more digits

An interval expression, {m,n} , follows an element to allow it to match at least m, and no more than n, occurrences of the element:
 [[:digit:]]{3,5} matches 3, 4, or 5 digits

Alternation

Multiple regular expressions can be separated with a vertical bar | to match any of them:
 print|list|exit matches any of the strings print , list , and exit

Matches

When searching for a pattern in a string, it is possible to find multiple substrings which match the pattern. The one that is returned is the one which starts earliest in the string. If more than one match starts at the same place, the longest one is returned.

Even once a particular match is located, it is possible for there to be multiple combinations of the contents of the subexpressions which make it up. As a rule, whenever possible the subexpressions which begin earlier in the string will be as long as possible.

The result of a match is a table which describes the match. The first row covers the whole match, and subsequent rows describe where the subexpressions in the pattern match in the string. Each row has two elements: index of the first character of the start of the match, and the length of the match. Any row which doesn't participate in the match is filled with _1 0.