Addons/xml/sax

From J Wiki
Jump to navigation Jump to search
User Guide | Installation | Development | Categories | Git | Build Log

xml/sax - XML parser based on Expat library

SAX (Simple API for XML) parser addon. There is both flat API and object oriented, SAX-like interface. Binaries for Windows, Linux x86 and Darwin PPC included.
Based on Expat 2.0.0, see http://expat.sourceforge.net/
See also: examples in test folder in SVN; change history.


Installation

Use JAL/Package Manager or download the xml_sax archive from JAL:j602/addons and extract it into the ~addons/xml/sax folder (or ~addons/xml for j504).

The listings and results of some examples found in the test folder.

SAX Usage

Wm yes check.png

SAX (Simple API for XML) is originally a Java framework by David Megginson derived from expat processing model. This paradigm results in systematically faster XML processing than DOM, as the SAX stream has a tiny memory footprint. See http://www.saxproject.org/.

SAX parsing works within the push model, i.e. the API calls you. You provide the callback functions by overriding the base class, see saxclass definition. For the XML nodes events, these functions are called on.

A higher-level visitor design pattern can be obtained if you define verbs with names of elements of interest and a prefix and call then from start/endElement. This would be similar to wd calling on event verbs.

In your class you maintain the state and selectively process the events. The event for text between tags is called characters. It is demoed in the table and rss examples.

In the "rss" example, a simple stack of nested elements is maintained in the S list. Then characters processes the text accroding to the current context.

You can pass the result for process in the output of endDocument, which is the last event called.

x2j - XML to J

Wm yes check.png

x2j is an extension in the 'xml/sax' addon, which allows to compactly describe XML parsing in a declarative way.

Hopefully, this will finally overcome the entry barrier for people to take on using XML with J.

Simple things are really simple; complex are, well, possible. Just to give you an idea. Say, covert an HTML table to J table.

  <table>
    <tr>
      <td>0 0</td><td>0 1</td><td>0 2</td><td>0 3</td>
    </tr><tr>
      <td>1 0</td><td>1 1</td><td>1 2</td><td>1 3</td>
    </tr><tr>
      <td>2 0</td><td>2 1</td><td>2 2</td><td>2 3</td>
    </tr>
  </table>

Here's all the code with definition and execution:

require 'xml/sax/x2j'            NB. Preamble
x2jclass 'px2j3'

'Items' x2jDefn                  NB. Definitions
  /   :=  table             : table=: 0 0$0
  tr  :=  table=: table,row : row=: ''
  td  :=  row=: row,<y
)

    process_px2j3_ TESTX2J3      NB. Processing
+---+---+---+---+
|0 0|0 1|0 2|0 3|
+---+---+---+---+
|1 0|1 1|1 2|1 3|
+---+---+---+---+
|2 0|2 1|2 2|2 3|
+---+---+---+---+

See the link above for more of exciting examples, such as nested structures and attributes handling.

As illustrated above, the usage pattern consists of three parts:

  • Preamble loads the x2j library and inherits from the px2j class
  • Definitions in full or compact notation (see below)
  • Processing starts parsing, which will call respective definitions

Each definition consists of

  • Path part, which is matched to the current node. Currently, only the tail part of the path is matched; future version may support more XPath expressions. A special path for root '/' matches the top-level document.
  • Explicit Definition (J code) which is invoked when the current parsed node matches the path.

Ambivalent definitions (monadic and dyadic) are used for elements, where the

  • dyad is called at the start of an element, having
    • y with element name and
    • x with attributes, accessed with the verb atr, e.g. test=: atr'test' or 'one two'=:atr'one two'
  • monad is called at the end of an element, having
    • y with element name.

Monadic definitions are used for text nodes, where the

  • y contains the string value of the text

In each definition category (element or text), if the same path is repeated, only the first occurrence is used.

Below is the summary of the full and compact definition notations.

Full Definitions

Element definitions are of the form

'path' x2jElm (3 : 0)
  end of element code
:
  start of element code
)

Text definitions are of the form

'path' x2jChar (3 : 0)
  text code
)

The same path can be used in an element and a text definition.

The definitions are regular J explicit definitions, and thus can be contracted to one-liners or be tacit, e.g.
   'path' x2jChar (3 : 'code')
or even reference named verbs, e.g.
   'path' x2jElm verb or
   'path' x2jElm (monad : dyad) .

x2jElm and x2jChar are conjunctions that build corresponding matching tables for processing, so no copula (=:) is used.

Compact Definitions

Compact definition is a convenience notation, which is translated into a series of full definitions internally.

Each line consists of path, definition token := (not a copula), and one-line explicit definition, in which monad and dyad parts are separated with spaced colon ( : ). Note that arguments x or y must be indicated explicitly.

'Note' x2jDefn
  elm1 := monad code : dyad code NB. ambivalent - element
  elm1/elm2 := : dyad code NB. dyad-only - element
  elm1/elm3 := monad code NB. monad-only - text
)

A number of compact definitions can be freely mixed with full definitions.

Processing

x2j processing is similar to the SAX visitor pattern. As the XML is parsed, each node is visited in a depth-first manner. When the current node path matches one of the definitions, the corresponding code is invoked.

The passed parameters or attribute values can be assigned to the locale (=:) variables (not local), and used in other definitions to construct higher-level items.

The result of the whole processing is the return value of the end monad verb of the document (with the root path '/'). All other results are ignored.

See Also

Authors