JWebServer/HttpParser

From J Wiki
Jump to navigation Jump to search

So we can have J listen for HTTP requests, and respond. But for this to be meaningful, we'll also need to parse those requests.

Note: this is hand-typed copy, and might contain some inadvertent syntax errors. Also, the code here requires J version 6. To make it work with version 5, a number of nouns will have to be explicitly converted to constant verbs using "_

HTTP Format Overview

HTTP requests come in several forms.

The original (now obsolete) form was a single line of text, consisting of a method (a word like 'GET' or 'POST') and a URI (something like '/directory/name/filename.htm').

Since a number of people wanted something more complicated than that (but didn't want existing web pages to break) a more advanced form was developed. Almost no one uses all the potential features of this more complicated format, but good practice suggests that all web servers should be able to safely handle it.

In the new format, that line was re-defined as the first line of the request. Since existing web servers ignored everything after the first two words, a third word was added -- if it was present, this more complicated format was being used. To allow for further enhancements in the future, it was decided that this word should be 'HTTP/1.0'.

The following lines were organized in a "message header"/"message body" format, such as is used for internet email. These are CRLF delimited lines, with a blank line (CRLF,CRLF) seperating the header from the body. Everything after that blank line is the body.

The header lines consist of a name followed by a ':' followed by a value. Alternatively, if a line begins with a TAB it is a continuation of the line before it.

Each header line has its own rules for the format of its value (the specific rules depend on the name), but all header lines must fit this mechanism.

More recently, a new message format has been defined. These use 'HTTP/1.1' as the third word on the first line. These messages are parsed the same as HTTP/1.0 messages, but multiple messages may be sent on the same connection one after the other. In other words, while there are important distinctions between HTTP/1.0 and HTTP/1.1 messages, in a modular system those differences must be dealt with before we parse those messages.

J's Sequential Machine (;:)

J's ;: is pretty good at breaking a string of text into tokens. We can use that here to parse HTTP messages.

The left argument to ;: controls how the text will be parsed. This left argument allows for a number of complications, so it's worth reviewing the specifics needed for a specific application, to avoid getting lost.

We'll be using verbs of the form (F;S;M) ;: ]

F will usually be 0, which means ;: will produce a boxed list of strings.

S is a rank three matrix. We'll always be presenting S using a compressed (rank 2) state matrix S2, where S is 0 10#:S2. In S2 the first dimension represents "state" and the second dimension indicates "character class". The values in S2 will represent both the operation code (the lowest digit) and the next state (the higher digits).

M will always be a vector containing 256 elements (one for each byte in a.). The values in M are the "character class" that we need to index S (and S2).

Because it's sometimes convenient to work with modified variants of these machines, it's convenient to leave them as nouns, to be used as the left argument to ;:

Also, so that S can be expressed as clearly as possible, We will be using a special routine to build it:

states_z_=. 3 :0
 NB. convenience fn for building commented FSMs
 NB. 23.4 means next state is 23, with opcode 4
 NB. rows are states, columns are character classes
 NB. opcodes:
 NB.  .0  continue
 NB.  .1  begin            j=. i
 NB.  .2  emit, begin      j=. i [ ew(i,j,r,c)
 NB.  .3  emit, end        j=._1 [ ew(i,j,r,c)
 NB.  .4  extend, begin    j=. i [ ev(i,j,r,c)
 NB.  .5  extend, end      j=._1 [ ev(i,j,r,c)
 NB.  .6  stop
 NB. 'stop' finishes
 NB. 'end' prevents emit until next 'begin'
 NB. implicit extend at finish if j>_1
 NB.
 0 10 #: <. 10 * > -.&a: <@".;._2] 0 :0
)

Result strings start at a 'begin' operation, and end just before an 'emit' or 'extend' operation. But 'extend' operations are special -- several in a row count as a single, long 'emit' if (and only if) they occur in the same state -- successive 'extend' operations in the same state use the new value for end with the earlier value for begin.

Extend operations which occur in different state of the current unfinished word (extend operations which are listed on different rows of our "state matrix") behave just like an emit. They finish the existing word and start a new one. (The difference between the two extend/emit operations is that a begin would a new unfinished word while an end leaves the machine in an "error" or "skip" state.)

splitWord

Here's a simple sequential machine verb. This will be used on the first line of the http request and will split it into white-space delimited words:

splitWord=: 0;(states'');a.e.' ',CRLF,TAB
NB. other whitespace     # got
    1.1   0          NB. 0 whitespace
    1     0.3        NB. 1 other
)

This machine has two states. State 0 is the initial state, and it's the state we'll go into after seeing whitespace. State 1, the other state, is the state we'll go into when we see something that isn't whitespace.

This machine also has two character classes. Class 1 is used for white space characters. Class 0 is used for everything else. In the context of HTTP we didn't really need to classify TAB as a whitespace character (since it's irrelevant in the first line of an HTTP request) -- it was only included for emphasis, and to make the implementation better fit the comments.

From state 0 (our initial state), if we see a character with class 1 (not whitespace), we stay in state 0. So we can think of "state 0" as our "not whitespace" state.

From state 0, if we see a character with class 0 (whitespace), we move into state 1 and use the "start of word" opcode. So we can think of "state 1" as our "whitespace" state.

From state 1, if we see a character with class 0, we stay in state 1.

From state 1, if we see a character with class 1, we move into state 0 and use the 'end of word' opcode.

This is extremely simple, but it's worth spending a minute or so inspecting the state machine to understand these pieces. Also, the implementation would be slightly easier to described if class 1 was whitespace and class 0 was non-whitespace (because then class 1 and state 1 would have an obvious similarity which would reinforce their concepts being similar and likewise for class 0 and state 0) -- but this would make the code more complex.

Split-words.svg

splitHttp

Our next state machine will be more complicated. This state machine will split an http response into three parts: the first line, the message headers, and the message body.

For this state machine, we'll need three character classes, CR, LF, and everything else. CR will be class 0 (the first column of the compressed state matrix), LF will be class 1, and everything else will be class 2 (the last column).

Later on, we'll be storing these parts into variables, and since the contents of each variable shows up in its own set of states, I'll introduce the names for those variables here.

Also, note that I have to deal with a number of potential state transitions which are not discussed in the http specifications. I've used personal judgement here (for example: I treat bare LF like CRLF and shut down the sequential machine when CR is not followed by LF), but there's no particularly "right" state transitions for cases which do not follow the specs.

namesHttp=: ;:'HTTPctrl HTTPhead HTTPbody'
splitHttp=: 0;(states'');CRLF i.a.
NB. CR   LF   other    # in
    0    0    1.1  NB. 0 initial state
    2.2  2.2  1    NB. 1 first line (HTTPctrl)
    3    4    2    NB. 2 headers    (HTTPhead)
    0.6  4    0.6  NB. 3 CR
    5    6    2    NB. 4 LF
    0.6  6    0.6  NB. 5 CR after blank line
    7.2  7.2  7.2  NB. 6 LF after blank line
    7    7    7    NB. 7 body       (HTTPbody)
)

SplitHTTP.svg

splitHeaders

We're not done yet. We can split the response into first line, headers and body, but we still need to deal with the structure of the message headers. First, we need to break up the header into a sequence of items. Since individual items can be longer than one line, we can't use a simple splitter like we used for the first line. Or, rather, we could do that but we might run into trouble later with obscure browser features which provide multi-line headers.

HTTP header lines which begin with TAB are continuations of the header represented on the previous line. We can use the 'extend' operation here, but we'll need two copies of it so that different headers are not merged together.

We'll use four character classes here: 0 is CR, 1 is LF, 2 is TAB or space, and 3 is everything else.

splitHeaders=: (0;(states'');(-2&<)a.i.~CRLF,TAB,' ')
NB. CR   LF   whitespace other     #
    0    0    0.6        1.1  NB.  0 initial state
    2    3    1          1    NB.  1 header
    0.6  3    0.6        0.6  NB.  2 CR
    0.5  0.5  1          4.4  NB.  3 LF
    5    6    4          4    NB.  4 alt header
    0.6  6    0.6        0.6  NB.  5 alt CR
    0.5  0.5  4          1.4  NB.  6 alt LF
)

SplitHeaders.svg

splitNameValue and safeName

Once we've split the headers into items, we'll need to split each of the items into a name and a value. The protocol uses ':' to split the name from the value, and we can use a simple state machine to do the same. Here, character class 1 will be ':' and character class 0 will be everything else.

splitNameValue=: (0;(states'');a.e.':')
NB. other :        #  in
    1.1 1.1    NB. 0  initial state: start word
    1   2.3    NB. 1  name (waiting for ':')
    3.1 3.1    NB. 2  :
    3   3      NB. 3  value (continued)
)

Split-name-value.svg

We'll also want to use these values.

In previous exercises, we set up a socket object for each connection, and an http object (which inherited from it) for each http message. We can create a variable in that http object for each of these header items, and they'll automatically be cleaned up when the object is discarded. This is largely under control of the browser so for safety we'll have to impose a few rules.

One rule is that we don't want any '_' charaters in these names. If we let that happen, the browser could step on stuff in other objects and classes.

But let's look at an example header line to get the rest of the rules.

Content-Type: text/html

This name might instead be presented as

CONTENT-TYPE: text/html

Content-Type is something sent from the server to the browser, so this is a contrived example. When parsing an http request, we'll be dealing with headers sent from the browser. Nevertheless, this illustrates a significant point.

Basically, we want to follow some simple rule for upper/lower case in these names (like: a name is formed from a list of words where the first letter of each word is capitalized and the remainder is lower case). We also will need to discard any characters which aren't alphanumeric. We'll also want to prefix each of these names with HTTP so they can't step on anything within the http object.

We can use the 'split' sequential machine we used earlier when defining splitWord. Here, character class 0 (keep) is "anything alphanumeric", while class 1 (toss) is everything else.

alph=. a.e. (,toupper)'abcdefghijklmnopqrstuvwxyz0123456789'
safeName=: [: ; [: (toupper@{. , tolower@}.)&.> (0;split;-.alph) ;: ]

trimValue

For the most part, different values in the headers have different structures. However, it's always the case that these headers will have extra white space around these values, so it's reasonable to trim off any leading or trailing white space. For that, the following should suffice:

trimValue=: (1;(0 10#:11 0,:10 5);a.e.' ',CRLF,TAB)

This is essentially the same as splitWord, but using opcode 5 (extend sequence) rather than opcode 3 (end word), and using function code 1 (don't box the words) rather than function code 0 (box the words).

parseHttp

And we need to glue these pieces together

parseHttp=:3 :0
'HTTPctrl HTTPhead HTTPbody'=: 3{.splitHttp ;: y.
'HTTPmethod HTTPuri HTTPversion'=: 3{.splitWord ;: HTTPctrl
'names values'=. <"1|:nameValue&> splitHead ;: HTTPhead
(HTTPnames=:('HTTP',safeName)&.>names)=: trimValue&;:&.> values
HTTPuri
)

untidy notes that should be folded into the above...

HTTP Responses (which browsers and proxies have to parse) follow the same format as HTTP Requests (as outlined above), except for the first line. The first line of a response will be the HTTP version (for HTTP/1.0 or HTTP/1.1) followed by a space then a 3 digit status code, followed by a space then a somewhat arbitrary status message (essentially, just a comment for humans who might be looking at the response).

Also, we can fold trimValue into nameValue:

splitNv=: (0;(states'');(2*a.=':')+a.e.' ',CRLF,TAB)&;:
 1.1  0.6  0.6  NB. state 0: initial state
 1    0.6  2.3  NB. state 1: name
 3.1  2    3.1  NB. state 2: skipping whitespace after name
 3    2.5  3    NB. state 3: value
)

If we use splitNv instead of nameValue we can get just get rid of trimValue