Skip to content

3. Parsers, How Do They Work?

For convenience, let's reprint the successful parse example from the last page.

> K.parse(K.letter(), 'abcdef')
[
  {
    view: DataView { byteLength: 6, byteOffset: 0, buffer: [ArrayBuffer] },
    index: 1
  },
  { status: 'ok', value: 'a' }
]

We know by now that this pair of objects is the context followed by the result. Let's go ahead now and discuss what those actually are.

The context object

Here's the isolated context object from the successful parse result above.

{
  view: DataView { byteLength: 6, byteOffset: 0, buffer: [ArrayBuffer] },
  index: 1
}

Alright, that's fine, but the context is supposed to contain the input text itself. But where's the text? To answer that question, we're going to have to learn (just a little) about character encodings.

The most common way to represent characters in the modern age is an encoding called UTF-8. In this encoding, each character is one to four bytes long; the one-byte characters happen to have the same encodings as ASCII characters. This made UTF-8 popular, especially in the West: ASCII is already a subset of UTF-8, and with the multi-byte characters there's plenty of room to cover all of the other languages of the world (with plenty of room left over for emojis).

JavaScript comes from a time before UTF-8, and it represents its characters in a two-or-four-byte encoding.1 The merits of an encoding like this aren't in question; in some ways they're even better because the less-variable length makes it easier to do some things internally within the language. But this encoding causes us some problems. In parsing, it's really important to have a solid concept of "character", and these four-byte characters are counted as two characters in most JavaScript operations. (For example, '🍔'.length === 2 is true even though the hamburger emoji is only one character.)

Alright, so what does this have to do with context? Basically, since there are some pitfalls to JavaScript string encoding, Kessel has to put in some safeguards, and since it already has to bother with encoding to that degree, it just chucks it all and uses UTF-8 encoding instead. That means it can't use JavaScript strings internally, so instead it uses arrays of UTF-8 encoded bytes.

When the context is created, the input text is converted into a UTF-8 byte array. That array is tucked away in the context object within a DataView, which is simply an object that makes it easier to access the byte array. You can see it if you look really close at that example context object again:

{
  view: DataView { byteLength: 6, byteOffset: 0, buffer: [ArrayBuffer] },
  index: 1
}
There it is, it's the buffer property view property in the context. An ArrayBuffer object is a byte array, and this one happens to hold UTF-8 encoded bytes. Let's look a little closer, since this view doesn't show us much about it.

> const [ctx1, res1] = K.parse(K.letter(), 'abcdef')
undefined
> ctx1.view.buffer
ArrayBuffer { [Uint8Contents]: <61 62 63 64 65 66>, byteLength: 6 }

There's the text! The contents of the array buffer are six numbers, which just happen to be (in hexadecimal) the UTF-8 code points for the letters 'abcdef'. Mystery solved.2 3

As for the rest of the context — well, there's just the index. It just points at the byte that is the next to be read when the next parser is applied. The byte array itself doesn't change once the context is created, so rather than stripping bytes off the front (as in the simple example at the start of Chapter 1), this index is simply updated.

As a final note about context, let's see what happens to index in a parse failure.

> K.parse(K.letter(), '123456')
[
  {
    view: DataView { byteLength: 6, byteOffset: 0, buffer: [ArrayBuffer] },
    index: 0
  },
  { status: 'fail', errors: [ [Object] ] }
]

This time index does not move. When letter fails, it does not consume any input, so the next parse attempt will happen at the same place. index reflects this by remaining at 0 when letter fails. This happens with every parser in Kessel that is not a combinator — they are atomic, so when they fail, it's as though they never ran in the first place.4

The result object

Alright, that was the hard part. Results are easy. Let's isolate the result object of the passing letter parse from the top of this page:

{ status: 'ok', value: 'a' }

Doesn't get much more straightforward than that. The status is 'ok', which means that nothing failed. The value is 'a', and it's not coincidental that it's the same as the return value of the successful run invocation from the last chapter. And that's all there is to it.

So how does the failed case look?

{ status: 'fail', errors: [ [Object] ] }

This time, the status again tells us how it all worked out, in this case that the parse failed. With failure, there is no value property; in its place is errors, which is an array of objects describing errors that happened in the parse (there can be more than one, but there was not in this case.) Here's a closer look at the object in that errors property.

> const [ctx2, res2] = K.parse(K.letter(), '123456')
undefined
> res2.errors[0]
{ type: 'expected', label: 'a letter' }

Each of the errors is just an object with a type and a label. These are used for making nice error messages. There are other types of errors that have different properties than this, but that's something we'll look at later in the guide.

Why parse?

Alright, so we've learned a lot about what goes on in the parser internals, but where does that get us in real life? Why wouldn't we just use run and not have to worry about all of these details?

Fact is, you may find that run is indeed best for you. But there are certainly reasons to use parse instead.

  1. You don't want an exception thrown on failure.
  2. You want to write your own error message formatter.
  3. Your parser is embedded within other code that will check for the parser's success or failure and act accordingly.
  4. You want the added detail for any number of other reasons.

If one of these reasons compels you to use parse, rest assured that there are some helper functions to let you avoid getting deep into the internals of a parser reply. Each of these functions takes the reply object that is returned by parse and picks out some part of it to return to you.

  • success returns the same thing as run does, except that it returns null on failure rather than throwing an exception.
  • failure does the opposite; it returns an error message on failure and returns null on success (again, no exception is thrown).
  • succeeded returns true if the parse was successful and false if it was not.
  • status returns 'ok', 'fail', or 'fatal' to tell how the parse went. (We'll talk about 'fatal' later in the guide.)

Finally, there is formatErrors. This takes a failed reply and returns a detailed error message; it's the same error message that both run and failure use. Using formatErrors grants you access to more options, including the ability to send a custom formatting function in case you want something different out of the error messages. (See Formatter for information about what that function has to look like.)

So that's a pretty deep dive into parser internals, but at this point we've done nothing more than parse a single letter off the front of a string. Let's face it, that's not very useful. In the next chapter, we'll start to address that.


  1. JavaScript implementations can choose their encoding, as long as how it acts conforms to the spec. And that spec mandates something that is a weird amalgam of UCS-2 and UTF-16; it's not quite UCS-2 because it has four-byte characters through pairs of two-byte code points called surrogate pairs, but it isn't quite UTF-16 because it allows partial and reversed surrogate pairs. 

  2. JavaScript would encode this same string as <61 00 62 00 63 00 64 00 65 00 66 00>

  3. To show how UTF-8 works, here's the same sort of thing, except using the first six letters of the Russian (Cyrillic) alphabet (this uses uletter because letter only succeeds with ASCII letters):

    > const [ctxr, resr] = K.parse(K.uletter(), 'абвгде')
    undefined
    > ctxr.view.buffer
    ArrayBuffer {
      [Uint8Contents]: <d0 b0 d0 b1 d0 b2 d0 b3 d0 b4 d0 b5>,
      byteLength: 12
    }
    

    Twelve bytes to represent six characters; in UTF-8, Russian letters are two bytes long.

    A JavaScript string with the same content would be encoded <30 04 31 04 32 04 33 04 34 04 35 04>. Even though these particular characters are the same length in both encodings, the encoding values are entirely different. 

  4. Alright, one more thing about index: it is a byte index, not a character index. Let's see how it works in that Russian text again.

    > K.parse(K.uletter(), 'абвгде')
    [
      {
        view: DataView { byteLength: 12, byteOffset: 0, buffer: [ArrayBuffer] },
        index: 2
      },
      { status: 'ok', value: 'а' }
    ]
    

    This time index is incremented by 2, because Russian letters are two bytes long.