Parsing and Lexing in F#

For the F# advent calendar this year, I wrote a KDL parser.

December 17, 2025

A note about references: I wrote this in my text editor using Github flavored markdown. I tried to maintain the formatting when putting the essay here so you can find with your browser.

You can view the project here:

GitHub - stormlightlabs/KDLFsharp

Contribute to stormlightlabs/KDLFsharp development by creating an account on GitHub.

https://github.com/stormlightlabs/KDLFsharp

What is KDL?

KDL is a node-oriented document format that is designed to sit in the same "conceptual neighborhood" as XML.

This means that it models data as nested, named nodes while aiming to be friendlier to humans writing and reading configuration by hand.

The official spec explicitly frames KDL’s niche as overlapping with XML and notes it can be used both as a both configuration language and as a data exchange/storage format.[^1]

The core idea is pretty straightforward: a KDL document is a sequence of nodes, and each node has a name plus optional "entries," and/or a nested set of nodes.

The spec describes a document as "zero or more Nodes… separated by newlines and whitespace," and emphasizes that nodes are the fundamental unit of structure.

From there, the design gets opinionated in a way that’s particularly convenient for parsing projects. Nodes can carry two kinds of payload that map cleanly to how many programmers already think about "structured lines":

Arguments (positional values), like CLI positional args
Properties (key/value pairs), like CLI flags/options

Arguments and properties can be interspersed like these CLI constructs and are collectively treated as entries.

This is a key reason KDL tends to feel natural as for configuration, as it reads like a compact, structured command script rather than a sea of punctuation.

A few details matter for implementors:

The current spec is for KDL 2.0.0, and the spec text identifies it as the latest stable version (with release date information included in the document).
KDL supports practical conveniences like node-level "slashdash" comments (/- ...) and line continuations using \, both highlighted in the official repo example.[^2]
The KDL site notes there are specs not just for KDL itself, but also for a KDL Query Language and a KDL Schema Language, which signals that KDL is meant to support more than "just config files."[^3]

In short, KDL is a deliberately structured, node-centric format that’s pleasant for humans and very tractable for compiler-style tooling.

It gives you enough complexity to be interesting (strings, numbers, comments, children blocks, properties/args, continuations), without making writing a parser unmanageable.

What is Recursive Descent?

Recursive descent is a parsing strategy where you write a set of functions, typically one per grammar rule, and those functions call each other to recognize and build a structured representation of the input. It’s a top-down approach in which you start from the highest-level rule ("a document is a list of nodes…") and progressively descend into smaller constructs ("a node has a name… then entries… then an optional children block…").

Each rule is implemented through mutually recursive procedures (function) that implement nonterminal tokens/sequences in the grammar.

This leaves authors with program's structured like a grammar.

Recursive descent is the most natural way to turn a language spec like KDL into readable code, especially in a functional/functional-first ML style languages like F#, where small, composable functions and explicit state passing (token stream + position) are idiomatic.

If a grammar is well-shaped, the resulting parser often reads like an executable version of the spec.

A typical recursive descent parser consumes a stream of tokens produced by a lexer. At a high level, the flow looks like:

Parse Document: repeatedly parses nodes until you reach EOF
Parse Node: parses a node name, then parses entries, then (optionally) parses a children block
Parse Entry: branches into argument vs property depending on token shape
Parse Value: parses strings, numbers, booleans, null, etc.

There are two major flavors of recursive descent that should keep be kept distinct:

Backtracking recursive descent: The parser tries an alternative, and if it fails, it rewinds and tries a different alternative. This is easy to write but can be inefficient in the worst case.
Predictive recursive descent: The parser chooses the correct alternative using limited lookahead (often 1 token). This is what many people mean by an LL(1) hand-written parser, as predictiveness is tied to the LL(k) class of grammars.[^4]

KDL’s syntax (especially things like "entries can be arguments or properties interspersed") encourages a style where you do small, local decisions based on the next token or two.

That leans toward predictive parsing, but you may still selectively backtrack in ambiguous-looking corners if you choose a simpler grammar representation.

There’s also a closely related modern refinement called packrat parsing, which is often described as recursive descent + memoization. Packrat parsers are typically paired with Parsing Expression Grammars (PEGs) and use caching so that each parsing function at a given input position is evaluated at most once, making performance predictable (often linear time for PEGs).

Using recursive descent forced me to understand and confront the spec (what terminates a node? how do comments interact with newlines? what does "line continuation" actually do?), and to produce code that is explainable and readable from step to step.

Why FSharp?

F#, in addition to being my favorite programming language, was a good fit for parsing work because its core strengths align almost perfectly with what parsers do: represent tree-shaped data, decompose it safely, and make illegal states hard to express.

The best feature for this is algebraic data modeling via discriminated unions. Microsoft’s own guidance explicitly calls out discriminated unions as a natural way to represent tree data structures, especially recursive trees which is exactly what an AST is.[^5]

When modeling syntax as a union like Node = | KdlNode of name * entries * children | ..., we get three benefits:

The AST becomes self-documenting and the "shape" of valid syntax is encoded directly in types.
Refactors become safer because when adding a new node kind, the compiler can force you to update all consumers.
We can prevent "half-parsed" structures from leaking into later phases.

Pattern matching as a construct aligns well with the overall flow of the whole program.

The official docs describe patterns as a way to compare data to logical structures and "decompose it into constituent parts," the operation perform repeatedly when converting tokens → AST nodes and later AST nodes → semantic representations.[^6]

In practice we write explicit cases, and your logic stays close to the grammar.

Active patterns are a way to define named partitions of input data that you can then be used inside of pattern matches.

That makes it natural to encode little recognizers like "is this token a valid bare identifier vs a quoted string?" or "does this lexeme represent an integer vs float?"

Finally, F# has a pragmatic ecosystem for language tooling. FsLexYacc project exists specifically to generate lexers and parsers for F#.

I decided to hand-write this parser, but the FsLexYacc runtime types and examples provided me with a solid reference model for token streams, position tracking, and lexer buffers (e.g., LexBuffer<'char>).[^7]

The result was a workflow where the compiler helped me keep the implementation honest: types define the AST, pattern matching defines the reduction from tokens to syntax, and the ecosystem can give me you proven building blocks when needed without forcing me into a heavyweight compiler framework.

Meanwhile, F# itself is open-source, and cross-platform on .NET with great tooling. I wrote this on a NixOS and Mac machines.

What is Lexing?

Lexing (or lexical analysis, scanning, tokenization) is the stage of the process where a raw stream of characters is converted to a stream of tokens, i.e. the meaningful "words" of a language.

A lexer should recognize categories like identifiers, strings, numbers, punctuation, and keywords to produce structured output that a parser can consume deterministically. The way I like to think about it is that parsing is about structure, but lexing is about letter and word boundaries.

Once you have tokens, the parser no longer needs to worry about low-level details like "what counts as a digit?" or "where does this string literal end?" Tt can focus on grammar rules like "a node consists of a name plus entries plus an optional children block."

This division of labor does a good job of reducing complexity by separating concerns.

Lexers are typically specified using a lexical grammar, and that grammar is often regular-expression shaped.

Lexical syntax is "usually a regular language," with rules often written as regular expressions describing possible lexemes (matched character sequences) for each token kind.[^8] Basically, lexing is where you decide what counts as IDENT("foo"), STRING("bar"), LBRACE, RBRACE, EQUALS, and so on.

Two practical issues I had to wrangle with are worth calling out:

Whitespace and comments are commonly recognized by the lexer but then discarded (i.e., they do not produce tokens) unless your downstream needs them.
Many languages treat newline as a terminator unless escaped. Line continuation as something "generally done in the lexer," where the backslash+newline may be discarded so the parser never sees a terminator. This matters because KDL has explicit concepts like nodes separated by newlines, and also supports conveniences like line continuations.

My Implementation

My pipeline sends raw text through the Lexer module, which produces a strongly-typed token stream, and the Parser module consumes that stream to yield a Document (a list of nodes).

Staying close to the spec means the lexer must handle slashdash /- comments, type annotations like (uuid), keywords such as #true / #false, and the "line continuation by escaped newline" rule so the parser never has to worry about these invisible terminators.

Throughout the procedure, the mutable LexerState tracks Pos, Line, and Col, while the Token discriminated union captures the shapes the parser cares about (Ident, String, Number, LBrace, SlashDash, and so on).

Once characters are tokenized, the parser wraps them in a tiny TokenStream helper that provides constant-time lookahead and safe advancement.

The public entry point, Parser.parse, simply wires together the lexer, constructs the stream, and delegates the real work to parseDocument:

// KDLFSharp.Core/Parser.fs
let parse (source: string) : ParseResult<Document> =
    let nodes, errs = parseDocument 
        { Tokens = Lexer.tokenize source |> Array.ofList; 
          Index = 0 }
    if errs.IsEmpty then Ok nodes else Error errs

From there, mutually-recursive helpers keep the grammar readable. parseNodes iterates until it hits Eof/RBrace, parseEntries handles the "arguments and properties can be interleaved" rule, and parseOptChildren watches for LBrace to recurse.

A few guard rails like newlineContinuesEntries peeks past Newline tokens so arguments can spill onto the next line when the spec allows it, and skipOneEntry enforces slashdash semantics by skipping exactly the next entry (children block or otherwise) before resuming normal parsing. Every failure produces an AstError with as much context as we can recover, so callers get a Result for error handling.

What is an AST?

An abstract syntax tree (AST) is a tree-shaped data structure that represents the meaning of an input written in a formal language. Each node corresponds to a construct in the source (or document), and the tree captures how those constructs compose. The syntax is called "abstract" because the tree doesn't preserve every surface detail of the original text.

A parse tree (sometimes called a concrete syntax tree/CST) is tightly coupled to the grammar productions used to recognize the input. It often includes intermediate nonterminals and punctuation-level artifacts that are necessary for proving syntactic correctness but are noise for later stages. The parse tree shows how grammar productions derive the input, while the AST removes nodes that are unnecessary for later processing.

If you’re building tooling like formatters, linters, analyzers, transpilers, schema validators, an AST is the representation you actually want to hold onto because it is compact, semantic-ish and it makes downstream passes simpler because you’ve already made a decision about what structure counts.

In the context of a document language like KDL an AST becomes the backbone for everything you might want to do after "it parses" like normalizing formatting, enforcing a schema, computing derived values, evaluating queries, transforming KDL into JSON/YAML (features I plan to implement!), or generate diagnostics that point to the exact construct that’s malformed.

The AST is also the natural place to attach metadata (source spans, comments you chose to preserve, inferred types of values, etc.) and are commonly annotated during later processing (contextual/semantic analysis) and are widely used in program analysis and transformation systems.

The AST is the contract between "text" and "meaning," and when it’s well-designed, everything else in a pipeline can be easier.

How I Implemented the AST

The library's AST layer mirrors how the KDL spec describes nodes where each node has an optional type annotation, a name, positional arguments, keyed properties, and (optionally) a block of child nodes. This mapped cleanly to a record type plus a handful of supporting unions, all defined in KDLFSharp.Core/AST.fs:

type Node =
    { TypeAnn: TypeAnnotation option
      Name: string
      Arguments: Value list
      Properties: Property list
      Children: Node list
      Span: SourceSpan option
      OriginalText: string option }


type Property = { Key: string; Value: Value }

type Document = Node list

Using a discriminated union for Value let me encode every literal form (strings, numbers, booleans, #null, and eventually inline child blocks) while leaving room for optional per-value type annotations the spec allows:

and Value =
    | String of string * TypeAnnotation option
    | Number of NumberLiteral * TypeAnnotation option
    | Boolean of bool * TypeAnnotation option
    | Null of TypeAnnotation option
    | NodeValue of Node list * TypeAnnotation option

The extra metadata (SourceSpan, OriginalText) will lets higher layers emit precise diagnostics or round-trip formatting hints without cluttering the core model. When I'll need to craft nodes programmatically (converting to KDL), the Builder helpers offer small composable constructors (Builder.node, Builder.prop, Builder.str).

What is Parsing?

Parsing (or syntax/syntactic analysis) is the process of taking a sequence of symbols (the token stream produced by the lexer) and determining whether that sequence can be derived from a formal grammar, while simultaneously constructing a structural representation of the input.

In compiler terms, parsing sits squarely in the front end by turning tokens into a tree (a parse tree and/or an AST).

Parsing is where "flat" becomes "nested." Tokens are a linear stream. Documents and programs are inherently hierarchical so nodes contain children, expressions contain subexpressions, statements contain expressions, and blocks contain statements.

A parser’s job is to reconstruct that hierarchy reliably.

There are two big families of parsing strategies:

Top-down parsing, which starts from the highest-level rule and tries to match the input by expanding productions (recursive descent is the hand-written archetype).
Bottom-up parsing, which starts from the tokens and repeatedly reduces token sequences into higher-level constructs (LR-family parsers are the classic example).

These families demonstrate the trade-offs. Hand-written top-down parsers are typically easier to read and explain, while bottom-up generators can handle a wider class of grammars in a mechanical way. What matters for KDL is that parsing allows us to validate and construct representative structures. Basically, parsing is where correctness and usability meet. A technically correct parser that produces terrible error messages is still a bad parser.

In practice, parsing includes tracking positions, reporting expected tokens, recovering enough to continue (when appropriate), and making sure the output tree corresponds to the user’s mental model of the language.

Putting it all together

All of the individual pieces stay intentionally decoupled, but the workflow from "text file" to "typed document" is tiny enough to explain on a single page.

The CLI (and tests) call one function, Parser.parse, which in turn shells out to Lexer.tokenize and then threads the resulting tokens through our recursive-descent machinery. Because Parser.parse returns Result<Document, AstError list>, callers can immediately branch on success vs. failure. The CLI can (and does) format aggregated errors with spans, while library consumers can map them to whatever diagnostic type they prefer.

The last mile was to "make it pleasant to use," where the Builder helpers and discriminated unions came from.

Tests that round-trip sample snippets or fabricate small ASTs read like executable spec prose:

let sample =
    Builder.node
        "package"
        [ Builder.str "my-pkg" ]
        [ Builder.prop "version" (Builder.str "1.2.3") ]
        []

Because Document is just Node list, composing or transforming KDL becomes a matter of operating on ordinary F# collections. When we implement a JSON serializer we'll simply map over nodes, inspect Arguments/Properties, and serialize. To enforce a schema, we walk the tree, collect issues into AstError values, and reuse the same reporting surface the parser uses.

Again, every stage mirrors the spec closely enough that debugging is more about reading the source grammar than reverse-engineering a framework.

Conclusion

The implementation leans on simple ideas, namely deterministic lexing, predictable recursive descent, and a type-safe AST, but the combination pays off.

KDL’s human-friendly surface (slashdash comments, line continuations, optional type annotations) is fully supported because the lexer shoulders the awkward bits up front, letting the parser read almost like a narration of the spec.

F#’s discriminated unions and pattern matching make it practical to encode every value form and keep illegal states further down the pipeline, while Results keeps downstream consumers in charge of error handling.

Most importantly, the code stays readable. If someone wants to extend the implementation, they can start at the AST, follow the flow back through the parser, and land in the lexer without fighting hidden abstractions. I've kept the project is intentionally small enough to teach from, but sturdy enough to power real tooling.

Special thanks to @lanayx.bsky.social for allowing me to contribute this year, and to you the reader. Happy Holidays!

References

[^1]: https://kdl.dev/spec/ "The KDL Document Language Spec"
[^2]: https://github.com/kdl-org/kdl "kdl-org/kdl: the kdl document language specifications"
[^3]: https://kdl.dev/ "The KDL Document Language"
[^4]: https://en.wikipedia.org/wiki/Recursive_descent_parser "Recursive descent parser"
[^5]: https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/discriminated-unions "Discriminated Unions - F#"
[^6]: https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/pattern-matching "Pattern Matching - F#"
[^7]: https://fsprojects.github.io/FsLexYacc/reference/fsharp-text-lexing-lexbuffer-1.html "LexBuffer<'char> (FsLexYacc)"
[^8]: https://en.wikipedia.org/wiki/Lexical_analysis "Lexical analysis"

Subscribe to oh hey, it's owais

to get updates in Reader, RSS, or via Bluesky Feed

fsharp