Parsing markdown is tricky. Markdown parsing is mostly about recognizing line-based block containers first, then interpreting character-level inline syntax second. This makes parsers less like typical programming-language parsers and more like layered document parsers.
In contrast, a programming language parser is about turning a flat token stream into a nested syntax tree according to a grammar.
The CommonMark spec explicitly describes its model of two-pass/phase parsing.1
- 1.
first determine the document’s block structure, then
- 2.
parse inline text inside paragraphs, headings, and other blocks.
Link reference definitions are collected during the block phase and used later during inline parsing.
MDZ
As a library, markdown z parses to an AST first, then exposes that same tree through an event iterator or the HTML renderer depending on your use-case.
Parsing Pipeline
At a glance, markup goes through this pipeline:
from source bytes...
- 1.
block parser splits source into Line records
- 2.
block parser builds block nodes with raw text children
- 3.
inline parser replaces raw text with inline nodes
- 4.
callers use the AST, event iterator, or HTML renderer
Blocks
Blocks (Headings, Lists, Block Quotes, etc.) are line-oriented. We scan the document line by line and decide what each line begins, continues, or closes. When running through blocks the complexity arises with respect to precedence rules, requiring us to address:
nested lists
lazy continuation lines
blockquotes inside lists
distinguishing indented code from list continuation
when a paragraph should be interrupted by another block
In mdz, Block parsing produces a Line structs with a LineCursor tracking the byte index & visual column
const Line = struct {
start: usize,
content_end: usize,
end: usize,
source_start: usize,
source_content_end: usize,
text: []const u8,
};
const LineCursor = struct {
line_index: usize,
byte_index: usize = 0,
column: usize = 0,
};References
Markdown supports reference-style links. During block parsing, the parser collects labels and identifiers i.e. [Label][id] such that [id] resolves to whatever URL/path its linked to.
Inlines
Once blocks are known, we parse inline content inside headings, paragraphs, table cells, etc. Emphasis parsing is the trickiest part here because we have to track delimiters. Markdown emphasis depends on whether * or _ can open, close, or both, based on surrounding characters. For this, we use a container stack
Container Stack
The container stack parser tracks open block structures: document, block quote, list, item, paragraph, code, and HTML block. On each line the parser first calls matchOpenContainers and tries to continue the existing stack before starting anything new. This is why block quote and list code feels inverted at first. The parser is not only asking “what starts on this line?” It first asks “which of the containers from the previous line are still open?”2
When a line no longer matches an open container, the parser shrinks the stack. New block recognizers then run under the deepest matched parent.
Output
A parser commonly produces either an AST or an event-stream.
An AST is good for transformations, linting, editor tooling, and custom renderers, especially powerful when you need to understand, inspect, modify, or relate parts of a document due to their addressability.3
Event streams are good for fast rendering because they avoid building a whole tree before output starts.