003. I wrote a tiny Markdown parser

I have been trying to build some small implementations of cool software. This time I made a (CommonMark) Markdown parser in under 500 lines of Golang code.

Before I go into the implementation, I wanna discuss why I'm using my time to do something that libraries like Goldmark already have in place.

Why do this in the first place?

I think that whenever I create something from scratch, I understand it better than just looking at tutorials for that thing. Ok, Markdown it's not difficult by any margin, but I'm still very interested in parsers and interpreters. I have my own toy language I've been writing based on Thorsten Ball's Writing an Interpreter/Compiler With Go books.

Another thing is: I wanted to see if I could do something that I could plug anywhere. One of my projects is rewriting Obsidian in my own way. The philosophy of the app is to be file over apps, which is incredible! That means that if Obsidan suddenly disappears, my notes still exist. However, another thing like Obsidan needs to rise so I can still have my workflow to be the same.

That means someone has to build it. Why can't I do something minimal and that works for me? This is mostly a personal interest and not something I want to grow enough to "compete" with Obsidian. I hope they still exist for decades.

The last reason I built tinymd is that I've also been thinking about an interactive fiction framework that works with markdown. Right now we have different implementations like Twine, Inkle, and things like that. But since I write my stories in markdown files, why not try to build something like this?

So eventually I want to build two things: A Obsidian Flavored Markdown extension and a Interactive Flavored Markdown extension to tinymd. Now that I understood how markdown parsers generally work, I can think of those things.

The last secret reason I did this is because: I want to and find it fun!

Let's get back on track.

How I built it

A parser does this: it takes some text or type of code and breaks it into whatever structure we need from it. In my case, I want to take a .md file and make my code understand that a # symbol means a header, a paragraph of text means literally a paragraph of text, that something between _ should be italicized.

Then I transform it into HTML. That way, I add it to web pages.

So to start my parser, I need to understand that it literally reads every line and "word" from my code. I need to always know where I'm at. You can think of it like the computer's cursor in a blank page, it starts at line 0 position 0. If I write "hello" and put my cursor at the end of the word, then I'm at line 0 position 5.

But if we add a "#" before the word, when our cursor starts checking the document, it will know that this is more complex than just a word, it can mean a header block. That's why we need to store those things somewhere.

type Parser struct {
	lines []string
	pos int
}

But how do we tell our code that a header is, in fact, a header? We check it with our AST Node.

An AST node is a component of an Abstract Syntax Tree, which represents a construct in the source code, such as a variable, operator, or statement.

For example, "#" is the token, Header is the node and <h1></h1> is the final transformation. When our parser reaches a # token, it will call the piece of code to transform it into the header tag.

Some Nodes also can have children. For example, ordered lists have the list item child. So that's how we create our Node struct code:

// Node represents a simple AST node
type Node struct {
	Type string // "p", "h1"-"h6", "em", "strong", "code", "pre", "ul", "ol", "li", "a", "img"
	Content string // text content or href for links, src for images
	Children []*Node // child nodes
	Attributes map[string]string // attributes like alt for images
}

func NewNode(nodeType, content string) *Node {
	return &Node{
		Type: nodeType,
		Content: content,
		Children: make([]*Node, 0),
		Attributes: make(map[string]string),
	}
} 

func (n *Node) AddChild(child *Node) {
	n.Children = append(n.Children, child)
}

Now we are going to start working on the actual parser functions. Since I'm talking a lot about headers, we'll start with them. Since didn't create an AST structure or a Lexer - that is responsible for walking and peeking at the structure and identifying the tokens -, I'm using RegEx to understand what tokens I'm seeing at each position.

Not gonna lie, I actually hate RegEx and never properly learned it, I just hacked around what I needed with https://regex101.com/.

// parseHeading parses ATX headings (# Header)
func (p *Parser) parseHeading(line string) *Node {
	re := regexp.MustCompile(`^(#{1,6})\s+(.+)$`)
	matches := re.FindStringSubmatch(line)
	if matches == nil {
		return nil
	}

	level := len(matches[1])
	text := matches[2]
	
	heading := NewNode(fmt.Sprintf("h%d", level), "")
	heading.Children = p.parseInlines(text)
	
	p.pos++

	return heading
}

I'm gonna talk about parseInlines soon! Let me try to explain what the RegEx actually does (^(#{1,6})\s+(.+)$):

^: Means we are checking for what starts the line. Headers generally have the # at the start of a line.
(#{1,6}): We are searching for the char # any X times being 1 until 6. Meaning we can have #, ##, ###, ####, #####, ######.
\s: matches any whitespace character (equivalent to [\r\n\t\f\v ]). This is for the space between chars.
+: Matches all the previous tokens, so it can have X times of char.
(.+): This will match any chars except for line terminators so it finishes after the line also finishes
$: asserts position at the end of a line

Yeah, so it tries to match anything that might look like this:

### This is a header until the end of the line

This no longer a header.

So we have two matches: one is the char that will become the header tag (hx) and the text. We'll create a new node that will take these into consideration and build the structure so we can render HTML later.

Now, we also parse the inline structs.

This is because we can have several things inside different tags. We can have italic text inside a paragraph, we can have images or links inside blocks. Or we can have inline code.

Since this one it's kinda straightforward let's add it now:

// parseInlines parses inline elements within text
func (p *Parser) parseInlines(text string) []*Node {
	var nodes []*Node

	// Simple regex-based inline parsing
	// Order matters: code first (to avoid parsing emphasis inside code)
	patterns := []struct {
	re *regexp.Regexp
	repl func([]string) []*Node
	}{
		// Inline code: `code`
		regexp.MustCompile("`([^`]+)`"),
		func(matches []string) []*Node {
			return []*Node{NewNode("code", matches[1])}
		},
	},
	// Images: ![alt](url) - must come before links!
	{
		regexp.MustCompile(`!\[([^\]]*)\]\(([^)]+)\)`),
		func(matches []string) []*Node {
			img := NewNode("img", matches[2]) // src in Content
			img.Attributes["alt"] = matches[1] // alt text in Attributes
			return []*Node{img}
		},
	},
	// Links: [text](url)
	{
		regexp.MustCompile(`\[([^\]]+)\]\(([^)]+)\)`),
		func(matches []string) []*Node {
			link := NewNode("a", matches[2]) // href in Content
			link.AddChild(NewNode("text", matches[1]))
			return []*Node{link}
		},
	},
	// Strong: **text** or __text__
	{
		regexp.MustCompile(`\*\*([^*]+)\*\*|__([^_]+)__`),
		func(matches []string) []*Node {
			text := matches[1]
			if text == "" {
				text = matches[2]
			}
			strong := NewNode("strong", "")
			strong.AddChild(NewNode("text", text))
			return []*Node{strong}
		},
	},
	// Emphasis: *text* or _text_
	{
		regexp.MustCompile(`\*([^*]+)\*|_([^_]+)_`),
		func(matches []string) []*Node {
			text := matches[1]
			if text == "" {
				text = matches[2]
			}
			em := NewNode("em", "")
			em.AddChild(NewNode("text", text))
			return []*Node{em}
		},
	},
	}

	remaining := text

	for len(remaining) > 0 {
		found := false
		earliestPos := len(remaining)
		var earliestMatch []string
		var earliestRepl func([]string) []*Node  
	
	// Find the earliest match to determinate what we're dealing with
	for _, pattern := range patterns {
		if match := pattern.re.FindStringSubmatch(remaining); match != nil {
			pos := strings.Index(remaining, match[0])
			if pos < earliestPos {
				earliestPos = pos
				earliestMatch = match
				earliestRepl = pattern.repl
				found = true
			}
		}
	}

	if found {
		if earliestPos > 0 {
			nodes = append(nodes, NewNode("text", remaining[:earliestPos]))
	}

	nodes = append(nodes, earliestRepl(earliestMatch)...)
	remaining = remaining[earliestPos+len(earliestMatch[0]):]
	} else {
		if len(remaining) > 0 {
		nodes = append(nodes, NewNode("text", remaining))
	}
	break
}
}

	return nodes
}

What we are doing is just grabbing the structure that indicates the type of Node we are dealing with (image, link, bold, italic), and separating the token and the text so we can later render as HTML. The same thing we did with the headings.

Now we can parse paragraphs too! We basically check if the content is anything else than a paragraph, and if it's not, we then decide it is indeed a paragraph block!

// paragraph is our default block
func (p *Parser) parseParagraph() *Node {
	var content strings.Builder

	for p.pos < len(p.lines) {
		line := strings.TrimSpace(p.lines[p.pos])
		if line == "" {
			break
		}
		// Stop if we hit another block type
		if strings.HasPrefix(line, "#") ||
		strings.HasPrefix(line, "```") ||
		regexp.MustCompile(`^[-*+]\s`).MatchString(line) ||
		regexp.MustCompile(`^\d+\.\s`).MatchString(line) {
			break
		}

		if content.Len() > 0 {
			content.WriteString(" ")
		}
	
		content.WriteString(line)
		p.pos++
	}

	para := NewNode("p", "")
	para.Children = p.parseInlines(content.String())
	return para
}

Now let's do horizontal rules and blockquotes:

// parseHorizontalRule parses horizontal rules (---, ***, ___)
func (p *Parser) parseHorizontalRule(line string) *Node {
	// Check for horizontal rule patterns: 3+ dashes, asterisks, or underscores with optional spaces between them
	trimmed := strings.TrimSpace(line)
	if len(trimmed) < 3 {
		return nil
	}

	// Check for --- pattern
	if matched, _ := regexp.MatchString(`^-{3,}(\s*-)*\s*$`, trimmed); matched {
		p.pos++
		return NewNode("hr", "")
	}

	// Check for *** pattern
	if matched, _ := regexp.MatchString(`^\*{3,}(\s*\*)*\s*$`, trimmed); matched {
		p.pos++
		return NewNode("hr", "")
	}

	// Check for ___ pattern
	if matched, _ := regexp.MatchString(`^_{3,}(\s*_)*\s*$`, trimmed); matched {
		p.pos++
		return NewNode("hr", "")
	}

	return nil
}

func (p *Parser) parseBlockquote() *Node {
	line := p.lines[p.pos]
	if !strings.HasPrefix(strings.TrimSpace(line), ">") {
		return nil
	}

	blockquote := NewNode("blockquote", "")
	var content strings.Builder

	// Collect all consecutive blockquote lines
	for p.pos < len(p.lines) {
		line := p.lines[p.pos]
		trimmed := strings.TrimSpace(line)

		if !strings.HasPrefix(trimmed, ">") {
			break
		}

		text := strings.TrimPrefix(trimmed, ">")
		if strings.HasPrefix(text, " ") {
			text = text[1:]
		}

		if content.Len() > 0 {
			content.WriteString("\n")
		}
		content.WriteString(text)
		p.pos++
	}

	// Parse the blockquote content as a paragraph
	if content.Len() > 0 {
		para := NewNode("p", "")
		para.Children = p.parseInlines(content.String())
		blockquote.AddChild(para)
	}

	return blockquote
}

There's nothing much to say except what we already did before, so I'm gonna move to lists.

We have bullet list and ordered lists so we need to check for both and decide how to parse them. Then we need to check for all of their children and parse them too.

func (p *Parser) parseList() *Node {
	line := strings.TrimSpace(p.lines[p.pos])

	// Check for bullet list
	bulletRe := regexp.MustCompile(`^[-*+]\s+(.+)$`)
	orderedRe := regexp.MustCompile(`^\d+\.\s+(.+)$`)

	var listType string
	var itemRe *regexp.Regexp

	if bulletRe.MatchString(line) {
		listType = "ul"
		itemRe = bulletRe
	} else if orderedRe.MatchString(line) {
		listType = "ol"
		itemRe = orderedRe
	} else {
		return nil
	}

	list := NewNode(listType, "")

	// Parse list items
	for p.pos < len(p.lines) {
		line := strings.TrimSpace(p.lines[p.pos])
		if line == "" {
			p.pos++
			break
		}

		matches := itemRe.FindStringSubmatch(line)
		if matches == nil {
			break
		}

		item := NewNode("li", "")
		item.Children = p.parseInlines(matches[1])
		list.AddChild(item)

		p.pos++
	}

	return list
}

And finally code blocks! Those are also interesting and work similar to the code inline, but I added some extra info related to the languages. It doesn't change anything in HTML but this is for libraries that take this info to render different syntax.

func (p *Parser) parseCodeBlock() *Node {
	line := p.lines[p.pos]
	if !strings.HasPrefix(line, "```") {
		return nil
	}

	lang := strings.TrimSpace(line[3:])
	p.pos++

	var content strings.Builder
	for p.pos < len(p.lines) {
		line := p.lines[p.pos]
		if strings.HasPrefix(line, "```") {
			p.pos++
			break
		}
		if content.Len() > 0 {
			content.WriteString("\n")
		}
		content.WriteString(line)
		p.pos++
	}

	pre := NewNode("pre", "")
	code := NewNode("code", content.String())
	if lang != "" {
		code.Type = "code-" + lang // Store language info
	}
	pre.AddChild(code)

	return pre
}

We have everything ready, but how do we even parse everything? We need to go through our document and check every line.

func NewParser(markdown string) *Parser {
	lines := strings.Split(strings.ReplaceAll(markdown, "\r\n", "\n"), "\n")
	return &Parser{lines: lines, pos: 0}
}

// We parse the markdown into an AST so then it gets transformed
func (p *Parser) Parse() *Node {
	doc := NewNode("doc", "")

	for p.pos < len(p.lines) {
		line := p.lines[p.pos]

		// Skip blank lines
		if strings.TrimSpace(line) == "" {
			p.pos++
			continue
		}

		// Try to parse different block types
		if node := p.parseHeading(line); node != nil {
			doc.AddChild(node)
		} else if node := p.parseHorizontalRule(line); node != nil {
			doc.AddChild(node)
		} else if node := p.parseCodeBlock(); node != nil {
			doc.AddChild(node)
		} else if node := p.parseBlockquote(); node != nil {
			doc.AddChild(node)
		} else if node := p.parseList(); node != nil {
			doc.AddChild(node)
		} else {
			// Default to paragraph
			doc.AddChild(p.parseParagraph())
		}
	}

	return doc
}

It's very simple and not a full blown parser that understand prefixes and infixes but Markdown is made to be very simple so it works here.

But there's a lot missing here. For example, sublists don't work with this code. There's no real implementation of any flavor of markdown. It will probably break a lot of documents still. But it exists and I'm happy with my tests.

This is the actual parser but it's not very useful if it can't generate HTML code.

So I built a simple render too.

// RenderHTML renders the AST to HTML
func RenderHTML(node *Node) string {
	switch node.Type {
	case "doc":
		var html strings.Builder
		for _, child := range node.Children {
			html.WriteString(RenderHTML(child))
		}
		return html.String()

	case "p":
		return "<p>" + renderChildren(node) + "</p>\n"

	case "h1", "h2", "h3", "h4", "h5", "h6":
		tag := node.Type
	    return "<" + tag + ">" + renderChildren(node) + "</" + tag + ">\n"

	case "hr":
		return "<hr>\n"

	case "blockquote":
		return "<blockquote>\n" + renderChildren(node) + "</blockquote>\n"

	case "pre":
		return "<pre>" + renderChildren(node) + "</pre>\n"

	case "code":
		return "<code>" + escapeHTML(node.Content) + "</code>"

	case "ul":
		return "<ul>\n" + renderChildren(node) + "</ul>\n"

	case "ol":
		return "<ol>\n" + renderChildren(node) + "</ol>\n"

	case "li":
		return "<li>" + renderChildren(node) + "</li>\n"

	case "em":
		return "<em>" + renderChildren(node) + "</em>"

	case "strong":
		return "<strong>" + renderChildren(node) + "</strong>"

	case "a":
		return fmt.Sprintf(`<a href="%s">%s</a>`, escapeHTML(node.Content), renderChildren(node))

	case "img":
		alt := node.Attributes["alt"]
		return fmt.Sprintf(`<img src="%s" alt="%s">`, escapeHTML(node.Content), escapeHTML(alt))

	case "text":
		return escapeHTML(node.Content)

	default:
		// Handle code-* types (e.g., code-go, code-js)
		if strings.HasPrefix(node.Type, "code-") {
			lang := node.Type[5:]
			return fmt.Sprintf(`<code class="language-%s">%s</code>`, lang, escapeHTML(node.Content))
		}
		return renderChildren(node)
	}
}

func renderChildren(node *Node) string {
	var html strings.Builder
	for _, child := range node.Children {
		html.WriteString(RenderHTML(child))
	}
	return html.String()
}

// escapeHTML escapes HTML special characters
func escapeHTML(text string) string {
	text = strings.ReplaceAll(text, "&", "&amp;")
	text = strings.ReplaceAll(text, "<", "&lt;")
	text = strings.ReplaceAll(text, ">", "&gt;")
	text = strings.ReplaceAll(text, "\"", "&quot;")
	text = strings.ReplaceAll(text, "'", "&#39;")
	return text
}

So to use it we just need to wrap it into an API and call it for any type of document:

// Parse is the main API function, this is what we call
func Parse(markdown string) string {
	parser := NewParser(markdown)
	ast := parser.Parse()
	return RenderHTML(ast)
}

// Example usage
func main() {
	// Read markdown from file
	content, err := os.ReadFile("test.md")
	if err != nil {
		fmt.Printf("Error reading file: %v\n", err)
		return
	}

	html := Parse(string(content))
	fmt.Println(html)
}

Another thing I need to say is that the render doesn't try to add the helper tags to indicate the start and end of a html doc, it just prints the document. So this might need to be tweaked.

Conclusion

So that was a fun exercise!

There's still a lot to improve and I'll start writing tests and benchmarks for it, but for now I'm proud of what I did in one afternoon.

I think it's possible to start building the flavors extensions and then start checking on parsing my own obsidian documents. After that, I'll start working on the app, but maybe Go it's not the best for this. Tauri (Rust) sounds promising as a desktop application framework but I know 0 things about Rust.

I'll see it eventually.

Wanna discuss this article with me? You can send me a comment in the neocities profile with this post code as reference: 003