← Back to articles
Building a basic Markdown parser.
2022-10-07 · 11 min read
As part of a ten-day challenge to work on building something cool, I'm writing a Markdown parser in JavaScript. I'm actually writing this on day three of the challenge, so I'll just skim over the last two days and start documenting the process together.
I'm currently relying on two resources to write the parser:
- Notes on making a Markdown parser from scratch in JS: No code (I took a look at the demo but not the source code itself), but it does a really good job of explaining concepts like tokenizer → parser → the actual HTML generation really well from a theoretical standpoint.
- Chapter 6 of Eloquent JavaScript, first edition: Yes, I know that the current edition is the third one. Yes, I have read half of that and still have yet to finish the other half. Actually, despite being a chapter on functional programming with JavaScript, this provides a lot of insight into writing a Markdown parser, since it explains how to write a basic parser that can parse headings, italics, and footnotes.
For this challenge, I want to write my own Markdown parser, which essentially means I want to be reliant on tutorials and Stack Overflow as little as possible when it comes to writing the Markdown parser. This means potentially convoluted and idiotic code, which I completely acknowledge. However, it also means that whatever code I'm writing is code that I can learn from. At the same time, I also want to be able to gain a deeper understanding of the terminology behind writing a parser, even if it's particularly basic. Hence, the two articles above.
The concept
Currently, all of the code behind the Markdown parser is in one file; the only dependency is highlight.js
, for highlighting the code. This is what happens when the user types in input:
- The input gets passed to a function,
parseMarkdown
. This function tries to split up the input into chunks, such as a code block. - These chunks are passed one by one into a function called
processBlock
, which is part of the tokenizer. (More info to be included later. Update: more info now included, check the final section of this post.) This tokenizer is responsible for converting, say, using the example before, a code block into an object, or a token, like{ type: "codeBlock", content: ..., attributes: { lang } }
. This tokenizer is recursive for certain types of content, such as italics. - Once the tokens have been generated, they're parsed, which just means they're transformed into a form closer to HTML. In the case of the parser I've written, it just takes the tokens and creates a tree-like data structure using an object representation for HTML, where each tag is described as
{ name, content, attributes }
. - Finally, the tree that the parser generates is converted into HTML, once again using recursion. One issue here is that of escaping HTML, which
highlight.js
has warned me of already. I might take a closer look into it later.
Update
I've moved the code into three different files in order to separate the Markdown parser from the frontend logic I worked on after day 6. The files now are:
index.js
: The actual frontend logic, which includes various functions forlocalStorage
so that files can be saved, as well as usage ofhighlight.js
.parser.js
: The actual Markdown parser. No external dependencies, just a dependency on..utils.js
: The only function in this file isescapeHTML
, which is used in bothindex.js
andparser.js
to escape user input. There are entire libraries dedicated to escaping HTML - this function simply escapes the most common HTML entities.
Day 3 & 4
(Like I mentioned before, I'm jumping ahead since I started writing this on day 3.) Today, I figured out how to parse code blocks. After struggling with it for two hours, I finally deleted what I had been writing and rewrote it in a smarter way in ten minutes. Bugs, as they say, take up more time than programming. So does thinking through things.
Day 5
Before I got started with anything else, I discovered a little edge case while playing around with the parser: What if the user leaves a code block empty? When I tried it out, the text following the code block was included as part of the code block. Then I realized what was happening: the function that renders the actual HTML after being passed a tree from the parser was treating an empty string (aka empty content) as a signal to mark the tag as <tag name/>
, meaning that a code block would be translated to <pre class="language-plaintext hljs"/>
- which would be invalid. I fixed this by designating between empty content with "" and no content with -1.
Another thing I wanted to work on was adding the usage of tabs to the input box. To do this, I built off the first answer to this Stack Overflow thread and added the ability to stay on a tab, without having to press Tab
each time you move on to a new line.
Other than that, I worked on adding bold functionality.
Day 6
Planning to work on adding unordered/ordered list functionality, and potentially working on an edge case I found: the disappearance of newlines in code blocks.
After fixing the disappearance of newlines in code blocks, I worked on the unordered/ordered list functionality, which introduced a new issue. See, the list items were being split into multiple lines. So the HTML would end up looking something like this:
<ul>
<li>Item one</li>
</ul>
<ul>
<li>Item two</li>
</ul>
Which is obviously not great. So I wrote some janky code to fix this, in order to put together something functional. I know, I know. I'm trying to think of an elegant way to do this, in a better way. To be honest, I'm not sure how to go about it. And that's okay! This is actually a perfect opportunity for me to read other people's code and implementation, which is actually what I was planning to do eventually.
Other than that, I wanted to work on some front-end stuff. (Aka, an excuse for not using my brain, which has been fried by SAT prep and Markdown logic over the last two days.) So I added a better design, along with localStorage
! I also set up the files needed to write a TypeScript version of the parser (to learn some TypeScript).
Okay, fine, I got a little bit carried away. And I just realized that I forgot to add support for nested lists. Duh. Of course.
Day 7
I was super busy today... so I just worked on translating some of the code into TypeScript.
Day 8
Worked on link and image functionality today! I've been using a decent amount of regular expressions to help with some of the parsing (although some would argue that this is convoluted), and it's actually helping me remember the different regular expression constructs/characters, which is pretty nice. The article I wrote on rebuilding my blog is 100% able to be rendered in my Markdown parser, which is super cool! I'm sure that there are a couple of edge cases I haven't tested yet (I still haven't implemented nested lists, for example), but I'm coding in class, so I can't tackle those at the moment.
Day 9
I know I said I wouldn't be working on subscript, superscript, or strikethrough... but I decided to work on them today! Tweaking the splitBlock
function (which does the actual splitting of blocks into different types of text) has been relatively easy, which is what parsing is supposed to do, really. One thing I've really been able to learn so far from writing this parser is abstraction and how to decide when to write a function that can apply to different values and when to write a function that does a specific thing.
What else did I work on? Well, I actually worked on adding a context menu to the file tree that I implemented on Day 5. The context menu provides a Delete feature currently, but I might add more to it.
Day 10
Not much done on the Markdown parser today. I cleaned up the code, finished writing this post, and also wrote a README
on using this.
Final thoughts
Alright, we are past the final stretch for this! I can't believe I actually wrote a Markdown parser in ten days. I feel like this "x days" challenge was really useful in actually getting me to make progress on building a Markdown parser - in the past I would have worked on it for a few days, before leaving it to gather dust. Also, having others complete the ten-day challenge really held me accountable.
In terms of the code, I'm planning on forking the public repo that I created for this parser and linking an Express/AWS API or something so that I can use this Markdown parser for my own (private) writing. Also because I've been wanting to create a personal "dashboard" of sorts for a while - so this will be the perfect opportunity.
Currently, the TypeScript code for the parser is incomplete. This is stored inside src/
, and I will be updating it once I've learned enough TypeScript to actually convert the code to TS. I've also really wanted to learn about unit testing, so this Markdown parser will also come in handy for that.
Last but not least, an explanation of how the parser works (from a bird's eye view but more detailed than this post, for sure) is located in the README
of the repo.
And of course, a demo:
What I haven't done
It may be important to include what features I haven't added, for learning purposes. I won't be able to work on (at least during the ten-day challenge) the following features:
- Usage of backslashes -
\
- to escape special characters- Tables
- Footnotes
- Definition lists
- Task lists
- Highlight
- Heading IDs
Continuing work
Okay, I know the 10 days already passed, but I decided to keep working on this Markdown editor because I wanted to turn it into a full-fledged tool for personal writing, as mentioned before. And also because I wanted to submit something to the Congressional App Challenge.
I ended up working on a bunch of stuff. One of the first things I did was create a contextmenu for files, which ended up including the ability to delete, rename, or download (as Markdown or HTML) a file. There were a couple of issues I ran into while implementing this - specifically, one that I remember as I'm writing this is using references to class methods in an event listener. Something like this:
class Editor {
// ...
generateFiletree() {
for (let file of this.files) {
// ...
let button = document.createElement("button");
button.innerText = file;
button.addEventListener("click", function () {
// ? Referencing this, as in the class?
});
}
}
}
Apparently, you can do something like let self = this;
outside the event listener and reference it inside the event listener. One of the perks of reinventing the wheel: you become more knowledgeable in the programming language you invented the wheel in.
I also added an algorithm for generating heading IDs (which currently doesn't take into account ID collisions, which is definitely going to be part of the TODO), and task lists, which were relatively easy, considering that they're just unordered lists with checkboxes tacked on.
Once I was done with all of this, I began working on Markright. Obviously, I scaffolded an Express/AWS API and created a basic (but interactive!) landing page. In terms of the Markdown parser, I wanted to implement folders. Unfortunately, due to time constraints, I decided it was better to have single folders anyways. (But also because I'm one of those people who would make a folder a dozen levels deep if I were given the opportunity.) Thus, I replaced the filesystem with a folder filesystem where I could create a new folder and add files inside. Not much was changed - in terms of porting from localStorage
to AWS, I simply uploaded the JSON content to a file inside an S3 bucket instead of localStorage
.
Obviously, the way I implemented the filesystem was probably a bit convoluted. I'm definitely going to fix it and add a couple of features. For now, the obligatory demo:
That's it! There's still a bunch of tasks to get around too; since the list of features above is out of date, you can check out the GitHub repository for the most up-to-date list. Now that I've mentioned GitHub repositories, you can also check out the GitHub repository for Markright.
Hey! This is a post forever in progress. In other words, I'm probably going to tweak it once in a while, because things do change.