--- /dev/null
+Title: MultiMarkdown v6 Development Notes
+Author: Fletcher T. Penney
+Date: 2017-03-14
+LaTeX Config: tufte-handout
+Base Header Level: 3
+
+
+# Introduction #
+
+This document includes some notes on the development of MMD v6. Most of it
+will be interesting only to other developers or those needing to choose the
+absolute "best" MD implementation for their needs -- it is not required
+reading to understand how the software works.
+
+
+## Why a New Version? ##
+
+MultiMarkdown version 5 was released in November of 2015, but the codebase was
+essentially the same as that of v4 -- and that was released in beta in April
+of 2013. A few key things prompted work on a new version:
+
+* Accuracy -- MMD v4 and v5 were the most accurate versions yet, and a lot of
+effort went into finding and resolving various edge cases. However, it began
+to feel like a game of whack-a-mole where new bugs would creep in every time I
+fixed an old one. The PEG began to feel rather convoluted in spots, even
+though it did allow for a precise (if not always accurate) specification of
+the grammar.
+
+* Performance -- "Back in the day" [peg-markdown] was one of the fastest
+Markdown parsers around. MMD v3 was based on peg-markdown, and would leap-
+frog with it in terms of performance. Then [CommonMark] was released, which
+was a bit faster. Then a couple of years went by and CommonMark became *much*
+faster -- in one of my test suites, MMD v 5.4.0 takes about 25 times longer to
+process a long document than CommonMark 0.27.0.
+
+[peg-markdown]: https://github.com/jgm/peg-markdown
+[CommonMark]: http://commonmark.org/
+
+In the spring of 2016, I decided I wanted to rewrite MultiMarkdown from scratch,
+building the parser myself rather than relying on a pre-rolled solution. (I
+had been using [greg](https://github.com/ooc-lang/greg) to compile the PEG
+into parser code. It worked well overall, but lacked some features I needed,
+requiring a lot of workarounds.)
+
+
+# First Attempt #
+
+My first attempt started by hand-crafting a parser that scanned through the
+document a line at a time, deciding what to do with each line as it found
+them. I used regex parsers made with [re2c](http://re2c.org/index.html) to
+help classify each line, and then a separate parser layer to process groups of
+lines into blocks. Initially this approach worked well, and was really
+efficient. But I quickly began to code my way into a dead-end -- the strategy
+was not elegant enough to handle things like nested lists, etc.
+
+One thing that did turn out well from the first attempt, however, was an
+approach for handling `<emph>` and `<strong>` parsing. I've learned over the
+years that this can be one of the hardest parts of coding accurately for
+Markdown. There are many examples that are obvious to a person, but difficult
+to properly "explain" how to parse to a computer.
+
+No solution is perfect, but I developed an approach that seems to accurately
+handle a wide range of situations without a great deal of complexity:
+
+1. Scan the documents for asterisks (`*`). Each one will be handled one at a
+time.
+
+2. Unlike brackets (`[` and `]`), an asterisk is "ambidextrous", in that it
+may be able to open a matched pair of asterisks, close a pair, or both. For
+example, in `foo *bar* foo`:
+
+ 1. The first asterisk can open a pair, but not close one.
+
+ 2. The second asterisk can close a pair, but not open one.
+
+3. So, once the asterisks have been identified, each has to be examined to
+determine whether it can open/close/both. The algorithm is not that complex,
+but I'll describe it in general terms. Check the code for more specifics.
+This approach seems to work, but might still need some slight tweaking. In
+the future, I'll codify this better in language rather than just in code.
+
+ 1. If there is whitespace to the left of an asterisk, it can't close.
+
+ 2. If there is whitespace or punctuation to the right it can't open.
+
+ 3. "Runs" of asterisks, e.g. `**bar` are treated as a unit in terms of
+ looking left/right.
+
+ 4. Asterisks inside a word are a bit trickier -- we look at the number of
+ asterisks before the word, the number in the current run, and the number
+ of asterisks after the word to determine which combinations, if any, are
+ permitted.
+
+4. Once all asterisks have been tagged as able to open/close/both, we proceed
+through them in order:
+
+ 1. When we encounter a tag that can close, we look to see if there is a
+ previous opener that has not been paired off. If so, pair the two and
+ remove the opener from the list of available asterisks.
+
+ 2. When we encounter an opener, add it to the stack of available openers.
+
+ 3. When encounter an asterisk that can do both, see if it can close an
+ existing opener. If not, then add it to the stack.
+
+5. After all tokens in the block have been paired, then we look for nesting
+pairs of asterisks in order to create `<emph>` and `<strong>` sets. For
+example, assume we have six asterisks wrapped around a word, three in front,
+and three after. The asterisks are indicated with numbers: `123foo456`. We
+proceed in the following manner:
+
+ 1. Based on the pairing algorithm above, these asterisks would be paired as
+ follows, with matching asterisks sharing numbers -- `123foo321`.
+
+ 2. Moving forwards, we come to asterisk "1". It is followed by an
+ asterisk, so we check to see if they should be grouped as a `<strong>`.
+ Since the "1" asterisks are wrapped immediately outside the "2" asterisks,
+ they are joined together. More than two pairs can't be joined, so we now
+ get the following -- `112foo211`, where the "11" represents the opening
+ and closing of a `<strong>`, and the "2" represents a `<emph>`.
+
+6. When matching a pair, any unclosed openers that are on the stack are
+removed, preventing pairs from "crossing" or "intersecting". Pairs can wrap
+around each other, e.g. `[(foo)]`, but not intersect like `[(foo])`. In the
+second case, the brackets would close, removing the `(` from the stack.
+
+7. This same approach is used in all tokens that are matched in pairs--
+`[foo]`, `(foo)`, `_foo_`, etc. There's slightly more to it, but once you
+figure out how to assign opening/closing ability, the rest is easy. By using
+a stack to track available openers, it can be performed efficiently.
+
+In my testing, this approach has worked quite well. It handles all the basic
+scenarios I've thrown at it, and all of the "basic" and "devious" edge cases I
+have thought of (some of these don't necessarily have a "right" answer -- but
+v6 gives consistency answers that seem as reasonable as any others to me).
+There are also three more edge cases I've come up can still stump it, and
+ironically they are handled correctly by most implementations. They just
+don't follow the rules above. I'll continue to work on this.
+
+In the end, I scrapped this effort, but kept the lessons learned in the token
+pairing algorithm.
+
+
+# Second Attempt #
+
+I tried again this past Fall. This time, I approached the problem with lots
+of reading. *Lots and lots* of reading -- tons of websites, computer science
+journal articles, PhD theses, etc. Learned a lot about lexers, and a lot
+about parsers, including hand-crafting vs using parser generators. In brief:
+
+1. I learned about the [Aho–Corasick algorithm], which is a great way to
+efficiently search a string for multiple target strings at once. I used this
+to create a custom lexer to identify tokens in a MultiMarkdown text document
+(e.g. `*`, `[ `, `{++`, etc.). I learned a lot, and had a good time working
+out the implementation. This code efficiently allowed me to break a string of
+text into the tokens that mattered for Markdown parsing.
+
+2. However, in a few instances I really needed some features of regular
+expressions to simplify more complex structures. After a quick bit of testing,
+using re2c to create a tokenizer was just as efficient, and allowed me to
+incorporate some regex functionality that simplified later parsing. I'll keep
+the Aho-Corasick stuff around, and will probably experiment more with it
+later. But I didn't need it for MMD now. `lexer.re` contains the source for
+the tokenizer.
+
+[Aho–Corasick algorithm]: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
+
+I looked long and hard for a way to simplify the parsing algorithm to try and
+"touch" each token only once. Ideally, the program could step through each
+token, and decide when to create a new block, when to pair things together,
+etc. But I'm not convinced it's possible. Since Markdown's grammar varies
+based on context, it seems to work best when handled in distinct phases:
+
+1. Tokenize the string to identify key sections of text. This includes line
+breaks, allowing the text to be examined one line at time.
+
+2. Join series of lines together into blocks, such as paragraphs, code blocks,
+lists, etc.
+
+3. The tokens inside each block can then be paired together to create more
+complex syntax such as links, strong, emphasis, etc.
+
+To handle the block parsing, I started off using the [Aho-Corasick] code to
+handle my first attempt. I had actually implemented some basic regex
+functionality, and used that to group lines together to create blocks. But
+this quickly fell apart in the face of more complex structures such as
+recursive lists. After a lot of searching, and *tons* more reading, I
+ultimately decided to use a parser generator to handle the task of group lines
+into blocks. `parser.y` has the source for this, and it is processed by the
+[lemon](http://www.hwaci.com/sw/lemon/) parser generator to create the actual
+code.
+
+I chose to do this because hand-crafting the block parser would be complex.
+The end result would likely be difficult to read and understand, which would
+make it difficult to update later on. Using the parser generator allows me to
+write things out in a way that can more easily be understood by a person. In
+all likelihood, the performance is probably as good as anything I could do
+anyway, if not better.
+
+Because lemon is a LALR(1) parser, it does require a bit of thinking ahead
+about how to create the grammar used. But so far, it has been able to handle
+everything I have thrown at it.
+
+
+# Optimization #
+
+One of my goals for MMD 6 was performance. So I've paid attention to speed
+along the way, and have tried to use a few tricks to keep things fast. Here
+are some things I've learned along the way. In no particular order:
+
+
+## Memory Allocation ##
+
+When parsing a long document, a *lot* of token structures are created. Each
+one requires a small bit of memory to be allocated. In aggregate, that time
+added up and slowed down performance.
+
+After reading for a bit, I ended up coming up with an approach that uses
+larger chunks of memory. I allocate pools of of memory in large slabs for
+smaller "objects"". For example, I allocate memory for 1024 tokens at a
+single time, and then dole that memory out as needed. When the slab is empty,
+a new one is allocated. This dramatically improved performance.
+
+When pairing tokens, I created a new stack for each block. I realized that an
+empty stack didn't have any "leftover" cruft to interfere with re-use, so I
+just used one for the entire document. Again a sizeable improvement in
+performance from only allocating one object instead of many. When recursing
+to a deeper level, the stack just gets deeper, but earlier levels aren't
+modified.
+
+Speaking of tokens, I realized that the average document contains a lot of
+single spaces (there's one between every two words I have written, for
+example.) The vast majority of the time, these single spaces have no effect
+on the output of Markdown documents. I changed my whitespace token search to
+only flag runs of 2 or more spaces, dramatically reducing the number of
+tokens. This gives the benefit of needing fewer memory allocations, and also
+reduces the number of tokens that need to be processed later on. The only
+downside is remember to check for a single space character in a few instances
+where it matters.
+
+
+## Proper input buffering ##
+
+When I first began last spring, I was amazed to see how much time was being
+spent by MultiMarkdown simply reading the input file. Then I discovered it
+was because I was reading it one character at a time. I switched to using a
+buffered read approach and the time to read the file went to almost nothing. I
+experimented with different buffer sizes, but they did not seem to make a
+measurable difference.
+
+
+## Output Buffering ##
+
+I experimented with different approaches to creating the output after parsing.
+I tried printing directly to `stdout`, and even played with different
+buffering settings. None of those seemed to work well, and all were slower
+than using the `d_string` approach (formerly call `GString` in MMD 5).
+
+
+## Fast Searches ##
+
+After getting basic Markdown functionality complete, I discovered during
+testing that the time required to parse a document grew exponentially as the
+document grew longer. Performance was on par with CommonMark for shorter
+documents, but fell increasingly behind in larger tests. Time profiling found
+that the culprit was searching for link definitions when they didn't exist.
+My first approach was to keep a stack of used link definitions, and to iterate
+through them when necessary. In long documents, this performs very poorly.
+More research and I ended up using
+[uthash](http://troydhanson.github.io/uthash/). This allows me to search for
+a link (or footnote, etc.) by "name" rather than searching through an array.
+This allowed me to get MMD's performance back to O(n), taking roughly twice as
+much time to process a document that is twice as long.
+
+
+## Efficient Utility Functions ##
+
+It is frequently necessary when parsing Markdown to check what sort of
+character we are dealing with at a certain position -- a letter, whitespace,
+punctuation, etc. I created a lookup table for this via `char_lookup.c` and
+hard-coded it in `char.c`. These routines allow me to quickly, and
+consistently, classify any byte within a document. This saved a lot of
+programming time, and saved time tracking down bugs from handling things
+slightly differently under different circumstances. I also suspect it
+improved performance, but don't have the data to back it up.
+
+
+## Testing While Writing ##
+
+I developed several chunks of code in parallel while creating MMD 6. The vast
+majority of it was developed largely in a [test-driven development] approach.
+The other code was largely created with extensive unit testing to accomplish
+this.
+
+[test-driven development]: https://en.wikipedia.org/wiki/Test-driven_development
+
+MMD isn't particularly amenable to this approach at the small level, but
+instead I relied more on integration testing with an ever-growing collection
+of text files and the corresponding HTML files in the MMD 6 test suite. This
+allowed me to ensure new features work properly and that old features aren't
+broken. At this time, there are 29 text files in the test suite, and many
+more to come.
+
+
+## Other Lessons ##
+
+Some things that didn't do me any good....
+
+I considered differences between using `malloc` and `calloc` when initializing
+tokens. The time saved by using `malloc` was basically exactly offset by the
+initial time required to initialize the token to default null values as
+compared to using `calloc`. When trying `calloc` failed to help me out
+(thinking that clearing a single slab in the object pool would be faster), I
+stuck with `malloc` as it makes more sense to me in my workflow.
+
+I read a bit about [struct padding] and reordered some of my structs. It was
+until later that I discovered the `-Wpadded` option, and it's not clear
+whether my changes modified anything. Since the structs were being padded
+automatically, there was no noticeable performance change, and I didn't have
+the tools to measure whether I could have improved memory usage at all. Not
+sure this would be worth the effort -- much lower hanging fruit available.
+
+[struct padding]: http://www.catb.org/esr/structure-packing/
+
+
+# Performance #
+
+Basic tests show that currently MMD 6 takes about 20-25% longer the CommonMark
+0.27.0 to process long files (e.g. 0.2 MB). However, it is around 5% *faster*
+than CommonMark when parsing a shorter file (27 kB) (measured by parsing the
+same file 200 times over). This test suite is performed by using the Markdown
+[syntax page], modified to avoid the use of the Setext header at the top. The
+longer files tested are created by copying the same syntax page onto itself,
+thereby doubling the length of the file with each iteration.
+
+The largest file I test is approximately 108 MB (4096 copies of the syntax
+page). On my machine (2012 Mac mini with 2.3 GHz Intel Core i7, 16 GB RAM),
+it takes approximately 4.4 seconds to parse with MMD 6 and 3.7 seconds with
+CommonMark. MMD 6 processes approximately 25 MB/s on this test file.
+CommonMark 0.27.0 gets about 29 MB/s on the same machine.
+
+There are some slight variations with the smaller test files (8-32 copies),
+but overall the performance of both programs (MMD 6 and CommonMark) are
+roughly linear as the test file gets bigger (double the file size and it takes
+twice as long to parse, aka O(n)).
+
+Out of curiosity, I ran the same tests on the original Markdown.pl by Gruber
+(v 1.0.2b8). It took approximately 178 seconds to parse 128 copies of the
+file (3.4 MB) and was demonstrating quadratic performance characteristics
+(double the file size and it takes 2^2 or 4 times longer to process, aka
+O(n^2)). I didn't bother running it on larger versions of the test file. For
+comparison, MMD 6 can process 128 copies in approximately 140 msec.
+
+Of note, the throughput speed drops when testing more complicated files
+containing more advanced MultiMarkdown features, though it still seems to
+maintain linear performance characteristics. A second test file is created by
+concatenating all of the test suite files (including the Markdown syntax
+file). In this case, MMD gets about 13 MB/s. CommonMark doesn't support
+these additional features, so testing it with that file is not relevant. I
+will work to see whether there are certain features in particular that are
+more challenging and see whether they can be reworked to improve performance.
+
+As above, I have done some high level optimization of the parse strategy, but
+I'm sure there's still a lot of room for further improvement to be made.
+Suggestions welcome!
+
+
+# Testing #
+
+## Test Suite ##
+
+The development of MMD v6 was heavily, but not absolutely, influenced by the
+philosophy of test-driven development. While coding, I made use of test
+suites to verify successful implementation of new features, to avoid
+regression problems when adding new features, and to identify known edge cases
+in need of proper handling.
+
+The test suite (located in `tests/MMD6Tests`) is a "living" collection of
+documents that will continue to be updated as new bugs and edge cases are
+identified. This helps make proper integration testing of the entire
+application with every release.
+
+
+## Fuzz Testing ##
+
+I was not familiar with the concept of [Fuzz Testing]
+(https://en.wikipedia.org/wiki/Fuzzing) until a user mentioned something about
+it to me a year or two ago. I had never used it before, but it seemed like a
+good idea. I implement it in two ways.
+
+The first is that I created a simplified version of the line parser that
+simply accepts various combinations of line type identifiers to see if they
+would successfully parse. The line parser is responsible for taking a series
+of line types (e.g. plain text, indented line, etc.) and determining what sort
+of block they should become. The file `test/parser_text.y` is run through the
+`lemon` program, compiled (with or without the `-DNDEBUG` flag) and then run.
+It sequentially throws every combination of line types at the simplified line
+parser to make sure that it doesn't choke. When I first did this, I found
+several combinations of lines that did not pass.
+
+**NOTE**: This does not verify accurate parsing, simply that the parser does
+not crash by an unacceptable combination of lines.
+
+The second form of fuzz testing I have started using more recently. This is
+using the [American fuzzy lop](http://lcamtuf.coredump.cx/afl/) program to try
+to find text input that crashes MMD. This works by taking sample input (e.g.
+files from the test suite), modifying them slightly, and trying the modified
+versions. Do this over and over and over, and some interesting edge cases are
+sometimes identified. I have found some interesting edge cases this way.
+Definitely a very useful tool!
+
+
+## Unit Testing ##
+
+Some of the original development was done with unit testing in some other
+tools I developed. This code formed the basis of a few parts of MMD.
+Otherwise, it was hard to see how to really create very good unit tests for
+the development of MMD. So there is really not much unit testing built into
+the code or used during the development.
+
+
+
+[>MMD]: MultiMarkdown
+[>MD]: Markdown
+
+[CriticMarkup]: http://criticmarkup.com/
+
+[?PEG]: Parsing Expression Grammar <https://en.wikipedia.org/wiki/Parsing_expression_grammar>
+
+[?AST]: Abstract Syntax Tree <https://en.wikipedia.org/wiki/Abstract_syntax_tree>
+
document formats you need. Same goes for Amazon's ebook formats -- the
[Calibre] program can also be used to interconvert between formats.
-**NOTE: Because EPUB documents are binary files, MMD only creates them when
-**run in batch mode (using the `-b\--batch` options). Otherwise, it simply
-**outputs the HTML 5 file that would serve as the primary content for the
-**EPUB.
+**NOTE**: Because EPUB documents are binary files, MMD only creates them when
+run in batch mode (using the `-b\--batch` options). Otherwise, it simply
+outputs the HTML 5 file that would serve as the primary content for the EPUB.
## Fenced Code Blocks ##
But first, some background...
-### Why a New Version? ###
-
-MultiMarkdown version 5 was released in November of 2015, but the codebase was
-essentially the same as that of v4 -- and that was released in beta in April
-of 2013. A few key things prompted work on a new version:
-
-* Accuracy -- MMD v4 and v5 were the most accurate versions yet, and a lot of
-effort went into finding and resolving various edge cases. However, it began
-to feel like a game of whack-a-mole where new bugs would creep in every time I
-fixed an old one. The PEG began to feel rather convoluted in spots, even
-though it did allow for a precise (if not always accurate) specification of
-the grammar.
-
-* Performance -- "Back in the day" [peg-markdown] was one of the fastest
-Markdown parsers around. MMD v3 was based on peg-markdown, and would leap-
-frog with it in terms of performance. Then [CommonMark] was released, which
-was a bit faster. Then a couple of years went by and CommonMark became *much*
-faster -- in one of my test suites, MMD v 5.4.0 takes about 25 times longer to
-process a long document than CommonMark 0.27.0.
-
-[peg-markdown]: https://github.com/jgm/peg-markdown
-[CommonMark]: http://commonmark.org/
-
-Last spring, I decided I wanted to rewrite MultiMarkdown from scratch,
-building the parser myself rather than relying on a pre-rolled solution. (I
-had been using [greg](https://github.com/ooc-lang/greg) to compile the PEG
-into parser code. It worked well overall, but lacked some features I needed,
-requiring a lot of workarounds.)
-
-
-## First Attempt ##
-
-My first attempt started by hand-crafting a parser that scanned through the
-document a line at a time, deciding what to do with each line as it found
-them. I used regex parsers made with [re2c](http://re2c.org/index.html) to
-help classify each line, and then a separate parser layer to process groups of
-lines into blocks. Initially this approach worked well, and was really
-efficient. But I quickly began to code my way into a dead-end -- the strategy
-was not elegant enough to handle things like nested lists, etc.
-
-One thing that did turn out well from the first attempt, however, was an
-approach for handling `<emph>` and `<strong>` parsing. I've learned over the
-years that this can be one of the hardest parts of coding accurately for
-Markdown. There are many examples that are obvious to a person, but difficult
-to properly "explain" how to parse to a computer.
-
-No solution is perfect, but I developed an approach that seems to accurately
-handle a wide range of situations without a great deal of complexity:
-
-1. Scan the documents for asterisks (`*`). Each one will be handled one at a
-time.
-
-2. Unlike brackets (`[` and `]`), an asterisk is "ambidextrous", in that it
-may be able to open a matched pair of asterisks, close a pair, or both. For
-example, in `foo *bar* foo`:
-
- 1. The first asterisk can open a pair, but not close one.
-
- 2. The second asterisk can close a pair, but not open one.
-
-3. So, once the asterisks have been identified, each has to be examined to
-determine whether it can open/close/both. The algorithm is not that complex,
-but I'll describe it in general terms. Check the code for more specifics.
-This approach seems to work, but might still need some slight tweaking. In
-the future, I'll codify this better in language rather than just in code.
-
- 1. If there is whitespace to the left of an asterisk, it can't close.
-
- 2. If there is whitespace or punctuation to the right it can't open.
-
- 3. "Runs" of asterisks, e.g. `**bar` are treated as a unit in terms of
- looking left/right.
-
- 4. Asterisks inside a word are a bit trickier -- we look at the number of
- asterisks before the word, the number in the current run, and the number
- of asterisks after the word to determine which combinations, if any, are
- permitted.
-
-4. Once all asterisks have been tagged as able to open/close/both, we proceed
-through them in order:
-
- 1. When we encounter a tag that can close, we look to see if there is a
- previous opener that has not been paired off. If so, pair the two and
- remove the opener from the list of available asterisks.
-
- 2. When we encounter an opener, add it to the stack of available openers.
-
- 3. When encounter an asterisk that can do both, see if it can close an
- existing opener. If not, then add it to the stack.
-
-5. After all tokens in the block have been paired, then we look for nesting
-pairs of asterisks in order to create `<emph>` and `<strong>` sets. For
-example, assume we have six asterisks wrapped around a word, three in front,
-and three after. The asterisks are indicated with numbers: `123foo456`. We
-proceed in the following manner:
-
- 1. Based on the pairing algorithm above, these asterisks would be paired as
- follows, with matching asterisks sharing numbers -- `123foo321`.
-
- 2. Moving forwards, we come to asterisk "1". It is followed by an
- asterisk, so we check to see if they should be grouped as a `<strong>`.
- Since the "1" asterisks are wrapped immediately outside the "2" asterisks,
- they are joined together. More than two pairs can't be joined, so we now
- get the following -- `112foo211`, where the "11" represents the opening
- and closing of a `<strong>`, and the "2" represents a `<emph>`.
-
-6. When matching a pair, any unclosed openers that are on the stack are
-removed, preventing pairs from "crossing" or "intersecting". Pairs can wrap
-around each other, e.g. `[(foo)]`, but not intersect like `[(foo])`. In the
-second case, the brackets would close, removing the `(` from the stack.
-
-7. This same approach is used in all tokens that are matched in pairs--
-`[foo]`, `(foo)`, `_foo_`, etc. There's slightly more to it, but once you
-figure out how to assign opening/closing ability, the rest is easy. By using
-a stack to track available openers, it can be performed efficiently.
-
-In my testing, this approach has worked quite well. It handles all the basic
-scenarios I've thrown at it, and all of the "basic" and "devious" edge cases I
-have thought of (some of these don't necessarily have a "right" answer -- but
-v6 gives consistency answers that seem as reasonable as any others to me).
-There are also three more edge cases I've come up can still stump it, and
-ironically they are handled correctly by most implementations. They just
-don't follow the rules above. I'll continue to work on this.
-
-In the end, I scrapped this effort, but kept the lessons learned in the token
-pairing algorithm.
-
-
-## Second Attempt ##
-
-I tried again this past Fall. This time, I approached the problem with lots
-of reading. *Lots and lots* of reading -- tons of websites, computer science
-journal articles, PhD theses, etc. Learned a lot about lexers, and a lot
-about parsers, including hand-crafting vs using parser generators. In brief:
-
-1. I learned about the [Aho–Corasick algorithm], which is a great way to
-efficiently search a string for multiple target strings at once. I used this
-to create a custom lexer to identify tokens in a MultiMarkdown text document
-(e.g. `*`, `[ `, `{++`, etc.). I learned a lot, and had a good time working
-out the implementation. This code efficiently allowed me to break a string of
-text into the tokens that mattered for Markdown parsing.
-
-2. However, in a few instances I really needed some features of regular
-expressions to simplify more complex structures. After a quick bit of testing,
-using re2c to create a tokenizer was just as efficient, and allowed me to
-incorporate some regex functionality that simplified later parsing. I'll keep
-the Aho-Corasick stuff around, and will probably experiment more with it
-later. But I didn't need it for MMD now. `lexer.re` contains the source for
-the tokenizer.
-
-[Aho–Corasick algorithm]: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
-
-I looked long and hard for a way to simplify the parsing algorithm to try and
-"touch" each token only once. Ideally, the program could step through each
-token, and decide when to create a new block, when to pair things together,
-etc. But I'm not convinced it's possible. Since Markdown's grammar varies
-based on context, it seems to work best when handled in distinct phases:
-
-1. Tokenize the string to identify key sections of text. This includes line
-breaks, allowing the text to be examined one line at time.
-
-2. Join series of lines together into blocks, such as paragraphs, code blocks,
-lists, etc.
-
-3. The tokens inside each block can then be paired together to create more
-complex syntax such as links, strong, emphasis, etc.
-
-To handle the block parsing, I started off using the [Aho-Corasick] code to
-handle my first attempt. I had actually implemented some basic regex
-functionality, and used that to group lines together to create blocks. But
-this quickly fell apart in the face of more complex structures such as
-recursive lists. After a lot of searching, and *tons* more reading, I
-ultimately decided to use a parser generator to handle the task of group lines
-into blocks. `parser.y` has the source for this, and it is processed by the
-[lemon](http://www.hwaci.com/sw/lemon/) parser generator to create the actual
-code.
-
-I chose to do this because hand-crafting the block parser would be complex.
-The end result would likely be difficult to read and understand, which would
-make it difficult to update later on. Using the parser generator allows me to
-write things out in a way that can more easily be understood by a person. In
-all likelihood, the performance is probably as good as anything I could do
-anyway, if not better.
-
-Because lemon is a LALR(1) parser, it does require a bit of thinking ahead
-about how to create the grammar used. But so far, it has been able to handle
-everything I have thrown at it.
-
-
-## Optimization ##
-
-One of my goals for MMD 6 was performance. So I've paid attention to speed
-along the way, and have tried to use a few tricks to keep things fast. Here
-are some things I've learned along the way. In no particular order:
-
-
-### Memory Allocation ###
-
-When parsing a long document, a *lot* of token structures are created. Each
-one requires a small bit of memory to be allocated. In aggregate, that time
-added up and slowed down performance.
-
-After reading for a bit, I ended up coming up with an approach that uses
-larger chunks of memory. I allocate pools of of memory in large slabs for
-smaller "objects"". For example, I allocate memory for 1024 tokens at a
-single time, and then dole that memory out as needed. When the slab is empty,
-a new one is allocated. This dramatically improved performance.
-
-When pairing tokens, I created a new stack for each block. I realized that an
-empty stack didn't have any "leftover" cruft to interfere with re-use, so I
-just used one for the entire document. Again a sizeable improvement in
-performance from only allocating one object instead of many. When recursing
-to a deeper level, the stack just gets deeper, but earlier levels aren't
-modified.
-
-Speaking of tokens, I realized that the average document contains a lot of
-single spaces (there's one between every two words I have written, for
-example.) The vast majority of the time, these single spaces have no effect
-on the output of Markdown documents. I changed my whitespace token search to
-only flag runs of 2 or more spaces, dramatically reducing the number of
-tokens. This gives the benefit of needing fewer memory allocations, and also
-reduces the number of tokens that need to be processed later on. The only
-downside is remember to check for a single space character in a few instances
-where it matters.
-
-
-### Proper input buffering ###
-
-When I first began last spring, I was amazed to see how much time was being
-spent by MultiMarkdown simply reading the input file. Then I discovered it
-was because I was reading it one character at a time. I switched to using a
-buffered read approach and the time to read the file went to almost nothing. I
-experimented with different buffer sizes, but they did not seem to make a
-measurable difference.
-
-
-### Output Buffering ###
-
-I experimented with different approaches to creating the output after parsing.
-I tried printing directly to `stdout`, and even played with different
-buffering settings. None of those seemed to work well, and all were slower
-than using the `d_string` approach (formerly call `GString` in MMD 5).
-
-
-### Fast Searches ###
-
-After getting basic Markdown functionality complete, I discovered during
-testing that the time required to parse a document grew exponentially as the
-document grew longer. Performance was on par with CommonMark for shorter
-documents, but fell increasingly behind in larger tests. Time profiling found
-that the culprit was searching for link definitions when they didn't exist.
-My first approach was to keep a stack of used link definitions, and to iterate
-through them when necessary. In long documents, this performs very poorly.
-More research and I ended up using
-[uthash](http://troydhanson.github.io/uthash/). This allows me to search for
-a link (or footnote, etc.) by "name" rather than searching through an array.
-This allowed me to get MMD's performance back to O(n), taking roughly twice as
-much time to process a document that is twice as long.
-
-
-### Efficient Utility Functions ###
-
-It is frequently necessary when parsing Markdown to check what sort of
-character we are dealing with at a certain position -- a letter, whitespace,
-punctuation, etc. I created a lookup table for this via `char_lookup.c` and
-hard-coded it in `char.c`. These routines allow me to quickly, and
-consistently, classify any byte within a document. This saved a lot of
-programming time, and saved time tracking down bugs from handling things
-slightly differently under different circumstances. I also suspect it
-improved performance, but don't have the data to back it up.
-
-
-### Testing While Writing ###
-
-I developed several chunks of code in parallel while creating MMD 6. The vast
-majority of it was developed largely in a [test-driven development] approach.
-The other code was largely created with extensive unit testing to accomplish
-this.
-
-[test-driven development]: https://en.wikipedia.org/wiki/Test-driven_development
-
-MMD isn't particularly amenable to this approach at the small level, but
-instead I relied more on integration testing with an ever-growing collection
-of text files and the corresponding HTML files in the MMD 6 test suite. This
-allowed me to ensure new features work properly and that old features aren't
-broken. At this time, there are 29 text files in the test suite, and many
-more to come.
-
-
-### Other Lessons ###
-
-Some things that didn't do me any good....
-
-I considered differences between using `malloc` and `calloc` when initializing
-tokens. The time saved by using `malloc` was basically exactly offset by the
-initial time required to initialize the token to default null values as
-compared to using `calloc`. When trying `calloc` failed to help me out
-(thinking that clearing a single slab in the object pool would be faster), I
-stuck with `malloc` as it makes more sense to me in my workflow.
-
-I read a bit about [struct padding] and reordered some of my structs. It was
-until later that I discovered the `-Wpadded` option, and it's not clear
-whether my changes modified anything. Since the structs were being padded
-automatically, there was no noticeable performance change, and I didn't have
-the tools to measure whether I could have improved memory usage at all. Not
-sure this would be worth the effort -- much lower hanging fruit available.
-
-[struct padding]: http://www.catb.org/esr/structure-packing/
-
## Differences in MultiMarkdown Itself ##
[syntax page]: https://daringfireball.net/projects/markdown/syntax
-### Performance ###
-
-Basic tests show that currently MMD 6 takes about 20-25% longer the CommonMark
-0.27.0 to process long files (e.g. 0.2 MB). However, it is around 5% *faster*
-than CommonMark when parsing a shorter file (27 kB) (measured by parsing the
-same file 200 times over). This test suite is performed by using the Markdown
-[syntax page], modified to avoid the use of the Setext header at the top. The
-longer files tested are created by copying the same syntax page onto itself,
-thereby doubling the length of the file with each iteration.
-
-The largest file I test is approximately 108 MB (4096 copies of the syntax
-page). On my machine (2012 Mac mini with 2.3 GHz Intel Core i7, 16 GB RAM),
-it takes approximately 4.4 seconds to parse with MMD 6 and 3.7 seconds with
-CommonMark. MMD 6 processes approximately 25 MB/s on this test file.
-CommonMark 0.27.0 gets about 29 MB/s on the same machine.
-
-There are some slight variations with the smaller test files (8-32 copies),
-but overall the performance of both programs (MMD 6 and CommonMark) are
-roughly linear as the test file gets bigger (double the file size and it takes
-twice as long to parse, aka O(n)).
-
-Out of curiosity, I ran the same tests on the original Markdown.pl by Gruber
-(v 1.0.2b8). It took approximately 178 seconds to parse 128 copies of the
-file (3.4 MB) and was demonstrating quadratic performance characteristics
-(double the file size and it takes 2^2 or 4 times longer to process, aka
-O(n^2)). I didn't bother running it on larger versions of the test file. For
-comparison, MMD 6 can process 128 copies in approximately 140 msec.
-
-Of note, the throughput speed drops when testing more complicated files
-containing more advanced MultiMarkdown features, though it still seems to
-maintain linear performance characteristics. A second test file is created by
-concatenating all of the test suite files (including the Markdown syntax
-file). In this case, MMD gets about 13 MB/s. CommonMark doesn't support
-these additional features, so testing it with that file is not relevant. I
-will work to see whether there are certain features in particular that are
-more challenging and see whether they can be reworked to improve performance.
-
-As above, I have done some high level optimization of the parse strategy, but
-I'm sure there's still a lot of room for further improvement to be made.
-Suggestions welcome!
-
## License ##
But first, some background...
-### Why a New Version? ###
-
-MultiMarkdown version 5 was released in November of 2015, but the codebase was
-essentially the same as that of v4 -- and that was released in beta in April
-of 2013. A few key things prompted work on a new version:
-
-* Accuracy -- MMD v4 and v5 were the most accurate versions yet, and a lot of
-effort went into finding and resolving various edge cases. However, it began
-to feel like a game of whack-a-mole where new bugs would creep in every time I
-fixed an old one. The PEG began to feel rather convoluted in spots, even
-though it did allow for a precise (if not always accurate) specification of
-the grammar.
-
-* Performance -- "Back in the day" [peg-markdown] was one of the fastest
-Markdown parsers around. MMD v3 was based on peg-markdown, and would leap-
-frog with it in terms of performance. Then [CommonMark] was released, which
-was a bit faster. Then a couple of years went by and CommonMark became *much*
-faster -- in one of my test suites, MMD v 5.4.0 takes about 25 times longer to
-process a long document than CommonMark 0.27.0.
-
-[peg-markdown]: https://github.com/jgm/peg-markdown
-[CommonMark]: http://commonmark.org/
-
-Last spring, I decided I wanted to rewrite MultiMarkdown from scratch,
-building the parser myself rather than relying on a pre-rolled solution. (I
-had been using [greg](https://github.com/ooc-lang/greg) to compile the PEG
-into parser code. It worked well overall, but lacked some features I needed,
-requiring a lot of workarounds.)
-
-
-## First Attempt ##
-
-My first attempt started by hand-crafting a parser that scanned through the
-document a line at a time, deciding what to do with each line as it found
-them. I used regex parsers made with [re2c](http://re2c.org/index.html) to
-help classify each line, and then a separate parser layer to process groups of
-lines into blocks. Initially this approach worked well, and was really
-efficient. But I quickly began to code my way into a dead-end -- the strategy
-was not elegant enough to handle things like nested lists, etc.
-
-One thing that did turn out well from the first attempt, however, was an
-approach for handling `<emph>` and `<strong>` parsing. I've learned over the
-years that this can be one of the hardest parts of coding accurately for
-Markdown. There are many examples that are obvious to a person, but difficult
-to properly "explain" how to parse to a computer.
-
-No solution is perfect, but I developed an approach that seems to accurately
-handle a wide range of situations without a great deal of complexity:
-
-1. Scan the documents for asterisks (`*`). Each one will be handled one at a
-time.
-
-2. Unlike brackets (`[` and `]`), an asterisk is "ambidextrous", in that it
-may be able to open a matched pair of asterisks, close a pair, or both. For
-example, in `foo *bar* foo`:
-
- 1. The first asterisk can open a pair, but not close one.
-
- 2. The second asterisk can close a pair, but not open one.
-
-3. So, once the asterisks have been identified, each has to be examined to
-determine whether it can open/close/both. The algorithm is not that complex,
-but I'll describe it in general terms. Check the code for more specifics.
-This approach seems to work, but might still need some slight tweaking. In
-the future, I'll codify this better in language rather than just in code.
-
- 1. If there is whitespace to the left of an asterisk, it can't close.
-
- 2. If there is whitespace or punctuation to the right it can't open.
-
- 3. "Runs" of asterisks, e.g. `**bar` are treated as a unit in terms of
- looking left/right.
-
- 4. Asterisks inside a word are a bit trickier -- we look at the number of
- asterisks before the word, the number in the current run, and the number
- of asterisks after the word to determine which combinations, if any, are
- permitted.
-
-4. Once all asterisks have been tagged as able to open/close/both, we proceed
-through them in order:
-
- 1. When we encounter a tag that can close, we look to see if there is a
- previous opener that has not been paired off. If so, pair the two and
- remove the opener from the list of available asterisks.
-
- 2. When we encounter an opener, add it to the stack of available openers.
-
- 3. When encounter an asterisk that can do both, see if it can close an
- existing opener. If not, then add it to the stack.
-
-5. After all tokens in the block have been paired, then we look for nesting
-pairs of asterisks in order to create `<emph>` and `<strong>` sets. For
-example, assume we have six asterisks wrapped around a word, three in front,
-and three after. The asterisks are indicated with numbers: `123foo456`. We
-proceed in the following manner:
-
- 1. Based on the pairing algorithm above, these asterisks would be paired as
- follows, with matching asterisks sharing numbers -- `123foo321`.
-
- 2. Moving forwards, we come to asterisk "1". It is followed by an
- asterisk, so we check to see if they should be grouped as a `<strong>`.
- Since the "1" asterisks are wrapped immediately outside the "2" asterisks,
- they are joined together. More than two pairs can't be joined, so we now
- get the following -- `112foo211`, where the "11" represents the opening
- and closing of a `<strong>`, and the "2" represents a `<emph>`.
-
-6. When matching a pair, any unclosed openers that are on the stack are
-removed, preventing pairs from "crossing" or "intersecting". Pairs can wrap
-around each other, e.g. `[(foo)]`, but not intersect like `[(foo])`. In the
-second case, the brackets would close, removing the `(` from the stack.
-
-7. This same approach is used in all tokens that are matched in pairs--
-`[foo]`, `(foo)`, `_foo_`, etc. There's slightly more to it, but once you
-figure out how to assign opening/closing ability, the rest is easy. By using
-a stack to track available openers, it can be performed efficiently.
-
-In my testing, this approach has worked quite well. It handles all the basic
-scenarios I've thrown at it, and all of the "basic" and "devious" edge cases I
-have thought of (some of these don't necessarily have a "right" answer -- but
-v6 gives consistency answers that seem as reasonable as any others to me).
-There are also three more edge cases I've come up can still stump it, and
-ironically they are handled correctly by most implementations. They just
-don't follow the rules above. I'll continue to work on this.
-
-In the end, I scrapped this effort, but kept the lessons learned in the token
-pairing algorithm.
-
-
-## Second Attempt ##
-
-I tried again this past Fall. This time, I approached the problem with lots
-of reading. *Lots and lots* of reading -- tons of websites, computer science
-journal articles, PhD theses, etc. Learned a lot about lexers, and a lot
-about parsers, including hand-crafting vs using parser generators. In brief:
-
-1. I learned about the [Aho–Corasick algorithm], which is a great way to
-efficiently search a string for multiple target strings at once. I used this
-to create a custom lexer to identify tokens in a MultiMarkdown text document
-(e.g. `*`, `[ `, `{++`, etc.). I learned a lot, and had a good time working
-out the implementation. This code efficiently allowed me to break a string of
-text into the tokens that mattered for Markdown parsing.
-
-2. However, in a few instances I really needed some features of regular
-expressions to simplify more complex structures. After a quick bit of testing,
-using re2c to create a tokenizer was just as efficient, and allowed me to
-incorporate some regex functionality that simplified later parsing. I'll keep
-the Aho-Corasick stuff around, and will probably experiment more with it
-later. But I didn't need it for MMD now. `lexer.re` contains the source for
-the tokenizer.
-
-[Aho–Corasick algorithm]: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
-
-I looked long and hard for a way to simplify the parsing algorithm to try and
-"touch" each token only once. Ideally, the program could step through each
-token, and decide when to create a new block, when to pair things together,
-etc. But I'm not convinced it's possible. Since Markdown's grammar varies
-based on context, it seems to work best when handled in distinct phases:
-
-1. Tokenize the string to identify key sections of text. This includes line
-breaks, allowing the text to be examined one line at time.
-
-2. Join series of lines together into blocks, such as paragraphs, code blocks,
-lists, etc.
-
-3. The tokens inside each block can then be paired together to create more
-complex syntax such as links, strong, emphasis, etc.
-
-To handle the block parsing, I started off using the [Aho-Corasick] code to
-handle my first attempt. I had actually implemented some basic regex
-functionality, and used that to group lines together to create blocks. But
-this quickly fell apart in the face of more complex structures such as
-recursive lists. After a lot of searching, and *tons* more reading, I
-ultimately decided to use a parser generator to handle the task of group lines
-into blocks. `parser.y` has the source for this, and it is processed by the
-[lemon](http://www.hwaci.com/sw/lemon/) parser generator to create the actual
-code.
-
-I chose to do this because hand-crafting the block parser would be complex.
-The end result would likely be difficult to read and understand, which would
-make it difficult to update later on. Using the parser generator allows me to
-write things out in a way that can more easily be understood by a person. In
-all likelihood, the performance is probably as good as anything I could do
-anyway, if not better.
-
-Because lemon is a LALR(1) parser, it does require a bit of thinking ahead
-about how to create the grammar used. But so far, it has been able to handle
-everything I have thrown at it.
-
-
-## Optimization ##
-
-One of my goals for MMD 6 was performance. So I've paid attention to speed
-along the way, and have tried to use a few tricks to keep things fast. Here
-are some things I've learned along the way. In no particular order:
-
-
-### Memory Allocation ###
-
-When parsing a long document, a *lot* of token structures are created. Each
-one requires a small bit of memory to be allocated. In aggregate, that time
-added up and slowed down performance.
-
-After reading for a bit, I ended up coming up with an approach that uses
-larger chunks of memory. I allocate pools of of memory in large slabs for
-smaller "objects"". For example, I allocate memory for 1024 tokens at a
-single time, and then dole that memory out as needed. When the slab is empty,
-a new one is allocated. This dramatically improved performance.
-
-When pairing tokens, I created a new stack for each block. I realized that an
-empty stack didn't have any "leftover" cruft to interfere with re-use, so I
-just used one for the entire document. Again a sizeable improvement in
-performance from only allocating one object instead of many. When recursing
-to a deeper level, the stack just gets deeper, but earlier levels aren't
-modified.
-
-Speaking of tokens, I realized that the average document contains a lot of
-single spaces (there's one between every two words I have written, for
-example.) The vast majority of the time, these single spaces have no effect
-on the output of Markdown documents. I changed my whitespace token search to
-only flag runs of 2 or more spaces, dramatically reducing the number of
-tokens. This gives the benefit of needing fewer memory allocations, and also
-reduces the number of tokens that need to be processed later on. The only
-downside is remember to check for a single space character in a few instances
-where it matters.
-
-
-### Proper input buffering ###
-
-When I first began last spring, I was amazed to see how much time was being
-spent by MultiMarkdown simply reading the input file. Then I discovered it
-was because I was reading it one character at a time. I switched to using a
-buffered read approach and the time to read the file went to almost nothing. I
-experimented with different buffer sizes, but they did not seem to make a
-measurable difference.
-
-
-### Output Buffering ###
-
-I experimented with different approaches to creating the output after parsing.
-I tried printing directly to `stdout`, and even played with different
-buffering settings. None of those seemed to work well, and all were slower
-than using the `d_string` approach (formerly call `GString` in MMD 5).
-
-
-### Fast Searches ###
-
-After getting basic Markdown functionality complete, I discovered during
-testing that the time required to parse a document grew exponentially as the
-document grew longer. Performance was on par with CommonMark for shorter
-documents, but fell increasingly behind in larger tests. Time profiling found
-that the culprit was searching for link definitions when they didn't exist.
-My first approach was to keep a stack of used link definitions, and to iterate
-through them when necessary. In long documents, this performs very poorly.
-More research and I ended up using
-[uthash](http://troydhanson.github.io/uthash/). This allows me to search for
-a link (or footnote, etc.) by "name" rather than searching through an array.
-This allowed me to get MMD's performance back to O(n), taking roughly twice as
-much time to process a document that is twice as long.
-
-
-### Efficient Utility Functions ###
-
-It is frequently necessary when parsing Markdown to check what sort of
-character we are dealing with at a certain position -- a letter, whitespace,
-punctuation, etc. I created a lookup table for this via `char_lookup.c` and
-hard-coded it in `char.c`. These routines allow me to quickly, and
-consistently, classify any byte within a document. This saved a lot of
-programming time, and saved time tracking down bugs from handling things
-slightly differently under different circumstances. I also suspect it
-improved performance, but don't have the data to back it up.
-
-
-### Testing While Writing ###
-
-I developed several chunks of code in parallel while creating MMD 6. The vast
-majority of it was developed largely in a [test-driven development] approach.
-The other code was largely created with extensive unit testing to accomplish
-this.
-
-[test-driven development]: https://en.wikipedia.org/wiki/Test-driven_development
-
-MMD isn't particularly amenable to this approach at the small level, but
-instead I relied more on integration testing with an ever-growing collection
-of text files and the corresponding HTML files in the MMD 6 test suite. This
-allowed me to ensure new features work properly and that old features aren't
-broken. At this time, there are 29 text files in the test suite, and many
-more to come.
-
-
-### Other Lessons ###
-
-Some things that didn't do me any good....
-
-I considered differences between using `malloc` and `calloc` when initializing
-tokens. The time saved by using `malloc` was basically exactly offset by the
-initial time required to initialize the token to default null values as
-compared to using `calloc`. When trying `calloc` failed to help me out
-(thinking that clearing a single slab in the object pool would be faster), I
-stuck with `malloc` as it makes more sense to me in my workflow.
-
-I read a bit about [struct padding] and reordered some of my structs. It was
-until later that I discovered the `-Wpadded` option, and it's not clear
-whether my changes modified anything. Since the structs were being padded
-automatically, there was no noticeable performance change, and I didn't have
-the tools to measure whether I could have improved memory usage at all. Not
-sure this would be worth the effort -- much lower hanging fruit available.
-
-[struct padding]: http://www.catb.org/esr/structure-packing/
-
## Differences in MultiMarkdown Itself ##
[syntax page]: https://daringfireball.net/projects/markdown/syntax
-### Performance ###
-
-Basic tests show that currently MMD 6 takes about 20-25% longer the CommonMark
-0.27.0 to process long files (e.g. 0.2 MB). However, it is around 5% *faster*
-than CommonMark when parsing a shorter file (27 kB) (measured by parsing the
-same file 200 times over). This test suite is performed by using the Markdown
-[syntax page], modified to avoid the use of the Setext header at the top. The
-longer files tested are created by copying the same syntax page onto itself,
-thereby doubling the length of the file with each iteration.
-
-The largest file I test is approximately 108 MB (4096 copies of the syntax
-page). On my machine (2012 Mac mini with 2.3 GHz Intel Core i7, 16 GB RAM),
-it takes approximately 4.4 seconds to parse with MMD 6 and 3.7 seconds with
-CommonMark. MMD 6 processes approximately 25 MB/s on this test file.
-CommonMark 0.27.0 gets about 29 MB/s on the same machine.
-
-There are some slight variations with the smaller test files (8-32 copies),
-but overall the performance of both programs (MMD 6 and CommonMark) are
-roughly linear as the test file gets bigger (double the file size and it takes
-twice as long to parse, aka O(n)).
-
-Out of curiosity, I ran the same tests on the original Markdown.pl by Gruber
-(v 1.0.2b8). It took approximately 178 seconds to parse 128 copies of the
-file (3.4 MB) and was demonstrating quadratic performance characteristics
-(double the file size and it takes 2^2 or 4 times longer to process, aka
-O(n^2)). I didn't bother running it on larger versions of the test file. For
-comparison, MMD 6 can process 128 copies in approximately 140 msec.
-
-Of note, the throughput speed drops when testing more complicated files
-containing more advanced MultiMarkdown features, though it still seems to
-maintain linear performance characteristics. A second test file is created by
-concatenating all of the test suite files (including the Markdown syntax
-file). In this case, MMD gets about 13 MB/s. CommonMark doesn't support
-these additional features, so testing it with that file is not relevant. I
-will work to see whether there are certain features in particular that are
-more challenging and see whether they can be reworked to improve performance.
-
-As above, I have done some high level optimization of the parse strategy, but
-I'm sure there's still a lot of room for further improvement to be made.
-Suggestions welcome!
-
## License ##
--- /dev/null
+<?xml version="1.0" encoding="UTF-8"?>
+<office:document xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
+ xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
+ xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
+ xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
+ xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
+ xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
+ xmlns:xlink="http://www.w3.org/1999/xlink"
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0"
+ xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
+ xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
+ xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0"
+ xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0"
+ xmlns:math="http://www.w3.org/1998/Math/MathML"
+ xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0"
+ xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0"
+ xmlns:config="urn:oasis:names:tc:opendocument:xmlns:config:1.0"
+ xmlns:ooo="http://openoffice.org/2004/office"
+ xmlns:ooow="http://openoffice.org/2004/writer"
+ xmlns:oooc="http://openoffice.org/2004/calc"
+ xmlns:dom="http://www.w3.org/2001/xml-events"
+ xmlns:xforms="http://www.w3.org/2002/xforms"
+ xmlns:xsd="http://www.w3.org/2001/XMLSchema"
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xmlns:rpt="http://openoffice.org/2005/report"
+ xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2"
+ xmlns:xhtml="http://www.w3.org/1999/xhtml"
+ xmlns:grddl="http://www.w3.org/2003/g/data-view#"
+ xmlns:tableooo="http://openoffice.org/2009/table"
+ xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0"
+ xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0"
+ xmlns:css3t="http://www.w3.org/TR/css3-text/"
+ office:version="1.2"
+ grddl:transformation="http://docs.oasis-open.org/office/1.2/xslt/odf2rdf.xsl"
+ office:mimetype="application/vnd.oasis.opendocument.text">
+<office:font-face-decls>
+ <style:font-face style:name="Courier New" svg:font-family="'Courier New'"
+ style:font-adornments="Regular"
+ style:font-family-generic="modern"
+ style:font-pitch="fixed"/>
+</office:font-face-decls>
+<office:styles>
+<style:style style:name="Standard" style:family="paragraph" style:class="text">
+ <style:paragraph-properties fo:margin-top="0in" fo:margin-bottom="0.15in" fo:text-align="justify" style:justify-single-word="false"/>
+ </style:style>
+<style:style style:name="Preformatted_20_Text" style:display-name="Preformatted Text"
+ style:family="paragraph"
+ style:parent-style-name="Standard"
+ style:class="html">
+ <style:paragraph-properties fo:margin-top="0in" fo:margin-bottom="0in" fo:text-align="start"
+ style:justify-single-word="false"/>
+ <style:text-properties style:font-name="Courier New" fo:font-size="11pt"
+ style:font-name-asian="Courier New"
+ style:font-size-asian="11pt"
+ style:font-name-complex="Courier New"
+ style:font-size-complex="11pt"/>
+</style:style>
+<style:style style:name="Source_20_Text" style:display-name="Source Text"
+ style:family="text">
+ <style:text-properties style:font-name="Courier New" style:font-name-asian="Courier New"
+ style:font-name-complex="Courier New"
+ fo:font-size="11pt"/>
+</style:style>
+<style:style style:name="List" style:family="paragraph"
+ style:parent-style-name="Standard"
+ style:class="list">
+ <style:paragraph-properties fo:text-align="start" style:justify-single-word="false"/>
+ <style:text-properties style:font-size-asian="12pt"/>
+</style:style>
+<style:style style:name="Quotations" style:family="paragraph"
+ style:parent-style-name="Standard"
+ style:class="html">
+ <style:paragraph-properties fo:margin-left="0.3937in" fo:margin-right="0.3937in" fo:margin-top="0in"
+ fo:margin-bottom="0.1965in"
+ fo:text-align="justify" style:justify-single-word="false" fo:text-indent="0in"
+ style:auto-text-indent="false"/>
+</style:style>
+<style:style style:name="Table_20_Heading" style:display-name="Table Heading"
+ style:family="paragraph"
+ style:parent-style-name="Table_20_Contents"
+ style:class="extra">
+ <style:paragraph-properties fo:text-align="center" style:justify-single-word="false"
+ text:number-lines="false"
+ text:line-number="0"/>
+ <style:text-properties fo:font-weight="bold" style:font-weight-asian="bold"
+ style:font-weight-complex="bold"/>
+</style:style>
+<style:style style:name="Horizontal_20_Line" style:display-name="Horizontal Line"
+ style:family="paragraph"
+ style:parent-style-name="Standard"
+ style:class="html">
+ <style:paragraph-properties fo:margin-top="0in" fo:margin-bottom="0.1965in"
+ style:border-line-width-bottom="0.0008in 0.0138in 0.0008in"
+ fo:padding="0in"
+ fo:border-left="none"
+ fo:border-right="none"
+ fo:border-top="none"
+ fo:border-bottom="0.0154in double #808080"
+ text:number-lines="false"
+ text:line-number="0"
+ style:join-border="false"/>
+ <style:text-properties fo:font-size="6pt" style:font-size-asian="6pt" style:font-size-complex="6pt"/>
+</style:style>
+<style:style style:name="Footnote_20_anchor" style:display-name="Footnote anchor" style:family="text"> <style:text-properties style:text-position="super 58%"/> </style:style>
+<style:style style:name="TOC_Item" style:family="paragraph" style:parent-style-name="Standard">
+ <style:paragraph-properties>
+ <style:tab-stops>
+ <style:tab-stop style:position="6.7283in" style:type="right" style:leader-style="dotted" style:leader-text="."/>
+ </style:tab-stops>
+ </style:paragraph-properties>
+</style:style>
+ <text:notes-configuration text:note-class="footnote" text:default-style-name="Footnote" text:citation-style-name="Footnote_20_Symbol" text:citation-body-style-name="Footnote_20_anchor" text:master-page-name="Footnote" style:num-format="a" text:start-value="0" text:footnotes-position="page" text:start-numbering-at="page"/>
+ <text:notes-configuration text:note-class="endnote" text:default-style-name="Endnote" text:citation-style-name="Endnote_20_Symbol" text:citation-body-style-name="Endnote_20_anchor" text:master-page-name="Endnote" style:num-format="1" text:start-value="0"/>
+</office:styles>
+<office:automatic-styles> <style:style style:name="MMD-Italic" style:family="text">
+ <style:text-properties fo:font-style="italic" style:font-style-asian="italic"
+ style:font-style-complex="italic"/>
+ </style:style>
+ <style:style style:name="MMD-Bold" style:family="text">
+ <style:text-properties fo:font-weight="bold" style:font-weight-asian="bold"
+ style:font-weight-complex="bold"/>
+ </style:style>
+ <style:style style:name="MMD-Superscript" style:family="text">
+ <style:text-properties style:text-position="super 58%"/>
+ </style:style>
+ <style:style style:name="MMD-Subscript" style:family="text">
+ <style:text-properties style:text-position="sub 58%"/>
+ </style:style>
+ <style:style style:name="Strike" style:family="text">
+ <style:text-properties style:text-line-through-style="solid" />
+ </style:style>
+ <style:style style:name="Underline" style:family="text">
+ <style:text-properties style:text-underline-style="solid" style:text-underline-color="font-color"/>
+ </style:style>
+ <style:style style:name="Highlight" style:family="text">
+ <style:text-properties fo:background-color="#FFFF00" />
+ </style:style>
+ <style:style style:name="Comment" style:family="text">
+ <style:text-properties fo:color="#0000BB" />
+ </style:style>
+<style:style style:name="MMD-Table" style:family="paragraph" style:parent-style-name="Standard">
+ <style:paragraph-properties fo:margin-top="0in" fo:margin-bottom="0.05in"/>
+</style:style>
+<style:style style:name="MMD-Table-Center" style:family="paragraph" style:parent-style-name="MMD-Table">
+ <style:paragraph-properties fo:text-align="center" style:justify-single-word="false"/>
+</style:style>
+<style:style style:name="MMD-Table-Right" style:family="paragraph" style:parent-style-name="MMD-Table">
+ <style:paragraph-properties fo:text-align="right" style:justify-single-word="false"/>
+</style:style>
+<style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard"
+ style:list-style-name="L2">
+<style:paragraph-properties fo:text-align="start" style:justify-single-word="false"/>
+</style:style>
+<style:style style:name="fr1" style:family="graphic" style:parent-style-name="Frame">
+ <style:graphic-properties style:print-content="true" style:vertical-pos="top"
+ style:vertical-rel="baseline"
+ fo:padding="0in"
+ fo:border="none"
+ style:shadow="none"/>
+</style:style>
+<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard"
+ style:list-style-name="L1"/>
+<text:list-style style:name="L1">
+ <text:list-level-style-bullet text:level="1" text:style-name="Numbering_20_Symbols" style:num-suffix="." text:bullet-char="•">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="0.5in" fo:text-indent="-0.25in" fo:margin-left="0.5in"/>
+ </style:list-level-properties>
+ </text:list-level-style-bullet>
+ <text:list-level-style-bullet text:level="2" text:style-name="Numbering_20_Symbols" style:num-suffix="." text:bullet-char="◦">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="0.75in" fo:text-indent="-0.25in" fo:margin-left="0.75in"/>
+ </style:list-level-properties>
+ </text:list-level-style-bullet>
+ <text:list-level-style-bullet text:level="3" text:style-name="Numbering_20_Symbols" style:num-suffix="." text:bullet-char="▪">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="1in" fo:text-indent="-0.25in" fo:margin-left="1in"/>
+ </style:list-level-properties>
+ </text:list-level-style-bullet>
+ <text:list-level-style-number text:level="4" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="1.25in" fo:text-indent="-0.25in" fo:margin-left="1.25in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="5" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="1.5in" fo:text-indent="-0.25in" fo:margin-left="1.5in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="6" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="1.75in" fo:text-indent="-0.25in" fo:margin-left="1.75in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="7" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="2in" fo:text-indent="-0.25in" fo:margin-left="2in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="8" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="2.25in" fo:text-indent="-0.25in" fo:margin-left="2.25in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="9" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="2.5in" fo:text-indent="-0.25in" fo:margin-left="2.5in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="10" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="2.75in" fo:text-indent="-0.25in" fo:margin-left="2.75in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+</text:list-style>
+<text:list-style style:name="L2">
+ <text:list-level-style-number text:level="1" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="0.5in" fo:text-indent="-0.25in" fo:margin-left="0.5in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="2" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="0.75in" fo:text-indent="-0.25in" fo:margin-left="0.75in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="3" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="1in" fo:text-indent="-0.25in" fo:margin-left="1in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="4" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="1.25in" fo:text-indent="-0.25in" fo:margin-left="1.25in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="5" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="1.5in" fo:text-indent="-0.25in" fo:margin-left="1.5in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="6" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="1.75in" fo:text-indent="-0.25in" fo:margin-left="1.75in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="7" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="2in" fo:text-indent="-0.25in" fo:margin-left="2in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="8" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="2.25in" fo:text-indent="-0.25in" fo:margin-left="2.25in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="9" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="2.5in" fo:text-indent="-0.25in" fo:margin-left="2.5in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+ <text:list-level-style-number text:level="10" text:style-name="Standard" style:num-suffix="." style:num-format="1">
+ <style:list-level-properties text:list-level-position-and-space-mode="label-alignment">
+ <style:list-level-label-alignment text:label-followed-by="listtab" text:list-tab-stop-position="2.75in" fo:text-indent="-0.25in" fo:margin-left="2.75in"/>
+ </style:list-level-properties>
+ </text:list-level-style-number>
+</text:list-style>
+</office:automatic-styles>
+ <office:master-styles>
+ <style:master-page style:name="Endnote" >
+ <style:header><text:h text:outline-level="2">Bibliography</text:h></style:header></style:master-page>
+ <style:master-page style:name="Footnote" style:page-layout-name="pm2"/>
+ </office:master-styles>
+<office:meta>
+ <dc:title>Fuzz Testing</dc:title>
+</office:meta>
+<office:body>
+<office:text>
+<text:p text:style-name="Standard">Collection of test cases identified by <text:a xlink:type="simple" xlink:href="http://lcamtuf.coredump.cx/afl/">American fuzzy lop</text:a>.</text:p>
+
+<text:p text:style-name="Standard">>bar~~}</text:p>
+
+<text:list text:style-name="L1">
+<text:list-item>
+<text:p text:style-name="P1">
+list</text:p></text:list-item>
+
+<text:list-item>
+<text:p text:style-name="Standard">tems</text:p></text:list-item>
+
+</text:list>
+
+<text:h text:outline-level="2"><text:bookmark text:name="escaped"/>:Escapes </text:h>
+
+<text:p text:style-name="Standard">[>MM]: MultiMarkdown</text:p>
+
+<text:p text:style-name="Standard"><text:a xlink:type="simple" xlink:href="#ba\">foo1</text:a></text:p>
+
+<text:p text:style-name="Standard"><text:a xlink:type="simple" xlink:href="#bar">foo2</text:a></text:p>
+
+<text:p text:style-name="Standard">]: And aÄe footn.I<text:span text:style-name="MMD-Superscript">Thi1</text:span>. $</text:p>
+
+<text:p text:style-name="Standard">[MMD]:</text:p>
+
+<text:p text:style-name="Standard">f F O O (o o) f o o</text:p>
+
+<text:list text:style-name="L1">
+<text:list-item>
+<text:p text:style-name="P1">
+foo</text:p></text:list-item>
+
+<text:list-item>
+<text:p text:style-name="Standard">Â foo</text:p></text:list-item>
+
+</text:list>
+
+<text:p text:style-name="Standard">(This must be at end of file without trailing newline)</text:p>
+
+<text:h text:outline-level="1"><text:bookmark text:name="a"/>A</text:h>
+</office:text>
+</office:body>
+</office:document>
--- /dev/null
+<!DOCTYPE html>
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
+<head>
+ <meta charset="utf-8"/>
+ <title>Fuzz Testing</title>
+</head>
+<body>
+
+<p>Collection of test cases identified by <a href="http://lcamtuf.coredump.cx/afl/">American fuzzy lop</a>.</p>
+
+<p>>bar~~}</p>
+
+<ul>
+<li>list</li>
+<li>tems</li>
+</ul>
+
+<h2 id="escaped">:Escapes </h2>
+
+<p>[>MM]: MultiMarkdown</p>
+
+<p><a href="#ba\">foo1</a></p>
+
+<p><a href="#bar">foo2</a></p>
+
+<p>]: And aÄe footn.I<sup>Thi1</sup>. $</p>
+
+<p>[MMD]:</p>
+
+<p>f F O O (<abbr title="F O O">o o</abbr>) f <abbr title="F O O">o o</abbr></p>
+
+<ul>
+<li>foo</li>
+<li>Â foo</li>
+</ul>
+
+<p>(This must be at end of file without trailing newline)</p>
+
+<h1 id="a">A</h1>
+
+</body>
+</html>
+
--- /dev/null
+<p>Title: Fuzz Testing
+latex config: article</p>
+
+<p>Collection of test cases identified by <a href="http://lcamtuf.coredump.cx/afl/">American fuzzy lop</a>.</p>
+
+<p>û~~foo~>bar~~}</p>
+
+<ul>
+<li>list</li>
+<li>tems</li>
+</ul>
+
+<h2>:Escapes [escaped]</h2>
+
+<p>[>MM]: MultiMarkdown</p>
+
+<p>[?terí¢ıı[?term]: A term to be defined.</p>
+
+<p><a href="#ba\">foo1</a></p>
+
+<p><a href="#bar">foo2</a></p>
+
+<p>]: And aÄe footn.I^Thi1. $</p>
+
+<p>[MMD]:</p>
+
+<p>f o o f o o</p>
+
+<p>[> o o]: F O O</p>
+
+<ul>
+<li>foo</li>
+<li>Â foo</li>
+</ul>
+
+<p>(This must be at end of file without trailing newline)</p>
+
+<h1>A</h1>
--- /dev/null
+\input{mmd6-article-leader}
+\def\mytitle{Fuzz Testing}
+\newacronym{o o}{o o}{F O O}
+
+\input{mmd6-article-begin}
+
+Collection of test cases identified by \href{http://lcamtuf.coredump.cx/afl/}{American fuzzy lop}\footnote{\href{http://lcamtuf.coredump.cx/afl/}{http:\slash \slash lcamtuf.coredump.cx\slash afl\slash }}.
+
+>bar~~}
+
+\begin{itemize}
+\item{} list
+
+\item{} tems
+
+\end{itemize}
+
+\chapter{:Escapes }
+\label{escaped}
+
+[>MM]: MultiMarkdown
+
+foo1 (\autoref{ba\})
+
+foo2 (\autoref{bar})
+
+]: And aÄe footn.I\textsuperscript{Thi1}. \$
+
+[MMD]:
+
+f \gls{o o} f \gls{o o}
+
+\begin{itemize}
+\item{} foo
+
+\item{} Â foo
+
+\end{itemize}
+
+(This must be at end of file without trailing newline)
+
+\part{A}
+\label{a}
+
+\input{mmd6-article-footer}
+\end{document}
--- /dev/null
+Title: Fuzz Testing
+latex config: article
+
+Collection of test cases identified by [American fuzzy lop](http://lcamtuf.coredump.cx/afl/).
+
+û~~foo~>bar~~}
+
+* list
+* tems
+
+:Escapes [escaped]
+----------------
+
+[>MM\]: MultiMarkdown
+
+[?terí¢ıı[?term]: A term to be defined.
+
+[foo1]
+
+[foo2]
+
+[foo1]: #ba\
+[foo2]: #bar
+
+]: And aÄe footn.I^Thi1. \$
+
+[MMD]:
+
+f o o f o o
+
+[> o o]: F O O
+
+* foo
+* foo
+
+
+
+(This must be at end of file without trailing newline)
+
+# A
\ No newline at end of file