From 630db16db5ede5051e03d4cc963adfcd761bd879 Mon Sep 17 00:00:00 2001 From: "Fletcher T. Penney" Date: Wed, 15 Mar 2017 10:39:48 -0400 Subject: [PATCH] CHANGE: Update documentation and test suite --- DevelopmentNotes/DevelopmentNotes.txt | 431 ++++++++++++++++++ QuickStart.epub => QuickStart/QuickStart.epub | Bin QuickStart.fodt => QuickStart/QuickStart.fodt | 0 QuickStart.html => QuickStart/QuickStart.html | 0 QuickStart.pdf => QuickStart/QuickStart.pdf | Bin QuickStart.txt => QuickStart/QuickStart.txt | 7 +- README.md | 350 -------------- templates/README.md.in | 350 -------------- .../{MMD6Tests => Disabled}/What Is MMD.text | 0 tests/MMD6Tests/Fuzz.fodt | 322 +++++++++++++ tests/MMD6Tests/Fuzz.html | 43 ++ tests/MMD6Tests/Fuzz.htmlc | 38 ++ tests/MMD6Tests/Fuzz.tex | 46 ++ tests/MMD6Tests/Fuzz.text | 40 ++ 14 files changed, 923 insertions(+), 704 deletions(-) create mode 100644 DevelopmentNotes/DevelopmentNotes.txt rename QuickStart.epub => QuickStart/QuickStart.epub (100%) rename QuickStart.fodt => QuickStart/QuickStart.fodt (100%) rename QuickStart.html => QuickStart/QuickStart.html (100%) rename QuickStart.pdf => QuickStart/QuickStart.pdf (100%) rename QuickStart.txt => QuickStart/QuickStart.txt (98%) rename tests/{MMD6Tests => Disabled}/What Is MMD.text (100%) create mode 100644 tests/MMD6Tests/Fuzz.fodt create mode 100644 tests/MMD6Tests/Fuzz.html create mode 100644 tests/MMD6Tests/Fuzz.htmlc create mode 100644 tests/MMD6Tests/Fuzz.tex create mode 100644 tests/MMD6Tests/Fuzz.text diff --git a/DevelopmentNotes/DevelopmentNotes.txt b/DevelopmentNotes/DevelopmentNotes.txt new file mode 100644 index 0000000..0dbb37a --- /dev/null +++ b/DevelopmentNotes/DevelopmentNotes.txt @@ -0,0 +1,431 @@ +Title: MultiMarkdown v6 Development Notes +Author: Fletcher T. Penney +Date: 2017-03-14 +LaTeX Config: tufte-handout +Base Header Level: 3 + + +# Introduction # + +This document includes some notes on the development of MMD v6. Most of it +will be interesting only to other developers or those needing to choose the +absolute "best" MD implementation for their needs -- it is not required +reading to understand how the software works. + + +## Why a New Version? ## + +MultiMarkdown version 5 was released in November of 2015, but the codebase was +essentially the same as that of v4 -- and that was released in beta in April +of 2013. A few key things prompted work on a new version: + +* Accuracy -- MMD v4 and v5 were the most accurate versions yet, and a lot of +effort went into finding and resolving various edge cases. However, it began +to feel like a game of whack-a-mole where new bugs would creep in every time I +fixed an old one. The PEG began to feel rather convoluted in spots, even +though it did allow for a precise (if not always accurate) specification of +the grammar. + +* Performance -- "Back in the day" [peg-markdown] was one of the fastest +Markdown parsers around. MMD v3 was based on peg-markdown, and would leap- +frog with it in terms of performance. Then [CommonMark] was released, which +was a bit faster. Then a couple of years went by and CommonMark became *much* +faster -- in one of my test suites, MMD v 5.4.0 takes about 25 times longer to +process a long document than CommonMark 0.27.0. + +[peg-markdown]: https://github.com/jgm/peg-markdown +[CommonMark]: http://commonmark.org/ + +In the spring of 2016, I decided I wanted to rewrite MultiMarkdown from scratch, +building the parser myself rather than relying on a pre-rolled solution. (I +had been using [greg](https://github.com/ooc-lang/greg) to compile the PEG +into parser code. It worked well overall, but lacked some features I needed, +requiring a lot of workarounds.) + + +# First Attempt # + +My first attempt started by hand-crafting a parser that scanned through the +document a line at a time, deciding what to do with each line as it found +them. I used regex parsers made with [re2c](http://re2c.org/index.html) to +help classify each line, and then a separate parser layer to process groups of +lines into blocks. Initially this approach worked well, and was really +efficient. But I quickly began to code my way into a dead-end -- the strategy +was not elegant enough to handle things like nested lists, etc. + +One thing that did turn out well from the first attempt, however, was an +approach for handling `` and `` parsing. I've learned over the +years that this can be one of the hardest parts of coding accurately for +Markdown. There are many examples that are obvious to a person, but difficult +to properly "explain" how to parse to a computer. + +No solution is perfect, but I developed an approach that seems to accurately +handle a wide range of situations without a great deal of complexity: + +1. Scan the documents for asterisks (`*`). Each one will be handled one at a +time. + +2. Unlike brackets (`[` and `]`), an asterisk is "ambidextrous", in that it +may be able to open a matched pair of asterisks, close a pair, or both. For +example, in `foo *bar* foo`: + + 1. The first asterisk can open a pair, but not close one. + + 2. The second asterisk can close a pair, but not open one. + +3. So, once the asterisks have been identified, each has to be examined to +determine whether it can open/close/both. The algorithm is not that complex, +but I'll describe it in general terms. Check the code for more specifics. +This approach seems to work, but might still need some slight tweaking. In +the future, I'll codify this better in language rather than just in code. + + 1. If there is whitespace to the left of an asterisk, it can't close. + + 2. If there is whitespace or punctuation to the right it can't open. + + 3. "Runs" of asterisks, e.g. `**bar` are treated as a unit in terms of + looking left/right. + + 4. Asterisks inside a word are a bit trickier -- we look at the number of + asterisks before the word, the number in the current run, and the number + of asterisks after the word to determine which combinations, if any, are + permitted. + +4. Once all asterisks have been tagged as able to open/close/both, we proceed +through them in order: + + 1. When we encounter a tag that can close, we look to see if there is a + previous opener that has not been paired off. If so, pair the two and + remove the opener from the list of available asterisks. + + 2. When we encounter an opener, add it to the stack of available openers. + + 3. When encounter an asterisk that can do both, see if it can close an + existing opener. If not, then add it to the stack. + +5. After all tokens in the block have been paired, then we look for nesting +pairs of asterisks in order to create `` and `` sets. For +example, assume we have six asterisks wrapped around a word, three in front, +and three after. The asterisks are indicated with numbers: `123foo456`. We +proceed in the following manner: + + 1. Based on the pairing algorithm above, these asterisks would be paired as + follows, with matching asterisks sharing numbers -- `123foo321`. + + 2. Moving forwards, we come to asterisk "1". It is followed by an + asterisk, so we check to see if they should be grouped as a ``. + Since the "1" asterisks are wrapped immediately outside the "2" asterisks, + they are joined together. More than two pairs can't be joined, so we now + get the following -- `112foo211`, where the "11" represents the opening + and closing of a ``, and the "2" represents a ``. + +6. When matching a pair, any unclosed openers that are on the stack are +removed, preventing pairs from "crossing" or "intersecting". Pairs can wrap +around each other, e.g. `[(foo)]`, but not intersect like `[(foo])`. In the +second case, the brackets would close, removing the `(` from the stack. + +7. This same approach is used in all tokens that are matched in pairs-- +`[foo]`, `(foo)`, `_foo_`, etc. There's slightly more to it, but once you +figure out how to assign opening/closing ability, the rest is easy. By using +a stack to track available openers, it can be performed efficiently. + +In my testing, this approach has worked quite well. It handles all the basic +scenarios I've thrown at it, and all of the "basic" and "devious" edge cases I +have thought of (some of these don't necessarily have a "right" answer -- but +v6 gives consistency answers that seem as reasonable as any others to me). +There are also three more edge cases I've come up can still stump it, and +ironically they are handled correctly by most implementations. They just +don't follow the rules above. I'll continue to work on this. + +In the end, I scrapped this effort, but kept the lessons learned in the token +pairing algorithm. + + +# Second Attempt # + +I tried again this past Fall. This time, I approached the problem with lots +of reading. *Lots and lots* of reading -- tons of websites, computer science +journal articles, PhD theses, etc. Learned a lot about lexers, and a lot +about parsers, including hand-crafting vs using parser generators. In brief: + +1. I learned about the [Aho–Corasick algorithm], which is a great way to +efficiently search a string for multiple target strings at once. I used this +to create a custom lexer to identify tokens in a MultiMarkdown text document +(e.g. `*`, `[ `, `{++`, etc.). I learned a lot, and had a good time working +out the implementation. This code efficiently allowed me to break a string of +text into the tokens that mattered for Markdown parsing. + +2. However, in a few instances I really needed some features of regular +expressions to simplify more complex structures. After a quick bit of testing, +using re2c to create a tokenizer was just as efficient, and allowed me to +incorporate some regex functionality that simplified later parsing. I'll keep +the Aho-Corasick stuff around, and will probably experiment more with it +later. But I didn't need it for MMD now. `lexer.re` contains the source for +the tokenizer. + +[Aho–Corasick algorithm]: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm + +I looked long and hard for a way to simplify the parsing algorithm to try and +"touch" each token only once. Ideally, the program could step through each +token, and decide when to create a new block, when to pair things together, +etc. But I'm not convinced it's possible. Since Markdown's grammar varies +based on context, it seems to work best when handled in distinct phases: + +1. Tokenize the string to identify key sections of text. This includes line +breaks, allowing the text to be examined one line at time. + +2. Join series of lines together into blocks, such as paragraphs, code blocks, +lists, etc. + +3. The tokens inside each block can then be paired together to create more +complex syntax such as links, strong, emphasis, etc. + +To handle the block parsing, I started off using the [Aho-Corasick] code to +handle my first attempt. I had actually implemented some basic regex +functionality, and used that to group lines together to create blocks. But +this quickly fell apart in the face of more complex structures such as +recursive lists. After a lot of searching, and *tons* more reading, I +ultimately decided to use a parser generator to handle the task of group lines +into blocks. `parser.y` has the source for this, and it is processed by the +[lemon](http://www.hwaci.com/sw/lemon/) parser generator to create the actual +code. + +I chose to do this because hand-crafting the block parser would be complex. +The end result would likely be difficult to read and understand, which would +make it difficult to update later on. Using the parser generator allows me to +write things out in a way that can more easily be understood by a person. In +all likelihood, the performance is probably as good as anything I could do +anyway, if not better. + +Because lemon is a LALR(1) parser, it does require a bit of thinking ahead +about how to create the grammar used. But so far, it has been able to handle +everything I have thrown at it. + + +# Optimization # + +One of my goals for MMD 6 was performance. So I've paid attention to speed +along the way, and have tried to use a few tricks to keep things fast. Here +are some things I've learned along the way. In no particular order: + + +## Memory Allocation ## + +When parsing a long document, a *lot* of token structures are created. Each +one requires a small bit of memory to be allocated. In aggregate, that time +added up and slowed down performance. + +After reading for a bit, I ended up coming up with an approach that uses +larger chunks of memory. I allocate pools of of memory in large slabs for +smaller "objects"". For example, I allocate memory for 1024 tokens at a +single time, and then dole that memory out as needed. When the slab is empty, +a new one is allocated. This dramatically improved performance. + +When pairing tokens, I created a new stack for each block. I realized that an +empty stack didn't have any "leftover" cruft to interfere with re-use, so I +just used one for the entire document. Again a sizeable improvement in +performance from only allocating one object instead of many. When recursing +to a deeper level, the stack just gets deeper, but earlier levels aren't +modified. + +Speaking of tokens, I realized that the average document contains a lot of +single spaces (there's one between every two words I have written, for +example.) The vast majority of the time, these single spaces have no effect +on the output of Markdown documents. I changed my whitespace token search to +only flag runs of 2 or more spaces, dramatically reducing the number of +tokens. This gives the benefit of needing fewer memory allocations, and also +reduces the number of tokens that need to be processed later on. The only +downside is remember to check for a single space character in a few instances +where it matters. + + +## Proper input buffering ## + +When I first began last spring, I was amazed to see how much time was being +spent by MultiMarkdown simply reading the input file. Then I discovered it +was because I was reading it one character at a time. I switched to using a +buffered read approach and the time to read the file went to almost nothing. I +experimented with different buffer sizes, but they did not seem to make a +measurable difference. + + +## Output Buffering ## + +I experimented with different approaches to creating the output after parsing. +I tried printing directly to `stdout`, and even played with different +buffering settings. None of those seemed to work well, and all were slower +than using the `d_string` approach (formerly call `GString` in MMD 5). + + +## Fast Searches ## + +After getting basic Markdown functionality complete, I discovered during +testing that the time required to parse a document grew exponentially as the +document grew longer. Performance was on par with CommonMark for shorter +documents, but fell increasingly behind in larger tests. Time profiling found +that the culprit was searching for link definitions when they didn't exist. +My first approach was to keep a stack of used link definitions, and to iterate +through them when necessary. In long documents, this performs very poorly. +More research and I ended up using +[uthash](http://troydhanson.github.io/uthash/). This allows me to search for +a link (or footnote, etc.) by "name" rather than searching through an array. +This allowed me to get MMD's performance back to O(n), taking roughly twice as +much time to process a document that is twice as long. + + +## Efficient Utility Functions ## + +It is frequently necessary when parsing Markdown to check what sort of +character we are dealing with at a certain position -- a letter, whitespace, +punctuation, etc. I created a lookup table for this via `char_lookup.c` and +hard-coded it in `char.c`. These routines allow me to quickly, and +consistently, classify any byte within a document. This saved a lot of +programming time, and saved time tracking down bugs from handling things +slightly differently under different circumstances. I also suspect it +improved performance, but don't have the data to back it up. + + +## Testing While Writing ## + +I developed several chunks of code in parallel while creating MMD 6. The vast +majority of it was developed largely in a [test-driven development] approach. +The other code was largely created with extensive unit testing to accomplish +this. + +[test-driven development]: https://en.wikipedia.org/wiki/Test-driven_development + +MMD isn't particularly amenable to this approach at the small level, but +instead I relied more on integration testing with an ever-growing collection +of text files and the corresponding HTML files in the MMD 6 test suite. This +allowed me to ensure new features work properly and that old features aren't +broken. At this time, there are 29 text files in the test suite, and many +more to come. + + +## Other Lessons ## + +Some things that didn't do me any good.... + +I considered differences between using `malloc` and `calloc` when initializing +tokens. The time saved by using `malloc` was basically exactly offset by the +initial time required to initialize the token to default null values as +compared to using `calloc`. When trying `calloc` failed to help me out +(thinking that clearing a single slab in the object pool would be faster), I +stuck with `malloc` as it makes more sense to me in my workflow. + +I read a bit about [struct padding] and reordered some of my structs. It was +until later that I discovered the `-Wpadded` option, and it's not clear +whether my changes modified anything. Since the structs were being padded +automatically, there was no noticeable performance change, and I didn't have +the tools to measure whether I could have improved memory usage at all. Not +sure this would be worth the effort -- much lower hanging fruit available. + +[struct padding]: http://www.catb.org/esr/structure-packing/ + + +# Performance # + +Basic tests show that currently MMD 6 takes about 20-25% longer the CommonMark +0.27.0 to process long files (e.g. 0.2 MB). However, it is around 5% *faster* +than CommonMark when parsing a shorter file (27 kB) (measured by parsing the +same file 200 times over). This test suite is performed by using the Markdown +[syntax page], modified to avoid the use of the Setext header at the top. The +longer files tested are created by copying the same syntax page onto itself, +thereby doubling the length of the file with each iteration. + +The largest file I test is approximately 108 MB (4096 copies of the syntax +page). On my machine (2012 Mac mini with 2.3 GHz Intel Core i7, 16 GB RAM), +it takes approximately 4.4 seconds to parse with MMD 6 and 3.7 seconds with +CommonMark. MMD 6 processes approximately 25 MB/s on this test file. +CommonMark 0.27.0 gets about 29 MB/s on the same machine. + +There are some slight variations with the smaller test files (8-32 copies), +but overall the performance of both programs (MMD 6 and CommonMark) are +roughly linear as the test file gets bigger (double the file size and it takes +twice as long to parse, aka O(n)). + +Out of curiosity, I ran the same tests on the original Markdown.pl by Gruber +(v 1.0.2b8). It took approximately 178 seconds to parse 128 copies of the +file (3.4 MB) and was demonstrating quadratic performance characteristics +(double the file size and it takes 2^2 or 4 times longer to process, aka +O(n^2)). I didn't bother running it on larger versions of the test file. For +comparison, MMD 6 can process 128 copies in approximately 140 msec. + +Of note, the throughput speed drops when testing more complicated files +containing more advanced MultiMarkdown features, though it still seems to +maintain linear performance characteristics. A second test file is created by +concatenating all of the test suite files (including the Markdown syntax +file). In this case, MMD gets about 13 MB/s. CommonMark doesn't support +these additional features, so testing it with that file is not relevant. I +will work to see whether there are certain features in particular that are +more challenging and see whether they can be reworked to improve performance. + +As above, I have done some high level optimization of the parse strategy, but +I'm sure there's still a lot of room for further improvement to be made. +Suggestions welcome! + + +# Testing # + +## Test Suite ## + +The development of MMD v6 was heavily, but not absolutely, influenced by the +philosophy of test-driven development. While coding, I made use of test +suites to verify successful implementation of new features, to avoid +regression problems when adding new features, and to identify known edge cases +in need of proper handling. + +The test suite (located in `tests/MMD6Tests`) is a "living" collection of +documents that will continue to be updated as new bugs and edge cases are +identified. This helps make proper integration testing of the entire +application with every release. + + +## Fuzz Testing ## + +I was not familiar with the concept of [Fuzz Testing] +(https://en.wikipedia.org/wiki/Fuzzing) until a user mentioned something about +it to me a year or two ago. I had never used it before, but it seemed like a +good idea. I implement it in two ways. + +The first is that I created a simplified version of the line parser that +simply accepts various combinations of line type identifiers to see if they +would successfully parse. The line parser is responsible for taking a series +of line types (e.g. plain text, indented line, etc.) and determining what sort +of block they should become. The file `test/parser_text.y` is run through the +`lemon` program, compiled (with or without the `-DNDEBUG` flag) and then run. +It sequentially throws every combination of line types at the simplified line +parser to make sure that it doesn't choke. When I first did this, I found +several combinations of lines that did not pass. + +**NOTE**: This does not verify accurate parsing, simply that the parser does +not crash by an unacceptable combination of lines. + +The second form of fuzz testing I have started using more recently. This is +using the [American fuzzy lop](http://lcamtuf.coredump.cx/afl/) program to try +to find text input that crashes MMD. This works by taking sample input (e.g. +files from the test suite), modifying them slightly, and trying the modified +versions. Do this over and over and over, and some interesting edge cases are +sometimes identified. I have found some interesting edge cases this way. +Definitely a very useful tool! + + +## Unit Testing ## + +Some of the original development was done with unit testing in some other +tools I developed. This code formed the basis of a few parts of MMD. +Otherwise, it was hard to see how to really create very good unit tests for +the development of MMD. So there is really not much unit testing built into +the code or used during the development. + + + +[>MMD]: MultiMarkdown +[>MD]: Markdown + +[CriticMarkup]: http://criticmarkup.com/ + +[?PEG]: Parsing Expression Grammar + +[?AST]: Abstract Syntax Tree + diff --git a/QuickStart.epub b/QuickStart/QuickStart.epub similarity index 100% rename from QuickStart.epub rename to QuickStart/QuickStart.epub diff --git a/QuickStart.fodt b/QuickStart/QuickStart.fodt similarity index 100% rename from QuickStart.fodt rename to QuickStart/QuickStart.fodt diff --git a/QuickStart.html b/QuickStart/QuickStart.html similarity index 100% rename from QuickStart.html rename to QuickStart/QuickStart.html diff --git a/QuickStart.pdf b/QuickStart/QuickStart.pdf similarity index 100% rename from QuickStart.pdf rename to QuickStart/QuickStart.pdf diff --git a/QuickStart.txt b/QuickStart/QuickStart.txt similarity index 98% rename from QuickStart.txt rename to QuickStart/QuickStart.txt index 491975e..8115935 100644 --- a/QuickStart.txt +++ b/QuickStart/QuickStart.txt @@ -202,10 +202,9 @@ older versions of the EPUB format, but other tools can convert to other document formats you need. Same goes for Amazon's ebook formats -- the [Calibre] program can also be used to interconvert between formats. -**NOTE: Because EPUB documents are binary files, MMD only creates them when -**run in batch mode (using the `-b\--batch` options). Otherwise, it simply -**outputs the HTML 5 file that would serve as the primary content for the -**EPUB. +**NOTE**: Because EPUB documents are binary files, MMD only creates them when +run in batch mode (using the `-b\--batch` options). Otherwise, it simply +outputs the HTML 5 file that would serve as the primary content for the EPUB. ## Fenced Code Blocks ## diff --git a/README.md b/README.md index 9c4863c..3ce8774 100644 --- a/README.md +++ b/README.md @@ -201,315 +201,6 @@ my experiences for those who are working on their own products. But first, some background... -### Why a New Version? ### - -MultiMarkdown version 5 was released in November of 2015, but the codebase was -essentially the same as that of v4 -- and that was released in beta in April -of 2013. A few key things prompted work on a new version: - -* Accuracy -- MMD v4 and v5 were the most accurate versions yet, and a lot of -effort went into finding and resolving various edge cases. However, it began -to feel like a game of whack-a-mole where new bugs would creep in every time I -fixed an old one. The PEG began to feel rather convoluted in spots, even -though it did allow for a precise (if not always accurate) specification of -the grammar. - -* Performance -- "Back in the day" [peg-markdown] was one of the fastest -Markdown parsers around. MMD v3 was based on peg-markdown, and would leap- -frog with it in terms of performance. Then [CommonMark] was released, which -was a bit faster. Then a couple of years went by and CommonMark became *much* -faster -- in one of my test suites, MMD v 5.4.0 takes about 25 times longer to -process a long document than CommonMark 0.27.0. - -[peg-markdown]: https://github.com/jgm/peg-markdown -[CommonMark]: http://commonmark.org/ - -Last spring, I decided I wanted to rewrite MultiMarkdown from scratch, -building the parser myself rather than relying on a pre-rolled solution. (I -had been using [greg](https://github.com/ooc-lang/greg) to compile the PEG -into parser code. It worked well overall, but lacked some features I needed, -requiring a lot of workarounds.) - - -## First Attempt ## - -My first attempt started by hand-crafting a parser that scanned through the -document a line at a time, deciding what to do with each line as it found -them. I used regex parsers made with [re2c](http://re2c.org/index.html) to -help classify each line, and then a separate parser layer to process groups of -lines into blocks. Initially this approach worked well, and was really -efficient. But I quickly began to code my way into a dead-end -- the strategy -was not elegant enough to handle things like nested lists, etc. - -One thing that did turn out well from the first attempt, however, was an -approach for handling `` and `` parsing. I've learned over the -years that this can be one of the hardest parts of coding accurately for -Markdown. There are many examples that are obvious to a person, but difficult -to properly "explain" how to parse to a computer. - -No solution is perfect, but I developed an approach that seems to accurately -handle a wide range of situations without a great deal of complexity: - -1. Scan the documents for asterisks (`*`). Each one will be handled one at a -time. - -2. Unlike brackets (`[` and `]`), an asterisk is "ambidextrous", in that it -may be able to open a matched pair of asterisks, close a pair, or both. For -example, in `foo *bar* foo`: - - 1. The first asterisk can open a pair, but not close one. - - 2. The second asterisk can close a pair, but not open one. - -3. So, once the asterisks have been identified, each has to be examined to -determine whether it can open/close/both. The algorithm is not that complex, -but I'll describe it in general terms. Check the code for more specifics. -This approach seems to work, but might still need some slight tweaking. In -the future, I'll codify this better in language rather than just in code. - - 1. If there is whitespace to the left of an asterisk, it can't close. - - 2. If there is whitespace or punctuation to the right it can't open. - - 3. "Runs" of asterisks, e.g. `**bar` are treated as a unit in terms of - looking left/right. - - 4. Asterisks inside a word are a bit trickier -- we look at the number of - asterisks before the word, the number in the current run, and the number - of asterisks after the word to determine which combinations, if any, are - permitted. - -4. Once all asterisks have been tagged as able to open/close/both, we proceed -through them in order: - - 1. When we encounter a tag that can close, we look to see if there is a - previous opener that has not been paired off. If so, pair the two and - remove the opener from the list of available asterisks. - - 2. When we encounter an opener, add it to the stack of available openers. - - 3. When encounter an asterisk that can do both, see if it can close an - existing opener. If not, then add it to the stack. - -5. After all tokens in the block have been paired, then we look for nesting -pairs of asterisks in order to create `` and `` sets. For -example, assume we have six asterisks wrapped around a word, three in front, -and three after. The asterisks are indicated with numbers: `123foo456`. We -proceed in the following manner: - - 1. Based on the pairing algorithm above, these asterisks would be paired as - follows, with matching asterisks sharing numbers -- `123foo321`. - - 2. Moving forwards, we come to asterisk "1". It is followed by an - asterisk, so we check to see if they should be grouped as a ``. - Since the "1" asterisks are wrapped immediately outside the "2" asterisks, - they are joined together. More than two pairs can't be joined, so we now - get the following -- `112foo211`, where the "11" represents the opening - and closing of a ``, and the "2" represents a ``. - -6. When matching a pair, any unclosed openers that are on the stack are -removed, preventing pairs from "crossing" or "intersecting". Pairs can wrap -around each other, e.g. `[(foo)]`, but not intersect like `[(foo])`. In the -second case, the brackets would close, removing the `(` from the stack. - -7. This same approach is used in all tokens that are matched in pairs-- -`[foo]`, `(foo)`, `_foo_`, etc. There's slightly more to it, but once you -figure out how to assign opening/closing ability, the rest is easy. By using -a stack to track available openers, it can be performed efficiently. - -In my testing, this approach has worked quite well. It handles all the basic -scenarios I've thrown at it, and all of the "basic" and "devious" edge cases I -have thought of (some of these don't necessarily have a "right" answer -- but -v6 gives consistency answers that seem as reasonable as any others to me). -There are also three more edge cases I've come up can still stump it, and -ironically they are handled correctly by most implementations. They just -don't follow the rules above. I'll continue to work on this. - -In the end, I scrapped this effort, but kept the lessons learned in the token -pairing algorithm. - - -## Second Attempt ## - -I tried again this past Fall. This time, I approached the problem with lots -of reading. *Lots and lots* of reading -- tons of websites, computer science -journal articles, PhD theses, etc. Learned a lot about lexers, and a lot -about parsers, including hand-crafting vs using parser generators. In brief: - -1. I learned about the [Aho–Corasick algorithm], which is a great way to -efficiently search a string for multiple target strings at once. I used this -to create a custom lexer to identify tokens in a MultiMarkdown text document -(e.g. `*`, `[ `, `{++`, etc.). I learned a lot, and had a good time working -out the implementation. This code efficiently allowed me to break a string of -text into the tokens that mattered for Markdown parsing. - -2. However, in a few instances I really needed some features of regular -expressions to simplify more complex structures. After a quick bit of testing, -using re2c to create a tokenizer was just as efficient, and allowed me to -incorporate some regex functionality that simplified later parsing. I'll keep -the Aho-Corasick stuff around, and will probably experiment more with it -later. But I didn't need it for MMD now. `lexer.re` contains the source for -the tokenizer. - -[Aho–Corasick algorithm]: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm - -I looked long and hard for a way to simplify the parsing algorithm to try and -"touch" each token only once. Ideally, the program could step through each -token, and decide when to create a new block, when to pair things together, -etc. But I'm not convinced it's possible. Since Markdown's grammar varies -based on context, it seems to work best when handled in distinct phases: - -1. Tokenize the string to identify key sections of text. This includes line -breaks, allowing the text to be examined one line at time. - -2. Join series of lines together into blocks, such as paragraphs, code blocks, -lists, etc. - -3. The tokens inside each block can then be paired together to create more -complex syntax such as links, strong, emphasis, etc. - -To handle the block parsing, I started off using the [Aho-Corasick] code to -handle my first attempt. I had actually implemented some basic regex -functionality, and used that to group lines together to create blocks. But -this quickly fell apart in the face of more complex structures such as -recursive lists. After a lot of searching, and *tons* more reading, I -ultimately decided to use a parser generator to handle the task of group lines -into blocks. `parser.y` has the source for this, and it is processed by the -[lemon](http://www.hwaci.com/sw/lemon/) parser generator to create the actual -code. - -I chose to do this because hand-crafting the block parser would be complex. -The end result would likely be difficult to read and understand, which would -make it difficult to update later on. Using the parser generator allows me to -write things out in a way that can more easily be understood by a person. In -all likelihood, the performance is probably as good as anything I could do -anyway, if not better. - -Because lemon is a LALR(1) parser, it does require a bit of thinking ahead -about how to create the grammar used. But so far, it has been able to handle -everything I have thrown at it. - - -## Optimization ## - -One of my goals for MMD 6 was performance. So I've paid attention to speed -along the way, and have tried to use a few tricks to keep things fast. Here -are some things I've learned along the way. In no particular order: - - -### Memory Allocation ### - -When parsing a long document, a *lot* of token structures are created. Each -one requires a small bit of memory to be allocated. In aggregate, that time -added up and slowed down performance. - -After reading for a bit, I ended up coming up with an approach that uses -larger chunks of memory. I allocate pools of of memory in large slabs for -smaller "objects"". For example, I allocate memory for 1024 tokens at a -single time, and then dole that memory out as needed. When the slab is empty, -a new one is allocated. This dramatically improved performance. - -When pairing tokens, I created a new stack for each block. I realized that an -empty stack didn't have any "leftover" cruft to interfere with re-use, so I -just used one for the entire document. Again a sizeable improvement in -performance from only allocating one object instead of many. When recursing -to a deeper level, the stack just gets deeper, but earlier levels aren't -modified. - -Speaking of tokens, I realized that the average document contains a lot of -single spaces (there's one between every two words I have written, for -example.) The vast majority of the time, these single spaces have no effect -on the output of Markdown documents. I changed my whitespace token search to -only flag runs of 2 or more spaces, dramatically reducing the number of -tokens. This gives the benefit of needing fewer memory allocations, and also -reduces the number of tokens that need to be processed later on. The only -downside is remember to check for a single space character in a few instances -where it matters. - - -### Proper input buffering ### - -When I first began last spring, I was amazed to see how much time was being -spent by MultiMarkdown simply reading the input file. Then I discovered it -was because I was reading it one character at a time. I switched to using a -buffered read approach and the time to read the file went to almost nothing. I -experimented with different buffer sizes, but they did not seem to make a -measurable difference. - - -### Output Buffering ### - -I experimented with different approaches to creating the output after parsing. -I tried printing directly to `stdout`, and even played with different -buffering settings. None of those seemed to work well, and all were slower -than using the `d_string` approach (formerly call `GString` in MMD 5). - - -### Fast Searches ### - -After getting basic Markdown functionality complete, I discovered during -testing that the time required to parse a document grew exponentially as the -document grew longer. Performance was on par with CommonMark for shorter -documents, but fell increasingly behind in larger tests. Time profiling found -that the culprit was searching for link definitions when they didn't exist. -My first approach was to keep a stack of used link definitions, and to iterate -through them when necessary. In long documents, this performs very poorly. -More research and I ended up using -[uthash](http://troydhanson.github.io/uthash/). This allows me to search for -a link (or footnote, etc.) by "name" rather than searching through an array. -This allowed me to get MMD's performance back to O(n), taking roughly twice as -much time to process a document that is twice as long. - - -### Efficient Utility Functions ### - -It is frequently necessary when parsing Markdown to check what sort of -character we are dealing with at a certain position -- a letter, whitespace, -punctuation, etc. I created a lookup table for this via `char_lookup.c` and -hard-coded it in `char.c`. These routines allow me to quickly, and -consistently, classify any byte within a document. This saved a lot of -programming time, and saved time tracking down bugs from handling things -slightly differently under different circumstances. I also suspect it -improved performance, but don't have the data to back it up. - - -### Testing While Writing ### - -I developed several chunks of code in parallel while creating MMD 6. The vast -majority of it was developed largely in a [test-driven development] approach. -The other code was largely created with extensive unit testing to accomplish -this. - -[test-driven development]: https://en.wikipedia.org/wiki/Test-driven_development - -MMD isn't particularly amenable to this approach at the small level, but -instead I relied more on integration testing with an ever-growing collection -of text files and the corresponding HTML files in the MMD 6 test suite. This -allowed me to ensure new features work properly and that old features aren't -broken. At this time, there are 29 text files in the test suite, and many -more to come. - - -### Other Lessons ### - -Some things that didn't do me any good.... - -I considered differences between using `malloc` and `calloc` when initializing -tokens. The time saved by using `malloc` was basically exactly offset by the -initial time required to initialize the token to default null values as -compared to using `calloc`. When trying `calloc` failed to help me out -(thinking that clearing a single slab in the object pool would be faster), I -stuck with `malloc` as it makes more sense to me in my workflow. - -I read a bit about [struct padding] and reordered some of my structs. It was -until later that I discovered the `-Wpadded` option, and it's not clear -whether my changes modified anything. Since the structs were being padded -automatically, there was no noticeable performance change, and I didn't have -the tools to measure whether I could have improved memory usage at all. Not -sure this would be worth the effort -- much lower hanging fruit available. - -[struct padding]: http://www.catb.org/esr/structure-packing/ - ## Differences in MultiMarkdown Itself ## @@ -627,47 +318,6 @@ There are a few at [syntax page]: https://daringfireball.net/projects/markdown/syntax -### Performance ### - -Basic tests show that currently MMD 6 takes about 20-25% longer the CommonMark -0.27.0 to process long files (e.g. 0.2 MB). However, it is around 5% *faster* -than CommonMark when parsing a shorter file (27 kB) (measured by parsing the -same file 200 times over). This test suite is performed by using the Markdown -[syntax page], modified to avoid the use of the Setext header at the top. The -longer files tested are created by copying the same syntax page onto itself, -thereby doubling the length of the file with each iteration. - -The largest file I test is approximately 108 MB (4096 copies of the syntax -page). On my machine (2012 Mac mini with 2.3 GHz Intel Core i7, 16 GB RAM), -it takes approximately 4.4 seconds to parse with MMD 6 and 3.7 seconds with -CommonMark. MMD 6 processes approximately 25 MB/s on this test file. -CommonMark 0.27.0 gets about 29 MB/s on the same machine. - -There are some slight variations with the smaller test files (8-32 copies), -but overall the performance of both programs (MMD 6 and CommonMark) are -roughly linear as the test file gets bigger (double the file size and it takes -twice as long to parse, aka O(n)). - -Out of curiosity, I ran the same tests on the original Markdown.pl by Gruber -(v 1.0.2b8). It took approximately 178 seconds to parse 128 copies of the -file (3.4 MB) and was demonstrating quadratic performance characteristics -(double the file size and it takes 2^2 or 4 times longer to process, aka -O(n^2)). I didn't bother running it on larger versions of the test file. For -comparison, MMD 6 can process 128 copies in approximately 140 msec. - -Of note, the throughput speed drops when testing more complicated files -containing more advanced MultiMarkdown features, though it still seems to -maintain linear performance characteristics. A second test file is created by -concatenating all of the test suite files (including the Markdown syntax -file). In this case, MMD gets about 13 MB/s. CommonMark doesn't support -these additional features, so testing it with that file is not relevant. I -will work to see whether there are certain features in particular that are -more challenging and see whether they can be reworked to improve performance. - -As above, I have done some high level optimization of the parse strategy, but -I'm sure there's still a lot of room for further improvement to be made. -Suggestions welcome! - ## License ## diff --git a/templates/README.md.in b/templates/README.md.in index e505612..5e388de 100644 --- a/templates/README.md.in +++ b/templates/README.md.in @@ -201,315 +201,6 @@ my experiences for those who are working on their own products. But first, some background... -### Why a New Version? ### - -MultiMarkdown version 5 was released in November of 2015, but the codebase was -essentially the same as that of v4 -- and that was released in beta in April -of 2013. A few key things prompted work on a new version: - -* Accuracy -- MMD v4 and v5 were the most accurate versions yet, and a lot of -effort went into finding and resolving various edge cases. However, it began -to feel like a game of whack-a-mole where new bugs would creep in every time I -fixed an old one. The PEG began to feel rather convoluted in spots, even -though it did allow for a precise (if not always accurate) specification of -the grammar. - -* Performance -- "Back in the day" [peg-markdown] was one of the fastest -Markdown parsers around. MMD v3 was based on peg-markdown, and would leap- -frog with it in terms of performance. Then [CommonMark] was released, which -was a bit faster. Then a couple of years went by and CommonMark became *much* -faster -- in one of my test suites, MMD v 5.4.0 takes about 25 times longer to -process a long document than CommonMark 0.27.0. - -[peg-markdown]: https://github.com/jgm/peg-markdown -[CommonMark]: http://commonmark.org/ - -Last spring, I decided I wanted to rewrite MultiMarkdown from scratch, -building the parser myself rather than relying on a pre-rolled solution. (I -had been using [greg](https://github.com/ooc-lang/greg) to compile the PEG -into parser code. It worked well overall, but lacked some features I needed, -requiring a lot of workarounds.) - - -## First Attempt ## - -My first attempt started by hand-crafting a parser that scanned through the -document a line at a time, deciding what to do with each line as it found -them. I used regex parsers made with [re2c](http://re2c.org/index.html) to -help classify each line, and then a separate parser layer to process groups of -lines into blocks. Initially this approach worked well, and was really -efficient. But I quickly began to code my way into a dead-end -- the strategy -was not elegant enough to handle things like nested lists, etc. - -One thing that did turn out well from the first attempt, however, was an -approach for handling `` and `` parsing. I've learned over the -years that this can be one of the hardest parts of coding accurately for -Markdown. There are many examples that are obvious to a person, but difficult -to properly "explain" how to parse to a computer. - -No solution is perfect, but I developed an approach that seems to accurately -handle a wide range of situations without a great deal of complexity: - -1. Scan the documents for asterisks (`*`). Each one will be handled one at a -time. - -2. Unlike brackets (`[` and `]`), an asterisk is "ambidextrous", in that it -may be able to open a matched pair of asterisks, close a pair, or both. For -example, in `foo *bar* foo`: - - 1. The first asterisk can open a pair, but not close one. - - 2. The second asterisk can close a pair, but not open one. - -3. So, once the asterisks have been identified, each has to be examined to -determine whether it can open/close/both. The algorithm is not that complex, -but I'll describe it in general terms. Check the code for more specifics. -This approach seems to work, but might still need some slight tweaking. In -the future, I'll codify this better in language rather than just in code. - - 1. If there is whitespace to the left of an asterisk, it can't close. - - 2. If there is whitespace or punctuation to the right it can't open. - - 3. "Runs" of asterisks, e.g. `**bar` are treated as a unit in terms of - looking left/right. - - 4. Asterisks inside a word are a bit trickier -- we look at the number of - asterisks before the word, the number in the current run, and the number - of asterisks after the word to determine which combinations, if any, are - permitted. - -4. Once all asterisks have been tagged as able to open/close/both, we proceed -through them in order: - - 1. When we encounter a tag that can close, we look to see if there is a - previous opener that has not been paired off. If so, pair the two and - remove the opener from the list of available asterisks. - - 2. When we encounter an opener, add it to the stack of available openers. - - 3. When encounter an asterisk that can do both, see if it can close an - existing opener. If not, then add it to the stack. - -5. After all tokens in the block have been paired, then we look for nesting -pairs of asterisks in order to create `` and `` sets. For -example, assume we have six asterisks wrapped around a word, three in front, -and three after. The asterisks are indicated with numbers: `123foo456`. We -proceed in the following manner: - - 1. Based on the pairing algorithm above, these asterisks would be paired as - follows, with matching asterisks sharing numbers -- `123foo321`. - - 2. Moving forwards, we come to asterisk "1". It is followed by an - asterisk, so we check to see if they should be grouped as a ``. - Since the "1" asterisks are wrapped immediately outside the "2" asterisks, - they are joined together. More than two pairs can't be joined, so we now - get the following -- `112foo211`, where the "11" represents the opening - and closing of a ``, and the "2" represents a ``. - -6. When matching a pair, any unclosed openers that are on the stack are -removed, preventing pairs from "crossing" or "intersecting". Pairs can wrap -around each other, e.g. `[(foo)]`, but not intersect like `[(foo])`. In the -second case, the brackets would close, removing the `(` from the stack. - -7. This same approach is used in all tokens that are matched in pairs-- -`[foo]`, `(foo)`, `_foo_`, etc. There's slightly more to it, but once you -figure out how to assign opening/closing ability, the rest is easy. By using -a stack to track available openers, it can be performed efficiently. - -In my testing, this approach has worked quite well. It handles all the basic -scenarios I've thrown at it, and all of the "basic" and "devious" edge cases I -have thought of (some of these don't necessarily have a "right" answer -- but -v6 gives consistency answers that seem as reasonable as any others to me). -There are also three more edge cases I've come up can still stump it, and -ironically they are handled correctly by most implementations. They just -don't follow the rules above. I'll continue to work on this. - -In the end, I scrapped this effort, but kept the lessons learned in the token -pairing algorithm. - - -## Second Attempt ## - -I tried again this past Fall. This time, I approached the problem with lots -of reading. *Lots and lots* of reading -- tons of websites, computer science -journal articles, PhD theses, etc. Learned a lot about lexers, and a lot -about parsers, including hand-crafting vs using parser generators. In brief: - -1. I learned about the [Aho–Corasick algorithm], which is a great way to -efficiently search a string for multiple target strings at once. I used this -to create a custom lexer to identify tokens in a MultiMarkdown text document -(e.g. `*`, `[ `, `{++`, etc.). I learned a lot, and had a good time working -out the implementation. This code efficiently allowed me to break a string of -text into the tokens that mattered for Markdown parsing. - -2. However, in a few instances I really needed some features of regular -expressions to simplify more complex structures. After a quick bit of testing, -using re2c to create a tokenizer was just as efficient, and allowed me to -incorporate some regex functionality that simplified later parsing. I'll keep -the Aho-Corasick stuff around, and will probably experiment more with it -later. But I didn't need it for MMD now. `lexer.re` contains the source for -the tokenizer. - -[Aho–Corasick algorithm]: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm - -I looked long and hard for a way to simplify the parsing algorithm to try and -"touch" each token only once. Ideally, the program could step through each -token, and decide when to create a new block, when to pair things together, -etc. But I'm not convinced it's possible. Since Markdown's grammar varies -based on context, it seems to work best when handled in distinct phases: - -1. Tokenize the string to identify key sections of text. This includes line -breaks, allowing the text to be examined one line at time. - -2. Join series of lines together into blocks, such as paragraphs, code blocks, -lists, etc. - -3. The tokens inside each block can then be paired together to create more -complex syntax such as links, strong, emphasis, etc. - -To handle the block parsing, I started off using the [Aho-Corasick] code to -handle my first attempt. I had actually implemented some basic regex -functionality, and used that to group lines together to create blocks. But -this quickly fell apart in the face of more complex structures such as -recursive lists. After a lot of searching, and *tons* more reading, I -ultimately decided to use a parser generator to handle the task of group lines -into blocks. `parser.y` has the source for this, and it is processed by the -[lemon](http://www.hwaci.com/sw/lemon/) parser generator to create the actual -code. - -I chose to do this because hand-crafting the block parser would be complex. -The end result would likely be difficult to read and understand, which would -make it difficult to update later on. Using the parser generator allows me to -write things out in a way that can more easily be understood by a person. In -all likelihood, the performance is probably as good as anything I could do -anyway, if not better. - -Because lemon is a LALR(1) parser, it does require a bit of thinking ahead -about how to create the grammar used. But so far, it has been able to handle -everything I have thrown at it. - - -## Optimization ## - -One of my goals for MMD 6 was performance. So I've paid attention to speed -along the way, and have tried to use a few tricks to keep things fast. Here -are some things I've learned along the way. In no particular order: - - -### Memory Allocation ### - -When parsing a long document, a *lot* of token structures are created. Each -one requires a small bit of memory to be allocated. In aggregate, that time -added up and slowed down performance. - -After reading for a bit, I ended up coming up with an approach that uses -larger chunks of memory. I allocate pools of of memory in large slabs for -smaller "objects"". For example, I allocate memory for 1024 tokens at a -single time, and then dole that memory out as needed. When the slab is empty, -a new one is allocated. This dramatically improved performance. - -When pairing tokens, I created a new stack for each block. I realized that an -empty stack didn't have any "leftover" cruft to interfere with re-use, so I -just used one for the entire document. Again a sizeable improvement in -performance from only allocating one object instead of many. When recursing -to a deeper level, the stack just gets deeper, but earlier levels aren't -modified. - -Speaking of tokens, I realized that the average document contains a lot of -single spaces (there's one between every two words I have written, for -example.) The vast majority of the time, these single spaces have no effect -on the output of Markdown documents. I changed my whitespace token search to -only flag runs of 2 or more spaces, dramatically reducing the number of -tokens. This gives the benefit of needing fewer memory allocations, and also -reduces the number of tokens that need to be processed later on. The only -downside is remember to check for a single space character in a few instances -where it matters. - - -### Proper input buffering ### - -When I first began last spring, I was amazed to see how much time was being -spent by MultiMarkdown simply reading the input file. Then I discovered it -was because I was reading it one character at a time. I switched to using a -buffered read approach and the time to read the file went to almost nothing. I -experimented with different buffer sizes, but they did not seem to make a -measurable difference. - - -### Output Buffering ### - -I experimented with different approaches to creating the output after parsing. -I tried printing directly to `stdout`, and even played with different -buffering settings. None of those seemed to work well, and all were slower -than using the `d_string` approach (formerly call `GString` in MMD 5). - - -### Fast Searches ### - -After getting basic Markdown functionality complete, I discovered during -testing that the time required to parse a document grew exponentially as the -document grew longer. Performance was on par with CommonMark for shorter -documents, but fell increasingly behind in larger tests. Time profiling found -that the culprit was searching for link definitions when they didn't exist. -My first approach was to keep a stack of used link definitions, and to iterate -through them when necessary. In long documents, this performs very poorly. -More research and I ended up using -[uthash](http://troydhanson.github.io/uthash/). This allows me to search for -a link (or footnote, etc.) by "name" rather than searching through an array. -This allowed me to get MMD's performance back to O(n), taking roughly twice as -much time to process a document that is twice as long. - - -### Efficient Utility Functions ### - -It is frequently necessary when parsing Markdown to check what sort of -character we are dealing with at a certain position -- a letter, whitespace, -punctuation, etc. I created a lookup table for this via `char_lookup.c` and -hard-coded it in `char.c`. These routines allow me to quickly, and -consistently, classify any byte within a document. This saved a lot of -programming time, and saved time tracking down bugs from handling things -slightly differently under different circumstances. I also suspect it -improved performance, but don't have the data to back it up. - - -### Testing While Writing ### - -I developed several chunks of code in parallel while creating MMD 6. The vast -majority of it was developed largely in a [test-driven development] approach. -The other code was largely created with extensive unit testing to accomplish -this. - -[test-driven development]: https://en.wikipedia.org/wiki/Test-driven_development - -MMD isn't particularly amenable to this approach at the small level, but -instead I relied more on integration testing with an ever-growing collection -of text files and the corresponding HTML files in the MMD 6 test suite. This -allowed me to ensure new features work properly and that old features aren't -broken. At this time, there are 29 text files in the test suite, and many -more to come. - - -### Other Lessons ### - -Some things that didn't do me any good.... - -I considered differences between using `malloc` and `calloc` when initializing -tokens. The time saved by using `malloc` was basically exactly offset by the -initial time required to initialize the token to default null values as -compared to using `calloc`. When trying `calloc` failed to help me out -(thinking that clearing a single slab in the object pool would be faster), I -stuck with `malloc` as it makes more sense to me in my workflow. - -I read a bit about [struct padding] and reordered some of my structs. It was -until later that I discovered the `-Wpadded` option, and it's not clear -whether my changes modified anything. Since the structs were being padded -automatically, there was no noticeable performance change, and I didn't have -the tools to measure whether I could have improved memory usage at all. Not -sure this would be worth the effort -- much lower hanging fruit available. - -[struct padding]: http://www.catb.org/esr/structure-packing/ - ## Differences in MultiMarkdown Itself ## @@ -627,47 +318,6 @@ There are a few at [syntax page]: https://daringfireball.net/projects/markdown/syntax -### Performance ### - -Basic tests show that currently MMD 6 takes about 20-25% longer the CommonMark -0.27.0 to process long files (e.g. 0.2 MB). However, it is around 5% *faster* -than CommonMark when parsing a shorter file (27 kB) (measured by parsing the -same file 200 times over). This test suite is performed by using the Markdown -[syntax page], modified to avoid the use of the Setext header at the top. The -longer files tested are created by copying the same syntax page onto itself, -thereby doubling the length of the file with each iteration. - -The largest file I test is approximately 108 MB (4096 copies of the syntax -page). On my machine (2012 Mac mini with 2.3 GHz Intel Core i7, 16 GB RAM), -it takes approximately 4.4 seconds to parse with MMD 6 and 3.7 seconds with -CommonMark. MMD 6 processes approximately 25 MB/s on this test file. -CommonMark 0.27.0 gets about 29 MB/s on the same machine. - -There are some slight variations with the smaller test files (8-32 copies), -but overall the performance of both programs (MMD 6 and CommonMark) are -roughly linear as the test file gets bigger (double the file size and it takes -twice as long to parse, aka O(n)). - -Out of curiosity, I ran the same tests on the original Markdown.pl by Gruber -(v 1.0.2b8). It took approximately 178 seconds to parse 128 copies of the -file (3.4 MB) and was demonstrating quadratic performance characteristics -(double the file size and it takes 2^2 or 4 times longer to process, aka -O(n^2)). I didn't bother running it on larger versions of the test file. For -comparison, MMD 6 can process 128 copies in approximately 140 msec. - -Of note, the throughput speed drops when testing more complicated files -containing more advanced MultiMarkdown features, though it still seems to -maintain linear performance characteristics. A second test file is created by -concatenating all of the test suite files (including the Markdown syntax -file). In this case, MMD gets about 13 MB/s. CommonMark doesn't support -these additional features, so testing it with that file is not relevant. I -will work to see whether there are certain features in particular that are -more challenging and see whether they can be reworked to improve performance. - -As above, I have done some high level optimization of the parse strategy, but -I'm sure there's still a lot of room for further improvement to be made. -Suggestions welcome! - ## License ## diff --git a/tests/MMD6Tests/What Is MMD.text b/tests/Disabled/What Is MMD.text similarity index 100% rename from tests/MMD6Tests/What Is MMD.text rename to tests/Disabled/What Is MMD.text diff --git a/tests/MMD6Tests/Fuzz.fodt b/tests/MMD6Tests/Fuzz.fodt new file mode 100644 index 0000000..b0be3a5 --- /dev/null +++ b/tests/MMD6Tests/Fuzz.fodt @@ -0,0 +1,322 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Bibliography + + + + Fuzz Testing + + + +Collection of test cases identified by American fuzzy lop. + +>bar~~} + + + + +list + + +tems + + + +:Escapes + +[>MM]: MultiMarkdown + +foo1 + +foo2 + +]: And aÄe footn.IThi1. $ + +[MMD]: + +f F O O (o o) f o o + + + + +foo + + + foo + + + +(This must be at end of file without trailing newline) + +A + + + diff --git a/tests/MMD6Tests/Fuzz.html b/tests/MMD6Tests/Fuzz.html new file mode 100644 index 0000000..b625f30 --- /dev/null +++ b/tests/MMD6Tests/Fuzz.html @@ -0,0 +1,43 @@ + + + + + Fuzz Testing + + + +

Collection of test cases identified by American fuzzy lop.

+ +

>bar~~}

+ +
    +
  • list
  • +
  • tems
  • +
+ +

:Escapes

+ +

[>MM]: MultiMarkdown

+ +

foo1

+ +

foo2

+ +

]: And aÄe footn.IThi1. $

+ +

[MMD]:

+ +

f F O O (o o) f o o

+ +
    +
  • foo
  • +
  • Â foo
  • +
+ +

(This must be at end of file without trailing newline)

+ +

A

+ + + + diff --git a/tests/MMD6Tests/Fuzz.htmlc b/tests/MMD6Tests/Fuzz.htmlc new file mode 100644 index 0000000..88d697b --- /dev/null +++ b/tests/MMD6Tests/Fuzz.htmlc @@ -0,0 +1,38 @@ +

Title: Fuzz Testing +latex config: article

+ +

Collection of test cases identified by American fuzzy lop.

+ +

û~~foo~>bar~~}

+ +
    +
  • list
  • +
  • tems
  • +
+ +

:Escapes [escaped]

+ +

[>MM]: MultiMarkdown

+ +

[?terí¢ıı[?term]: A term to be defined.

+ +

foo1

+ +

foo2

+ +

]: And aÄe footn.I^Thi1. $

+ +

[MMD]:

+ +

f o o f o o

+ +

[> o o]: F O O

+ +
    +
  • foo
  • +
  • Â foo
  • +
+ +

(This must be at end of file without trailing newline)

+ +

A

diff --git a/tests/MMD6Tests/Fuzz.tex b/tests/MMD6Tests/Fuzz.tex new file mode 100644 index 0000000..e0ba52d --- /dev/null +++ b/tests/MMD6Tests/Fuzz.tex @@ -0,0 +1,46 @@ +\input{mmd6-article-leader} +\def\mytitle{Fuzz Testing} +\newacronym{o o}{o o}{F O O} + +\input{mmd6-article-begin} + +Collection of test cases identified by \href{http://lcamtuf.coredump.cx/afl/}{American fuzzy lop}\footnote{\href{http://lcamtuf.coredump.cx/afl/}{http:\slash \slash lcamtuf.coredump.cx\slash afl\slash }}. + +>bar~~} + +\begin{itemize} +\item{} list + +\item{} tems + +\end{itemize} + +\chapter{:Escapes } +\label{escaped} + +[>MM]: MultiMarkdown + +foo1 (\autoref{ba\}) + +foo2 (\autoref{bar}) + +]: And aÄe footn.I\textsuperscript{Thi1}. \$ + +[MMD]: + +f \gls{o o} f \gls{o o} + +\begin{itemize} +\item{} foo + +\item{}  foo + +\end{itemize} + +(This must be at end of file without trailing newline) + +\part{A} +\label{a} + +\input{mmd6-article-footer} +\end{document} diff --git a/tests/MMD6Tests/Fuzz.text b/tests/MMD6Tests/Fuzz.text new file mode 100644 index 0000000..67e0fc3 --- /dev/null +++ b/tests/MMD6Tests/Fuzz.text @@ -0,0 +1,40 @@ +Title: Fuzz Testing +latex config: article + +Collection of test cases identified by [American fuzzy lop](http://lcamtuf.coredump.cx/afl/). + +û~~foo~>bar~~} + +* list +* tems + +:Escapes [escaped] +---------------- + +[>MM\]: MultiMarkdown + +[?terí¢ıı[?term]: A term to be defined. + +[foo1] + +[foo2] + +[foo1]: #ba\ +[foo2]: #bar + +]: And aÄe footn.I^Thi1. \$ + +[MMD]: + +f o o f o o + +[> o o]: F O O + +* foo +*   foo + + + +(This must be at end of file without trailing newline) + +# A \ No newline at end of file -- 2.40.0