From: Ulya Trofimovich Date: Wed, 4 Nov 2015 14:15:04 +0000 (+0000) Subject: Some explanations of the C++98 lexer example. X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=58cc00bb24597a3ce0c474bbc7f9ac9ee419085a;p=re2c Some explanations of the C++98 lexer example. --- diff --git a/src/examples.rst b/src/examples.rst index e0ce9162..3b04dc9c 100644 --- a/src/examples.rst +++ b/src/examples.rst @@ -470,12 +470,38 @@ Generate, compile and run: C++98 lexer ----------- +This is an example of a big real-world re2c program: C++98 lexer. +It confirms to the C++98 standard (except for a couple of hacks to simulate preprocessor). +All nontrivial lexemes (integers, floating-point constants, strings and character literals) +are parsed (not only recognized): numeric literals are converted to numbers, strings are unescaped. +Some additional checks described in standard (e.g. overflows in integer literals) are also done. +In fact, C++ is easy an easy language to lex: unlike many other languages, lexer can proceed without feedback from parser. + `[07_c++98.re] `_ .. include:: examples/07_c++98.re :code: cpp :number-lines: +Notes: + +* The main lexer is used to lex all trivial lexemes (macros, whitespaces, boolean literals, keywords, operators and punctuators, identifiers), + recognize numeric literals (which are further parsed by a bunch of auxilary lexers), + and recognize the start of string and character literals (which are further recognized and parsed by an auxilary lexer). + Numeric literals are thus lexed twice: this approach may be deemed inefficient, + but it takes much more effort to validate and parse them at once. + Besides, a real-world lexer would rather recognize ill-formed lexemes (e.g. overflowed integer literals), + report them and resume lexing. + +* The main lexer and string lexer both use ``re2c:yyfill:enable = 1;``, other lexers use ``re2c:yyfill:enable = 0;``. + This is very important: both main lexer and string lexer advance input position to new (yet unseen) input characters, + so they must check for the end of input and call ``YYFILL``. In conrast, other lexers only parse lexemes that + have been already recognized by the main lexer: these lexemes are guaranteed to be within buffer bounds + (they are guarded by ``in.tok`` on the left and ``in.lim`` on the right). + +* The hardest part is (unsurprisingly) floating-point literals. + They are just as hard to lex as to use. ``:)`` + Generate, compile and run: .. code-block:: bash