From: Ezio Melotti Date: Fri, 17 Feb 2012 23:46:04 +0000 (+0200) Subject: #14020: improve HTMLParser documentation. X-Git-Tag: v2.7.3rc1~45 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=c39b552603ef6a5f5c0e5a6d239a2f9021faffcc;p=python #14020: improve HTMLParser documentation. --- diff --git a/Doc/library/htmlparser.rst b/Doc/library/htmlparser.rst index adbcb38d9c..3aba74ef3f 100644 --- a/Doc/library/htmlparser.rst +++ b/Doc/library/htmlparser.rst @@ -22,7 +22,7 @@ -------------- -This module defines a class :class:`HTMLParser` which serves as the basis for +This module defines a class :class:`.HTMLParser` which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser in :mod:`sgmllib`. @@ -30,11 +30,12 @@ in :mod:`sgmllib`. .. class:: HTMLParser() - The :class:`HTMLParser` class is instantiated without arguments. + An :class:`.HTMLParser` instance is fed HTML data and calls handler methods + when start tags, end tags, text, comments, and other markup elements are + encountered. The user should subclass :class:`.HTMLParser` and override its + methods to implement the desired behavior. - An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags - begin and end. The :class:`HTMLParser` class is meant to be overridden by the - user to provide a desired behavior. + The :class:`.HTMLParser` class is instantiated without arguments. Unlike the parser in :mod:`htmllib`, this parser does not check that end tags match start tags or call the end-tag handler for elements which are closed @@ -42,22 +43,59 @@ in :mod:`sgmllib`. An exception is defined as well: - .. exception:: HTMLParseError - Exception raised by the :class:`HTMLParser` class when it encounters an error - while parsing. This exception provides three attributes: :attr:`msg` is a brief - message explaining the error, :attr:`lineno` is the number of the line on which - the broken construct was detected, and :attr:`offset` is the number of + :class:`.HTMLParser` is able to handle broken markup, but in some cases it + might raise this exception when it encounters an error while parsing. + This exception provides three attributes: :attr:`msg` is a brief + message explaining the error, :attr:`lineno` is the number of the line on + which the broken construct was detected, and :attr:`offset` is the number of characters into the line at which the construct starts. -:class:`HTMLParser` instances have the following methods: +Example HTML Parser Application +------------------------------- -.. method:: HTMLParser.reset() +As a basic example, below is a simple HTML parser that uses the +:class:`.HTMLParser` class to print out start tags, end tags and data +as they are encountered:: + + from HTMLParser import HTMLParser + + # create a subclass and override the handler methods + class MyHTMLParser(HTMLParser): + def handle_starttag(self, tag, attrs): + print "Encountered a start tag:", tag + def handle_endtag(self, tag): + print "Encountered an end tag :", tag + def handle_data(self, data): + print "Encountered some data :", data + + # instantiate the parser and fed it some HTML + parser = MyHTMLParser() + parser.feed('Test' + '

Parse me!

') + +The output will then be:: + + Encountered a start tag: html + Encountered a start tag: head + Encountered a start tag: title + Encountered some data : Test + Encountered an end tag : title + Encountered an end tag : head + Encountered a start tag: body + Encountered a start tag: h1 + Encountered some data : Parse me! + Encountered an end tag : h1 + Encountered an end tag : body + Encountered an end tag : html - Reset the instance. Loses all unprocessed data. This is called implicitly at - instantiation time. + +:class:`.HTMLParser` Methods +---------------------------- + +:class:`.HTMLParser` instances have the following methods: .. method:: HTMLParser.feed(data) @@ -73,7 +111,13 @@ An exception is defined as well: Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call - the :class:`HTMLParser` base class method :meth:`close`. + the :class:`.HTMLParser` base class method :meth:`close`. + + +.. method:: HTMLParser.reset() + + Reset the instance. Loses all unprocessed data. This is called implicitly at + instantiation time. .. method:: HTMLParser.getpos() @@ -89,22 +133,34 @@ An exception is defined as well: attributes can be preserved, etc.). +The following methods are called when data or markup elements are encountered +and they are meant to be overridden in a subclass. The base class +implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): + + .. method:: HTMLParser.handle_starttag(tag, attrs) - This method is called to handle the start of a tag. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to handle the start of a tag (e.g. ``
``). The *tag* argument is the name of the tag converted to lower case. The *attrs* argument is a list of ``(name, value)`` pairs containing the attributes found inside the tag's ``<>`` brackets. The *name* will be translated to lower case, and quotes in the *value* have been removed, and character and entity references - have been replaced. For instance, for the tag ````, this method would be called as - ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. + have been replaced. + + For instance, for the tag ````, this method + would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. .. versionchanged:: 2.6 - All entity references from :mod:`htmlentitydefs` are now replaced in the attribute - values. + All entity references from :mod:`htmlentitydefs` are now replaced in the + attribute values. + + +.. method:: HTMLParser.handle_endtag(tag) + + This method is called to handle the end tag of an element (e.g. ``
``). + + The *tag* argument is the name of the tag converted to lower case. .. method:: HTMLParser.handle_startendtag(tag, attrs) @@ -115,94 +171,175 @@ An exception is defined as well: implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`. -.. method:: HTMLParser.handle_endtag(tag) - - This method is called to handle the end tag of an element. It is intended to be - overridden by a derived class; the base class implementation does nothing. The - *tag* argument is the name of the tag converted to lower case. - - .. method:: HTMLParser.handle_data(data) - This method is called to process arbitrary data (e.g. the content of - ```` and ````). It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to process arbitrary data (e.g. text nodes and the + content of ```` and ````). -.. method:: HTMLParser.handle_charref(name) +.. method:: HTMLParser.handle_entityref(name) - This method is called to process a character reference of the form ``&#ref;``. - It is intended to be overridden by a derived class; the base class - implementation does nothing. + This method is called to process a named character reference of the form + ``&name;`` (e.g. ``>``), where *name* is a general entity reference + (e.g. ``'gt'``). -.. method:: HTMLParser.handle_entityref(name) +.. method:: HTMLParser.handle_charref(name) - This method is called to process a general entity reference of the form - ``&name;`` where *name* is an general entity reference. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to process decimal and hexadecimal numeric character + references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal + equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; + in this case the method will receive ``'62'`` or ``'x3E'``. .. method:: HTMLParser.handle_comment(data) - This method is called when a comment is encountered. The *comment* argument is - a string containing the text between the ``--`` and ``--`` delimiters, but not - the delimiters themselves. For example, the comment ```` will cause - this method to be called with the argument ``'text'``. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called when a comment is encountered (e.g. ````). + For example, the comment ```` will cause this method to be + called with the argument ``' comment '``. -.. method:: HTMLParser.handle_decl(decl) + The content of Internet Explorer conditional comments (condcoms) will also be + sent to this method, so, for ````, + this method will receive ``'[if IE 9]>IE-specific content`` markup. It is intended to be overridden by a derived class; - the base class implementation does nothing. +.. method:: HTMLParser.handle_decl(decl) -.. method:: HTMLParser.unknown_decl(data) + This method is called to handle an HTML doctype declaration (e.g. + ````). - Method called when an unrecognized SGML declaration is read by the parser. - The *data* parameter will be the entire contents of the declaration inside - the ```` markup. It is sometimes useful to be overridden by a - derived class; the base class implementation throws an :exc:`HTMLParseError`. + The *decl* parameter will be the entire contents of the declaration inside + the ```` markup (e.g. ``'DOCTYPE html'``). .. method:: HTMLParser.handle_pi(data) - Method called when a processing instruction is encountered. The *data* - parameter will contain the entire processing instruction. For example, for the + This method is called when a processing instruction is encountered. The *data* + parameter will contain the entire processing instruction. For example, for the processing instruction ````, this method would be called as - ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived - class; the base class implementation does nothing. + ``handle_pi("proc color='red'")``. .. note:: - The :class:`HTMLParser` class uses the SGML syntactic rules for processing + The :class:`.HTMLParser` class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing ``'?'`` will cause the ``'?'`` to be included in *data*. -.. _htmlparser-example: +.. method:: HTMLParser.unknown_decl(data) -Example HTML Parser Application -------------------------------- + This method is called when an unrecognized declaration is read by the parser. + + The *data* parameter will be the entire contents of the declaration inside + the ```` markup. It is sometimes useful to be overridden by a + derived class. -As a basic example, below is a simple HTML parser that uses the -:class:`HTMLParser` class to print out start tags, end tags and data -as they are encountered:: + +.. _htmlparser-examples: + +Examples +-------- + +The following class implements a parser that will be used to illustrate more +examples:: from HTMLParser import HTMLParser + from htmlentitydefs import name2codepoint class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): - print "Encountered a start tag:", tag + print "Start tag:", tag + for attr in attrs: + print " attr:", attr def handle_endtag(self, tag): - print "Encountered an end tag:", tag + print "End tag :", tag def handle_data(self, data): - print "Encountered some data:", data - + print "Data :", data + def handle_comment(self, data): + print "Comment :", data + def handle_entityref(self, name): + c = unichr(name2codepoint[name]) + print "Named ent:", c + def handle_charref(self, name): + if name.startswith('x'): + c = unichr(int(name[1:], 16)) + else: + c = unichr(int(name)) + print "Num ent :", c + def handle_decl(self, data): + print "Decl :", data parser = MyHTMLParser() - parser.feed('Test' - '

Parse me!

') + +Parsing a doctype:: + + >>> parser.feed('') + Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" + +Parsing an element with a few attributes and a title:: + + >>> parser.feed('The Python logo') + Start tag: img + attr: ('src', 'python-logo.png') + attr: ('alt', 'The Python logo') + >>> + >>> parser.feed('

Python

') + Start tag: h1 + Data : Python + End tag : h1 + +The content of ``script`` and ``style`` elements is returned as is, without +further parsing:: + + >>> parser.feed('') + Start tag: style + attr: ('type', 'text/css') + Data : #python { color: green } + End tag : style + >>> + >>> parser.feed('') + Start tag: script + attr: ('type', 'text/javascript') + Data : alert("hello!"); + End tag : script + +Parsing comments:: + + >>> parser.feed('' + ... '') + Comment : a comment + Comment : [if IE 9]>IE-specific content'``):: + + >>> parser.feed('>>>') + Named ent: > + Num ent : > + Num ent : > + +Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but +:meth:`~HTMLParser.handle_data` might be called more than once:: + + >>> for chunk in ['buff', 'ered ', 'text']: + ... parser.feed(chunk) + ... + Start tag: span + Data : buff + Data : ered + Data : text + End tag : span + +Parsing invalid HTML (e.g. unquoted attributes) also works:: + + >>> parser.feed('

tag soup

') + Start tag: p + Start tag: a + attr: ('class', 'link') + attr: ('href', '#main') + Data : tag soup + End tag : p + End tag : a