From: Ezio Melotti Date: Fri, 17 Feb 2012 23:46:04 +0000 (+0200) Subject: #14020: improve HTMLParser documentation. X-Git-Tag: v2.7.3rc1~45 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=c39b552603ef6a5f5c0e5a6d239a2f9021faffcc;p=python #14020: improve HTMLParser documentation. --- diff --git a/Doc/library/htmlparser.rst b/Doc/library/htmlparser.rst index adbcb38d9c..3aba74ef3f 100644 --- a/Doc/library/htmlparser.rst +++ b/Doc/library/htmlparser.rst @@ -22,7 +22,7 @@ -------------- -This module defines a class :class:`HTMLParser` which serves as the basis for +This module defines a class :class:`.HTMLParser` which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser in :mod:`sgmllib`. @@ -30,11 +30,12 @@ in :mod:`sgmllib`. .. class:: HTMLParser() - The :class:`HTMLParser` class is instantiated without arguments. + An :class:`.HTMLParser` instance is fed HTML data and calls handler methods + when start tags, end tags, text, comments, and other markup elements are + encountered. The user should subclass :class:`.HTMLParser` and override its + methods to implement the desired behavior. - An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags - begin and end. The :class:`HTMLParser` class is meant to be overridden by the - user to provide a desired behavior. + The :class:`.HTMLParser` class is instantiated without arguments. Unlike the parser in :mod:`htmllib`, this parser does not check that end tags match start tags or call the end-tag handler for elements which are closed @@ -42,22 +43,59 @@ in :mod:`sgmllib`. An exception is defined as well: - .. exception:: HTMLParseError - Exception raised by the :class:`HTMLParser` class when it encounters an error - while parsing. This exception provides three attributes: :attr:`msg` is a brief - message explaining the error, :attr:`lineno` is the number of the line on which - the broken construct was detected, and :attr:`offset` is the number of + :class:`.HTMLParser` is able to handle broken markup, but in some cases it + might raise this exception when it encounters an error while parsing. + This exception provides three attributes: :attr:`msg` is a brief + message explaining the error, :attr:`lineno` is the number of the line on + which the broken construct was detected, and :attr:`offset` is the number of characters into the line at which the construct starts. -:class:`HTMLParser` instances have the following methods: +Example HTML Parser Application +------------------------------- -.. method:: HTMLParser.reset() +As a basic example, below is a simple HTML parser that uses the +:class:`.HTMLParser` class to print out start tags, end tags and data +as they are encountered:: + + from HTMLParser import HTMLParser + + # create a subclass and override the handler methods + class MyHTMLParser(HTMLParser): + def handle_starttag(self, tag, attrs): + print "Encountered a start tag:", tag + def handle_endtag(self, tag): + print "Encountered an end tag :", tag + def handle_data(self, data): + print "Encountered some data :", data + + # instantiate the parser and fed it some HTML + parser = MyHTMLParser() + parser.feed('Test' + '

Parse me!

') + +The output will then be:: + + Encountered a start tag: html + Encountered a start tag: head + Encountered a start tag: title + Encountered some data : Test + Encountered an end tag : title + Encountered an end tag : head + Encountered a start tag: body + Encountered a start tag: h1 + Encountered some data : Parse me! + Encountered an end tag : h1 + Encountered an end tag : body + Encountered an end tag : html - Reset the instance. Loses all unprocessed data. This is called implicitly at - instantiation time. + +:class:`.HTMLParser` Methods +---------------------------- + +:class:`.HTMLParser` instances have the following methods: .. method:: HTMLParser.feed(data) @@ -73,7 +111,13 @@ An exception is defined as well: Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call - the :class:`HTMLParser` base class method :meth:`close`. + the :class:`.HTMLParser` base class method :meth:`close`. + + +.. method:: HTMLParser.reset() + + Reset the instance. Loses all unprocessed data. This is called implicitly at + instantiation time. .. method:: HTMLParser.getpos() @@ -89,22 +133,34 @@ An exception is defined as well: attributes can be preserved, etc.). +The following methods are called when data or markup elements are encountered +and they are meant to be overridden in a subclass. The base class +implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): + + .. method:: HTMLParser.handle_starttag(tag, attrs) - This method is called to handle the start of a tag. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to handle the start of a tag (e.g. ``

``). The *tag* argument is the name of the tag converted to lower case. The *attrs* argument is a list of ``(name, value)`` pairs containing the attributes found inside the tag's ``<>`` brackets. The *name* will be translated to lower case, and quotes in the *value* have been removed, and character and entity references - have been replaced. For instance, for the tag ````, this method would be called as - ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. + have been replaced. + + For instance, for the tag ````, this method + would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. .. versionchanged:: 2.6 - All entity references from :mod:`htmlentitydefs` are now replaced in the attribute - values. + All entity references from :mod:`htmlentitydefs` are now replaced in the + attribute values. + + +.. method:: HTMLParser.handle_endtag(tag) + + This method is called to handle the end tag of an element (e.g. ``

Parse me!

') + +Parsing a doctype:: + + >>> parser.feed('') + Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" + +Parsing an element with a few attributes and a title:: + + >>> parser.feed('

') + Start tag: img + attr: ('src', 'python-logo.png') + attr: ('alt', 'The Python logo') + >>> + >>> parser.feed('

Python

') + Start tag: h1 + Data : Python + End tag : h1 + +The content of ``script`` and ``style`` elements is returned as is, without +further parsing:: + + >>> parser.feed('') + Start tag: style + attr: ('type', 'text/css') + Data : #python { color: green } + End tag : style + >>> + >>> parser.feed('') + Start tag: script + attr: ('type', 'text/javascript') + Data : alert("hello!"); + End tag : script + +Parsing comments:: + + >>> parser.feed('' + ... '') + Comment : a comment + Comment : [if IE 9]>IE-specific content'``):: + + >>> parser.feed('>>>') + Named ent: > + Num ent : > + Num ent : > + +Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but +:meth:`~HTMLParser.handle_data` might be called more than once:: + + >>> for chunk in ['buff', 'ered ', 'text']: + ... parser.feed(chunk) + ... + Start tag: span + Data : buff + Data : ered + Data : text + End tag : span + +Parsing invalid HTML (e.g. unquoted attributes) also works:: + + >>> parser.feed('

tag soup

') + Start tag: p + Start tag: a + attr: ('class', 'link') + attr: ('href', '#main') + Data : tag soup + End tag : p + End tag : a