W3C home > Mailing lists > Public > public-html-xml@w3.org > January 2012

Re: scalability and forgiving parsers

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 21 Jan 2012 00:12:46 +0100
To: Larry Masinter <masinter@adobe.com>
Cc: Anne van Kesteren <annevk@opera.com>, "public-html-xml@w3.org" <public-html-xml@w3.org>
Message-ID: <l5njh7d6oditjrif6rnvseslgnoi4ouimj@hive.bjoern.hoehrmann.de>
* Larry Masinter wrote:
>Somehow I wonder if, by " the apparent perception that an HTML parser is
>somehow vastly more complex than its XML counterpart" (and a pointer to
>a Google+ thread) Anne was referring my comment on that thread about
>scalability. I kind of gave up trying to make my point in a thread in
>Google+, but I thought I would try here.

HTML is first and foremost a bad language, simply because if you have a
document with `<x><y>...` you cannot know whether the <y> is child or
sibling to <x>, and whether `<y>` is character data or a start-tag, un-
less you recognize both elements. Which you might not since elements are
added to the language all the time. Since documents do not convey this
on their own this means that HTML parsers require maintenance, and it'd
seem clear that something that requires maintenance is vastly more com-
plex than something that does not.

If you add that there are actually many more options to read `<x><y>`,
for instance, `<table><form>` might be read as `<form><table>` (I do not
know whether that is still the case, but that was HTML Tidy's interpre-
tation back when I actively maintained it), and something that should be
very simple, like understanding the basic document structure, becomes
very complicated. Back in the 1990s the implication of this was that, at
times, when you loaded a HTML document in some editor and simply saved
the document, it ended up broken. This very day there is a thread where
someone argues for being able to write `<table><content>` and someone
else argues that might not be a good idea because that requires changing
how HTML parsers understand the basic structure of HTML documents.

>The general principle is that the more you constrain exactly how an
>agent is to behave, the more you constrain the implementation style and
>methodology of implementations.  At least on first principles, every
>MUST in a specification -- if it means anything at all -- has to make
>some otherwise compliant implementations non-compliant.  If it doesn't,
>it isn't really normative.... 

If you cannot tell implementations apart based on some requirement then
the requirement does indeed carry no meaning, at least as far black-box
considerations go, but beyond that your argument seems to be mistaken.
As I understand it, you start with "all turing machines are compliant",
and if you add a requirement "most" possible turing machines aren't com-
pliant anymore. But that is not how people write code, they start with
the turing machine that does nothing, and then add stuff that makes it
do something, and more constraints may very well mean you have to do a
lot less and more options how to do it.

With e-mail for instance you have a requirement that "lines" must not
exceed somewhere around 1000 bytes. That allows you to use constant
memory, which allows for more implementation options than if you had to
deal with arbitrarily long lines (unless you have some escape clause to
allow implementation-specific limits, in which case you would just be
turning a de-jure limit into a de-facto limit that is not documented in
the specification).

Solving the halting problem for arbitrary turing machines is a bit hard,
but if the input machine must halt, which would add a must-requirement,
it is very trivial to solve for valid input machines. In this sense, the
more assumptions you can make, the more options you have to provide an
answer, and requirements justify assumptions, meaning the more require-
ments there are, the more assumptions you can make, the more options you

(I will also note that arguing about complexity is not very useful with-
out some pre-esetablished notion of what that means. As an example con-
sider <http://bjoern.hoehrmann.de/utf-8/decoder/dfa/>. Is that decoder
more complex than any other that is functionally equivalent? A commonly
used measure is branch-complexity, and my decoder has somewhere between
zero and one branch on x86 systems (you can use conditional moves to
implement the one branch that is explicit there). Virtually all decoders
I am aware of that pre-date mine use more branches, by that measure it
is the least complex.

Instruction count, in as much as there is an objective measure for that,
is another. My decoder would win on that count aswell. Lines of code? My
decoder wins again. Effort needed to manually prove it mathematically
correct? Test it exhaustively automatically? Code cache pressure on ran-
domized input? Being able to read all of it without scrolling? Porting
it to other programming languages? Probably all a win, yet nobody would
really consider the code as a whole as not complex, at the very least in
terms of construction. So... is it more, or less complex than others?)
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Friday, 20 January 2012 23:13:04 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 20 January 2012 23:13:05 GMT