- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sat, 21 Jan 2012 00:12:46 +0100
- To: Larry Masinter <masinter@adobe.com>
- Cc: Anne van Kesteren <annevk@opera.com>, "public-html-xml@w3.org" <public-html-xml@w3.org>
* Larry Masinter wrote: >Somehow I wonder if, by " the apparent perception that an HTML parser is >somehow vastly more complex than its XML counterpart" (and a pointer to >a Google+ thread) Anne was referring my comment on that thread about >scalability. I kind of gave up trying to make my point in a thread in >Google+, but I thought I would try here. HTML is first and foremost a bad language, simply because if you have a document with `<x><y>...` you cannot know whether the <y> is child or sibling to <x>, and whether `<y>` is character data or a start-tag, un- less you recognize both elements. Which you might not since elements are added to the language all the time. Since documents do not convey this on their own this means that HTML parsers require maintenance, and it'd seem clear that something that requires maintenance is vastly more com- plex than something that does not. If you add that there are actually many more options to read `<x><y>`, for instance, `<table><form>` might be read as `<form><table>` (I do not know whether that is still the case, but that was HTML Tidy's interpre- tation back when I actively maintained it), and something that should be very simple, like understanding the basic document structure, becomes very complicated. Back in the 1990s the implication of this was that, at times, when you loaded a HTML document in some editor and simply saved the document, it ended up broken. This very day there is a thread where someone argues for being able to write `<table><content>` and someone else argues that might not be a good idea because that requires changing how HTML parsers understand the basic structure of HTML documents. >The general principle is that the more you constrain exactly how an >agent is to behave, the more you constrain the implementation style and >methodology of implementations. At least on first principles, every >MUST in a specification -- if it means anything at all -- has to make >some otherwise compliant implementations non-compliant. If it doesn't, >it isn't really normative.... If you cannot tell implementations apart based on some requirement then the requirement does indeed carry no meaning, at least as far black-box considerations go, but beyond that your argument seems to be mistaken. As I understand it, you start with "all turing machines are compliant", and if you add a requirement "most" possible turing machines aren't com- pliant anymore. But that is not how people write code, they start with the turing machine that does nothing, and then add stuff that makes it do something, and more constraints may very well mean you have to do a lot less and more options how to do it. With e-mail for instance you have a requirement that "lines" must not exceed somewhere around 1000 bytes. That allows you to use constant memory, which allows for more implementation options than if you had to deal with arbitrarily long lines (unless you have some escape clause to allow implementation-specific limits, in which case you would just be turning a de-jure limit into a de-facto limit that is not documented in the specification). Solving the halting problem for arbitrary turing machines is a bit hard, but if the input machine must halt, which would add a must-requirement, it is very trivial to solve for valid input machines. In this sense, the more assumptions you can make, the more options you have to provide an answer, and requirements justify assumptions, meaning the more require- ments there are, the more assumptions you can make, the more options you have. (I will also note that arguing about complexity is not very useful with- out some pre-esetablished notion of what that means. As an example con- sider <http://bjoern.hoehrmann.de/utf-8/decoder/dfa/>. Is that decoder more complex than any other that is functionally equivalent? A commonly used measure is branch-complexity, and my decoder has somewhere between zero and one branch on x86 systems (you can use conditional moves to implement the one branch that is explicit there). Virtually all decoders I am aware of that pre-date mine use more branches, by that measure it is the least complex. Instruction count, in as much as there is an objective measure for that, is another. My decoder would win on that count aswell. Lines of code? My decoder wins again. Effort needed to manually prove it mathematically correct? Test it exhaustively automatically? Code cache pressure on ran- domized input? Being able to read all of it without scrolling? Porting it to other programming languages? Probably all a win, yet nobody would really consider the code as a whole as not complex, at the very least in terms of construction. So... is it more, or less complex than others?) -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Friday, 20 January 2012 23:13:04 UTC