- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Fri, 14 Nov 2008 18:45:40 -0500
- To: Robert J Burns <rob@robburns.com>
- CC: public-html@w3.org
Robert J Burns wrote: > I'm not really clear what your questions are directed at in my previous > message. They certainly are. > Certainly, that would need to be part of a parsing the spec. The with > SGML we had DTDs. DTDs aren't a parsing spec. DTDs are a way to specify what markup is valid (and some details like inferring opening/closing tags). SGML + DTDs is closer to a parsing spec (mostly on the SGML side). But it's not really sufficiently well-defined to handle arbitrary byte streams. > With HTML5 we have prose along with specific error > handling for ill-formed/invalid markup. Right. Though there's plenty of perfectly valid markup that requires behavior that looks suspiciously like error handling, as a result of the SGML legacy described above. > What I'm suggesting is that this > part of the HTML5 spec suffers from not having some specialized > expertise applies to this. Specialized expertise in what? Language design? Parser design? Parsing HTML? I think a good bit of HTML parsing expertise has been applied to writing this part of the spec. Unless by "this" you mean something other than "prose along with specific error handling". > Ideally I think we could have a parsing > specification that applied to HTML and SGML equally, but with the > possibility of specifying error handling for other DTD specified SGML. > Think of it as an SGML parser with a built-in HTML5 DTD. That's not particularly compatible with the way HTML actually needs to be parsed.... And a DTD can't specify the behavior that's needed out of an HTML parser, I should note. If you're using "DTD" as a shorthand for "machine-readable format", there's no reason one couldn't create a machine-readable definition of the HTML5 state machine. I'm just not sure that's what you're looking for. > Parsing only depends on the HTML language with respect to the schema > handling. It depends on the language because of the wide variety of tags that have to be handled in "weird" ways. > Valid well-formed markup can be specified by a the language > schema and leave error-handling specifications to the parsing algorithm. I'm not sure what the first part of that sentence means, to be honest, but I agree with the second part. The parsing algorithm needs to be aware of the error handling, and hence of the HTML vocabulary and the various properties different HTML tags have in terms of parsing. > Perhaps it would better to say this is the specification of the HTML > vocabulary (elements, attributes, and content models) and DOM as > opposed to the HTML 'language' and DOM. OK. So this would basically be a list of elements, corresponding attributes, DOM interface, and the behavior of said DOM interfaces, without reference to where these elements come from or how they relate to each other other than that some elements may contain other elements in some cases? >> Note that in practice parsing might need to depend on attribute >> values.... > > Could you give an example where parsing depends on attribute values? Sure. Compare the DOM produced by browsers for: <table> <input type="text"> </table> To that for: <table> <input type="hidden"> </table> In practice, either you submit your form controls in DOM order and parse differently depending on the type attribute of the control or you parse the same way no matter what the type, but submit in an order that has nothing to do with DOM order. > Still there's an independence. We can allow scripts to call the parser > and we can have parsers produce scripts while still keeping the > definition separate. The definition of which? > The point of my post (and what I read Roy Fielding saying) is that the > current HTML5 specification's strength is in its web browser behavior > specification. "web browser behavior" includes the parsing algorithm. In fact, that's one of the most important parts of the current specification from Mozilla's point of view. > The parsing algorithm and the HTML vocabulary parts of > the spec suffer because we don't have spec editors who sufficiently > understand those parts. Uh... We have spec editors who understand the parsing algorithm far better than anyone else I can think of, since they've spent a good bit of time studying how browsers actually parse HTML. So I don't know where the "don't have spec editors who sufficiently understand those parts" meme comes from. It more or less looks like a passive-aggressive accusation of incompetence to me. Cheers, Boris
Received on Friday, 14 November 2008 23:46:27 UTC