- From: Robert J Burns <rob@robburns.com>
- Date: Fri, 14 Nov 2008 18:35:29 -0600
- To: Boris Zbarsky <bzbarsky@MIT.EDU>
- Cc: public-html@w3.org
HI Boris, On Nov 14, 2008, at 5:45 PM, Boris Zbarsky wrote: > > Robert J Burns wrote: >> I'm not really clear what your questions are directed at in my >> previous message. > > They certainly are. They may be, but the seeming points of contention you raise are not contrary to anything I'm saying, so I'm not sure what the thrust of your intervention is here. > > >> Certainly, that would need to be part of a parsing the spec. The >> with SGML we had DTDs. > > DTDs aren't a parsing spec. DTDs are a way to specify what markup > is valid (and some details like inferring opening/closing tags). > SGML + DTDs is closer to a parsing spec (mostly on the SGML side). > But it's not really sufficiently well-defined to handle arbitrary > byte streams. Agree, that's why I'm saying a separate parsing spec could be made independent of the HTML5 vocabulary / DOM and Web browser behavior specs. What the HTML5 parsing algorithms adds is specifics of error- recovery related both to arbitrary general markup (like that which could be defined in a DTD), and HTML specific error-recovery (the fancy recovery I wrote about earlier) which could only be applied to HTML. >> With HTML5 we have prose along with specific error handling for ill- >> formed/invalid markup. > > Right. Though there's plenty of perfectly valid markup that > requires behavior that looks suspiciously like error handling, as a > result of the SGML legacy described above. By this I assume you mean the handling of DTD specified optional tag omissions. This is the part of a new parsing spec that I think could be applied to any schema defined markup. Contrast this handling of pre- specified optional tag omission from the 'fancy' error recovery applied by browsers to HTML. This too needs to be in the parsing spec, but its just a task independent of the handling of pre-specified optional tag omission. >> What I'm suggesting is that this part of the HTML5 spec suffers >> from not having some specialized expertise applies to this. > > Specialized expertise in what? Language design? Parser design? > Parsing HTML? I think a good bit of HTML parsing expertise has been > applied to writing this part of the spec. Unless by "this" you mean > something other than "prose along with specific error handling". Certainly the reverse-engineering for the parsing of HTML is in the current draft. The problem is the draft lacks any forward-looking approach and leaves rendering engines (mozilla among them) in a state that hinders language improvements in the future (like adding new non- void head elements or not even incorporating Gecko's in span insertion mode). >> Ideally I think we could have a parsing specification that applied >> to HTML and SGML equally, but with the possibility of specifying >> error handling for other DTD specified SGML. Think of it as an SGML >> parser with a built-in HTML5 DTD. > > That's not particularly compatible with the way HTML actually needs > to be parsed.... And a DTD can't specify the behavior that's needed > out of an HTML parser, I should note. If you're using "DTD" as a > shorthand for "machine-readable format", there's no reason one > couldn't create a machine-readable definition of the HTML5 state > machine. I'm just not sure that's what you're looking for. Again, there's no point of contention here. All I'm saying is that the parsing algorithm draft includes error-recovery that could be equally applied to non-html DTD defined content (and the algorithm applies error-recovery to HTML beyond what a DTD can define). It is that parsing algorithm which would be better as a separate spec (one that could incorporate improvements that allowed the HTML element and content model vocabulary to change). >> Parsing only depends on the HTML language with respect to the >> schema handling. > > It depends on the language because of the wide variety of tags that > have to be handled in "weird" ways. That again is the fancy error- recovery I spoke about. However, as we advance the HTML5 error-recovery, there shouldn't be any need to add new fancy error-recovery (like moving input elements outside of tables). >> Valid well-formed markup can be specified by a the language schema >> and leave error-handling specifications to the parsing algorithm. > > I'm not sure what the first part of that sentence means, to be > honest, but I agree with the second part. The parsing algorithm > needs to be aware of the error handling, and hence of the HTML > vocabulary and the various properties different HTML tags have in > terms of parsing. In only needs to be aware of the legacy HTML vocabulary and the error recovery applied to that. The introduction of new vocabulary to HTML need not be addressed in changes to error-recovery (there's no need for new fancy error recovery moving forward if every browser adheres to the parsing spec). >> Perhaps it would better to say this is the specification of the >> HTML vocabulary (elements, attributes, and content models) and DOM >> as opposed to the HTML 'language' and DOM. > > OK. So this would basically be a list of elements, corresponding > attributes, DOM interface, and the behavior of said DOM interfaces, > without reference to where these elements come from or how they > relate to each other other than that some elements may contain other > elements in some cases? No, it can be a complete specification of the vocabulary of HTML. What is left out is how that vocabulary gets serialized or how such serialized get parsed (except that the result of the parsing should be a valid HTML object). >>> Note that in practice parsing might need to depend on attribute >>> values.... >> Could you give an example where parsing depends on attribute values? > > Sure. Compare the DOM produced by browsers for: > > <table> > <input type="text"> > </table> > > To that for: > > <table> > <input type="hidden"> > </table> > > In practice, either you submit your form controls in DOM order and > parse differently depending on the type attribute of the control or > you parse the same way no matter what the type, but submit in an > order that has nothing to do with DOM order. Yes, but this is the fancy legacy error-recovery I was speaking about. This fancy stuff was introduced by browsers because they didn't have a predefined parsing algorithm and authors used invalid content models where they were trying to solve certain problems. There's no reason that the parsing algorithm going forward needs to introduce any new fancy error-recovery. Just to clarify this strangeness is due to some browsers introducing fancy error-recovery to repair poorly authored tables, while authors simultaneously took advantage of other browser error-recovery to hide input value inside tables. If browsers adhere to a newly specified parsing algorithm we shouldn't need to introduce any new such error-recovery in the future. >> Still there's an independence. We can allow scripts to call the >> parser and we can have parsers produce scripts while still keeping >> the definition separate. > > The definition of which? The definition of both. In other words we can specify HTML (and others specify javascript) in a way that allows authors to call a parser and parse a string or bytes into content without either HTML or javascript specifying the algorithm for that parsing (e.g., scripting could allow the parsing of a vcard into a microformats hcard, but hcard and HTML do not need to define the algorithm for that parsing). On the other hand, the parsing algorithm doesn't have to necessarily know anything about the document schema that it is going to de-serialize content into. >> The point of my post (and what I read Roy Fielding saying) is that >> the current HTML5 specification's strength is in its web browser >> behavior specification. > > "web browser behavior" includes the parsing algorithm. In fact, > that's one of the most important parts of the current specification > from Mozilla's point of view. However, its an easily compartmentalized piece of what a browser does (not only easily compartmentalized, but better for spec designing and application design when its abstracted in that way). >> The parsing algorithm and the HTML vocabulary parts of the spec >> suffer because we don't have spec editors who sufficiently >> understand those parts. > > Uh... We have spec editors who understand the parsing algorithm far > better than anyone else I can think of, since they've spent a good > bit of time studying how browsers actually parse HTML. So I don't > know where the "don't have spec editors who sufficiently understand > those parts" meme comes from. From many months of discussing these topics. Take care, Rob
Received on Saturday, 15 November 2008 00:36:08 UTC