- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Sat, 23 May 2009 17:49:19 +0100
- To: Julian Reschke <julian.reschke@gmx.de>
- CC: Sam Ruby <rubys@intertwingly.net>, Shane McCarron <shane@aptest.com>, RDFa Community <public-rdfa@w3.org>, "public-rdf-in-xhtml-tf.w3.org" <public-rdf-in-xhtml-tf@w3.org>, HTML WG <public-html@w3.org>
Minor correction: I wrote: > In a HTML5 text/html serialisation with no scripting, you can only get > the attribute "xml:lang" in no namespace. which I think is wrong because of foreign content: you can write <div xml:lang=a><svg xml:lang=b></svg></div>, which will result in one attribute called "xml:lang" in no namespace on the div, and one called "lang" in the XML namespace on the svg. (But you can't get both on the same element, unless I'm wrong again.) Julian Reschke wrote: >>> Is it still underspecified once we require a valid HTML5 document as >>> input? >> >> Probably not. But I wouldn't consider it acceptable to require a valid >> document as input - people make mistakes all the time, and I want them >> to get consistent (and hopefully predictable) RDF triples out of it >> regardless of what implementation they use, so the specification has >> to deal precisely with invalid input. See >> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2009May/0156.html >> for an example of someone with precisely this kind of error. > > Understood; I just wanted to understand the scope of the problem. Okay, sure. My original comment was in the context of there not necessarily being a contiguous sequence of characters that corresponds to a parsed element, and I think that's closely related (perhaps the same as?) the concept of streamability (basically the ability to output SAX events without buffering elements). The current streamability violations come from: * Content inside <table>/<tr> instead of inside <td> * Misnested <i>...<p>...</i>...</p>, <i>...<b>...</i>...</b>, etc * Head content (link, meta, etc) between </head> and <body> * Multiple <html> or <body> elements (their attributes get merged) * Content after </body> * (Can't think of any others) Those are all parse errors, and a conforming parser is allowed to abort when it sees a parse error, though many of them are quite common in the wild. In any other case, it seems like it ought to be theoretically possible to find a substring of the document that corresponds to the content of an element, though I may be missing some subtleties. (But current parser implementations don't do that, and I don't think they would willingly do so - they throw away the input stream and all they can do is re-serialise the parsed output.) >> By "DOM" I generally mean any kind of tree structure of elements and >> attributes, either as an explicit data structure (DOM, XOM, >> ElementTree) or implicit (SAX). Would any RDFa implementation *not* >> parse the input HTML into that kind of structure and operate over the >> elements and attributes as distinct objects? (e.g. would they just use >> regular expressions over the input byte stream? That seems quite >> infeasible to me...) > > Depends on the definition of "tree structure". I've been involved in > code that just uses a tokenizer and specialized stack, and > implementations like these will not do the re-arranging of elements the > HTML5 spec specifies for some kinds of broken input. If they abort when there are streamability violations, that's fine (and is what the Validator.nu parser's unbuffered SAX output does) - the stream of start/end element events will always be well-nested and will encode a tree structure, and it would be possible to specify DOM-based algorithms that could be easily mapped onto that non-DOM implementation. If they don't abort and instead do some different kind of error handling, then they're not a conforming HTML5 parser, and in that case we've already failed at the goal of getting consistent behaviour. >> [...] > > That's impossible, at least for now as RDFa-in-XHTML relies on > XML-NS-wellformedness (so XMLNS:* would be recognized as namespace > declaration, right?). Hmm, maybe a better example of what I intended is: <div xmlns:t="test1:"> <div xmlns:T="test2:"> <span property="t:x T:y">Test</span> </div> </div> which is well-formed XML and has a clear definition in RDFa-in-XHTML, but the defined behaviour is impossible to reproduce in text/html (because xmlns:t and xmlns:T (and XMLNS:T) are parsed identically by an HTML parser and there's no way to distinguish them afterwards). RDFa-in-text/html could: * Assume attributes are all treated as lowercase (breaking <div xmlns:T="..." property="T:..."> which works in XHTML); * Say CURIEs (in both XHTML and HTML) match prefixes case-insensitively (breaking compatibility with current implementations); * Change text/html parsing to preserve attribute case (breaking compatibility with current parsers); * Use some other prefix-binding mechanism (in both XHTML in HTML) like prefix="t=... T=..." instead of xmlns:t="..." (breaking current implementations and deployed content, but avoiding the mess of parsing differences between XHTML and HTML). I can't think of any other solutions, so something is going to break no matter what is chosen. -- Philip Taylor pjt47@cam.ac.uk
Received on Saturday, 23 May 2009 16:50:07 UTC