- From: Shane McCarron <shane@aptest.com>
- Date: Thu, 14 May 2009 16:24:09 -0500
- To: Philip Taylor <pjt47@cam.ac.uk>
- CC: Sam Ruby <rubys@intertwingly.net>, RDFa Community <public-rdfa@w3.org>, "public-rdf-in-xhtml-tf.w3.org" <public-rdf-in-xhtml-tf@w3.org>, HTML WG <public-html@w3.org>
Philip Taylor wrote: > Indeed, it would be good have this defined with the level of precision > that HTML 5 has, so we can be sure implementations will be able to > agree on how to extract RDFa from text/html content. > > A few significant issues that I see in the current version: > > What is "the @xml:lang attribute"? Is it the attribute with local name > "xml:lang" in no namespace (as would be produced by an HTML 5 parser > (and by current HTML browser parser implementations))? or the > attribute with local name "lang" in the namespace > "http://www.w3.org/XML/1998/namespace" (as would be produced by an XML > parser, and could be inserted in an HTML document via DOM APIs)? or > both (in which case both could be specified on one element, in > addition to "lang" in no namespace)? Well - remember that the document you are looking at is written in the context of HTML 4. In HTML 4 none of what you say above makes any sense. Attributes are tokens - and the token "xml:lang" is what I was talking about. In HTML 4 those attribute names are case-insensitive - I need to add something about that to the draft. Thanks for the reminder! > > "If the object of a triple would be an XMLLiteral, and the input to > the processor is not well-formed [XML]" - I don't understand what that > means in an HTML context. Is it meant to mean something like "the > bytes in the HTML file that correspond to the contents of the relevant > element could be parsed as well-formed XML (modulo various namespace > declaration issues)"? If so, that seems impossible to implement. The > input to the RDFa processor will most likely be a DOM, possibly > manipulated by the DOM APIs rather than coming straight from an HTML > parser, so it may never have had a byte representation at all. We have no presumption of how an RDFa processor is implemented. It might be client side via a browser. It might be server side. It might be part of an XML tool-chain. It doesn't really matter. In this case, the document I wrote is a little too fuzzy because the idea is not completely cooked yet. Here's the problem: RDFa permits the creation of objects that are of type XMLLiteral. That datatype is tightly defined, and as you can imagine it is expected to contain well-formed XML. If a Conforming RDFa Processor were to generate triples that contained data of type XMLLiteral, and that data were not "well-formed" as defined in XML, then consumers of that data could easily be very surprised! > > Even without scripting, there isn't always a contiguous sequence of > bytes corresponding to the content of an element. E.g. if the HTML > input is: > <table> > <tr some-attributes-to-say-this-element-outputs-an-XMLLiteral> > <td> This text goes inside the table </td> > This text gets parsed to *outside* the table > <td> This text goes inside the table </td> > </tr> > </table> > then (according to the HTML 5 parsing algorithm, and implemented in > (at least) Firefox) the content of the <tr> element includes the first > and third lines of text, but not the second. How would you decide > whether the content is well-formed XML? Yeah - tricky. I think you need to take a step back and think about goals rather than implementation strategies. The goal here is that all implementations extract the same collection of triples from a given document. There are a lot of ways to achieve that. In the XHTML profile of RDFa we relied upon the XML parsing model. Consequently, we are confident that we are being handed well-formed content. If you do an implementation via the DOM, you can also be confident of that, since by the time content gets into the DOM you can assume the processor has done whatever magic was necessary and you have a node tree that you could turn back into content that would be well-formed. If you are writing your own parser that sifts through a document character by character... well, you are going to have some work ahead of you! With regard to your example above.... if I had a DOM based processor, I would have missed out on line 3 I imagine. If I wrote my own I would have included it ('cause that is well formed - the XML parser would have handed it to me). In the XHTML profile we (sort of ) address this in that we only tightly constrain behavior for *valid* content. The content above is *invalid* according to the XHTML+RDFa schema - so while the behavior of existing implementations might be inconsistent, I personally won't get too excited about it. In the HTML profile of RDFa, things are much the same. We can attempt to be very very precise about how the parsing of the content should be handled, or we can rely upon the parsing model spelled out by the underlying specification (HTML 4 in this case). Now, I am sure you will agree that HTML 4 does a pretty poor job of defining the parsing model, but.... is it adequate for our needs in this instance? My belief is that it is adequate - at least for the vast majority of the RDFa processing rules. In particular in that most implementors will rely upon existing parsing libraries, and the problems associated with that parsing have been largely sorted out over the years. Even to the point that they are being codified in the early draft HTML5 documents. The only place I have a concern is with regard to creating XMLLiterals. This is a very powerful aspect of RDFa, and I am loathe to disable it in the HTML profile if I don't have to. Instead, I would like to identify a light-weight model that implementors can use. For example, we could say that if an object is of type XMLLiteral, then its content is escaped so that there is no markup (< to < etc). This would mean that it is "well formed XML", and that it could be turned back into its original source form, which is the goal of such content. However, it would also mean that a consumer of such content would need to know this and do the reverse transformation before using the content. I don't know what the right answer is - maybe we can figure it out together? > > For this to make sense in real HTML implementations, the definition > should be in terms of the document layer rather than the byte layer. > (The XMLLiteral should be an XML-fragment serialisation of the > element, and some error handling (like ignoring the triple) would > occur if it's impossible to serialise as XML, similar to the > requirements in > <http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>) > In HTML 5, where there is an XML serialisation method, that might make sense. In HTML 4 however, we don't have that luxury. I suppose we could say that the HTML 4 content is transformed into corresponding XHTML 1.0 content... but there are no reliable serializers out there that do that really. > > How are xmlns:* attributes meant to be processed? E.g. what is the > expected output in the following cases: > > <div xmlns:T="test:"> > <span typeof="t:x" property="t:y">Test</span> > </div> > > <div XMLNS:t="test:"> > <span typeof="t:x" property="t:y">Test</span> > </div> > > <div xmlns:T="test:"> > <span typeof="T:x" property="T:y">Test</span> > </div> > > <div xmlns:t="test:"> > <div xmlns:t=""> > <span typeof="t:x" property="t:y">Test</span> > </div> > </div> > > <div xmlns:t="test1:" id="d"> > <span typeof="t:x" property="t:y">Test</span> > </div> > <script> > document.getElementById('d').setAttributeNS( > 'http://www.w3.org/2000/xmlns/', 'xmlns:t', 'test2:'); > /* (now the element has two distinct attributes, > each in different namespaces) */ > </script> I had not thought about this much before. Attribute names in HTML / SGML are case-insensitive. CURIE prefix names are of course NOT. However, I can almost guarantee you that browser-based implementations of the XHTML profile right now would fail to work correctly when faced with CURIE prefixes that differ only in case. Interesting point - I am going to test that later. I think we would be wise to advise document authors to not define prefixes that differ only in case. And in the HTML profile I think it would be reasonable to require that prefix names are mapped to lower-case during processing. Or some other solution that gets us to the point where a browser-based implementation that requests attribute names from a DOM node can still work. My conclusion here is that prefix names should be treated case-insensitively in the HTML profile. Do you agree? > > Should the same processing rules be used for documents from both HTML > and XHTML parsers, or would DOM-based implementations need to detect > where the input came from and switch processing rules accordingly? If > there is a difference, what happens if I adoptNode from an XHTML > document into an HTML document, or vice versa? Err... What's adoptNode? And how are these two documents getting together? I mean, that's sort of out of scope of an HTML 4 profile for RDFa. With regard to the first part of the question, I believe the same processing rules can be used. I have an implementation that does it now. So do lots of other people. My implementation is DOM based though, so that makes it relatively simple to have the same rules work. Thanks for your comments! -- Shane P. McCarron Phone: +1 763 786-8160 x120 Managing Director Fax: +1 763 786-8180 ApTest Minnesota Inet: shane@aptest.com
Received on Thursday, 14 May 2009 21:24:59 UTC