- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Fri, 15 May 2009 00:58:48 +0100
- To: Shane McCarron <shane@aptest.com>
- CC: Sam Ruby <rubys@intertwingly.net>, RDFa Community <public-rdfa@w3.org>, "public-rdf-in-xhtml-tf.w3.org" <public-rdf-in-xhtml-tf@w3.org>, HTML WG <public-html@w3.org>
Shane McCarron wrote: > Philip Taylor wrote: >> Indeed, it would be good have this defined with the level of precision >> that HTML 5 has, so we can be sure implementations will be able to >> agree on how to extract RDFa from text/html content. >> [...] > Well - remember that the document you are looking at is written in the > context of HTML 4. In HTML 4 none of what you say above makes any > sense. Attributes are tokens - and the token "xml:lang" is what I was > talking about. Yeah, I'm not sure what else you could do in the context of HTML 4. I'm approaching this from the context of HTML 5 - I think it would be valuable to define precisely the mapping from text/html to RDF triples, so that people can know what to expect when they run their content through any RDFa-aware tool, and it only seems to be feasible to define that in the HTML 5 context. (This might be in addition to an HTML 4 extension like in your document, not necessarily a replacement, but I'm not personally interested in working with HTML 4. Maybe that means the "HTML 4 Profile for RDFa" thread is not the best place to discuss this, but better here than nowhere...) >> [Stuff about XMLLiterals] > We have no presumption of how an RDFa processor is implemented. It > might be client side via a browser. It might be server side. It might > be part of an XML tool-chain. It doesn't really matter. Is there any implementation that is *not* based on some kind of abstract document model (like DOM or SAX or some custom tree structure, where documents are parsed into elements and attributes before any further processing)? It seems to me that requiring the abstract document model to be re-serialised into well-formed XML (regardless of whether it originated from an XML parser, or from parsing HTML with missing quotes and unclosed <br>s, or from a DOM API, or anywhere else) would be the best way to ensure correctness (since the output will always be well-formed XML, by definition), functionality (since it would let you use XMLLiterals in text/html with few surprises or special cases), and practical implementability (since everyone should already have a tree of elements and attributes and be able to serialise it into XML). But that does rely the concept of a document model, which only really exists in HTML 5 and not in HTML 4. > I think you need to take a step back and think about > goals rather than implementation strategies. The goal here is that all > implementations extract the same collection of triples from a given > document. I like that goal :-). I don't want to limit things to a single implementation strategy (e.g. DOM) - but some people will use that implementation strategy, and if other implementations are required to extract the same collection of triples, then it seems sensible to define the requirements in a way that can be easily mapped onto that implementation strategy (and preferably onto others), rather than leaving a huge gap that implementers have to sort out themselves and could easily get wrong. The DOM-based model used by the HTML 5 parsing algorithm can be easily mapped onto common implementation strategies (DOM, SAX, XOM, ElementTree, etc). The token-based model of HTML 4 can't (hence the crazy incompatibilities between HTML parsers, and the need for a huge amount of work in HTML 5 to define the mapping for the first time). So defining RDFa triple extraction based on HTML 5 seems much more likely to achieve the goal than defining it based on HTML 4, and therefore seems a more useful thing to work on. >> <http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>) >> > In HTML 5, where there is an XML serialisation method, that might make > sense. In HTML 4 however, we don't have that luxury. Sounds like another benefit of defining RDFa-in-HTML based on HTML 5 instead of suffering the restrictions of HTML 4 :-) > [...] in the HTML profile I think it > would be reasonable to require that prefix names are mapped to > lower-case during processing. Or some other solution that gets us to > the point where a browser-based implementation that requests attribute > names from a DOM node can still work. My conclusion here is that prefix > names should be treated case-insensitively in the HTML profile. Do you > agree? HTML parsers (by which I mean HTML 5 and web browsers) don't preserve the case of element or attribute names. Anything processing the output from a parser will see everything as lowercase (or uppercase, depending on what API they use), so RDFa mustn't consider the case of attribute names to be significant. When comparing prefixes in CURIEs, I suppose it could do a case-insensitive comparison, but that would be unnecessary complexity and annoyingly inconsistent with XHTML. rdfquery and http://www.w3.org/2006/07/SWD/RDFa/impl/js/ appear to treat the attribute name as lowercase, and then case-sensitively compare against the CURIE prefix. >> Should the same processing rules be used for documents from both HTML >> and XHTML parsers, or would DOM-based implementations need to detect >> where the input came from and switch processing rules accordingly? If >> there is a difference, what happens if I adoptNode from an XHTML >> document into an HTML document, or vice versa? > Err... What's adoptNode? http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-adoptNode > And how are these two documents getting together? I might have an HTML document (containing some RDFa), which uses XMLHttpRequest to download an XHTML fragment (also containing some RDFa) and inserts it into the current page, and then I might attempt to extract RDF triples from the page. > I mean, that's sort of out of scope of an HTML 4 profile for RDFa. It's out of scope for HTML 4, but it seems necessary for the goal that "all implementations extract the same collection of triples from a given document" if you include dynamic implementations. (And http://rdfa.info/wiki/Dynamic-content-parsing suggests people are interested in dynamic implementations.) > With regard to the first part of the question, I believe the same > processing rules can be used. So I could use the "lang" attribute (instead of "xml:lang") in XHTML documents as well as HTML, because the same processing rules would be applied? (If so, it would be nice if the RDFa-in-XHTML specification agreed with that.) -- Philip Taylor pjt47@cam.ac.uk
Received on Thursday, 14 May 2009 23:59:37 UTC