Re: Request to publish HTML+RDFa (draft 3) as FPWD from Philip Taylor on 2009-09-17 (public-html@w3.org from September 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Thu, 17 Sep 2009 17:17:21 +0100
To: Manu Sporny <msporny@digitalbazaar.com>
CC: HTMLWG WG <public-html@w3.org>, RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>
Message-ID: <4AB26111.2050206@cam.ac.uk>
Manu Sporny wrote:
> The 3rd draft of the HTML+RDFa specification has been released and is
> available here:
> 
> http://html5.digitalbazaar.com/specs/rdfa.html

I found a few things to comment on while reading through this. (I'm 
mostly ignoring any high-level issues about the design of the language, 
and just looking at how it's being specified.)

First, the more substantive issues:

"a tree-based model" -- is that tree-based model defined anywhere? (What 
data types does it consist of? e.g. is an attribute just a name string 
plus a value string, or is it a namespace URI plus a local name plus a 
value string? Does an element just have a list of attributes, or does it 
also have a separate list of namespace declarations? The questions seem 
important in determing how a DOM or Infoset or XOM tree or SAX stream 
etc maps onto the tree-based model.)

I'm not sure what the point of section 2.1 (Modifying the Input 
Document) is. Section 2 already says HTML5 defines how to get from a 
document to a DOM, and says it's obvious how to get from a DOM to RDFa's 
tree-based model, so the first paragraph of section 2.1 seems redundant. 
As an underlying concept throughout the HTML5 spec, implementations are 
free to do whatever they want as long as the output is exactly the same 
as what HTML5 specifies, so it's already true that an HTML+RDFa 
implementation could internally use e.g. SAX as long as the output is 
equal to what's specified, so the second paragraph of 2.1 seems unnecessary.

It might still be useful to explicitly state that underlying concept, 
e.g. "Note: Although HTML5 is specified in terms of a DOM, HTML+RDFa 
processors are free to use any implementation approach as long as their 
RDF output matches the output specified in this document." (I think 
that's more general than what's in 2.1, since it doesn't talk about 
details like HTML5 parser data structures - all that's important is the 
input and output. Also it avoids questions about what "a data structure 
equivalent to the HTML5 or XHTML5 DOM" really means (is a stream of SAX 
events an equivalent data structure? (is it even a data structure?)))

"There may be a link element contained in the head element that contains 
profile for the the rel attribute and http://www.w3.org/1999/xhtml/vocab 
for the href attribute." -- that's a broken definition, e.g. it doesn't 
seem to allow <link rel="PROFILE" href=...> or <link rel="profile next" 
href=...>. It also conflicts with section 5.2. This line should probably 
just be removed, since section 5.2 is enough to allow documents to use 
profile.

"The lang attribute must be processed in the same manner as the xml:lang 
attribute is [...]" -- that is confusing since the xml:lang attribute 
(in HTML5 text/html) is not processed in the same manner as in XHTML. 
(For example, <div xml:lang="en">...</div> in text/html has no 
language). It would be clearer to replace this with something like 
"Where the XHTML+RDFa specification refers to the xml:lang attribute, 
the language of an element must instead be determined as in the section 
titled The lang and xml:lang attributes in the HTML5 specification."

"When generating literals of type XMLLiteral, the processor must ensure 
that the output XMLLiteral is a namespace well-formed XML fragment." -- 
I don't see why this requirement needs to be explicitly specified for 
HTML+RDFa, or described with such verbosity, given that XHTML+RDFa 
doesn't specify it explicitly. Any processor generating RDF triples must 
generate valid triples, which means XMLLiterals must have a lexical form 
that is exclusive canonical XML (hence namespace well-formed etc), and 
the RDFa spec does not need to repeat any of those requirements.

Given RDF's use of exclusive canonical XML, there is only a single valid 
serialisation of a given input tree. So I think there's no need for 
HTML+RDFa to discuss various ways of getting a value - it just needs to 
define what that single valid serialisation is.

So I think the whole section could simply require:

   "When generating literals of type XMLLiteral, the lexical form of the 
literal must be equal to the result of applying the [Coercing an HTML 
DOM into an infoset] rules to the child nodes of the current element, 
then serialising the resulting nodes to an octet stream with the 
[exclusive XML canonicalization method] (with comments, with empty 
InclusiveNamespaces PrefixList), then decoding the octet stream as UTF-8 
into a Unicode string."

(with some non-normative explanations of the implications, and examples, 
etc, but no other conformance requirements).

"Hyperlink" -- <link rel=profile> sounds more like an "External 
Resource", since it augments the current document.

http://whatwg.org/html5#linkTypes defines the link-type table to be 
non-normative. Is the link type table extension in HTML+RDFa meant to be 
non-normative or normative? If the former, the Hyperlink/External 
Resource thing needs to be specified in normative text and not just the 
table.

"For documents conforming to this specification, attributes with names 
that have the case insensitive prefix "xmlns:" are conforming in both 
HTML5 and XHTML5." -- is it intentional that <div XMLNS:foo="..."/> in 
XHTML will be conforming? Surely that markup would break any RDFa 
processors, because they don't do case-insensitive attribute lookups in 
XHTML, so it should not be permitted.

Also, attribute names in HTML5 are always lowercase (ignoring script 
modifications etc), because the concept of "attribute name" refers to 
the name in the DOM (not the bytes in the text/html syntax), and the 
parser converts names to lowercase. So only lowercase attribute names 
need to be made conforming.

Also, according to this, attributes like xmlns:="..." and xmlns:0="..." 
will be conforming in HTML5, but authors will be confused if they use 
such attributes (because they'll try to use the CURIE "0:foo" and it 
will be ignored because it's invalid), so they should be non-conforming 
to alert authors to their errors. Only attributes whose names match the 
PrefixedAttName production from XML Namespaces should be conforming.



And some minor issues about wording etc:

"RDF in XHTML: Syntax and Processing" -- s/RDF/RDFa/ (in Abstract, and 
again in History).

"The latest stable version of the editor's draft of this specification 
is always available on [the W3C CVS server]. The [latest editor's 
working copy] (which may contain unfinished text in the process of being 
prepared) is also available." -- the first link is to an old version 
(July) that looks much less stable or complete than this version; the 
second link is a 404.

"By design, the possibility of [...] was squarely in the realm of 
possibility." -- seems tautological; maybe remove the "the possibility of".

"heeding the minor changes in this section" -- s/section/document/ (or 
specification or something).

"Section 5.5: Sequence, of the [XHTML+RDFa] specification defines [...]" 
-- remove unnecessary comma.

"The HTML5 and XHTML5 DOM, or equivalent data structure, should be used 
as input to the RDFa processing rules." -- s/should/must/ (I don't see 
any reason why someone ought to be allowed to violate this requirement, 
and still claim to be a conforming HTML+RDFa processor).

"element nesting issues in HTML documents may be corrected" -- 
s/may/can/ (or something similar) (use of normative RFC2119 keyword 
"may" in a non-normative section seems undesirable).

"Any mechanism that generates a data structure equivalent to the HTML5 
or XHTML5 DOM, such as the html5lib library" -- it seems weird for a 
specification to refer to a specific implementation. The reference 
doesn't even provide any information to the reader, unless they already 
know the details of html5lib tree builders.

"Any mechanism [...] may be used" -- s/may/can/ (same reason as above).

"a XML mode document" -- s/a/an/

"In future versions of RDFa, the value of the profile may trigger 
different processing rules in RDFa Processors." -- s/may/might/ (I don't 
think that's meant to be a normative conformance requirement).

"While it is specified that HTML5 must preserve these attributes in the 
DOM" -- s/must/will/ (I don't think that's meant to be a normative 
conformance requirement here, and it's confusing to use RFC2119 keywords 
when referring to the consequences of requirements in other specs).

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Thursday, 17 September 2009 16:18:01 UTC