Re: FPWD Review Request: HTML+RDFa from Philip Taylor on 2009-09-05 (public-rdf-in-xhtml-tf@w3.org from September 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Sat, 05 Sep 2009 13:00:31 +0100
To: Shane McCarron <shane@aptest.com>
CC: Mark Birbeck <mark.birbeck@webbackplane.com>, Manu Sporny <msporny@digitalbazaar.com>, HTML WG <public-html@w3.org>, RDFa Developers <public-rdf-in-xhtml-tf@w3.org>
Message-ID: <4AA252DF.7070808@cam.ac.uk>
Shane McCarron wrote:
> I would not object to providing examples of extraction algorithms as guidance.  
> We already do this for CURIEs somewhere...  But I do not think it is a good idea 
> to normatively define code.

I agree the spec shouldn't normatively define code. When I said it 
"needs to define the prefix mapping extraction algorithm in precise 
detail" I was thinking of something much more abstract than real code, 
though it should still be clear and unambiguous on all the relevant details.

Currently I don't see anything in the specs other than vague references 
to the Namespaces in XML spec ("Since CURIE mappings are created by 
authors via the XML namespace syntax [XMLNS] an RDFa processor MUST take 
into account the hierarchical nature of prefix declarations" in 
rdfa-syntax, "CURIE prefix mappings specified using xmlns: must be 
processed using the rules specified in the [Namespaces in XML] 
Recommendation" in HTML5+RDFa), and I want it to be clearer about 
exactly which rules are applied and how they are adapted for non-XML 
content, because otherwise I can produce lots of test cases where I 
can't work out what the spec says the output must be. (I don't care how 
an implementation computes the output, I just want to know what the 
output is.)

> The processing model in the current RDFa Syntax 
> Recommendation is sufficiently precise for anyone to understand what must be 
> done in the face of both conforming and non-conforming input.  The edge 
> conditions people keep bringing up (what happens if xmlns:="" is defined, etc) 
> are all degenerate cases of the general case of prefix declaration that does not 
> match the syntax definition.  If it doesn't match the syntax definition, it is 
> illegal.

Which syntax definition? In http://www.w3.org/TR/rdfa-syntax/ I can only 
find a definition of the CURIE syntax, which is not relevant to the 
issue of handling xmlns:="...".

(In most cases the CURIE syntax restriction is sufficient - you can't 
have rel="0:test" (it will just be ignored) so it doesn't really matter 
how xmlns:0="..." was processed. But you can write rel=":test", so it 
matters how xmlns:="..." interacts with that. And you can write 
rel="ex:test" and xmlns:ex="" (empty value, illegal in Namespaces in XML 
1.0), so it matters how that is handled too.)

Presumably http://www.w3.org/TR/REC-xml-names/#NT-PrefixedAttName is the 
relevant syntax definition for namespace prefix declarations, but 
rdfa-syntax doesn't explicitly refer to that. It's implicit when using 
RDFa in XHTML, because XHTML is based on top of xml-names and you'll get 
a well-formedness error if you try writing these invalid things, but 
that doesn't automatically apply when using HTML instead.

Should the non-syntactic xml-names constraints be required too? e.g. 
what triples should I get if I write the following HTML:

   <p xmlns:xml="http://example.org/" property="xml:test">Test</p>

   <p xmlns:xmlns="http://www.w3.org/2000/xmlns/" 
property="xmlns:test">Test</p>

   <p xmlns:ex="http://www.w3.org/2000/xmlns/" property="ex:test">Test</p>

(which all violate the Namespace Constraints in xml-names)? I presume 
these should all be ignored too, but implementers have not been doing 
that, so evidently it is not sufficiently obvious.

(I've updated http://philip.html5.org/demos/rdfa/results.html with some 
of these cases, to show the output of current implementations. The 
pass/fail statuses are largely irrelevant and probably wrong, but the 
table shows the actual output of each implementation on mouse-over.)


> If it is illegal, it is ignored.  What more does one need in a 
> normative spec? 

For RDFa-in-HTML, I'd like it to explicitly state what "illegal" means, 
e.g. whether those Namespace Constraints should be applied in 
non-XML-based versions of HTML. It doesn't need to redefine things that 
are defined elsewhere, but it should explicitly refer to concepts like 
PrefixedAttName and Namespace Constraints that are being used by the 
RDFa-in-HTML processing model, because I don't think they are obvious 
otherwise.


For both RDFa-in-HTML and RDFa-in-XHTML, I'd also like it to slightly 
more clearly state what "ignored" means:

The "CURIE and URI Processing" section says "any value that is not a 
'curie' according to the definition in the section CURIE Syntax 
Definition MUST be ignored". The "Sequence" section refers to e.g. "the 
URI from @about, if present, obtained according to the section on CURIE 
and URI Processing", and I think it's clear it should be considered 
not-present if it's not a valid CURIE. So <span about="[bogus:bogus]" 
src="http://example.org/"> should ignore @about and use @src, and that's 
all okay. (Some implementations still get this wrong, though.)

But it also says "if @property is not present then the [skip element] 
flag is set to 'true'" - is an invalid CURIE meant to be considered 
not-present here too (even though there's no reference to the CURIE and 
URI Processing section)? i.e. should the output from:

     <p about="http://example.com/" rel="next">
       <span property="bogus:bogus">
         <span about="http://example.net/">Test</span>
       </span>
     </p>

include the triple '<http://example.com/> 
<http://www.w3.org/1999/xhtml/vocab#next> <http://example.net/>' or not? 
Implementations differ.

It also says "If the [current element] contains no @rel or @rev 
attribute" - is the attribute meant to be ignored (acting as if the 
element didn't have the attribute at all) if it contains only invalid 
CURIEs (or if it contains no values)? i.e. should the output from:

   <p xmlns:ex="http://example.org/" rel="bogus:bogus" 
property="ex:test" href="http://example.org/href">Test</p>

include the triple '<http://example.org/href> <http://example.org/test> 
"Test".' or '<> <http://example.org/test> "Test".'? Implementations 
again differ.

The test suite should be extended to cover these cases, in order to 
detect these differences between implementations (because at least one 
must be buggy), if it doesn't already (I haven't checked). But I think 
the RDFa Syntax spec should also be updated to be clear about the 
expected behaviour, because I've tried to read it carefully and I'm 
still not confident enough to know what the output should be.


> I could come up with a nearly infinite collection of illegal declarations for 
> each of the attributes that are addressed in the RDFa Syntax specification.  
> However, they would all fall into the same class - illegal.  When you are doing 
> testing, you don't do "exhaustive" or even "thorough" testing of anything that 
> is sufficiently complex.  It is impossible.  Instead, you do "equivalence class 
> testing".  Identify a couple of use cases from each class of processing for a 
> given interface, test those, and trust that the other values in the class will 
> behave the same way.  For example, I would not test every single possible prefix 
> name when exercising a CURIE processing library.  It is not just impossible, it 
> is also uninteresting.  I would test some good ones and make sure they work.  I 
> would test some bad ones and make sure they are ignored.  Then I would move on. 

I would want to write tests that find bugs. There are lots of different 
classes of bugs when handling illegal input - you might forget to check 
the prefix is non-zero length, or forget to check it's an NCName, or 
forget to check the value is non-empty, or forget to check the value is 
not the xml or xmlns URI, or you might use the 4th Edition of XML 
instead of the 5th, etc. There are dozens of mistakes that people can 
(and apparently do) make when implementing this. Those mistakes are not 
all equivalent, so they should each be tested as separate equivalence 
classes, and it needs a lot more than a few tests of illegal input.

(I agree that each class doesn't need to be tested exhaustively - e.g. a 
few non-NCName prefixes are enough to detect bugs if implementations 
aren't correctly checking for NCNames, and there's no need to test 
thousands of non-NCNames because that's very unlikely to find any more 
bugs. But I don't think anyone's ever proposed testing thousands of 
non-NCNames, so I presume that's not really what you're concerned about.)

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Saturday, 5 September 2009 12:01:15 UTC