Re: FPWD Review Request: HTML+RDFa from Shane McCarron on 2009-09-08 (public-html@w3.org from September 2009)

From: Shane McCarron <shane@aptest.com>
Date: Tue, 08 Sep 2009 11:19:46 -0500
To: Philip Taylor <pjt47@cam.ac.uk>
CC: Mark Birbeck <mark.birbeck@webbackplane.com>, Manu Sporny <msporny@digitalbazaar.com>, HTML WG <public-html@w3.org>, RDFa Developers <public-rdf-in-xhtml-tf@w3.org>
Message-ID: <4AA68422.9050200@aptest.com>
Philip,

Thanks for taking the time to respond so thoroughly.  In general I agree 
with Mark that the RDFa Syntax Recommendation could have done a better 
job of tightening its relationship to the Namespaces in XML 
Recommendation.  I also agree that if there are pathological cases that 
are *important* they should be covered by a test suite.  What is 
"pathological" and what is "important" are obviously subjective, and 
test suite authors spend lots of time debating such things.

I have some detailed comments in line.

Philip Taylor wrote:
> Shane McCarron wrote:
>> I would not object to providing examples of extraction algorithms as 
>> guidance.  We already do this for CURIEs somewhere...  But I do not 
>> think it is a good idea to normatively define code.
>
> I agree the spec shouldn't normatively define code. When I said it 
> "needs to define the prefix mapping extraction algorithm in precise 
> detail" I was thinking of something much more abstract than real code, 
> though it should still be clear and unambiguous on all the relevant 
> details.
>
> Currently I don't see anything in the specs other than vague 
> references to the Namespaces in XML spec ("Since CURIE mappings are 
> created by authors via the XML namespace syntax [XMLNS] an RDFa 
> processor MUST take into account the hierarchical nature of prefix 
> declarations" in rdfa-syntax, "CURIE prefix mappings specified using 
> xmlns: must be processed using the rules specified in the [Namespaces 
> in XML] Recommendation" in HTML5+RDFa), and I want it to be clearer 
> about exactly which rules are applied and how they are adapted for 
> non-XML content, because otherwise I can produce lots of test cases 
> where I can't work out what the spec says the output must be. (I don't 
> care how an implementation computes the output, I just want to know 
> what the output is.)
Well... Hmm... My opinion differs with yours on this.  That reference, 
while in prose, is not vague at all.  It is a normative reference to a 
related W3C Recommendation that defines precisely the syntactic 
requirements for what is and is not a legal xmlns: attribute declaration 
(see, for example, section 3 - Declaring Namespaces).  As an implementor 
of RDFa Syntax, it is my responsibility to ensure I am either 1) using a 
library to parse my input that already knows about the requirements of 
the Namespaces in XML Recommendation, or 2) implement those requirements 
myself.  Either way, the requirements are clear (and yes, my 
implementation is somewhat broken).

Further, since the RDFa Syntax Recommendation is only concerned about 
the "syntax" of those prefix declarations, and has no semantic 
requirements beyond that for the use of XML Namespaces, it should be 
clear that parts of the Namespaces in XML Recommendation that deal with 
how XML Namespaces effect the declaration of elements and attributes is 
irrelevant for an RDFa Syntax - conforming processor.

(Note - I would be very comfortable adding such language in the RDFa 
Syntax Errata document immediately.  I will bring it up at the next Task 
Force call.)

>
>> The processing model in the current RDFa Syntax Recommendation is 
>> sufficiently precise for anyone to understand what must be done in 
>> the face of both conforming and non-conforming input.  The edge 
>> conditions people keep bringing up (what happens if xmlns:="" is 
>> defined, etc) are all degenerate cases of the general case of prefix 
>> declaration that does not match the syntax definition.  If it doesn't 
>> match the syntax definition, it is illegal.
>
> Which syntax definition? In http://www.w3.org/TR/rdfa-syntax/ I can 
> only find a definition of the CURIE syntax, which is not relevant to 
> the issue of handling xmlns:="...".
True.  That's what the XML Namespaces Recommendation is for.  And it 
tightly defines the syntax.  Anything that does not conform to that 
syntax is not a legal CURIE prefix declaration, and therefore would be 
ignored.
>
> (In most cases the CURIE syntax restriction is sufficient - you can't 
> have rel="0:test" (it will just be ignored) so it doesn't really 
> matter how xmlns:0="..." was processed. But you can write rel=":test", 
> so it matters how xmlns:="..." interacts with that. And you can write 
> rel="ex:test" and xmlns:ex="" (empty value, illegal in Namespaces in 
> XML 1.0), so it matters how that is handled too.)
The XML Namespaces Recommendation clearly says what is illegal, 
including xmlns:="...".  The RDFa Syntax Recommendation clearly states 
that there is no way to define a local default CURIE prefix mapping, and 
that rel=":next" is interpreted in the context of the XHTML Vocabulary 
URI.  So no, I don't think there is any room for misinterpretation or 
difference among implementations here.

>
>
> Presumably http://www.w3.org/TR/REC-xml-names/#NT-PrefixedAttName is 
> the relevant syntax definition for namespace prefix declarations, but 
> rdfa-syntax doesn't explicitly refer to that. It's implicit when using 
> RDFa in XHTML, because XHTML is based on top of xml-names and you'll 
> get a well-formedness error if you try writing these invalid things, 
> but that doesn't automatically apply when using HTML instead.
RDFa Syntax DOES explicitly, normatively incorporate the XML Namespaces 
Recommendation.  It also explicitly, normatively says that the XML 
Namespace *syntax* is what is used to declare CURIE prefix mappings.  I 
wouldn't mind explaining in an errata that the syntax is defined at 
http://www.w3.org/TR/REC-xml-names/#NT-PrefixedAttName - would that help 
address your concerns?
>
> Should the non-syntactic xml-names constraints be required too? e.g. 
> what triples should I get if I write the following HTML:
>
>   <p xmlns:xml="http://example.org/" property="xml:test">Test</p>
>
>   <p xmlns:xmlns="http://www.w3.org/2000/xmlns/" 
> property="xmlns:test">Test</p>
>
>   <p xmlns:ex="http://www.w3.org/2000/xmlns/" property="ex:test">Test</p>
>
> (which all violate the Namespace Constraints in xml-names)? I presume 
> these should all be ignored too, but implementers have not been doing 
> that, so evidently it is not sufficiently obvious.
Such prefix declarations are illegal, and therefore MUST be ignored by a 
conforming RDFa Processor.  Do all processors do so today?  I doubt it.  
Could they?  Of course.  Should they?  Of course.  Would it break 
anything in the wild if they started doing so tomorrow?  No way.  These 
are good but pathological cases.  I would be happy to add test cases for 
them.  But in the end, whether we test for these cases or not in no way 
changes the definition of RDFa as it was published.  We have these 
constraints by normative reference already.
>
>
>> If it is illegal, it is ignored.  What more does one need in a 
>> normative spec? 
>
> For RDFa-in-HTML, I'd like it to explicitly state what "illegal" 
> means, e.g. whether those Namespace Constraints should be applied in 
> non-XML-based versions of HTML. It doesn't need to redefine things 
> that are defined elsewhere, but it should explicitly refer to concepts 
> like PrefixedAttName and Namespace Constraints that are being used by 
> the RDFa-in-HTML processing model, because I don't think they are 
> obvious otherwise.
I agree that all the prefix syntax declaration constraints should apply 
to both the XHTML and HTML versions of RDFa.  I think they do already 
because of the normative inclusion of the XML Namespaces Recommendation, 
but if you think it would help clarify things I am happy to 1) add an 
errata as I mentioned above, and 2) support adding some explicit text to 
the RDFa-in-HTML working draft.
>
>
> For both RDFa-in-HTML and RDFa-in-XHTML, I'd also like it to slightly 
> more clearly state what "ignored" means:
>
> The "CURIE and URI Processing" section says "any value that is not a 
> 'curie' according to the definition in the section CURIE Syntax 
> Definition MUST be ignored". The "Sequence" section refers to e.g. 
> "the URI from @about, if present, obtained according to the section on 
> CURIE and URI Processing", and I think it's clear it should be 
> considered not-present if it's not a valid CURIE. So <span 
> about="[bogus:bogus]" src="http://example.org/"> should ignore @about 
> and use @src, and that's all okay. (Some implementations still get 
> this wrong, though.)
I think your interpretation is the correct one, and I think there is a 
test case to this effect already.  If it were not ignored, then @about 
would be interpreted as @about="" and that would refer to the current 
document and supercede @src.  Michael or Manu, can you confirm there is 
already a test case for this?
>
> But it also says "if @property is not present then the [skip element] 
> flag is set to 'true'" - is an invalid CURIE meant to be considered 
> not-present here too (even though there's no reference to the CURIE 
> and URI Processing section)? i.e. should the output from:
>
>     <p about="http://example.com/" rel="next">
>       <span property="bogus:bogus">
>         <span about="http://example.net/">Test</span>
>       </span>
>     </p>
>
> include the triple '<http://example.com/> 
> <http://www.w3.org/1999/xhtml/vocab#next> <http://example.net/>' or 
> not? Implementations differ.
The rules are to be applied consistently.  If there are no legal values 
in an attribute declaration, an implementation MUST act as if that 
attribute declaration were not present at all.  Again, I believe there 
are test cases that do this now, and it surprises me that you say 
implementations differ on this.  In the case of @property, I would 
support adding errata to clarify that this behaves as @about behaves if 
that would satisfy your concern.
>
> It also says "If the [current element] contains no @rel or @rev 
> attribute" - is the attribute meant to be ignored (acting as if the 
> element didn't have the attribute at all) if it contains only invalid 
> CURIEs (or if it contains no values)? i.e. should the output from:
>
>   <p xmlns:ex="http://example.org/" rel="bogus:bogus" 
> property="ex:test" href="http://example.org/href">Test</p>
>
> include the triple '<http://example.org/href> 
> <http://example.org/test> "Test".' or '<> <http://example.org/test> 
> "Test".'? Implementations again differ.
As above - all illegal attribute interpretations should be consistent 
throughout.  @rel or @rev with no legal values MUST be treated as if the 
attribute were not present at all.
>
> The test suite should be extended to cover these cases, in order to 
> detect these differences between implementations (because at least one 
> must be buggy), if it doesn't already (I haven't checked). But I think 
> the RDFa Syntax spec should also be updated to be clear about the 
> expected behaviour, because I've tried to read it carefully and I'm 
> still not confident enough to know what the output should be.
Understood.  We will discuss this at a Task Force meeting and see if 
there is a way to introduce a blanket statement via the errata.  
However, again, I believe there is no conflict in the spec as written 
currently.  There is ALWAYS room for misinterpretation in every spec.  
We can tighten the language and attempt to make the language more 
consistent. 
>
>
>> I could come up with a nearly infinite collection of illegal 
>> declarations for each of the attributes that are addressed in the 
>> RDFa Syntax specification.  However, they would all fall into the 
>> same class - illegal.  When you are doing testing, you don't do 
>> "exhaustive" or even "thorough" testing of anything that is 
>> sufficiently complex.  It is impossible.  Instead, you do 
>> "equivalence class testing".  Identify a couple of use cases from 
>> each class of processing for a given interface, test those, and trust 
>> that the other values in the class will behave the same way.  For 
>> example, I would not test every single possible prefix name when 
>> exercising a CURIE processing library.  It is not just impossible, it 
>> is also uninteresting.  I would test some good ones and make sure 
>> they work.  I would test some bad ones and make sure they are 
>> ignored.  Then I would move on. 
>
> I would want to write tests that find bugs. There are lots of 
> different classes of bugs when handling illegal input - you might 
> forget to check the prefix is non-zero length, or forget to check it's 
> an NCName, or forget to check the value is non-empty, or forget to 
> check the value is not the xml or xmlns URI, or you might use the 4th 
> Edition of XML instead of the 5th, etc. There are dozens of mistakes 
> that people can (and apparently do) make when implementing this. Those 
> mistakes are not all equivalent, so they should each be tested as 
> separate equivalence classes, and it needs a lot more than a few tests 
> of illegal input.
>
> (I agree that each class doesn't need to be tested exhaustively - e.g. 
> a few non-NCName prefixes are enough to detect bugs if implementations 
> aren't correctly checking for NCNames, and there's no need to test 
> thousands of non-NCNames because that's very unlikely to find any more 
> bugs. But I don't think anyone's ever proposed testing thousands of 
> non-NCNames, so I presume that's not really what you're concerned about.)
>

No, I'm not.  Poor testing is my personal soap box.  Sorry if I came off 
as attacking your testing methodology.  In general, I believe it is 
important to always identify each equivalence class.  There are several 
in the case the of XML Namespace prefix syntax, and it is a good idea to 
exercise each of them.  There are several in the case of CURIE 
interpretation in attribute values, and those should be exercised as well. 

What I *personally* avoid is adding tests to make sure something no 
longer works wrong. Conformance testing is about ensuring all 
implementations work *right* in the presence of correct and incorrect 
usage. Failure or regression testing is about adding tests that exercise 
a reported failure. Once that reported failure is fixed, that test will 
never fail again.  Therefore, such tests check to make sure an 
implementation no longer works wrong.  It doesn't make it a bad test, 
but such tests are almost always exercising members of a class of input 
that SHOULD have been exercised by conformance testing in the first 
place.  Rather than add a hodge-podge of tests that touch on specific 
failure cases, I strive to define/update the related general equivalence 
class.  That way you are categorizing the test correctly and exercising 
the general feature, as opposed to the specific failure. 

But as I said, that's my personal soap box.  I have been standing on it, 
beating my breast and shouting, for 25 years.  For some reason, there 
are people who remain unconvinced.  :-P

Shane P. McCarron                          Phone: +1 763 786-8160 x120
Managing Director                            Fax: +1 763 786-8180
ApTest Minnesota                            Inet: shane@aptest.com
Received on Tuesday, 8 September 2009 16:20:51 UTC