Re: @rel syntax in RDFa (relevant to ISSUE-60 discussion), was: Using XMLNS in link/@rel from Henri Sivonen on 2009-02-27 (public-html@w3.org from February 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 27 Feb 2009 14:57:31 +0200
To: Julian Reschke <julian.reschke@gmx.de>, Mark Nottingham <mnot@mnot.net>
Cc: HTMLWG WG <public-html@w3.org>, "www-tag@w3.org WG" <www-tag@w3.org>, RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>, public-xhtml2@w3.org
Message-Id: <E22A02DE-DF12-43EF-9C9F-B0389BB37C17@iki.fi>
Mark Nottingham wrote:

> Creative Commons just released a new spec:
>  http://wiki.creativecommons.org/Ccplus
> that has markup in this form:
>  <a xmlns:cc="http://creativecommons.org/ns#"
> rel="cc:morePermissions" href="#agreement">below</a>
> (in HTML4, one assumes, since they don't specify XHTML, and this is
> what the vast majority of users will presume).

http://wiki.creativecommons.org/images/0/06/Ccplus-technical.pdf says  
"html". The syntax is not valid in any of HTML 2.0, HTML 3.2, HTML  
4.0, HTML 4.01 or HTML5 as currently drafted.

> However, it appears that they adopted this practice from RDFa;
>  http://www.w3.org/TR/rdfa-syntax/#relValues
> which, in turn, *does* rely upon XHTML.

Indeed, RDFa is not a REC over text/html.

> However, XHTML does *not*
> specify the @rel value as a QName (or CURIE, as RDFa assumes);
> http://www.w3.org/TR/2008/REC-xhtml-modularization-20081008/abstraction.html#dt_LinkTypes
>
> "Note that in a future version of this specification, the Working
> Group expects to evolve this type from a simple name to a Qualified
> Name (QName)."

In HTML5, as currently drafted, rel is a space character-separated  
list of tokens that are compared ASCII-case-insensitively. It is  
noteworthy that the token may look like URIs, although HTML5  
processing itself ascribes no URI semantics to tokens that look like  
URIs.

> So, that's an expectation, not a current specification.

It's not a current or drafted specification for text/html, either.

[...]
> A few observations and questions;
>
> 1) I'm more than happy to specify in the Link that in XHTML, a link
> rel value is indeed a QName, if XHTML chooses to take that position
> (although I believe a URI is a better fit than a QName here, as in
> most other places). Can we get a current reading from the XHTML world
> on this?

In XHTML5, as currently drafted, rel is a space character-separated  
list of tokens that are compared ASCII-case-insensitively. This  
matches current HTML 4.01 and XHTML 1.0 implementations.

> 2) However, it seems like RDFa is jumping the gun by assuming @rel is
> a CURIE right now. This is not promoting interoperablity or shared
> architecture, because no XHTML processor that isn't aware of RDFa can
> properly identify these link relations.

I agree.

> My preference would be an
> erratum to RDFa removing this syntax, replacing them with a self-
> contained identifier (i.e. a URI). Thoughts?

More generally, I think it would make sense to issue an erratum that  
replaces all CURIEs in RDFa with the corresponding full URIs, since  
this would both
  1) Remove the reliance on attributes spelled "xmlns:foo" which are  
special in XML but not special in text/html (as text/html parsing is  
currently implemented out there and drafted in HTML5).
  2) Avoid introducing a novel prefix-based indirection mechanism with  
many of the same problems that Namespaces in XML have been observed to  
have over the last decade.

Examples of problems:
http://lists.xml.org/archives/xml-dev/200502/msg00306.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6475032
http://dev.ctor.org/soap4r/ticket/179
http://sourceforge.net/tracker/?func=detail&atid=454391&aid=924041&group_id=48863

> 3) CC's adoption of *proposed* XHTML conventions from RDFa into HTML4
> via CURIEs further muddies the waters; xmlns has no meaning whatsoever
> in HTML4, so they're promoting bad practice there by circumventing the
> specified Profile mechanism. I find this aspect of this the most
> concerning, and it needs clarification (more colourful words come to
> mind, but I'll leave it there for now).

I also find the use of xmlns:foo the most concerning aspect, but not  
just because it has no special HTML 4.01 on the theoretical level but  
on the practical repercussions for software architecture.

I develop a text/html parser that implements the HTML5 parsing  
algorithm and targets five APIs for the application layer: JDK DOM  
Level 2, Java SAX2 in the namespace-aware mode, XOM, Web DOM (the one  
browsers expose via JS; targeted via Google Web Toolkit) and the  
internal content tree API of Gecko (nsINode/nsIContent; targeted via  
automated translation of the Java code into C++).

These are all namespace-aware APIs. (Note that DOM Level 1 and the DOM  
Level 1-ish Python minidom aren't namespace-aware and they are the  
APIs typically used to demonstrate RDFa interop.)

Gecko, WebKit and Presto use a namespace-aware DOM for both text/html  
and application/xhtml+xml. Thus, we can gain understanding of the  
implemented mapping of text/html into a namespace-aware representation  
from these implementations. Since attributes of the form xmlns:foo are  
not special in any way in HTML 4.01 (or 4.0, 3.2 or 2.0 for that  
matter), an attribute spelled "xmlns:foo" in text/html parses into  
["", "xmlns:foo"] as the [namespace, local] pair. (Note that the local  
name is not an XML 1.0 + Namespaces NCName.) For compatibility with  
the behavior of these existing browsers, HTML5, as drafted, specifies  
that "xmlns:foo" in text/html parses into ["", "xmlns:foo"].

Demo: http://hsivonen.iki.fi/test/moz/xmlns-dom.html

DOM Level 2 XML, on the other hand, represents an attribute spelled  
"xmlns:foo" in application/xhtml+xml as ["http://www.w3.org/2000/ 
xmlns/", "foo"].

Demo: http://hsivonen.iki.fi/test/moz/xmlns-dom.xhtml

Furthermore, SAX2 in the namespace-aware mode and XOM do not represent  
what are spelled "xmlns:foo" in XML as attributes at all in the API.  
Instead, there's dedicated API surface for exposing namespace mappings  
to the application layer.

If we use the explicit mapping of DOM Level 3 to Infoset, the mapping  
of XML onto Infoset and the mappings from XML into XOM or namespace- 
aware SAX2, we have to conclude that when a DOM-oriented spec talks  
about an attribute in the "http://www.w3.org/2000/xmlns/" namespace,  
the concept maps to the namespace mapping API surface of SAX2 and XOM  
and, on the other hand, when an attribute is not in the "http://www.w3.org/2000/xmlns/ 
" namespace according to a DOM-oriented spec, it doesn't map to the  
namespace mapping API surface of XOM and namespace-aware SAX2.

The above paragraph is relevant, because the dominant design of text/ 
html parsers for non-browser applications established by John Cowan's  
TagSoup and adopted by HTML5 parsers is that they expose an XML API so  
that the application-level code is written as if working with an XML  
parser parsing an equivalent XHTML 1.0 or XHTML5 file (for HTML 4.01  
and HTML5 respectively).

This design of sharing above-parser application-level code between  
text/html and application/xhtml+xml is also in use in Gecko, WebKit  
and (based on black-box guess) Presto.

The internal API of Gecko differs from the DOM slightly: The DOM has  
three datums: namespace URI, qname (aka. Level 1 node name) and local  
name. Gecko's internal API also has three datums but slightly  
differently: namespace URI, *prefix* and local name. None of these are  
string data types in Gecko. The namespace URI is interned into a 32- 
bit integer and prefix & local name are interned into a specific  
interned string type that cannot be used directly where string types  
can be used. It follows, that for any natively implemented feature, it  
would be highly undesirable to have to look 'inside' these values as  
strings as opposed to merely comparing pointers or integers.

I'm not suggesting that there were any foreseeable native  
implementation of RDFa-sensitive functionality in any Gecko-based  
browsers. However, I am suggesting that language design that would be  
a bad match for established browser internals is architecturally  
unsound design in case there's the slightest chance that the language  
might one day be browser-sensitive.

Going back to the design of exposing text/html as if it were XML: As I  
pointed out earlier, xmlns:foo in text/html parses, in existing  
browsers and in the HTML5 parsing algorithm as drafted today, into a  
[namespace, local] pair where the local part is not an NCName. This  
characteristic alone (i.e. without even considering the part that is  
spelled "xmlns") is enough to render the [namespace, local] pair  
unrepresentable in XML 1.0 + Namespaces.

This poses the following problems:
  1) A local name that is not an NCName cannot be serialized as XML  
1.0 in such a way that parsing the resulting XML document with a  
namespace-aware parser round-trips the non-NCName local name properly.
  2) Namespace-wise strictly correct XML tree implementations throw if  
you try to set an attribute that can't be serialized as XML 1.0 +  
Namespaces. (A demo that makes XOM throw is included below my  
signature.)
  3) Even if the API contract of an XML API could be violated and a  
local name that is impossible in XML 1.0 + Namespaces could be passed  
through, this representation would be *different* from the way an XML  
parser would expose an attribute spelled "xmlns:foo" though the same  
API. Thus, the application-layer code would have to differ for text/ 
html and application/xhtml+xml.

The options are thus:

  1) Letting the application-layer code differ for text/html and  
application/xhtml+xml (provided that you can make the infrastructure  
not to throw). This would violate the DOM Consistency design principle  
in HTML Design Principles. (For the general purpose of application- 
layer code reuse, "DOM" here should be understood to mean any API  
between the parser and application layers.) Experience with dealing  
with the lang vs. xml:lang issue should show that going down this road  
leads to divergent code paths in many places, which is bug-prone and  
bad software architecture.

  2) Changing text/html parsing to parse "xmlns:foo" into ["http://www.w3.org/2000/xmlns/ 
", "foo"]. This would be inconsistent with the behavior of existing  
Gecko, WebKit and Presto releases.

  3) Changing RDFa not to use attributes spelled "xmlns:foo" in either  
text/html or application/xhtml+xml. (Failing to do this for  
application/xhtml+xml would still lead to the problem of different  
code paths in application-layer code.) This could be achieved with an  
erratum changing CURIEs to full URIs.

  4) Not using RDFa in text/html at all.

- -

Due to the above considerations, I think that a vocabulary that uses  
attributes spelled "xmlns:foo" on (X)HTML elements is in architectural  
error.

> P.S., I realise that this involves at least three additional
> communities, but the TAG seems like the logical place for the initial
> discussion and eventual coordination of this issue.

Since Steven already CCed two of those three and Julian forwarded your  
email to the third, I've CCed all three in addition to the TAG here.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

import nu.xom.Attribute;
import nu.xom.Element;

public class XomTest {
     public static void main(String[] args) {
         Element elt = new Element("html", "http://www.w3.org/1999/ 
xhtml");
         elt.addAttribute(new Attribute("xmlns:foo", "bar"));
     }
}
Received on Friday, 27 February 2009 12:58:24 UTC