Re: XMLLiteral handling in RDFa in HTML from Shelley Powers on 2009-05-27 (public-rdf-in-xhtml-tf@w3.org from May 2009)

From: Shelley Powers <shelley.just@gmail.com>
Date: Wed, 27 May 2009 09:15:13 -0500
To: Sam Ruby <rubys@intertwingly.net>
Cc: Manu Sporny <msporny@digitalbazaar.com>, Toby Inkster <tai@g5n.co.uk>, RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>, HTMLWG WG <public-html@w3.org>
Message-ID: <643cc0270905270715x5561941fp33c91079c10de520@mail.gmail.com>
On Wed, May 27, 2009 at 8:52 AM, Sam Ruby <rubys@intertwingly.net> wrote:

> Shelley Powers wrote:
>
>>
>>    It is worse than that.  If you consider only the set of valid, and
>>    well-formed XHTML 1.1 documents, it is the case that parsing all such
>>    documents as text/html will produce a DOM, but it is not the case that
>>    all such DOMs will be identical to the ones produced if the same
>> sources
>>    were parsed as application/xhtml+xml.
>>
>>    More info: http://wiki.whatwg.org/wiki/HTML_vs._XHTML
>>
>>    Most of the differences deal with things like titles, textarea,
>> scripts,
>>    and style elements.  Also, <![CDATA[...]]> ends up being treated as a
>>    comment in HTML.
>>
>>    The first observation is that even the Microdata proposal in the
>> current
>>    HTML 5 specification doesn't meet the criteria specified above(*), as
>>    titles which contain the strings "&amp;" or "&lt;" will produce
>>    different triples when those documents are parsed as text/html vs
>>    application/xhtml+xml.
>>
>>    As an aside: for purposes of this discussion, I suggest adopting the
>>    approach of identifying content based on the MIME type.  My weblog, for
>>    example, is only XHTML when served to browsers that support
>>    application/xhtml+xml.  For all other browsers (e.g. Lynx, IE8), it is
>>    simply HTML.  I also suggest dropping version numbers when referring to
>>    HTML or XHTML.
>>
>> I don't think we can drop references to HTML5, as the work is still
>> ongoing and we can't be sure what impact future efforts on the document will
>> have on RDFa. To repeat again, property came close to being redefined, and
>> that would have had seriously negative impacts on RDFa, regardless of
>> whether you used the XHTML or HTML version.
>>
>
> Apologies if I was unclear.  Some seem to have a notion that syntaxes can
> be different depending on what the enclosing format is.  I'm not convinced
> that that is a good idea when it comes to XHTML vs HTML.  I'm totally
> convinced that that is a bad idea when it comes to HTML4 vs HTML5.
>


And I agree with this.


> I believe that the rules for RDFa in HTML should be the same for documents
> with the HTML5 doctype as they are for the HTML4 doctype.
>

If HTML5 doesn't redefine attributes RDFa is dependent on, I think that's
fair. With an understanding that HTML5 is a work in progress, so we're not
completely sure what the end product, or DOM, will be.


>
>
>      >> Sometimes an RDFa parser, dealing with HTML,
>>     >> will hit a situation where it needs to generate an XMLLiteral from
>>     >> non-wellformed HTML. In these situations, it seems to me that we
>>    have a
>>     >> choice of three potential "the parser MUST" actions, all of
>>    which are
>>     >> roughly consistent with RDFa in XHTML:
>>     >>
>>     >> 1. The parser MUST ignore this triple altogether. A simple
>>    solution, and
>>     >> it means that the HTML graph would be a subset of the XHTML
>>    graph. RDF
>>     >> vocabularies are generally defined so that if a graph G is true,
>>    then
>>     >> any graph H such that H is a subset of G is also true.
>>     >
>>     > The XHTML parser can't ignore the triple due to a parser error,
>>    or if it
>>     > corrects the parser error, shouldn't output the malformed
>> XMLLiteral.
>>     >
>>     > The HTML5lib parser will never see that the XMLLiteral was
>> malformed.
>>     >
>>     >> 2. The parser MUST add the triple to the graph as normal, but
>>    MUST NOT
>>     >> set the literal's datatype to XMLLiteral. They could either
>>    leave the
>>     >> literal as an untyped literal (that happened to have a lot of
>> angled
>>     >> brackets in it) or perhaps set it to some HTMLLiteral datatype
>>    of our
>>     >> own concoction.
>>     >
>>     > This would be a problem because the XML-based parser implementations
>>     > would switch the datatype of the object to something like
>>     > XMLCharacterStream, while the html5lib parser would output an
>>    XMLLiteral.
>>     >
>>     > I don't believe that there is any such thing as an malformed
>>    XMLLiteral
>>     > in HTML5... is there? Can anybody think of an example of an invalid
>>     > XMLLiteral in an html5 parser?
>>
>>    <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/">
>>      <span property="dc:description"><a$></span>
>>    </div>
>>
>>     >> 3. The parser MUST coerce the HTML fragment into a well-formed
>>    (but not
>>     >> necessarily valid) XHTML fragment. The HTML5 draft gives us decent
>>     >> algorithms for doing this.
>>     >
>>     > It does, but HTML5 has nothing to do with XHTML1.1 and XHTML2 - why
>>     > should we apply HTML5's parsing rules to XHTML1.1 and XHTML2
>>    documents?
>>
>>    Browsers will apply HTML parsing rules to XHTML1.1 documents served as
>>    text/html.  This can affect the triples produced by jquery.rdfa.js.
>>
>>     > I don't think that this is something we can 'MUST' ourselves out
>>    of...
>>     > relaxing the conformance requirements to not include XMLLiterals
>>    seems
>>     > to be a mechanism that would:
>>     >
>>     > a) Allow variance in IF and HOW XMLLiterals are generated - which
>>    will
>>     > vary based on if a document is being parsed by a SAX-based XML
>>    parser in
>>     > XHTML1.1, or a DOM-based Javascript parser in HTML5.
>>     > b) Not automatically disqualify all DOM-based HTML5
>>    implementations, or
>>     > non-raw-stream-based XHTML1.1 implementations.
>>     >
>>     > Although, even this approach bothers me quite a bit... as does
>>    getting
>>     > rid of XMLLiterals all-together.
>>
>>    There are a set of documents which will produce the same RDF triples
>>    independent of whether the document is processed as text/html vs
>>    application/xhtml+xml.
>>
>>    1) I suggest that the syntaxes for RDFa in application/xhtml+xml vs
>> RDFa
>>    in text/html not be considered separately, but be developed together
>> and
>>    with an eye towards maximizing the set mentioned above.
>>
>> There is already a recommended syntax for RDFa in XHTML. Whatever decision
>> made going forward, that won't change.
>>
>
> It is possible for recommendations to be superseded.  I'm not suggesting
> that radical changes are required (nor, in fact, am I precluding that
> possibility); I'm just saying that instead of a separate RDFa in HTML
> specification, I think it would make sense for the successor to the current
> RDFa in XHTML document to cover documents served as application/xhtml+xml
> and documents served as text/html; particularly given that the current
> document purports to describe RDFa in XHTML despite the fact that most XHTML
> is served as text/html and therefore processed in a number of ways
> differently than what the current spec describes.
>
> But that's just my input, others may have different perspectives.
>


The RDFa group will have to decide which way to go on this. My biggest
concern is that we'll end up severely limiting RDFa because of issues with
in-page access with the DOM, an environment I don't consider to be
especially important. But that's my opinion.



>
>  However, the RDFa group could follow the SVG working group's approach by
>> providing its own RDFa 'tiny' specification, as a next version, which will
>> then end up being a subset of a RDFa 'full' specification at a later time.
>>
>> The 'tiny' version could specifically address the limitations that
>> HTML/DOM implementation casts, while allowing folks to freely use the
>> original RDFa specification if they serve their pages up as XHTML. The next
>> 'full' version of RDFa can then build on the 'tiny' specification, adding
>> back in capabilities allowed through the use of XHTML, while ensuring
>> consistent RDF triples.
>>
>> I don't think its important to generate the "same" set of triples, as to
>> ensure that people know what to expect when they serve their pages up as
>> HTML compared to XHTML, and that the triples they get with HTML will be a
>> subset of those they would get with XHTML.
>>
>
> I don't believe that there is a proper sub/superset relationship between
> HTML and XHTML w.r.t. RDFa; but given that RDFa in HTML is essentially
> undefined, anything is possible.  At the moment, I believe that it is
> entirely likely that an RDFa parser such as jquery.rdfa.js would pick up
> *more* triples in a document which varied the case of a number of elements
> if those documents were served as text/html vs the same documents served as
> application/xhtml+xml.
>

Again, we're talking about DOM, not necessarily other access.

We're dealing with multiple options: well formed XHTML served as XML; well
formed XHTML served as HTML; not well formed XHTML served as HTML; HTML
served as HTML.

Well formed XHTML served either as HTML or XHTML can be processed the same
using external applications.  However, the DOM generates differences. I'm
not sure I want to penalize the pool of people who serve up well formed
XHTML pages but aren't serving them as XML, just because of the limitation
of the DOM. Perhaps the problem is really with the DOM.

Well, we know this state isn't going to change. But do we need a new
specification specifically to address limitations of the DOM? Or would
something like a best practices document work, to warn folks of differences?
I guess the RDFa folks will have to decide this one.


>
>     2) (in the fullness of time) it would be helpful if there were a
>>    validator which identified documents which cause different triples to
>> be
>>    produced.  If such a tool also identifies other conformance issues with
>>    the document, it would be helpful to have an option to turn the
>>    reporting of such issues off as with many documents this will obscure
>>    the set of errors that affect the production of triples.
>>
>> And I'm sure there will be, in the fullness of time
>>
>>    3) Test cases should be produced with the goal of ensuring that parsers
>>    looking to produce RDF triples (whether it be from microformats,
>>    microdata, or RDFa) respect the MIME type of the document.
>>
>> I believe that test cases have already been added in this regard.
>>
>
> Got a URL handy?
>

I believe that Shane has been sending emails to the RDFa in XHTML TF email
lists with new test cases, including one specifically addressing case.


>
>
>      > -- manu
>>
>>    - Sam Ruby
>>
>>    (*) I don't know whether this is an oversight, or even a problem, but
>>    when looking into the HTML5 draft, I couldn't find where itemprop
>>    attributes have any effect on the RDF triples produced.
>>
>> The section discussing how to generate RDF from microdata.
>>
>
> - Sam Ruby
>
>
Shelley
Received on Wednesday, 27 May 2009 14:15:53 UTC