Re: Support for XHTML5 from Robin Berjon on 2015-12-04 (public-scholarlyhtml@w3.org from December 2015)

From: Robin Berjon <robin@berjon.com>
Date: Fri, 4 Dec 2015 11:05:41 -0500
To: Sebastian Heath <sebastian.heath@gmail.com>, W3C <public-scholarlyhtml@w3.org>
Message-ID: <5661B9D5.505@berjon.com>
On 03/12/2015 19:05 , Sebastian Heath wrote:
> On Thu, Dec 3, 2015 at 3:20 PM, Robin Berjon <robin@berjon.com
> <mailto:robin@berjon.com>> wrote:
>       A) Processors must accept XHTML documents.
> 
> SHOULD better.

But parsed as XHTML or are you only interested in being allowed to have
your own documents in XHTML? If that's the case, you can just do so, the
HTML parser will consume them.

>       B) Documents must use the XHTML syntax.
> 
>  "must use"? Gosh, of course not. Hard to read what I wrote in toto as
> suggesting this.

I was just exhausting the options so that we can focus on your use case.

>       C) SH must be compatible with the ecosystem of tools that consume
>     XHTML today.
> 
>     If the idea is (A), then that's by and large a given. Over HTTP use the
>     right media type and you'll be fine; in other contexts make sure your
>     XHTML is Appendix-C compliant (which is probably a good idea to start
>     with) and you'll be good too, even with processors that expect HTML.
> 
> 
>  By appendix C you mean "Processing Instructions and the XML
> Declaration" [1]?

That's C.1, but yeah. Generally I meant the set of rules people use in
XHTML so that their document is processed by an HTML parser correctly.
The /> part is useless nowadays, but things like closing your <script>
elements are needed.

> If yes, SH is based on HTML5, not earlier versions?

I would propose that SH be based on whatever version of HTML has currency.

> Appendix C less relevant in that world? But more than that, no, I don't
> keep to Appendix C. My goal is to write nice looking XML using the
> XHTML5 tag set in such a way that it will be stable as it passes through
> compliant tools. As in, constructs such as '<script> </script>' only
> lead to trouble if you rely on that specific sequence of characters in
> an xml environment. That's not to say anything many of us don't already
> know, just to offer a concrete example.

So there are potential issues with using XHTML if you want to go ahead
and rely on <script/> or on namespaces in a manner that does not map to
HTML directly.

So long as you're over HTTP, or in any setting that conveys a tight
coupling to media types, you'll be fine. Processors can handle that. But
in other systems you might need to flag that fact.

>  The above might be a segue to one of your points: that XHTML is a
> legacy format. I'm not sure that's right. I am not a W3 insider but I
> don't know of any formal declaration that XHTML is legacy. But am I
> wrong? Obviously, the conversation changes if there is such a policy in
> place.

The W3C is historically not the best at declaring things over. There is
a RESCND status to indicate that a Recommendation is rescinded, but in
over 20 years it has never been used.

There are some pretty good signs, though. The XHTML group was closed
five years ago. I think it's safe to say that the HTML WG has very
little interest in XHTML, if any at all. Polyglot was officially
abandoned after limping along for years. SVG and MathML now work with
the HTML syntax. I am not aware of any XHTML-related effort of any
magnitude over the past, well, quite a few years.

>  Certainly the XHTML concrete syntax of HTML 5 (I'll use xhtml5 going
> forward for convenience) is deployed by major websites (~"View source"
> on Apple's site [2], ).
> [2] http://apple.com (I take the presence of the xml namespace,
> presence of '/>', and use of 'itemscope' to indicate that this is
> xhtml5. It'd be nice if they set the media type. :-)

Actually, that's a common misconception. If you don't send your content
as application/xhtml+xml you are not sending XHTML, no matter what
syntax you use. The browser will not use the XML parser; it will use the
HTML parser (which in turn means you cannot use the parts of XHTML that
would not work in HTML).

Furthermore, the syntax is in fact not XHTML either:

$ curl -q http://www.apple.com/ | xmllint -
-:3: parser error : Specification mandate value for attribute itemscope
<head itemscope itemtype="http://schema.org/WebSite">
                ^
-:3: parser error : attributes construct error
<head itemscope itemtype="http://schema.org/WebSite">
                ^
-:3: parser error : Couldn't find end of Start Tag head line 3
<head itemscope itemtype="http://schema.org/WebSite">
                ^
-:63: parser error : Opening and ending tag mismatch: html line 2 and head
</head>
       ^
-:64: parser error : Extra content at the end of the document
<body class="page-home">
^

The itemscope attribute is from Microdata.

I don't have numbers handy, but every time I have seen statistic of how
much of the Web is served as application/xhtml+xml they have been
vanishingly tiny.

>  With that point made (and maybe accepted??)... Robin, there's a sense
> that you're telling me my workflow is wrong. Almost, "Change your files.
> No big deal." All of the above - particularly the point that xhtml5 is
> part of the W3C ecosystem - is part of my response. More practically, in
> an xml workflow, practices such as unnecessarily adding closing tags for
> empty content tags can lead to real trouble. I've been there, I've lost
> those hours. Again '<script href=...></script>'.

I am not telling you your workflow is wrong, and I am not telling you to
change your files. You're an adult and obviously technically competent.
Perhaps I should try to flesh out several statements that are by and
large independent from one another so as not to cross the beams.

The most important one is:

I would strongly recommend that SH be defined at the vocabulary level
(ie. on a parsed DOM), not at the syntax level. In that sense, there is
no reason why you couldn't use XHTML. I *think* that is all that you
might be asking for, but I'm not sure. If that's the case then I think
we're good to go, I suspect there is also broader consensus.

Technically, if you defined a way of conveying a DOM as S-expressions,
then SH could also be encoded that way. I may seem daft put that way but
there are cases in which it can be useful; for instance I commonly
manipulate our SH in an alternative JSON encoding.

Some less important points are:

Using an XML tool chain does not imply that HTML is a problem, you can
have an HTML parser in front of it and an HTML serialiser at the back of
it. For instance, the NVU HTML validator (which is the primary validator
today) makes extensive use of RelaxNG; but obviously it receives HTML as
input.

Seeking strict interoperability with EPUB3 (as per spec) is IMHO a bad
idea because it brings in baggage that is being phased out.

I am not at all an anti-XML fighter, in fact I have long worked with XML
and advocated cross-pollination with HTML. However, where XHTML is
concerned, I would advise anyone to transition to HTML so as to simplify
the processing, lower the chance of being bitten but small (and,
frankly, dumb) syntactical variations, and decrease reliance on the
always brittle practice of authoritative metadata. Having said that,
you're a responsible and competent grown-up, this is just my personal
advice :)

>  And then more generally, as a group, let's look to be open to the wide
> variety of practices that  exist in the real world. I am engaged in "the
> interoperable exchange of scholarly articles in a manner that is
> compatible with off-the-shelf browsers"[3] and I use xhtml5. Include me.
> I know that means recognizing that the word is messy and imperfect, but
> it is.

If indeed all you are asking for is that XHTML *may* be used, I think
that's not necessarily problematic. See below.

>  To the extent that the Vernacular site [3] represents a current state
> of the SH standard, I see that it reads in part: 
> 
>  "The document must be encoded in UTF-8, and transmitted with a media
> type of text/html. It must feature a DOCTYPE as its preamble."

Ah, so this is where it's coming from! Yes, I have already thought I
should drop this requirement.

People have seemed to get really angry about that. You're the first to
ask about the XHTML side, but several people have been really worked up
about text encoding!

The primary goal of that clause was to make a much stronger promise in
terms of interoperability and long-term archival than can be achieved
otherwise. Enabling flexibility in both syntax and encoding when there
are known interoperability issues with both is IMHO a problem. It's a
tractable problem today; I'm not sure how tractable something exotic
like XHTML in ISO-8859-15 will be some hundred years from now.

But I seem to be the only one worrying about that, so I don't mind
backing away from it if it means we can make progress on the rest.

-- 
• Robin Berjon - http://berjon.com/ - @robinberjon
• http://science.ai/ — intelligent science publishing
•
Received on Friday, 4 December 2015 16:06:11 UTC