- From: Robin Berjon <robin@berjon.com>
- Date: Fri, 4 Dec 2015 11:05:41 -0500
- To: Sebastian Heath <sebastian.heath@gmail.com>, W3C <public-scholarlyhtml@w3.org>
On 03/12/2015 19:05 , Sebastian Heath wrote: > On Thu, Dec 3, 2015 at 3:20 PM, Robin Berjon <robin@berjon.com > <mailto:robin@berjon.com>> wrote: > A) Processors must accept XHTML documents. > > SHOULD better. But parsed as XHTML or are you only interested in being allowed to have your own documents in XHTML? If that's the case, you can just do so, the HTML parser will consume them. > B) Documents must use the XHTML syntax. > > "must use"? Gosh, of course not. Hard to read what I wrote in toto as > suggesting this. I was just exhausting the options so that we can focus on your use case. > C) SH must be compatible with the ecosystem of tools that consume > XHTML today. > > If the idea is (A), then that's by and large a given. Over HTTP use the > right media type and you'll be fine; in other contexts make sure your > XHTML is Appendix-C compliant (which is probably a good idea to start > with) and you'll be good too, even with processors that expect HTML. > > > By appendix C you mean "Processing Instructions and the XML > Declaration" [1]? That's C.1, but yeah. Generally I meant the set of rules people use in XHTML so that their document is processed by an HTML parser correctly. The /> part is useless nowadays, but things like closing your <script> elements are needed. > If yes, SH is based on HTML5, not earlier versions? I would propose that SH be based on whatever version of HTML has currency. > Appendix C less relevant in that world? But more than that, no, I don't > keep to Appendix C. My goal is to write nice looking XML using the > XHTML5 tag set in such a way that it will be stable as it passes through > compliant tools. As in, constructs such as '<script> </script>' only > lead to trouble if you rely on that specific sequence of characters in > an xml environment. That's not to say anything many of us don't already > know, just to offer a concrete example. So there are potential issues with using XHTML if you want to go ahead and rely on <script/> or on namespaces in a manner that does not map to HTML directly. So long as you're over HTTP, or in any setting that conveys a tight coupling to media types, you'll be fine. Processors can handle that. But in other systems you might need to flag that fact. > The above might be a segue to one of your points: that XHTML is a > legacy format. I'm not sure that's right. I am not a W3 insider but I > don't know of any formal declaration that XHTML is legacy. But am I > wrong? Obviously, the conversation changes if there is such a policy in > place. The W3C is historically not the best at declaring things over. There is a RESCND status to indicate that a Recommendation is rescinded, but in over 20 years it has never been used. There are some pretty good signs, though. The XHTML group was closed five years ago. I think it's safe to say that the HTML WG has very little interest in XHTML, if any at all. Polyglot was officially abandoned after limping along for years. SVG and MathML now work with the HTML syntax. I am not aware of any XHTML-related effort of any magnitude over the past, well, quite a few years. > Certainly the XHTML concrete syntax of HTML 5 (I'll use xhtml5 going > forward for convenience) is deployed by major websites (~"View source" > on Apple's site [2], ). > [2] http://apple.com (I take the presence of the xml namespace, > presence of '/>', and use of 'itemscope' to indicate that this is > xhtml5. It'd be nice if they set the media type. :-) Actually, that's a common misconception. If you don't send your content as application/xhtml+xml you are not sending XHTML, no matter what syntax you use. The browser will not use the XML parser; it will use the HTML parser (which in turn means you cannot use the parts of XHTML that would not work in HTML). Furthermore, the syntax is in fact not XHTML either: $ curl -q http://www.apple.com/ | xmllint - -:3: parser error : Specification mandate value for attribute itemscope <head itemscope itemtype="http://schema.org/WebSite"> ^ -:3: parser error : attributes construct error <head itemscope itemtype="http://schema.org/WebSite"> ^ -:3: parser error : Couldn't find end of Start Tag head line 3 <head itemscope itemtype="http://schema.org/WebSite"> ^ -:63: parser error : Opening and ending tag mismatch: html line 2 and head </head> ^ -:64: parser error : Extra content at the end of the document <body class="page-home"> ^ The itemscope attribute is from Microdata. I don't have numbers handy, but every time I have seen statistic of how much of the Web is served as application/xhtml+xml they have been vanishingly tiny. > With that point made (and maybe accepted??)... Robin, there's a sense > that you're telling me my workflow is wrong. Almost, "Change your files. > No big deal." All of the above - particularly the point that xhtml5 is > part of the W3C ecosystem - is part of my response. More practically, in > an xml workflow, practices such as unnecessarily adding closing tags for > empty content tags can lead to real trouble. I've been there, I've lost > those hours. Again '<script href=...></script>'. I am not telling you your workflow is wrong, and I am not telling you to change your files. You're an adult and obviously technically competent. Perhaps I should try to flesh out several statements that are by and large independent from one another so as not to cross the beams. The most important one is: I would strongly recommend that SH be defined at the vocabulary level (ie. on a parsed DOM), not at the syntax level. In that sense, there is no reason why you couldn't use XHTML. I *think* that is all that you might be asking for, but I'm not sure. If that's the case then I think we're good to go, I suspect there is also broader consensus. Technically, if you defined a way of conveying a DOM as S-expressions, then SH could also be encoded that way. I may seem daft put that way but there are cases in which it can be useful; for instance I commonly manipulate our SH in an alternative JSON encoding. Some less important points are: Using an XML tool chain does not imply that HTML is a problem, you can have an HTML parser in front of it and an HTML serialiser at the back of it. For instance, the NVU HTML validator (which is the primary validator today) makes extensive use of RelaxNG; but obviously it receives HTML as input. Seeking strict interoperability with EPUB3 (as per spec) is IMHO a bad idea because it brings in baggage that is being phased out. I am not at all an anti-XML fighter, in fact I have long worked with XML and advocated cross-pollination with HTML. However, where XHTML is concerned, I would advise anyone to transition to HTML so as to simplify the processing, lower the chance of being bitten but small (and, frankly, dumb) syntactical variations, and decrease reliance on the always brittle practice of authoritative metadata. Having said that, you're a responsible and competent grown-up, this is just my personal advice :) > And then more generally, as a group, let's look to be open to the wide > variety of practices that exist in the real world. I am engaged in "the > interoperable exchange of scholarly articles in a manner that is > compatible with off-the-shelf browsers"[3] and I use xhtml5. Include me. > I know that means recognizing that the word is messy and imperfect, but > it is. If indeed all you are asking for is that XHTML *may* be used, I think that's not necessarily problematic. See below. > To the extent that the Vernacular site [3] represents a current state > of the SH standard, I see that it reads in part: > > "The document must be encoded in UTF-8, and transmitted with a media > type of text/html. It must feature a DOCTYPE as its preamble." Ah, so this is where it's coming from! Yes, I have already thought I should drop this requirement. People have seemed to get really angry about that. You're the first to ask about the XHTML side, but several people have been really worked up about text encoding! The primary goal of that clause was to make a much stronger promise in terms of interoperability and long-term archival than can be achieved otherwise. Enabling flexibility in both syntax and encoding when there are known interoperability issues with both is IMHO a problem. It's a tractable problem today; I'm not sure how tractable something exotic like XHTML in ISO-8859-15 will be some hundred years from now. But I seem to be the only one worrying about that, so I don't mind backing away from it if it means we can make progress on the rest. -- • Robin Berjon - http://berjon.com/ - @robinberjon • http://science.ai/ — intelligent science publishing •
Received on Friday, 4 December 2015 16:06:11 UTC