Re: Comments on XHTML Media Types Note 20080827 from Simon Pieters on 2008-10-22 (public-xhtml2@w3.org from October 2008)

From: Simon Pieters <simonp@opera.com>
Date: Wed, 22 Oct 2008 19:15:18 +0200
To: "Shane McCarron" <shane@aptest.com>
Cc: "public-xhtml2@w3.org" <public-xhtml2@w3.org>
Message-ID: <op.ujfrjsu6idj3kv@hp-a0a83fcd39d2.oslo.opera.com>
On Wed, 22 Oct 2008 17:23:22 +0200, Shane McCarron <shane@aptest.com>  
wrote:

> Simon,
>
> Thanks very much for your thorough review of the draft XHTML Media Types  
> Note.  The XHTML 2 Working Group continues to make progress on this  
> document, and expects to update the published Note in the near future.   
> Many of your changes have been included in the current editors draft -  
> some notes on your comments are below.  A few of your comments raised  
> further questions.  I am going to split those out into a separate  
> thread. As always, you can find the current editors draft via  
> http://www.w3.org/MarkUp/Drafts#xhtmlmime
>
> Thanks again for your comments - they were a big help!

I'm glad it was helpful.


> Simon Pieters wrote:
>>
>> This abstract sucks. It shouldn't use RFC2119 terms. It shouldn't  
>> summarize the spec. It shouldn't give notes or advice about things. It  
>> shouldn't contain references or pointers.
>>
>> It should describe in abstract terms what the Note does and why it  
>> exists.
>>
>> e.g. "This Note contains advice about how to serve XHTML markup to  
>> different UAs and advice on how such markup should look in order to  
>> work as intended in common UAs when served with different media types."  
>> would be a better abstract. Better still would be to also explain why  
>> anyone would want to do so (instead of just using HTML or just XHTML).
>
> Changed.

Looks much better.


>>> 2. Terms and Definitions
>>>
>>> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",  
>>> "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this  
>>> document are to be interpreted as described in RFC 2119 [RFC2119].
>>
>> This document isn't normative. Why reference RFC2119 at all? I'd  
>> suggest to remove and use non-RFC2119 terms throughout to avoid  
>> confusion.
>
> Removed

Could the terms throughout the document be used like normal words (i.e.  
"should" rather than "<strong>SHOULD</strong>")?


>>> This section summarizes which Internet media type SHOULD be used for  
>>> which XHTML Family document for which purpose.
>>>
>>> A combination of these rules, in conjunction with a careful  
>>> examination of the HTTP Accept header, can be useful in determining  
>>> which media type to use when a document adheres to the guidelines in  
>>> Appendix A. Specifically:
>>>
>>>     1. if the Accept header explicitly contains application/xhtml+xml  
>>> deliver the document using that media type.
>> 3. Recommended Media Type Usage
>>
>> This is not appropriate since it doesn't consider the q parameter, nor  
>> does it consider wildcards. Consider:
>>
>>    Accept: text/html, application/xhtml+xml; q=0
>>
>> ...or:
>>
>>    Accept: application/*, text/*; q=0.5
>
> Changed.
>>
>>
>>>     2. Otherwise, if the Accept header contains text/html, deliver the  
>>> document using that media type.
>>>     3. Otherwise, deliver the document using media type text/html.
>>
>> Step 2 can be struck.
>>
>>
>>> In other words, requestors that advertise they support XHTML family  
>>> documents will receive the document in the XHTML media type, and all  
>>> other requestors will receive the document using the HTML media type.
>>
>> This is not appropriate when the UA Accepts neither (should give a 406).
>
> Expanded upon to try to clarify when this should be delivered.  However,  
> I don't think this note should be a comprehensive document on content  
> negotiation.

I agree -- but why, then, does it contain an algorithm that authors should  
use to implement content negotiation (which still doesn't handle e.g.  
Accept: text/html; application/xhtml+xml; q=0.5 correctly)?


> I would prefer that we reference TAG findings or other relevant sources  
> for more details.
>
>>
>>
>>> When a document does NOT adhere to the guidelines, it SHOULD NOT be  
>>> delivered as media type text/html. If such documents need to be  
>>> delivered to requestors who do not explicitly support the XHTML  
>>> family, those documents should be transformed into valid HTML and then  
>>> delivered as such.
>>
>> Documents that *do* adhere to the guidelines aren't valid HTML. Why do  
>> documents that don't need to be transformed into valid HTML instead of,  
>> say, be transformed into XHTML that adheres to the guidelines?
>
> Interesting point.  I have added an issue to the agenda for this week to  
> discuss it.  Certainly you are correct that such a document would not  
> validate as HTML - its DOCTYPE is wrong for one thing.  I don't think  
> the goal of the compatibility guidelines has ever been that people  
> deliver valid HTML when falling back - rather that they deliver valid  
> XHTML that will "work" in current HTML user agents.
>
>>
>>
>>> Note: It is possible that in the future XHTML Modularization will  
>>> define rules for indicating which specific XHTML family members are  
>>> supported by a requestor (e.g., via the profile parameter of the media  
>>> type in the Accept header). Such rules, when used in conjunction with  
>>> the "quality" parameter of the media type could help a server  
>>> determine which of several versions of a document to deliver.
>>
>> Well we could start with getting the q parameter right... :-)
>>
>> In any case, why would it be useful to know if a UA claims to support a  
>> specific XHTML family member? What would you do with that information?
>
> Added some clarifying text.

I'm still curious though why it would be useful to customize the content  
for specific classes of user agents. Could you give an example? (Not  
necessarily in the Note -- I just want to know.)


>>> 3.1. 'text/html'
>>>
>>>
>>> "5.2.2 Specifying the character encoding" of the HTML 4 specification  
>>> [HTML4] also notes that "user agents must not assume any default value  
>>> for the "charset" parameter". Therefore, authors SHOULD NOT assume any  
>>> default value for an XHTML document served as 'text/html', and as  
>>> mentioned in [RFC2854], the use of an explicit charset parameter is  
>>> STRONGLY RECOMMENDED. When it is difficult to specify an explicit  
>>> charset parameter through a higher-level protocol (e.g., HTTP),  
>>> authors SHOULD include the XML declaration (e.g., <?xml version="1.0"  
>>> encoding="EUC-JP"?>) and a meta http-equiv statement (e.g. <meta  
>>> http-equiv="Content-Type" content="text/html; charset=EUC-JP" />). See  
>>> guideline 9 for details.
>>
>> This is giving the opposite advice from A.1, which says to omit the XML  
>> declaration and, as a consequence, use UTF-8 or UTF-16 when it is  
>> difficult to specify an explicit charset parameter through a  
>> higher-level protocol.
>>
>> Which advice is correct?
>
> Appendix A is updated.  Nice catch.
>>
>>
>>> 3.2. 'application/xhtml+xml'
>>>
>>> The 'application/xhtml+xml' media type [RFC3236] is the primary media  
>>> type for XHTML Family document types, and in particular it is suitable  
>>> for all XHTML Host Language document types. XHTML Family document  
>>> types suitable for this media type include [XHTML1], [XHTMLBasic],  
>>> [XHTML11] and [XHTML+MathML]. An XHTML Host Language document type  
>>> that adds elements and attributes from foreign namespaces MAY identify  
>>> its profile with the 'profile' optional parameter or other means such  
>>> as the "Content-features" MIME header described in RFC 2912 [RFC2912].  
>>> Each namespace SHOULD be explicitly identified through namespace  
>>> declaration [XMLNS]. This document does not preclude the registration  
>>> of its own media type for specific XHTML Host Language document type.
>>>
>>> In general, this media type is NOT suitable for XHTML Integration Set  
>>> document types. This document does not define which media type should  
>>> be used for XHTML Integration Set document types.
>>
>> Why mention XHTML Integration Set document types at all?
>
> Because someone will ask.

Fair enough.


>>> Generic XML processors might recognize it as just an XML document  
>>> which includes elements and attributes from the XHTML namespace (and  
>>> others), and may not have a priori knowledge what to do with such a  
>>> document beyond they can do for generic XML documents.
>>
>> I think "XML processors" isn't what is meant here. An XML processor  
>> alone wouldn't constitute a UA and by definition has no knowledge of  
>> XHTML.
>>
>> Assuming s/processors/UAs/, how is this different from generic XML UAs  
>> processing application/xhtml+xml? Why do authors need to know this?
> We changed the term - we are talking about User Agents that use XML  
> natively.

My questions remain, though.


>>> Authors SHOULD explicitly identify the XHTML namespace through the  
>>> namespace declaration when they serve an XHTML Family document as  
>>> 'application/xml' to facilitate the chance for reliable processing.
>>
>> Um. Isn't this always required? "facilitate the chance for reliable  
>> processing"? Is there a chance that it will fail? What is unreliable?  
>> If you don't include it, it won't be interpreted as XHTML; if you do,  
>> it will.
>
> Expanded.

What's the new text?


>>> The XML stylesheet PI SHOULD be used to associate style sheets.
>>
>> Why?
>
> Expanded.

It now says:

    Authors should use the XML stylesheet PI when serving content using
    this media type, since a Generic XML User Agent is unlikely interpret
    the XHTML link elements or style elements correctly.

It's true -- but they won't interpret anything else "correctly" either,  
which will render the document useless even if the style sheets are  
applied. In fact, if the style sheets are applied, it will most likely be  
a complete mess, since most style sheets assume the default UA style sheet  
being in place. So it's not clear to me why it matters whether non-XHTML  
UAs will apply style sheets or not.

Also, it's not clear to me why this is specific to application/xml and  
doesn't apply to application/xhtml+xml. Surely the same holds true for  
generic XML UAs processing application/xhtml+xml.


>>> Whenever appropriate, 'application/xhtml+xml' SHOULD be used rather  
>>> than 'application/xml'.
>>
>> Why?
>
> Added some text.

It now says:

    Whenever appropriate, 'application/xhtml+xml' should be used rather
    than 'application/xml' so that the user agent can know, via the media
    type, the inherent semantics of the markup language.

I'm a bit confused about this paragraph. The user agent always knows the  
inherent semantics of XHTML (if it supports it). However, it cannot know  
that an application/xhtml+xml resource is in fact XHTML. It could be  
anything.


>>> As for character encoding issues, "3.2 Application/xml Registration"  
>>> of [RFC3023] says that "the use of the charset parameter is STRONGLY  
>>> RECOMMENDED", and also specifies a rule that "[i]f an application/xml  
>>> entity is received where the charset parameter is omitted, no  
>>> information is being provided about the charset by the MIME  
>>> Content-Type header". This means that conforming XML processors MUST  
>>> follow the requirements described in section 4.3.3 of [XML10].
>>>
>>> Therefore, while it is STRONGLY RECOMMENDED to specify an explicit  
>>> charset parameter through a higher-level protocol, authors SHOULD  
>>> include the XML declaration (e.g. <?xml version="1.0"  
>>> encoding="EUC-JP"?>). Note that a meta http-equiv statement will not  
>>> be recognized by XML processors, and while authors MAY include such a  
>>> statement a statement in an XHTML document served as 'application/xml'  
>>> it will not effect processing of the document since the higher level  
>>> protocol and the XML PI both take precedence.
>>
>> "Take precedence" makes it sound like the meta would do something when  
>> the higher level protocol doesn't say anything and the XML declaration  
>> is absent. It does not.
>>
>
> Fixed.

It now says:

    Note that a meta http-equiv statement will not be recognized by Generic
    XML User Agents, ...

This is also true for XHTML UAs, at least as far as character encoding  
concerns.


>> Why is application/xhtml+xml "MAY" for XHTML Family (HTML 4 compatible)  
>> but "SHOULD" for other XHTML?
>
> Because for other XHTML, delivering as anything else doesn't make much  
> sense. There will be other namespaces included, or it will be otherwise  
> incompatible with HTML user agents so there is no reason for it to be  
> delivered as "text/html".

Right, but I didn't ask about text/html. Put in another way, why is the  
application/xhtml+xml row not either "should" or "may" for all of XHTML?

If I recall correctly, an earlier paragraph said that authors "should" use  
text/html or application/xhtml+xml rather than application/xml or  
text/xml, so, to match, the table should say "should" for text/html and  
application/xhtml+xml for HTML 4 compatible (or change the prose to match  
the table).


>>> A.8. Fragment Identifiers
>>>
>>> DO use the id attribute to identify elements.
>>>
>>> DO ensure that the values used for the id attribute are limited to the  
>>> pattern [A-Za-z][A-Za-z0-9:_.-]*.
>>>
>>> DO NOT use the name attribute to identify elements, even in languages  
>>> that permit the use of name such as XHTML 1.0.
>>
>> Why not allow to use both?
>
> As explained in the rationale, it is redundant and unnecessary.

A lot of things are redundant and unnecessary. Is it the job of this Note  
to ban things that are redundant and unnecessary even if they aren't  
related to HTML vs. XHTML compat?


> Moreover, @name is not supported in XHTML Family markup languages other  
> than XHTML 1.0.

So?


>>> Rationale: In HTML 3.2 and earlier the name attribute on some elements  
>>> could be used to define an anchor, but HTML 4 introduced the id  
>>> attribute. In an XML dialect, only attributes with type ID are  
>>> permitted to be used as anchors, and the id attribute is defined to be  
>>> of type ID. Relying upon the id attribute as an anchor will work well  
>>> in modern HTML and XHTML-aware user agents.
>>
>>> A.15. Formfeed Character in HTML vs. XML
>>>
>>> DO NOT use the formfeed character (U+000C).
>>>
>>> Rationale: This character is recognized as white space in HTML 4, but  
>>> is NOT considered white space in XML.
>>
>> Where is it said that U+000C is whitespace in HTML 4?
>>
>> In the SGML declaration for HTML 4 I find:
>>
>>          FUNCTION
>>                   RE            13
>>                   RS            10
>>                   SPACE         32
>>                   TAB SEPCHAR    9
>>
>> ...which seems to suggest that only U+000A, U+000D, U+0020 and U+0009  
>> are whitespace.
>>
>>
>> Also, not only is it not considered whitespace in XML, it's not  
>> well-formed XML.
> The SGML declaration for HTML 4 is inconsistent with the HTML 4  
> recommendation,

It's *part* of the recommendation. I agree that HTML 4 is inconsistent  
with itself, though.


> which explicitly states that form feed is a whitespace character -  
> http://www.w3.org/TR/html401/struct/text.html#didx-white_space-1
>
> So we tell people to not use it at all - that way there is no  
> incompatibility risk.

Right. But it's already a forbidden character in XML so it seems a bit  
redundant...


Cheers,
-- 
Simon Pieters
Opera Software
Received on Wednesday, 22 October 2008 17:16:11 UTC