- From: Simon Pieters <simonp@opera.com>
- Date: Mon, 01 Sep 2008 19:21:02 +0200
- To: "public-xhtml2@w3.org" <public-xhtml2@w3.org>
Summary and general comments: This Note is pretty verbose for what it says and I find it relatively confusing and contradictory at points. I think that spec overall is pretty poorly written. I also don't understand the motivation behind this Note. It gives references to "some user agents" but doesn't call out what they are, so it's hard to know if it is of any relevance. It gives some bogus advice and some wrong advice and some rationale is wrong or unrelated to the advice. Some important advice authors would need when working with both HTML and XHTML is missing. Quoting http://www.w3.org/MarkUp/2008/ED-xhtmlmime-20080827/ > Abstract > > This document summarizes the current best practice for using various > Internet media types when serving XHTML Family documents to relatively > modern user agents - even those that do not yet support XHTML natively. > In summary, 'application/xhtml+xml' SHOULD be used for XHTML Family > documents, and the use of 'text/html' SHOULD be limited to > HTML-compatible XHTML Family documents intended for delivery to user > agents that do not explcitly state in their HTTP Accept header that they > accept 'application/xhtml+xml'. The media types 'application/xml' and > 'text/xml' MAY also be used, but whenever appropriate, > 'application/xhtml+xml' or 'text/html' SHOULD be used rather than those > generic XML media types. > > Note that, because of the lack of explicit support for XHTML (and XML in > general) in some user agents, only very careful construction of > documents can ensure their portability (see Appendix A). If you do not > require the advanced features of XHTML Family markup languages (e.g., > XML DOM, XML Validation, extensibility via XHTML Modularization, > semantic markup via XHTML+RDFa, Assistive Technology access via the > XHTML Role and XHTML Access modules, etc.), you may want to consider > using HTML 4.01 [HTML] in order to reduce the risk that content will not > be portable to HTML user agents. Even in that case authors can help > ensure their portability AND ease their eventual migration to the XHTML > Family by ensuring their documents are valid [VALIDATOR] and by > following the relevant guidelines in Appendix A. This abstract sucks. It shouldn't use RFC2119 terms. It shouldn't summarize the spec. It shouldn't give notes or advice about things. It shouldn't contain references or pointers. It should describe in abstract terms what the Note does and why it exists. e.g. "This Note contains advice about how to serve XHTML markup to different UAs and advice on how such markup should look in order to work as intended in common UAs when served with different media types." would be a better abstract. Better still would be to also explain why anyone would want to do so (instead of just using HTML or just XHTML). > 2. Terms and Definitions > > The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", > "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this > document are to be interpreted as described in RFC 2119 [RFC2119]. This document isn't normative. Why reference RFC2119 at all? I'd suggest to remove and use non-RFC2119 terms throughout to avoid confusion. > 3. Recommended Media Type Usage > > This section summarizes which Internet media type SHOULD be used for > which XHTML Family document for which purpose. > > A combination of these rules, in conjunction with a careful examination > of the HTTP Accept header, can be useful in determining which media type > to use when a document adheres to the guidelines in Appendix A. > Specifically: > > 1. if the Accept header explicitly contains application/xhtml+xml > deliver the document using that media type. This is not appropriate since it doesn't consider the q parameter, nor does it consider wildcards. Consider: Accept: text/html, application/xhtml+xml; q=0 ...or: Accept: application/*, text/*; q=0.5 > 2. Otherwise, if the Accept header contains text/html, deliver the > document using that media type. > 3. Otherwise, deliver the document using media type text/html. Step 2 can be struck. > In other words, requestors that advertise they support XHTML family > documents will receive the document in the XHTML media type, and all > other requestors will receive the document using the HTML media type. This is not appropriate when the UA Accepts neither (should give a 406). > When a document does NOT adhere to the guidelines, it SHOULD NOT be > delivered as media type text/html. If such documents need to be > delivered to requestors who do not explicitly support the XHTML family, > those documents should be transformed into valid HTML and then delivered > as such. Documents that *do* adhere to the guidelines aren't valid HTML. Why do documents that don't need to be transformed into valid HTML instead of, say, be transformed into XHTML that adheres to the guidelines? > Note: It is possible that in the future XHTML Modularization will define > rules for indicating which specific XHTML family members are supported > by a requestor (e.g., via the profile parameter of the media type in the > Accept header). Such rules, when used in conjunction with the "quality" > parameter of the media type could help a server determine which of > several versions of a document to deliver. Well we could start with getting the q parameter right... :-) In any case, why would it be useful to know if a UA claims to support a specific XHTML family member? What would you do with that information? > 3.1. 'text/html' > > The 'text/html' media type [RFC2854] is primarily for HTML, not for > XHTML. In general, this media type is NOT suitable for XHTML except when > the XHTML is carefully constructed (see Appendix A. In particular, > 'text/html' is NOT suitable for XHTML Family document types that add > elements and attributes from foreign namespaces, such as XHTML+MathML > [XHTML+MathML]. > > XHTML documents served as 'text/html' will not be processed as XML > [XML10], e.g. well-formedness errors may not be detected by user agents. > Also be aware that HTML rules will be applied for DOM and style sheets > (see guidelines 11 and 13). > > Authors should also be careful about character encoding issues. A > typical misunderstanding is that since an XHTML document is an XML > document, the character encoding of an XHTML document should be treated > as UTF-8 or UTF-16 in the absence of an explicit character encoding > information. This is NOT the case when an XHTML document is served as > 'text/html'. "6. Charset default rules" of [RFC2854] notes as follows: > > The use of an explicit charset parameter is strongly recommended. > While [MIME] specifies "The default character set, which must be assumed > in the absence of a charset parameter, is US-ASCII." [HTTP] Section > 3.7.1, defines that "media subtypes of the 'text' type are defined to > have a default charset value of 'ISO-8859-1'". Section 19.3 of [HTTP] > gives additional guidelines. Using an explicit charset parameter will > help avoid confusion. > > Using an explicit charset parameter also takes into account that > the overwhelming majority of deployed browsers are set to use something > else than 'ISO-8859-1' as the default; the actual default is either a > corporate character encoding or character encodings widely deployed in a > certain national or regional community. For further considerations, > please also see Section 5.2 of [HTML40]. > > "5.2.2 Specifying the character encoding" of the HTML 4 specification > [HTML4] also notes that "user agents must not assume any default value > for the "charset" parameter". Therefore, authors SHOULD NOT assume any > default value for an XHTML document served as 'text/html', and as > mentioned in [RFC2854], the use of an explicit charset parameter is > STRONGLY RECOMMENDED. When it is difficult to specify an explicit > charset parameter through a higher-level protocol (e.g., HTTP), authors > SHOULD include the XML declaration (e.g., <?xml version="1.0" > encoding="EUC-JP"?>) and a meta http-equiv statement (e.g. <meta > http-equiv="Content-Type" content="text/html; charset=EUC-JP" />). See > guideline 9 for details. This is giving the opposite advice from A.1, which says to omit the XML declaration and, as a consequence, use UTF-8 or UTF-16 when it is difficult to specify an explicit charset parameter through a higher-level protocol. Which advice is correct? > 3.2. 'application/xhtml+xml' > > The 'application/xhtml+xml' media type [RFC3236] is the primary media > type for XHTML Family document types, and in particular it is suitable > for all XHTML Host Language document types. XHTML Family document types > suitable for this media type include [XHTML1], [XHTMLBasic], [XHTML11] > and [XHTML+MathML]. An XHTML Host Language document type that adds > elements and attributes from foreign namespaces MAY identify its profile > with the 'profile' optional parameter or other means such as the > "Content-features" MIME header described in RFC 2912 [RFC2912]. Each > namespace SHOULD be explicitly identified through namespace declaration > [XMLNS]. This document does not preclude the registration of its own > media type for specific XHTML Host Language document type. > > In general, this media type is NOT suitable for XHTML Integration Set > document types. This document does not define which media type should be > used for XHTML Integration Set document types. Why mention XHTML Integration Set document types at all? > 'application/xhtml+xml' SHOULD be used for serving XHTML documents to > XHTML user agents (agents that explicitly indicate their support for > this media type). Authors who wish to support both XHTML and HTML user > agents MAY utilize content negotiation by serving carefully constructed > XHTML documents both as 'text/html' and as 'application/xhtml+xml'. > Alternately, authors may serve HTML versions of such documents as > 'text/html' and XHTML versions as 'application/xhtml+xml'. Also note > that it is not necessary for XHTML documents served as > 'application/xhtml+xml' to follow the HTML4 Compatibility Guidelines. > > When serving an XHTML document with this media type, authors MAY include > the XML stylesheet processing instruction [XMLstyle] to associate style > sheets. This is not generally necessary when documents are to be > processed by XHTML-aware user agents, but generic XML document > processors may handle such processing instructions. > > As for character encoding issues, as mentioned in "6. Charset default > rules" of [RFC3236], 'application/xhtml+xml' has the same considerations > as 'application/xml'. See section 3.3 for details. > 3.3. 'application/xml' > > The 'application/xml' media type [RFC3023] is a generic media type for > XML documents, and the definition of 'application/xml' does not preclude > serving XHTML documents as that media type. Any XHTML Family document > MAY be served as 'application/xml'. > > However, authors should be aware that such a document may not always be > processed as XHTML (e.g. hyperlinks may not be recognized), depending on > user agents. This sounds like XHTML UAs would be confused upon getting application/xml. > Generic XML processors might recognize it as just an XML document which > includes elements and attributes from the XHTML namespace (and others), > and may not have a priori knowledge what to do with such a document > beyond they can do for generic XML documents. I think "XML processors" isn't what is meant here. An XML processor alone wouldn't constitute a UA and by definition has no knowledge of XHTML. Assuming s/processors/UAs/, how is this different from generic XML UAs processing application/xhtml+xml? Why do authors need to know this? > Authors SHOULD explicitly identify the XHTML namespace through the > namespace declaration when they serve an XHTML Family document as > 'application/xml' to facilitate the chance for reliable processing. Um. Isn't this always required? "facilitate the chance for reliable processing"? Is there a chance that it will fail? What is unreliable? If you don't include it, it won't be interpreted as XHTML; if you do, it will. > The XML stylesheet PI SHOULD be used to associate style sheets. Why? > Whenever appropriate, 'application/xhtml+xml' SHOULD be used rather than > 'application/xml'. Why? > As for character encoding issues, "3.2 Application/xml Registration" of > [RFC3023] says that "the use of the charset parameter is STRONGLY > RECOMMENDED", and also specifies a rule that "[i]f an application/xml > entity is received where the charset parameter is omitted, no > information is being provided about the charset by the MIME Content-Type > header". This means that conforming XML processors MUST follow the > requirements described in section 4.3.3 of [XML10]. > > Therefore, while it is STRONGLY RECOMMENDED to specify an explicit > charset parameter through a higher-level protocol, authors SHOULD > include the XML declaration (e.g. <?xml version="1.0" > encoding="EUC-JP"?>). Note that a meta http-equiv statement will not be > recognized by XML processors, and while authors MAY include such a > statement a statement in an XHTML document served as 'application/xml' > it will not effect processing of the document since the higher level > protocol and the XML PI both take precedence. "Take precedence" makes it sound like the meta would do something when the higher level protocol doesn't say anything and the XML declaration is absent. It does not. > 3.4. 'text/xml' > > The 'text/xml' media type [RFC3023] is an another generic media type for > XML documents, and the definition of 'text/xml' does not preclude > serving XHTML documents as that media type, either. Any XHTML Family > document MAY be served as 'text/xml'. The considerations for > 'application/xml' also apply to 'text/xml'. Whenever appropriate, > 'application/xhtml+xml' SHOULD be used rather than 'text/xml'. > > Authors should also be aware of the difference between 'application/xml' > (and for that matter 'application/xhtml+xml' as well) and 'text/xml' > with regard to the treatment of character encoding. According to "3.1 > Text/xml Registration" of [RFC3023], "if a text/xml entity is received > with the charset parameter omitted, MIME processors and XML processors > MUST use the default charset value of "us-ascii"[ASCII]". This default > value is authoritative over the encoding information specified in the > XML declaration, or the XML default encodings of UTF-8 and UTF-16 when > no encoding declaration is supplied, so omitting the charset parameter > of a 'text/xml' entity might cause an unexpected result. As mentioned in > [RFC3023], the use of the charset parameter is STRONGLY RECOMMENDED. > 3.5. Summary > > The following table summarizes recommendation to content authors for > labeling XHTML documents. HTML 4 is also listed for comparison. > Media types summary for serving XHTML documents Media type HTML 4 > XHTML Family (HTML 4 compatible) XHTML Family (other) XHTML Family + > Extensions > text/html SHOULD MAY SHOULD NOT* SHOULD NOT > application/xhtml+xml MUST NOT MAY SHOULD SHOULD > application/xml MUST NOT MAY MAY MAY > text/xml MUST NOT MAY MAY MAY > > * However, see transformation. Why is application/xhtml+xml "MAY" for XHTML Family (HTML 4 compatible) but "SHOULD" for other XHTML? > Appendix A. Compatibility Guidelines > > This appendix summarizes design guidelines for authors who wish their > XHTML documents to render on both XHTML-aware and modern HTML user > agents. The purpose of providing these guidelines is to supply a simple > collection that, if followed, will give reasonable, predictable results > in modern user agents. Document authors should treat these as best > practices that were considered correct at the time this document was > published. Like all of this document, this Appendix is informative. It > contains no absolute requirements, and should NEVER be used as the basis > for creating conformance nor validation rules of any sort. Period. Heh, the last part of this paragraph is pretty funny when one has the earlier RFC2119 reference in mind. > For an example document that reflect the use of the guidelines from this > section, see Appendix B. > A.1. Processing Instructions and the XML Declaration > > DO NOT include XML processing instructions NOR the XML declaration. > > Rationale: Some HTML user agents render XML processing instructions. Namely some crappy mobile browsers, AFAIK. > Also, some user agents interpret the XML declaration to mean that the > document is unrecognized XML rather than HTML. Which ones? I don't know of any. > Such user agents may not render the document as expected. For > compatibility with these types of HTML browsers, you should avoid using > processing instructions and XML declarations. You forgot the most important rationale: it makes IE6 trigger quirks mode. > Consequence: Remember, however, that when the XML declaration is not > included in a document, AND the character encoding is not specified by a > higher level protocol such as HTTP, the document can only use the > default character encodings UTF-8 or UTF-16. See, however, guideline 9 > below. > A.2. Elements that can never have content > > If an element has an EMPTY content model DO use the minimized tag syntax > permitted by XML (e.g., <br />). DO NOT use the alternative syntax > (e.g., <br></br>) allowed by XML, since this may be unsupported by HTML > user agents. What do you mean with "unsupported"? AFAIK, </br> is treated as <br> in HTML UAs and other end tags are ignored. > Also, DO include a space before the trailing / and >. Why? AFAIK, this is only a problem for NS4 when not using any attributes (<br/> would be treated as an element "br/" instead of "br"). Considering that NS4 is irrelevant, this advice could well be dropped. > Empty elements in the XHTML family include: area, base, basefont, br, > col, hr, img, input, isindex, link, meta, and param. > > Rationale: HTML user agents ignore the /> at the end of a tag, but > without it they may incorrectly parse the tag or its attributes. HTML > user agents also may not recognize the alternate syntax permitted by XML. > A.3. Elements that have no content > > If an element permits content (e.g., the p element) but an instance of > that element has no content (e.g., an empty paragraph), DO NOT use the > "minimized" tag syntax (e.g., <p />). > > Rationale: HTML user agents may give uncertain results when using the > the minimized syntax permitted by XML when an element has no content. They give very certain results, AFAIK: they uniformly ignore the slash. Why would anyone want to use an empty paragraph anyway? Isn't <script> a better example? > A.4. Embedded Style Sheets and Scripts > > DO use external style sheets if your style sheet uses < or & or ]]> or > --. DO NOT use an internal stylesheet if the style rules contain any of > the above characters. > DO use external scripts if your script uses < or & or ]]> or --. DO NOT > embed a script in a document if it contains any of these characters. Why? If you use < and/or &, you could just use <script>//<![CDATA[ ... //]]></script> or <style>/*<![CDATA[*/ ... /*]]>*/</style> If you use ]]> you could just escape it as ]]\>. Why is -- a problem? > Rationale: XML parsers are permitted to silently remove the contents of > comments. Therefore, the historical practice of "hiding" scripts and > style sheets within "comments" to make the documents backward compatible > may not work as expected in XML-based user agents. This is (1) bogus and (2) not a rationale for the advice. (1) because even if the XML parser didn't remove the comment, the UA still wouldn't pass on the comment to the scripting engine. The UA would only pass on child nodes that are text nodes or CDATA nodes and skip any comments. (2) this is a rationale for not using <script><!-- ... //--></script> > @@@@Put a real example in here that works, and one that does not work@@@@ > A.5. Line Breaks within Attribute Values > > DO ensure that attribute values are on a single line and only use single > whitespace characters. DO NOT use line breaks and multiple consecutive > white space characters within attribute values. I understand linebreaks but why only a single whitespace? Also it would be good to give advice about what to do if you actually want a linebreak or a tab in an attribute value (i.e. use character references). > Rationale: These are handled inconsistently by user agents. Or rather: XML requires whitespace to be normalized to spaces. Or is there inconsistency that I don't know about? > A.7. The lang and xml:lang Attributes Where's A.6? > DO use both lang and xml:lang attributes when specifying the language of > an element in markup languages that support the use of both. > > DO NOT use the only the lang attribute, even in languages that include > it such as XHTML 1.0. > > Rationale: HTML 4 documents use the lang attribute to identify the > language of an element. XML documents use the xml:lang attribute. CSS > has a "lang" pseudo selector that automatically uses the appropriate > attribute depending on the document type. Therefore, specifying both > attributes ensures that single CSS selectors will work in both modes. "document type" makes it sound like you're talking about the doctype, which has no effect. > A.8. Fragment Identifiers > > DO use the id attribute to identify elements. > > DO ensure that the values used for the id attribute are limited to the > pattern [A-Za-z][A-Za-z0-9:_.-]*. > > DO NOT use the name attribute to identify elements, even in languages > that permit the use of name such as XHTML 1.0. Why not allow to use both? > Rationale: In HTML 3.2 and earlier the name attribute on some elements > could be used to define an anchor, but HTML 4 introduced the id > attribute. In an XML dialect, only attributes with type ID are permitted > to be used as anchors, and the id attribute is defined to be of type ID. > Relying upon the id attribute as an anchor will work well in modern HTML > and XHTML-aware user agents. > A.9. Character Encoding > > DO encode your document in UTF-8 or UTF-16. When delivering the document > from a server, DO set the character encoding for a document via the > charset parameter of the HTTP Content-Type header. When not delivering > the document from a server, DO set the encoding via a "meta http-equiv" > statement in the document (e.g., <meta http-equiv="Content-Type" > content="text/html; charset=EUC-JP" />). However, note that doing so > will explicitly bind the document to an a single content type. Interesting to repeatedly give EUC-JP as example for authors to copy-and-paste when the advice is to use UTF-8 or UTF-16. > Rationale: Since these guidelines already recommend that documents NOT > contain the XML declaration, setting the encoding via the HTTP header is > the only reliable mechanism compatible with HTML and XML user agents. > When that mechanism is not available, the only portable fallback is the > "meta http-equiv" statement. > A.10. Boolean Attributes > > DO use the full form for boolean attributes, as required by XML (e.g., > disabled="disabled"). Such attributes include: compact, nowrap, ismap, > declare, noshade, checked, disabled, readonly, multiple, selected, > noresize, and defer. Isn't valid XHTML the baseline? > Rationale: The compact form of these attributes is not well formed XML, > and therefore invalid. > A.11. Document Object Model and XHTML > > DO rely upon the HTML 4 DOM as defined in The Document Object Model > level 1 Recommendation [DOM] for scripting. This means, in particular, > that the names of elements and attributes will be returned (from > functions that return such things) in upper case. > > Rationale: Using the HTML DOM will result in maximum portability of > scripts, since the HTML DOM is supported in both HTML and XHTML > documents in modern user agents. Um. This seems bogus. They won't return uppercase in application/xhtml+xml. And why DOM Level 1? > A.12. Using Ampersands > > DO ensure that when content or attribute values contain the reserved > character & it is used in its escaped form &. > > Rationale: If ampersands are not encoded, the characters after them up > to the next semi-colon can be interpreted as the name of a entity by the > user agent. Or, more likely, the document isn't well-formed. > @@@@add example@@@@ > A.13. Cascading Style Sheets (CSS) and XHTML > > DO use lower case element and attribute names in style sheets. DO create > rules that include inferred elements (e.g., the tbody element in a > table). > > Rationale: These simple rules will help increase the portability of CSS > rules regardless of the media type the document is processed as. Hmm. Including inferred elements seems like a way to be *in*compatible, since they aren't inferred in application/xhtml+xml. > @@@@add examples@@@@ > A.14. Referencing Style Elements when serving as XML > > DO NOT use xml stylesheet declarations to identify style sheets. > > DO use the style or link elements to define stylesheets. > > Rationale: Since XML processing instructions may be rendered by some > HTML user agents, using the standard XML stylesheet declaration > mechanism may not work well. However, since XHTML user agents are > required to process style and link elements and interpret stylesheets > referenced from those elements, documents constructed to use them will > work as expected. Or more likely, XML processing instructions are dropped or parsed into comments in HTML UAs. > A.15. Formfeed Character in HTML vs. XML > > DO NOT use the formfeed character (U+000C). > > Rationale: This character is recognized as white space in HTML 4, but is > NOT considered white space in XML. Where is it said that U+000C is whitespace in HTML 4? In the SGML declaration for HTML 4 I find: FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9 ...which seems to suggest that only U+000A, U+000D, U+0020 and U+0009 are whitespace. Also, not only is it not considered whitespace in XML, it's not well-formed XML. > A.16. The Named Character Reference ' > > DO use ' to specify an escaped apostrophe. DO NOT use '. > > Rationale: The entity ' is not defined in HTML 4. Makes sense. I'm missing a number of guidelines. For instance: Markup: Don't use the internal subset. Don't use CDATA sections (except in <script> and <style>). Don't use <noscript>. Don't use markup in <iframe>. Use explicit <tbody> if scripts or style sheets assume it's there. Don't use xml:base. DOM: Don't use document.write() or document.writeln(). Use createElementNS if supported and fall back to createElement. (createElement as specced doesn't match reality and there's no interop.) If you use application/xml, don't assume that the document will implement the HTMLDocument interface (e.g. don't use document.body). (It doesn't in Firefox.) When setting innerHTML, make sure the string is both well-formed and "HTML 4 compatible". If you don't use explicit <tbody>, make sure your script works both with and without tbody present. CSS: Specify 'overflow' and 'background' on html instead of body. If you don't use explicit <tbody>, make sure your style sheet works both with and without tbody present. > Appendix B. An Example Document > > The following is an example document that adopts the conventions > described in Appendix A to ensure its portability among XHTML and HTML > user agents. > > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" > "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> > <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> > <head> > <title>sample</title> > <link href="style/style.css" rel="stylesheet" type="text/css" /> > </head> > > <body> > > <div id="main"> > > <h1>heading</h1> > <img src="http://www.w3.org/Icons/w3c_main" alt="W3C logo" /> <!-- > defined as an "EMPTY" element, do not use <img></img> or <img/> --> > <p>Some material & some <!-- use escaped ampersand, & --> Doesn't look very escaped to me. > <br /> <!-- defined as an "EMPTY" element, do not use <br></br> or > <br/> --> > that should be split.</p> > <p></p> <!-- NOT defined as an "EMPTY" element, just no content, so do > not use <p/> nor <p /> --> > > <input type="reset" disabled="disabled" /> <!-- defined as an "EMPTY" > element, do not use <hr></hr> nor <hr/> --> > > <hr /> <!-- defined as an "EMPTY" element, do not use <hr></hr> nor > <hr/> --> > > </div> > > </body> > </html> Wow, you covered a whopping 3 guidelines in this example, one of which repeated 4 times! Or maybe more guidelines were covered without commenting them, such as A.1, A.4, A.5, A.7, A.8, A.10, and A.15. -- Simon Pieters Opera Software
Received on Monday, 1 September 2008 17:21:48 UTC