- From: Alex Milowski <alex@milowski.org>
- Date: Thu, 17 Jun 2010 14:54:06 +0100
- To: public-xml-processing-model-wg@w3.org
On Thu, Jun 17, 2010 at 2:38 PM, Norman Walsh <ndw@nwalsh.com> wrote: > Consider the following document, served with a content type of > "text/html" with no encoding declaration[1]: > > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> > <title>Some Title</title> > </head> > <body>Some text with UTF-8 encoded, non-ASCII characters.</body> > </html> > > Recall that the HTTP specification says that a "text/*" document > served without an encoding is by default served as US-ASCII. Actually, it defaults to "ISO-8859-1". > > 1. Is an XProc processor {forbidden|allowed|required} to notice the > content-type meta header and correct the encoding? No. We processing XML. ;) For the text/html media type, it would be a quality of implementation feature to detect encodings set in a "meta" element. > 2. Suppose it was sent by the server as "text/html; charset=US-ASCII", > is the processor {forbidden|allowed|required} to notice the > content-type meta header and correct the encoding in that case? No. If you explicitly given the character encoding on the Content-Type header, then you should use that. Here's the bits from the current HTML5 draft about this: "If an HTML document does not start with a BOM, and if its encoding is not explicitly given by Content-Type metadata, and the document is not an iframe srcdoc document, then the character encoding used must be an ASCII-compatible character encoding, and, in addition, if that encoding isn't US-ASCII itself, then the encoding must be specified using a meta element with a charset attribute or a meta element with an http-equiv attribute in the Encoding declaration state." > 3. Suppose the document is an HTML document, not an XHTML one. Are the > rules different then? (I think all bets are off in that case, > because the document must be passing through some tidy/tagsoup > process; but what would we recommend?) All of the above rules are for text/html. XHTML follows XML rules. If you don't send the proper charset parameter on the Content-Type header, then you need to sniff for the encoding via auto-detection and the XML PI. -- --Alex Milowski "The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered." Bertrand Russell in a footnote of Principles of Mathematics
Received on Thursday, 17 June 2010 13:54:40 UTC