Re: Unknown text/* subtypes [i20]

On 12 Feb 2008, at 21:12, Roy T. Fielding wrote:

> The answer is that iso-8859-1 is still the most interoperable default
> *with* the addition of safe sniffing only when the charset is left
> unlabeled or when charset="iso-8859-1".  By safe sniffing, I mean
> specifically excluding any charset-switching in mid-content
> for which the text media type's delimiter set (e.g., <"':> in HTML)
> would be mapped to different octets than they are in US-ASCII.
> In other words, it is safe to sniff for charsets in the first ten
> or so characters, and also to switch to other US-ASCII supersets
> after reading something like the <meta http-equiv="content-type" ...>,
> but it is definitely unsafe to continue sniffing for charset changes
> after that point unless they are restricted to US-ASCII supersets.

ISO-8859-1 isn't actually the most interoperable default: a huge  
number of documents (and not just HTML documents, but also a large  
number of feeds) rely on ISO-8859-1 being treated as Windows-1252.  
What is probably the best default is windows-1252 with sniffing (the  
exact details of which are inevitably reliant on the exact format  
being used, as what is used in HTML obviously isn't suitable for XML,  
for example). I don't think it's worthwhile attempting to define what  
type of sniffing (e.g., your "safe" sniffing) can be used, as it is  
very much context dependant (and if there's one thing we've learnt  
from this, let it be that context is very important), and in some  
cases it may be ideal to throw out what you already have. However, in  
defence of "safe" sniffing, HTML5 requires a partial US-ASCII superset  
(to sniff it from meta), and XML 1.0 implicitly requires a superset of  
the encoding being used in Appendix F (when there is no BOM).


--
Geoffrey Sneddon
<http://gsnedders.com/>

Received on Tuesday, 12 February 2008 23:23:38 UTC