Re: Encoding document approach [I18N-ACTION-117]

On Apr 25, 2012, at 4:12 , Anne van Kesteren wrote:

> On Tue, 24 Apr 2012 21:27:22 +0200, Norbert Lindenberg <w3@norbertlindenberg.com> wrote:
>> I'm somewhat concerned about positioning the Encoding document [1] as a standard. I think it would be very helpful to describe the issues around encodings in web content and provide recommendations for handling the more commonly used encodings and use cases. The existing document has a lot of useful information in that direction. However, I don't think it's feasible to create a standard that completely prescribes the handling of all legacy encodings on the web - the swamp is just way too big.
> 
> I think I might be a bit more ambitious. I think it is feasible.
> 
> 
>> 1) The document seems to be based solely on observing the behavior of browsers. There are other user agents that access web content, such as search engines or (HTML) email processors. These operate under different constraints than browsers, including the lack of a user who could override incorrect encoding labels by selecting a different encoding. They're also more difficult to experiment with.
> 
> Indeed, so a proper standard should help them even more. I think this is largely analogous to other problems we have solved, such as HTML parsing. By defining how the majority consumers of HTML actually consume HTML (and get them closer to each other in the process) the whole ecosystem benefits.

If you take their different constraints into consideration, then the standard may indeed help them.

>> 2) The document assumes a strict mapping from labels to encodings, and doesn't say where labels come from. This may cause readers to assume that labels are directly taken from the documents or transmission protocols. In reality, many documents on the web, and even more so in emails, are mislabeled, and so some user agents use encoding detection algorithms that interpret labels as just one of several hints. (As noted above, browsers let the user override the encoding).
> 
> This is actually not true. Only when labels are not recognized (i.e. when the "get an encoding" algorithm returns failure) do browsers resort to encoding sniffing. This is defined in detail in HTML. text/plain and HTML are the only places sniffing is required. Browsers do not resort to encoding sniffing for external scripts, style sheets, or XML.

I said "some user agents use encoding detection algorithms", not "some browsers". Search engines, for example. Has anybody verified with search engine developers whether the "Determining the character encoding" algorithm in HTML5 works for them?

>> 3) The document assumes that encodings are labeled with encoding names. In reality, some web sites rely on font encodings, sometimes with site-specific fonts, and so technologies such as the Padma extension [2] interpret font names as encoding identifiers.
> 
> I don't think this influences the architecture that is in place. You can use PUA/custom fonts and violate all kinds of standards, just like you can use display:none on your root element, but that does not mean encoders or decoders work any differently.

These are actually very different situations:

- Using the PUA doesn't violate any standard that I know of; it's just that the Unicode standard doesn't tell you how to interpret the characters, so you have to rely on a private agreement.

- Using display:none on the root element doesn't violate any standard that I know of, and in fact the CSS standard specifies how to interpret it, and the browsers with which I tested all implement it according to the standard; the only problem is that the resulting empty page is kinda meaningless.

- Using fonts that render Latin code points within an HTML document with Malayalam/Devanagari/Telugu glyphs, as these sites do, arguably violates the HTML and Unicode standards.

And still, even though font encodings arguably violates the standards, people who want to make the whole web searchable convert these font encodings to Unicode, as you can see by looking at the sources of this original page and its copy in the Google cache:
http://www.manoramaonline.com/OnlinePortal/1086195281.htm
http://webcache.googleusercontent.com/search?q=cache:gUeYetcKTIkJ:www.manoramaonline.com/OnlinePortal/1086195281.htm+site:manoramaonline.com&cd=9&hl=en&ct=clnk&gl=us&client=safari

>> 4) I doubt that the owners of user agents would accept the requirement "User agents must not support any other encodings or labels", which would make it impossible for them to interpret content that happens to be encoded in a different form.
> 
> If there is evidence that supporting an additional encoding is beneficiary I am sure the other user agents would be happy to support it too. The goal here is to foster interoperability and most definitely not step away from hard problems before we are even confronted with them.

So in the case of Indic font encodings, you'd have to either convince Google and Padma to stop interpreting them, or Opera and Apple to support them (where support could mean that you can search for and find Malayalam text in the page referenced above). I suspect the first two will tell you that they're glad their code works, and the other two that they have more important problems to work on.

>> 5) Similarly, I doubt that all owners of content will suddenly comply with the requirement "New content and formats must exclusively use the utf-8 encoding", and so user agents will not be able to rely on it. This should probably be aligned with HTML5 section 4.2.5.5, "Specifying the document's character encoding".
> 
> There is a difference in what content must do and what user agents must do as user agents must deal with the errors that content has. The requirement on content using utf-8 is because many APIs and new formats only work if you are using utf-8. If you use anything but utf-8 you are going to have a bad time. (Web Workers will not work, WebSocket will not work, XMLHttpRequest.send() will not work, application manifests will break, URL query parameters will be strangely encoded, form submission will have ambiguity with respect to whether &#...; was entered by the user or was an unrecognized code point, etc.)
> 
> A bug has been filed on HTML to have it make use of the Encoding Standard. If HTML cannot live with the requirements in it and they do not find the above argument persuasive enough the requirements will be reevaluated.

"Will not work" sounds pretty strong here. If an HTML page uses an encoding other than UTF-8, it seems at least the following situations can occur:

- No other format is used; it's just content.

- The other format works if the data for that format is encoded or decoded in UTF-8, independent of the encoding of the HTML page. I think that's the case for web workers, and for the string and form data cases of XMLHttpRequest.send().

- The other format works if the recipient of the data knows the encoding of the HTML page. That's the case of URL query parameters.

- There's ambiguity, which the recipient may or may not be able to resolve. That's the &#...; case.

- The other format cannot be made to work. Any examples here? (I have to admit that I'm not familiar with all the technologies you mention.)

>> 6) The document generally uses the Windows extension for encodings that have been extended. For some encodings, especially Japanese and Chinese encodings, there are multiple incompatible extensions, so assuming the Windows extension may cause mojibake.
> 
> All browsers (including those on Mac) use the Windows extensions as Microsoft has been the dominant force in that area of the world for quite some time. Apart from big5 most browsers are pretty close to each other. They usually differ in a few code points, PUA exposure, labels that are supported, and error handling details.

You may be right that the Windows extensions have become the de-facto standard on the web - I don't have enough data here.

>> Also, where a web application labels its pages with the name of a standard encoding (such as iso-8859-1), it may not be prepared to handle characters from the corresponding Windows encoding (here windows-1252).
> 
> As I mentioned before they *rely* on handling characters from the corresponding Windows encoding.
> 
> 
>> On Apr 22, 2012, at 13:40 , Phillips, Addison wrote:
>>> 1. The document describes various character encoding schemes without placing, we feel, the correct emphasis on migrating from legacy encodings to Unicode. More attention should be paid to this and to leveraging CharMod [3].
> 
> I have tried putting emphasis by marking everything but utf-8 as legacy. Suggestions are more than welcome however.

I feel a bit uncomfortable seeing "legacy" and "UTF-16" together. I wouldn't recommend UTF-8 for transmission over networks, but it's one of the main Unicode encodings and commonly used for in-memory processing.

>>> 2. The document proceeds from observations of how character encodings *appear* to be handled in various browsers/user-agents. Implementers may find this documentation useful, but several important user-agents are thought to be implemented in ways that are divergent from this document. We think that more direct information about character encoding conversion from implementers should be sought to form the description of various encoders/decoders.
> 
> The work is based to large extent on reverse engineering and when data is available picking the best alternative. (Though I so far have avoided specifying anything that maps to PUA.)
> 
> Getting more feedback from implementors would be great of course. I myself have a pretty good channel with Opera, and engineers from Mozilla have been filing bugs as well as well as making changes to Gecko. I have attempted to reach out to Chromium's expert, but have had no luck reaching him. Shawn Steele from Microsoft said he did not have resources to look at the standard and I have not approached Apple thus far.

Any representatives of search engines and email processors?

> Kind regards,
> 
> 
> -- 
> Anne van Kesteren
> http://annevankesteren.nl/
> 

Received on Saturday, 28 April 2012 03:08:46 UTC