- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Mon, 24 Mar 2014 15:45:25 +0900
- To: Nicolas Mailhot <nicolas.mailhot@laposte.net>, Mark Nottingham <mnot@mnot.net>
- CC: HTTP Working Group <ietf-http-wg@w3.org>, Gabriel Montenegro <gabriel.montenegro@microsoft.com>
I have had a quick look at the proposal slides as presented at the last IETF, and commented on the member-i18n-core@w3.org mailing list (W3C member only, but some of you might be able to have a look at it). Below are somewhat similar comments. On 2014/03/21 16:43, Nicolas Mailhot wrote: > Le Ven 21 mars 2014 05:36, Mark Nottingham a écrit : >> I've heard hallway feedback about it that wonders if we just want to allow >> one value ("UTF-8"). Only distinguishing UTF-8 and non-UTF-8 is definitely very well worth considering. But I think it would be better to turn things around: Only use 'Legacy' to express that UTF-8 is NOT used. URI-Path-Encoding: Legacy URI-Query-Encoding: Legacy Why is that the better choice? Because the Web is moving more and more towards UTF-8. Explicitly marking UTF-8 therefore means we have to keep these headers for a long time (*). Heuristics have also been mentioned in the followup discussion. When it comes to heuristics for character encodings and UTF-8, there are essentially two levels: 1) UTF-8 byte pattern detection As known since 1997 (see http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf), this is highly discriminating and reliable even for very short strings such as URI/IRI components, although not 100%. (The biggest chunk of wrong positives is double-encoding, data encoded in UTF-8 interpreted as e.g. Latin-1 and converted to UTF-8 a second time.) 2) Anything else This has to make use of many different kinds of information, such as the set of possible/probable encodings, their byte patterns, linguistic information, and so on. Given the necessary data, it's not too difficult for longer texts (e.g. one paragraph and up), but mostly not feasible for shorter texts such as e.g. URIs. That 1) is quite reliable, i.e. that data that looks like UTF-8 usually is UTF-8, seems to provide further support for labeling legacy rather than UTF-8. In fact, one proposal may be to only label data that looks like UTF-8 but IS NOT. The advantage would be that the header field is necessary in only very few cases. The disadvantage would be that the header field is very rare, and therefore implementations wouldn't be tested seriously. >> Beyond that, folks seem generally neutral-to-positive >> about it, AFAICT. The reason for this seems to be that everybody is happy when they see a solution to a character encoding problem. What the discussion on the list has shown is that it's not so clear where and when this proposal (in whatever form) helps, and where it doesn't help. > The draft adds constrains to network nodes that try do do the right thing, > and gives bad nodes a free ticket. So the incentive is to continue to > behave non-deterministically. There's definitely an incentive in that direction. But if the proposal indeed has merit, and if we can get over the initial adoption chicken-and-egg problem (no servers support it so clients don't support it and vice versa), then there should also be incentives to do the right thing. > I'd much rather have: anything that pretends to talk http/2 uses UTF-8 > (perhaps not even % encoded) by default, unless it specifies another > encoding in one of the two optional headers. People who try to do anything > other than utf-8 without specifying it get broken connexions when > something fails in the path, if they don't want to make the effort to be > deterministic that's their problem. I agree that whatever we can do to speed up the (already strong) adoption of UTF-8 is a good thing. > The spec can add a reserved keyword for unknown encoding (probably with > the provision network nodes MAY refuse connexions with this keyword or any > encoding value not specified in the spec). That would be the above 'Legacy' value. If we go that route, it would probably be a good idea to register 'Legacy'. > That avoids dragging two new headers on every connexion and does not make > bad encoding the problem of people who rightly use utf-8 but on the actual > bad encoding perpetrators. I definitely agree that we should try to make those pay who don't follow the specs, if possible. Regards, Martin. (*) Essentially we first need to make sure there's nothing else than UTF-8 on the Web, then we have to make sure we change all the servers so that they don't assume not getting one of these headers means the data is in a legacy encoding. This means two full server/content deployment cycles, which can mean easily 30 to 40 years altogether :-(.
Received on Monday, 24 March 2014 06:46:03 UTC