Re: draft-montenegro-httpbis-uri-encoding from Martin J. Dürst on 2014-03-24 (ietf-http-wg@w3.org from January to March 2014)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Mon, 24 Mar 2014 15:45:25 +0900
To: Nicolas Mailhot <nicolas.mailhot@laposte.net>, Mark Nottingham <mnot@mnot.net>
CC: HTTP Working Group <ietf-http-wg@w3.org>, Gabriel Montenegro <gabriel.montenegro@microsoft.com>
Message-ID: <532FD485.6030306@it.aoyama.ac.jp>
I have had a quick look at the proposal slides as presented at the last 
IETF, and commented on the member-i18n-core@w3.org mailing list (W3C 
member only, but some of you might be able to have a look at it).

Below are somewhat similar comments.

On 2014/03/21 16:43, Nicolas Mailhot wrote:

> Le Ven 21 mars 2014 05:36, Mark Nottingham a écrit :

>> I've heard hallway feedback about it that wonders if we just want to allow
>> one value ("UTF-8").

Only distinguishing UTF-8 and non-UTF-8 is definitely very well worth 
considering. But I think it would be better to turn things around: Only 
use 'Legacy' to express that UTF-8 is NOT used.
    URI-Path-Encoding: Legacy
    URI-Query-Encoding: Legacy
Why is that the better choice? Because the Web is moving more and more 
towards UTF-8. Explicitly marking UTF-8 therefore means we have to keep 
these headers for a long time (*).


Heuristics have also been mentioned in the followup discussion.
When it comes to heuristics for character encodings and UTF-8, there are 
essentially two levels:

1) UTF-8 byte pattern detection
As known since 1997 (see 
http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf‎), this is highly 
discriminating and reliable even for very short strings such as URI/IRI 
components, although not 100%. (The biggest chunk of wrong positives is 
double-encoding, data encoded in UTF-8 interpreted as e.g. Latin-1 and 
converted to UTF-8 a second time.)

2) Anything else
This has to make use of many different kinds of information, such as the 
set of possible/probable encodings, their byte patterns, linguistic 
information, and so on. Given the necessary data, it's not too difficult 
for longer texts (e.g. one paragraph and up), but mostly not feasible 
for shorter texts such as e.g. URIs.

That 1) is quite reliable, i.e. that data that looks like UTF-8 usually 
is UTF-8, seems to provide further support for labeling legacy rather 
than UTF-8. In fact, one proposal may be to only label data that looks 
like UTF-8 but IS NOT. The advantage would be that the header field is 
necessary in only very few cases. The disadvantage would be that the 
header field is very rare, and therefore implementations wouldn't be 
tested seriously.


>> Beyond that, folks seem generally neutral-to-positive
>> about it, AFAICT.

The reason for this seems to be that everybody is happy when they see a 
solution to a character encoding problem. What the discussion on the 
list has shown is that it's not so clear where and when this proposal 
(in whatever form) helps, and where it doesn't help.

> The draft adds constrains to network nodes that try do do the right thing,
> and gives bad nodes a free ticket. So the incentive is to continue to
> behave non-deterministically.

There's definitely an incentive in that direction.

But if the proposal indeed has merit, and if we can get over the initial 
adoption chicken-and-egg problem (no servers support it so clients don't 
support it and vice versa), then there should also be incentives to do 
the right thing.

> I'd much rather have: anything that pretends to talk http/2 uses UTF-8
> (perhaps not even % encoded) by default, unless it specifies another
> encoding in one of the two optional headers. People who try to do anything
> other than utf-8 without specifying it get broken connexions when
> something fails in the path, if they don't want to make the effort to be
> deterministic that's their problem.

I agree that whatever we can do to speed up the (already strong) 
adoption of UTF-8 is a good thing.


> The spec can add a reserved keyword for unknown encoding (probably with
> the provision network nodes MAY refuse connexions with this keyword or any
> encoding value not specified in the spec).

That would be the above 'Legacy' value. If we go that route, it would 
probably be a good idea to register 'Legacy'.


> That avoids dragging two new headers on every connexion and does not make
> bad encoding the problem of people who rightly use utf-8 but on the actual
> bad encoding perpetrators.

I definitely agree that we should try to make those pay who don't follow 
the specs, if possible.

Regards,   Martin.


(*) Essentially we first need to make sure there's nothing else than 
UTF-8 on the Web, then we have to make sure we change all the servers so 
that they don't assume not getting one of these headers means the data 
is in a legacy encoding. This means two full server/content deployment 
cycles, which can mean easily 30 to 40 years altogether :-(.
Received on Monday, 24 March 2014 06:46:03 UTC