Re: draft-montenegro-httpbis-uri-encoding from Nicolas Mailhot on 2014-03-21 (ietf-http-wg@w3.org from January to March 2014)

From: Nicolas Mailhot <nicolas.mailhot@laposte.net>
Date: Fri, 21 Mar 2014 14:24:02 +0100
To: "Bjoern Hoehrmann" <derhoermi@gmx.net>
Cc: "Nicolas Mailhot" <nicolas.mailhot@laposte.net>, "Julian Reschke" <julian.reschke@gmx.de>, "Mark Nottingham" <mnot@mnot.net>, "HTTP Working Group" <ietf-http-wg@w3.org>, "Gabriel Montenegro" <gabriel.montenegro@microsoft.com>
Message-ID: <cbe1b5830e180522ff5d2bf9c34e7a0c.squirrel@arekh.dyndns.org>

Le Ven 21 mars 2014 13:54, Bjoern Hoehrmann a écrit :
> * Nicolas Mailhot wrote:
>>Really, can't you read the abundant documentation that was written on the
>>massive FAIL duck typing is for encoding (for example, python-side)? Code
>>passing unit tests then failing right and left as soon as some new
>>encoding combo or text triggering encoding differences injected in the
>>system? Piles of piles of partial workarounds till there was complete
>> loss
>>of understanding how they were all supposed to work in the first place?
>>
>>That's the last thing you want to reinvent on security equipments (and
>>you'll reinvent it because the amount of non-ASCII urls is small now but
>>will only grow with time).
>
> Julian asked for a concrete example use case. So far you have not given
> one. It might help to assume the rest of us understands the subject at
> hand at least as well as you do.

As I wrote last time he asked the same question, on some of our networks
accesses are controlled by regex-like checks on URL and not knowing the
encoding of processes URLs means this processing (and the processing of
security logs) is unreliable.

We already had several security incidents where carefully crafted urls
triggered security equipment bugs (so far, not using encoding tricks just
plain ascii but the writing is on the wall).

The first in-the-wild uses of punicode already triggered bugs in code that
assumed everything is ascii (and that's ok we can fix this case because it
is clearly defined – not fix every random encoding permutation people can
invent).

Not knowing encoding propagates encoding heuristics in all layers of the
software stack – from security appliances, to log handlers, to the apps
that process their output to inform users on their web usage, to the
stupid spreadsheet macros people use to simplify reporting/billing/quick
analysis tasks. The only sane way to limit the problem scope is to convert
everything to a single universal well managed encoding at entry point. No
different than what the python people did. No different from database
schema handling (if you don't use unicode and utc in your databases by
default today, you deserve the problems you get). Text is an early
decoding problem space, not lazy last-mile just-as-needed decoding problem
space.

-- 
Nicolas Mailhot

Received on Friday, 21 March 2014 13:24:47 UTC