Re: [whatwg/url] Editorial: make everything use percent-encode sets (#518) from Matt Giuca on 2020-05-15 (public-webapps-github@w3.org from May 2020)

From: Matt Giuca <notifications@github.com>
Date: Thu, 14 May 2020 21:36:28 -0700
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/pull/518/c629020741@github.com>

> It seems query state with [`ISO-2022-JP` encoding](https://encoding.spec.whatwg.org/#iso-2022-jp-encoder) will not work well after this update. For example `ISO-2022-JP` encodes U+00A5 to bytes: 0x1B 0x28 0x4A 0x5C 0x1B 0x28 0x42.
> 
> Before this update, result of encoding U+00A5 in query is: `%1B(J\%1B(B`
> After this update - result is: `%1B%28%4A%5C%1B%28%42`

True, I agree. That's because the old query state logic performed encoding, then checked whether the byte was in the query percent encode set in deciding whether to percent-encode it. The new "percent-encode after encoding" logic checks whether the code point is in the percent encode set, then encodes it, then percent-encodes those bytes unconditionally.

Fixing this is a bit messy. I think we need to shuffle around the "percent-encode after encoding" algorithm:

* Delete 1. "If _codePoint_ is not in _percentEncodeSet_, then return _codePoint_."
* Replace 5. "For each _byte_ of _bytes_, percent-encode _byte_ and append the result to _output_." with "For each _byte_ of _bytes_, if _byte_ is not an [ASCII byte](https://infra.spec.whatwg.org/#ascii-byte), or if the code point whose value is _byte_ is not in _percentEncodeSet_, percent-encode _byte_ and append the result to _output_. Otherwise, append the code point whose value is _byte_ to _output_.

This involved a fair amount of byte/codepoint gymnastics, which is unavoidable if we want to preserve the existing behaviour (which literally runs an encoder, then represents a selection of bytes using their ASCII equivalents). I'm using the "code point whose value is _byte_" language, borrowed from the old query state algorithm.

This change should only affect non-UTF-8 encodings. When using UTF-8, all ASCII code points encode to the corresponding ASCII byte (so either way, characters in the encode set will be encoded or not according to the percent-encode set), and all non-ASCII code points encode to a sequence of non-ASCII bytes (so either way, all of those bytes will be percent-encoded).

You might note that in fact this is the only usage of percent-encode sets, which are sets of code points. It would make the above algorithm easier if we converted all the percent-encode sets into sets of bytes (e.g., "U+0023 (#), U+003F (?)" becomes "0x23 (#), 0x3F (?)"), and then "the code point whose value is _byte_ is not in _percentEncodeSet_" becomes "_byte_ is not in _percentEncodeSet_". But I would rather not do this, since fundamentally the percent encode sets should represent characters, not bytes.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/pull/518#issuecomment-629020741

Received on Friday, 15 May 2020 04:36:41 UTC