[whatwg/url] can't parse urls starting with xn-- (#438) from Jan Potoms on 2019-05-02 (public-webapps-github@w3.org from May 2019)

From: Jan Potoms <notifications@github.com>
Date: Thu, 02 May 2019 03:59:47 -0700
To: whatwg/url <url@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/url/issues/438@github.com>

Can't seem to parse urls like `http://xn--abc.com`. This seems to work in browsers though.
I've been digging through the code and specs a bit.

It looks like `tr46.toASCII` returns an error. Digging further, it looks like it should implement this spec: https://www.unicode.org/reports/tr46/#Processing. But that seems to say:
> Even if an error occurs, the conversion of the string is performed as much as is possible.

And it says

> If the label starts with “xn--”:
> Attempt to convert the rest of the label to Unicode according to Punycode [RFC3492]. If that conversion fails, record that there was an error, and continue with the next label. Otherwise replace the original label in the string by the results of the conversion.

The url spec seems to dictate (https://url.spec.whatwg.org/#idna)

> If result is a failure value, validation error, return failure.

I feel like this should be possible though, tr46 seems quite ambiguous as to what's recoverable and what not. 

I came across an example that renders and parses in the browser but seems to fail the parsing algorithm: http://xn--12cr4aua8bifvs3aljr6edb1al1vlg1a.blogspot.com (disclaimer: I am in no way connected to this url or the content of the site, it just passed by our systems)

In any case, I'm not super experienced in reading these specs, so take the previous with an appropriate grain of salt. It just seems strange to me that urls can render in a browser, but fail parsing them according to the spec.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/url/issues/438

Received on Thursday, 2 May 2019 11:00:10 UTC