Re: [whatwg] URL: spec review - basic_parser from Sam Ruby on 2014-10-14 (public-whatwg-archive@w3.org from October 2014)

From: Sam Ruby <rubys@intertwingly.net>
Date: Tue, 14 Oct 2014 04:37:24 -0400
To: Anne van Kesteren <annevk@annevk.nl>
Cc: whatWG <whatwg@whatwg.org>
Message-ID: <543CE0C4.3030405@intertwingly.net>

On 10/14/2014 03:41 AM, Anne van Kesteren wrote:
> On Tue, Oct 14, 2014 at 1:05 AM, Sam Ruby <rubys@intertwingly.net> wrote:
>> 1) rows where the notes merely say "href" are cases where parse errors are
>> thrown and failure is returned.  The expected results are an object that
>> returns the original href, but empty values for all other properties.  I
>> don't see this behavior in the spec:
>>
>> https://url.spec.whatwg.org/#url-parsing
>
> That is what you get when e.g. using <a>. If you use new URL() the
> object would fail to construct so you cannot observe the other
> properties. I'm not sure why you think it doesn't follow from the
> specification. If you return failure, there's no URL returned, so why
> would the properties return something?

Given that I've found problems in the spec, my implementation, and the 
test data, I'm trying to guess at what is the desired behavior.  As one 
source for clues, I've looked at what at the now unmaintained library:

https://github.com/annevk/url/blob/master/url.js#L62

And, as noted above, this is consistent with urltestdata.txt,

Given all of the above, would you suggest changing the spec or the 
expected test results?

>> 2) rows that contain "href hostname" appear to be ones where the expected
>> results do not appear to be updated to include the host to IDNA mapping.
>
> Looking at the first of those
> http://intertwingly.net/stories/2014/10/13/urltest-results/eb3950fcc8
> it seems something might be broken here on your end.

Can you explain what you think is broken?  It isn't completely obvious, 
but the input string in that case contains U+200B, U+2060, U+FEFF:

http://www.fileformat.info/info/unicode/char/200B/index.htm
http://www.fileformat.info/info/unicode/char/2060/index.htm
http://www.fileformat.info/info/unicode/char/feff/index.htm

I'll also note that the results I produce are consistent with 
Presto/2.12.388.

>> 3) rows that contain "href protocol hostname pathname" need further
>> investigation.  I suspect that these are based on my using a library to
>> normalize the IDNA mapping, and it "helpfully" cleans up other problems like
>> removing U+0000 characters from the input.
>
> E.g. for http://intertwingly.net/stories/2014/10/13/urltest-results/7a0e86d240
> per http://www.unicode.org/Public/idna/latest/IdnaMappingTable.txt
> U+FDD0 is disallowed meaning failure ought to be returned. What you
> have as outcome for "whatwg" does not match urltestdata.txt (including
> the version you are using).

Agreed.  As I indicated, I need to look further into the library that I 
am using.

>> P.S.  I didn't update to the latest test data yet; but from what I can see
>> the changes wouldn't materially affect the results, so I am publishing now.
>
> It affects what happens for http://%30%78%63%30%2e%30%32%35%30.01%2e,
> http://192.168.0.257, and
> ttp://\uff10\uff38\uff43\uff10\uff0e\uff10\uff12\uff15\uff10\uff0e\uff10\uff11.

I do plan to update to the latest expected test results, but meanwhile I 
am still trying to determine places where these results aren't correct 
or current with the specification.

- Sam Ruby

Received on Tuesday, 14 October 2014 08:38:01 UTC