Re: URL Spec rewrite (bug 25946) and galimatias test results

On 10/28/14 5:14 PM, Santiago M. Mola wrote:
> Hi,
>
> 2014-10-25 2:36 GMT+02:00 Sam Ruby <rubys@intertwingly.net
> <mailto:rubys@intertwingly.net>>:
>
>
>     He suggested that I ask you for feedback on the following:
>
>     http://intertwingly.net/__projects/pegurl/url.html
>     <http://intertwingly.net/projects/pegurl/url.html>
>
>
> It's definitely a useful resource.
>
> I think that the parser defined in the current spec is easier to follow
> if you want to implement it as-is.

Having implemented it both ways, I respectfully disagree.  :-)

http://intertwingly.net/stories/2014/10/13/url_rb.html
https://github.com/rubys/url#readme

I'll also note that the spec isn't intended to be implemented as is, at 
least not for code intended to be run in production environments.

> Your approach gives a better idea about what should the parser do on a
> higher level. After implementing the parser following the current spec,
> I had a hard-time determining what should be the parsing output for some
> cases, These new diagrams would solve that problem.

I also had a hard time, and that's why I assert that the current spec 
isn't exactly easy to follow.  In fact, I will openly confess that I 
have looked into your implementation a number of times to see how the 
current spec should be interpreted, and attempted with my specification 
to make such areas clearer.

>     I also said that I would test galimatias for compatibility.  I've
>     posted the results here:
>
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/>
>
>   Thank you for taking the time to include Galimatias!
>
>     A few notes: it doesn't appear to me that galimatias reports any
>     recoverable parse errors (for example, including a tab or a linefeed
>     inside a path).

It does now.  :-)

http://intertwingly.net/projects/pegurl/urltest-results/

> Galimatias checks every defined error, both recoverable and fatal. It
> provides a customizable ErrorHandler interface. The core provides a
> DefaultErrorHandler that just ignores any recoverable error and
> StrictErrorHandler, which fails on any error. The user could implement a
> LoggingErrorHandler that logs recoverable parsing errors, a
> CollectorErrorHandler that collects every error for later
> analysis/validation, etc. I will add more error handlers to the core if
> common patterns of use emerge among the user community.

And I believe that any new recoverable errors you define should find 
their way back into the specification.  As near as I can tell, the 
following is an example of one:

http://intertwingly.net/projects/pegurl/urltest-results/bc6ea8bdf8

I say this because the error isn't defined here:
   https://url.spec.whatwg.org/#host-state

And the following only defines fatal errors (e.g. step 5);
   https://url.spec.whatwg.org/#concept-host-parser

But as I said, I have found the current specification to be hard to 
follow, so I may have missed something.

>     Also galimatias doesn't provide the interfaces that the URL Standard
>     defines, for example to get the portname - an interface that is
>     supposed to return null if the port matches the default port for the
>     scheme.
>
> Right. Galimatias does not implement the URLUtils interface. I have
> opened an issue to keep a reference for it:
> https://github.com/smola/galimatias/issues/44
>
> I'm still not sure I want to provide URLUtils interface as is. It's a
> browser-centric API that I don't find particularly useful outside the
> JavaScript-in-a-browser scope. Maybe it makes sense for standards
> validation code such as validator.nu <http://validator.nu>?

As a point of reference, node.js came to the opposite conclusion:

http://nodejs.org/api/url.html

>     Even with that accounted for, there still are a number of notable
>     results:
>
>     Null pointer exceptions, some examples:
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/bf8630587b
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/bf8630587b>,
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/4038fcfa6d
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/4038fcfa6d>,
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/275612041a
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/275612041a>,
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/2f33177681
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/2f33177681>,
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/e630bf59c6
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/e630bf59c6>,
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/a16d100f3
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/a16d100f37>
>
> Some of these seem caused because the call to url.host().toString() in
> your test case. In these cases, host is null. This was the intended
> behaviour.

Yes, all were my fault. :-)

>     Returning an empty string instead of null for fragment:
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/1b77231365
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/1b77231365>
>
> AFAIK this is consistent with the standard (fragment, not hash). You can
> change your testing class to:
>
> result.put("hash", (url.fragment()!=null && !url.fragment().isEmpty()) ? "#"+url.fragment() : "");

Here is what I am currently using:

https://github.com/rubys/url/blob/peg.js/evaluate/testgalimatias.java

>     Returning an empty string instead of null for query:
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/24f081633d
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/24f081633d>
>
> Again, this is the standard (query, not search). You can change your
> testing class to:
>
> result.put("search", (url.query()!=null && !url.query().isEmpty()) ? "?"+url.query() : "");

Again, see above.

>     ipv6 addresses not wrapped in []:
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/54f86d22f2
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/54f86d22f2>
>
> Right. IPv6 addresses are wrapped in [] when serialized as part of an
> URL, but they are not wrapped when printed as standalone entities. I'll
> fix it:
> https://github.com/smola/galimatias/issues/45

Actually, toHumanString() does what I need.

>     difference in case:
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/e40dedda84
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/e40dedda84>,
>     http://intertwingly.net/__stories/2014/10/24/urltest-__results/9a4e54b1c3
>     <http://intertwingly.net/stories/2014/10/24/urltest-results/9a4e54b1c3>
>
> Galimatias is biased towards URL normalization. It tries to minimize the
> creation of URLs that are equivalent according to the standard. A
> setting to disable this percent-encoding normalization behaviour will be
> provided if there is a real world use case for it.

The URL Standard is also biased towards URL normalization.  It is also 
very specific about the case of the output.  My point is that you are 
not conforming to the standard here.  Either your implementation should 
be updated, or the standard should be updated, or you should explicitly 
mark this as an area where you don't intend to comply with the standard.

>     If you want to review how I captured these results, the program I
>     used can be found here:
>
>     http://intertwingly.net/__stories/2014/10/24/urltest.__java
>     <http://intertwingly.net/stories/2014/10/24/urltest.java>
>
>     Please let me know if you identify any problems with that program,
>     and I will be glad to rerun the tests.
>
> I have added a test to check with
> http://intertwingly.net/stories/2014/10/05/urltestdata.json
>
> Do you have more up-to-date data?

Yes:

http://intertwingly.net/projects/pegurl/urltestdata.json

> I found Galimatias failed with this:
>
> http://intertwingly.net/stories/2014/10/24/urltest-results/bc6ea8bdf8
> {"input":"http://%30%78%63%30%2e%30%32%35%30.01%2e","base":"http://other.com/","scheme":"","username":"","password":null,"host":"","port":"","path":"","query":"","fragment":""}
>
> My latest code parses this URL as http://0xc0.0250.01./
> This seems in line with the standard, since it does not perform sanity
> checks for DNS rules. See: https://github.com/smola/galimatias/issues/26
> It also fails with this one:
> {"input":"http://192.168.0.257","base":"http://other.com/","scheme":"","username":"","password":null,"host":"","port":"","path":"","query":"","fragment":""}
>
> But 192.168.0.257 is a valid domain name at DNS-level and it's not
> forbidden by the URL standard.

I got Anne to fix the test data:

https://github.com/w3c/web-platform-tests/commit/fe078df8f3ab7d6c9df1b6d80b119e3216179a0e#diff-7d84adcee757bea416deedfb7007e090

And you can find a newer version of the test data in JSON format at the 
URL mentioned above.

> Apart from these two cases, Galimatias passes all test cases if
> query/search fragment/hash differences are considered.

Cool.  Per above, in one case the test case was identified as incorrect 
and has been updated.  I'll leave it up to you to decide whether you 
wish to Galimatias to be non-spec compliant in the other case, or to 
make the case that the spec should be changed.

> Please, let me know if there is any failure in the future or if you have
> any feedback.
>
> Thank you again!

I definitely will do.

> Best,
>
> Santiago

- Sam Ruby

Received on Wednesday, 29 October 2014 02:29:57 UTC