Fwd: Re: URL Spec rewrite (bug 25946) and galimatias test results

First attempt apparently didn't make it to the online archives.

- Sam Ruby

-------- Forwarded Message --------
Subject:  Re: URL Spec rewrite (bug 25946) and galimatias test results
Date:  Wed, 29 Oct 2014 01:14:18 +0100
From:  Santiago M. Mola <santi@mola.io>
To:  Sam Ruby <rubys@intertwingly.net>
CC:  Michael(tm) Smith <mike@w3.org>, www-archive@w3.org
<www-archive@w3.org>



Hi,

2014-10-25 2:36 GMT+02:00 Sam Ruby <rubys@intertwingly.net
<mailto:rubys@intertwingly.net>>:


     He suggested that I ask you for feedback on the following:

     http://intertwingly.net/__projects/pegurl/url.html
     <http://intertwingly.net/projects/pegurl/url.html>


It's definitely a useful resource.

I think that the parser defined in the current spec is easier to follow
if you want to implement it as-is.

Your approach gives a better idea about what should the parser do on a
higher level. After implementing the parser following the current spec,
I had a hard-time determining what should be the parsing output for some
cases, These new diagrams would solve that problem.



     I also said that I would test galimatias for compatibility.  I've
     posted the results here:

     http://intertwingly.net/__stories/2014/10/24/urltest-__results/
     <http://intertwingly.net/stories/2014/10/24/urltest-results/>


   Thank you for taking the time to include Galimatias!


     A few notes: it doesn't appear to me that galimatias reports any
     recoverable parse errors (for example, including a tab or a linefeed
     inside a path).


Galimatias checks every defined error, both recoverable and fatal. It
provides a customizable ErrorHandler interface. The core provides a
DefaultErrorHandler that just ignores any recoverable error and
StrictErrorHandler, which fails on any error. The user could implement a
LoggingErrorHandler that logs recoverable parsing errors, a
CollectorErrorHandler that collects every error for later
analysis/validation, etc. I will add more error handlers to the core if
common patterns of use emerge among the user community.

     Also galimatias doesn't provide the interfaces that the URL Standard
     defines, for example to get the portname - an interface that is
     supposed to return null if the port matches the default port for the
     scheme.


Right. Galimatias does not implement the URLUtils interface. I have
opened an issue to keep a reference for it:
https://github.com/smola/galimatias/issues/44

I'm still not sure I want to provide URLUtils interface as is. It's a
browser-centric API that I don't find particularly useful outside the
JavaScript-in-a-browser scope. Maybe it makes sense for standards
validation code such as validator.nu <http://validator.nu>?


     Even with that accounted for, there still are a number of notable
     results:

     Null pointer exceptions, some examples:
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/bf8630587b
 
<http://intertwingly.net/stories/2014/10/24/urltest-results/bf8630587b>,
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/4038fcfa6d
 
<http://intertwingly.net/stories/2014/10/24/urltest-results/4038fcfa6d>,
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/275612041a
 
<http://intertwingly.net/stories/2014/10/24/urltest-results/275612041a>,
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/2f33177681
 
<http://intertwingly.net/stories/2014/10/24/urltest-results/2f33177681>,
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/e630bf59c6
 
<http://intertwingly.net/stories/2014/10/24/urltest-results/e630bf59c6>,
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/a16d100f3
     <http://intertwingly.net/stories/2014/10/24/urltest-results/a16d100f37>


Some of these seem caused because the call to url.host().toString() in
your test case. In these cases, host is null. This was the intended
behaviour.

     Returning an empty string instead of null for fragment:
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/1b77231365
     <http://intertwingly.net/stories/2014/10/24/urltest-results/1b77231365>


AFAIK this is consistent with the standard (fragment, not hash). You can
change your testing class to:

result.put("hash", (url.fragment()!=null && !url.fragment().isEmpty()) ? 
"#"+url.fragment() : "");

     Returning an empty string instead of null for query:
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/24f081633d
     <http://intertwingly.net/stories/2014/10/24/urltest-results/24f081633d>


Again, this is the standard (query, not search). You can change your
testing class to:

result.put("search", (url.query()!=null && !url.query().isEmpty()) ? 
"?"+url.query() : "");

     ipv6 addresses not wrapped in []:
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/54f86d22f2
     <http://intertwingly.net/stories/2014/10/24/urltest-results/54f86d22f2>


Right. IPv6 addresses are wrapped in [] when serialized as part of an
URL, but they are not wrapped when printed as standalone entities. I'll
fix it:
https://github.com/smola/galimatias/issues/45



     difference in case:
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/e40dedda84
 
<http://intertwingly.net/stories/2014/10/24/urltest-results/e40dedda84>,
 
http://intertwingly.net/__stories/2014/10/24/urltest-__results/9a4e54b1c3
     <http://intertwingly.net/stories/2014/10/24/urltest-results/9a4e54b1c3>


Galimatias is biased towards URL normalization. It tries to minimize the
creation of URLs that are equivalent according to the standard. A
setting to disable this percent-encoding normalization behaviour will be
provided if there is a real world use case for it.

     If you want to review how I captured these results, the program I
     used can be found here:

     http://intertwingly.net/__stories/2014/10/24/urltest.__java
     <http://intertwingly.net/stories/2014/10/24/urltest.java>

     Please let me know if you identify any problems with that program,
     and I will be glad to rerun the tests.


I have added a test to check with
http://intertwingly.net/stories/2014/10/05/urltestdata.json

Do you have more up-to-date data?

I found Galimatias failed with this:

http://intertwingly.net/stories/2014/10/24/urltest-results/bc6ea8bdf8
{"input":"http://%30%78%63%30%2e%30%32%35%30.01%2e","base":"http://other.com/","scheme":"","username":"","password":null,"host":"","port":"","path":"","query":"","fragment":""}

My latest code parses this URL as http://0xc0.0250.01./
This seems in line with the standard, since it does not perform sanity
checks for DNS rules. See: https://github.com/smola/galimatias/issues/26
It also fails with this one:
{"input":"http://192.168.0.257","base":"http://other.com/","scheme":"","username":"","password":null,"host":"","port":"","path":"","query":"","fragment":""}

But 192.168.0.257 is a valid domain name at DNS-level and it's not
forbidden by the URL standard.

Apart from these two cases, Galimatias passes all test cases if
query/search fragment/hash differences are considered.

Please, let me know if there is any failure in the future or if you have
any feedback.

Thank you again!

Best,

Santiago

Received on Friday, 31 October 2014 08:08:39 UTC