Re: [url] Requests for Feedback (was Feedback from TPAC)

On 12/22/2014 09:08 AM, Julian Reschke wrote:
> On 2014-12-22 15:04, Sam Ruby wrote:
>> On 12/22/2014 08:50 AM, Julian Reschke wrote:
>>>>>
>>>>> Validity according to RFC 3986 can be mechanically checked; why do we
>>>>> need to "mark" something here?
>>>>
>>>> If there is a program I can use to mechanically check for RFC 3986
>>>> compliance and shows how a given URI is to be interpreted (scheme,
>>>> host,
>>>> path, query, fragment, etc.), I'll gladly update my results.
>>>
>>> RFC 3986 has a regexp that's expected to parse valid URIs consistent
>>> with the ABNF; see
>>> <http://greenbytes.de/tech/webdav/rfc3986.html#rfc.section.B>.
>>
>> That is indeed a regular expression.  I'll even grant that it seems
>> likely to handle valid URIs correctly.  My concern is that it also
>> processes a large number of invalid URIs, for example:
>> "http://192.168.0.257"
>
> That is true; there'll be false positives; but that's still better than
> having to checks at all :-)
>
> That being said, I once mapped the normative ABNF to regexps and
> processed them in XSLT; see <http://greenbytes.de/tech/tc/uris/>; I can
> try to leverage that to create a proper regexp from that.

That does indeed look promising!

I'll note that you don't need to restrict yourself to only using regular 
expressions.  What I'm looking for is a mechanical process that checks 
strings when parsed against a given base URI for validity, and at least 
for valid sets of inputs it produces the individual components.  Ideally 
for cases where inputs are rejected, it would provide some hint as to why.

My test source is here:

https://raw.githubusercontent.com/w3c/web-platform-tests/master/url/urltestdata.txt

For most of my evaluations, I convert this to JSON.  It could just as 
easily be converted to XML.  That XML could then be passed through 
xsltproc (or equivalent).  Ultimately, I'd need that output to be in 
JSON format, and the stylesheet could either produce that directly, or 
could produce XML that could be parsed.

As a quick and dirty demonstration, I used 
http://www.freeformatter.com/json-to-xml-converter.html to produce the 
following:

http://intertwingly.net/tmp/urltestdata.xml

That's just one possible way to represent this data in XML.

I'll note that when I started work on the URL Standard there also wasn't 
such a mechanical step, or even a JSON form of the test data.  I 
approached this problem incrementally and posted intermediate results. 
The first such can be found here:

http://intertwingly.net/blog/2014/10/21/pegurl-js

- Sam Ruby

Received on Monday, 22 December 2014 14:53:57 UTC