- From: Sam Ruby <rubys@intertwingly.net>
- Date: Mon, 22 Dec 2014 09:53:29 -0500
- To: Julian Reschke <julian.reschke@gmx.de>
- CC: "public-ietf-w3c@w3.org" <public-ietf-w3c@w3.org>
On 12/22/2014 09:08 AM, Julian Reschke wrote: > On 2014-12-22 15:04, Sam Ruby wrote: >> On 12/22/2014 08:50 AM, Julian Reschke wrote: >>>>> >>>>> Validity according to RFC 3986 can be mechanically checked; why do we >>>>> need to "mark" something here? >>>> >>>> If there is a program I can use to mechanically check for RFC 3986 >>>> compliance and shows how a given URI is to be interpreted (scheme, >>>> host, >>>> path, query, fragment, etc.), I'll gladly update my results. >>> >>> RFC 3986 has a regexp that's expected to parse valid URIs consistent >>> with the ABNF; see >>> <http://greenbytes.de/tech/webdav/rfc3986.html#rfc.section.B>. >> >> That is indeed a regular expression. I'll even grant that it seems >> likely to handle valid URIs correctly. My concern is that it also >> processes a large number of invalid URIs, for example: >> "http://192.168.0.257" > > That is true; there'll be false positives; but that's still better than > having to checks at all :-) > > That being said, I once mapped the normative ABNF to regexps and > processed them in XSLT; see <http://greenbytes.de/tech/tc/uris/>; I can > try to leverage that to create a proper regexp from that. That does indeed look promising! I'll note that you don't need to restrict yourself to only using regular expressions. What I'm looking for is a mechanical process that checks strings when parsed against a given base URI for validity, and at least for valid sets of inputs it produces the individual components. Ideally for cases where inputs are rejected, it would provide some hint as to why. My test source is here: https://raw.githubusercontent.com/w3c/web-platform-tests/master/url/urltestdata.txt For most of my evaluations, I convert this to JSON. It could just as easily be converted to XML. That XML could then be passed through xsltproc (or equivalent). Ultimately, I'd need that output to be in JSON format, and the stylesheet could either produce that directly, or could produce XML that could be parsed. As a quick and dirty demonstration, I used http://www.freeformatter.com/json-to-xml-converter.html to produce the following: http://intertwingly.net/tmp/urltestdata.xml That's just one possible way to represent this data in XML. I'll note that when I started work on the URL Standard there also wasn't such a mechanical step, or even a JSON form of the test data. I approached this problem incrementally and posted intermediate results. The first such can be found here: http://intertwingly.net/blog/2014/10/21/pegurl-js - Sam Ruby
Received on Monday, 22 December 2014 14:53:57 UTC