Validating URIs by regexp (was: Re: [url] Requests for Feedback (was Feedback from TPAC))

[I think we should redirect the technical discussion to 
apps-discuss@ietf.org, because the public-ietf-w3c@w3.org list is for 
(procedural) liaison issues. If you agree, just just do so when you 
reply. Also, an occasional change of subject may be helpful.]

On 2014/12/23 00:23, Sam Ruby wrote:
> On 12/22/2014 10:16 AM, Bjoern Hoehrmann wrote:
>> * Sam Ruby wrote:
>>> On 12/22/2014 08:50 AM, Julian Reschke wrote:
>>>> RFC 3986 has a regexp that's expected to parse valid URIs consistent
>>>> with the ABNF; see
>>>> <http://greenbytes.de/tech/webdav/rfc3986.html#rfc.section.B>.

No, please look at what Appendix B says:

----
The following line is the regular expression for breaking-down a 
well-formed URI reference into its components.
----

So it *assumes* a well-formed URI, it doesn't check for it.

Mark's gist (https://gist.github.com/mnot/138549) looks like it should 
work, because it uses all the productions from RFC 3986 (which are 
neatly summarized in http://tools.ietf.org/html/rfc3986#appendix-A), 
although I haven't checked it in detail.


>>> That is indeed a regular expression.  I'll even grant that it seems
>>> likely to handle valid URIs correctly.  My concern is that it also
>>> processes a large number of invalid URIs, for example:
>>> "http://192.168.0.257"
>>
>> (That is a `URI` as far as RFC 3986's ABNF grammar is concerned and I am
>> not aware of prose requirements that make this example invalid. A better
>> example would be e.g. something that contains `%xx` literally; the regex
>> would likely accept the string while the grammar rejects it.)
>
> I base my assertion that this is not a valid URI based on the following
> quote from RFC 3986:
>
>        IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
>
>        dec-octet   = DIGIT                 ; 0-9
>                    / %x31-39 DIGIT         ; 10-99
>                    / "1" 2DIGIT            ; 100-199
>                    / "2" %x30-34 DIGIT     ; 200-249
>                    / "25" %x30-35          ; 250-255
>
> Do you come to a different conclusion?

No, of course not. The '7' in '257' doesn't match %x30-35.

Regards,   Martin.


P.S.: It may have helped to point out that 257 is greater than the 
allowed 255 when first putting out the example. While everybody who goes 
through the details will eventually find that out, even people who have 
done careful work in this area (e.g. Julian, Björn,...) may not 
immediately see it.

Received on Tuesday, 23 December 2014 07:00:25 UTC