Re: IRI regex quiz!

Bjoern Hoehrmann wrote:

>   http://bj%f6rn.example.org/

Ugh, indeed, <reg-name> doesn't limit itself to LDH, it can be
any octet (some of them must be of course percent-encoded).

And that's correct.  Probably 2616 says why you can't do that
for HTTP, but I found no ABNF for the 2616 <host>.  Maybe it's
in the prose.

Another interesting case is http://what<ever.spammer.example

One browser (guess) thinks that this is a good URL and happily
supports to click on it.  Decent anti-spam tools limited to
play by the rules see http://what and ignore the URL.  And 3986
doesn't _explicitly_ say that "<" and a few other VCHARs are
never allowed in an URL.

>   http://example.org/~björn/

Obviously no URI.

 [http://bj%f6rn.example.org/]
> The former is not allowed per RFC 3986 and RFC 3987

It doesn't clearly say so, apparently it all depends on the 
registry for the <reg-name>.  And DNS labels can contain any
octet, as some spammers found out.  Of course the pointers to
1034 3.5 and 1123 2.1 would result in some style of LDH rule,
and that kills the "%" in bj%f6rn

But the LDH rules are also a bit vague today, some all-digit
labels exist.  RfC 3696 has it clear, at least the <toplabel>
can't be all-digits, the worst case could be 1-2-3 (no ALPHA).
OTOH 3696 is only informational and offers no ABNF.

 {http://example.org/~björn/]
> the latter is not allowed per RFC 2396, RFC 2616, RFC
> 3986, but allowed per ABNF and prose of RFC 3987

IIRC the 3987 ABNF is for the step when you have Unicode, your
Latin-1 oumlaut won't match before you have u+00F6.  But it's
certainly okay here (in a Latin 1 text).

> except that RFC 3987 requires in the prose to meet the
> constraints in RFC 2616, e.g.

>   When stored or transmitted in digital representation,
>   bidirectional IRIs MUST be in full logical order and MUST
>   conform to the IRI syntax rules (which includes the rules
>   relevant to their scheme).

No "abs-path =" in 2616, and I'm unwiling to try the "interpret
1738 for 3986" stunt now, appendix D.2 in 3986 is a royal PITA.

Guessing: 3986 pchar is what I want, no oumlaut, no surprise.

But we knew that, the oumlaut is obviously no URI, it's an IRI.
If you translate it to an URI it would be ~bj%C3%B6rn, that's
a legal segment of a path in an http-URL.

> I agree that at the moment http://example.org/%C0 is not
> illegal per any RFC though.

This might be no nonsense at least for ftp:  The ftp servers
used to have legacy charsets.  When I start ftpd I'd get some
baroque pc-multilingual-850+euro.  I could also start it in an
windows-1252 session and let it create filenames where my file
system later crashes, but I digress.

                      Bye, Frank

Received on Monday, 23 January 2006 17:47:47 UTC