uri handling of hosts is too restrictive

On Thu, 2004-02-05 at 06:07, Martin Duerst wrote:
> At 18:34 04/02/04 -0800, Stephen Pollei wrote:
> >http://stephen_pollei.home.comcast.net/ gives Error: {W107} Bad URI
> > Host is not a well formed address!
> >
> >It's the underscore, however _'s are good host names according to rfc
> >2181 section 11 and rfc1123 sections 2.1 and 6.1.3.5
> >The problem is that rfc2396 section 3.2.2 is unduly restrictive.
> If you think RFC 2396 is overly restrictive, please raise this point
> on the mailing list uri@w3.org, where the next version of this spec
> is discussed.
Hello, I've run into a situation where a uri that is handled properly by
most software I've run across has generated a warning in a RDF
validation tool.
I believe that the problem arose in the spec when the http1.0 spec
directly referenced an older more restrictive rfc concerning host names.
Later the http1.1 spec(RFC 2616 IIRC) passed the specification of what
constitutes a valid host name to RFC 2396. RFC 2396 still retains a more
restricted set of allowed characters, but didn't specify length
restricts like what the dns RFC's do.

The DNS RFC's do specify that an application is allowed to specify a
subset of it's allowed names in it's own specs. So RFC 2396's
restrictions are valid restrictions in that sense.

It does however restrict various things that would otherwise be OK. This
proposal doesn't fix international domain names in unicode. I however
think that RFC3492(punycode) and others is good enough for that purpose.

I propose that the characters !$*+,=^_{|}~ be added as valid characters.
"&%'`()[]:;/\<>@?# should probably not be added as being valid.
" conflicts with quotation too much
& conflicts will sgml/xml entity too much
% is the escape char
'`();/\ might have way too much meaning elsewhere.
[]:/?# used for ipv6, port number separation, url component separation
<>@ is used too much in email addresses
control characters and whitespace characters should not be allowed..
characters 127(ascii) and above should not be allowed.
Of course one could allow all the above and just have it be required
that they be escaped. That would be most liberal approach and might be
best. Hmmm... http://%2f%2e.org/ ;->

I also thing that the first character should be kept as being more
restrictive. Some DNS schemes are using '_' as first character for
special purposes for example. Has nice effect of also disallowing
http://www.**wow**.com/ . Too bad http://www.wow!!!.com/ would work!
Maybe disallow at beginning and at the end. Then
http://www.Jack+Jill.example.org/ could still work.
Anyway this is just top of my head comments. Feel free to rip it to
shreds.

There should also maybe be a security note that dns and the character
encodings are more liberal. That with these allowed encodings some thing
like http://my${FOO}thing.example.org/ would be valid but might cause
trouble for shell scripts for example. That security problem already
existed though.

-- 
http://dmoz.org/profiles/pollei.html
http://sourceforge.net/users/stephen_pollei/
http://slashdot.org/~joe_plastic/
http://stephen_pollei.home.comcast.net/
GPG Key fingerprint = EF6F 1486 EC27 B5E7 E6E1  3C01 910F 6BB5 4A7D 9677

Received on Thursday, 5 February 2004 16:01:48 UTC