W3C home > Mailing lists > Public > uri@w3.org > March 2003

parsing hostname -- implementation feedback

From: Graham Klyne <gk@ninebynine.org>
Date: Wed, 05 Mar 2003 17:41:38 +0000
Message-Id: <5.1.0.14.2.20030305171317.00bd1978@127.0.0.1>
To: "Roy T. Fielding" <fielding@apache.org>
Cc: <uri@w3.org>

Roy,

I have to say that the 'hostname' syntax as specified an RFC2396bis is a 
pain to parse accurately.  I think it's sufficiently difficult to get 
exactly right that it won't be correctly implemented as specified in many 
applications, which leaves me wondering if it really should be so fussily 
correct with respect to domain name usage.

(The reason I'm noticing this is that I've been using the URI parsing task 
to experiment with some programming tools and techniques that offer a more 
direct correspondence between specification and the source code.  If I were 
doing this as part of a real application, I would long ago have ignored the 
detailed syntax and done something very similar but much easier to implement.)

The problem is in the production for 'qualified'.  To determine whether an 
incoming ".abc" is a 'domainlabel' or a 'toplabel' requires a significant 
lookahead, to the following '.' (if present) and the character following 
that.  To determine if an incoming ".123" is valid can require an 
arbitrarily long lookahead (e.g. http://0.123.4.5.6.7.8.9.10.11.12.13.x/).

I think parsing precisely according to the syntax would be greatly 
simplified if the syntax were relaxed so that:

   qualified = *( "." domainlabel ) [ "." ]

i.e. drop the syntactic prohibition of URIs like this:

   http://www.example.123./foo

I appreciate this is not strictly correct, but I see no practical harm from 
defining the syntax in this way and asserting the form of the final domain 
label as an extra-syntactic constraint.  A (limited) few tests with my 
browser  suggest that it does not syntactically prohibit numeric top-level 
domain labels, but simply reports that the domain cannot be found.

...

If you really want to keep the syntactic constraint in place, I suggest an 
alternative approach:

hostname  = qualified
qualified = numericlabel "." qualified /
             toplabel [ "." [qualified] ]

numericlabel = DIGIT [ 0*61( alphanum / "-" ) alphanum

...

I think there's a typo in the syntax production for 'toplabel':

s/alpha/ALPHA/ ?

#g


-------------------
Graham Klyne
<GK@NineByNine.org>
PGP: 0FAA 69FF C083 000B A2E9  A131 01B9 1C7A DBCA CB5E
Received on Wednesday, 5 March 2003 18:05:15 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:05 UTC