Re: parsing hostname -- implementation feedback

> The problem is in the production for 'qualified'.  To determine 
> whether an incoming ".abc" is a 'domainlabel' or a 'toplabel' requires 
> a significant lookahead, to the following '.' (if present) and the 
> character following that.  To determine if an incoming ".123" is valid 
> can require an arbitrarily long lookahead (e.g. 
> http://0.123.4.5.6.7.8.9.10.11.12.13.x/).
>
> I think parsing precisely according to the syntax would be greatly 
> simplified if the syntax were relaxed so that:
>
>   qualified = *( "." domainlabel ) [ "." ]
>
> i.e. drop the syntactic prohibition of URIs like this:
>
>   http://www.example.123./foo
>
> I appreciate this is not strictly correct, but I see no practical harm 
> from defining the syntax in this way and asserting the form of the 
> final domain label as an extra-syntactic constraint.  A (limited) few 
> tests with my browser  suggest that it does not syntactically prohibit 
> numeric top-level domain labels, but simply reports that the domain 
> cannot be found.

Doing that would cause the syntax to be ambiguous in regards to IPv4
addresses, which is why that syntax was added to the specification
in the first place.  The reason that literal IP addresses are explicitly
denoted is because applications are encouraged to convert them
directly to numeric IP rather than send everything to a DNS resolver.

> If you really want to keep the syntactic constraint in place, I 
> suggest an alternative approach:
>
> hostname  = qualified
> qualified = numericlabel "." qualified /
>             toplabel [ "." [qualified] ]
>
> numericlabel = DIGIT [ 0*61( alphanum / "-" ) alphanum

Well, that is harder for people to understand, but I agree that it is
better for LALR parsers.

> ...
>
> I think there's a typo in the syntax production for 'toplabel':
>
> s/alpha/ALPHA/ ?

Yes, thanks for noting it.

....Roy

Received on Friday, 7 March 2003 19:10:11 UTC