Re: (ACTION) On a new grammar for IRIs

Thank you for this.  I think it's a useful step forward.

I wonder if we can make it a bit more positive in tone, and convey a
little less disapproval of the grammar in the RFC --the text below does
say explicitly that the RFC's grammar does what it needs to do, but the
reader may (like me) have the impression that that was a hard-won
concession, and uttered with an inward reservation.

It troubles me a little, too, to suggest that ixml requires a particular
approach to writing grammars.  Different people may have different
requirements and desiderata for the grammars they write; I am skeptical
of any suggestion that ixml has requirements of the kind suggested
here. I think the things identified here as requirements of ixml are
things someone using an ixml grammar to parse IRIs might plausibly
desire, and they are plausible motivations for writing a grammar for
IRIs that deviates from the grammar that defines what is and what is not
an IRI.  But they are not the only things a user is allowed to want -- a
user might also want some proof that the grammar in use actually
recognizes the same language as the RFC -- that's a bit easier when the
grammar is visibly just a transliteration from the RFC into a slightly
different notation.  Or someone might want to see the structure assigned
to an IRI by the grammar in the RFC.  

I continue to regard the collection of samples as a good place to show
what can be done with ixml, and not such a good place to tell the
editors of the relevant RFCs how the grammar would be written by someone
who knew what they were about.

So I'll be happier with a proposal to replace the current ixml grammars
in the samples/URI directory, or add one or more additional ones, if the
README file can avoid condescending to the technical specifications of
URIs and IRIs.  But maybe that's a pipe dream.

Michael

Steven Pemberton <steven.pemberton@cwi.nl> writes:

> The standard syntax definition for IRIs is defined in RFC 3987,
> Internationalized Resource Identifiers (IRIs),
> https://datatracker.ietf.org/doc/html/rfc3987.
>
> The purpose of that definition is to define IRIs, and give a syntax
> that determines what a correct IRI looks like.
>
> The purpose of ixml in general is slightly different: on the one hand
> it is not always necessary to determine if the input is correct -- the
> assumption is often that the input is correct, and we only want to
> transform it. These are described as permissive grammars. For
> instance, you might define a date as having two digits for the day,
> rather than being specific that the digits are in a particular range.
>
> On the other hand, what ixml is really interested in is the structure
> of the underlying object, identifying the parts it is made up of,
> without intervening syntactical necessities. A date consists primarily
> of a day, a month, and a year, and not just of six digits.
>
> The RFC definition of IRIs while meeting (most of) its requirements
> for its purpose, doesn't meet the requirements for ixml: * Firstly, it
> is in places ambiguous, certain strings for correct IRIs fulfilling
> the definition of an IRI in different ways; for instance 192.168.0.1
> is both an IPv4address and (incorrectly in this author's view) an
> ireg-name.
>
> * Secondly, it doesn't expose the underlying structure in a suitable
>   way for ixml. For instance, an ireg-name according to the RFC is
>   just a string of the characters allowed in a hostname, without
>   exposing any underlying structure. The hostname www.w3.org is
>   according to the RFC just a string of 10 allowable characters.
>
> * Thirdly, it actually fails in correctly determining if a string is a
>   correct IRI by being too lax. Again, using ireg-name as an example,
>   it will happily accept a hostname "...", which is disallowed by
>   rfc3986, which says that a hostname "consists of a sequence of
>   domain labels separated by ".", each domain label starting and
>   ending with an alphanumeric character and possibly also containing
>   "-" characters." In other words, it may not have adjacent dots, nor
>  start or end with a dot.
>
> The revised ixml definition for IRI tries to rectify these problems:
> it is strict in what it accepts, it is unambiguous, and it reveals the
> underlying structure of the parts, which is what ixml is all about.
>
> Steven


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Tuesday, 5 March 2024 02:23:07 UTC