- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Mon, 04 Mar 2024 18:43:28 -0700
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: public-ixml@w3.org
Thank you for this. I think it's a useful step forward. I wonder if we can make it a bit more positive in tone, and convey a little less disapproval of the grammar in the RFC --the text below does say explicitly that the RFC's grammar does what it needs to do, but the reader may (like me) have the impression that that was a hard-won concession, and uttered with an inward reservation. It troubles me a little, too, to suggest that ixml requires a particular approach to writing grammars. Different people may have different requirements and desiderata for the grammars they write; I am skeptical of any suggestion that ixml has requirements of the kind suggested here. I think the things identified here as requirements of ixml are things someone using an ixml grammar to parse IRIs might plausibly desire, and they are plausible motivations for writing a grammar for IRIs that deviates from the grammar that defines what is and what is not an IRI. But they are not the only things a user is allowed to want -- a user might also want some proof that the grammar in use actually recognizes the same language as the RFC -- that's a bit easier when the grammar is visibly just a transliteration from the RFC into a slightly different notation. Or someone might want to see the structure assigned to an IRI by the grammar in the RFC. I continue to regard the collection of samples as a good place to show what can be done with ixml, and not such a good place to tell the editors of the relevant RFCs how the grammar would be written by someone who knew what they were about. So I'll be happier with a proposal to replace the current ixml grammars in the samples/URI directory, or add one or more additional ones, if the README file can avoid condescending to the technical specifications of URIs and IRIs. But maybe that's a pipe dream. Michael Steven Pemberton <steven.pemberton@cwi.nl> writes: > The standard syntax definition for IRIs is defined in RFC 3987, > Internationalized Resource Identifiers (IRIs), > https://datatracker.ietf.org/doc/html/rfc3987. > > The purpose of that definition is to define IRIs, and give a syntax > that determines what a correct IRI looks like. > > The purpose of ixml in general is slightly different: on the one hand > it is not always necessary to determine if the input is correct -- the > assumption is often that the input is correct, and we only want to > transform it. These are described as permissive grammars. For > instance, you might define a date as having two digits for the day, > rather than being specific that the digits are in a particular range. > > On the other hand, what ixml is really interested in is the structure > of the underlying object, identifying the parts it is made up of, > without intervening syntactical necessities. A date consists primarily > of a day, a month, and a year, and not just of six digits. > > The RFC definition of IRIs while meeting (most of) its requirements > for its purpose, doesn't meet the requirements for ixml: * Firstly, it > is in places ambiguous, certain strings for correct IRIs fulfilling > the definition of an IRI in different ways; for instance 192.168.0.1 > is both an IPv4address and (incorrectly in this author's view) an > ireg-name. > > * Secondly, it doesn't expose the underlying structure in a suitable > way for ixml. For instance, an ireg-name according to the RFC is > just a string of the characters allowed in a hostname, without > exposing any underlying structure. The hostname www.w3.org is > according to the RFC just a string of 10 allowable characters. > > * Thirdly, it actually fails in correctly determining if a string is a > correct IRI by being too lax. Again, using ireg-name as an example, > it will happily accept a hostname "...", which is disallowed by > rfc3986, which says that a hostname "consists of a sequence of > domain labels separated by ".", each domain label starting and > ending with an alphanumeric character and possibly also containing > "-" characters." In other words, it may not have adjacent dots, nor > start or end with a dot. > > The revised ixml definition for IRI tries to rectify these problems: > it is strict in what it accepts, it is unambiguous, and it reveals the > underlying structure of the parts, which is what ixml is all about. > > Steven -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Tuesday, 5 March 2024 02:23:07 UTC