(ACTION) On a new grammar for IRIs

The standard syntax definition for IRIs is defined in RFC 3987, 
Internationalized Resource Identifiers (IRIs), 
https://datatracker.ietf.org/doc/html/rfc3987.

The purpose of that definition is to define IRIs, and give a syntax that 
determines what a correct IRI looks like.

The purpose of ixml in general is slightly different: on the one hand it is 
not always necessary to determine if the input is correct -- the assumption 
is often that the input is correct, and we only want to transform it. These 
are described as permissive grammars. For instance, you might define a date 
as having two digits for the day, rather than being specific that the 
digits are in a particular range.

On the other hand, what ixml is really interested in is the structure of 
the underlying object, identifying the parts it is made up of, without 
intervening syntactical necessities. A date consists primarily of a day, a 
month, and a year, and not just of six digits.

The RFC definition of IRIs while meeting (most of) its requirements for its 
purpose, doesn't meet the requirements for ixml: 

* Firstly, it is in places ambiguous, certain strings for correct IRIs 
fulfilling the definition of an IRI in different ways; for instance 
192.168.0.1 is both an IPv4address and (incorrectly in this author's view) 
an ireg-name.

* Secondly, it doesn't expose the underlying structure in a suitable way 
for ixml. For instance, an ireg-name according to the RFC is just a string 
of the characters allowed in a hostname, without exposing any underlying 
structure. The hostname www.w3.org is according to the RFC just a string of 
10 allowable characters.

* Thirdly, it actually fails in correctly determining if a string is a 
correct IRI by being too lax. Again, using ireg-name as an example, it will 
happily accept a hostname "...", which is disallowed by rfc3986, which says 
that a hostname "consists of a sequence of domain labels separated by ".", 
each domain label starting and ending with an alphanumeric character and 
possibly also containing "-" characters." In other words, it may not have 
adjacent dots, nor start or end with a dot.

The revised ixml definition for IRI tries to rectify these problems: it is 
strict in what it accepts, it is unambiguous, and it reveals the underlying 
structure of the parts, which is what ixml is all about.

Steven

Received on Tuesday, 20 February 2024 14:38:31 UTC