- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Tue, 20 Feb 2024 14:38:24 +0000
- To: ixml <public-ixml@w3.org>
The standard syntax definition for IRIs is defined in RFC 3987, Internationalized Resource Identifiers (IRIs), https://datatracker.ietf.org/doc/html/rfc3987. The purpose of that definition is to define IRIs, and give a syntax that determines what a correct IRI looks like. The purpose of ixml in general is slightly different: on the one hand it is not always necessary to determine if the input is correct -- the assumption is often that the input is correct, and we only want to transform it. These are described as permissive grammars. For instance, you might define a date as having two digits for the day, rather than being specific that the digits are in a particular range. On the other hand, what ixml is really interested in is the structure of the underlying object, identifying the parts it is made up of, without intervening syntactical necessities. A date consists primarily of a day, a month, and a year, and not just of six digits. The RFC definition of IRIs while meeting (most of) its requirements for its purpose, doesn't meet the requirements for ixml: * Firstly, it is in places ambiguous, certain strings for correct IRIs fulfilling the definition of an IRI in different ways; for instance 192.168.0.1 is both an IPv4address and (incorrectly in this author's view) an ireg-name. * Secondly, it doesn't expose the underlying structure in a suitable way for ixml. For instance, an ireg-name according to the RFC is just a string of the characters allowed in a hostname, without exposing any underlying structure. The hostname www.w3.org is according to the RFC just a string of 10 allowable characters. * Thirdly, it actually fails in correctly determining if a string is a correct IRI by being too lax. Again, using ireg-name as an example, it will happily accept a hostname "...", which is disallowed by rfc3986, which says that a hostname "consists of a sequence of domain labels separated by ".", each domain label starting and ending with an alphanumeric character and possibly also containing "-" characters." In other words, it may not have adjacent dots, nor start or end with a dot. The revised ixml definition for IRI tries to rectify these problems: it is strict in what it accepts, it is unambiguous, and it reveals the underlying structure of the parts, which is what ixml is all about. Steven
Received on Tuesday, 20 February 2024 14:38:31 UTC