Re: Range conformance from John Lumley on 2022-02-24 (public-ixml@w3.org from February 2022)

From: John Lumley <john@saxonica.com>
Date: Thu, 24 Feb 2022 15:46:56 +0000
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Cc: Steven Pemberton <steven.pemberton@cwi.nl>, public-ixml@w3.org
Message-Id: <FEEDED94-4235-4314-BA2A-CA3D4ED09FF0@saxonica.com>

In XPath a statement
45 to 2
is not considered an error, though this isn’t as complex as a case in a grammar

Sent from my iPad

> On 24 Feb 2022, at 15:42, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
> 
> 
> Steven Pemberton writes:
> 
>> The spec says:
>>    A range matches any character in the range from the start
>>    character to the end, inclusive, using the Unicode ordering
>> 
>> It doesn't require the start character to be earlier in the ordering
>> than the end character. This means that
>> 
>>    ["z"-"a"]
>> is the same as
>>    [ ]
>> 
>> Do we care? Should it be an error, or a warning?
> 
> I think a warning makes sense only if there is some meaning that can be
> attached to the range as specified. We can
> 
> (a) allow the arguments in either order and say that the range matches any
>  character between from the two points specified, inclusive;
> 
> (b) say that it matches the starting code point, the ending code point,
>  and any characters matching code points between the two (so ["z" -
>  "a"] is equivalent to ["za"]);
> 
> (c) invent some other meaning for ["z"-"a"]; or
> 
> (d) call it an error.
> 
> I don't know off hand what other regular expression notations people may
> be familiar with do.  Looking it up, I find that XSD defines the meaning
> of a range this way:
> 
>    A ·character range· in the form s-e identifies the set of characters
>    with UCS code points greater than or equal to the code point of s,
>    but not greater than the code point of e.
> 
> This seems to fall in class (c).  Since no characters have UCS code
> points p with (p ≥ 97) ∧ (p ≤ 122), that means that in XSD regular
> expressions (and, I guess, XPath 3 regular expressions), [z-a] matches
> nothing and is thus equivalent to ixml [].
> 
> I am agnostic, but in the abstract I would lean towards (a) or (d).
> 
> Since a processor has to check the order either way, I don't think (a)
> imposes any new cost on the processor, and it does make ixml processors
> less fussy.  (And using Earley parsing pretty much says off the bat that
> high performance in parsing is not a goal for ixml.)
> 
> But unless we find some regular expression syntax that is reasonably
> widely used that uses interpretation (a), I think the principle of least
> surprise would lead us to (d).
> 
> -- 
> C. M. Sperberg-McQueen
> Black Mesa Technologies LLC
> http://blackmesatech.com
>

Received on Thursday, 24 February 2022 15:47:13 UTC