Re: IRIs

Thanks for this. A solid and thorough piece of work.

However, I think our problem is a little simpler; we don't need to parse  
the URI, only recognise if it is correct or not. This means we can greatly  
simplify the syntax.

As far as I can see, the basis of the regex needed is:

    IRI-reference  = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ "#"  
ipch-f* ]

where the only difference between 'hier' and 'hier-nc' is that hier-nc may  
have no colons before the first (if any) "/" character.

(The only difference between the characters represented by 'ipch-q' and  
'ipch-f' is that ipch-q can contain characters from the private use areas.)

As I see it, XForms needs two IRI types. For the case where a user is  
required to type in a full web-address:

    IRI            = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ]

and where data could hold either a full IRI or a relative IRI:

    IRI-reference  = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ "#"  
ipch-f* ]

In passing I note one great failing of the syntax in RFC 3987:

while it goes to great lengths to define an IP address as exactly 4  
numbers, separated by dots, where the numbers are between 0 and 255, and  
similarly for an IPv6 address, when it comes to what it calls an  
'ireg-name' (representing things such as www.w3.org), it allows any old  
rubbish through, including things like "..." "}{" and "+-.._". Adopting a  
type in XForms, I would be inclined to use the definition used for email,  
which is one or more 'subdomains' separated by ".".

Steven

On Thu, 08 Mar 2018 18:56:28 +0100, C. M. Sperberg-McQueen  
<cmsmcq@blackmesatech.com> wrote:

> The tighter types defined in [1] may be what you want.
>
> [1]  
> https://www.w3.org/XML/Group/2004/06/exacturi/xsd-rfc-3986-uri-3986-iri.html
>
> Actually, though, since it appears the access privileges for /XML/Group
> have been made more restrictive, it's not clear that the document at [1]
> is available to anyone outside the Team anymore.  So I attach a copy,
> which I have munged to try to make it display plausibly from the
> lists.w3.org archives.
>
> The actual definitions of the types appear to accessible in [2], [3],  
> and [4].
>
> [2] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-URI-driver.xsd
> [3] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-URI-RFC3986.xsd
> [4] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-RFC3987.xsd
>
> Note that these are marked as drafts, and contain text saying that the
> current version of the types is in the schema datatype library at [5],  
> which
> is not true: since the draft note was never published by the Schema WG,
> the types were never added to the type library.  If XForms wants to use
> them, you should probably re-issue them by revising the schema documents
> and publishing them in an appropriate location.  (And if you want to
> publish [1] as a group document, that would probably be useful for
> those who need to understand how the schema documents are
> constructed.)
>
> Michael
>
>
> On Mar 8, 2018, at 7:06 AM, Steven Pemberton wrote:
>
>> But that said, the anyURI type is extremely liberal, accepting  
>> literally *any* string of characters. The only purpose of the type  
>> seems to be mandating transforming the characters into something  
>> acceptable to a URI when necessary.
>>
>> It would still be useful to have a type that validates according to  
>> http://www.ietf.org/rfc/rfc3987.txt.
>>
>> Steven
>>
>>
>> On Thu, 08 Mar 2018 13:46:33 +0100, Steven Pemberton  
>> <steven.pemberton@cwi.nl> wrote:
>>
>>> You are absolutely right, and I am absolutely wrong.
>>>
>>> What led me to the conclusion was writing the test suite for anyURI,  
>>> and IRIs showing up as invalid, and me then following the wrong link.
>>>
>>> So all is well, I can breathe a sigh of relief, and carry on with the  
>>> test suite.
>>>
>>> I'm happy that you are reading the XForms mailing list :-)
>>>
>>> Steven
>>>
>>> On Wed, 07 Mar 2018 19:37:20 +0100, C. M. Sperberg-McQueen  
>>> <cmsmcq@blackmesatech.com> wrote:
>>>
>>>>
>>>>> On Mar 7, 2018, at 10:39 AM, Steven Pemberton  
>>>>> <steven.pemberton@cwi.nl> wrote:
>>>>>
>>>>> The definition of anyURI doesn't allow IRIs, such as
>>>>>  https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en.
>>>>>
>>>>> Just as we added an iemail address type to match modern email  
>>>>> addresses, it seems to me that we ought to also add an anyIRI type  
>>>>> that accepts IRIs like the aboce.
>>>>
>>>> I am puzzled; what leads you to the conclusion that xsd:anyURI
>>>> does not accept IRIs?
>>>>
>>>> In XSD 1.0 [1], the value space is described as that of
>>>> RFC 2396, as modified by RFC 2732, and the lexical space
>>>> is described (roughly) as the set of strings, which after
>>>> escaping, turn into URIs as defined by those specs.  The
>>>> escaping in question is the then current algorithm for IRIs,
>>>> as published in the XLink spec. I believe that later revisions
>>>> of the concept of IRI changed the rules for whitespace, but
>>>> I don’t recall any other changes likely to be noticeable to
>>>> users of the datatype.  Certainly the intent of XSD 1.0
>>>> was to accept IRIs in the lexical space of the type anyURI.
>>>>
>>>> The spec says "This [the mapping from lexical space to value
>>>> space] means that a wide range of internationalized resource
>>>> identifiers can be specified when an anyURI is called for”.
>>>>
>>>> In XSD 1.1 [2], the spec is a little more explicit, since the
>>>> IRI concept was a little more clearly developed by that time:
>>>> "anyURI represents an Internationalized Resource Identifier
>>>> Reference (IRI).  An anyURI value can be absolute or relative,
>>>> and may have an optional fragment identifier (i.e., it may be
>>>> an IRI Reference).  This type should be used when the value
>>>> fulfills the role of an IRI, as defined in [RFC 3987] or its
>>>> successor(s) in the IETF Standards Track.”
>>>>
>>>> During the development of XSD 1.1 the WG responded to
>>>> inconsistencies in the 1.0 implementations of the anyURI
>>>> type (and, perhaps, to fears that future revisions of the RFCs
>>>> for URIs and IRIs would continue to change the set of legal
>>>> values) by seeking to simplify and future-proof the rules used
>>>> for checking schema-validity of IRIs.  For reasons I do not think
>>>> I can successfully reconstruct (at least, not without falling
>>>> into depression), it chose to do so by stating clearly that the
>>>> grammar rules specified by the relevant RFCs are effectively
>>>> only advisory, and that for purposes of schema validation,
>>>> any sequence of XML characters constitutes a value of the
>>>> type.
>>>>
>>>> So in XSD 1.1 it is doubly untrue to say that IRIs are not
>>>> accepted as lexical representations of xsd:anyURI:  not only
>>>> is it clearly stated that IRIs are to be accepted, but strings
>>>> that do not match the current definition of IRIs will *also*
>>>> be accepted as schema-valid.
>>>>
>>>> XForms needs its own IRI type only if stricter validation of the
>>>> grammar of URIs and IRIs is needed.
>>>>
>>>> If in fact stricter validation is needed, the XForms group may
>>>> wish to consider using the datatypes defined in “XSD datatypes
>>>> for strict validation of IRIs and URIs” [3].
>>>>
>>>> It would be very disappointing if the amount of work that went
>>>> into making xsd:anyURI accept IRIs turned out to be for
>>>> naught.
>>>>
>>>> [1] https://www.w3.org/TR/xmlschema-2/#anyURI
>>>> [2] https://www.w3.org/TR/xmlschema11-2/#anyURI
>>>> [3]  
>>>> https://www.w3.org/XML/Group/2004/06/exacturi/xsd-rfc-3986-uri-3986-iri.html
>>>>
>>>> N.B. I am umable to verify URI [3], since my access privileges
>>>> no longer seem sufficient to retrieve the document.  [3] was
>>>> prepared for publication as a WG note by the then XML Schema
>>>> WG but never published, since the WG ran out of resources and
>>>> time.  When the XML Core WG took over responsibility for
>>>> XSD, they decided they didn’t have the necessary resources, either.
>>>> I would be glad if the work were finally published.
>>>>
>>>> ********************************************
>>>> C. M. Sperberg-McQueen
>>>> Black Mesa Technologies LLC
>>>> cmsmcq@blackmesatech.com
>>>> http://www.blackmesatech.com
>>>> ********************************************
>>>>

Received on Sunday, 11 March 2018 19:52:19 UTC