- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Sun, 11 Mar 2018 20:51:44 +0100
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Cc: public-xformsusers@w3.org
Thanks for this. A solid and thorough piece of work.
However, I think our problem is a little simpler; we don't need to parse
the URI, only recognise if it is correct or not. This means we can greatly
simplify the syntax.
As far as I can see, the basis of the regex needed is:
IRI-reference = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ "#"
ipch-f* ]
where the only difference between 'hier' and 'hier-nc' is that hier-nc may
have no colons before the first (if any) "/" character.
(The only difference between the characters represented by 'ipch-q' and
'ipch-f' is that ipch-q can contain characters from the private use areas.)
As I see it, XForms needs two IRI types. For the case where a user is
required to type in a full web-address:
IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ]
and where data could hold either a full IRI or a relative IRI:
IRI-reference = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ "#"
ipch-f* ]
In passing I note one great failing of the syntax in RFC 3987:
while it goes to great lengths to define an IP address as exactly 4
numbers, separated by dots, where the numbers are between 0 and 255, and
similarly for an IPv6 address, when it comes to what it calls an
'ireg-name' (representing things such as www.w3.org), it allows any old
rubbish through, including things like "..." "}{" and "+-.._". Adopting a
type in XForms, I would be inclined to use the definition used for email,
which is one or more 'subdomains' separated by ".".
Steven
On Thu, 08 Mar 2018 18:56:28 +0100, C. M. Sperberg-McQueen
<cmsmcq@blackmesatech.com> wrote:
> The tighter types defined in [1] may be what you want.
>
> [1]
> https://www.w3.org/XML/Group/2004/06/exacturi/xsd-rfc-3986-uri-3986-iri.html
>
> Actually, though, since it appears the access privileges for /XML/Group
> have been made more restrictive, it's not clear that the document at [1]
> is available to anyone outside the Team anymore. So I attach a copy,
> which I have munged to try to make it display plausibly from the
> lists.w3.org archives.
>
> The actual definitions of the types appear to accessible in [2], [3],
> and [4].
>
> [2] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-URI-driver.xsd
> [3] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-URI-RFC3986.xsd
> [4] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-RFC3987.xsd
>
> Note that these are marked as drafts, and contain text saying that the
> current version of the types is in the schema datatype library at [5],
> which
> is not true: since the draft note was never published by the Schema WG,
> the types were never added to the type library. If XForms wants to use
> them, you should probably re-issue them by revising the schema documents
> and publishing them in an appropriate location. (And if you want to
> publish [1] as a group document, that would probably be useful for
> those who need to understand how the schema documents are
> constructed.)
>
> Michael
>
>
> On Mar 8, 2018, at 7:06 AM, Steven Pemberton wrote:
>
>> But that said, the anyURI type is extremely liberal, accepting
>> literally *any* string of characters. The only purpose of the type
>> seems to be mandating transforming the characters into something
>> acceptable to a URI when necessary.
>>
>> It would still be useful to have a type that validates according to
>> http://www.ietf.org/rfc/rfc3987.txt.
>>
>> Steven
>>
>>
>> On Thu, 08 Mar 2018 13:46:33 +0100, Steven Pemberton
>> <steven.pemberton@cwi.nl> wrote:
>>
>>> You are absolutely right, and I am absolutely wrong.
>>>
>>> What led me to the conclusion was writing the test suite for anyURI,
>>> and IRIs showing up as invalid, and me then following the wrong link.
>>>
>>> So all is well, I can breathe a sigh of relief, and carry on with the
>>> test suite.
>>>
>>> I'm happy that you are reading the XForms mailing list :-)
>>>
>>> Steven
>>>
>>> On Wed, 07 Mar 2018 19:37:20 +0100, C. M. Sperberg-McQueen
>>> <cmsmcq@blackmesatech.com> wrote:
>>>
>>>>
>>>>> On Mar 7, 2018, at 10:39 AM, Steven Pemberton
>>>>> <steven.pemberton@cwi.nl> wrote:
>>>>>
>>>>> The definition of anyURI doesn't allow IRIs, such as
>>>>> https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en.
>>>>>
>>>>> Just as we added an iemail address type to match modern email
>>>>> addresses, it seems to me that we ought to also add an anyIRI type
>>>>> that accepts IRIs like the aboce.
>>>>
>>>> I am puzzled; what leads you to the conclusion that xsd:anyURI
>>>> does not accept IRIs?
>>>>
>>>> In XSD 1.0 [1], the value space is described as that of
>>>> RFC 2396, as modified by RFC 2732, and the lexical space
>>>> is described (roughly) as the set of strings, which after
>>>> escaping, turn into URIs as defined by those specs. The
>>>> escaping in question is the then current algorithm for IRIs,
>>>> as published in the XLink spec. I believe that later revisions
>>>> of the concept of IRI changed the rules for whitespace, but
>>>> I don’t recall any other changes likely to be noticeable to
>>>> users of the datatype. Certainly the intent of XSD 1.0
>>>> was to accept IRIs in the lexical space of the type anyURI.
>>>>
>>>> The spec says "This [the mapping from lexical space to value
>>>> space] means that a wide range of internationalized resource
>>>> identifiers can be specified when an anyURI is called for”.
>>>>
>>>> In XSD 1.1 [2], the spec is a little more explicit, since the
>>>> IRI concept was a little more clearly developed by that time:
>>>> "anyURI represents an Internationalized Resource Identifier
>>>> Reference (IRI). An anyURI value can be absolute or relative,
>>>> and may have an optional fragment identifier (i.e., it may be
>>>> an IRI Reference). This type should be used when the value
>>>> fulfills the role of an IRI, as defined in [RFC 3987] or its
>>>> successor(s) in the IETF Standards Track.”
>>>>
>>>> During the development of XSD 1.1 the WG responded to
>>>> inconsistencies in the 1.0 implementations of the anyURI
>>>> type (and, perhaps, to fears that future revisions of the RFCs
>>>> for URIs and IRIs would continue to change the set of legal
>>>> values) by seeking to simplify and future-proof the rules used
>>>> for checking schema-validity of IRIs. For reasons I do not think
>>>> I can successfully reconstruct (at least, not without falling
>>>> into depression), it chose to do so by stating clearly that the
>>>> grammar rules specified by the relevant RFCs are effectively
>>>> only advisory, and that for purposes of schema validation,
>>>> any sequence of XML characters constitutes a value of the
>>>> type.
>>>>
>>>> So in XSD 1.1 it is doubly untrue to say that IRIs are not
>>>> accepted as lexical representations of xsd:anyURI: not only
>>>> is it clearly stated that IRIs are to be accepted, but strings
>>>> that do not match the current definition of IRIs will *also*
>>>> be accepted as schema-valid.
>>>>
>>>> XForms needs its own IRI type only if stricter validation of the
>>>> grammar of URIs and IRIs is needed.
>>>>
>>>> If in fact stricter validation is needed, the XForms group may
>>>> wish to consider using the datatypes defined in “XSD datatypes
>>>> for strict validation of IRIs and URIs” [3].
>>>>
>>>> It would be very disappointing if the amount of work that went
>>>> into making xsd:anyURI accept IRIs turned out to be for
>>>> naught.
>>>>
>>>> [1] https://www.w3.org/TR/xmlschema-2/#anyURI
>>>> [2] https://www.w3.org/TR/xmlschema11-2/#anyURI
>>>> [3]
>>>> https://www.w3.org/XML/Group/2004/06/exacturi/xsd-rfc-3986-uri-3986-iri.html
>>>>
>>>> N.B. I am umable to verify URI [3], since my access privileges
>>>> no longer seem sufficient to retrieve the document. [3] was
>>>> prepared for publication as a WG note by the then XML Schema
>>>> WG but never published, since the WG ran out of resources and
>>>> time. When the XML Core WG took over responsibility for
>>>> XSD, they decided they didn’t have the necessary resources, either.
>>>> I would be glad if the work were finally published.
>>>>
>>>> ********************************************
>>>> C. M. Sperberg-McQueen
>>>> Black Mesa Technologies LLC
>>>> cmsmcq@blackmesatech.com
>>>> http://www.blackmesatech.com
>>>> ********************************************
>>>>
Received on Sunday, 11 March 2018 19:52:19 UTC