Re: Change definition of URL to normatively reference IRI specification using a well-defined interface from Larry Masinter on 2010-05-18 (public-html@w3.org from May 2010)

From: Larry Masinter <LMM@acm.org>
Date: Tue, 18 May 2010 12:00:14 -0700
To: <public-html@w3.org>, <public-iri@w3.org>
Message-ID: <003401caf6bc$5a1bb410$0e531c30$@org>
In the deluge of the last few months, I've gotten way behind,
so forgive me if I've missed some of this and am treading old
ground. I'm cc'ing both W3C HTML and IETF IRI mailing lists,
since the message addresses the interface between the two
specifications:

Re the HTML working group change proposal:
http://lists.w3.org/Archives/Public/public-html/2010Apr/0147.html
for W3C HTML working group Issue-56 
http://www.w3.org/html/wg/tracker/issues/56


I was asked for a comment:

> Larry, can you comment on the feasibility of the requested IRIbis  
> changes? It would be really great to get this HTML WG issue settled,

> so that remaining IRI/URI/URL work can move into the IRIbis WG.

I think this seems quite feasible to accomplish the rationale,
but I have comments and questions on the requirements:


# ISSUE-56
# ========
# 
# SUMMARY
# The HTML specification is changed slightly to reference the IRI 
# specification using a well-defined interface.

This is fine, although we need to agree on "well-defined", and I think
there might be some difference of opinion.

# RATIONALE
# To ensure a clean modular separation of the IRI and HTML
specifications, 
# an interface is needed. This allows the specifications to co-exist
in a 
# well-defined way without each specification needing to be
continually 
# updated as the other is fixed (for example, changing references to
section 
# numbers or step numbers).

Agree, this would be much better and quite feasible.

# DETAILS

# Update the IRI specification to define two algorithms:

I'm reluctant to define "algorithms" for interfaces when
"constraints on interface implementations" are
more appropriate. While it is important that web content
that contains scripting that parses hypertext reference
elements (whether you call them IRIs or URLs) need to
work reliably for existing web content, it seems also 
reasonable to allow for variations between implementations
that don't matter in practice, especially when currently
deployed implementations actually vary in ways that are
visible (it's possible to construct test where the
results don't match), but unimportant (e.g., parsing an
invalid string might give different results.)

That is, in some cases, it may be reasonable to allow
implementations to "agree to disagree", because the disagreements
are irrelevant.

Allowing is reasonable, because URL/IRI parsing is implemented
in so many systems, not just browsers, and making currently
conforming implementations non-conforming requires stronger
justification than just a first-principle preference for
precision. Giving definitions of interfaces in terms
of required relationships between input and output, some
of those relationships may well be best specified as 
SHOULD rather than MUST. 

<  * parsing an address (relative or absolute): algorithm to obtain a 
<   failure/success condition (not the same as whether the input is 
<   valid or not, just whether it can be parsed), and the following 
<   components, from parsing an arbitrary string:
<    - <scheme> component
<    - <host> component
<    - <port> component
<    - <hostport> component
<    - <path> component
<    - <query> component
<    - <fragment> component
<    - <host-specific> component

There are some things about this requirement that I don't really
understand:

IRIs and IRIbis are defined in terms of "sequence of
character" where character is taken from the Unicode
repertoire.

The requirement is written in terms of "string", though,
and the mapping is unclear at the moment. Do all implementations
first translate input strings in their native character encodings
into Unicode before parsing? Are there implementations which
parse strings whose encoding is a  charset that does not map fully
into Unicode (I vaguely recall iso-2022-jp as an example)?


What is the range of the "failure" condition? I.e., is it allowed
merely to signal a "failure" or is the nature of the failure also
need to be available/discernable?

Is it necessary/possible/allowed to distinguish between
"component empty" and "component not present"?
E.g., for http://example.com:/ vs http://example.com/
does the "port" component distinguish (is it allowed to
distinguish) between component not present?

Should <hostport> be part of this interface, given that
it can be reconstructed from <host> and <port>?

In the case where the interface signals an error:
Must all of the results be defined? Must they return exactly the same 
results?  

I think making that a requirement needs some justification.


> * resolving an address A relative to a base address B with an
encoding C: 
>   algorithm for parsing an arbitrary string A and resolving it
relative 
>   to address B (which will have been resolved, but may be invalid),
using 
>   a specified character encoding C, and returning either success or 
>   failure, and in the case of success, a string, with the following 
>   conditions:

What is the range of the failure signal? I.e., 
   resolverelative(A, B, C) ==> failure

is the interfa

What does it mean for B to "have been resolved", or was this an aside?

I think this is alluding to the presumption that this interface 
resolverelative(A, B, C) that the input B is in the range of a
previous
resolverelative step? Or is it? Is there some assumption that B is
an absolute IRI without a fragment identifier? Or with one?

>    - the output of the algorithm must be idempotent even if the base

>      argument is changed (i.e. once resolved, resolving it again
with 
>      the same character encoding cannot change the result)

I think this is saying something like

   resolverelative(resolverelative(A, B1, C), B2, C)
    == resolverelative(A, B1, C)

I think the interface definition probably needs to be clearer
of the role of character encoding, unless these are meant to
be "sequence of octet" strings rather than "sequence of unicode
character" strings. 

>     - resolving preserves errors, e.g. resolving
"http://example.com##"
>      returns "http://example.com/##" not "http://example.com/#%C3".

I think calling this "preserves errors" might be confusing in the
context
of an interface which allows "errors" to be signaled. 

I'm not sure why this is a requirement; I think there's been some
discussion of this on the mailing list, but perhaps someone could
recap.


> Update the HTML spec to use these algorithms and reference the IRI
spec 
> that defines them.

I would express this as:

"Update the HTML spec to use the methods whose interfaces are
specified in the IRI spec."
Received on Tuesday, 18 May 2010 19:07:07 UTC