Re: Advice on making IRI document suitable for reference by HTML (and other specs)

On Dec 28, 2009, at 2:56 PM, Larry Masinter wrote:

>> This is still confusing IRIs with the arbitrary contents of an
>> href (or other) attribute.
> 
> To try to bring two things into alignment isn't "confusing"
> them. Why shouldn't the most widely deployed implementations
> be used as a guideline? 

Because they are two different things.  I don't want them aligned
any more than I want "child" and "adult" aligned into "person".
The distinction is necessary for some standards and some
implementations.  The fact that it doesn't matter for some
specific browser contexts does not imply that it doesn't matter
for everyone else.

>> The fact is that HTML5 (and others) needs a definition of reference
>> and the rules for converting a reference to an IRI or URI.
> 
> Yes. I'm willing to admit that it may be necessary to retain some
> elements as "preprocessing", although I'm not convinced.
> 
>> Trying to pretend that a reference is always an IRI is doomed
>> to fail -- you might as well obsolete the RFC and say that
>> an IRI is anyString.
> 
> I'm not trying to _pretend_ anything, I'm trying to make
> it so. And issuing a new version of an RFC *does* make
> the old version "obsolete". If the standard doesn't match
> what implementations do, obsoleting the standard and
> making a new version isn't bad.

The current standard reads on the output of the algorithm and
you want to redefine the same term as defining the input of the
algorithm.  Such a change is bizarre.  Just as bizarre as the
similarly nonsensical way that HTML5 redefines URL.

I told you before, the solution to this problem is to not use
the same term for both the input and output of the algorithm.
That is the whole point of differentiating references from
the final result: an interoperable identifier in absolute form.
Both IRI and URI define the final result, not the data-entry
input, because there was no uniformity in how one gets from
an arbitrary string to a URI.  We might be able to get that
kind of uniformity within a single implementation space,
such as HTML attributes, but it would be at best a proposal
that has not yet been implemented in practice.

> I'm not convinced that it is inappropriate to define
> a syntax which parses into components, and yet any
> string *has* a parse, and that validity is determined
> after the parse rather than before. (Especially since
> the restrictions on character ranges may be different
> from one parsed field to another.)

I already did that.  See RFC 3986 appendix B

http://tools.ietf.org/html/rfc3986#appendix-B

>> Thus making all current references to the standard wrong
>> and useless.
> 
> If current references to the IRI Proposed Standard don't
> match what implementations actually do, then perhaps they
> ARE _wrong_, and fixing the specifications to match the
> widely deployed and interoperable implementations is 
> actually the right thing to do.

Browser data entry forms (search boxes) are not implementations
of IRI.  HTML href is not an implementation of IRI.  The output
of a browser's reference parser, just before it sends an address
on the wire for HTTP, is an implementation of URI.

>> Julian is right.
> 
> I didn't read a specific position in Julian's post, but
> rather just pointing out there were some existing
> specifications that would have to be reworded if
> the "no internal spaces" restriction might be required
> for those applications.

What Julian meant, I think, is that other protocols currently
reference the term IRI expecting that the grammar disallows
spaces, in the same way that the HTTP protocol assumes that
a valid request target cannot contain a space.

>> What you should be doing
>> is defining an algorithm from anyString to the current
>> definition of IRI, 
> 
> That's what 
> http://tools.ietf.org/html/draft-duerst-iri-bis-07#section-7.2 
> section 7.2 " Web Address processing" already attempts.
> Do you think it accomplishes that? 

No.  I cannot even conceive of implementing that since the
ABNF is invalid and the preprocessing steps occur after the
grammar is defined.  It makes no sense.  Why not just take
anyString, split it into separate references by whitespace
if that is how the context is defined, preprocess that string
to remove embedded linefeeds and transform disallowed into
allowed characters, and then apply the regular expression in
RFC 3968?

>> and then change HTML5 so that it uses
>> anyString (or whatever you want to call it) as the attribute
>> definition. 
> 
> That's what was intended by:
> http://lists.w3.org/Archives/Public/public-html/2009Nov/att-0670/iri-rewrite-draft.html
> Do you think this is the right direction, then?

I think it would be easier to simply define how to process
a Web reference (not an address yet) into a Web address in
the form of an IRI or URI.

> Some of those definitions are useful outside of the context
> of HTML; do you agree with moving some of them into the
> IRI-BIS document?

No.  Some of those definitions aren't even useful inside HTML5
because the attribute string has to be parsed for whitespace
issues based on the definition of that attribute -- there is
no single attribute parser algorithm for HTML.  Furthermore,
what do we do then for documents that are not Unicode based,
do not have references that are Unicode based, and will not
work with IRI conversion to UTF-8?  Should those be called
IRIs as well?

>> My suggested name is "Web reference". 
> 
> I used "Web address" rather than "Web reference", since
> that's was the term used before.
> 
>> Just be
>> aware that some HTML5 attributes require a list of
>> space-separated references, whereas others require a
>> single reference that expects space to be auto-encoded
>> by the parser.
> 
> I looked through the HTML5 specification for any specific reference
> to WEBADDRESS or HTML5 section 2.5, and saw no such attributes;
> could you give an example of an HTML5 attribute which requires a
> list of space-separated references?

rel="", itemprop="", and potentially any attribute that consists
of an undefined set of space-separated tokens (token syntax is
only restricted to exclude space).

....Roy

Received on Tuesday, 29 December 2009 00:34:17 UTC