Re: Scope question

On Wed, May 5, 2010 at 5:09 PM, Roy T. Fielding <fielding@gbiv.com> wrote:
> On May 5, 2010, at 11:11 AM, Adam Barth wrote:
>> RFC 3986 Section 3.1 is helpful w.r.t. the casing of the scheme.
>> However, it's not as clear as it could be.  For example, it says:
>>
>> "documents that specify schemes must do so with lowercase letters"
>>
>> It's unclear whether that's a requirement for folks who produce
>> documents or for folks who consume documents.
>
> That is a requirement for IETF specifications of URI schemes.  It has
> nothing to do with processing.

Ah, I see.  That reading makes more sense.

>>  Later it says:
>>
>> "An implementation should accept uppercase letters as equivalent to
>> lowercase in scheme names"
>>
>> Leading me to believe the first requirement is for folks who produce
>> documents, assuming "implementation" above refers to document
>> consumers.
>
> RFC 3986 defines how to parse URIs (for recipients) and provides
> many rules for scheme-specific specs to define how to generate URIs
> of a given scheme (for producers) within the overall constraint of
> matching the URI syntax (the formal ABNF).
>
> A URI is the most constrained form of address for maximum
> interoperability across both machine and non-machine transports.
> It is like the postal addressing standard -- there exists one
> form that is known to be the most readable and efficient postal
> handling format of an address.  That does not prevent readers
> of an envelope from handling an unbounded number of additional
> addressing forms, with partial automation, and then relying
> on the postal carriers to interpret the nonstandard bits.
>
>> As I read the charter, we're not supposed to address issues in RFC
>> 3986, which might place this document out of scope depending on the
>> division of responsibilities between RFC 3986 and RFC 3987.
>
> Please understand that browsers almost never parse URI or IRI or
> anything in between.  Browsers have input strings that contain one
> or more references, usually in the document encoding, and so there
> is a sequence of context-specific and charset-specific and
> media-type-specific processing that occurs before you even get to
> the individual URI-reference or IRI-reference that are defined by
> 3986/3987.

Where are those rules defined (e.g., for HTML documents)?  I suspect
that's the layer that interests me at the moment.

> Some people have proposed that most of that pre-processing be added
> to the IRIbis spec, but I have seen no evidence to suggest that
> such pre-processing is even remotely standardizable (it seems to
> be different for every input context).  If you can demonstrate or
> get agreement on a single way to preprocess an input string, or at
> least a few named processes (like single-ref and multi-ref), then
> that would be useful.

It seems likely that this would be possible and valuable for at least
some widely used contexts (e.g., UTF8-encoded HTML documents).

> It would have no effect on RFC 3986.  The only things that would
> impact 3986 is if the allowed characters or major components
> changed in the wire syntax of the URI standard, which is simply
> not going to happen because that would break a majority of
> implementations (of which browsers make up less than 1%).
> As far as 3986 is concerned, your algorithm is in Appendix B.
> Note that the algorithm will work with any superset of ASCII.

I don't have an algorithm yet, but, according to my understanding of
your email, the algorithm in Appendix B appears to a constraint on the
*output* of the media/context-specific transformation that interests
me.

> IRI (3987) is more flexible because there are no wire implementations
> that depend on its constraints -- it could just as easily have
> been defined as an "any string" conversion/presentation process,
> which would have satisfied the scope you are looking for if there
> is sufficient agreement among implementations.

I didn't understand this paragraph, but I'm not sure it's essential to
our discussion.

Thanks,
Adam

Received on Thursday, 6 May 2010 00:32:58 UTC