W3C home > Mailing lists > Public > public-html@w3.org > April 2010

Re: Change definition of URL to normatively reference IRI specification using a well-defined interface

From: Mark Davis ☕ <mark@macchiato.com>
Date: Fri, 9 Apr 2010 10:41:37 -0700
Message-ID: <q2o30b660a21004091041t4b3ad4a7k444adfaf81c7acd0@mail.gmail.com>
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Ian Hickson <ian@hixie.ch>, Ted Hardie <ted.ietf@gmail.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, Maciej Stachowiak <mjs@apple.com>, Larry Masinter <LMM@acm.org>, Marc Blanchet <Marc.Blanchet@viagenie.ca>, Sam Ruby <rubys@intertwingly.net>, Paul Cotton <Paul.Cotton@microsoft.com>, Martin Duerst <duerst@w3.org>, Michel SUIGNARD <Michel@suignard.com>, public-html <public-html@w3.org>, "public-iri@w3.org" <public-iri@w3.org>
When you would actually implement it, there are a few different kinds of
APIs that you would use, such as:

end = lookingAt(string, startPosition);
if there is an IRI starting at startPosition, return the end of it -
otherwise return an error.

<start, end> = scan(string, startPosition);
find the first instance of an IRI in a string at or after startPosition,
returning where it starts and ends.

The key is that if the Issue#1 specification can return the first error
point (as I outlined in the message), then one can design and implement fast
code to implement the above (or other kinds of APIs). The reference code for
*testing* lookingAt would implement the algorithm in Issue#1 (as amended).
The reference code for *testing* scan would just call lookingAt in a loop,
starting at position 0, returning if something is found, and otherwise going
to the next character. This would just be reference code; the reference code
can be much faster.


— Il meglio è l’inimico del bene —

On Fri, Apr 9, 2010 at 10:12, Julian Reschke <julian.reschke@gmx.de> wrote:

> On 09.04.2010 18:54, Mark Davis ☕ wrote:
>>  For Issue #1, I like the formulation. However, I'd like to see one
>> more piece of information (logically) returned: if the parse could not
>> continue to the end, then what was the last character successfully parsed.
>> That is, in "http://google.com/<space>/", it would return the offset
>> between the "m" and the space.
>> So why do this? It is because a very common problem is to find an IRI in
>> plain text, where the end is not known. This needs to be done in email,
>> word processors, HTML editors, and a host of other products. By having
>> an explicit specification that lets us know what the last character is,
>> one can then (logically) call the function again to determine whether
>> the segment up to the error point is a valid IRI.
> Hmm. Not convinced.
> 1) If you want to parse IRIs out of content, wouldn't you also need to
> consider *leading* non IRI characters?
> 2) What's wrong with just adding up the individual segments (plus
> delimiters)?
>  ...
> Best regards, Julian
Received on Friday, 9 April 2010 17:42:12 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 29 October 2015 10:16:01 UTC