Re: draft-wilde-text-fragment-01 (was: Including 'fragment identifier semantics' ...) from Chris Lilley on 2002-09-06 (uri@w3.org from September 2002)

From: Chris Lilley <chris@w3.org>
Date: Fri, 6 Sep 2002 22:33:15 +0200
To: gerald@w3.org, David Hopwood <david.hopwood@zetnet.co.uk>
CC: ietf-types@iana.org, uri@w3.org, www-tag@w3.org, net.dret@dret.net
Message-ID: <100120039500.20020906223315@w3.org>

On Thursday, September 5, 2002, 9:27:35 PM, David wrote:

DH> -----BEGIN PGP SIGNED MESSAGE-----

DH> Dan Kohn wrote:
>> It would also be worth noting and/or commenting on this draft:
>> 
>> http://www.ietf.org/internet-drafts/draft-wilde-text-fragment-01

DH> Yuch. This is overcomplicated and not sufficiently useful to justify the
DH> risk of security flaws in parsing regular expressions. There's a good
DH> case for supporting a simple "#<line-number>" syntax for text/plain, but
DH> nothing more IMHO.

Depends what you want it for. For Ted Nelson-style standoff markup,
you would need character addressing and ranges, too. Matching on text
patterns is handy to make more robust links if the document is being
edited (and would be even more handy if combinations, such as "the
seven lines after 'Non-ASCII Characters in Regular Expressions'" were
possible) (although, there are still problems if a revision takes that
to eight lines, or a 'page break' intervenes as here).

The draft does usefully distinguish between characters and bytes, and
counts the former, minus (normalized) line endings,and wisely stays
away from attempting to count words, sentences, or paragraphs (though
the latter would be tractable).

There are some unresolved questions of detail - what does
#char(6) point to in the text file containing "Hi World" encoded in
UTF-16, for example - "o" or "W" (is the BOM two characters, or a
thing before the first character).

It also distinguishes between point and range references, which is
vital (particularly since fragments in HTML and, less commonly, in XML
are often incorrectly assumed to be point references.

It might be worth mentioning that the encoding of the text in a match
string might well be different to the encoding of the text document
containing the string to be matched.

The pointer to the (recently expired but still there)
http://www.ietf.org/internet-drafts/draft-borden-frag-00
was valuable.

text-scheme   =  ( char-scheme / line-scheme / regex-scheme )
should presumably be
text-scheme   =  ( char-scheme / line-scheme / match-scheme )

I thought the draft a little cautious regarding IRI although given the
content of
http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-09.txt
perhaps this is wise. I hope that
http://www.w3.org/International/2001/draft-masinter-url-i18n-08.txt
gets republished soon.

For non-normative reference to BRE, the following might be useful:

 An Introduction to Posix Regexps
 http://www.regexps.com/src/docs.d/hackerlab/html/introduction-to-regexps.html

 Regular Expressions
 http://www.opengroup.org/onlinepubs/7908799/xbd/re.html

 The latter includes a BRE grammar.

-- 
 Chris                            mailto:chris@w3.org

Received on Friday, 6 September 2002 16:33:38 UTC