Re: Floating Quotable Citations (FQC) from Robert Sanderson on 2013-02-21 (public-openannotation@w3.org from February 2013)

From: Robert Sanderson <azaroth42@gmail.com>
Date: Thu, 21 Feb 2013 15:46:32 -0700
To: David Cuenca <dacuetu@gmail.com>
Cc: public-openannotation@w3.org
Message-ID: <CABevsUHkokGA3Ga1WoRorQeWXNhaZoe6HVu52VkFUr7pSC19xw@mail.gmail.com>

>> However I'd encourage you to *not* try to do this as a URI Fragment,
>> as you would be competing with the official specifications of what a
>> Fragment component of HTML, plain text, XML and PDF resources means.
>> Within Media Wiki and other conforming implementations you can, of
>> course, use the query approach.
>
> The idea was that, if seen appropriate, it could be converted into an
> official specification. That is for sure not up to me to decide, but it is path that could be
> explored.

Right, and I'm not the person wearing the decision making hat either,
but I would be *extremely* surprised if this were acceptable without a
major overhaul of the rules about URI fragments.  It only just works
in Media Fragments because there aren't any Fragments defined for the
existing media types.  I'm sure that Raphael could explain the
headaches there!


>> Some other issues off the top of my head:
>>
>> * It's hard to determine paragraphs, sentences and words.
>> -- Paragraphs could be <p>, or <div>, but they might not be.  Perhaps
>> just <br/><br/> is used to separate the paragraphs.
>> And that's just HTML, let alone other textual resources.
>> -- Sentences:  Mr. J. Smith of the U.S.A. took $1.45 from his pocket
>> ... and spent it.   1 sentence or 10?
>> -- Words: Word splitting is extremely hard in eastern languages.

> In the first case it is a matter of having accurate definitions of what a
> paragraph means - I admit there are some loose ends there.

Yes, but also detecting those paragraphs is necessary regardless of
any abstract definition of a paragraph.

> The issue with
> sentences could be averted defining a sentence as a group of at least 2
> words and handling numerals properly.

See below for words, but even the above plus ellipsis handling would generate:

Mr. J.
Smith of the U.
S.A.
took $1.45 from his pocket ... and spent it.

Unless you have a dictionary of "words", or don't allow sentence
breaks without spaces, which would generate:

Mr. J.
Smith of the U.S.A.
took $1.45 from his pocket ... and spent it.

And that's just "." in well written English.

> Nevertheless in an extreme situation
> like that it would be much more sensible to use word counting instead. What
> do you mean by word splitting in eastern languages? The concept of using the
> unit "word"?

It's extremely complex to determine word boundaries in many Eastern
languages, as they don't naturally have spaces.
For example: http://www.mnogosearch.org/doc33/msearch-cjk.html
Or check out Ken Lunde's O'Reilly book: CJKV Information Processing.


>> * We stuck with character counting, but even then it's tricky with
>> normalization routines.  &amp;  -- 1 character or 5?
>> You have the same issue with length as well.
>
> My aim was only to address the topic of parsed text, would that be an issue
> in that case?

So long as it's defined as to what normalizations have to happen (eg
XML DOM string comparison rules) we believe that it's okay. On the
other hand, after a conversation at the W3C ebook workshop last week,
that may be too strongly centered on web browsers as the user agent
and other systems may have a significant challenge.  I guess we'll see
with implementation experience :)

The major issue is counting language specific units such as paragraphs
sentences and words.

Rob

Received on Thursday, 21 February 2013 22:47:00 UTC