Re: Floating Quotable Citations (FQC) from Karin Verspoor on 2013-02-22 (public-openannotation@w3.org from February 2013)

From: Karin Verspoor <Karin.Verspoor@nicta.com.au>
Date: Fri, 22 Feb 2013 03:58:47 +0000
To: David Cuenca <dacuetu@gmail.com>
CC: Robert Sanderson <azaroth42@gmail.com>, "<public-openannotation@w3.org>" <public-openannotation@w3.org>
Message-ID: <07D38B031F98104F9F3F36E41EA63CF206190D47@atp-exchmbx2>
Hi David,

I believe that computationally it is actually much simpler to count characters and move to a specific location in a long text file using simple offsets than it would be to (a) parse the text into paragraphs [using some agreed algorithm for determining paragraph boundaries] (b) count the paragraphs (c) move to the appropriate paragraph and (d) count characters within the paragraph.

So counting characters is both representationally simpler, and computationally simpler, and I don't think it matters much whether the text is short or long.

Karin

On 22/02/2013, at 12:23 PM, David Cuenca wrote:

Dear Karin and Rob,
Thanks a lot for your insights on semantic units. On extreme cases of sentence splitting like the ones you point out there are only two ways out: either enforcing strong rules about how to perform the division (adding complexity) or rejecting them altogether (simplicity). In the first case those rules might be too difficult to apply for the random person wanting to cite a book, on the second case we are back to square one, with no way of properly identifying a text fragment that was not created with that aim.

I really like the Text Quote Selector, it is simple and avoids many of the word counting/sentence splitting problems. However it might be that for large bodies of text it could be either too slow, or that the input text necessary for an unambiguous query resulted too long. If a paragraph delimiter could be chosen to make sure that the paragraph splitting is done in the right way, do you think that using paragraph fingerprinting to define broad boundaries plus the Text Quote Selector to look inside those boundaries could be a more efficient combination?

For Eastern languages it would be better to keep using the text selectors that already exist, or define another kind of text identification. That link that Rob posted suggests me, however, that it wouln't be an easy task...

Glad to hear about XML DOM! If that really works it could be a headache less :)

On Thu, Feb 21, 2013 at 5:46 PM, Robert Sanderson <azaroth42@gmail.com<mailto:azaroth42@gmail.com>> wrote:
>> However I'd encourage you to *not* try to do this as a URI Fragment,
>> as you would be competing with the official specifications of what a
>> Fragment component of HTML, plain text, XML and PDF resources means.
>> Within Media Wiki and other conforming implementations you can, of
>> course, use the query approach.
>
> The idea was that, if seen appropriate, it could be converted into an
> official specification. That is for sure not up to me to decide, but it is path that could be
> explored.

Right, and I'm not the person wearing the decision making hat either,
but I would be *extremely* surprised if this were acceptable without a
major overhaul of the rules about URI fragments.  It only just works
in Media Fragments because there aren't any Fragments defined for the
existing media types.  I'm sure that Raphael could explain the
headaches there!


>> Some other issues off the top of my head:
>>
>> * It's hard to determine paragraphs, sentences and words.
>> -- Paragraphs could be <p>, or <div>, but they might not be.  Perhaps
>> just <br/><br/> is used to separate the paragraphs.
>> And that's just HTML, let alone other textual resources.
>> -- Sentences:  Mr. J. Smith of the U.S.A. took $1.45 from his pocket
>> ... and spent it.   1 sentence or 10?
>> -- Words: Word splitting is extremely hard in eastern languages.

> In the first case it is a matter of having accurate definitions of what a
> paragraph means - I admit there are some loose ends there.

Yes, but also detecting those paragraphs is necessary regardless of
any abstract definition of a paragraph.

> The issue with
> sentences could be averted defining a sentence as a group of at least 2
> words and handling numerals properly.

See below for words, but even the above plus ellipsis handling would generate:

Mr. J.
Smith of the U.
S.A.
took $1.45 from his pocket ... and spent it.

Unless you have a dictionary of "words", or don't allow sentence
breaks without spaces, which would generate:

Mr. J.
Smith of the U.S.A.
took $1.45 from his pocket ... and spent it.

And that's just "." in well written English.

> Nevertheless in an extreme situation
> like that it would be much more sensible to use word counting instead. What
> do you mean by word splitting in eastern languages? The concept of using the
> unit "word"?

It's extremely complex to determine word boundaries in many Eastern
languages, as they don't naturally have spaces.
For example: http://www.mnogosearch.org/doc33/msearch-cjk.html
Or check out Ken Lunde's O'Reilly book: CJKV Information Processing.


>> * We stuck with character counting, but even then it's tricky with
>> normalization routines.  &amp;  -- 1 character or 5?
>> You have the same issue with length as well.
>
> My aim was only to address the topic of parsed text, would that be an issue
> in that case?

So long as it's defined as to what normalizations have to happen (eg
XML DOM string comparison rules) we believe that it's okay. On the
other hand, after a conversation at the W3C ebook workshop last week,
that may be too strongly centered on web browsers as the user agent
and other systems may have a significant challenge.  I guess we'll see
with implementation experience :)

The major issue is counting language specific units such as paragraphs
sentences and words.

Rob



--
Etiamsi omnes, ego non


________________________________

The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments.
Received on Friday, 22 February 2013 04:01:42 UTC