W3C home > Mailing lists > Public > public-openannotation@w3.org > February 2013

Re: Floating Quotable Citations (FQC)

From: David Cuenca <dacuetu@gmail.com>
Date: Thu, 21 Feb 2013 20:23:53 -0500
Message-ID: <CAJBSGSpWvW1AQKNWCtsWD2oh0mCR1dAxoBPDDZMRbtWp4Ooa7w@mail.gmail.com>
To: Robert Sanderson <azaroth42@gmail.com>
Cc: public-openannotation@w3.org
Dear Karin and Rob,
Thanks a lot for your insights on semantic units. On extreme cases of
sentence splitting like the ones you point out there are only two ways out:
either enforcing strong rules about how to perform the division (adding
complexity) or rejecting them altogether (simplicity). In the first case
those rules might be too difficult to apply for the random person wanting
to cite a book, on the second case we are back to square one, with no way
of properly identifying a text fragment that was not created with that aim.

I really like the Text Quote Selector, it is simple and avoids many of the
word counting/sentence splitting problems. However it might be that for
large bodies of text it could be either too slow, or that the input text
necessary for an unambiguous query resulted too long. If a paragraph
delimiter could be chosen to make sure that the paragraph splitting is done
in the right way, do you think that using paragraph fingerprinting to
define broad boundaries plus the Text Quote Selector to look inside those
boundaries could be a more efficient combination?

For Eastern languages it would be better to keep using the text selectors
that already exist, or define another kind of text identification. That
link that Rob posted suggests me, however, that it wouln't be an easy

Glad to hear about XML DOM! If that really works it could be a headache
less :)

On Thu, Feb 21, 2013 at 5:46 PM, Robert Sanderson <azaroth42@gmail.com>wrote:

> >> However I'd encourage you to *not* try to do this as a URI Fragment,
> >> as you would be competing with the official specifications of what a
> >> Fragment component of HTML, plain text, XML and PDF resources means.
> >> Within Media Wiki and other conforming implementations you can, of
> >> course, use the query approach.
> >
> > The idea was that, if seen appropriate, it could be converted into an
> > official specification. That is for sure not up to me to decide, but it
> is path that could be
> > explored.
> Right, and I'm not the person wearing the decision making hat either,
> but I would be *extremely* surprised if this were acceptable without a
> major overhaul of the rules about URI fragments.  It only just works
> in Media Fragments because there aren't any Fragments defined for the
> existing media types.  I'm sure that Raphael could explain the
> headaches there!
> >> Some other issues off the top of my head:
> >>
> >> * It's hard to determine paragraphs, sentences and words.
> >> -- Paragraphs could be <p>, or <div>, but they might not be.  Perhaps
> >> just <br/><br/> is used to separate the paragraphs.
> >> And that's just HTML, let alone other textual resources.
> >> -- Sentences:  Mr. J. Smith of the U.S.A. took $1.45 from his pocket
> >> ... and spent it.   1 sentence or 10?
> >> -- Words: Word splitting is extremely hard in eastern languages.
> > In the first case it is a matter of having accurate definitions of what a
> > paragraph means - I admit there are some loose ends there.
> Yes, but also detecting those paragraphs is necessary regardless of
> any abstract definition of a paragraph.
> > The issue with
> > sentences could be averted defining a sentence as a group of at least 2
> > words and handling numerals properly.
> See below for words, but even the above plus ellipsis handling would
> generate:
> Mr. J.
> Smith of the U.
> S.A.
> took $1.45 from his pocket ... and spent it.
> Unless you have a dictionary of "words", or don't allow sentence
> breaks without spaces, which would generate:
> Mr. J.
> Smith of the U.S.A.
> took $1.45 from his pocket ... and spent it.
> And that's just "." in well written English.
> > Nevertheless in an extreme situation
> > like that it would be much more sensible to use word counting instead.
> What
> > do you mean by word splitting in eastern languages? The concept of using
> the
> > unit "word"?
> It's extremely complex to determine word boundaries in many Eastern
> languages, as they don't naturally have spaces.
> For example: http://www.mnogosearch.org/doc33/msearch-cjk.html
> Or check out Ken Lunde's O'Reilly book: CJKV Information Processing.
> >> * We stuck with character counting, but even then it's tricky with
> >> normalization routines.  &amp;  -- 1 character or 5?
> >> You have the same issue with length as well.
> >
> > My aim was only to address the topic of parsed text, would that be an
> issue
> > in that case?
> So long as it's defined as to what normalizations have to happen (eg
> XML DOM string comparison rules) we believe that it's okay. On the
> other hand, after a conversation at the W3C ebook workshop last week,
> that may be too strongly centered on web browsers as the user agent
> and other systems may have a significant challenge.  I guess we'll see
> with implementation experience :)
> The major issue is counting language specific units such as paragraphs
> sentences and words.
> Rob

Etiamsi omnes, ego non
Received on Friday, 22 February 2013 01:24:42 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:38:22 UTC