Re: Floating Quotable Citations (FQC) from Tom Morris on 2013-02-22 (public-openannotation@w3.org from February 2013)

From: Tom Morris <tfmorris@gmail.com>
Date: Fri, 22 Feb 2013 00:00:47 -0500
To: David Cuenca <dacuetu@gmail.com>
Cc: Karin Verspoor <Karin.Verspoor@nicta.com.au>, Robert Sanderson <azaroth42@gmail.com>, "<public-openannotation@w3.org>" <public-openannotation@w3.org>
Message-ID: <CAE9vqEF9opSA8nfhNabMH2hUJboUr0JUkDaeYXg=MQfDGkN6Zg@mail.gmail.com>
On Thu, Feb 21, 2013 at 11:25 PM, David Cuenca <dacuetu@gmail.com> wrote:

> Maybe as a clarification I should add a possible workflow:
> 1) a citation is added to wikipedia pointing to a specific text fragment
> of a book (a book doesn't exist in digital format)
> 2) years pass (5? 10? 30? 100?), the book goes into public domain
> 3) the book is scanned and uploaded to Wikisource
> 4) some volunteers proofread the ocr, add format, split the text into
> chapters and create the digital version of the book
> 5) the original citation now points to the newly created digital text and
> it can be seen in its original context
>
> I hope that explains why the extra effort needed.


I think you're trying to do too much.  The standard in the paper world is a
page number or range of page numbers (of a specific edition, I might add).
 The page boundaries are preserved in the scanning process and certainly
can be preserved in the OCR and post-processing.  The citation might not be
letter accurate, but it will be as accurate as the original paper-world
citation.

One way people tighten the citation up is to quote a passage.  In that
case, the quote can be matched against the newly OCR'd text.

You're reaching for goal that may not be achievable.  I'd suggest
simplifying.

Tom



>
>
> On Thu, Feb 21, 2013 at 11:12 PM, David Cuenca <dacuetu@gmail.com> wrote:
>
>> Hi Karin,
>> The problem we'd like to address is: to create a citation for a paper
>> based source and keep that citation unchanged on the digital version.
>> Besides it should be possible to: 1) reorganize the digital version at
>> will without worrying about structuring the text, 2) allow minor
>> transcription errors.
>> With all that in mind, character counting appears to me as the worst
>> solution possible, specially when creating the citation for the paper based
>> source...
>> David
>>
>>
>> On Thu, Feb 21, 2013 at 10:58 PM, Karin Verspoor <
>> Karin.Verspoor@nicta.com.au> wrote:
>>
>>> Hi David,
>>>
>>> I believe that computationally it is actually much simpler to count
>>> characters and move to a specific location in a long text file using simple
>>> offsets than it would be to (a) parse the text into paragraphs [using some
>>> agreed algorithm for determining paragraph boundaries] (b) count the
>>> paragraphs (c) move to the appropriate paragraph and (d) count characters
>>> within the paragraph.
>>>
>>> So counting characters is both representationally simpler, and
>>> computationally simpler, and I don't think it matters much whether the text
>>> is short or long.
>>>
>>> Karin
>>>
>>> On 22/02/2013, at 12:23 PM, David Cuenca wrote:
>>>
>>> Dear Karin and Rob,
>>> Thanks a lot for your insights on semantic units. On extreme cases of
>>> sentence splitting like the ones you point out there are only two ways out:
>>> either enforcing strong rules about how to perform the division (adding
>>> complexity) or rejecting them altogether (simplicity). In the first case
>>> those rules might be too difficult to apply for the random person wanting
>>> to cite a book, on the second case we are back to square one, with no way
>>> of properly identifying a text fragment that was not created with that aim.
>>>
>>> I really like the Text Quote Selector, it is simple and avoids many of
>>> the word counting/sentence splitting problems. However it might be that for
>>> large bodies of text it could be either too slow, or that the input text
>>> necessary for an unambiguous query resulted too long. If a paragraph
>>> delimiter could be chosen to make sure that the paragraph splitting is done
>>> in the right way, do you think that using paragraph fingerprinting to
>>> define broad boundaries plus the Text Quote Selector to look inside those
>>> boundaries could be a more efficient combination?
>>>
>>> For Eastern languages it would be better to keep using the text
>>> selectors that already exist, or define another kind of text
>>> identification. That link that Rob posted suggests me, however, that it
>>> wouln't be an easy task...
>>>
>>> Glad to hear about XML DOM! If that really works it could be a headache
>>> less :)
>>>
>>> On Thu, Feb 21, 2013 at 5:46 PM, Robert Sanderson <azaroth42@gmail.com
>>> <mailto:azaroth42@gmail.com>> wrote:
>>> >> However I'd encourage you to *not* try to do this as a URI Fragment,
>>> >> as you would be competing with the official specifications of what a
>>> >> Fragment component of HTML, plain text, XML and PDF resources means.
>>> >> Within Media Wiki and other conforming implementations you can, of
>>> >> course, use the query approach.
>>> >
>>> > The idea was that, if seen appropriate, it could be converted into an
>>> > official specification. That is for sure not up to me to decide, but
>>> it is path that could be
>>> > explored.
>>>
>>> Right, and I'm not the person wearing the decision making hat either,
>>> but I would be *extremely* surprised if this were acceptable without a
>>> major overhaul of the rules about URI fragments.  It only just works
>>> in Media Fragments because there aren't any Fragments defined for the
>>> existing media types.  I'm sure that Raphael could explain the
>>> headaches there!
>>>
>>>
>>> >> Some other issues off the top of my head:
>>> >>
>>> >> * It's hard to determine paragraphs, sentences and words.
>>> >> -- Paragraphs could be <p>, or <div>, but they might not be.  Perhaps
>>> >> just <br/><br/> is used to separate the paragraphs.
>>> >> And that's just HTML, let alone other textual resources.
>>> >> -- Sentences:  Mr. J. Smith of the U.S.A. took $1.45 from his pocket
>>> >> ... and spent it.   1 sentence or 10?
>>> >> -- Words: Word splitting is extremely hard in eastern languages.
>>>
>>> > In the first case it is a matter of having accurate definitions of
>>> what a
>>> > paragraph means - I admit there are some loose ends there.
>>>
>>> Yes, but also detecting those paragraphs is necessary regardless of
>>> any abstract definition of a paragraph.
>>>
>>> > The issue with
>>> > sentences could be averted defining a sentence as a group of at least 2
>>> > words and handling numerals properly.
>>>
>>> See below for words, but even the above plus ellipsis handling would
>>> generate:
>>>
>>> Mr. J.
>>> Smith of the U.
>>> S.A.
>>> took $1.45 from his pocket ... and spent it.
>>>
>>> Unless you have a dictionary of "words", or don't allow sentence
>>> breaks without spaces, which would generate:
>>>
>>> Mr. J.
>>> Smith of the U.S.A.
>>> took $1.45 from his pocket ... and spent it.
>>>
>>> And that's just "." in well written English.
>>>
>>> > Nevertheless in an extreme situation
>>> > like that it would be much more sensible to use word counting instead.
>>> What
>>> > do you mean by word splitting in eastern languages? The concept of
>>> using the
>>> > unit "word"?
>>>
>>> It's extremely complex to determine word boundaries in many Eastern
>>> languages, as they don't naturally have spaces.
>>> For example: http://www.mnogosearch.org/doc33/msearch-cjk.html
>>> Or check out Ken Lunde's O'Reilly book: CJKV Information Processing.
>>>
>>>
>>> >> * We stuck with character counting, but even then it's tricky with
>>> >> normalization routines.  &amp;  -- 1 character or 5?
>>> >> You have the same issue with length as well.
>>> >
>>> > My aim was only to address the topic of parsed text, would that be an
>>> issue
>>> > in that case?
>>>
>>> So long as it's defined as to what normalizations have to happen (eg
>>> XML DOM string comparison rules) we believe that it's okay. On the
>>> other hand, after a conversation at the W3C ebook workshop last week,
>>> that may be too strongly centered on web browsers as the user agent
>>> and other systems may have a significant challenge.  I guess we'll see
>>> with implementation experience :)
>>>
>>> The major issue is counting language specific units such as paragraphs
>>> sentences and words.
>>>
>>> Rob
>>>
>>>
>>>
>>> --
>>> Etiamsi omnes, ego non
>>>
>>>
>>> ________________________________
>>>
>>> The information in this e-mail may be confidential and subject to legal
>>> professional privilege and/or copyright. National ICT Australia Limited
>>> accepts no liability for any damage caused by this email or its attachments.
>>>
>>
>>
>>
>> --
>> Etiamsi omnes, ego non
>>
>
>
>
> --
> Etiamsi omnes, ego non
>
Received on Friday, 22 February 2013 05:01:17 UTC