Re: Floating Quotable Citations (FQC) from David Cuenca on 2013-02-22 (public-openannotation@w3.org from February 2013)

From: David Cuenca <dacuetu@gmail.com>
Date: Thu, 21 Feb 2013 23:25:44 -0500
To: Karin Verspoor <Karin.Verspoor@nicta.com.au>
Cc: Robert Sanderson <azaroth42@gmail.com>, "<public-openannotation@w3.org>" <public-openannotation@w3.org>
Message-ID: <CAJBSGSqyN9XLSA9W+MQtwHWS9wtbP7tZ8nw59N3gWt92oyMgeA@mail.gmail.com>
Maybe as a clarification I should add a possible workflow:
1) a citation is added to wikipedia pointing to a specific text fragment of
a book (a book doesn't exist in digital format)
2) years pass (5? 10? 30? 100?), the book goes into public domain
3) the book is scanned and uploaded to Wikisource
4) some volunteers proofread the ocr, add format, split the text into
chapters and create the digital version of the book
5) the original citation now points to the newly created digital text and
it can be seen in its original context

I hope that explains why the extra effort needed.

On Thu, Feb 21, 2013 at 11:12 PM, David Cuenca <dacuetu@gmail.com> wrote:

> Hi Karin,
> The problem we'd like to address is: to create a citation for a paper
> based source and keep that citation unchanged on the digital version.
> Besides it should be possible to: 1) reorganize the digital version at
> will without worrying about structuring the text, 2) allow minor
> transcription errors.
> With all that in mind, character counting appears to me as the worst
> solution possible, specially when creating the citation for the paper based
> source...
> David
>
>
> On Thu, Feb 21, 2013 at 10:58 PM, Karin Verspoor <
> Karin.Verspoor@nicta.com.au> wrote:
>
>> Hi David,
>>
>> I believe that computationally it is actually much simpler to count
>> characters and move to a specific location in a long text file using simple
>> offsets than it would be to (a) parse the text into paragraphs [using some
>> agreed algorithm for determining paragraph boundaries] (b) count the
>> paragraphs (c) move to the appropriate paragraph and (d) count characters
>> within the paragraph.
>>
>> So counting characters is both representationally simpler, and
>> computationally simpler, and I don't think it matters much whether the text
>> is short or long.
>>
>> Karin
>>
>> On 22/02/2013, at 12:23 PM, David Cuenca wrote:
>>
>> Dear Karin and Rob,
>> Thanks a lot for your insights on semantic units. On extreme cases of
>> sentence splitting like the ones you point out there are only two ways out:
>> either enforcing strong rules about how to perform the division (adding
>> complexity) or rejecting them altogether (simplicity). In the first case
>> those rules might be too difficult to apply for the random person wanting
>> to cite a book, on the second case we are back to square one, with no way
>> of properly identifying a text fragment that was not created with that aim.
>>
>> I really like the Text Quote Selector, it is simple and avoids many of
>> the word counting/sentence splitting problems. However it might be that for
>> large bodies of text it could be either too slow, or that the input text
>> necessary for an unambiguous query resulted too long. If a paragraph
>> delimiter could be chosen to make sure that the paragraph splitting is done
>> in the right way, do you think that using paragraph fingerprinting to
>> define broad boundaries plus the Text Quote Selector to look inside those
>> boundaries could be a more efficient combination?
>>
>> For Eastern languages it would be better to keep using the text selectors
>> that already exist, or define another kind of text identification. That
>> link that Rob posted suggests me, however, that it wouln't be an easy
>> task...
>>
>> Glad to hear about XML DOM! If that really works it could be a headache
>> less :)
>>
>> On Thu, Feb 21, 2013 at 5:46 PM, Robert Sanderson <azaroth42@gmail.com
>> <mailto:azaroth42@gmail.com>> wrote:
>> >> However I'd encourage you to *not* try to do this as a URI Fragment,
>> >> as you would be competing with the official specifications of what a
>> >> Fragment component of HTML, plain text, XML and PDF resources means.
>> >> Within Media Wiki and other conforming implementations you can, of
>> >> course, use the query approach.
>> >
>> > The idea was that, if seen appropriate, it could be converted into an
>> > official specification. That is for sure not up to me to decide, but it
>> is path that could be
>> > explored.
>>
>> Right, and I'm not the person wearing the decision making hat either,
>> but I would be *extremely* surprised if this were acceptable without a
>> major overhaul of the rules about URI fragments.  It only just works
>> in Media Fragments because there aren't any Fragments defined for the
>> existing media types.  I'm sure that Raphael could explain the
>> headaches there!
>>
>>
>> >> Some other issues off the top of my head:
>> >>
>> >> * It's hard to determine paragraphs, sentences and words.
>> >> -- Paragraphs could be <p>, or <div>, but they might not be.  Perhaps
>> >> just <br/><br/> is used to separate the paragraphs.
>> >> And that's just HTML, let alone other textual resources.
>> >> -- Sentences:  Mr. J. Smith of the U.S.A. took $1.45 from his pocket
>> >> ... and spent it.   1 sentence or 10?
>> >> -- Words: Word splitting is extremely hard in eastern languages.
>>
>> > In the first case it is a matter of having accurate definitions of what
>> a
>> > paragraph means - I admit there are some loose ends there.
>>
>> Yes, but also detecting those paragraphs is necessary regardless of
>> any abstract definition of a paragraph.
>>
>> > The issue with
>> > sentences could be averted defining a sentence as a group of at least 2
>> > words and handling numerals properly.
>>
>> See below for words, but even the above plus ellipsis handling would
>> generate:
>>
>> Mr. J.
>> Smith of the U.
>> S.A.
>> took $1.45 from his pocket ... and spent it.
>>
>> Unless you have a dictionary of "words", or don't allow sentence
>> breaks without spaces, which would generate:
>>
>> Mr. J.
>> Smith of the U.S.A.
>> took $1.45 from his pocket ... and spent it.
>>
>> And that's just "." in well written English.
>>
>> > Nevertheless in an extreme situation
>> > like that it would be much more sensible to use word counting instead.
>> What
>> > do you mean by word splitting in eastern languages? The concept of
>> using the
>> > unit "word"?
>>
>> It's extremely complex to determine word boundaries in many Eastern
>> languages, as they don't naturally have spaces.
>> For example: http://www.mnogosearch.org/doc33/msearch-cjk.html
>> Or check out Ken Lunde's O'Reilly book: CJKV Information Processing.
>>
>>
>> >> * We stuck with character counting, but even then it's tricky with
>> >> normalization routines.  &amp;  -- 1 character or 5?
>> >> You have the same issue with length as well.
>> >
>> > My aim was only to address the topic of parsed text, would that be an
>> issue
>> > in that case?
>>
>> So long as it's defined as to what normalizations have to happen (eg
>> XML DOM string comparison rules) we believe that it's okay. On the
>> other hand, after a conversation at the W3C ebook workshop last week,
>> that may be too strongly centered on web browsers as the user agent
>> and other systems may have a significant challenge.  I guess we'll see
>> with implementation experience :)
>>
>> The major issue is counting language specific units such as paragraphs
>> sentences and words.
>>
>> Rob
>>
>>
>>
>> --
>> Etiamsi omnes, ego non
>>
>>
>> ________________________________
>>
>> The information in this e-mail may be confidential and subject to legal
>> professional privilege and/or copyright. National ICT Australia Limited
>> accepts no liability for any damage caused by this email or its attachments.
>>
>
>
>
> --
> Etiamsi omnes, ego non
>



-- 
Etiamsi omnes, ego non
Received on Friday, 22 February 2013 04:26:31 UTC