- From: Tom Morris <tfmorris@gmail.com>
- Date: Fri, 22 Feb 2013 00:00:47 -0500
- To: David Cuenca <dacuetu@gmail.com>
- Cc: Karin Verspoor <Karin.Verspoor@nicta.com.au>, Robert Sanderson <azaroth42@gmail.com>, "<public-openannotation@w3.org>" <public-openannotation@w3.org>
- Message-ID: <CAE9vqEF9opSA8nfhNabMH2hUJboUr0JUkDaeYXg=MQfDGkN6Zg@mail.gmail.com>
On Thu, Feb 21, 2013 at 11:25 PM, David Cuenca <dacuetu@gmail.com> wrote: > Maybe as a clarification I should add a possible workflow: > 1) a citation is added to wikipedia pointing to a specific text fragment > of a book (a book doesn't exist in digital format) > 2) years pass (5? 10? 30? 100?), the book goes into public domain > 3) the book is scanned and uploaded to Wikisource > 4) some volunteers proofread the ocr, add format, split the text into > chapters and create the digital version of the book > 5) the original citation now points to the newly created digital text and > it can be seen in its original context > > I hope that explains why the extra effort needed. I think you're trying to do too much. The standard in the paper world is a page number or range of page numbers (of a specific edition, I might add). The page boundaries are preserved in the scanning process and certainly can be preserved in the OCR and post-processing. The citation might not be letter accurate, but it will be as accurate as the original paper-world citation. One way people tighten the citation up is to quote a passage. In that case, the quote can be matched against the newly OCR'd text. You're reaching for goal that may not be achievable. I'd suggest simplifying. Tom > > > On Thu, Feb 21, 2013 at 11:12 PM, David Cuenca <dacuetu@gmail.com> wrote: > >> Hi Karin, >> The problem we'd like to address is: to create a citation for a paper >> based source and keep that citation unchanged on the digital version. >> Besides it should be possible to: 1) reorganize the digital version at >> will without worrying about structuring the text, 2) allow minor >> transcription errors. >> With all that in mind, character counting appears to me as the worst >> solution possible, specially when creating the citation for the paper based >> source... >> David >> >> >> On Thu, Feb 21, 2013 at 10:58 PM, Karin Verspoor < >> Karin.Verspoor@nicta.com.au> wrote: >> >>> Hi David, >>> >>> I believe that computationally it is actually much simpler to count >>> characters and move to a specific location in a long text file using simple >>> offsets than it would be to (a) parse the text into paragraphs [using some >>> agreed algorithm for determining paragraph boundaries] (b) count the >>> paragraphs (c) move to the appropriate paragraph and (d) count characters >>> within the paragraph. >>> >>> So counting characters is both representationally simpler, and >>> computationally simpler, and I don't think it matters much whether the text >>> is short or long. >>> >>> Karin >>> >>> On 22/02/2013, at 12:23 PM, David Cuenca wrote: >>> >>> Dear Karin and Rob, >>> Thanks a lot for your insights on semantic units. On extreme cases of >>> sentence splitting like the ones you point out there are only two ways out: >>> either enforcing strong rules about how to perform the division (adding >>> complexity) or rejecting them altogether (simplicity). In the first case >>> those rules might be too difficult to apply for the random person wanting >>> to cite a book, on the second case we are back to square one, with no way >>> of properly identifying a text fragment that was not created with that aim. >>> >>> I really like the Text Quote Selector, it is simple and avoids many of >>> the word counting/sentence splitting problems. However it might be that for >>> large bodies of text it could be either too slow, or that the input text >>> necessary for an unambiguous query resulted too long. If a paragraph >>> delimiter could be chosen to make sure that the paragraph splitting is done >>> in the right way, do you think that using paragraph fingerprinting to >>> define broad boundaries plus the Text Quote Selector to look inside those >>> boundaries could be a more efficient combination? >>> >>> For Eastern languages it would be better to keep using the text >>> selectors that already exist, or define another kind of text >>> identification. That link that Rob posted suggests me, however, that it >>> wouln't be an easy task... >>> >>> Glad to hear about XML DOM! If that really works it could be a headache >>> less :) >>> >>> On Thu, Feb 21, 2013 at 5:46 PM, Robert Sanderson <azaroth42@gmail.com >>> <mailto:azaroth42@gmail.com>> wrote: >>> >> However I'd encourage you to *not* try to do this as a URI Fragment, >>> >> as you would be competing with the official specifications of what a >>> >> Fragment component of HTML, plain text, XML and PDF resources means. >>> >> Within Media Wiki and other conforming implementations you can, of >>> >> course, use the query approach. >>> > >>> > The idea was that, if seen appropriate, it could be converted into an >>> > official specification. That is for sure not up to me to decide, but >>> it is path that could be >>> > explored. >>> >>> Right, and I'm not the person wearing the decision making hat either, >>> but I would be *extremely* surprised if this were acceptable without a >>> major overhaul of the rules about URI fragments. It only just works >>> in Media Fragments because there aren't any Fragments defined for the >>> existing media types. I'm sure that Raphael could explain the >>> headaches there! >>> >>> >>> >> Some other issues off the top of my head: >>> >> >>> >> * It's hard to determine paragraphs, sentences and words. >>> >> -- Paragraphs could be <p>, or <div>, but they might not be. Perhaps >>> >> just <br/><br/> is used to separate the paragraphs. >>> >> And that's just HTML, let alone other textual resources. >>> >> -- Sentences: Mr. J. Smith of the U.S.A. took $1.45 from his pocket >>> >> ... and spent it. 1 sentence or 10? >>> >> -- Words: Word splitting is extremely hard in eastern languages. >>> >>> > In the first case it is a matter of having accurate definitions of >>> what a >>> > paragraph means - I admit there are some loose ends there. >>> >>> Yes, but also detecting those paragraphs is necessary regardless of >>> any abstract definition of a paragraph. >>> >>> > The issue with >>> > sentences could be averted defining a sentence as a group of at least 2 >>> > words and handling numerals properly. >>> >>> See below for words, but even the above plus ellipsis handling would >>> generate: >>> >>> Mr. J. >>> Smith of the U. >>> S.A. >>> took $1.45 from his pocket ... and spent it. >>> >>> Unless you have a dictionary of "words", or don't allow sentence >>> breaks without spaces, which would generate: >>> >>> Mr. J. >>> Smith of the U.S.A. >>> took $1.45 from his pocket ... and spent it. >>> >>> And that's just "." in well written English. >>> >>> > Nevertheless in an extreme situation >>> > like that it would be much more sensible to use word counting instead. >>> What >>> > do you mean by word splitting in eastern languages? The concept of >>> using the >>> > unit "word"? >>> >>> It's extremely complex to determine word boundaries in many Eastern >>> languages, as they don't naturally have spaces. >>> For example: http://www.mnogosearch.org/doc33/msearch-cjk.html >>> Or check out Ken Lunde's O'Reilly book: CJKV Information Processing. >>> >>> >>> >> * We stuck with character counting, but even then it's tricky with >>> >> normalization routines. & -- 1 character or 5? >>> >> You have the same issue with length as well. >>> > >>> > My aim was only to address the topic of parsed text, would that be an >>> issue >>> > in that case? >>> >>> So long as it's defined as to what normalizations have to happen (eg >>> XML DOM string comparison rules) we believe that it's okay. On the >>> other hand, after a conversation at the W3C ebook workshop last week, >>> that may be too strongly centered on web browsers as the user agent >>> and other systems may have a significant challenge. I guess we'll see >>> with implementation experience :) >>> >>> The major issue is counting language specific units such as paragraphs >>> sentences and words. >>> >>> Rob >>> >>> >>> >>> -- >>> Etiamsi omnes, ego non >>> >>> >>> ________________________________ >>> >>> The information in this e-mail may be confidential and subject to legal >>> professional privilege and/or copyright. National ICT Australia Limited >>> accepts no liability for any damage caused by this email or its attachments. >>> >> >> >> >> -- >> Etiamsi omnes, ego non >> > > > > -- > Etiamsi omnes, ego non >
Received on Friday, 22 February 2013 05:01:17 UTC