Re: Floating Quotable Citations (FQC)

On Fri, Feb 22, 2013 at 12:32 AM, David Cuenca <dacuetu@gmail.com> wrote:

> On Fri, Feb 22, 2013 at 12:00 AM, Tom Morris <tfmorris@gmail.com> wrote:
>
>>
>> I think you're trying to do too much.  The standard in the paper world is
>> a page number or range of page numbers (of a specific edition, I might
>> add).  The page boundaries are preserved in the scanning process and
>> certainly can be preserved in the OCR and post-processing.  The citation
>> might not be letter accurate, but it will be as accurate as the original
>> paper-world citation.
>>
>> One way people tighten the citation up is to quote a passage.  In that
>> case, the quote can be matched against the newly OCR'd text.
>>
>> You're reaching for goal that may not be achievable.  I'd suggest
>> simplifying.
>>
>>
> Well, I suggested an ideal scenario that might be far-fetched. The
> immediate goal is to be able to generate quote queries.
> And I wish that page boundaries were always preserved, but if you take a
> look to the transcriptions by Project Gutenberg you will see that they
> aren't.
>

PG does all kinds of weird stuff.  They insisted on 7-bit ASCII for ages
after everyone else moved to ISO Latin-1.  They strip all edition
information claiming that they are creating new editions (which means none
of the citations would be any good anyway since you can't match them up
with the correct edition).

If you look at the millions of books of PD books in the Internet Archive,
HathiTrust, Google Books, etc, you'll see that they certainly do include
page information. It's only the few thousand in the quirky Project
Gutenburg which don't (and even PG has that information at the beginning of
the process until they intentionally throw it away).

Tom

Received on Friday, 22 February 2013 05:55:40 UTC