On Fri, Feb 22, 2013 at 12:00 AM, Tom Morris <tfmorris@gmail.com> wrote:
> I think you're trying to do too much. The standard in the paper world is
> a page number or range of page numbers (of a specific edition, I might
> add). The page boundaries are preserved in the scanning process and
> certainly can be preserved in the OCR and post-processing. The citation
> might not be letter accurate, but it will be as accurate as the original
> paper-world citation.
> One way people tighten the citation up is to quote a passage. In that
> case, the quote can be matched against the newly OCR'd text.
> You're reaching for goal that may not be achievable. I'd suggest
> simplifying.
Well, I suggested an ideal scenario that might be far-fetched. The
immediate goal is to be able to generate quote queries.
And I wish that page boundaries were always preserved, but if you take a
look to the transcriptions by Project Gutenberg you will see that they