- From: <bugzilla@jessica.w3.org>
- Date: Thu, 04 Nov 2010 23:41:33 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11211 --- Comment #4 from Aharon Lanin <aharon.lists.lanin@gmail.com> 2010-11-04 23:41:33 UTC --- (In reply to comment #3) > The right way to capture non-semantic line-breaking copied from another medium > is <pre>, aka Preformatted. I have no opinion on whether HTML is the right format for OCR output. If we are on the subject, though, OCR from bidi text is devilishly hard. You have to run a visual-to-logical transformation (which is enough of a complication by itself). And you would have to guess which of the line breaks in the original text are actually line wraps, and which are paragraph breaks, since for the line wraps, you do indeed need line separators. For example, let's say this is the original visual order in the printed RTL book: ali baba and the" SI YROTS EHT FO EMAN EHT ."40 thieves The correct logical-order content would be: THE NAME OF THE STORY IS "ali baba and the[LINE SEPARATOR] 40 thieves".[PARAGRAPH SEPARATOR] If one used a [PARAGRAPH SEPARATOR] at the end of the first line, it would get displayed as: ali baba and the" SI YROTS EHT FO EMAN EHT ."thieves 40 > I would be interested in hearing other use cases as > well, though. I don't have anything. But I know that it will come and bite me on the behind one day. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Thursday, 4 November 2010 23:41:35 UTC