[Bug 11211] Need a way to force a line wrap with the bidi semantics of LINE SEPARATOR when necessary. from bugzilla@jessica.w3.org on 2010-11-04 (public-html-bugzilla@w3.org from November 2010)

From: <bugzilla@jessica.w3.org>
Date: Thu, 04 Nov 2010 23:41:33 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1PE9Qz-00072i-UC@jessica.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=11211

--- Comment #4 from Aharon Lanin <aharon.lists.lanin@gmail.com> 2010-11-04 23:41:33 UTC ---
(In reply to comment #3)
> The right way to capture non-semantic line-breaking copied from another medium
> is <pre>, aka Preformatted.

I have no opinion on whether HTML is the right format for OCR output.

If we are on the subject, though, OCR from bidi text is devilishly hard. You
have to run a visual-to-logical transformation (which is enough of a
complication by itself). And you would have to guess which of the line breaks
in the original text are actually line wraps, and which are paragraph breaks,
since for the line wraps, you do indeed need line separators. For example,
let's say this is the original visual order in the printed RTL book:

   ali baba and the" SI YROTS EHT FO EMAN EHT
                                 ."40 thieves

The correct logical-order content would be:

THE NAME OF THE STORY IS "ali baba and the[LINE SEPARATOR]
40 thieves".[PARAGRAPH SEPARATOR]

If one used a [PARAGRAPH SEPARATOR] at the end of the first line, it would get
displayed as:

   ali baba and the" SI YROTS EHT FO EMAN EHT
                                 ."thieves 40

> I would be interested in hearing other use cases as
> well, though.

I don't have anything. But I know that it will come and bite me on the behind
one day.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Thursday, 4 November 2010 23:41:35 UTC