Re: [css-text][css-writing-modes] Line breaking around Emoji, Gaiji, U+FFFC, and text-combine-horizontal

On Mon, 26 Jan 2015 21:00:12 +0100, fantasai  
<fantasai.lists@inkedblade.net> wrote:

> On 01/26/2015 06:31 AM, Koji Ishii wrote:
>> Thank you for the great summary.
>>
>> On Mon, Jan 26, 2015 at 7:29 AM, fantasai  
>> <fantasai.lists@inkedblade.net> wrote:
>>>
>>> 1. text-combine-upright
>>> -----------------------
>>>
>>>     Result of text-combine-upright should break as ID, not as U+FFFC.
>>>     Current spec requires treating as actual contents for  
>>> line-breaking.
>>>     So there is some misunderstanding of the text;
>>>     unclear whether there is an issue here to fix.
>>>
>>>     Proposal A: Leave spec as-is: TCY treated as its own text.
>>>     Proposal B: Make TCY always treated as ideographic character.
>>
>> Hm, the change was made in 2012[1]. I merely remember we discussed,
>> but don't remember why we changed.
>>
>> Though I lost that context, thinking now, I think B works the best.
>>
>>> 2. UAX#14 Rules for Atomic Inlines Problematic
>>> ----------------------------------------------
>>>
>>>     Changing the rule order for UAX#14 is a difficult tailoring.
>>>     Spec should just create a special rule for atomic inlines.
>>>
>>>     Proposal A: Change spec wording to fix this.
>>>     Proposal B: Change spec wording to fix issue #3.
>>>
>>>     Remaining Issue: Should U+FFFC match images?
>>
>> Not very clear the diff between A and B. Can you clarify?
>> [...]
>> Maybe we're talking the same? I couldn't read what you meant by your A  
>> and B.
>
> Sorry, I wasn't clear. I meant follow Proposal B for Issue #3, i.e.
>>>     Proposal B: Treat all images as ID.
>
>
>> This property is to opt-out the fix and bring back the behavior
>> we defined in the LC, so I think we need this in the Level 3.
>
> The ideal behavior is, I think, to treat as ID. I can't imagine
> anyone intentionally *wanting* the current behavior (ignoring
> nbsp etc.)
>
>
> FWIW, just checked Presto with some of your test cases (using
> comma, period, brackets, etc.), and it seems to treat images
> as ideographic. E.g. it keeps an image together with an
> immediately following close-bracket, comma, or period. This
> means it was Web-compatible enough for Presto, so maybe it's
> Web-compatible enough for everyone.
>
> I propose we treat TCY, U+FFFC, and images all as ID by default.
> What do you think?

I did some research in httparchive. Not breaking for nbsp around replaced  
elements has the potential to put a lot of images or form controls on a  
single line when it was expected to wrap, but this appears to not be  
common enough that it is trivial to find when looking for it. Still, this  
is something that has been reported as a bug for Opera. Not breaking for  
other characters seems like it would not break pages (more than  
implementing UAX14 for text in general, at least).

Possibly LB19 can break pages where e.g. a 100% wide inline image is  
adjacent to inline heading text with quotes (like  
http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3426 ).

Replace <...> with the things below to get the whole query. I limited the  
searches to ASCII characters (except nbsp).

SELECT page, COUNT(*) AS num
 FROM [httparchive:runs.2014_08_15_requests_body]
WHERE LOWER(mimeType) CONTAINS "html"
<...>
GROUP BY page
ORDER BY num

== LB12 ==

# nbsp before replaced element
AND REGEXP_MATCH(LOWER(body),  
r'(&nbsp;|&#x0*a0;|&#0*160;)<(embed|iframe|video|canvas|object|applet|audio|img|input|button|meter|progress|select|textarea|keygen)(\s|>)')

8756 pages.

I loaded the first 50 in Opera 12 and didn't see anything obviously broken.

# nbsp after replaced element
AND REGEXP_MATCH(LOWER(body),  
r'<(/?embed|/iframe|/video|/canvas|/object|/applet|/audio|img|input|/button|/meter|/progress|/select|/textarea|/?keygen)\s*/?>(&nbsp;|&#x0*a0;|&#0*160;)')

685 pages.

I loaded the first 50 in Opera 12 and found one page that is slightly  
broken:

http://joboutlook.gov.au/ the "search" buttons overflow in Presto but wrap  
in other browsers.

# nbsp between two replaced elements
AND REGEXP_MATCH(LOWER(body),  
r'<(/?embed|/iframe|/video|/canvas|/object|/applet|/audio|img|input|/button|/meter|/progress|/select|/textarea|/?keygen)\s*/?>(&nbsp;|&#x0*a0;|&#0*160;)+<(embed|iframe|video|canvas|object|applet|audio|img|input|button|meter|progress|select|textarea|keygen)(\s|>)')

190 pages.

Of the first 50 I only found joboutlook.gov.au again.

== LB13 ==

# } ) ] ! ? , . / after replaced element, possibly spaces between
AND REGEXP_MATCH(LOWER(body),  
r'<(/?embed|/iframe|/video|/canvas|/object|/applet|/audio|img|input|/button|/meter|/progress|/select|/textarea|/?keygen)\s*/?>\s*[\}\)\]\!\?\,\.\/]')

167 pages.

I included \s* between, although UAX14 is inconsistent. It says "Do not  
break before ‘]’ or ‘!’ or ‘;’ or ‘/’, even after spaces." but then the  
grammar is "× CL", not "× SP* CL". Presto prevents breaking even with the  
space.

== LB14 ==

# ( [ { before replaced element, possibly spaces between
AND REGEXP_MATCH(LOWER(body),  
r'[\(\[\{]\s*<(embed|iframe|video|canvas|object|applet|audio|img|input|button|meter|progress|select|textarea|keygen)(\s|>)')

127 pages.

Presto breaks when there is a space between.

e.g.

http://www.gigposters.com/
http://www.newsonews.com/

Note that the newsonnews.com one is in quirks mode, with [<img>] in table  
cells. WebKit/Blink prevent linebreaks around images in table cells in  
quirks mode. Gecko only prevents linebreaks around images in table cells  
in quirks mode *for the purpose of calculating the width of the table  
cell*, not when actually laying out. The good news is that the proposal is  
slightly closer to the behavior in quirks mode, so it is less likely to  
break such pages.

== LB19 ==

(Searching for " or ' is not useful because of strings in JS.)

# HTML-escaped QU characters before replaced element
AND REGEXP_MATCH(LOWER(body),  
r'&#(x0*(ab|bb|2018|2019|201[bcdf]|203[9a]|275[bcdef]|2760|2e0[0123456789abcd]|2e2[01]|1f67[678])|0*(171|187|8216|8217|8219|822[013]|8249|8250|1007[56789]|10080|1177[6789]|117[89]\d|1180[01234589]|12863[012]));?<(embed|iframe|video|canvas|object|applet|audio|img|input|button|meter|progress|select|textarea|keygen)(\s|>)')

7 pages.

# HTML-escaped QU characters after replaced element
AND REGEXP_MATCH(LOWER(body),  
r'<(/?embed|/iframe|/video|/canvas|/object|/applet|/audio|img|input|/button|/meter|/progress|/select|/textarea|/?keygen)\s*/?>&#(x0*(ab|bb|2018|2019|201[bcdf]|203[9a]|275[bcdef]|2760|2e0[0123456789abcd]|2e2[01]|1f67[678])|0*(171|187|8216|8217|8219|822[013]|8249|8250|1007[56789]|10080|1177[6789]|117[89]\d|1180[01234589]|12863[012]))([;<]|\s)')

2 pages.

# Raw QU characters before replaced element, excluding " ' and astral  
characters
AND REGEXP_MATCH(LOWER(body),  
r'[«»‘’‛-“”‟‹›❛-❠⸀-⸁⸂⸃⸄⸅⸆-⸈⸉⸊⸋⸌⸍⸜⸝⸠⸡]<(embed|iframe|video|canvas|object|applet|audio|img|input|button|meter|progress|select|textarea|keygen)(\s|>)')

16 pages. (I don't know if this result is accurate, it is possible there  
are encoding issues.)

# Raw QU characters after replaced element, excluding " ' and astral  
characters
AND REGEXP_MATCH(LOWER(body),  
r'<(/?embed|/iframe|/video|/canvas|/object|/applet|/audio|img|input|/button|/meter|/progress|/select|/textarea|/?keygen)\s*/?>[«»‘’‛-“”‟‹›❛-❠⸀-⸁⸂⸃⸄⸅⸆-⸈⸉⸊⸋⸌⸍⸜⸝⸠⸡]')

0 pages. (Again, don't know if this is accurate.)

-- 
Simon Pieters
Opera Software

Received on Tuesday, 24 February 2015 12:17:59 UTC