[Bug 14360] New: Count Unicode 'combining marks" together with "inter-element whitespace"

http://www.w3.org/Bugs/Public/show_bug.cgi?id=14360

           Summary: Count Unicode 'combining marks" together with
                    "inter-element whitespace"
           Product: HTML WG
           Version: unspecified
          Platform: All
               URL: http://dev.w3.org/html5/spec/content-models.html#flow-
                    content-0
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: LC1 HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: xn--mlform-iua@xn--mlform-iua.no
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


SPEC SAYS:

]] As a general rule, elements whose content model allows **any flow content**
should have either at least one descendant text node that is not inter-element
whitespace, [[

PROPOSALS: 
  1)  After last comma above, add roughly this text:
       "and that also isn't a Unicode combining mark".
  2)  Also, in a parenthesis or side note, state that if an isolated 
       combining mark is needed, then a one should, in line with
       Unicode 6.0, combine it  with U+00A0 no-break space.
  3) Allow conformance checkers to warn if a combining mark - 
       with or without  U+0020, is the sole text node of an element
       "whose content model allows any flow content" as well as 
       when - regardless of whether it allos any content - 
       it combines with/is placed adjacentn to U+0020.

TEST CASE: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1167

PROBLEM DESCRIPTION: Bug 13502 resulted in a de-facto permisson to let text
runs begin with combining marks. However, while it should perhaps not be
completely forbidden, still - if an element "whose content model allows any
flow content"  contains nothing but (inter-element) space + combining mark (or
even solely a combining mark), then there are several potential issues:

1)  White space collapsing means that the combining character doesn't really
     combine with the space character
2)  Combing marks that combines with nothing or space, are hard to select with
the mouse
3)  Visually, such marks may look as if they combine with something outside the
element
     (See third paragraph in test case)
4)  When the first letter is a combnining mark, then the CSS *:first-letter{}
selector may
     seem, to authors, to not work

UNICODE ARGUMENTS: In bug 13502, comment number 4, it came up how to represent
isolated combining marks.
(http://www.w3.org/Bugs/Public/show_bug.cgi?id=13502#c4) However, the mentioned
solution - to use U-0020 - is no longer the recommended method, due to the
space character normalization issues rules of XML. Citing Unicode 6.0:

]]
7.9 Combining Marks
   [ snip ]
Marks as Spacing Characters. By convention, combining marks may be exhibited in
(apparent) isolation by applying them to U+00A0 no-break space. This approach
might be taken, for example, when referring to the diacritical mark itself as a
mark, rather than using it in its normal way in text. Prior to Version 4.1 of
the Unicode Standard, the standard also recommended the use of U+0020 space for
display of isolated combining marks. This is no longer recommended, however,
because of potential conflicts with the handling of sequences of U+0020 space
characters in such contexts as XML.
[[
   [ For RTL scripts, it is slightly more complicated - see section 7.9 of
Unicode 6.]

The justificaitons for somewhat aligning with inter-elemetn whitespace  rather
than completley forbidding combining marks that combine with U-0020 are:
  1)  the same as for the permission to have empty elements: it may be used as
place holder or template. E.g. a combining accent migh tbe combined with
different letters via scriptiong.
  2) Further more, Unicode contains "Spacing Clones of Diacritical Marks" which
most of them have "have compatibility decomposition mappings involving U+0020
space, but implementers should be cautious in making use of those decomposition
mappings because of the complications that can arise from replacing a spacing
character with a space + combining mark sequence". (Point is that, even if
Unicode warns againast it, one can probably not completely forbid combining
marks combined with U+0020 when Unicode itself operates with normalization that
includes the U+0020.)

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Monday, 3 October 2011 03:29:19 UTC