Re: liaison for a Unicode ticket from Shervin Afshar on 2016-12-26 (public-tt@w3.org from December 2016)

From: Shervin Afshar <safshar@netflix.com>
Date: Mon, 26 Dec 2016 12:21:57 -0800
To: Pierre-Anthony Lemieux <pal@sandflow.com>
Cc: Thierry MICHEL <tmichel@w3.org>, Nigel Megitt <nigel.megitt@bbc.co.uk>, Richard Ishida <ishida@w3.org>, Mark Davis <mark@macchiato.com>, Steven R Loomis <srloomis@us.ibm.com>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <CABEdNY+suyEcyJxGyzZ7Uz-dc7j8ejC5S75=i6nNznnyo5DGvA@mail.gmail.com>
Hi Pierre-Anthony,

I understand the lax criteria in order to cover larger number of cases, but
we still need to scrutinize and investigate some of the questionable
additions; another example being U+02BC (MODIFIER LETTER APOSTROPHE) which
though relevant to be added to the punctuation exemplar for one locale,
presence of it in others might prove redundant and problematic in the grand
scheme of things. Another instance of such unlikely additions is U+0132
(LATIN CAPITAL LIGATURE IJ) which is only present in Turkish, but can not
be explained why (since it's used in Dutch).

One could certainly imagine a timed-text asset for Turkish to have some
strings of Dutch in it and it might be reasoned that therefore Dutch
exemplar set (or a subset of it) should be added to Turkish exemplar sets
as auxiliary, but in actuality what is needed is a more granular usage of "
xml:lang" attribute according to W3C i18n best practices[1] to distinguish
non-Turkish content present in a Turkish context:

<body region="subtitleArea">
    <div>
      <p xml:id="subtitle1" xml:lang="tr" begin="0.76s" end="3.45s">
        Dost kara günde belli olur.<br />
        <span xml:lang="nl">De ratten verlaten het zinkende schip.</span>
      </p>
...
   </div>
</body>

Although the complexity of an all-around additional of characters to
exemplar sets to maximize coverage might seem marginal, but as far as I'm
aware of, CLDR has been selective about such additions.

For now I will proceed with actions for adding the clear-cut cases (which I
mentioned in a previous email) to the relevant exemplar sets.

[1]: https://www.w3.org/International/questions/qa-when-xmllang.en#when_use

Best regards,
Shervin

On Wed, Dec 21, 2016 at 11:22 PM, Pierre-Anthony Lemieux <pal@sandflow.com>
wrote:

> Hi Shervin,
>
> Thanks for the update, and to the CLDR TC for considering the input.
>
> > In some other cases it's not very clear if the inclusion of a specific
> characters is justified or simply due to
> > bad data (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in
> the set latnExtA provided in [2]).
>
> I believe that the recommended sets erred on the side of caution, and
> were created to deliberately cast a wider, rather than narrower, net
> whenever possible. For instance, the recommended set for each of the
> "lv,lt,et,hr,cs,pl,sl,sk,tr" locales includes all of the Latin
> Extended-A block, instead of attempting to optimize each sets at the
> risk of missing important characters -- the general assumption being
> that the incremental complexity of supporting all versus parts of the
> Latin Extended-A block would be marginal, e.g. implementations support
> all or none of the Latin Extended-A block.
>
> > I will update the thread and the ticket with the next steps when I get
> to check for anomalies of that sort.
>
> Looking forward to your feedback.
>
> Best,
>
> -- Pierre
>
> On Mon, Dec 12, 2016 at 12:50 PM, Shervin Afshar <safshar@netflix.com>
> wrote:
> > Thanks for the new comparison report[1]. CLDR TC discussed this again
> last
> > week and looking at the report, it seems that in some cases the issue
> can be
> > addressed by adding characters to one of CLDR exemplar categories for the
> > respective locale; e.g. for Arabic, U+060D (Arabic date separator) or for
> > Hebrew, U+05C3 (Sof Pasuq). In some other cases it's not very clear if
> the
> > inclusion of a specific characters is justified or simply due to bad data
> > (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in the set
> > latnExtA provided in [2]).
> >
> > Therefore, a closer inspection of each set seems necessary. I will update
> > the thread and the ticket with the next steps when I get to check for
> > anomalies of that sort.
> >
> > [1]: http://www.sandflow.com/public/CLDR-report-20161204.txt
> > [2]:
> > https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-
> profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml
> >
> > Best regards,
> > Shervin
> >
> > On Fri, Dec 9, 2016 at 1:14 AM, Thierry MICHEL <tmichel@w3.org> wrote:
> >>
> >> Hello,
> >>
> >> The TTWG provided feedback on the
> >> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915>
> >>
> >> Looking forward to your review,
> >>
> >> Best regards,
> >> Thierry Michel
> >>
> >>
> >> Le 05/12/2016 à 01:41, Shervin Afshar a écrit :
> >>>
> >>> Hello,
> >>>
> >>> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> was
> >>> discussed in last technical committee meeting. We think that this
> >>> use-case falls within the scope of CLDR project, but to effectively add
> >>> this data to benefit implementers and users, there are few issues which
> >>> need to be addressed. Most of these questions are reflected in the
> >>> comment that Mark provided on the ticket (direct link
> >>> <http://unicode.org/cldr/trac/ticket/8915#comment:8>). To summarize,
> the
> >>> following items should be addressed and discussed:
> >>>
> >>> – Clarification on the intended usage of this data with regards to
> >>> section 7.2 and Appendix B of TTML-IMSC1; e.g. inclusion/exclusion
> >>> rationale, rationale for selection of "base" set;
> >>> – Comparison between sets in proposed draft data
> >>>
> >>> <https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-
> profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml>
> >>> and
> >>> existing CLDR exemplar types (main, aux, punctuation) in various
> locales;
> >>> – Plans for providing data for other locales.
> >>>
> >>> Best regards,
> >>> Shervin
> >>>
> >>>         ----- Original message -----
> >>>         From: r12a <ishida@w3.org <mailto:ishida@w3.org>>
> >>>         To: Mark Davis <mark@macchiato.com
> >>>         <mailto:mark@macchiato.com>>, Shervin Afshar
> >>>         <shervinafshar@gmail.com <mailto:shervinafshar@gmail.com>>,
> >>>         Steven R Loomis/Cupertino/IBM@IBMUS
> >>>         Cc: Thierry MICHEL <tmichel@w3.org <mailto:tmichel@w3.org>>,
> W3C
> >>>         Public TTWG <public-tt@w3.org <mailto:public-tt@w3.org>>
> >>>         Subject: Re: liaison for a Unicode ticket
> >>>         Date: Tue, Nov 8, 2016 3:43 AM
> >>>
> >>>         hi Mark, Shervin, Steve,
> >>>
> >>>         It has been thirteen months since there was movement on this
> >>> query.
> >>>         Could one of you please contact Thierry and advise him on
> >>>         how/whether
> >>>         it's possible to move forward the request of the Timed Text WG?
> >>>
> >>>         thanks,
> >>>         ri
> >>>
> >>>
> >>>
> >>>         On 03/11/2016 17:38, Thierry MICHEL wrote:
> >>>         > Richard,
> >>>         >
> >>>         >
> >>>         > The TTWG as a Unicode ticket for adding the following "CLDR
> >>>         supplemental
> >>>         > data for subtitle and caption characters"
> >>>         >
> >>>         > The Unicode ticket is available at
> >>>         > http://unicode.org/cldr/trac/ticket/8915
> >>>         <http://unicode.org/cldr/trac/ticket/8915>
> >>>         >
> >>>         > There has been no further notes on this for 7 months since
> >>>         >  IMSC1 has been published as a Recommendation
> >>>         > (https://www.w3.org/TR/ttml-imsc1/
> >>>         <https://www.w3.org/TR/ttml-imsc1/>)
> >>>         >
> >>>         >
> >>>         > Could you please help the TTWG to lease with Unicode to allow
> >>>         moving
> >>>         > forward ?
> >>>         >
> >>>         > I guess Mark Davis is the liaison contact for Unicode.
> >>>         >
> >>>         > Thierry.
> >>>         >
> >>>
> >>>
> >
>
Received on Monday, 26 December 2016 20:23:06 UTC