Re: liaison for a Unicode ticket from Shervin Afshar on 2017-01-02 (public-tt@w3.org from January 2017)

From: Shervin Afshar <safshar@netflix.com>
Date: Mon, 2 Jan 2017 11:03:48 -0800
To: Pierre-Anthony Lemieux <pal@sandflow.com>
Cc: Thierry MICHEL <tmichel@w3.org>, Nigel Megitt <nigel.megitt@bbc.co.uk>, Richard Ishida <ishida@w3.org>, Mark Davis <mark@macchiato.com>, Steven R Loomis <srloomis@us.ibm.com>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <CABEdNYLzJDug65a10DQvMOsVS6=YW_=J9gKu1B-oPi0p8cknLA@mail.gmail.com>
Hi Pierre-Anthony,

Thanks for the input.

- how would an author creating subtitles/captions targeting the
> Turkish locale (and/or an implementer wishing to support the Turkish
> locale) know that Dutch characters can be present?
>

This of course depends on implementation and it is out of scope of defining
exemplar sets and used characters for each locale. A similar issue exists
in copy-writing for localizable content and there are established best
practices agreed-upon among the industry practitioners; e.g. using markup
to denote the language of non-translatable content, high-level tooling to
add markup when content for one language is embedded in another, etc.

- some characters, such as the musical note ("♩"), are used for
> subtitles/captions across many locales.
> For this reason, ticket #8915 [1] suggests:
> - adding an entirely new class ('subtitleCharacters') of sets to CLDR
> - defining a 'base' subtitle/captions exemplar set available across all
> locales


Agreed. My thinking is to first address those characters which are
clear-cut and obviously missing for a locale. Then address some anomalies
which were previously mentioned in this thread. After these
narrowing-downs, we end up with a set of characters which are used across
locales for timed-text and then we can address them separately. My
assumption is that ♩ would fall in this last category.

Best regards,
Shervin

On Mon, Jan 2, 2017 at 9:09 AM, Pierre-Anthony Lemieux <pal@sandflow.com>
wrote:

> Hi Shervin,
>
> Thanks for the feedback and the illustrative example.
>
> > in actuality what is needed is a more granular usage of "xml:lang"
> attribute according to W3C i18n
> > best practices[1] to distinguish non-Turkish content present in a
> Turkish context:
>
> Yes. This does not however address two other issues:
>
> - how would an author creating subtitles/captions targeting the
> Turkish locale (and/or an implementer wishing to support the Turkish
> locale) know that Dutch characters can be present?
>
> - some characters, such as the musical note ("♩"), are used for
> subtitles/captions across many locales.
>
> For this reason, ticket #8915 [1] suggests:
>
> - adding an entirely new class ('subtitleCharacters') of sets to CLDR
> -- instead of overloading the main and auxiliary sets, which, as you
> point out, cover only those exemplars that are used in the language of
> the locale (and are selected using xml:lang)
>
> - defining a 'base' subtitle/captions exemplar set available across all
> locales
>
> [1] http://unicode.org/cldr/trac/ticket/8915
>
> Perhaps these issues have been discussed before. Looking forward to
> your thoughts.
>
> > For now I will proceed with actions for adding the clear-cut cases
> (which I mentioned in a previous email)
> > to the relevant exemplar sets.
>
> Thanks. Seems very useful regardless :)
>
> Best,
>
> -- Pierre
>
>
> On Mon, Dec 26, 2016 at 12:21 PM, Shervin Afshar <safshar@netflix.com>
> wrote:
> > Hi Pierre-Anthony,
> >
> > I understand the lax criteria in order to cover larger number of cases,
> but
> > we still need to scrutinize and investigate some of the questionable
> > additions; another example being U+02BC (MODIFIER LETTER APOSTROPHE)
> which
> > though relevant to be added to the punctuation exemplar for one locale,
> > presence of it in others might prove redundant and problematic in the
> grand
> > scheme of things. Another instance of such unlikely additions is U+0132
> > (LATIN CAPITAL LIGATURE IJ) which is only present in Turkish, but can
> not be
> > explained why (since it's used in Dutch).
> >
> > One could certainly imagine a timed-text asset for Turkish to have some
> > strings of Dutch in it and it might be reasoned that therefore Dutch
> > exemplar set (or a subset of it) should be added to Turkish exemplar
> sets as
> > auxiliary, but in actuality what is needed is a more granular usage of
> > "xml:lang" attribute according to W3C i18n best practices[1] to
> distinguish
> > non-Turkish content present in a Turkish context:
> >
> > <body region="subtitleArea">
> >     <div>
> >       <p xml:id="subtitle1" xml:lang="tr" begin="0.76s" end="3.45s">
> >         Dost kara günde belli olur.<br />
> >         <span xml:lang="nl">De ratten verlaten het zinkende schip.</span>
> >       </p>
> > ...
> >    </div>
> > </body>
> >
> > Although the complexity of an all-around additional of characters to
> > exemplar sets to maximize coverage might seem marginal, but as far as I'm
> > aware of, CLDR has been selective about such additions.
> >
> > For now I will proceed with actions for adding the clear-cut cases
> (which I
> > mentioned in a previous email) to the relevant exemplar sets.
> >
> > [1]: https://www.w3.org/International/questions/qa-when-xmllang.
> en#when_use
> >
> > Best regards,
> > Shervin
> >
> > On Wed, Dec 21, 2016 at 11:22 PM, Pierre-Anthony Lemieux <
> pal@sandflow.com>
> > wrote:
> >>
> >> Hi Shervin,
> >>
> >> Thanks for the update, and to the CLDR TC for considering the input.
> >>
> >> > In some other cases it's not very clear if the inclusion of a specific
> >> > characters is justified or simply due to
> >> > bad data (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in
> >> > the set latnExtA provided in [2]).
> >>
> >> I believe that the recommended sets erred on the side of caution, and
> >> were created to deliberately cast a wider, rather than narrower, net
> >> whenever possible. For instance, the recommended set for each of the
> >> "lv,lt,et,hr,cs,pl,sl,sk,tr" locales includes all of the Latin
> >> Extended-A block, instead of attempting to optimize each sets at the
> >> risk of missing important characters -- the general assumption being
> >> that the incremental complexity of supporting all versus parts of the
> >> Latin Extended-A block would be marginal, e.g. implementations support
> >> all or none of the Latin Extended-A block.
> >>
> >> > I will update the thread and the ticket with the next steps when I get
> >> > to check for anomalies of that sort.
> >>
> >> Looking forward to your feedback.
> >>
> >> Best,
> >>
> >> -- Pierre
> >>
> >> On Mon, Dec 12, 2016 at 12:50 PM, Shervin Afshar <safshar@netflix.com>
> >> wrote:
> >> > Thanks for the new comparison report[1]. CLDR TC discussed this again
> >> > last
> >> > week and looking at the report, it seems that in some cases the issue
> >> > can be
> >> > addressed by adding characters to one of CLDR exemplar categories for
> >> > the
> >> > respective locale; e.g. for Arabic, U+060D (Arabic date separator) or
> >> > for
> >> > Hebrew, U+05C3 (Sof Pasuq). In some other cases it's not very clear if
> >> > the
> >> > inclusion of a specific characters is justified or simply due to bad
> >> > data
> >> > (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in the set
> >> > latnExtA provided in [2]).
> >> >
> >> > Therefore, a closer inspection of each set seems necessary. I will
> >> > update
> >> > the thread and the ticket with the next steps when I get to check for
> >> > anomalies of that sort.
> >> >
> >> > [1]: http://www.sandflow.com/public/CLDR-report-20161204.txt
> >> > [2]:
> >> >
> >> > https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-pr
> ofiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml
> >> >
> >> > Best regards,
> >> > Shervin
> >> >
> >> > On Fri, Dec 9, 2016 at 1:14 AM, Thierry MICHEL <tmichel@w3.org>
> wrote:
> >> >>
> >> >> Hello,
> >> >>
> >> >> The TTWG provided feedback on the
> >> >> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915>
> >> >>
> >> >> Looking forward to your review,
> >> >>
> >> >> Best regards,
> >> >> Thierry Michel
> >> >>
> >> >>
> >> >> Le 05/12/2016 à 01:41, Shervin Afshar a écrit :
> >> >>>
> >> >>> Hello,
> >> >>>
> >> >>> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> was
> >> >>> discussed in last technical committee meeting. We think that this
> >> >>> use-case falls within the scope of CLDR project, but to effectively
> >> >>> add
> >> >>> this data to benefit implementers and users, there are few issues
> >> >>> which
> >> >>> need to be addressed. Most of these questions are reflected in the
> >> >>> comment that Mark provided on the ticket (direct link
> >> >>> <http://unicode.org/cldr/trac/ticket/8915#comment:8>). To
> summarize,
> >> >>> the
> >> >>> following items should be addressed and discussed:
> >> >>>
> >> >>> – Clarification on the intended usage of this data with regards to
> >> >>> section 7.2 and Appendix B of TTML-IMSC1; e.g. inclusion/exclusion
> >> >>> rationale, rationale for selection of "base" set;
> >> >>> – Comparison between sets in proposed draft data
> >> >>>
> >> >>>
> >> >>> <https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-p
> rofiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml>
> >> >>> and
> >> >>> existing CLDR exemplar types (main, aux, punctuation) in various
> >> >>> locales;
> >> >>> – Plans for providing data for other locales.
> >> >>>
> >> >>> Best regards,
> >> >>> Shervin
> >> >>>
> >> >>>         ----- Original message -----
> >> >>>         From: r12a <ishida@w3.org <mailto:ishida@w3.org>>
> >> >>>         To: Mark Davis <mark@macchiato.com
> >> >>>         <mailto:mark@macchiato.com>>, Shervin Afshar
> >> >>>         <shervinafshar@gmail.com <mailto:shervinafshar@gmail.com>>,
> >> >>>         Steven R Loomis/Cupertino/IBM@IBMUS
> >> >>>         Cc: Thierry MICHEL <tmichel@w3.org <mailto:tmichel@w3.org
> >>,
> >> >>> W3C
> >> >>>         Public TTWG <public-tt@w3.org <mailto:public-tt@w3.org>>
> >> >>>         Subject: Re: liaison for a Unicode ticket
> >> >>>         Date: Tue, Nov 8, 2016 3:43 AM
> >> >>>
> >> >>>         hi Mark, Shervin, Steve,
> >> >>>
> >> >>>         It has been thirteen months since there was movement on this
> >> >>> query.
> >> >>>         Could one of you please contact Thierry and advise him on
> >> >>>         how/whether
> >> >>>         it's possible to move forward the request of the Timed Text
> >> >>> WG?
> >> >>>
> >> >>>         thanks,
> >> >>>         ri
> >> >>>
> >> >>>
> >> >>>
> >> >>>         On 03/11/2016 17:38, Thierry MICHEL wrote:
> >> >>>         > Richard,
> >> >>>         >
> >> >>>         >
> >> >>>         > The TTWG as a Unicode ticket for adding the following
> "CLDR
> >> >>>         supplemental
> >> >>>         > data for subtitle and caption characters"
> >> >>>         >
> >> >>>         > The Unicode ticket is available at
> >> >>>         > http://unicode.org/cldr/trac/ticket/8915
> >> >>>         <http://unicode.org/cldr/trac/ticket/8915>
> >> >>>         >
> >> >>>         > There has been no further notes on this for 7 months since
> >> >>>         >  IMSC1 has been published as a Recommendation
> >> >>>         > (https://www.w3.org/TR/ttml-imsc1/
> >> >>>         <https://www.w3.org/TR/ttml-imsc1/>)
> >> >>>         >
> >> >>>         >
> >> >>>         > Could you please help the TTWG to lease with Unicode to
> >> >>> allow
> >> >>>         moving
> >> >>>         > forward ?
> >> >>>         >
> >> >>>         > I guess Mark Davis is the liaison contact for Unicode.
> >> >>>         >
> >> >>>         > Thierry.
> >> >>>         >
> >> >>>
> >> >>>
> >> >
> >
> >
>
Received on Monday, 2 January 2017 19:05:02 UTC