Re: liaison for a Unicode ticket

That's an accurate summary of the actions on my side with regards to CLDR
data. I just have one comment to add here:


> - CLDR: address some anomalies which were previously mentioned in this
> thread
>

TTWG assistance would be needed in such cases to clarify if a character is
actually needed in a locale or: a. was included due to an error in the
collected data and can be ignored, b. was included intentionally to expand
the coverage.

Best regards,
Shervin

On Mon, Jan 2, 2017 at 11:16 AM, Pierre-Anthony Lemieux <pal@sandflow.com>
wrote:

> Hi Shervin,
>
> Ok. It sounds like the overall proposal is as follows:
>
> - CLDR: address those characters which are clear-cut and obviously
> missing for a locale
> - CLDR: address some anomalies which were previously mentioned in this
> thread
> - CLDR: create a set of characters which are used across locales for
> timed-text, e.g. ♩
> - IMSC: recommend locales that are likely to be used in a given
> distribution region (e.g. timed text intended for the Turkish locale
> may include characters from locale X, Y, Z... which need to be
> signaled using xml:lang?)
>
> Did I get this right?
>
> Best,
>
> -- Pierre
>
> On Mon, Jan 2, 2017 at 11:03 AM, Shervin Afshar <safshar@netflix.com>
> wrote:
> > Hi Pierre-Anthony,
> >
> > Thanks for the input.
> >
> >> - how would an author creating subtitles/captions targeting the
> >> Turkish locale (and/or an implementer wishing to support the Turkish
> >> locale) know that Dutch characters can be present?
> >
> >
> > This of course depends on implementation and it is out of scope of
> defining
> > exemplar sets and used characters for each locale. A similar issue
> exists in
> > copy-writing for localizable content and there are established best
> > practices agreed-upon among the industry practitioners; e.g. using
> markup to
> > denote the language of non-translatable content, high-level tooling to
> add
> > markup when content for one language is embedded in another, etc.
> >
> >> - some characters, such as the musical note ("♩"), are used for
> >> subtitles/captions across many locales.
> >> For this reason, ticket #8915 [1] suggests:
> >> - adding an entirely new class ('subtitleCharacters') of sets to CLDR
> >> - defining a 'base' subtitle/captions exemplar set available across all
> >> locales
> >
> >
> > Agreed. My thinking is to first address those characters which are
> clear-cut
> > and obviously missing for a locale. Then address some anomalies which
> were
> > previously mentioned in this thread. After these narrowing-downs, we end
> up
> > with a set of characters which are used across locales for timed-text and
> > then we can address them separately. My assumption is that ♩ would fall
> in
> > this last category.
> >
> > Best regards,
> > Shervin
> >
> > On Mon, Jan 2, 2017 at 9:09 AM, Pierre-Anthony Lemieux <pal@sandflow.com
> >
> > wrote:
> >>
> >> Hi Shervin,
> >>
> >> Thanks for the feedback and the illustrative example.
> >>
> >> > in actuality what is needed is a more granular usage of "xml:lang"
> >> > attribute according to W3C i18n
> >> > best practices[1] to distinguish non-Turkish content present in a
> >> > Turkish context:
> >>
> >> Yes. This does not however address two other issues:
> >>
> >> - how would an author creating subtitles/captions targeting the
> >> Turkish locale (and/or an implementer wishing to support the Turkish
> >> locale) know that Dutch characters can be present?
> >>
> >> - some characters, such as the musical note ("♩"), are used for
> >> subtitles/captions across many locales.
> >>
> >> For this reason, ticket #8915 [1] suggests:
> >>
> >> - adding an entirely new class ('subtitleCharacters') of sets to CLDR
> >> -- instead of overloading the main and auxiliary sets, which, as you
> >> point out, cover only those exemplars that are used in the language of
> >> the locale (and are selected using xml:lang)
> >>
> >> - defining a 'base' subtitle/captions exemplar set available across all
> >> locales
> >>
> >> [1] http://unicode.org/cldr/trac/ticket/8915
> >>
> >> Perhaps these issues have been discussed before. Looking forward to
> >> your thoughts.
> >>
> >> > For now I will proceed with actions for adding the clear-cut cases
> >> > (which I mentioned in a previous email)
> >> > to the relevant exemplar sets.
> >>
> >> Thanks. Seems very useful regardless :)
> >>
> >> Best,
> >>
> >> -- Pierre
> >>
> >>
> >> On Mon, Dec 26, 2016 at 12:21 PM, Shervin Afshar <safshar@netflix.com>
> >> wrote:
> >> > Hi Pierre-Anthony,
> >> >
> >> > I understand the lax criteria in order to cover larger number of
> cases,
> >> > but
> >> > we still need to scrutinize and investigate some of the questionable
> >> > additions; another example being U+02BC (MODIFIER LETTER APOSTROPHE)
> >> > which
> >> > though relevant to be added to the punctuation exemplar for one
> locale,
> >> > presence of it in others might prove redundant and problematic in the
> >> > grand
> >> > scheme of things. Another instance of such unlikely additions is
> U+0132
> >> > (LATIN CAPITAL LIGATURE IJ) which is only present in Turkish, but can
> >> > not be
> >> > explained why (since it's used in Dutch).
> >> >
> >> > One could certainly imagine a timed-text asset for Turkish to have
> some
> >> > strings of Dutch in it and it might be reasoned that therefore Dutch
> >> > exemplar set (or a subset of it) should be added to Turkish exemplar
> >> > sets as
> >> > auxiliary, but in actuality what is needed is a more granular usage of
> >> > "xml:lang" attribute according to W3C i18n best practices[1] to
> >> > distinguish
> >> > non-Turkish content present in a Turkish context:
> >> >
> >> > <body region="subtitleArea">
> >> >     <div>
> >> >       <p xml:id="subtitle1" xml:lang="tr" begin="0.76s" end="3.45s">
> >> >         Dost kara günde belli olur.<br />
> >> >         <span xml:lang="nl">De ratten verlaten het zinkende
> >> > schip.</span>
> >> >       </p>
> >> > ...
> >> >    </div>
> >> > </body>
> >> >
> >> > Although the complexity of an all-around additional of characters to
> >> > exemplar sets to maximize coverage might seem marginal, but as far as
> >> > I'm
> >> > aware of, CLDR has been selective about such additions.
> >> >
> >> > For now I will proceed with actions for adding the clear-cut cases
> >> > (which I
> >> > mentioned in a previous email) to the relevant exemplar sets.
> >> >
> >> > [1]:
> >> > https://www.w3.org/International/questions/qa-
> when-xmllang.en#when_use
> >> >
> >> > Best regards,
> >> > Shervin
> >> >
> >> > On Wed, Dec 21, 2016 at 11:22 PM, Pierre-Anthony Lemieux
> >> > <pal@sandflow.com>
> >> > wrote:
> >> >>
> >> >> Hi Shervin,
> >> >>
> >> >> Thanks for the update, and to the CLDR TC for considering the input.
> >> >>
> >> >> > In some other cases it's not very clear if the inclusion of a
> >> >> > specific
> >> >> > characters is justified or simply due to
> >> >> > bad data (e.g. u+017F, LATIN SMALL LETTER LONG S which is included
> in
> >> >> > the set latnExtA provided in [2]).
> >> >>
> >> >> I believe that the recommended sets erred on the side of caution, and
> >> >> were created to deliberately cast a wider, rather than narrower, net
> >> >> whenever possible. For instance, the recommended set for each of the
> >> >> "lv,lt,et,hr,cs,pl,sl,sk,tr" locales includes all of the Latin
> >> >> Extended-A block, instead of attempting to optimize each sets at the
> >> >> risk of missing important characters -- the general assumption being
> >> >> that the incremental complexity of supporting all versus parts of the
> >> >> Latin Extended-A block would be marginal, e.g. implementations
> support
> >> >> all or none of the Latin Extended-A block.
> >> >>
> >> >> > I will update the thread and the ticket with the next steps when I
> >> >> > get
> >> >> > to check for anomalies of that sort.
> >> >>
> >> >> Looking forward to your feedback.
> >> >>
> >> >> Best,
> >> >>
> >> >> -- Pierre
> >> >>
> >> >> On Mon, Dec 12, 2016 at 12:50 PM, Shervin Afshar <
> safshar@netflix.com>
> >> >> wrote:
> >> >> > Thanks for the new comparison report[1]. CLDR TC discussed this
> again
> >> >> > last
> >> >> > week and looking at the report, it seems that in some cases the
> issue
> >> >> > can be
> >> >> > addressed by adding characters to one of CLDR exemplar categories
> for
> >> >> > the
> >> >> > respective locale; e.g. for Arabic, U+060D (Arabic date separator)
> or
> >> >> > for
> >> >> > Hebrew, U+05C3 (Sof Pasuq). In some other cases it's not very clear
> >> >> > if
> >> >> > the
> >> >> > inclusion of a specific characters is justified or simply due to
> bad
> >> >> > data
> >> >> > (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in the
> set
> >> >> > latnExtA provided in [2]).
> >> >> >
> >> >> > Therefore, a closer inspection of each set seems necessary. I will
> >> >> > update
> >> >> > the thread and the ticket with the next steps when I get to check
> for
> >> >> > anomalies of that sort.
> >> >> >
> >> >> > [1]: http://www.sandflow.com/public/CLDR-report-20161204.txt
> >> >> > [2]:
> >> >> >
> >> >> >
> >> >> > https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-
> profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml
> >> >> >
> >> >> > Best regards,
> >> >> > Shervin
> >> >> >
> >> >> > On Fri, Dec 9, 2016 at 1:14 AM, Thierry MICHEL <tmichel@w3.org>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hello,
> >> >> >>
> >> >> >> The TTWG provided feedback on the
> >> >> >> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915>
> >> >> >>
> >> >> >> Looking forward to your review,
> >> >> >>
> >> >> >> Best regards,
> >> >> >> Thierry Michel
> >> >> >>
> >> >> >>
> >> >> >> Le 05/12/2016 à 01:41, Shervin Afshar a écrit :
> >> >> >>>
> >> >> >>> Hello,
> >> >> >>>
> >> >> >>> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> was
> >> >> >>> discussed in last technical committee meeting. We think that this
> >> >> >>> use-case falls within the scope of CLDR project, but to
> effectively
> >> >> >>> add
> >> >> >>> this data to benefit implementers and users, there are few issues
> >> >> >>> which
> >> >> >>> need to be addressed. Most of these questions are reflected in
> the
> >> >> >>> comment that Mark provided on the ticket (direct link
> >> >> >>> <http://unicode.org/cldr/trac/ticket/8915#comment:8>). To
> >> >> >>> summarize,
> >> >> >>> the
> >> >> >>> following items should be addressed and discussed:
> >> >> >>>
> >> >> >>> – Clarification on the intended usage of this data with regards
> to
> >> >> >>> section 7.2 and Appendix B of TTML-IMSC1; e.g.
> inclusion/exclusion
> >> >> >>> rationale, rationale for selection of "base" set;
> >> >> >>> – Comparison between sets in proposed draft data
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> <https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-
> profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml>
> >> >> >>> and
> >> >> >>> existing CLDR exemplar types (main, aux, punctuation) in various
> >> >> >>> locales;
> >> >> >>> – Plans for providing data for other locales.
> >> >> >>>
> >> >> >>> Best regards,
> >> >> >>> Shervin
> >> >> >>>
> >> >> >>>         ----- Original message -----
> >> >> >>>         From: r12a <ishida@w3.org <mailto:ishida@w3.org>>
> >> >> >>>         To: Mark Davis <mark@macchiato.com
> >> >> >>>         <mailto:mark@macchiato.com>>, Shervin Afshar
> >> >> >>>         <shervinafshar@gmail.com <mailto:shervinafshar@gmail.com
> >>,
> >> >> >>>         Steven R Loomis/Cupertino/IBM@IBMUS
> >> >> >>>         Cc: Thierry MICHEL <tmichel@w3.org
> >> >> >>> <mailto:tmichel@w3.org>>,
> >> >> >>> W3C
> >> >> >>>         Public TTWG <public-tt@w3.org <mailto:public-tt@w3.org>>
> >> >> >>>         Subject: Re: liaison for a Unicode ticket
> >> >> >>>         Date: Tue, Nov 8, 2016 3:43 AM
> >> >> >>>
> >> >> >>>         hi Mark, Shervin, Steve,
> >> >> >>>
> >> >> >>>         It has been thirteen months since there was movement on
> >> >> >>> this
> >> >> >>> query.
> >> >> >>>         Could one of you please contact Thierry and advise him on
> >> >> >>>         how/whether
> >> >> >>>         it's possible to move forward the request of the Timed
> Text
> >> >> >>> WG?
> >> >> >>>
> >> >> >>>         thanks,
> >> >> >>>         ri
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>         On 03/11/2016 17:38, Thierry MICHEL wrote:
> >> >> >>>         > Richard,
> >> >> >>>         >
> >> >> >>>         >
> >> >> >>>         > The TTWG as a Unicode ticket for adding the following
> >> >> >>> "CLDR
> >> >> >>>         supplemental
> >> >> >>>         > data for subtitle and caption characters"
> >> >> >>>         >
> >> >> >>>         > The Unicode ticket is available at
> >> >> >>>         > http://unicode.org/cldr/trac/ticket/8915
> >> >> >>>         <http://unicode.org/cldr/trac/ticket/8915>
> >> >> >>>         >
> >> >> >>>         > There has been no further notes on this for 7 months
> >> >> >>> since
> >> >> >>>         >  IMSC1 has been published as a Recommendation
> >> >> >>>         > (https://www.w3.org/TR/ttml-imsc1/
> >> >> >>>         <https://www.w3.org/TR/ttml-imsc1/>)
> >> >> >>>         >
> >> >> >>>         >
> >> >> >>>         > Could you please help the TTWG to lease with Unicode to
> >> >> >>> allow
> >> >> >>>         moving
> >> >> >>>         > forward ?
> >> >> >>>         >
> >> >> >>>         > I guess Mark Davis is the liaison contact for Unicode.
> >> >> >>>         >
> >> >> >>>         > Thierry.
> >> >> >>>         >
> >> >> >>>
> >> >> >>>
> >> >> >
> >> >
> >> >
> >
> >
>

Received on Monday, 2 January 2017 19:57:34 UTC