Re: liaison for a Unicode ticket from Pierre-Anthony Lemieux on 2017-01-02 (public-tt@w3.org from January 2017)

From: Pierre-Anthony Lemieux <pal@sandflow.com>
Date: Mon, 2 Jan 2017 11:16:39 -0800
To: Shervin Afshar <safshar@netflix.com>
Cc: Thierry MICHEL <tmichel@w3.org>, Nigel Megitt <nigel.megitt@bbc.co.uk>, Richard Ishida <ishida@w3.org>, Mark Davis <mark@macchiato.com>, Steven R Loomis <srloomis@us.ibm.com>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <CAF_7JxCpP1ucCbv1xLWbmCWdyd7gRjcRoVE25w+bHq+Ue854OA@mail.gmail.com>
Hi Shervin,

Ok. It sounds like the overall proposal is as follows:

- CLDR: address those characters which are clear-cut and obviously
missing for a locale
- CLDR: address some anomalies which were previously mentioned in this thread
- CLDR: create a set of characters which are used across locales for
timed-text, e.g. ♩
- IMSC: recommend locales that are likely to be used in a given
distribution region (e.g. timed text intended for the Turkish locale
may include characters from locale X, Y, Z... which need to be
signaled using xml:lang?)

Did I get this right?

Best,

-- Pierre

On Mon, Jan 2, 2017 at 11:03 AM, Shervin Afshar <safshar@netflix.com> wrote:
> Hi Pierre-Anthony,
>
> Thanks for the input.
>
>> - how would an author creating subtitles/captions targeting the
>> Turkish locale (and/or an implementer wishing to support the Turkish
>> locale) know that Dutch characters can be present?
>
>
> This of course depends on implementation and it is out of scope of defining
> exemplar sets and used characters for each locale. A similar issue exists in
> copy-writing for localizable content and there are established best
> practices agreed-upon among the industry practitioners; e.g. using markup to
> denote the language of non-translatable content, high-level tooling to add
> markup when content for one language is embedded in another, etc.
>
>> - some characters, such as the musical note ("♩"), are used for
>> subtitles/captions across many locales.
>> For this reason, ticket #8915 [1] suggests:
>> - adding an entirely new class ('subtitleCharacters') of sets to CLDR
>> - defining a 'base' subtitle/captions exemplar set available across all
>> locales
>
>
> Agreed. My thinking is to first address those characters which are clear-cut
> and obviously missing for a locale. Then address some anomalies which were
> previously mentioned in this thread. After these narrowing-downs, we end up
> with a set of characters which are used across locales for timed-text and
> then we can address them separately. My assumption is that ♩ would fall in
> this last category.
>
> Best regards,
> Shervin
>
> On Mon, Jan 2, 2017 at 9:09 AM, Pierre-Anthony Lemieux <pal@sandflow.com>
> wrote:
>>
>> Hi Shervin,
>>
>> Thanks for the feedback and the illustrative example.
>>
>> > in actuality what is needed is a more granular usage of "xml:lang"
>> > attribute according to W3C i18n
>> > best practices[1] to distinguish non-Turkish content present in a
>> > Turkish context:
>>
>> Yes. This does not however address two other issues:
>>
>> - how would an author creating subtitles/captions targeting the
>> Turkish locale (and/or an implementer wishing to support the Turkish
>> locale) know that Dutch characters can be present?
>>
>> - some characters, such as the musical note ("♩"), are used for
>> subtitles/captions across many locales.
>>
>> For this reason, ticket #8915 [1] suggests:
>>
>> - adding an entirely new class ('subtitleCharacters') of sets to CLDR
>> -- instead of overloading the main and auxiliary sets, which, as you
>> point out, cover only those exemplars that are used in the language of
>> the locale (and are selected using xml:lang)
>>
>> - defining a 'base' subtitle/captions exemplar set available across all
>> locales
>>
>> [1] http://unicode.org/cldr/trac/ticket/8915
>>
>> Perhaps these issues have been discussed before. Looking forward to
>> your thoughts.
>>
>> > For now I will proceed with actions for adding the clear-cut cases
>> > (which I mentioned in a previous email)
>> > to the relevant exemplar sets.
>>
>> Thanks. Seems very useful regardless :)
>>
>> Best,
>>
>> -- Pierre
>>
>>
>> On Mon, Dec 26, 2016 at 12:21 PM, Shervin Afshar <safshar@netflix.com>
>> wrote:
>> > Hi Pierre-Anthony,
>> >
>> > I understand the lax criteria in order to cover larger number of cases,
>> > but
>> > we still need to scrutinize and investigate some of the questionable
>> > additions; another example being U+02BC (MODIFIER LETTER APOSTROPHE)
>> > which
>> > though relevant to be added to the punctuation exemplar for one locale,
>> > presence of it in others might prove redundant and problematic in the
>> > grand
>> > scheme of things. Another instance of such unlikely additions is U+0132
>> > (LATIN CAPITAL LIGATURE IJ) which is only present in Turkish, but can
>> > not be
>> > explained why (since it's used in Dutch).
>> >
>> > One could certainly imagine a timed-text asset for Turkish to have some
>> > strings of Dutch in it and it might be reasoned that therefore Dutch
>> > exemplar set (or a subset of it) should be added to Turkish exemplar
>> > sets as
>> > auxiliary, but in actuality what is needed is a more granular usage of
>> > "xml:lang" attribute according to W3C i18n best practices[1] to
>> > distinguish
>> > non-Turkish content present in a Turkish context:
>> >
>> > <body region="subtitleArea">
>> >     <div>
>> >       <p xml:id="subtitle1" xml:lang="tr" begin="0.76s" end="3.45s">
>> >         Dost kara günde belli olur.<br />
>> >         <span xml:lang="nl">De ratten verlaten het zinkende
>> > schip.</span>
>> >       </p>
>> > ...
>> >    </div>
>> > </body>
>> >
>> > Although the complexity of an all-around additional of characters to
>> > exemplar sets to maximize coverage might seem marginal, but as far as
>> > I'm
>> > aware of, CLDR has been selective about such additions.
>> >
>> > For now I will proceed with actions for adding the clear-cut cases
>> > (which I
>> > mentioned in a previous email) to the relevant exemplar sets.
>> >
>> > [1]:
>> > https://www.w3.org/International/questions/qa-when-xmllang.en#when_use
>> >
>> > Best regards,
>> > Shervin
>> >
>> > On Wed, Dec 21, 2016 at 11:22 PM, Pierre-Anthony Lemieux
>> > <pal@sandflow.com>
>> > wrote:
>> >>
>> >> Hi Shervin,
>> >>
>> >> Thanks for the update, and to the CLDR TC for considering the input.
>> >>
>> >> > In some other cases it's not very clear if the inclusion of a
>> >> > specific
>> >> > characters is justified or simply due to
>> >> > bad data (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in
>> >> > the set latnExtA provided in [2]).
>> >>
>> >> I believe that the recommended sets erred on the side of caution, and
>> >> were created to deliberately cast a wider, rather than narrower, net
>> >> whenever possible. For instance, the recommended set for each of the
>> >> "lv,lt,et,hr,cs,pl,sl,sk,tr" locales includes all of the Latin
>> >> Extended-A block, instead of attempting to optimize each sets at the
>> >> risk of missing important characters -- the general assumption being
>> >> that the incremental complexity of supporting all versus parts of the
>> >> Latin Extended-A block would be marginal, e.g. implementations support
>> >> all or none of the Latin Extended-A block.
>> >>
>> >> > I will update the thread and the ticket with the next steps when I
>> >> > get
>> >> > to check for anomalies of that sort.
>> >>
>> >> Looking forward to your feedback.
>> >>
>> >> Best,
>> >>
>> >> -- Pierre
>> >>
>> >> On Mon, Dec 12, 2016 at 12:50 PM, Shervin Afshar <safshar@netflix.com>
>> >> wrote:
>> >> > Thanks for the new comparison report[1]. CLDR TC discussed this again
>> >> > last
>> >> > week and looking at the report, it seems that in some cases the issue
>> >> > can be
>> >> > addressed by adding characters to one of CLDR exemplar categories for
>> >> > the
>> >> > respective locale; e.g. for Arabic, U+060D (Arabic date separator) or
>> >> > for
>> >> > Hebrew, U+05C3 (Sof Pasuq). In some other cases it's not very clear
>> >> > if
>> >> > the
>> >> > inclusion of a specific characters is justified or simply due to bad
>> >> > data
>> >> > (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in the set
>> >> > latnExtA provided in [2]).
>> >> >
>> >> > Therefore, a closer inspection of each set seems necessary. I will
>> >> > update
>> >> > the thread and the ticket with the next steps when I get to check for
>> >> > anomalies of that sort.
>> >> >
>> >> > [1]: http://www.sandflow.com/public/CLDR-report-20161204.txt
>> >> > [2]:
>> >> >
>> >> >
>> >> > https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml
>> >> >
>> >> > Best regards,
>> >> > Shervin
>> >> >
>> >> > On Fri, Dec 9, 2016 at 1:14 AM, Thierry MICHEL <tmichel@w3.org>
>> >> > wrote:
>> >> >>
>> >> >> Hello,
>> >> >>
>> >> >> The TTWG provided feedback on the
>> >> >> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915>
>> >> >>
>> >> >> Looking forward to your review,
>> >> >>
>> >> >> Best regards,
>> >> >> Thierry Michel
>> >> >>
>> >> >>
>> >> >> Le 05/12/2016 à 01:41, Shervin Afshar a écrit :
>> >> >>>
>> >> >>> Hello,
>> >> >>>
>> >> >>> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> was
>> >> >>> discussed in last technical committee meeting. We think that this
>> >> >>> use-case falls within the scope of CLDR project, but to effectively
>> >> >>> add
>> >> >>> this data to benefit implementers and users, there are few issues
>> >> >>> which
>> >> >>> need to be addressed. Most of these questions are reflected in the
>> >> >>> comment that Mark provided on the ticket (direct link
>> >> >>> <http://unicode.org/cldr/trac/ticket/8915#comment:8>). To
>> >> >>> summarize,
>> >> >>> the
>> >> >>> following items should be addressed and discussed:
>> >> >>>
>> >> >>> – Clarification on the intended usage of this data with regards to
>> >> >>> section 7.2 and Appendix B of TTML-IMSC1; e.g. inclusion/exclusion
>> >> >>> rationale, rationale for selection of "base" set;
>> >> >>> – Comparison between sets in proposed draft data
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> <https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml>
>> >> >>> and
>> >> >>> existing CLDR exemplar types (main, aux, punctuation) in various
>> >> >>> locales;
>> >> >>> – Plans for providing data for other locales.
>> >> >>>
>> >> >>> Best regards,
>> >> >>> Shervin
>> >> >>>
>> >> >>>         ----- Original message -----
>> >> >>>         From: r12a <ishida@w3.org <mailto:ishida@w3.org>>
>> >> >>>         To: Mark Davis <mark@macchiato.com
>> >> >>>         <mailto:mark@macchiato.com>>, Shervin Afshar
>> >> >>>         <shervinafshar@gmail.com <mailto:shervinafshar@gmail.com>>,
>> >> >>>         Steven R Loomis/Cupertino/IBM@IBMUS
>> >> >>>         Cc: Thierry MICHEL <tmichel@w3.org
>> >> >>> <mailto:tmichel@w3.org>>,
>> >> >>> W3C
>> >> >>>         Public TTWG <public-tt@w3.org <mailto:public-tt@w3.org>>
>> >> >>>         Subject: Re: liaison for a Unicode ticket
>> >> >>>         Date: Tue, Nov 8, 2016 3:43 AM
>> >> >>>
>> >> >>>         hi Mark, Shervin, Steve,
>> >> >>>
>> >> >>>         It has been thirteen months since there was movement on
>> >> >>> this
>> >> >>> query.
>> >> >>>         Could one of you please contact Thierry and advise him on
>> >> >>>         how/whether
>> >> >>>         it's possible to move forward the request of the Timed Text
>> >> >>> WG?
>> >> >>>
>> >> >>>         thanks,
>> >> >>>         ri
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>         On 03/11/2016 17:38, Thierry MICHEL wrote:
>> >> >>>         > Richard,
>> >> >>>         >
>> >> >>>         >
>> >> >>>         > The TTWG as a Unicode ticket for adding the following
>> >> >>> "CLDR
>> >> >>>         supplemental
>> >> >>>         > data for subtitle and caption characters"
>> >> >>>         >
>> >> >>>         > The Unicode ticket is available at
>> >> >>>         > http://unicode.org/cldr/trac/ticket/8915
>> >> >>>         <http://unicode.org/cldr/trac/ticket/8915>
>> >> >>>         >
>> >> >>>         > There has been no further notes on this for 7 months
>> >> >>> since
>> >> >>>         >  IMSC1 has been published as a Recommendation
>> >> >>>         > (https://www.w3.org/TR/ttml-imsc1/
>> >> >>>         <https://www.w3.org/TR/ttml-imsc1/>)
>> >> >>>         >
>> >> >>>         >
>> >> >>>         > Could you please help the TTWG to lease with Unicode to
>> >> >>> allow
>> >> >>>         moving
>> >> >>>         > forward ?
>> >> >>>         >
>> >> >>>         > I guess Mark Davis is the liaison contact for Unicode.
>> >> >>>         >
>> >> >>>         > Thierry.
>> >> >>>         >
>> >> >>>
>> >> >>>
>> >> >
>> >
>> >
>
>
Received on Monday, 2 January 2017 19:17:34 UTC