Re: liaison for a Unicode ticket from Pierre-Anthony Lemieux on 2017-01-02 (public-tt@w3.org from January 2017)

From: Pierre-Anthony Lemieux <pal@sandflow.com>
Date: Mon, 2 Jan 2017 09:09:48 -0800
To: Shervin Afshar <safshar@netflix.com>
Cc: Thierry MICHEL <tmichel@w3.org>, Nigel Megitt <nigel.megitt@bbc.co.uk>, Richard Ishida <ishida@w3.org>, Mark Davis <mark@macchiato.com>, Steven R Loomis <srloomis@us.ibm.com>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <CAF_7JxAvfO2hva_72Wj7q8b=T3i3C_yqC8c9_-fe2NLNh8qhCw@mail.gmail.com>
Hi Shervin,

Thanks for the feedback and the illustrative example.

> in actuality what is needed is a more granular usage of "xml:lang" attribute according to W3C i18n
> best practices[1] to distinguish non-Turkish content present in a Turkish context:

Yes. This does not however address two other issues:

- how would an author creating subtitles/captions targeting the
Turkish locale (and/or an implementer wishing to support the Turkish
locale) know that Dutch characters can be present?

- some characters, such as the musical note ("♩"), are used for
subtitles/captions across many locales.

For this reason, ticket #8915 [1] suggests:

- adding an entirely new class ('subtitleCharacters') of sets to CLDR
-- instead of overloading the main and auxiliary sets, which, as you
point out, cover only those exemplars that are used in the language of
the locale (and are selected using xml:lang)

- defining a 'base' subtitle/captions exemplar set available across all locales

[1] http://unicode.org/cldr/trac/ticket/8915

Perhaps these issues have been discussed before. Looking forward to
your thoughts.

> For now I will proceed with actions for adding the clear-cut cases (which I mentioned in a previous email)
> to the relevant exemplar sets.

Thanks. Seems very useful regardless :)

Best,

-- Pierre


On Mon, Dec 26, 2016 at 12:21 PM, Shervin Afshar <safshar@netflix.com> wrote:
> Hi Pierre-Anthony,
>
> I understand the lax criteria in order to cover larger number of cases, but
> we still need to scrutinize and investigate some of the questionable
> additions; another example being U+02BC (MODIFIER LETTER APOSTROPHE) which
> though relevant to be added to the punctuation exemplar for one locale,
> presence of it in others might prove redundant and problematic in the grand
> scheme of things. Another instance of such unlikely additions is U+0132
> (LATIN CAPITAL LIGATURE IJ) which is only present in Turkish, but can not be
> explained why (since it's used in Dutch).
>
> One could certainly imagine a timed-text asset for Turkish to have some
> strings of Dutch in it and it might be reasoned that therefore Dutch
> exemplar set (or a subset of it) should be added to Turkish exemplar sets as
> auxiliary, but in actuality what is needed is a more granular usage of
> "xml:lang" attribute according to W3C i18n best practices[1] to distinguish
> non-Turkish content present in a Turkish context:
>
> <body region="subtitleArea">
>     <div>
>       <p xml:id="subtitle1" xml:lang="tr" begin="0.76s" end="3.45s">
>         Dost kara günde belli olur.<br />
>         <span xml:lang="nl">De ratten verlaten het zinkende schip.</span>
>       </p>
> ...
>    </div>
> </body>
>
> Although the complexity of an all-around additional of characters to
> exemplar sets to maximize coverage might seem marginal, but as far as I'm
> aware of, CLDR has been selective about such additions.
>
> For now I will proceed with actions for adding the clear-cut cases (which I
> mentioned in a previous email) to the relevant exemplar sets.
>
> [1]: https://www.w3.org/International/questions/qa-when-xmllang.en#when_use
>
> Best regards,
> Shervin
>
> On Wed, Dec 21, 2016 at 11:22 PM, Pierre-Anthony Lemieux <pal@sandflow.com>
> wrote:
>>
>> Hi Shervin,
>>
>> Thanks for the update, and to the CLDR TC for considering the input.
>>
>> > In some other cases it's not very clear if the inclusion of a specific
>> > characters is justified or simply due to
>> > bad data (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in
>> > the set latnExtA provided in [2]).
>>
>> I believe that the recommended sets erred on the side of caution, and
>> were created to deliberately cast a wider, rather than narrower, net
>> whenever possible. For instance, the recommended set for each of the
>> "lv,lt,et,hr,cs,pl,sl,sk,tr" locales includes all of the Latin
>> Extended-A block, instead of attempting to optimize each sets at the
>> risk of missing important characters -- the general assumption being
>> that the incremental complexity of supporting all versus parts of the
>> Latin Extended-A block would be marginal, e.g. implementations support
>> all or none of the Latin Extended-A block.
>>
>> > I will update the thread and the ticket with the next steps when I get
>> > to check for anomalies of that sort.
>>
>> Looking forward to your feedback.
>>
>> Best,
>>
>> -- Pierre
>>
>> On Mon, Dec 12, 2016 at 12:50 PM, Shervin Afshar <safshar@netflix.com>
>> wrote:
>> > Thanks for the new comparison report[1]. CLDR TC discussed this again
>> > last
>> > week and looking at the report, it seems that in some cases the issue
>> > can be
>> > addressed by adding characters to one of CLDR exemplar categories for
>> > the
>> > respective locale; e.g. for Arabic, U+060D (Arabic date separator) or
>> > for
>> > Hebrew, U+05C3 (Sof Pasuq). In some other cases it's not very clear if
>> > the
>> > inclusion of a specific characters is justified or simply due to bad
>> > data
>> > (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in the set
>> > latnExtA provided in [2]).
>> >
>> > Therefore, a closer inspection of each set seems necessary. I will
>> > update
>> > the thread and the ticket with the next steps when I get to check for
>> > anomalies of that sort.
>> >
>> > [1]: http://www.sandflow.com/public/CLDR-report-20161204.txt
>> > [2]:
>> >
>> > https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml
>> >
>> > Best regards,
>> > Shervin
>> >
>> > On Fri, Dec 9, 2016 at 1:14 AM, Thierry MICHEL <tmichel@w3.org> wrote:
>> >>
>> >> Hello,
>> >>
>> >> The TTWG provided feedback on the
>> >> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915>
>> >>
>> >> Looking forward to your review,
>> >>
>> >> Best regards,
>> >> Thierry Michel
>> >>
>> >>
>> >> Le 05/12/2016 à 01:41, Shervin Afshar a écrit :
>> >>>
>> >>> Hello,
>> >>>
>> >>> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> was
>> >>> discussed in last technical committee meeting. We think that this
>> >>> use-case falls within the scope of CLDR project, but to effectively
>> >>> add
>> >>> this data to benefit implementers and users, there are few issues
>> >>> which
>> >>> need to be addressed. Most of these questions are reflected in the
>> >>> comment that Mark provided on the ticket (direct link
>> >>> <http://unicode.org/cldr/trac/ticket/8915#comment:8>). To summarize,
>> >>> the
>> >>> following items should be addressed and discussed:
>> >>>
>> >>> – Clarification on the intended usage of this data with regards to
>> >>> section 7.2 and Appendix B of TTML-IMSC1; e.g. inclusion/exclusion
>> >>> rationale, rationale for selection of "base" set;
>> >>> – Comparison between sets in proposed draft data
>> >>>
>> >>>
>> >>> <https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml>
>> >>> and
>> >>> existing CLDR exemplar types (main, aux, punctuation) in various
>> >>> locales;
>> >>> – Plans for providing data for other locales.
>> >>>
>> >>> Best regards,
>> >>> Shervin
>> >>>
>> >>>         ----- Original message -----
>> >>>         From: r12a <ishida@w3.org <mailto:ishida@w3.org>>
>> >>>         To: Mark Davis <mark@macchiato.com
>> >>>         <mailto:mark@macchiato.com>>, Shervin Afshar
>> >>>         <shervinafshar@gmail.com <mailto:shervinafshar@gmail.com>>,
>> >>>         Steven R Loomis/Cupertino/IBM@IBMUS
>> >>>         Cc: Thierry MICHEL <tmichel@w3.org <mailto:tmichel@w3.org>>,
>> >>> W3C
>> >>>         Public TTWG <public-tt@w3.org <mailto:public-tt@w3.org>>
>> >>>         Subject: Re: liaison for a Unicode ticket
>> >>>         Date: Tue, Nov 8, 2016 3:43 AM
>> >>>
>> >>>         hi Mark, Shervin, Steve,
>> >>>
>> >>>         It has been thirteen months since there was movement on this
>> >>> query.
>> >>>         Could one of you please contact Thierry and advise him on
>> >>>         how/whether
>> >>>         it's possible to move forward the request of the Timed Text
>> >>> WG?
>> >>>
>> >>>         thanks,
>> >>>         ri
>> >>>
>> >>>
>> >>>
>> >>>         On 03/11/2016 17:38, Thierry MICHEL wrote:
>> >>>         > Richard,
>> >>>         >
>> >>>         >
>> >>>         > The TTWG as a Unicode ticket for adding the following "CLDR
>> >>>         supplemental
>> >>>         > data for subtitle and caption characters"
>> >>>         >
>> >>>         > The Unicode ticket is available at
>> >>>         > http://unicode.org/cldr/trac/ticket/8915
>> >>>         <http://unicode.org/cldr/trac/ticket/8915>
>> >>>         >
>> >>>         > There has been no further notes on this for 7 months since
>> >>>         >  IMSC1 has been published as a Recommendation
>> >>>         > (https://www.w3.org/TR/ttml-imsc1/
>> >>>         <https://www.w3.org/TR/ttml-imsc1/>)
>> >>>         >
>> >>>         >
>> >>>         > Could you please help the TTWG to lease with Unicode to
>> >>> allow
>> >>>         moving
>> >>>         > forward ?
>> >>>         >
>> >>>         > I guess Mark Davis is the liaison contact for Unicode.
>> >>>         >
>> >>>         > Thierry.
>> >>>         >
>> >>>
>> >>>
>> >
>
>
Received on Monday, 2 January 2017 17:10:43 UTC