- From: Shervin Afshar <safshar@netflix.com>
- Date: Mon, 2 Jan 2017 11:03:48 -0800
- To: Pierre-Anthony Lemieux <pal@sandflow.com>
- Cc: Thierry MICHEL <tmichel@w3.org>, Nigel Megitt <nigel.megitt@bbc.co.uk>, Richard Ishida <ishida@w3.org>, Mark Davis <mark@macchiato.com>, Steven R Loomis <srloomis@us.ibm.com>, "public-tt@w3.org" <public-tt@w3.org>
- Message-ID: <CABEdNYLzJDug65a10DQvMOsVS6=YW_=J9gKu1B-oPi0p8cknLA@mail.gmail.com>
Hi Pierre-Anthony, Thanks for the input. - how would an author creating subtitles/captions targeting the > Turkish locale (and/or an implementer wishing to support the Turkish > locale) know that Dutch characters can be present? > This of course depends on implementation and it is out of scope of defining exemplar sets and used characters for each locale. A similar issue exists in copy-writing for localizable content and there are established best practices agreed-upon among the industry practitioners; e.g. using markup to denote the language of non-translatable content, high-level tooling to add markup when content for one language is embedded in another, etc. - some characters, such as the musical note ("♩"), are used for > subtitles/captions across many locales. > For this reason, ticket #8915 [1] suggests: > - adding an entirely new class ('subtitleCharacters') of sets to CLDR > - defining a 'base' subtitle/captions exemplar set available across all > locales Agreed. My thinking is to first address those characters which are clear-cut and obviously missing for a locale. Then address some anomalies which were previously mentioned in this thread. After these narrowing-downs, we end up with a set of characters which are used across locales for timed-text and then we can address them separately. My assumption is that ♩ would fall in this last category. Best regards, Shervin On Mon, Jan 2, 2017 at 9:09 AM, Pierre-Anthony Lemieux <pal@sandflow.com> wrote: > Hi Shervin, > > Thanks for the feedback and the illustrative example. > > > in actuality what is needed is a more granular usage of "xml:lang" > attribute according to W3C i18n > > best practices[1] to distinguish non-Turkish content present in a > Turkish context: > > Yes. This does not however address two other issues: > > - how would an author creating subtitles/captions targeting the > Turkish locale (and/or an implementer wishing to support the Turkish > locale) know that Dutch characters can be present? > > - some characters, such as the musical note ("♩"), are used for > subtitles/captions across many locales. > > For this reason, ticket #8915 [1] suggests: > > - adding an entirely new class ('subtitleCharacters') of sets to CLDR > -- instead of overloading the main and auxiliary sets, which, as you > point out, cover only those exemplars that are used in the language of > the locale (and are selected using xml:lang) > > - defining a 'base' subtitle/captions exemplar set available across all > locales > > [1] http://unicode.org/cldr/trac/ticket/8915 > > Perhaps these issues have been discussed before. Looking forward to > your thoughts. > > > For now I will proceed with actions for adding the clear-cut cases > (which I mentioned in a previous email) > > to the relevant exemplar sets. > > Thanks. Seems very useful regardless :) > > Best, > > -- Pierre > > > On Mon, Dec 26, 2016 at 12:21 PM, Shervin Afshar <safshar@netflix.com> > wrote: > > Hi Pierre-Anthony, > > > > I understand the lax criteria in order to cover larger number of cases, > but > > we still need to scrutinize and investigate some of the questionable > > additions; another example being U+02BC (MODIFIER LETTER APOSTROPHE) > which > > though relevant to be added to the punctuation exemplar for one locale, > > presence of it in others might prove redundant and problematic in the > grand > > scheme of things. Another instance of such unlikely additions is U+0132 > > (LATIN CAPITAL LIGATURE IJ) which is only present in Turkish, but can > not be > > explained why (since it's used in Dutch). > > > > One could certainly imagine a timed-text asset for Turkish to have some > > strings of Dutch in it and it might be reasoned that therefore Dutch > > exemplar set (or a subset of it) should be added to Turkish exemplar > sets as > > auxiliary, but in actuality what is needed is a more granular usage of > > "xml:lang" attribute according to W3C i18n best practices[1] to > distinguish > > non-Turkish content present in a Turkish context: > > > > <body region="subtitleArea"> > > <div> > > <p xml:id="subtitle1" xml:lang="tr" begin="0.76s" end="3.45s"> > > Dost kara günde belli olur.<br /> > > <span xml:lang="nl">De ratten verlaten het zinkende schip.</span> > > </p> > > ... > > </div> > > </body> > > > > Although the complexity of an all-around additional of characters to > > exemplar sets to maximize coverage might seem marginal, but as far as I'm > > aware of, CLDR has been selective about such additions. > > > > For now I will proceed with actions for adding the clear-cut cases > (which I > > mentioned in a previous email) to the relevant exemplar sets. > > > > [1]: https://www.w3.org/International/questions/qa-when-xmllang. > en#when_use > > > > Best regards, > > Shervin > > > > On Wed, Dec 21, 2016 at 11:22 PM, Pierre-Anthony Lemieux < > pal@sandflow.com> > > wrote: > >> > >> Hi Shervin, > >> > >> Thanks for the update, and to the CLDR TC for considering the input. > >> > >> > In some other cases it's not very clear if the inclusion of a specific > >> > characters is justified or simply due to > >> > bad data (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in > >> > the set latnExtA provided in [2]). > >> > >> I believe that the recommended sets erred on the side of caution, and > >> were created to deliberately cast a wider, rather than narrower, net > >> whenever possible. For instance, the recommended set for each of the > >> "lv,lt,et,hr,cs,pl,sl,sk,tr" locales includes all of the Latin > >> Extended-A block, instead of attempting to optimize each sets at the > >> risk of missing important characters -- the general assumption being > >> that the incremental complexity of supporting all versus parts of the > >> Latin Extended-A block would be marginal, e.g. implementations support > >> all or none of the Latin Extended-A block. > >> > >> > I will update the thread and the ticket with the next steps when I get > >> > to check for anomalies of that sort. > >> > >> Looking forward to your feedback. > >> > >> Best, > >> > >> -- Pierre > >> > >> On Mon, Dec 12, 2016 at 12:50 PM, Shervin Afshar <safshar@netflix.com> > >> wrote: > >> > Thanks for the new comparison report[1]. CLDR TC discussed this again > >> > last > >> > week and looking at the report, it seems that in some cases the issue > >> > can be > >> > addressed by adding characters to one of CLDR exemplar categories for > >> > the > >> > respective locale; e.g. for Arabic, U+060D (Arabic date separator) or > >> > for > >> > Hebrew, U+05C3 (Sof Pasuq). In some other cases it's not very clear if > >> > the > >> > inclusion of a specific characters is justified or simply due to bad > >> > data > >> > (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in the set > >> > latnExtA provided in [2]). > >> > > >> > Therefore, a closer inspection of each set seems necessary. I will > >> > update > >> > the thread and the ticket with the next steps when I get to check for > >> > anomalies of that sort. > >> > > >> > [1]: http://www.sandflow.com/public/CLDR-report-20161204.txt > >> > [2]: > >> > > >> > https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-pr > ofiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml > >> > > >> > Best regards, > >> > Shervin > >> > > >> > On Fri, Dec 9, 2016 at 1:14 AM, Thierry MICHEL <tmichel@w3.org> > wrote: > >> >> > >> >> Hello, > >> >> > >> >> The TTWG provided feedback on the > >> >> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> > >> >> > >> >> Looking forward to your review, > >> >> > >> >> Best regards, > >> >> Thierry Michel > >> >> > >> >> > >> >> Le 05/12/2016 à 01:41, Shervin Afshar a écrit : > >> >>> > >> >>> Hello, > >> >>> > >> >>> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> was > >> >>> discussed in last technical committee meeting. We think that this > >> >>> use-case falls within the scope of CLDR project, but to effectively > >> >>> add > >> >>> this data to benefit implementers and users, there are few issues > >> >>> which > >> >>> need to be addressed. Most of these questions are reflected in the > >> >>> comment that Mark provided on the ticket (direct link > >> >>> <http://unicode.org/cldr/trac/ticket/8915#comment:8>). To > summarize, > >> >>> the > >> >>> following items should be addressed and discussed: > >> >>> > >> >>> – Clarification on the intended usage of this data with regards to > >> >>> section 7.2 and Appendix B of TTML-IMSC1; e.g. inclusion/exclusion > >> >>> rationale, rationale for selection of "base" set; > >> >>> – Comparison between sets in proposed draft data > >> >>> > >> >>> > >> >>> <https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-p > rofiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml> > >> >>> and > >> >>> existing CLDR exemplar types (main, aux, punctuation) in various > >> >>> locales; > >> >>> – Plans for providing data for other locales. > >> >>> > >> >>> Best regards, > >> >>> Shervin > >> >>> > >> >>> ----- Original message ----- > >> >>> From: r12a <ishida@w3.org <mailto:ishida@w3.org>> > >> >>> To: Mark Davis <mark@macchiato.com > >> >>> <mailto:mark@macchiato.com>>, Shervin Afshar > >> >>> <shervinafshar@gmail.com <mailto:shervinafshar@gmail.com>>, > >> >>> Steven R Loomis/Cupertino/IBM@IBMUS > >> >>> Cc: Thierry MICHEL <tmichel@w3.org <mailto:tmichel@w3.org > >>, > >> >>> W3C > >> >>> Public TTWG <public-tt@w3.org <mailto:public-tt@w3.org>> > >> >>> Subject: Re: liaison for a Unicode ticket > >> >>> Date: Tue, Nov 8, 2016 3:43 AM > >> >>> > >> >>> hi Mark, Shervin, Steve, > >> >>> > >> >>> It has been thirteen months since there was movement on this > >> >>> query. > >> >>> Could one of you please contact Thierry and advise him on > >> >>> how/whether > >> >>> it's possible to move forward the request of the Timed Text > >> >>> WG? > >> >>> > >> >>> thanks, > >> >>> ri > >> >>> > >> >>> > >> >>> > >> >>> On 03/11/2016 17:38, Thierry MICHEL wrote: > >> >>> > Richard, > >> >>> > > >> >>> > > >> >>> > The TTWG as a Unicode ticket for adding the following > "CLDR > >> >>> supplemental > >> >>> > data for subtitle and caption characters" > >> >>> > > >> >>> > The Unicode ticket is available at > >> >>> > http://unicode.org/cldr/trac/ticket/8915 > >> >>> <http://unicode.org/cldr/trac/ticket/8915> > >> >>> > > >> >>> > There has been no further notes on this for 7 months since > >> >>> > IMSC1 has been published as a Recommendation > >> >>> > (https://www.w3.org/TR/ttml-imsc1/ > >> >>> <https://www.w3.org/TR/ttml-imsc1/>) > >> >>> > > >> >>> > > >> >>> > Could you please help the TTWG to lease with Unicode to > >> >>> allow > >> >>> moving > >> >>> > forward ? > >> >>> > > >> >>> > I guess Mark Davis is the liaison contact for Unicode. > >> >>> > > >> >>> > Thierry. > >> >>> > > >> >>> > >> >>> > >> > > > > > >
Received on Monday, 2 January 2017 19:05:02 UTC