- From: Pierre-Anthony Lemieux <pal@sandflow.com>
- Date: Mon, 2 Jan 2017 11:16:39 -0800
- To: Shervin Afshar <safshar@netflix.com>
- Cc: Thierry MICHEL <tmichel@w3.org>, Nigel Megitt <nigel.megitt@bbc.co.uk>, Richard Ishida <ishida@w3.org>, Mark Davis <mark@macchiato.com>, Steven R Loomis <srloomis@us.ibm.com>, "public-tt@w3.org" <public-tt@w3.org>
Hi Shervin, Ok. It sounds like the overall proposal is as follows: - CLDR: address those characters which are clear-cut and obviously missing for a locale - CLDR: address some anomalies which were previously mentioned in this thread - CLDR: create a set of characters which are used across locales for timed-text, e.g. ♩ - IMSC: recommend locales that are likely to be used in a given distribution region (e.g. timed text intended for the Turkish locale may include characters from locale X, Y, Z... which need to be signaled using xml:lang?) Did I get this right? Best, -- Pierre On Mon, Jan 2, 2017 at 11:03 AM, Shervin Afshar <safshar@netflix.com> wrote: > Hi Pierre-Anthony, > > Thanks for the input. > >> - how would an author creating subtitles/captions targeting the >> Turkish locale (and/or an implementer wishing to support the Turkish >> locale) know that Dutch characters can be present? > > > This of course depends on implementation and it is out of scope of defining > exemplar sets and used characters for each locale. A similar issue exists in > copy-writing for localizable content and there are established best > practices agreed-upon among the industry practitioners; e.g. using markup to > denote the language of non-translatable content, high-level tooling to add > markup when content for one language is embedded in another, etc. > >> - some characters, such as the musical note ("♩"), are used for >> subtitles/captions across many locales. >> For this reason, ticket #8915 [1] suggests: >> - adding an entirely new class ('subtitleCharacters') of sets to CLDR >> - defining a 'base' subtitle/captions exemplar set available across all >> locales > > > Agreed. My thinking is to first address those characters which are clear-cut > and obviously missing for a locale. Then address some anomalies which were > previously mentioned in this thread. After these narrowing-downs, we end up > with a set of characters which are used across locales for timed-text and > then we can address them separately. My assumption is that ♩ would fall in > this last category. > > Best regards, > Shervin > > On Mon, Jan 2, 2017 at 9:09 AM, Pierre-Anthony Lemieux <pal@sandflow.com> > wrote: >> >> Hi Shervin, >> >> Thanks for the feedback and the illustrative example. >> >> > in actuality what is needed is a more granular usage of "xml:lang" >> > attribute according to W3C i18n >> > best practices[1] to distinguish non-Turkish content present in a >> > Turkish context: >> >> Yes. This does not however address two other issues: >> >> - how would an author creating subtitles/captions targeting the >> Turkish locale (and/or an implementer wishing to support the Turkish >> locale) know that Dutch characters can be present? >> >> - some characters, such as the musical note ("♩"), are used for >> subtitles/captions across many locales. >> >> For this reason, ticket #8915 [1] suggests: >> >> - adding an entirely new class ('subtitleCharacters') of sets to CLDR >> -- instead of overloading the main and auxiliary sets, which, as you >> point out, cover only those exemplars that are used in the language of >> the locale (and are selected using xml:lang) >> >> - defining a 'base' subtitle/captions exemplar set available across all >> locales >> >> [1] http://unicode.org/cldr/trac/ticket/8915 >> >> Perhaps these issues have been discussed before. Looking forward to >> your thoughts. >> >> > For now I will proceed with actions for adding the clear-cut cases >> > (which I mentioned in a previous email) >> > to the relevant exemplar sets. >> >> Thanks. Seems very useful regardless :) >> >> Best, >> >> -- Pierre >> >> >> On Mon, Dec 26, 2016 at 12:21 PM, Shervin Afshar <safshar@netflix.com> >> wrote: >> > Hi Pierre-Anthony, >> > >> > I understand the lax criteria in order to cover larger number of cases, >> > but >> > we still need to scrutinize and investigate some of the questionable >> > additions; another example being U+02BC (MODIFIER LETTER APOSTROPHE) >> > which >> > though relevant to be added to the punctuation exemplar for one locale, >> > presence of it in others might prove redundant and problematic in the >> > grand >> > scheme of things. Another instance of such unlikely additions is U+0132 >> > (LATIN CAPITAL LIGATURE IJ) which is only present in Turkish, but can >> > not be >> > explained why (since it's used in Dutch). >> > >> > One could certainly imagine a timed-text asset for Turkish to have some >> > strings of Dutch in it and it might be reasoned that therefore Dutch >> > exemplar set (or a subset of it) should be added to Turkish exemplar >> > sets as >> > auxiliary, but in actuality what is needed is a more granular usage of >> > "xml:lang" attribute according to W3C i18n best practices[1] to >> > distinguish >> > non-Turkish content present in a Turkish context: >> > >> > <body region="subtitleArea"> >> > <div> >> > <p xml:id="subtitle1" xml:lang="tr" begin="0.76s" end="3.45s"> >> > Dost kara günde belli olur.<br /> >> > <span xml:lang="nl">De ratten verlaten het zinkende >> > schip.</span> >> > </p> >> > ... >> > </div> >> > </body> >> > >> > Although the complexity of an all-around additional of characters to >> > exemplar sets to maximize coverage might seem marginal, but as far as >> > I'm >> > aware of, CLDR has been selective about such additions. >> > >> > For now I will proceed with actions for adding the clear-cut cases >> > (which I >> > mentioned in a previous email) to the relevant exemplar sets. >> > >> > [1]: >> > https://www.w3.org/International/questions/qa-when-xmllang.en#when_use >> > >> > Best regards, >> > Shervin >> > >> > On Wed, Dec 21, 2016 at 11:22 PM, Pierre-Anthony Lemieux >> > <pal@sandflow.com> >> > wrote: >> >> >> >> Hi Shervin, >> >> >> >> Thanks for the update, and to the CLDR TC for considering the input. >> >> >> >> > In some other cases it's not very clear if the inclusion of a >> >> > specific >> >> > characters is justified or simply due to >> >> > bad data (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in >> >> > the set latnExtA provided in [2]). >> >> >> >> I believe that the recommended sets erred on the side of caution, and >> >> were created to deliberately cast a wider, rather than narrower, net >> >> whenever possible. For instance, the recommended set for each of the >> >> "lv,lt,et,hr,cs,pl,sl,sk,tr" locales includes all of the Latin >> >> Extended-A block, instead of attempting to optimize each sets at the >> >> risk of missing important characters -- the general assumption being >> >> that the incremental complexity of supporting all versus parts of the >> >> Latin Extended-A block would be marginal, e.g. implementations support >> >> all or none of the Latin Extended-A block. >> >> >> >> > I will update the thread and the ticket with the next steps when I >> >> > get >> >> > to check for anomalies of that sort. >> >> >> >> Looking forward to your feedback. >> >> >> >> Best, >> >> >> >> -- Pierre >> >> >> >> On Mon, Dec 12, 2016 at 12:50 PM, Shervin Afshar <safshar@netflix.com> >> >> wrote: >> >> > Thanks for the new comparison report[1]. CLDR TC discussed this again >> >> > last >> >> > week and looking at the report, it seems that in some cases the issue >> >> > can be >> >> > addressed by adding characters to one of CLDR exemplar categories for >> >> > the >> >> > respective locale; e.g. for Arabic, U+060D (Arabic date separator) or >> >> > for >> >> > Hebrew, U+05C3 (Sof Pasuq). In some other cases it's not very clear >> >> > if >> >> > the >> >> > inclusion of a specific characters is justified or simply due to bad >> >> > data >> >> > (e.g. u+017F, LATIN SMALL LETTER LONG S which is included in the set >> >> > latnExtA provided in [2]). >> >> > >> >> > Therefore, a closer inspection of each set seems necessary. I will >> >> > update >> >> > the thread and the ticket with the next steps when I get to check for >> >> > anomalies of that sort. >> >> > >> >> > [1]: http://www.sandflow.com/public/CLDR-report-20161204.txt >> >> > [2]: >> >> > >> >> > >> >> > https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml >> >> > >> >> > Best regards, >> >> > Shervin >> >> > >> >> > On Fri, Dec 9, 2016 at 1:14 AM, Thierry MICHEL <tmichel@w3.org> >> >> > wrote: >> >> >> >> >> >> Hello, >> >> >> >> >> >> The TTWG provided feedback on the >> >> >> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> >> >> >> >> >> >> Looking forward to your review, >> >> >> >> >> >> Best regards, >> >> >> Thierry Michel >> >> >> >> >> >> >> >> >> Le 05/12/2016 à 01:41, Shervin Afshar a écrit : >> >> >>> >> >> >>> Hello, >> >> >>> >> >> >>> CLDR ticket #8915 <http://unicode.org/cldr/trac/ticket/8915> was >> >> >>> discussed in last technical committee meeting. We think that this >> >> >>> use-case falls within the scope of CLDR project, but to effectively >> >> >>> add >> >> >>> this data to benefit implementers and users, there are few issues >> >> >>> which >> >> >>> need to be addressed. Most of these questions are reflected in the >> >> >>> comment that Mark provided on the ticket (direct link >> >> >>> <http://unicode.org/cldr/trac/ticket/8915#comment:8>). To >> >> >>> summarize, >> >> >>> the >> >> >>> following items should be addressed and discussed: >> >> >>> >> >> >>> – Clarification on the intended usage of this data with regards to >> >> >>> section 7.2 and Appendix B of TTML-IMSC1; e.g. inclusion/exclusion >> >> >>> rationale, rationale for selection of "base" set; >> >> >>> – Comparison between sets in proposed draft data >> >> >>> >> >> >>> >> >> >>> >> >> >>> <https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml> >> >> >>> and >> >> >>> existing CLDR exemplar types (main, aux, punctuation) in various >> >> >>> locales; >> >> >>> – Plans for providing data for other locales. >> >> >>> >> >> >>> Best regards, >> >> >>> Shervin >> >> >>> >> >> >>> ----- Original message ----- >> >> >>> From: r12a <ishida@w3.org <mailto:ishida@w3.org>> >> >> >>> To: Mark Davis <mark@macchiato.com >> >> >>> <mailto:mark@macchiato.com>>, Shervin Afshar >> >> >>> <shervinafshar@gmail.com <mailto:shervinafshar@gmail.com>>, >> >> >>> Steven R Loomis/Cupertino/IBM@IBMUS >> >> >>> Cc: Thierry MICHEL <tmichel@w3.org >> >> >>> <mailto:tmichel@w3.org>>, >> >> >>> W3C >> >> >>> Public TTWG <public-tt@w3.org <mailto:public-tt@w3.org>> >> >> >>> Subject: Re: liaison for a Unicode ticket >> >> >>> Date: Tue, Nov 8, 2016 3:43 AM >> >> >>> >> >> >>> hi Mark, Shervin, Steve, >> >> >>> >> >> >>> It has been thirteen months since there was movement on >> >> >>> this >> >> >>> query. >> >> >>> Could one of you please contact Thierry and advise him on >> >> >>> how/whether >> >> >>> it's possible to move forward the request of the Timed Text >> >> >>> WG? >> >> >>> >> >> >>> thanks, >> >> >>> ri >> >> >>> >> >> >>> >> >> >>> >> >> >>> On 03/11/2016 17:38, Thierry MICHEL wrote: >> >> >>> > Richard, >> >> >>> > >> >> >>> > >> >> >>> > The TTWG as a Unicode ticket for adding the following >> >> >>> "CLDR >> >> >>> supplemental >> >> >>> > data for subtitle and caption characters" >> >> >>> > >> >> >>> > The Unicode ticket is available at >> >> >>> > http://unicode.org/cldr/trac/ticket/8915 >> >> >>> <http://unicode.org/cldr/trac/ticket/8915> >> >> >>> > >> >> >>> > There has been no further notes on this for 7 months >> >> >>> since >> >> >>> > IMSC1 has been published as a Recommendation >> >> >>> > (https://www.w3.org/TR/ttml-imsc1/ >> >> >>> <https://www.w3.org/TR/ttml-imsc1/>) >> >> >>> > >> >> >>> > >> >> >>> > Could you please help the TTWG to lease with Unicode to >> >> >>> allow >> >> >>> moving >> >> >>> > forward ? >> >> >>> > >> >> >>> > I guess Mark Davis is the liaison contact for Unicode. >> >> >>> > >> >> >>> > Thierry. >> >> >>> > >> >> >>> >> >> >>> >> >> > >> > >> > > >
Received on Monday, 2 January 2017 19:17:34 UTC