- From: gsergiu via GitHub <sysbot+gh@w3.org>
- Date: Tue, 24 May 2016 08:32:02 +0000
- To: public-annotation@w3.org
Dear all, I would make a simple synthesis of the problem from the implementation point of view: Facts: · There are many web resources that use multiple languages (and of course we want that everything is annotatable) · There are also many of these resources that even don’t use metadata or markup to advertise the use languages As the goal is to be able to everything, we can even take in consideration the worst case scenario, in which we have the resources that include texts in multiple languages, but we don’t know which languages are used. (this is not a rare situation .. in Europeana there are 3,77 records for which we know that the metadata is in multiple languages, but we don’t know which ones are these: http://www.europeana.eu/portal/search?f[LANGUAGE][]=mul<http://www.europeana.eu/portal/search?f%5bLANGUAGE%5d%5b%5d=mul> ) Expected user behavior: · I think that the majority of users would agree to add the used language (list) when creating annotations. (mainly for retrieval purposes) · I don’t think that will be many users that are willing to mark all parts of the texts with the correctly identified language, but there will be use cases in which this is needed · Audio browsers might be nice and important, especially for blind people, but I doubt that they are able to correctly read texts in any language and especially old languages (I’m not sure if we have readers that are able to read latin or old german for example, which are frequently used in Europeana resources: See http://www.europeana.eu/portal/record/92080/FCBC03581F63DA47F920E30CF3000212D7A476F1.html Or … this … http://www.univie.ac.at/elib/index.php?title=Greg%C3%B4rje,_b%C3%A2best,_geistlich_vater,_wache_und_brich_abe_d%C3%AEnem_slaf_%28Bruder_Werner%29&redirect=no ). Analysis: 1. The metadata should be consistent with the resources (if 1 language is used, one should be available in language property, if 10 are used … than 10 must/should be in language field) 2. For the great majority of cases the text is perfectly as it is. Encouraging the usage of exact 1 language if possible, but allowing multiple as well. 3. For the i18n problem of multiple texts, it is clear that we don’t have sufficient to correctly apply NLP and TTS algorithms, so … by removing information we make it only worse, not better. It is obvious that in this case, there is no way to derive the required information from the language property (at least not with 100% confidence, as language detection algorithms might be applied). What we need is a way to express which parts of the text (selectors?) are written in which language (BCP?/RFC) and with which script (it is already included in RFC 5646 https://tools.ietf.org/html/rfc5646#ref-ISO15924 - http://unicode.org/iso15924/iso15924-codes.html ) Proposed Approach: 1. In the general case when we have only one language, the text-processing language can be derived from the existing “language” property, and the script code as well. a. Open question, do we really need text direction if we have the script code? Cannot the text direction be derived from the script code https://github.com/w3c/web-annotation/issues/224 ? 2. For the correct representation of texts in multiple languages, we need additional information, but I wouldn’t advice for embedded markup, because we shouldn’t break the functionality of the APIs, because of the Browser’s problems. a. As written above, I think that the best way is to have a special (robust?) selector for adding the missing i18n information! Just let the body to have a clean representation, which is human and machine friendly .. (opposite to browser friendly and human/machine unfriendly, the json representation should be json and not html .. or other markup) Br, Sergiu Von: Ivan Herman [mailto:notifications@github.com] Gesendet: Dienstag, 24. Mai 2016 08:58 An: w3c/web-annotation Cc: Gordea Sergiu; Mention Betreff: Re: [w3c/web-annotation] exactly 0 or 1 language(s) (#213) That is an acceptable compromise. > On 23 May 2016, at 23:17, Rob Sanderson <notifications@github.com<mailto:notifications@github.com>> wrote: > > My 2c: > > language: The Body or Target SHOULD have exactly 1 language associated with it, but MAY have 0 or more. If the resource contains content in a mixture of languages, and there is a particular language to use for text processing, then that language should be given in the processingLanguage property. > > processingLanguage: The language to use for text processing algorithms such as line breaking, hyphenation, which font to use, and similar. Each Body and Target MAY have 0 or exactly 1 processingLanguage. If this property is not present and the language property is given as a single language, then the client SHOULD use that language for processing requirements. > > Then if there's the case when there are multiple languages and there's a need to specify which one to use for text processing, there's somewhere to do it. However for the simple (and frequent) case of a single language, then the client knows it should use the language property rather than repeat it in both fields. > > Thoughts? > — You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub<https://github.com/w3c/web-annotation/issues/213#issuecomment-221182575> -- GitHub Notification of comment by gsergiu Please view or discuss this issue at https://github.com/w3c/web-annotation/issues/213#issuecomment-221201494 using your GitHub account
Received on Tuesday, 24 May 2016 08:32:10 UTC