Re: [web-annotation] exactly 0 or 1 language(s)

Dear all,

I would make a simple synthesis of the problem from the implementation
 point of view:

Facts:

·        There are many web resources that use multiple languages (and
 of course we want that everything is annotatable)

·        There are also many of these resources that even don’t use 
metadata or markup to advertise the use languages

As the goal is to be able to everything, we can even take in 
consideration the worst case scenario, in which we have the resources 
that include texts in multiple languages, but we don’t know which 
languages are used.
(this is not a rare situation .. in Europeana there are 3,77 records 
for which we know that the metadata is in multiple languages, but we 
don’t know which ones are these: 
http://www.europeana.eu/portal/search?f[LANGUAGE][]=mul<http://www.europeana.eu/portal/search?f%5bLANGUAGE%5d%5b%5d=mul>
 )

Expected user behavior:

·        I think that the majority of users would agree to add the 
used language (list) when creating annotations. (mainly for retrieval 
purposes)

·        I don’t think that will be many users that are willing to 
mark all parts of the texts with the correctly identified language, 
but there will be use cases in which this is needed

·        Audio browsers might be nice and important, especially for 
blind people, but I doubt that they are able to correctly read texts 
in any language and especially old languages (I’m not sure if we have 
readers that are able to read latin or old german for example, which 
are frequently used in Europeana resources:

See 
http://www.europeana.eu/portal/record/92080/FCBC03581F63DA47F920E30CF3000212D7A476F1.html

Or … this … 
http://www.univie.ac.at/elib/index.php?title=Greg%C3%B4rje,_b%C3%A2best,_geistlich_vater,_wache_und_brich_abe_d%C3%AEnem_slaf_%28Bruder_Werner%29&redirect=no
  ).

Analysis:

1.      The metadata should be consistent with the resources (if 1 
language is used, one should be available in language property, if 10 
are used … than 10 must/should be in language field)

2.      For the great majority of cases the text is perfectly as it 
is. Encouraging the usage of exact 1 language if possible, but 
allowing multiple as well.

3.      For the i18n problem of multiple texts, it is clear that we 
don’t have sufficient to correctly apply NLP and TTS algorithms, so … 
by removing information we make it only worse, not better. It is 
obvious that in this case, there is no way to derive the required 
information from the language property (at least not with 100% 
confidence, as language detection algorithms might be applied). What 
we need is a way to express which parts of the text (selectors?) are 
written in which language (BCP?/RFC) and with which script   (it is 
already included in RFC 5646 
https://tools.ietf.org/html/rfc5646#ref-ISO15924   - 
http://unicode.org/iso15924/iso15924-codes.html )


Proposed Approach:

1.      In the general case when we have only one language, the 
text-processing language can be derived from the existing “language” 
property, and the script code as well.

a.      Open  question, do we really need text direction if we have 
the script code? Cannot the text direction be derived from the script 
code https://github.com/w3c/web-annotation/issues/224 ?

2.      For the correct representation of texts in multiple languages,
 we need additional information, but I wouldn’t advice for embedded 
markup, because we shouldn’t break the functionality of the APIs, 
because of the Browser’s problems.

a.      As written above, I think that the best way is to have a 
special (robust?) selector for adding the missing i18n information! 
Just let the body to have a clean representation, which is human and 
machine friendly .. (opposite to browser  friendly and human/machine 
unfriendly, the json representation should be json and not html .. or 
other markup)


Br,

Sergiu



Von: Ivan Herman [mailto:notifications@github.com]
Gesendet: Dienstag, 24. Mai 2016 08:58
An: w3c/web-annotation
Cc: Gordea Sergiu; Mention
Betreff: Re: [w3c/web-annotation] exactly 0 or 1 language(s) (#213)

That is an acceptable compromise.

> On 23 May 2016, at 23:17, Rob Sanderson 
<notifications@github.com<mailto:notifications@github.com>> wrote:
>
> My 2c:
>
> language: The Body or Target SHOULD have exactly 1 language 
associated with it, but MAY have 0 or more. If the resource contains 
content in a mixture of languages, and there is a particular language 
to use for text processing, then that language should be given in the 
processingLanguage property.
>
> processingLanguage: The language to use for text processing 
algorithms such as line breaking, hyphenation, which font to use, and 
similar. Each Body and Target MAY have 0 or exactly 1 
processingLanguage. If this property is not present and the language 
property is given as a single language, then the client SHOULD use 
that language for processing requirements.
>
> Then if there's the case when there are multiple languages and 
there's a need to specify which one to use for text processing, 
there's somewhere to do it. However for the simple (and frequent) case
 of a single language, then the client knows it should use the 
language property rather than repeat it in both fields.
>
> Thoughts?
>



—
You are receiving this because you were mentioned.
Reply to this email directly or view it on 
GitHub<https://github.com/w3c/web-annotation/issues/213#issuecomment-221182575>


-- 
GitHub Notification of comment by gsergiu
Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/213#issuecomment-221201494
 using your GitHub account

Received on Tuesday, 24 May 2016 08:32:10 UTC