- From: gsergiu via GitHub <sysbot+gh@w3.org>
- Date: Wed, 03 Aug 2016 09:54:37 +0000
- To: public-annotation@w3.org
I also had several complains on the processingLanguage, and I still have the feeling that this is not documented enough in the standard, and concrete usecases are needed to understand theri meaning. In the past discussions there were two types of scenarios discussed: 1. One was related to the "correct" representation of texts with multiple languages (e.g. european, arabic, chinese, hebrew...) Additionally to this scenario, there was the concern of audio readers... 2. There is the search scenario, where the NLPs need to know which algorithms to use, as they are language specific. again my feedback on the 2 scenario types: 1. I doubt that processingLanguage and textDirection are able to solve the (absolutely) correct representations of the text. Simply because the exact identification of the text parts written in different languages is needed. 2. For the indexing/search scenario, processingLanguage might be sufficient, still ... I'm not convinced that this should have a single value! It is ok for text that are writen to >90% in one language, but absolutely not ok for texts which habe near 50-50% distribution! Futhermore, it is not enough to have a definition of processingLanguage, which is anyway a little bit vague given that it is intended to serve to purposes at this stage (a dangerous approach). Who sould set this property? - This is for sure a property that will not be set by the end users. (they are likely to set the language property) - is the client application the part of the system in charge of setting this value when the annotation is created? .. probalby in some exotic scenarios, as I don't expect that the NLP is applied before pressing the submit button. - is the server in charge of setting the processingLanguage? Well .... actually the server is the one that needs this value as input, in order to know how to tokenize, normalize, stemm the text. Should an automatic language detection algorithm be used? Should the server simply advertize the language of processing algorithms that were applied? If yes, why should be the server be constrained to use only one processingLanguage? - I think this is mainly a kind of client-server negociation mechanism (the client should know which processing languages are supported by the server and choose one or more of them). I think this is the first usecase to be addressed in order to provide a clear definition and meaning of the field. -- GitHub Notification of comment by gsergiu Please view or discuss this issue at https://github.com/w3c/web-annotation/issues/335#issuecomment-237194817 using your GitHub account
Received on Wednesday, 3 August 2016 09:54:49 UTC