- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Tue, 20 Oct 2015 20:20:50 +0900
- To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>, <www-international@w3.org>
Hello Sebastian, There is already quite a bit about character counting/string length at http://www.w3.org/TR/charmod/#sec-stringIndexing. But it just gives some guidelines. The Encoding CR (not a Recommendation yet) deals with encoding conversions, not with what you do once you have a single internal encoding. Regards, Martin. On 2015/10/20 19:02, Sebastian Hellmann wrote: > Hi all, > I am new, so sorry, if I reraise a topic. > > I was wondering, whether the Encoding Recommendation would be the right > place to tackle a string counting issue. Lot's of programming languages > and specifications have quite different implementations regarding string > counting. I am sure you are aware of this. A particular example is this > spec in Section 2.1.2: https://tools.ietf.org/html/rfc5147#section-2.1.2 > which specifies to count two code point as one, or PHP with > |strlen(utf8_decode("ä")) != ||strlen("ä")| > > Could we include some definitions in the standard on how strings are > counted and define a way to have offsets over these strings? > > Suggestion 1 (easy change): In the terminology section: > A string is a sequence of code points. > The /length/ of a string equals the number of contained code points. > > Suggestion 2 : > Define offsets similar to this image: > http://persistence.uni-leipzig.org/nlp2rdf/specification/image/iso+24612-2012.png > > e.g. start with 0 and then count the gaps. > > I would have high hopes that some implementers would pick it up > eventually. Such a definition would help immensely in the area of text > annotation and might also be an issue for the Web Annotation Group. > > All the best, > Sebastian > > >
Received on Tuesday, 20 October 2015 11:21:37 UTC