W3C home > Mailing lists > Public > www-international@w3.org > October to December 2015

Re: CR Feedback: String counting and offsets

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 20 Oct 2015 20:20:50 +0900
To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>, <www-international@w3.org>
Message-ID: <56262392.2050300@it.aoyama.ac.jp>
Hello Sebastian,

There is already quite a bit about character counting/string length at 
http://www.w3.org/TR/charmod/#sec-stringIndexing. But it just gives some 
guidelines.

The Encoding CR (not a Recommendation yet) deals with encoding 
conversions, not with what you do once you have a single internal encoding.

Regards,   Martin.

On 2015/10/20 19:02, Sebastian Hellmann wrote:
> Hi all,
> I am new, so sorry, if I  reraise a topic.
>
> I was wondering, whether  the Encoding Recommendation would be the right
> place to tackle a string counting issue. Lot's of programming languages
> and specifications have quite different implementations regarding string
> counting. I am sure you are aware of this. A particular example is this
> spec in Section 2.1.2: https://tools.ietf.org/html/rfc5147#section-2.1.2
> which specifies to count two code point as one, or PHP with
> |strlen(utf8_decode("ä")) != ||strlen("ä")|
>
> Could we include some definitions in the standard on how strings are
> counted and define a way to have offsets over these strings?
>
> Suggestion 1 (easy change): In the terminology section:
> A string is a sequence of code points.
> The /length/ of a string equals the number of contained code points.
>
> Suggestion 2 :
> Define offsets similar to this image:
> http://persistence.uni-leipzig.org/nlp2rdf/specification/image/iso+24612-2012.png
>
> e.g. start with 0 and then count the gaps.
>
> I would have high hopes that some implementers would pick it up
> eventually. Such a definition would help immensely in the area of text
> annotation and might also be an issue for the Web Annotation Group.
>
> All the best,
> Sebastian
>
>
>
Received on Tuesday, 20 October 2015 11:21:37 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:41:09 UTC