W3C home > Mailing lists > Public > www-international@w3.org > October to December 2015

CR Feedback: String counting and offsets

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Tue, 20 Oct 2015 12:02:15 +0200
To: www-international@w3.org
Message-ID: <56261127.7040408@informatik.uni-leipzig.de>
Hi all,
I am new, so sorry, if I  reraise a topic.

I was wondering, whether  the Encoding Recommendation would be the right 
place to tackle a string counting issue. Lot's of programming languages 
and specifications have quite different implementations regarding string 
counting. I am sure you are aware of this. A particular example is this 
spec in Section 2.1.2: https://tools.ietf.org/html/rfc5147#section-2.1.2
which specifies to count two code point as one, or PHP with 
|strlen(utf8_decode("ä")) != ||strlen("ä")|

Could we include some definitions in the standard on how strings are 
counted and define a way to have offsets over these strings?

Suggestion 1 (easy change): In the terminology section:
A string is a sequence of code points.
The /length/ of a string equals the number of contained code points.

Suggestion 2 :
Define offsets similar to this image:
e.g. start with 0 and then count the gaps.

I would have high hopes that some implementers would pick it up 
eventually. Such a definition would help immensely in the area of text 
annotation and might also be an issue for the Web Annotation Group.

All the best,

Sebastian Hellmann
AKSW/KILT research group
Insitute for Applied Informatics (InfAI) at University Leipzig
DBpedia Association
* *Oct 31st, 2015* Deadline for Quality Management of Semantic Web 
Assets (Data, Services and Systems) 
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Received on Tuesday, 20 October 2015 10:02:48 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:41:09 UTC