- From: Asmus Freytag (t) <asmus-inc@ix.netcom.com>
- Date: Sat, 07 Mar 2015 00:04:26 -0800
- To: Eric Prud'hommeaux <eric@w3.org>, Andrew Sullivan <ajs@anvilwalrusden.com>
- CC: public-ldp-comments@w3.org, cowan@ccil.org, Steven Atkin <atkin@us.ibm.com>, John Cowan <cowan@mercury.ccil.org>, www-international@w3.org
- Message-ID: <54FAB10A.6020502@ix.netcom.com>
On 3/6/2015 11:26 PM, Eric Prud'hommeaux wrote: > > > On Mar 7, 2015 2:04 AM, "Andrew Sullivan" <ajs@anvilwalrusden.com > <mailto:ajs@anvilwalrusden.com>> wrote: > > > > On Thu, Mar 05, 2015 at 10:38:01PM -0500, John Cowan wrote: > > > No, since you ask. We use Unicode, but we don't require that every > > > non-printing character be recognized as a delimiter. > > > > What I worry about is inconsistent handling of whitespace across > > implementations. But anyway, I guess this isn't really the place to > > fix that up, since it'd be all over XML anyway, right? (I guess I'm > > just sensitive to this right now because the IETF tried to do clever > > things with paring down Unicode to things we wanted, and it isn't > > working quite as we'd hoped.) > > I suspect that whitespace is pretty consistently treated as the four > control codes this point. In 2006 I tried a more inclusive definition > of whitespace in SPARQL but folks said "what the hell is this? > Everybody knows that whitespace is four characters." Had things like > non-breaking, zero-width, all-singing space stayed in SPARQL, parsers > would have required multi-byte lexers and the interoperability of > incomplete implementations would have suffered. > > The downside is that someone typing in some script with its own > whitespace (does that exist?) must use ASCII space, but they have to > anyways because all of the language keywords are in ASCII. > For programming languages, sticking to the basic set for syntax purposes makes a certain amount of sense. When you are dealing with text data, or free-form input, this approach can be unnecessarily limiting. All the markup languages have the issue that both language syntax and text content reside in the same "plain-text" file, leading to complicated rules about which whitespace characters are part of the text content and which are to be ignored for text purpose for being syntax characters. However, Andrew's point is well taken - it's important to not let the programmer's attitude infect those parts of whatever protocol is being designed that are concerned with representing full-text data. It better be possible to not only represent all space characters (and zero width characters), but to have them act on the text in the way they are defined in Unicode when segmenting text for whatever purpose. A./ > > > A > > > > -- > > Andrew Sullivan > > ajs@anvilwalrusden.com <mailto:ajs@anvilwalrusden.com> > > >
Received on Saturday, 7 March 2015 08:04:49 UTC