DOMString Character Encoding [was RE: C++ binding] from Allen, Michael B (RSCH) on 2002-02-18 (www-dom@w3.org from January to March 2002)

From: Allen, Michael B (RSCH) <Michael_B_Allen@ml.com>
Date: Sun, 17 Feb 2002 19:15:22 -0500
To: "'Joseph Kesselman'" <keshlam@us.ibm.com>
cc: www-dom@w3c.org
Message-ID: <2D31030A810FD611973700306E0208F61997BC@ehope07.hew.us.ml.com>

> -----Original Message-----
> From:	Joseph Kesselman [SMTP:keshlam@us.ibm.com]
> 
> >     Why does the DOMString type needs to be defined at all?
> 
> The nature of a language binding is that it brings everything down to
> language-specific interfaces and types, so that code written against that
> binding will compile and run against all instances of that binding. If you
> don't nail down DOMString to _something_, you don't have a binding.
> 
	Sure, but I don't think you have to define what the DOMString *character
	encoding* is. DOMString could just be the standard string type for that
	language. In C this would be a pointer to 'char' (The encoding of the string
	object this pointer points to is the locale dependent character enocoding such
	as ISO-8859-5 or UTF-8 but my point is this shouldn't matter). In Java this
	would be the 'String' object (encoding is UTF-16 non-endian dependent with a
	BOM but again this doesn't matter). The reason it doesn't matter is because
	you will never be transferring strings between language bindings at runtime.
	That's what XML is for. So only the DOM serialization routines (probably to
	XML) need to worry about details like character encoding.

> The solution we've used in the past is for the binding to state what
> specific type DOMString is mapped to.
> 
	Specifying the type is one thing, but specifying the encoding is another.
	Making it UTF-16 (big endian, little endian, w/wo BOM?) unnecessarily
	constrains the implementation. I know first hand it creates a significant barrier
	for C. It requires that the implementation provide all the usual string
	manipulation functions. Consider what would happen if the DOMString type
	were defined as a long and specified the encoding as UCS-4BE? What would
	the Java language binding look like?

> Another solution would be to say that DOMString will itself be a specific
> interface in this binding, derived from or wrapped around whatever the
> implementation wants to use.
> 
	If you mean:

	interface DOMString {
	};

	then great. And maybe aknowledge that this might also be represented as a
	simple typedef in some language bindings (just like it would be silly to
	represent the EventListener interface with anything but a function pointer in C:
	typedef void (*DOM_EventListener)(DOM_Event *evt);). DOMString could just
	an object to be used by the standard string manipulation functions of the
	language. You cannot define the string manipulation functions too...

> There's nothing inherently wrong with providing a DOM-like API that
> supports only 8-bit characters in a specific encoding, if you know that's
> all your customers are going to want... but if you do so, be sure to
> clearly document that divergence and don't claim to be a fully compliant
> DOM implementation. You'll benefit from folks not having to relearn the API
> architecture, but you'll lose interoperability with other systems... and
> interoperability is part of the point of the DOM.
> 
	Then I think we'll have a lot of "DOM-like" implementations of the DOM laying
	round (DOMC being one of them). I don't see how interoperabilty would be
	lost if the actual character encoding of the DOMString type were defined in the
	language binding. And in most languages there is an obvious choice for strings if
	you have a choice at all.

	Mike

Received on Sunday, 17 February 2002 19:15:39 UTC