W3C home > Mailing lists > Public > www-svg@w3.org > April 2002

Re: [svg-developers] Re: indic language support

From: Chris Lilley <chris@w3.org>
Date: Mon, 29 Apr 2002 05:12:46 +0200
Message-ID: <133281750828.20020429051246@w3.org>
To: "Jim Ley" <jim@jibbering.com>
CC: www-svg@w3.org, svg-developers@yahoogroups.com
On Sunday, April 28, 2002, 11:14:42 PM, Jim wrote:

JL> "Chris Lilley" <chris@w3.org>

>> If it has a suitable fallback font pre-configured.

JL> This is completely different to my understanding, if a character is not
JL> found in the browsers font, it is to find the character in any available
JL> font,

In general that is implemented by picking a suitable 'last chance'
font that has wide coverage. i am not aware of implementations that
proceed to search every font installed - perhaps many hundreds - on
the offchance that a missing glyph is found. Mainly for performance reasons.


JL>  my understanding was brought from such pages as
JL> http://ppewww.ph.gla.ac.uk/~flavell/charset/fontface-harmful.html and
JL> related (follow the links.)

Yes, I read that page many years ago and have referred to it from some
papers at earlier Unicode conferences. Its a prime reason why in CSS
font-family, as opposed the the vendor-HTML FONT tag

a) The font family is a list, not a single value
b) It is a priority ordered list
c) CSS2 added font descriptors which allows the browser vendor and the user to contribute to the 'font database'
d) CSS2 added unicode-range descriptor in particular.

Also, a prime difference from the FONT tag is that in CSS, you cannot
put (for example) Symbol, or one of the many fonts that associate
glyph indices with characters on a one to one basis, onto ASCII text
and have it come out looking like a foreign language. Instead, a
conformant CSS processor will make it look like a bunch of 'missing
glyph' markers. Glyphs are assigned based on Unicode characters, not
on random glyph indices.


>> JL>  AIUI, it does in HTML with modern browsers.
>>
>> No, only if the user picks an appropriate font.
>>
>> JL>  How are
>> JL> we to know which fonts a user has that contains a particular
JL> character?

We don't. Thats why its not in the author stylesheet but the user
stylesheet (or if you prefer, the user configuration settings).


>> JL> (I don't send image/svg+xml; charset=utf-8
>> JL> which perhaps I should,
>>
>> There is no charset parameter defined for image types.

JL> http://www.w3.org/TR/charmod/#sec-Encodings
JL> (a draft, and I don't follow the exact issues, so my interpretation is
JL> potentially wrong...)

JL> Says:
JL> "Because encoded text cannot be interpreted and processed without knowing
JL> the encoding, it is vitally important that the character encoding [...] is
JL> known at all times and places where text is exchanged or processed. "

Yes, correct. You have demonstrated that the character encoding scheme
needs to be transmitted. I agree. You then assert that this can only
be transmitted, or is best transmitted, as a MIME chartet parameter. I
disagree, very strongly.

Since SVG is written in XML, the character encoding is known exactly
at all times. Its what the encoding declaration says it is. If there
is no encoding declaration, then it is either UTF-8 or UTF-16, a
choice which is easily resolvable from looking at the first few bytes
of the file for a BOM as defined in the XML specification.

Note that this method is robust - the encoding is the same whether the
SVG file sis read from local storage, over HTTP, FTP, POP, whatever.

Note also that if the encoding declaration is *wrong*, then the XML
parser will give a well formedness error and halt. Thus, there is no
bad data around.

JL> Which seems to be saying to me that when text is transmitted via a
JL> protocol such as http you need to include what the CES is to allow for it
JL> to be processed correctly

Yes.

JL> (It seems to me that saying inside the file is
JL> too late if you're using 8 or 16 or whatever byte CES's)

No, in fact that is very well defined.

So, consider the alternative, a charset parameter copied from the
text/* types and (unwisely) foisted on the application/* types.

This is fragile, out of band information. It raises the possibility
that the charset parameter and the encoding declaration may differ. In
that case, one either has to establish a precedence or declare this to
be an error.

The RFC for XML media types establishes a precedence, so it is not an
error if they conflict. The downside of this is that the simple act of
saving a file locally now involves rewriting the file, otherwise it
will fail with a WF error next time it is read. It also means that it
is not possible to do server side processing on the file - because its
encoding declaration might be wrong, but overridden to the correct
value in some server config for HTTP. Now, there is rather a lot of
server-side XML processing. Breaking it seems like a really, really
bad idea.

Lastly, if one relies on a charset parameter then it has to be
generated. Either some per-server naming convention - for example, I
use .htm8 on the W3C server to force XHTML files to be served with a
charset parameter (since they are served as a text/* type). Who knows
about my particular naming convention? How would an authoring tool
know what to generate? maybee it would be foo.svg.utf8 or
/utf8/foo.svg or .... too many possibilities. Wheras with an encoding
declaration, it is very clear and totally independent of the server
config, which an authoring tool cannot know. Just generate correct XMl
according to the XML spec and voila! it all works. Whereas if one uses
a charset parameter - well, another way to generate it would be to
have the server parse each XML file as it is served, read the encoding
declaration, generate the HTTP headers .... this is both inefficient
and redundant.

JL> http://lists.w3.org/Archives/Public/www-svg/2001Oct/0067.html indicates
JL> you were discussing the registration including the charset issue,

Yes, and the above is a summary of the discussion.

JL>  it
JL> clearly needs one as an XML document,

No, it clearly does not.

JL> rfc3023 says this "In particular,
JL> the charset parameter SHOULD be used in the same manner, as described in
JL> Section 7.1, in order to enhance Interoperability."

The problem is that it *decreases* interoperability (except for text/*
types where I agree it is absolutely required due to the baroque way
the text/* type is defined with a US-ASCII fallback when there is no
charset parameter).


JL>  - Okay it was only
JL> SHOULD, but I'd like to see some very good justification of why you're
JL> going against this.

See above.

JL>  (Perhaps in the registration of image/svg+xml )

Yes, you will see those same arguments deployed in that registration.


-- 
 Chris                            mailto:chris@w3.org
Received on Sunday, 28 April 2002 23:15:25 GMT

This archive was generated by hypermail 2.3.1 : Friday, 8 March 2013 15:54:22 GMT