- From: Chris Lilley <chris@w3.org>
- Date: Fri, 17 Oct 2003 15:54:09 +0200
- To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
- Cc: Bert Bos <bert@w3.org>, Tex Texin <tex@i18nguy.com>, www-style@w3.org, W3c I18n Group <w3c-i18n-ig@w3.org>
On Friday, October 17, 2003, 10:48:41 AM, Jukka wrote: JKK> On Thu, 16 Oct 2003, Chris Lilley wrote: >> JKK> Anyway, what the XML specification says about the xml:lang attribute is >> JKK> that "The values of the attribute are language identifiers as defined by >> JKK> [IETF RFC 1766], Tags for the Identification of Languages, or its >> JKK> successor on the IETF Standards Track." >> >> Please also look at the XML 1.0 eratta, and the XML 1.1 specification. JKK> Good grief. I thought that it was unique to CSS specifications to make JKK> changes in an "Errata", but the XML 1.0 "Errata" is apparently similar. JKK> We have been given a _specification_ that is officially approved by the JKK> W3C, containing a reference to an Errata, which says: JKK> "This document records all known errors in - -" JKK> but actually contains substantial _changes_ to the content of the JKK> specification. It is left to readers to distinguish between typo fixes, JKK> wording clarifications, and material changes. JKK> So people who naively think they are reading the official specification JKK> will be mislead. The specification may change at any moment, just by a JKK> change to the "Errata", with no announcement before or after. And we don't JKK> even have a copy of the specification as changed by the "Errata". That last statement is false. Please see http://www.w3.org/TR/1998/REC-xml-19980210 (XML 1.0, first edition) http://www.w3.org/TR/2000/REC-xml-20001006 (XML 1.0 second edition) which also links to http://www.w3.org/TR/2000/REC-xml-20001006-review.html (review version with color coded changes) and then there is the always-current http://www.w3.org/TR/REC-xml which points to the latest version, including any third or subsequent edition. Its normal practice to reprint periodically incorporating clarifications and errata. JKK> And there is no XML 1.1 specification. There is a 1.1 specification. However, it doesn't supercede 1.0. JKK> (There is a candidate dated JKK> 15 October 2002; it says: "It is inappropriate to cite this document as JKK> other than 'work in progress.'") >> JKK> I see no way how an empty string >> JKK> could be interpreted as an accepted value for the attribute. >> >> I do, but then I am reading later specs than you seem to be. JKK> I was reading the document that is announced by the W3C as a JKK> specification. In which you will find the text The errata list for this second edition is available at http://www.w3.org/XML/xml-V10-2e-errata. >> JKK> By the HTML 4.* specification, >> >> (who cares!) its being phased out in favour of the one that the rest >> of xml uses. JKK> I do care. HTML 4 is the only specification for the semantics of HTML JKK> elements and attributes; Yes, JKK> XHTML 1.0 just what it says (though the hype says otherwise): a JKK> reformulation in XML or, rather, a reformulation of the _syntax_ JKK> of HTML 4. Reformulation of the syntax *into XML*. Moving from an html-specific :lang to an XML-generic xml:lang is part of that tightening up of the syntax - possibly even to the point where it gets implemented in HTML user agents. JKK> Why would it need to be unset? Because otherwise it would erroneously apply to child elements. JKK> You can use either an appropriate language JKK> code, Clearly. JKK> or one of the indicators "und" and "mul". No, you should not do that. See RFC 3066 http://www.ietf.org/rfc/rfc3066.txt 5. You SHOULD NOT use the UND (Undetermined) code unless the protocol in use forces you to give a value for the language tag, even if the language is unknown. Omitting the tag is preferred. 6. You SHOULD NOT use the MUL (Multiple) tag if the protocol allows you to use multiple languages, as is the case for the Content- Language: header. JKK> The argumentation in the XML 1.0 "errata" is very obscure - it JKK> looks like they decided on "" and then tried to explain why it JKK> was needed. No, they followed RFC 306 and corrected XML which previously 'forced you to give a value' (by ineritance, once set). JKK> If there was a need for yet JKK> another special code, it should have been formulated and proposed in the JKK> appropriate process. But there wasn't; "und" is perhaps not optimally JKK> clearly defined in ISO 639-2, but it's there for uses just like this. Actually, it is specifically banned from uses like this with a MUST NOT, which seems pretty clear to me. >> JKK> In practical terms, :lang is pointless until support to >> language markup JKK> in browsers becomes worth mentioning. >> >> I don't follow your point, unless you think that xml:lang is solely >> something to do with styling. JKK> I was referring to :lang selectors in CSS. Sorry for not being clear JKK> enough here. Aha. Okay, I misunderstood what you were referring to. >> Its not; its also of use for searching, spell >> checking, speech synthesis, and so forth. JKK> I know the arguments. (But my arguments thought you were referring tothe :lang or xml:lang attributes, not the :lang selector, so that don't apply). JKK> Yet, actual use of lang and xml:lang attributes is JKK> very limited, and partly _wrong_. Try using lang="ru" for transliterated JKK> Russian text and view the page on IE and you probably see what I mean. Do you have a sample handy? I don't have any transliterated text at hand to test this. JKK> (It is a fundamental flaw in language markup that there is no way JKK> to indicate the writing system. But language does not change when JKK> the letters are transliterated, does it?) I agree that specifying script and specifying language are orthogonal. >> JKK> Since the whole point in CSS 2.1 >> JKK> is to define a practical subset of CSS 2.0, I don't see why :lang is kept >> JKK> there at all. >> >> Possibly because, at least in theory, CSS2.1 is not restricted to >> buggy HTML browsers that have not changed much over the last 4 years. >> Instead, its all CSS implementations. JKK> Really? So what is the point of CSS 2.1 then? Why have so many JKK> CSS 2.0 features been removed from it? Because (thankfully) buggy crappy HTML browsers are not the only implementation experience we have. There are also a few much less buggy and actively maintained (x)html browsers that implement CSS, and there are implementations of CSS for other languages than XML (for example, XForms and SVG). I agree that the extent of the surgery is a little worrying and in some cases seems to have given little note to non-HTML uses. That probably reflects the interests and priorities of those actively working on it. >> JKK> Besides, the actual meaning of language markup is still obscure. >> JKK> The whole thing is vaguely defined, little used, and little >> JKK> supported, >> >> I invite you to back up those claims. JKK> OK, see http://www.cs.tut.fi/~jkorpela/kielimerkkaus/ JKK> It's in Finnish, so it might not be optimally accessible to you. However, it has an English summary as the final link, which was helpful to me as was your summary below. JKK> Just to summarize a few points: JKK> - the writing system problem I mentioned above JKK> - the conflicts between the various meanings and purposes of language JKK> markup; example: if a document (in a language other than English) JKK> discusses CSS and mentions, say, the property name vertical-align, JKK> should it be marked up as being in English (thereby making suitable JKK> pronunciation possible, but confusing spelling and grammar checkers, JKK> since it does not really obey normal English rules) Good point, there is a growing body of 'technical english' that obeys its own rules and is partially incorporated into other languages, somethimes with respellings (eg in french, (e)mail becomes mèl to conserve the sound while altering the spelling; other languages keep the spelling but pronounce as a word in their own language). How to best mark that up is a problem. It dosn't strike me as enough of a problem to not use language markup at all. JKK> - how do you deal with words and expressions that are commonly JKK> used in other languages - is "fiancé", when used in English text, JKK> a French word? what about "status quo" If its being used as an english word it should be marked as english. Its language, not pronunciation and not etymology. JKK> (such problems don't exist when language codes are used e.g. as JKK> for bibliographic purposes; but as you get down to individual JKK> words and even morphemes, marking up _all_ language changes as JKK> WCAG 1.0 requires, it's a huge conceptual problem, in addition JKK> to being quite some work in practice) I agree, and I am surprised that WCAG 1.0 requires markup at individual morpheme level. JKK> - what do you do with words that contain parts from different JKK> languages? JKK> - how do declare the language of data in attribute (e.g. JKK> title="..." attributes), as required by WCAG 1.0? Another illustration of why human-presentable text in attributes is wrong. It should be corrected by moving title to an element (and not, I hasten to add, by some bogus attribute-grouping hack like 'titlelang') JKK> - by W3C example, names are not marked up as being in their JKK> respective languages; what might justify this, in the light JKK> of reasons presented for language markup in general. Could you give some examples where names are not marked up in the correct language? Some might be omissions and some might be the "fiancé" use case where the word is french by etymology but english by usage and increasingly by pronunciation as well). Incidentally I agree with your summary "The author recommends that at word level, markup be used to indicate language changes in unproblematic cases only; "if in doubt, leave it out"." Thanks by the way for pointing me at your essay, which I skimmed to try and get the ghist of what you are saying. Have you considered submitting a paper to the Internationalisation and Unicode conference on this topic? -- Chris mailto:chris@w3.org
Received on Friday, 17 October 2003 09:54:12 UTC