W3C home > Mailing lists > Public > www-international@w3.org > January to March 2005

Re: Language X within scope of language Y

From: David Clarke <w3c@dragonthoughts.co.uk>
Date: Thu, 20 Jan 2005 08:18:53 +0000
Message-ID: <$VMk0$Btl27BFwgr@sisko.dragonthoughts.co.uk>
To: www-international@w3.org

I think there is an area of importance that has been missed in this 

When indexing content, as for a search engine, common words are dropped 
out. Thus for indexing something in English, the word "the" would not be 
indexed as it appears in almost every English sentence.

However the word "the" in French , carries the meaning of tea; this 
suggests it should not be omitted from an index. Were it to be embedded 
within a predominantly English text, embedded language marking would 
make a significant difference to processing.

Similarly, an English "the" embedded in a French text should be omitted 
from the index, but the indexer would need the clue to the embedded 

  (Note "the" should have an accent, but accents are regularly omitted in 
computer based content)

In message <129701c4fe94$75b7a100$6501a8c0@sanjose.ibm.com>, Mark Davis 
<mark.davis@jtcsv.com> writes
>I agree. Big gray area. And in practice, I suspect that 99.428571% of the
>time, even if someone *could* annotate a document to indicate that an
>embedded word, sentence, or phrase is French instead of English, they won't.
>So one certainly couldn't depend on it happening in arbitrary documents,
>even if the capability is there (eg in HTML or XML).
>On the other hand, within a closed environment, such as a linguistics
>research project, such capabilities might be used, and then depended on.
>----- Original Message -----
>From: "Peter Constable" <petercon@microsoft.com>
>To: <www-rdf-interest@w3.org>; <www-international@w3.org>;
>Sent: Wednesday, January 19, 2005 17:26
>Subject: RE: Language X within scope of language Y
>> From: Mark Davis [mailto:mark.davis@jtcsv.com]
>> Also, because words get adopted over time, and become "more and more"
>> considered a natural part of the language.
>I almost mentioned lexical borrowings as an issue. To my knowledge (though
>language contact is not an area of expertise for me) linguists have not
>established agreed-upon criteria by which to decide that lexical borrowing
>has become fully incorporated into another language. The process is
>certainly a gradual one.
>So, for instance, most English speakers would have no clue that
>"conversation" came into English from French. Most are probably aware that
>"faux pas" comes from French, but would not be conscious of that each time
>they use it. (Another example that's probably further along in
>internalization would be American usage of "foyer" in which the
>pronunciation is fully Anglicised (Canadians and Brits say /foije/.) I'd
>guess that most times an English speaker uses "nom de plume" or "bête noir"
>they are conscious of the French origin though some might feel this has been
>adopted into their working vocabulary. And I'd guess any time an English
>speaker used "adieu", "un bon idée" or several other expressions they might
>use in an English conversation that they'd perceive themselves as having
>switched temporarily to French.
>So, at what point do you tag xml:lang="en" versus xml:lang="fr"? There are
>no well-defined answers to this one.
>Peter Constable
>Ietf-languages mailing list

David Clarke
Received on Thursday, 20 January 2005 08:21:03 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:24 UTC