W3C home > Mailing lists > Public > www-zig@w3.org > March 2002

A question about UTF-8 encoding in UNIMARC

From: xietao <xietao@datatrans.com.cn>
Date: Sun, 31 Mar 2002 10:39:53 +0800
Message-Id: <200203310234.VAA19545@www19.w3.org>
To: "www-zig@w3.org" <www-zig@w3.org>
Dear all,

I have a question about UTF-8 encoding in UNIMARC(ISO2709) format.


In USMARC format, leader postion 09, one character indicate the character coding scheme:

09 - Character coding scheme
Identifies the character coding scheme used in the record. 
# - MARC-8
a - UCS/Unicode

And in
The encoding of Unicode characters will be according to the rules of UTF-8 (UCS Transformation Formats-8) which uses designated bits to indicate whether a UCS/Unicode character is represented by 1 octet (8-bits) or multiple octets. This encoding has the advantage of allowing the Basic Latin (ASCII) subset of the MARC 21 repertoire to be encoded the same as in MARC-8 (with 1 octet), thus preserving the basic structural elements of the MARC 21 record, while enabling record content to be multiscript. A brief description of UTF-8 encoding follows, but a fuller description is carried in the UCS and Unicode standards. 

So when I can identify the encoding of ISO2709 record through the first 24 bytes (record leader).


But in UNIMARC format, leader postion 09 is undefined:

9 Undefined
Contains a blank. 

Can I use the leader postion 09 in UNIMARC?

If the information about the character encoding format, such as UTF-8, was storage into some fields, which field is it? So I have to look up the field content before I know the character encoding format? 

If I export a record(which charset is UCS-2) to a ISO2709 file which encoding use UTF-8, must I change some field contents corresponding to character encoding format?  


DataTrans Software Corp. Ltd.
Received on Saturday, 30 March 2002 21:34:16 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:26:04 UTC