Re: BOMs from Martin J. Dürst on 2013-11-19 (www-tag@w3.org from November 2013)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 19 Nov 2013 13:32:37 +0900
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, IETF Discussion <ietf@ietf.org>, JSON WG <json@ietf.org>, Anne van Kesteren <annevk@annevk.nl>, www-tag@w3.org, es-discuss <es-discuss@mozilla.org>
Message-ID: <528AE9E5.3000704@it.aoyama.ac.jp>

Okay, here are some more tests.

http://www.sw.it.aoyama.ac.jp/2013/pub/json_tests/test1_utf8_nobom.json
http://www.sw.it.aoyama.ac.jp/2013/pub/json_tests/test2_utf8_bom.json

They are self-describing JSON files served with application/json, the 
first without a BOM, and the second with a BOM.

They contain some Japanese, and a tiny bit of Spanish.

[see more below]

On 2013/11/18 21:59, Henry S. Thompson wrote:
> Bjoern Hoehrmann writes:
>
>> Perl's JSON module gives me
>>
>>    malformed JSON string, neither array, object, number, string
>>    or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}[]")
>>
>> Python's json module gives me
>>
>>    ValueError: No JSON object could be decoded
>>
>> Go's "encoding/json" module gives me
>>
>>    invalid character 'ï' looking for beginning of value
>
> I'm curious to know what level you're invoking the parser at.  As
> implied by my previous post about the Python 'requests' package, it
> handles application/json resources by stripping any initial BOM it
> finds -- you can try this with
>
>>>> import requests
>>>> r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json")
>>>> r.json()

I get a 404 on this example. I can put up UTF-16 examples, too.

Regards,   Martin.

> Signatures are not part of the text of a document, as the UNICODE spec
> makes clear, so asking what happens when you pass a string beginning
> with a BOM to a parser is not really the right question in this
> context, is it?
>
> As I tried to say in an earlier post, there's a distinction which
> needs to be carefully insisted on between, on the one hand, languages
> and their parsers, where I agree signatures/BOMs have no place, and,
> on the other hand, (media-typed) resources/entities/payloads and _their_
> processing, where a discussion of BOMs/signatures _is_ appropriate
> and, often, necessary.
>
> BTW I agree that the status of the UTF-8 BOM as signature is slightly
> hazy, but again the UNICODE spec itself [1] says
>
>    "this sequence can serve as signature for UTF-8 encoded text where
>     the character set is unmarked"
>
> ht
>
> [1] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

Received on Tuesday, 19 November 2013 04:33:55 UTC