- From: Geoffrey Sneddon <foolistbar@googlemail.com>
- Date: Sun, 21 Dec 2008 17:19:39 +0000
On 21 Dec 2008, at 16:35, Edward Z. Yang wrote: > I suppose the big pivot point is "as if". A byte-wise implementation > would replace character globally with byte, and any U+xxxx designation > with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not > the actual algorithm implementation, no? It states that what is done must be wholly equivalent to the given algorithm. >> But an HTML5 implementation, >> according to the spec, must at a minimum support the UTF-8 and >> Windows-1252 encodings, so the overall implementation might not >> depending >> on exactly how this is done. > > The plan is to convert Windows-1252 into UTF-8 before processing; > with a > reasonably good iconv implementation, support for lots of encodings is > possible. The implementation might not be fully conforming if iconv > doesn't perform the proper (possibly context-sensitive; I haven't > checked) substitution when it doesn't recognize a character, but it > should be close. I've never seen any way of getting iconv (at least via PHP) to do what HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is, however, possible using mbstring (which also has the advantage of not being system dependant), as well as with PHP6's Unicode support. -- Geoffrey Sneddon <http://gsnedders.com/>
Received on Sunday, 21 December 2008 09:19:39 UTC