Re: [whatwg/encoding] Explain the relationship between windows-1252, Latin1, and ASCII (PR #345)

@annevk commented on this pull request.



> + <p>The <a>windows-1252</a> <a for=/>encoding</a> has various <a for=encoding>labels</a> like
+ "<code>latin1</code>", "<code>iso-8859-1</code>", "<code>ascii</code>", etc. which have

like and etc. seems redundant? Maybe "labels, such as latin1, iso-..., and ascii, which have ..."?

> @@ -732,6 +747,30 @@ part of the ISO 8859 series. In particular, the necessity of the inclusion of <a
 and <a>ISO-8859-16</a> is doubtful for the purpose of supporting existing content, but there are no
 plans to remove these.</p>
 
+<div class=note id=note-latin1-ascii>
+ <p>The <a>windows-1252</a> <a for=/>encoding</a> has various <a for=encoding>labels</a> like
+ "<code>latin1</code>", "<code>iso-8859-1</code>", "<code>ascii</code>", etc. which have
+ historically been confusing for developers. On the web, and in any software that seeks to be
+ web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and
+ "<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this
+ standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding
+ of that byte.
+
+ <p>Software that does not follow the Encoding Standard does not always give the same answers. The
+ root of this is that the original document that specified Latin1 (ISO/IEC 8859-1), did not provide
+ any mappings for bytes in the inclusive ranges 0x00–0x1F or 0x7F–0x9F. Similarly, the original

Nit: I think whenever we talk about ranges it's always in the form of "0x7F to 0x9F".

> @@ -732,6 +747,30 @@ part of the ISO 8859 series. In particular, the necessity of the inclusion of <a
 and <a>ISO-8859-16</a> is doubtful for the purpose of supporting existing content, but there are no
 plans to remove these.</p>
 
+<div class=note id=note-latin1-ascii>
+ <p>The <a>windows-1252</a> <a for=/>encoding</a> has various <a for=encoding>labels</a> like
+ "<code>latin1</code>", "<code>iso-8859-1</code>", "<code>ascii</code>", etc. which have
+ historically been confusing for developers. On the web, and in any software that seeks to be
+ web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and
+ "<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this
+ standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding
+ of that byte.

The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask.

I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below?

> @@ -732,6 +747,30 @@ part of the ISO 8859 series. In particular, the necessity of the inclusion of <a
 and <a>ISO-8859-16</a> is doubtful for the purpose of supporting existing content, but there are no
 plans to remove these.</p>
 
+<div class=note id=note-latin1-ascii>
+ <p>The <a>windows-1252</a> <a for=/>encoding</a> has various <a for=encoding>labels</a> like
+ "<code>latin1</code>", "<code>iso-8859-1</code>", "<code>ascii</code>", etc. which have
+ historically been confusing for developers. On the web, and in any software that seeks to be
+ web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and

How about "this standard"?

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/pull/345#pullrequestreview-2767568493
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/encoding/pull/345/review/2767568493@github.com>

Received on Tuesday, 15 April 2025 09:46:07 UTC