[Bug 921] New: Validator inserts text in the middle of a UTF-8 character

http://www.w3.org/Bugs/Public/show_bug.cgi?id=921

           Summary: Validator inserts text in the middle of a UTF-8
                    character
           Product: Validator
           Version: 0.6.7
          Platform: All
               URL: http://validator.w3.org/check?uri=http://forum.druzya.or
                    g
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: check
        AssignedTo: link@pobox.com
        ReportedBy: bdew@bdew.yi.org
         QAContact: www-validator-cvs@w3.org


I've discovered a bug in the checker, i tried to check my site 
(http://forum.druzya.org) and the first error looked broken, something like: 
(that's what appeared on my screen, as you can see it's broken and some HTML 
code produced by the validator leaks to the screen)

...663b" title="Список форуммstrong title="Position where error was detected.">?
в Друзей" />

here's a hex dump of the html that pruduced this:

      0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
000: 3C 6C 69 3E 3C 70 3E 3C 65 6D 3E 4C 69 6E 65 20  <li><p><em>Line
010: 37 2C 20 63 6F 6C 75 6D 6E 20 31 30 33 3C 2F 65  7, column 103</e
020: 6D 3E 3A 20 3C 73 70 61 6E 20 63 6C 61 73 73 3D  m>: <span class=
030: 22 6D 73 67 22 3E 63 68 61 72 61 63 74 65 72 20  "msg">character
040: 64 61 74 61 20 69 73 20 6E 6F 74 20 61 6C 6C 6F  data is not allo
050: 77 65 64 20 68 65 72 65 3C 2F 73 70 61 6E 3E 3C  wed here</span><
060: 2F 70 3E 3C 70 3E 3C 63 6F 64 65 20 63 6C 61 73  /p><p><code clas
070: 73 3D 22 69 6E 70 75 74 22 3E 2E 2E 2E 63 36 64  s="input">...c6d
080: 36 26 23 33 34 3B 20 74 69 74 6C 65 3D 26 23 33  6&#34; title=&#3
090: 34 3B D0 A1 D0 BF D0 B8 D1 81 D0 BE D0 BA 20 D1  4;&#1056;&#1038;&#1056;&#1111;&#1056;&#1105;&#1057;_&#1056;_&#1056;&#1108; &#1057;
0A0: 84 D0 BE D1 80 D1 83 D0 BC D0 3C 73 74 72 6F 6E  "&#1056;_&#1057;_&#1057;_&#1056;_&#1056;<stron

As you can see, it's utf8 and at 0x0A9 there is a beginning of a utf-8 
character that's got broken into two by the message. The first char of the 
message html ("<") got processed as the second byte of that character.
                                  
0B0: 67 20 74 69 74 6C 65 3D 22 50 6F 73 69 74 69 6F  g title="Positio
0C0: 6E 20 77 68 65 72 65 20 65 72 72 6F 72 20 77 61  n where error wa
0D0: 73 20 64 65 74 65 63 74 65 64 2E 22 3E BE 3C 2F  s detected.">_</

At position 0x0DD seems to be the character that the checker complains about, 
and i don't see anything bad in it so probably it's a bug too.

0E0: 73 74 72 6F 6E 67 3E D0 B2 20 D0 94 D1 80 D1 83  strong>&#1056;_ &#1056;"&#1057;_&#1057;_
0F0: D0 B7 D0 B5 D0 B9 26 23 33 34 3B 20 2F 26 23 36  &#1056;·&#1056;&#1095;&#1056;&#8470;&#34; /&#6
100: 32 3B 3C 2F 63 6F 64 65 3E 3C 2F 70 3E           2;</code></p>
[END OF HEXDUMP]

This how this looked on the original file: (It was encoded with CP-1251, the 
recoding to UTF8 was done by the checker)

      0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
000: 3C 6C 69 6E 6B 20 72 65 6C 3D 22 74 6F 70 22 20   <link rel="top"
010: 68 72 65 66 3D 22 2E 2F 69 6E 64 65 78 2E 70 68   href="./index.ph
020: 70 3F 73 69 64 3D 30 34 34 38 62 32 66 62 61 63   p?sid=0448b2fbac
030: 38 38 66 31 65 39 62 31 66 35 65 65 39 37 36 63   88f1e9b1f5ee976c
040: 64 65 30 36 38 32 22 20 74 69 74 6C 65 3D 22 D1   de0682" title="&#9572;
050: EF E8 F1 EE EA 20 F4 EE F0 F3 EC EE E2 20 C4 F0   &#1103;&#1096;&#1105;&#1102;&#1098; &#1031;&#1102;&#1025;&#1108;&#1100;&#1102;&#1090; &#9472;&#1025;

It seems that it barfed at 0x05B, as i said i see nothing bad about this 
character whatsoever.

060: F3 E7 E5 E9 22 20 2F 3E                           &#1108;&#1095;&#1093;&#1097;" />

That's all, i hope that my bugreport helps (and that it won't be corrupted 
because of all those chars :)



------- You are receiving this mail because: -------
You are the QA contact for the bug, or are watching the QA contact.

Received on Sunday, 17 October 2004 22:58:17 UTC