[Bug 26278] getElementText - no info about U+200E, U+200F

https://www.w3.org/Bugs/Public/show_bug.cgi?id=26278

Andrey Botalov <botalov.andrey@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |---

--- Comment #3 from Andrey Botalov <botalov.andrey@gmail.com> ---
There are other whitespace and BiDi characters in
http://www.unicode.org/Public/6.3.0/ucd/PropList.txt and
http://en.wikipedia.org/wiki/Space_(punctuation)#Spaces_in_Unicode.

I think that if only \u200b, \u200e, \u200f, \v, \f should be removed by
getElementText() from the string, then the spec should also contain an
explanation (note) about what makes those characters special and why other
invisible "spaces" shouldn't be removed.

I don't know much about Unicode but IMO those "spaces" also look like
zero-width:
U+180E
U+200C
U+2060
U+061C
etc.

I also found this line in gecko-dev repository:
https://github.com/mozilla/gecko-dev/blob/master/browser/base/content/browser.js#L2205:

> value = value.replace(/[\u00ad\u034f\u061c\u115f-\u1160\u17b4-\u17b5\u180b-\u180d\u200b\u200e-\u200f\u202a-\u202e\u2060-\u206f\u3164\ufe00-\ufe0f\ufeff\uffa0\ufff0-\ufff8]|\ud834[\udd73-\udd7a]|[\udb40-\udb43][\udc00-\udfff]/g, encodeURIComponent);

It seems that implementation in Firefox is a bit more complicated.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.

Received on Wednesday, 16 July 2014 20:02:28 UTC