Re: Unicode Character 'PILE OF POO' (U+1F4A9) and validator test suite

Hi Mark,

Mark Rogers <mark.rogers@powermapper.com>, 2014-11-08 13:59 -0600:

> Is the Unicode character U+1F4A9 used in the conformance checker test
> suite for URLs really invalid?

No, it's valid. Thanks for catching this and taking time report it.

> It’s marked as novalid in test suite files like:
> 
> conformance-checkers/html/elements/a/href/userinfo-username-contains-pile-of-poo-novalid.html

Yeah, I'll need to fix that. But before I do, I'll wait for a fix to the
upstream code of the URL parsing library the validator uses, called
galimatias. I've already filed a pull request with a proposed fix:

  https://github.com/smola/galimatias/pull/46

I expect that'll get fixed relatively soon.

> In RFC 3987 this character is listed in the 10000-1FFFD  range in the
> iuserinfo  -> iunreserved -> ucschar production:
> 
> iuserinfo      = *( iunreserved / pct-encoded / sub-delims / ":" )
> 
> iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
> 
>    ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                   / %xD0000-DFFFD / %xE1000-EFFFD
> 
> In the Whatwg URL standard it’s listed as a valid URL code point, and
> will be converted to percent encoding during the normalisation process,
> but won’t flag an error. See
> https://url.spec.whatwg.org/#url-code-points
> https://url.spec.whatwg.org/#authority-state

Yup. Your reading of the spec is right. I'd made the mistake of being lazy
and having the test suite just follow the (buggy in this particular case)
behavior galimatias on this, rather than checking it against the spec.

I'll follow up here after I've got it all fixed.

  --Mike

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Monday, 10 November 2014 01:34:43 UTC