- From: Nils Dagsson Moskopp <nils@dieweltistgarnichtso.net>
- Date: Thu, 20 Dec 2012 05:50:06 +0100
- To: Philip Jägenstedt <philipj@opera.com>
- Cc: whatwg <whatwg@lists.whatwg.org>, Ian Hickson <ian@hixie.ch>
Philip Jägenstedt <philipj@opera.com> schrieb am Wed, 19 Dec 2012 11:19:14 +0100: > […] > > Redefining/extending what the fragment component does for HTML is > somewhat risky, so it really comes down to what exactly the > processing should be. > > What should a browser do with a URL ending with #foo&t=10 if there is > an element on the page with id="foo&t=10"? What about #foo&bar if > there is an element with id="foo"? I would be surprised if treating > #foo& the same as #foo were Web compatible... I wrote a script to exctract attribute values and the archive of 8915 HTML pages (<http://html5accessibility.com/HTML5data/html.zip>). The script and files containing all id and href attributes can be found at <http://daten.dieweltistgarnichtso.net/src/htmlattrib>. Following are some results for some possible sub-delimiters between element id and media fragment, generated using the following script: ===== snip ===== #/bin/sh echo `grep "$1" html-attrib-id -c` \ id attributes containing “"$1"” echo `grep "$1\S*=" html-attrib-id -c` \ id attributes containing “"$1"” followed by something containing “=” echo `grep '#' html-attrib-href | cut -d'#' -f2 | grep "$1" -c` \ href attributes containing “"$1"” in fragment echo `grep '#' html-attrib-href | cut -d'#' -f2 | grep "$1\S*=" -c` \ href attributes containing "$1" followed by something containing “=” \ in fragment ===== snap ===== From the data set, it seems to me that U+007E TILDE would a pretty safe choice for separation of element id and media fragment if processing should be kept to a minimum (just splitting on the delimiter). With more elaborate processing (only splitting on the delimiter if a U+003D EQUALS SIGN appears after it), we might also use: - U+0021 EXCLAMATION MARK - U+0027 APOSTROPHE - U+002A ASTERISK - U+002C COMMA - U+003B SEMICOLON - U+0040 COMMERCIAL AT I did check for characters U+0028 LEFT PARENTHESIS and U+0029 RIGHT PARENTHESIS but did not include the results for aesthetic reasons. My shell script also does weird things when given an argument of U+002D HYPHEN-MINUS so that is missing as well. Any faults in my reasoning? Also, where do I get a bigger data set? [Boring stuff follows] Regarding U+0021 EXCLAMATION MARK: 4 id attributes containing “!” 0 id attributes containing “!” followed by something containing “=” 2232 href attributes containing “!” in fragment 630 href attributes containing ! followed by something containing “=” in fragment Regarding U+0024 DOLLAR SIGN: 558023 id attributes containing “$” 1 id attributes containing “$” followed by something containing “=” 89837 href attributes containing “$” in fragment 0 href attributes containing $ followed by something containing “=” in fragment Regarding U+0026 AMPERSAND: 78 id attributes containing “&” 56 id attributes containing “&” followed by something containing “=” 1362 href attributes containing “&” in fragment 1346 href attributes containing & followed by something containing “=” in fragment Regarding U+0027 APOSTROPHE: 23 id attributes containing “'” 0 id attributes containing “'” followed by something containing “=” 339 href attributes containing “'” in fragment 9 href attributes containing ' followed by something containing “=” in fragment Regarding U+002A ASTERISK: 19 id attributes containing “*” 0 id attributes containing “*” followed by something containing “=” 18 href attributes containing “*” in fragment 0 href attributes containing * followed by something containing “=” in fragment Regarding U+002B PLUS SIGN: 28 id attributes containing “+” 1 id attributes containing “+” followed by something containing “=” 93 href attributes containing “+” in fragment 19 href attributes containing + followed by something containing “=” in fragment Regarding U+002C COMMA: 130 id attributes containing “,” 0 id attributes containing “,” followed by something containing “=” 428 href attributes containing “,” in fragment 10 href attributes containing , followed by something containing “=” in fragment Regarding U+003B SEMICOLON: 88 id attributes containing “;” 0 id attributes containing “;” followed by something containing “=” 222 href attributes containing “;” in fragment 8 href attributes containing ; followed by something containing “=” in fragment Regarding U+0040 COMMERCIAL AT: 8 id attributes containing “@” 0 id attributes containing “@” followed by something containing “=” 208 href attributes containing “@” in fragment 15 href attributes containing @ followed by something containing “=” in fragment Regarding U+007E TILDE: 2 id attributes containing “~” 0 id attributes containing “~” followed by something containing “=” 1 href attributes containing “~” in fragment 0 href attributes containing ~ followed by something containing “=” in fragment -- Nils Dagsson Moskopp // erlehmann <http://dieweltistgarnichtso.net>
Received on Thursday, 20 December 2012 04:51:28 UTC