- From: Simon Pieters <simonp@opera.com>
- Date: Mon, 08 Dec 2014 23:48:17 +0100
- To: "Ian Hickson" <ian@hixie.ch>
- Cc: whatwg@whatwg.org, Sanjoy Pal <sanjoy.pal@samsung.com>
On Mon, 08 Dec 2014 21:50:56 +0100, Simon Pieters <simonp@opera.com> wrote:
> SELECT COUNT(*) as num,
> CASE
> WHEN REGEXP_MATCH(LOWER(body),
> r'<menuitem[^>]*>(\s*[^<]+)+\s*</menuitem>') THEN "has content"
> ELSE "no content"
> END as stat
> FROM [httparchive:runs.2014_08_15_requests_body]
> WHERE mimeType CONTAINS "html"
> AND REGEXP_MATCH(LOWER(body), r'<menuitem')
> GROUP BY stat
> ORDER BY num desc
>
> Row num stat
> 1 10101 no content
Hixie pointed out that this doesn't catch element children. So flipping it
gives:
SELECT COUNT(*) as num,
CASE
WHEN REGEXP_MATCH(LOWER(body), r'<menuitem[^>]*>\s*</menuitem>') THEN
"no content"
ELSE "has content"
END as stat
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS "html"
AND REGEXP_MATCH(LOWER(body), r'<menuitem')
GROUP BY stat
ORDER BY num desc
Row num stat
1 10085 no content
2 16 has content
15 of these are omitting the end tag, as seen by the other query. So what
is the last one doing?
SELECT url,body
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS "html"
AND LOWER(body) CONTAINS '<menuitem'
AND LOWER(body) CONTAINS '</menuitem'
AND NOT REGEXP_MATCH(LOWER(body), r'<menuitem[^>]*>\s*</menuitem>')
Row url body
1 http://www.dod.gr/lib/menuData_v483.php <menus> <!-- BOTTOM NAVIGATION
MENU ---> <menu id="BottomNavigationMenu" type="main" x="30" y="30">
<menuitem x="120" y="120"> <image>community.swf</image>
<label>community</label> ...
Yep, mislabeled XML.
For completeness, the 15 pages with no end tags fall in two categories:
* for(i=0;i<menuitems.length;i++)
* <xml id=""SolpartMenuDI"" onreadystatechange=""if (this.readyState ==
'complete') spm_initMyMenu(this,
spm_getById('dnn_dnnMENU_ctldnnMENU'))""><root><menuitem id=""2533""
title=""صفحه اصلی"" url=""/Default.aspx?tabid=2533"" lefthtml=""<img
alt="*" BORDER="0"
src="/images/breadcrumb.gif">"" css="" "" />
Previous conclusion stands. :-)
--
Simon Pieters
Opera Software
Received on Monday, 8 December 2014 22:47:01 UTC