[whatwg] Parsing of meta refresh needs tweaking

The spec's parsing rules of meta refresh causes infinite reloading on some  
pages. In particular, the spec requires the "url=" to be present, but  
there are pages that omit it. IE9 also requires "url=" apparently.  
Gecko/Blink/WebKit allow "url=" to be omitted.

For example, there is http://www.only-for-winners.com/ which has

    <meta http-equiv="refresh" content="0;http://www.aldanitinetwork.com" />

Clearly this is intended to redirect, not reload the current page after 0  
seconds.


SELECT page, COUNT(*) AS num
 FROM [httparchive:runs.2014_08_15_requests_body]
WHERE page = url
AND mimeType CONTAINS "html"
AND REGEXP_MATCH(LOWER(body),  
r"<meta\s+[^>]*http-equiv\s*=\s*[\"']?refresh")
AND REGEXP_MATCH(LOWER(body),  
r"<meta\s+[^>]*content\s*=\s*[\"']?\s*\d+\s*;\s*[^\"'>]")
AND NOT REGEXP_MATCH(LOWER(body),  
r"<meta\s+[^>]*content\s*=\s*[\"']?\s*\d+\s*;\s*url=")
GROUP BY page

23 rows.

I also noticed that Gecko allows the number to be omitted. I only found  
one page doing that and it was using <meta http-equiv="refresh"  
content=";URL="> so it seems we can fail parsing for that case.

-- 
Simon Pieters
Opera Software

Received on Thursday, 11 December 2014 08:07:49 UTC