Re: [whatwg] *** GMX Spamverdacht *** Parsing of meta refresh needs tweaking

On 2014-12-11 09:09, Simon Pieters wrote:
> The spec's parsing rules of meta refresh causes infinite reloading on
> some pages. In particular, the spec requires the "url=" to be present,
> but there are pages that omit it. IE9 also requires "url=" apparently.
> Gecko/Blink/WebKit allow "url=" to be omitted.
>
> For example, there is http://www.only-for-winners.com/ which has
>
>     <meta http-equiv="refresh"
> content="0;http://www.aldanitinetwork.com" />
>
> Clearly this is intended to redirect, not reload the current page after
> 0 seconds.
>
>
> SELECT page, COUNT(*) AS num
> FROM [httparchive:runs.2014_08_15_requests_body]
> WHERE page = url
> AND mimeType CONTAINS "html"
> AND REGEXP_MATCH(LOWER(body),
> r"<meta\s+[^>]*http-equiv\s*=\s*[\"']?refresh")
> AND REGEXP_MATCH(LOWER(body),
> r"<meta\s+[^>]*content\s*=\s*[\"']?\s*\d+\s*;\s*[^\"'>]")
> AND NOT REGEXP_MATCH(LOWER(body),
> r"<meta\s+[^>]*content\s*=\s*[\"']?\s*\d+\s*;\s*url=")
> GROUP BY page
>
> 23 rows.
>
> I also noticed that Gecko allows the number to be omitted. I only found
> one page doing that and it was using <meta http-equiv="refresh"
> content=";URL="> so it seems we can fail parsing for that case.
>

I hear (a) these pages have been broken in IE for a long time, and (b) 
only 23 (?) pages in your DB are found.

So why not just leave them broken?

Best regards, Julian

Received on Tuesday, 6 January 2015 12:41:35 UTC