Re: [whatwg/encoding] Amount of bytes to sniff for encoding detection (#102) from JinsukKim on 2017-05-19 (public-webapps-github@w3.org from May 2017)

From: JinsukKim <notifications@github.com>
Date: Thu, 18 May 2017 21:56:33 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/102/302608504@github.com>

Ran the detector over substantial amount of unlabelled documents and got following stat: 

input size     coverage (%)
     1K         84.36
     2K         92.86
     3K         96.28
     4K         98.60

(The rest 1.40% is exception)

84.36% of the unlabelled HTML documents return the correct text encoding when their first 1K is fed to the detector (meaning the guessed encoding remained the same even if we gave a bigger chunk like 2K ~ 4K). Quite big it doesn't look like a decisive number either.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/102#issuecomment-302608504

Received on Friday, 19 May 2017 04:57:11 UTC