Re: [whatwg/encoding] Amount of bytes to sniff for encoding detection (#102)

Ran the detector over substantial amount of unlabelled documents and got following stat: 

input size     coverage (%)
     1K         84.36
     2K         92.86
     3K         96.28
     4K         98.60

(The rest 1.40% is exception)

84.36% of the unlabelled HTML documents return the correct text encoding when their first 1K is fed to the detector (meaning the guessed encoding remained the same even if we gave a bigger chunk like 2K ~ 4K). Quite big it doesn't look like a decisive number either.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/102#issuecomment-302608504

Received on Friday, 19 May 2017 04:57:11 UTC