*"You must provide it clean (interchange-valid) UTF-8, so any encoding issues mu...

adambyrtek · on Oct 22, 2011

Encoding, but obviously not language, should be provided explicitly as metadata (e.g. Content-Type HTTP header). Also, most of content available on the web is already UTF-8 (65.9% according to a recent survey[1]).

[1] http://w3techs.com/technologies/details/en-utf8/all/all

ninjin · on Oct 22, 2011

Mark Pilgrim reversed (or ripped out, can't remember) the encoding detection that Firefox uses. It has done a fairly good job for my web crawling:

http://pypi.python.org/pypi/chardet

e98cuenc · on Oct 22, 2011

In my experience chardet misclassifies very often iso-8859-1 as iso-8859-2. I saw the misclassification even in small spanish pages, which were using only the typical spanish characters.

Maakuth · on Oct 22, 2011

I'd say in most cases UTF-8 is already used. Of course this depends on the source of the text.