Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand."

In most cases you have to know the language in order to guess the encoding and convert to UTF-8 if necessary. Mutual recursion...



Encoding, but obviously not language, should be provided explicitly as metadata (e.g. Content-Type HTTP header). Also, most of content available on the web is already UTF-8 (65.9% according to a recent survey[1]).

[1] http://w3techs.com/technologies/details/en-utf8/all/all


Mark Pilgrim reversed (or ripped out, can't remember) the encoding detection that Firefox uses. It has done a fairly good job for my web crawling:

http://pypi.python.org/pypi/chardet


In my experience chardet misclassifies very often iso-8859-1 as iso-8859-2. I saw the misclassification even in small spanish pages, which were using only the typical spanish characters.


I'd say in most cases UTF-8 is already used. Of course this depends on the source of the text.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: