用Python chardet库来判断文件编码

2014-09-17

| reads

抓取一批页面的内容时，经常会遇到编码类型不同的问题。对于简体中文站点来说，一般只有uft8、gb2312两种，如果再加上繁体文，编码类型又会增加。如果想将一批页面的结果合并在一起进行观看的话，如果编码不同，往往会造成乱码的问题。而一个个的页面去查看也相当麻烦。

上面的问题如果使用python解决相当简单，python的chardet库可以对编码类型进行判读：

1import chardet
2f = open('/path/file.txt',r)
3data = f.read()
4print chardet.detect(data)

返回值会是类似这样的：一个是检测的可信度，另外一个就是检测到的编码。

1{'confidence': 0.99, 'encoding': 'utf-8'}

能判读出编码类型，接下来就可以配合iconv模块进行转码。

如您感觉文章有用，可扫码捐赠本站！(If the article useful, you can scan the QR code to donate))

Author: shisekong
Link: https://blog.361way.com/python-chardet-codetype/4021.html
License: This work is under a 知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议. Kindly fulfill the requirements of the aforementioned License when adapting or creating a derivative of this work.

See Also