python3.4.3 读取文件/网页数据采集

使用该方法读取文件的时候遇到了如下几个问题，特分享1. gbk 编码问题：解决方法是采用 ‘rb’方式读取文件，然后使用正则表达式2. 采集数据的时候，遇到了相同的数据，需要去掉重复部分。

工具/原料

python 3.4.3

windows 7

方法/步骤

python 版本

code：加粗部分为关键字搜索，以及组元素去重# _*_coding:utf-8_*_import reimport osfile_open = open('index.html', 'rb')try: content = file_open.read().decode('utf-8') reg = r'reference internal' href='(.+?)'' result = re.findall(reg, content) # print(result) # delete the same items in result temp = [] [temp.append(i) for i in result if i not in temp] for line in temp: if '#' not in line and 'environment.html' in line: line = 'http://docs.openstack.org/liberty/install-guide-rdo/' + line print(line) os.system('wget ' + line)finally: file_open.close()

运行结果如下：D:\Python34\python.exe open_file.pyhttp://docs.openstack.org/liberty/install-guide-rdo/common/conventions.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/overview.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/environment.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/environment-security.html

troubleshooting

1. UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x9e in position 1270: illegal multibyte sequence解决方法：以 ‘rb’方式打开文件，然后用正则表达式获取关键字段# _*_coding:utf-8_*_import refile_open = open('index.html', 'rb')try: content = file_open.read().decode('utf-8') reg = r'reference internal' href='(.+?)'' result = re.findall(reg, content) print(result) for line in result: if '#' not in line: print(line)finally: file_open.close()

注意事项

本系统为 windows 7 ，python版本为 3.4.3

上一篇：横刀辅助过冰雪单职业GK登录器怎么样

下一篇：卡密尔的e技能

欧尼酱

python3.4.3 读取文件/网页数据采集

Python两种方式过滤列表并且转换为大写

python3.4.3 读取文件/网页数据采集

韭菜炒魔芋的做法

内心变强大的八个步骤

让自己喜欢的女孩得到快乐

相思不露，只因入骨

止咳雪梨南瓜糖水

南瓜戚风蛋糕的做法——纯天然的金黄色

如何制作沙茶酱烧嫩南瓜

香烤南瓜手工面

素烧南瓜怎样才能软糯香甜

香烤南瓜子怎么做

做家常菜南瓜饼烤箱版的教程！

家常素菜系列—特色软烧南瓜

椰子油烤松仁南瓜

椰香南瓜派的做法

怎样做香辣豆豉肉酱烤南瓜？

椰子油烤松仁南瓜

没喝酒被查出酒驾？原来是因为这些东西！

独家湘式水蒸蛋

十大防癌的饮食规则分享

南瓜戚风的做法