多语言展示
当前在线:1402今日阅读:167今日分享:16

python3.4.3 读取文件/网页数据采集

使用该方法读取文件的时候遇到了如下几个问题,特分享1. gbk 编码问题:解决方法是采用 ‘rb’方式读取文件,然后使用正则表达式2. 采集数据的时候,遇到了相同的数据,需要去掉重复部分。
工具/原料
1

python 3.4.3

2

windows 7

方法/步骤
1

python 版本

3

code:加粗部分为关键字搜索,以及组元素去重# _*_coding:utf-8_*_import reimport osfile_open = open('index.html', 'rb')try:    content = file_open.read().decode('utf-8')    reg = r'reference internal' href='(.+?)''    result = re.findall(reg, content)    # print(result)    # delete the same items in result    temp = []    [temp.append(i) for i in result if i not in temp]    for line in temp:        if '#' not in line and 'environment.html' in line:            line = 'http://docs.openstack.org/liberty/install-guide-rdo/' + line            print(line)            os.system('wget ' + line)finally:    file_open.close()

4

运行结果如下:D:\Python34\python.exe open_file.pyhttp://docs.openstack.org/liberty/install-guide-rdo/common/conventions.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/overview.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/environment.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/environment-security.html

troubleshooting

1. UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x9e in position 1270: illegal multibyte sequence解决方法:以 ‘rb’方式打开文件,然后用正则表达式获取关键字段# _*_coding:utf-8_*_import refile_open = open('index.html', 'rb')try:    content = file_open.read().decode('utf-8')    reg = r'reference internal' href='(.+?)''    result = re.findall(reg, content)    print(result)    for line in result:        if '#' not in line:            print(line)finally:    file_open.close()

注意事项

本系统为 windows 7 ,python版本为 3.4.3

推荐信息