python 3.4.3
windows 7
python 版本
code:加粗部分为关键字搜索,以及组元素去重# _*_coding:utf-8_*_import reimport osfile_open = open('index.html', 'rb')try: content = file_open.read().decode('utf-8') reg = r'reference internal' href='(.+?)'' result = re.findall(reg, content) # print(result) # delete the same items in result temp = [] [temp.append(i) for i in result if i not in temp] for line in temp: if '#' not in line and 'environment.html' in line: line = 'http://docs.openstack.org/liberty/install-guide-rdo/' + line print(line) os.system('wget ' + line)finally: file_open.close()
运行结果如下:D:\Python34\python.exe open_file.pyhttp://docs.openstack.org/liberty/install-guide-rdo/common/conventions.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/overview.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/environment.htmlhttp://docs.openstack.org/liberty/install-guide-rdo/environment-security.html
1. UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x9e in position 1270: illegal multibyte sequence解决方法:以 ‘rb’方式打开文件,然后用正则表达式获取关键字段# _*_coding:utf-8_*_import refile_open = open('index.html', 'rb')try: content = file_open.read().decode('utf-8') reg = r'reference internal' href='(.+?)'' result = re.findall(reg, content) print(result) for line in result: if '#' not in line: print(line)finally: file_open.close()
本系统为 windows 7 ,python版本为 3.4.3