使用Scrapy+Selenium+PhantomJS爬取煎蛋妹子图

本文介绍在Ubuntu linux系统下，使用Python3+Scrapy+Selenium+PhantomJS爬取煎蛋妹子图的完整过程，其中包含几个踩过的坑，先给大家贴上爬取的结果截图，效果还是可以的，就是速度有点慢，后续再改进。

工具/原料

Ubuntu 18操作系统

Python3

Scrapy

selenium 2.48.0

PhantomJS

开始爬虫前的网站分析

使用firefox浏览器打开煎蛋网，通过查询网页源代码分析相关的图片页面元素，可以看到图片显示的地方调用了js的函数jandan_load_img进行动态加载。

网站反爬虫措施分析1. 煎蛋网采用Robots反爬虫协议2. 同时在加载图片时还采取了反爬虫的js算法处理，所以如果直接request的方式爬虫，下载到的图片地址都是blank.gif图片的

所以我们的PhantomJS+selenium的方案，爬取动态网页的图片。通过图片元素查询，获取图片相关的文档结构信息

安装PhantomJS和Selenium

安装PhantomJS库命令：sudo apt-get install phantomjshxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ sudo apt-get install phantomjs

安装beautifulsoup4库命令： pip install beautifulsoup4hxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ sudo pip install beautifulsoup4

安装selenium库命令： pip install seleniumhxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ sudo pip install selenium

创建爬虫工程

创建Scrapy基础工程:jandanSpiderhxb@lion:~/PycharmProjects$ mkdir jandanSpiderhxb@lion:~/PycharmProjects$ lsGirlsSpider jandanSpider meizhiSpider test1 zhifuSpider zhihu.pyhxb@lion:~/PycharmProjects$ cd jandanSpider/hxb@lion:~/PycharmProjects/jandanSpider$ lshxb@lion:~/PycharmProjects/jandanSpider$ scrapy startproject jandanSpiderNew Scrapy project 'jandanSpider', using template directory '/home/hxb/.local/lib/python3.6/site-packages/scrapy/templates/project', created in: /home/hxb/PycharmProjects/jandanSpider/jandanSpiderYou can start your first spider with: cd jandanSpider scrapy genspider example example.com

使用命令生存spider文件hxb@lion:~/PycharmProjects/jandanSpider$ cd jandanSpider/hxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ scrapy genspider jandan jandan.netCreated spider 'jandan' using template 'basic' in module: jandanSpider.spiders.jandan

在PyCharm编辑器中打开刚才创建的工程

修改工程的Python解释器为python3.6环境

编写爬虫相关的python内容

工程环境变量的配置:settings.py1. user_agent2. pipelines3. request_header# -*- coding: utf-8 -*-BOT_NAME = 'jandanSpider'SPIDER_MODULES = ['jandanSpider.spiders']NEWSPIDER_MODULE = 'jandanSpider.spiders'USER_AGENT = ['Mozilla/4.0 (compatible; MISE 7.0; Windows NT 5.1; Maxthon 2.0)']ROBOTSTXT_OBEY = FalseDOWNLOAD_DELAY = 3COOKIES_ENABLED = FalseDEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en',}DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':None, 'jandanSpider.middlewares.JandanspiderDownloaderMiddleware': 543}# Configure item pipelines# See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'jandanSpider.pipelines.JandanspiderPipeline': 300,}

文件管道处理类：pipelines.py将爬取到的图片，以图片的标题为图片的文件名，保存到本地指定磁盘位置# -*- coding: utf-8 -*-import osimport requestsclass JandanspiderPipeline(object): def process_item(self, item, spider): dir_path='/home/hxb/jandan' if 'url' in item: if not os.path.exists(dir_path): os.makedirs(dir_path) ext = '.'+item['url'].split('.')[-1] path = item['title']+ext file_path='%s/%s' %(dir_path,path) if not os.path.exists(file_path): with open(file_path,'wb') as file_handler: file_stream = requests.get(item['url'],stream=True) for block in file_stream.iter_content(1024): if not block: break file_handler.write(block) return item

爬虫获取的数据字段定义：items.py定义两个字段保存图片的链接和标题：url和title

爬虫处理解析类：jandan.py# -*- coding: utf-8 -*-import scrapyfrom jandanSpider.items import JandanspiderItemfrom selenium import webdriverfrom bs4 import BeautifulSoup as bs4import timeclass JandanSpider(scrapy.Spider): name = 'jandan' allowed_domains = ['jandan.net'] start_urls =['http://jandan.net/ooxx/'] def parse(self, response): img_item = JandanspiderItem() driver = webdriver.PhantomJS() print('--->req to the RUL:%s' %response.url) driver.get(response.url) #time sleep is very import to load the url time.sleep(120) soup = bs4(driver.page_source,'html.parser') all_code_source=soup.find_all('div', {'class':'row'}) for t_div in all_code_source: print('******************\n') title = t_div.find('strong') img_item['title'] = title.get_text().strip() print('******************\n') print('title :%s' % title) link_url = t_div.find('a',{'class':'view_img_link'}) print('a link :%s' % link_url) if len(link_url) ==0: yield img_item url = link_url.get('href') if len(url) == 0: return img_item['url'] ='http://' + url.split('//')[-1] yield img_item #for prevision page pre_page=soup.find('a',{'class':'previous-comment-page'}) if len(pre_page) == 0: return pre_page_link = pre_page.get('href') if pre_page_link: #the last element pre_page_url = 'http://'+ pre_page_link.split('//')[-1] yield scrapy.Request(pre_page_url, callback=self.parse) else: print('No More Page') driver.quit()

运行爬虫测试结果

运行爬虫:scrapy crawl jandanhxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ scrapy crawl jandan

查看爬虫的结果，可以在本地指定的文件目录中看到下载的图片

注意事项

在打开url后需要等待网页内容全部加载完后再进行页面元素解析

上一篇：pytho的基础语法

下一篇：tmea盛典2021直播入口

欧尼酱

使用Scrapy+Selenium+PhantomJS爬取煎蛋妹子图

风之大陆电脑版使用方法及按键设置

来电秀秀秀使用方法