Ubuntu 18操作系统
Python3
Scrapy
selenium 2.48.0
PhantomJS
使用firefox浏览器打开煎蛋网,通过查询网页源代码分析相关的图片页面元素,可以看到图片显示的地方调用了js的函数jandan_load_img进行动态加载。
网站反爬虫措施分析1. 煎蛋网采用Robots反爬虫协议2. 同时在加载图片时还采取了反爬虫的js算法处理,所以如果直接request的方式爬虫,下载到的图片地址都是blank.gif图片的
所以我们的PhantomJS+selenium的方案,爬取动态网页的图片。通过图片元素查询,获取图片相关的文档结构信息
安装PhantomJS库命令:sudo apt-get install phantomjshxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ sudo apt-get install phantomjs
安装beautifulsoup4库命令: pip install beautifulsoup4hxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ sudo pip install beautifulsoup4
安装selenium库命令: pip install seleniumhxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ sudo pip install selenium
创建Scrapy基础工程:jandanSpiderhxb@lion:~/PycharmProjects$ mkdir jandanSpiderhxb@lion:~/PycharmProjects$ lsGirlsSpider jandanSpider meizhiSpider test1 zhifuSpider zhihu.pyhxb@lion:~/PycharmProjects$ cd jandanSpider/hxb@lion:~/PycharmProjects/jandanSpider$ lshxb@lion:~/PycharmProjects/jandanSpider$ scrapy startproject jandanSpiderNew Scrapy project 'jandanSpider', using template directory '/home/hxb/.local/lib/python3.6/site-packages/scrapy/templates/project', created in: /home/hxb/PycharmProjects/jandanSpider/jandanSpiderYou can start your first spider with: cd jandanSpider scrapy genspider example example.com
使用命令生存spider文件hxb@lion:~/PycharmProjects/jandanSpider$ cd jandanSpider/hxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ scrapy genspider jandan jandan.netCreated spider 'jandan' using template 'basic' in module: jandanSpider.spiders.jandan
在PyCharm编辑器中打开刚才创建的工程
修改工程的Python解释器为python3.6环境
工程环境变量的配置:settings.py1. user_agent2. pipelines3. request_header# -*- coding: utf-8 -*-BOT_NAME = 'jandanSpider'SPIDER_MODULES = ['jandanSpider.spiders']NEWSPIDER_MODULE = 'jandanSpider.spiders'USER_AGENT = ['Mozilla/4.0 (compatible; MISE 7.0; Windows NT 5.1; Maxthon 2.0)']ROBOTSTXT_OBEY = FalseDOWNLOAD_DELAY = 3COOKIES_ENABLED = FalseDEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en',}DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':None, 'jandanSpider.middlewares.JandanspiderDownloaderMiddleware': 543}# Configure item pipelines# See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'jandanSpider.pipelines.JandanspiderPipeline': 300,}
文件管道处理类 :pipelines.py将爬取到的图片,以图片的标题为图片的文件名,保存到本地指定磁盘位置# -*- coding: utf-8 -*-import osimport requestsclass JandanspiderPipeline(object): def process_item(self, item, spider): dir_path='/home/hxb/jandan' if 'url' in item: if not os.path.exists(dir_path): os.makedirs(dir_path) ext = '.'+item['url'].split('.')[-1] path = item['title']+ext file_path='%s/%s' %(dir_path,path) if not os.path.exists(file_path): with open(file_path,'wb') as file_handler: file_stream = requests.get(item['url'],stream=True) for block in file_stream.iter_content(1024): if not block: break file_handler.write(block) return item
爬虫获取的数据字段定义:items.py定义两个字段保存图片的链接和标题:url和title
爬虫处理解析类:jandan.py# -*- coding: utf-8 -*-import scrapyfrom jandanSpider.items import JandanspiderItemfrom selenium import webdriverfrom bs4 import BeautifulSoup as bs4import timeclass JandanSpider(scrapy.Spider): name = 'jandan' allowed_domains = ['jandan.net'] start_urls =['http://jandan.net/ooxx/'] def parse(self, response): img_item = JandanspiderItem() driver = webdriver.PhantomJS() print('--->req to the RUL:%s' %response.url) driver.get(response.url) #time sleep is very import to load the url time.sleep(120) soup = bs4(driver.page_source,'html.parser') all_code_source=soup.find_all('div', {'class':'row'}) for t_div in all_code_source: print('******************\n') title = t_div.find('strong') img_item['title'] = title.get_text().strip() print('******************\n') print('title :%s' % title) link_url = t_div.find('a',{'class':'view_img_link'}) print('a link :%s' % link_url) if len(link_url) ==0: yield img_item url = link_url.get('href') if len(url) == 0: return img_item['url'] ='http://' + url.split('//')[-1] yield img_item #for prevision page pre_page=soup.find('a',{'class':'previous-comment-page'}) if len(pre_page) == 0: return pre_page_link = pre_page.get('href') if pre_page_link: #the last element pre_page_url = 'http://'+ pre_page_link.split('//')[-1] yield scrapy.Request(pre_page_url, callback=self.parse) else: print('No More Page') driver.quit()
运行爬虫:scrapy crawl jandanhxb@lion:~/PycharmProjects/jandanSpider/jandanSpider$ scrapy crawl jandan
查看爬虫的结果,可以在本地指定的文件目录中看到下载的图片
在打开url后需要等待网页内容全部加载完后再进行页面元素解析