多语言展示
当前在线:888今日阅读:113今日分享:31

使用Python3和Scrapy进行网站图片爬虫自动下载

本文主要介绍如何安装和使用Scrapy进行指定网站图片的爬虫,包含python3、Scrapy、PyCharm CM工具的安装以及提供一个实例进行图片爬虫项目的实际开发。
工具/原料
1

Ubuntu 18

2

Python3

3

电脑需要连接互联网

Scrapy开发环境的安装准备
1

在Ubuntu系统上安装python3hxb@lion:~$ sudo apt-get install python3

2

安装python3-dev依赖包hxb@lion:~$ sudo apt-get install python3-dev

3

安装pip包,用于安装scrapy需要依赖的相关python库hxb@lion:~$ sudo apt install python-pipQuery the Pip version:hxb@lion:~$ pip -Vpip 9.0.1 from /usr/lib/python2.7/dist-packages (python 2.7)

4

安装Ubuntu系统上python的开发工具PyCharm CE 通过软件中心管理界面直接安装

如何切换Python2和Python3版本环境
1

Ubuntu 18默认提供了Pyton2.7版本,我们安装了Python3后系统中就存在多个python版本,会给接下来的安装依赖库带来一定的困扰,所以我们需要将环境切换到python3中配置python2:run the commands:sudo update-alternatives --install /usr/bin/python python /usr/bin/python2 100hxb@lion:~/PycharmProjects$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python2 100[sudo] password for hxb: update-alternatives: using /usr/bin/python2 to provide /usr/bin/python (python) in auto modehxb@lion:~/PycharmProjects$

2

配置python3:hxb@lion:~$ sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 150update-alternatives: using /usr/bin/python3 to provide /usr/bin/python (python) in auto modehxb@lion:~$

3

通过以下这里可以灵活实现python2和python3环境的灵活切换sudo update-alternatives --config pythonhxb@lion:~$ sudo update-alternatives --config pythonThere are 2 choices for the alternative python (providing /usr/bin/python).  Selection    Path              Priority   Status------------------------------------------------------------  0            /usr/bin/python3   150       auto mode* 1            /usr/bin/python2   100       manual mode  2            /usr/bin/python3   150       manual modePress to keep the current choice[*], or type selection number: 2update-alternatives: using /usr/bin/python3 to provide /usr/bin/python (python) in manual modehxb@lion:~$ python -VPython 3.6.5

4

在安装pip过程中可能出现以下错误,在将环境切换到python3后需要从新安装python-pip包hxb@lion:~/PycharmProjects$ sudo apt install python-pipReading package lists... DoneBuilding dependency tree       Reading state information... Donepython-pip is already the newest version (9.0.1-2.3~ubuntu1).0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.hxb@lion:~/PycharmProjects$ pip install pipTraceback (most recent call last):  File '/usr/bin/pip', line 9, in     from pip import mainModuleNotFoundError: No module named 'pip'hxb@lion:~/PycharmProjects$ pip -VTraceback (most recent call last):  File '/usr/bin/pip', line 9, in     from pip import mainModuleNotFoundError: No module named 'pip'Resolve the problem:1. remove pip:hxb@lion:~/PycharmProjects$ sudo apt-get remove python-pip2. install pip again:hxb@lion:~/PycharmProjects$ sudo apt-get install python-pip

5

安装pip for python3hxb@lion:~/PycharmProjects$ sudo apt-get install python3-pipReading package lists... DoneBuilding dependency tree       Reading state information... DoneThe following additional packages will be installed:  python3-setuptools python3-wheelSuggested packages:  python-setuptools-docThe following NEW packages will be installed:  python3-pip python3-setuptools python3-wheel0 upgraded, 3 newly installed, 0 to remove and 1 not upgraded.Need to get 398 kB of archives.After this operation, 2,073 kB of additional disk space will be used.Do you want to continue? [Y/n] yGet:1 http://cn.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python3-pip all 9.0.1-2.3~ubuntu1 [114 kB]Get:2 http://cn.archive.ubuntu.com/ubuntu bionic/main amd64 python3-setuptools all 39.0.1-2 [248 kB]Get:3 http://cn.archive.ubuntu.com/ubuntu bionic/universe amd64 python3-wheel all 0.30.0-0.2 [36.5 kB]Fetched 398 kB in 2s (172 kB/s)          Selecting previously unselected package python3-pip.(Reading database ... 171685 files and directories currently installed.)Preparing to unpack .../python3-pip_9.0.1-2.3~ubuntu1_all.deb ...Unpacking python3-pip (9.0.1-2.3~ubuntu1) ...Selecting previously unselected package python3-setuptools.Preparing to unpack .../python3-setuptools_39.0.1-2_all.deb ...Unpacking python3-setuptools (39.0.1-2) ...Selecting previously unselected package python3-wheel.Preparing to unpack .../python3-wheel_0.30.0-0.2_all.deb ...Unpacking python3-wheel (0.30.0-0.2) ...Setting up python3-wheel (0.30.0-0.2) ...Setting up python3-pip (9.0.1-2.3~ubuntu1) ...Processing triggers for man-db (2.8.3-2) ...Setting up python3-setuptools (39.0.1-2) ...

6

检查python3环境下pip安装是否正常hxb@lion:~/PycharmProjects$ pip -Vpip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)hxb@lion:~/PycharmProjects$

7

安装scrapy on Python3hxb@lion:~/PycharmProjects$ pip install scrapyimport scrapy in the Python3 env:hxb@lion:~$ pythonPython 3.6.5 (default, Apr  1 2018, 05:46:30) [GCC 7.3.0] on linuxType 'help', 'copyright', 'credits' or 'license' for more information.>>> import scrapy>>>

8

安装scrapy过程中出现如下错误:Segmentation fault (core dumped) when install scrapy

9

在安装scrapy前增加sudo权限可以解决步骤8的错误:sudo -H pip install scrapyhxb@lion:~/PycharmProjects$ scrapyCommand 'scrapy' not found, did you mean:  command 'scapy' from deb python-scapy  command 'scrappy' from deb libscrappy-perlTry: sudo apt install you should install scrapy with sudohxb@lion:~$ sudo -H pip install scrapySuccessfully installed Automat-0.7.0 PyDispatcher-2.0.5 Twisted-18.4.0 attrs-18.1.0 cffi-1.11.5 constantly-15.1.0 cryptography-2.2.2 cssselect-1.0.3 hyperlink-18.0.0 incremental-17.5.0 lxml-4.2.3 parsel-1.5.0 pyOpenSSL-18.0.0 pyasn1-0.4.3 pyasn1-modules-0.2.2 pycparser-2.18 queuelib-1.5.0 scrapy-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0hxb@lion:~$

10

安装virtualenv  based on piphxb@lion:~/PycharmProjects$ pip install virtualenv

11

安装scrapy依赖的其他python库:hxb@lion:~/PycharmProjects$ sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

创建Python工程并测试环境是否就绪
1

在PyCharm CM开发工具中创建一个Python3的工程New Python Project 'GirlsSpider' with python3

2

增加一个python文件输入一下语句进行测试:GirlsSpider.pyprint('Hello Girls Spider')运行GirlsSpider.py文件可以看到正常的输出:/home/hxb/PycharmProjects/GirlsSpider/venv/bin/python /home/hxb/PycharmProjects/GirlsSpider/GirlsSpider.pyHello Girls Spider

创建Scrapy框架工程
1

使用Scrapy指令创建爬虫项目框架:hxb@lion:~/PycharmProjects$ scrapy startproject meizhiSpiderNew Scrapy project 'meizhiSpider', using template directory '/home/hxb/.local/lib/python3.6/site-packages/scrapy/templates/project', created in:    /home/hxb/PycharmProjects/meizhiSpiderYou can start your first spider with:    cd meizhiSpider    scrapy genspider example example.comhxb@lion:~/PycharmProjects$

2

使用scrapy genspider生存一个spider文件,同时需要看项目的python环境是否为python3如果不是需要切换python环境hxb@lion:~/PycharmProjects/meizhiSpider$ scrapy genspider jiandan jiandan.netCreated spider 'jiandan' using template 'basic' in module:  meizhiSpider.spiders.jiandanhxb@lion:~/PycharmProjects/meizhiSpider$you can find the python version is 2.7 ,we need change the project python2.7 to python3.6 in the PyCharm1) File ->Settings->Project:meizhiSpider->Project Interpreter2) Add new Python Interpreter env3)run the jiandan.py to check wether the scrapy env is ok

分析目标网站内容进行爬虫
1

在进行jiandan网站的测试时发现存在rebots包含导致无法进行图片下载,于是选择xiaohuar网进行图片爬虫下载1.  目标website: xiaohuar.com/hua/2. 图片的xpath: '//div[@class='img']/a/img/@src'3.下一页的xpath: '//a[text()='下一页']/@href'

2

安装一下顺序编写代码:1. jiandan.py:spider code in jiandan.py refer to the following picture2. item.py:define the items for the scrapy result3. pipelines.py: save the scrapy result4. settings.py: settings for scrapy

3

运行爬虫,可以看到我们的爬虫正在工作,不断下载图片到本地1. run the scrapy:hxb@lion:~/PycharmProjects/meizhiSpider/meizhiSpider$ scrapy crawl jiandan1. image files were saved in the directory: /home/hxb/jiandan

注意事项
1

通过pip安装scrapy时需要带Sudo -H

2

写爬虫是需要注意yield的用法

3

该实例介绍的方法不能支持rebots包含的网站图片爬虫

推荐信息