多语言展示
当前在线:1715今日阅读:39今日分享:10

如何爬取网页数据

Python是进行网页爬虫和网页数据抓取的一个不错语言。其中python也提供了不少模块用于数据抓取。urllib是用于打开网页链接的模块,urlopen()函数用于打开网页,bs4(BeautifulSoup模块)用BeautifulSoup()函数处理返回html的数据。
工具/原料
1

python3.4

2

BeautifulSoup

方法/步骤
1

from urllib.request import urlopen用于打开网页from urllib.error import HTTPError用于处理链接异常from bs4 import BeautifulSoup用于处理html文档import re用正则表达式匹配目标字符串

3

import java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader;import java.net.HttpURLConnection;import java.net.MalformedURLException;import java.net.URL;public class Capture {    public static void main(String[] args) throws MalformedURLException, IOException {        String strUrl ='http://news.baidu.com/';        URL url =new URL(strUrl);        HttpURLConnection httpConnection = (HttpURLConnection) url.openConnection();        InputStreamReader input = new InputStreamReader(httpConnection.getInputStream(),'utf-8');        BufferedReader bufferedReader = new BufferedReader(input);        String line ='';        StringBuilder stringBuilder = new StringBuilder();        while ((line =bufferedReader.readLine())!=null){            stringBuilder.append(line);        }        String string =stringBuilder.toString();        int begin =string.indexOf('');        int end=string.indexOf('');        System.out.println('IP address:'+string.substring(begin,end));    }

推荐信息