linux下scrapy+selenium+chromedriver配置教程

博主：小漠
发布时间：2020 年 07 月 08 日
2923次浏览
暂无评论
1898字数
分类： linux

1、scrapy

pip install Scrapy

2、 selenium

pip install selenium

3、chromedriver

首先需要下载Chrome客户端,然后查看客户端的版本号,如果是linux无界面的话可以执行命令行下载

rpm -ivh google-chrome-stable_current_x86_64.rpm

然后去 https://chromedriver.storage.googleapis.com/index.html 下载后可以放进/usr/bin目录下

之后按以下代码可以顺利执行即可.

 import scrapy
 import json 
 from comic_spider.items import TopicItem
 import time
 from selenium import webdriver
 import time

 class SearchSpider(scrapy.spiders.Spider):
 name = 'search'
 search_page_url_pattern = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page={page}&enc=utf-8"
 start_urls = ['https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8']

def __init__(self):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    self.browser = webdriver.Chrome(chrome_options=chrome_options, executable_path='/usr/bin/chromedriver')
    super(SearchSpider, self).__init__()

def closed(self, reason):
    self.browser.close()  # 记得关闭

def parse(self, response):
    total_page = response.css('span.p-skip em b::text').extract_first()
    if total_page:
        for i in range(int(total_page)):
            next_page_url = self.search_page_url_pattern.format(page=2 * i + 1)
            yield scrapy.Request(next_page_url, callback=self.parse_page)
            time.sleep(1)

def parse_page(self, response):
    phone_info_list = response.css('div.p-name a')
    for item in book_info_list:
        phone_name = item.css('a::attr(title)').extract_first()
        phone_href = item.css('a::attr(href)').extract_first()

        yield dict(name=phone_name, href=phone_href)

最后修改：2020 年 07 月 08 日 09 : 31 PM

© 允许规范转载

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复

评论 *

私密评论

名称 *

邮箱 *

地址

docker命令操作
浏览次数: 9689
Protobuf 安装
浏览次数: 7567
laravel与lumen压力测试
浏览次数: 6146
压力测试 CANNOT ASSIGN REQUESTED ADDRESS错误解决办法
浏览次数: 5768
欢迎使用 Typecho
浏览次数: 5623

阳光灿烂
冬天已经到来，春天还会远吗
阳光灿烂
枯木逢春犹再岁，人无两度再少年。
iharljhieg
立意高远，以小见大，引发读者对社会/人性的深层共鸣。
wvgupfvmgk
选材新颖独特，通过细节描写赋予主题鲜活生命力。
cbagjrftjw
文章紧扣主题，观点鲜明，展现出深刻的思考维度。

mongodb导入
浏览次数: 51
yum安装nginx
浏览次数: 2074
typecho 头像旋转css
浏览次数: 2833
css瀑布流布局
浏览次数: 1881
自签名证书本地配置
浏览次数: 1966

linux下scrapy+selenium+chromedriver配置教程

小漠 • 2020 年 07 月 08 日

1、scrapy

pip install Scrapy

2、 selenium

pip install selenium

3、chromedriver

首先需要下载Chrome客户端,然后查看客户端的版本号,如果是linux无界面的话可以执行命令行下载

rpm -ivh google-chrome-stable_current_x86_64.rpm

然后去 https://chromedriver.storage.googleapis.com/index.html 下载后可以放进/usr/bin目录下

之后按以下代码可以顺利执行即可.

 import scrapy
 import json 
 from comic_spider.items import TopicItem
 import time
 from selenium import webdriver
 import time

 class SearchSpider(scrapy.spiders.Spider):
 name = 'search'
 search_page_url_pattern = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page={page}&enc=utf-8"
 start_urls = ['https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8']

def __init__(self):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    self.browser = webdriver.Chrome(chrome_options=chrome_options, executable_path='/usr/bin/chromedriver')
    super(SearchSpider, self).__init__()

def closed(self, reason):
    self.browser.close()  # 记得关闭

def parse(self, response):
    total_page = response.css('span.p-skip em b::text').extract_first()
    if total_page:
        for i in range(int(total_page)):
            next_page_url = self.search_page_url_pattern.format(page=2 * i + 1)
            yield scrapy.Request(next_page_url, callback=self.parse_page)
            time.sleep(1)

def parse_page(self, response):
    phone_info_list = response.css('div.p-name a')
    for item in book_info_list:
        phone_name = item.css('a::attr(title)').extract_first()
        phone_href = item.css('a::attr(href)').extract_first()

        yield dict(name=phone_name, href=phone_href)