数据提取方法
json
- 数据交换格式(后端转换到前端),看起来像python类型(字典、集合…)的字符串
- 使用json之前需要导入模块
- 哪里会返回json数据
- jsom.loads
- 把json字符串转化为python类型
json.loads('json字符串')
- json.dump
- 把python类型转化为json字符串
json.dumps('{python字典}')
- json.dumps(ret1,ensure_ascii=False,indent=2)
- ensure_ascii : 让中文显示成中文
- indent: 能够让下一行在上一行的基础上空格, indent = 2
- 豆瓣电视爬虫案例
- preview
搜索需要的爬取的内容
- Filter 过滤显示输入字段的url地址
- 案例实现代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| from parse import parse_url import json
class DoubanSpider: """豆瓣内容抓取""" def __init__(self): self.temp_url = "https://movie.douban.com/j/search_subjects?type=tv&tag=%E7%BE%8E%E5%89%A7&sort=recommend&page_limit=20&page_start={}" def get_content_list(self,html_str): dict_data = json.loads(html_str) content_list = dict_data["subjects"] return content_list def save_content_list(self,content_list): with open("douban.json","a") as f: for content in content_list: f.write(json.dumps(content,ensure_ascii=False)) f.write("\n") print('保存成功') def run(self): num = 0 while num < 100: url = self.temp_url.format(num) print(url) html_str = parse_url(url) content_list = self.get_content_list(html_str) self.save_content_list(content_list) num += 20 if __name__ == '__main__': douban = DoubanSpider() douban.run()
|
xpath和lxml
xpath
- 一门从html中提取数据的语言
- xpath helper插件:帮助我们从
element中定位数据
- 1.选择节点(标签)
/html/head/meta: 能够选中html下的所有meta标签
- 2.
// : 能够选择任意标签
//li: 当前页面上所有li标签
/html/meta//link: 结合使用显示meta下的link标签
- 3.
@符号用途
- 选择具体属性值
//div[@class='feed-infinite-wrapper']/ul/li: 选择class=’feed-infinite-wrapper’的div下的ul下的li
//div[@class='feed-infinite-wrapper']/ul/li/a/@href:选择a标签的href值
- 4.获取a标签下的文本:
/a/text()
- 标签下的所有文本:
/a//text()
- 5.点前
- “./a” 当前节点下的a标签
lxml
- 安装: pip install lxml
- 使用:
from lxml import etree
element = etree.HTML(‘接受的html字符串’) #转化为element对象
element.xpath(“”) #url地址对应响应,网页源码,只有elements和url响应相同时才能看elements
- 实现代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
| import requests from lxml import etree
url = "https://movie.douban.com/chart" headers = { "User_Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", }
response = requests.get(url,headers=headers) html_str = response.content.decode()
html = etree.HTML(html_str)
img_list = html.xpath("//div[@class='indent']/div/table//a[@class='nbg']/img/@src")
url_list = html.xpath("//div[@class='indent']/div/table//div[@class='pl2']/a/@href")
ret1 = html.xpath("//div[@class='indent']/div/table")
for table in ret1: item = {} item['title'] = table.xpath(".//div[@class='pl2']/a/text()")[0].replace('/','').strip() item['href'] = table.xpath(".//div[@class='pl2']/a/@href")[0] item['img'] = table.xpath(".//a[@class='nbg']/img/@src")[0] item['comment_num'] = table.xpath(".//span[@class='pl']/text()")[0] item['rating_num'] = table.xpath(".//span[@class='rating_nums']/text()")[0] print(item)
|
xpath和lxml案例(糗事百科)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
| from lxml import etree import requests import json
class QiubaiSpader: def __init__(self): self.url_temp = "https://www.qiushibaike.com/8hr/page/{}/" self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", } def get_url_list(self): url_list = [self.url_temp.format(i) for i in range(1,14)] return url_list def parse_url(self,url): print('now_parse:'+url) response = requests.get(url,headers=self.headers) return response.text.encode('GBK','ignore').decode('gbk') def get_content_list(self,html_str): html = etree.HTML(html_str) content_list = [] div_list = html.xpath('//div[@id="content-left"]/div') for div in div_list: item = {} item["author_name"] = div.xpath(".//h2/text()")[0].strip() if len(div.xpath(".//h2/text()"))>0 else None item["author_img"] = div.xpath(".//img/@src") item["content"] = div.xpath('.//div[@class="content"]/span/text()') item["content"] = [i.strip() for i in item["content"]] item["stats_vote"] = div.xpath('.//span[@class="stats-vote"]/i/text()') item["stats_vote"] = item["stats_vote"][0] if len(item["stats_vote"])>0 else None item["stats-comments"] = div.xpath('.//span[@class="stats-comments"]//i/text()') item["stats-comments"] = item["stats-comments"][0] if len(item["stats-comments"])>0 else None item["content_img"] = div.xpath('.//div[@class="thumb"]//img/@src') item["content_img"] = 'https:' + item["content_img"][0] if len(item["content_img"])>0 else None content_list.append(item) return content_list def save_content_list(self,content_list): with open('qiubai.txt','a',encoding='utf-8') as f: for content in content_list: f.write(json.dumps(content,ensure_ascii=False)) f.write('\n') print("保存成功") def run(self): """实现主要逻辑""" url_list = self.get_url_list() for url in url_list: html_str = self.parse_url(url) content_list = self.get_content_list(html_str) self.save_content_list(content_list)
if __name__ == '__main__': qiubai = QiubaiSpader() qiubai.run()
|
写爬虫的套路
- 1.url
- 知道url地址规律和总页码: 构造url地址的列表
- start_url、先构造起始url地址,然后构造下一页面
- 2.发送请求获取响应
- 3.提取数据
- 返回json字符串:json模块
- 返回的是html字符串:lxml模块配合xpath提取数据
- 4.保存
基础知识点的学习
1 2 3 4
| "传智{}播客".format("1") "传智{}播客".format(1) "传智{}播客".format([1,2,3]) "传智{}播客{}".format({1,2,3},[1,2,3])
|
- 列表推导式
- 帮助我们快速的生成包含一对数据的列表
[i+10 for i in range(10)] —> [10,11,12…19]
["10月{}日".format(i) for i in range(1,10)] —> [“10月1日”,”10月2日”,..,”10月9日”]
- 字典推导式
1 2
| {i+10:i for i in range(10)} {"a{}".format(i):10 for i in range(3)}
|
- 三元运算符
- if去前面的条件成立就把if前面的结果赋值给a,否则把else后面的结果赋值给a
a = 10 if 4 > 3 else 20
a = 10 if 4 < 3 else 20