爬虫概念、工具和HTTP 1、什么是爬虫
爬虫就是模拟客户端(浏览器)发送网络请求
,获取服务器响应,按照规则提取数据的程序
2、爬虫数据去哪了
网页呈现出来(不是做新闻的app通过爬虫抓取新闻)
进行分析 (从数据中分析想要的(大公司拥有大数据,小公司抓取数据) 实例: 爬取小说、新闻等呈现到前端页面
3、 浏览器的请求
url
在chrome中点击检查,点击网络network
url = 请求协议 + 网络域名、主机名 + 资源路径 +? 参数
小技巧: url解码工具: 将url地址解码,可以将参数具体
浏览器请求url出现大量字符串元素(网页元素由大量工具编写js、css)
当前url对应响应 + js + css + 图片 —–即elements中的内容
爬虫请求url地址
elements的内容和爬虫获取的url地址对应的响应不同,爬虫中需要以当前url地址对应响应为准提取数据
当前url地址对应的响应在哪
从network中找到当前url地址,点击response
或者在页面上右键显示网页源码(对于更新快的网页,网页源码中元素可能不断改变)
4、 认识HTTP、HTTPS
HTTP: 超文本传输协议
HTTPS: HTTP+SSL(安全套接字层)
每次传输之前数据先加密,之后解密获取
效率低、安全
get请求和post请求的区别
get请求没有请求体,post含有,get请求把数据放到url地址中
post请求常用于登录注册
post请求携带的数据量比get请求大,多,常用于传输大文本
HTTP之请求 Request Header view parms
请求行
请求头 Host、Connection、cache-Control、 User-Agent用户代理:(对方服务器了解访问者是什么浏览器、还是爬虫) 如果我们需要模拟手机版浏览器则修改User-Agent,并选择相应型号手机
Cookie: 用来存储用户信息、每次请求会被携带上发送给对方的浏览器
要获取登录后才能访问的页面
对方的服务器会通过cookie来判断我们是否是一个爬虫
请求体
携带数据
get请求没有请求体,post含有
HTTP协议之响应
1.响应头
Set-Cookie: 对方服务器通过该字段设置Cookie(cookies查看)到本地
2.响应体
url地址对应的响应
requests模块 1、 使用前安装 pip install requests
2、 get/post响应请求
response = requests.get(url) #发送get请求,请求url地址对应的响应
response = requests.post(url, data = {请求的字典}) #发送post请求并附带请求体,请求url地址对应的响应
1 2 3 4 5 6 7 8 9 10 import requests url = " " query_string = {"query":".....", "from":"zh", "to":"en} headers = {"User-Agent":"......"网页Fn+12赋值} response = requests.post(url,data = query_string,headers = headers) print(response) #判断请求正常发送出去,服务器响应 print(response.content.decode())
3、response方法
response.text 大部分都会出现乱码,一般加上resonse.encoding = ‘utf-8’
response.content.decode() 二进制字节流byte形式显示,一般加上decode()转化为str类型
response.encoding = response.apparent_encoding 通过网页内容分析编码
response.request.url #发送请求的url地址,字符串
response.url #response相应的url地址,页面跳转和前面一个不一样
response.request.headers #请求头,字典
response.headers #响应头
4、获取网页源码正确打开方式
response.content.decode()
response.content.decode(“gbk”)
response.text #requests根据头部猜测,该方式往往出现乱码,解决response.encoding=’utf-8’
response.text.encode(‘GBK’,’ignore’).decode(‘gbk’)
目的为了模拟浏览器,获取和浏览器一模一样的内容
headers = {“User-Agent”:……网页Fn+12赋值} “Referer”:”…..” 更多的参数尝试
response = requests.get(url,headers = headers) response = requests.post(url,data = data,headers=headers)
6、使用超时参数
requests.get(url,headers = headers,timeout=3) #3秒内必须返回,否则报错
7、retrying模块学习
1 2 3 4 5 6 from retrying import retry @retry(stop_max_attempt_number=3) #让被装饰的函数,反复请求3次,三次报错才报错,只要有一次正确请求则返回 def fun1(): print("this is fun1") raise ValueErro("this is test error) #报错效果
8、详细代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 headers = {"User-Agent":"....."} @retry(stop_max_attempt_number=3) def _parse_url(url): print("*"*100) #检测 response = requests.get(url,headers=headers,timeout) return response.content.decode() def parse_url(url): try: html_str = _parse_url(url) except: html_str = None return html_str if __name__ == '__main__': #本模块执行 url = "http://www.baidu.com" print(parse_url(url)[:100]) #取字符串前100
9、requests处理Cooike相关请求
cookie放在headers中 1.1 headers = {“User-Agent”:”…”, “Cookie”:”…”}
cookie字典传给cookies参数 requests.get(url,cookies = cookie_dict)
方式二: 先发送post请求,获取cookie,带上cookie请求登录后的页面
session = requests.session() # 帮助和服务端回话保持| 方法和request相同
session.post(url,data,headers) #服务器设置在本地的cookie会保存在session中
session.get(url) #会带上之前保存在session中的cookie,能够请求成功
10、在使用session 请求登录后的页面 1 2 3 4 url = "....." response = session.get(url,headers=headers) with open("renren3.html","w",encoding='utf-8') as f: f.write(response.content.decode()
11、三种方式爬虫案例 1.requests-get实现页面爬取(百度首页) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import requestsfrom retrying import retryheaders = { "User-Agent" : "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1" , } @retry(stop_max_attempt_number=3) #装饰函数反复请求,三次请求都失败报错 def get_page (get_url) : """get()方法请求内容""" print('*' *10 ) r = requests.get(get_url,headers=headers,timeout=10 ) return r.text.encode('GBK' ,'ignore' ).decode('gbk' ) def _get_page (get_url) : try : html_str = get_page(get_url) except : html_str = 'Error' else : with open('baidu_get.text' ,'w' ) as f: f.write(html_str[:500 ]) return html_str if __name__ == '__main__' : get_url = 'https://www.baidu.com/' print(_get_page(get_url)[:500 ])
2.requests-post实现携带请求体爬取输入框转化内容(百度翻译) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 import requestsfrom retrying import retryimport jsonheaders = { "User-Agent" : "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1" , "Referer" : "http://fanyi.baidu.com/?aldtype=16047" , } query_data = { "query" : "人生苦短,我用python" , "from" : "zh" , "to" : "en" , } @retry(stop_max_attempt_number=3) #连续请求三次,三次都失败报错 def get_page (url) : """服务器请求""" print("*" *10 ) r = requests.post(url,headers=headers,data=query_data,timeout=3 ) return r.content.decode() def _get_page (url) : try : html_str = get_page(url) except : html_str = None return html_str if __name__ == '__main__' : url = 'http://fanyi.baidu.com/basetrans' page_dict = json.loads(_get_page(url)) print(page_dict['trans' ][0 ]['dst' ])
3.cookies请求登录后服务器保留密码的页面(人人网)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import requestsfrom retrying import retryheaders = { "User_Agent" :"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1" , "Cookie" : "anonymid=jl6plze3-antvfl; _r01_=1; ln_uact=17318646019; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; _de=71EA9E23AC1F39D48A6CD3B85E99E01A; depovince=SC; JSESSIONID=abcnrx7CjMin0-AScWtww; ick_login=9a8e1f1c-6cd8-4700-afa6-8cf5315d36c2; jebe_key=8be431ea-e18e-4feb-ab4b-0472222a3513%7C819b727b2164a924b41678741330f60a%7C1535114212567%7C1%7C1535782169180; _ga=GA1.2.609958396.1535782958; _gid=GA1.2.2069950083.1535782958; jebecookies=2f5e7612-0f24-4f28-a1ff-52b96074e414|||||; p=fddc828d0aeca708377ca86756cd44342; first_login_flag=1; t=84129fe73a07099108643adb89c95ad52; societyguester=84129fe73a07099108643adb89c95ad52; id=967724772; xnsid=6c333906; ver=7.0; loginfrom=null; wp_fold=0" , } @retry(stop_max_attempt_number=3) def get_page (url) : """请求服务器响应""" print('*' *10 ) response = requests.get(url,headers=headers,timeout=5 ) return response.content.decode() def _get_page (url) : try : html_str = get_page(url) except : html_str = None return html_str if __name__ == '__main__' : url = 'http://www.renren.com/967724772/profile' with open('renren2.txt' ,'w' ,encoding='utf-8' ) as f: f.write(_get_page(url))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import requestsfrom retrying import retryheaders = { "User_Agent" :"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1" , } cookie="anonymid=jl6plze3-antvfl; _r01_=1; ln_uact=17318646019; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; _de=71EA9E23AC1F39D48A6CD3B85E99E01A; depovince=SC; jebe_key=8be431ea-e18e-4feb-ab4b-0472222a3513%7C819b727b2164a924b41678741330f60a%7C1535114212567%7C1%7C1535782169180; _ga=GA1.2.609958396.1535782958; _gid=GA1.2.2069950083.1535782958; fenqi_user_city=36; jebecookies=35d2c64d-78b8-41ef-b5e4-4c887424f2ac|||||; JSESSIONID=abcJLhFJVpWjx73Ufvyww; ick_login=f74bfdc3-17bf-44a5-8227-33b2b0e79e89; p=f527cef24d899db1c7795ed77b5494772; first_login_flag=1; t=f634330753378f3c9eb40dd69dcd74c72; societyguester=f634330753378f3c9eb40dd69dcd74c72; id=967724772; xnsid=67ab8568; ver=7.0; loginfrom=null; wp_fold=0" cookie_dict = {i.split("=" )[0 ]:i.split("=" )[-1 ]for i in cookie.split("; " )} @retry(stop_max_attempt_number=3) def get_page (url) : print('*' *10 ) r = requests.get(url,headers=headers,cookies=cookie_dict,timeout=3 ) return r.content.decode() def _get_page (url) : try : html_str = get_page(url) except : html_str = None return html_str if __name__ == '__main__' : url = "http://www.renren.com/967724772/profile" with open('renren2.txt' ,'w' ) as f: f.write(_get_page(url))
4. post请求页面,保存登录后的cookies信息到本地session
注意: 验证码待解决,没有实现session访问,存有疑问!!!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import requestsfrom retrying import retryheaders = { "User_Agent" :"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1" , } query_data = { "email" :"17318646019" , "password" :"******" , } session = requests.session() @retry(stop_max_attempt_number=3) def _get_page (post_url,url) : session.post(post_url,headers=headers,data=query_data) response = session.get(url,headers=headers) return response.content.decode() if __name__ == '__main__' : post_url = "http://www.renren.com/PLogin.do" url = "http://www.renren.com/967724772/profile" with open('renren_session_cookie.txt' ,'w' ,encoding='utf-8' ) as f: f.write(_get_page(post_url,url))
正则表达式
1 2 res_tr = r'<tr>(.*?)</tr>' m_tr = re.findall(res_tr,language,re.S|re.M)
1 2 res_tr = r'<a .*?>(.*?)</a>' m_tr= re.findall(res_tr, content, re.S|re.M)
1 2 3 4 5 6 7 8 urls=re.findall(r"<a.*?href=.*?<\/a>" , content, re.I|re.S|re.M) ``` - 获取<a href></a>中的URL ```python res_url = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" link = re.findall(res_url , content, re.I|re.S|re.M)
1 2 3 4 5 title_pat = r'(?<=<title>).*?(?=</title>)' title_ex = re.compile(title_pat,re.M|re.S) title_obj = re.search(title_ex, content) title = title_obj.group() print title
1 2 title = re.findall(r'<title>(.*?)</title>' , content) print title[0 ]
1 2 3 4 start = content.find(r'<table class="infobox vevent"' ) end = content.find(r'</table>' ) infobox = language[start:end] print infobox
1 2 3 if '<br />' in value: value = value.replace('<br />' ,'' ) value = value.replace('\n' ,' ' )
1 value = re.sub('<[^>]+>' ,'' , value)
<
爬虫进阶