大家好,欢迎来到IT知识分享网。
想做一个有趣的项目,首先整理一下思路,如何快速爬取关键信息。并且实现自动翻页功能。
想了想用最常规的requests加上re正则表达式,BeautifulSoup用于批量爬取
import requests import re from bs4 import BeautifulSoup import pymysql
然后引入链接,注意这里有反爬虫机制,第一页必须为https://tianjin.anjuke.com/sale/,后面页必须为’https://tianjin.anjuke.com/sale/p%d/#filtersort’%page,不然会被机制检测到为爬虫,无法实现爬取。这里实现了翻页功能。
while page < 11:
# brower.get("https://tianjin.anjuke.com/sale/p%d/#filtersort"%page)
# time.sleep(1)
print ("这是第"+str(page) +"页")
# proxy=requests.get(pool_url).text
# proxies={
# 'http': 'http://' + proxy
# }
if page==1:
url='https://tianjin.anjuke.com/sale/'
headers={
'referer': 'https://tianjin.anjuke.com/sale/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
}
else:
url='https://tianjin.anjuke.com/sale/p%d/#filtersort'%page
headers={
'referer': 'https://tianjin.anjuke.com/sale/p%d/#filtersort'%page,
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
}
# html=requests.get(url,allow_redirects=False,headers=headers,proxies=proxies)
html = requests.get(url, headers=headers)
第二步自然是分析网页以及如何实现自动翻页,首先找到图片
正则表达式走起!
#图片地址
myjpg=r'<img src="(.*?)" width="180" height="135" />'
jpg=re.findall(myjpg,html.text)
照片信息已经完成爬取,接下来依葫芦画瓢,把其它信息页也迅速爬取!
#描述
mytail=r'<a data-from="" data-company="" title="(.*?)" href'
tail=re.findall(mytail,html.text)
# 获取总价
totalprice=r'<span class="price-det"><strong>(.*?)</strong>'
mytotal=re.findall(totalprice,html.text)
#单价
simpleprice=r'<span class="unit-price">(.*?)</span> '
simple=re.findall(simpleprice,html.text)
接下来实现用beauitfulsoup实现关键字标签取值!解析器我这里用lxml,速度比较快,当然也可以用html.parser
soup=BeautifulSoup(html.content,'lxml')
看图,这里用了很多换行符,并且span标签没有命名,所以请上我们的嘉宾bs4
这里使用了循环,因为是一次性爬取,一个300条信息,由于一页图片只有60张,所以将其5个一组进行划分,re.sub目的为了将其中的非字符信息替换为空以便存入数据库
#获取房子信息 itemdetail=soup.select(".details-item span") # print(len(itemdetail)) you=[] my=[] for i in itemdetail: # print(i.get_text()) you.append(i.get_text()) k = 0 while k < 60: my.append([you[5 * k], you[5 * k + 1], you[5 * k + 2], you[5 * k + 3],re.sub(r'\s', "", you[5 * k + 4])]) k = k + 1 # print(my) # print(len(my))
接下来存入数据库!
db = pymysql.connect("localhost", "root", "" ,"anjuke") conn = db.cursor() print(len(jpg)) for i in range(0,len(tail)): jpgs = jpg[i] scripts = tail[i] localroom = my[i][0] localarea=my[i][1] localhigh=my[i][2] localtimes=my[i][3] local=my[i][4] total = mytotal[i] oneprice=simple[i] sql = "insert into shanghai_admin value('%s','%s','%s','%s','%s','%s','%s','%s','%s')" % \ (jpgs,scripts,local,total,oneprice,localroom,localarea,localhigh,localtimes) conn.execute(sql) db.commit() db.close()
大功告成!来看看效果!
以下为完整代码:
# from selenium import webdriver
import requests
import re
from bs4 import BeautifulSoup
import pymysql
# import time
# chrome_driver=r"C:\Users\秦\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\selenium-3.141.0-py3.8.egg\selenium\webdriver\chrome\chromedriver.exe"
# brower=webdriver.Chrome(executable_path=chrome_driver)
# pool_url='http://localhost:5555/random'
page=1
while page < 11:
# brower.get("https://tianjin.anjuke.com/sale/p%d/#filtersort"%page)
# time.sleep(1)
print ("这是第"+str(page) +"页")
# proxy=requests.get(pool_url).text
# proxies={
# 'http': 'http://' + proxy
# }
if page==1:
url='https://tianjin.anjuke.com/sale/'
headers={
'referer': 'https://tianjin.anjuke.com/sale/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
}
else:
url='https://tianjin.anjuke.com/sale/p%d/#filtersort'%page
headers={
'referer': 'https://tianjin.anjuke.com/sale/p%d/#filtersort'%page,
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
}
# html=requests.get(url,allow_redirects=False,headers=headers,proxies=proxies)
html = requests.get(url, headers=headers)
soup=BeautifulSoup(html.content,'lxml')
#图片地址
myjpg=r'<img src="(.*?)" width="180" height="135" />'
jpg=re.findall(myjpg,html.text)
#描述
mytail=r'<a data-from="" data-company="" title="(.*?)" href'
tail=re.findall(mytail,html.text)
#获取房子信息
itemdetail=soup.select(".details-item span")
# print(len(itemdetail))
you=[]
my=[]
for i in itemdetail:
# print(i.get_text())
you.append(i.get_text())
k = 0
while k < 60:
my.append([you[5 * k], you[5 * k + 1], you[5 * k + 2], you[5 * k + 3],re.sub(r'\s', "", you[5 * k + 4])])
k = k + 1
# print(my)
# print(len(my))
# 获取总价
totalprice=r'<span class="price-det"><strong>(.*?)</strong>'
mytotal=re.findall(totalprice,html.text)
#单价
simpleprice=r'<span class="unit-price">(.*?)</span> '
simple=re.findall(simpleprice,html.text)
db = pymysql.connect("localhost", "root", "" ,"anjuke")
conn = db.cursor()
print(len(jpg))
for i in range(0,len(tail)):
jpgs = jpg[i]
scripts = tail[i]
localroom = my[i][0]
localarea=my[i][1]
localhigh=my[i][2]
localtimes=my[i][3]
local=my[i][4]
total = mytotal[i]
oneprice=simple[i]
sql = "insert into shanghai_admin value('%s','%s','%s','%s','%s','%s','%s','%s','%s')" % \
(jpgs,scripts,local,total,oneprice,localroom,localarea,localhigh,localtimes)
conn.execute(sql)
db.commit()
db.close()
# button=brower.find_element_by_class_name('aNxt')
# button.click()
# time.sleep(1)
page=page+1
# brower.close()
免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://yundeesoft.com/58629.html