共计 2114 个字符,预计需要花费 6 分钟才能阅读完成。
前言
昨晚,2023 年首部爆款剧集《狂飙》迎来大结局,一度冲上热搜第一
“是非面前稍不留神,就会步入万丈深渊,唯有坚守信仰,才能守得初心”
面对这么多广大网友的讨论,我也来凑上一个热闹
用 python 采集一下《狂飙》评论数据
代码展示
(源码、教程、文档 点击此处跳转 跳转文末名片加入君羊,找管理员小姐姐领取呀~ )
导入模块
import requests
import parsel
- 1
- 2
伪装
headers = {
'Cookie': 'll="118267"; bid=vmTru_a25m8; __utma=30149280.50068328.1675317520.1675317520.1675317520.1; __utmc=30149280; __utmz=30149280.1675317520.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1675317540%2C%22https%3A%2F%2Fwww.douban.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utma=223695111.62892083.1675317540.1675317540.1675317540.1; __utmb=223695111.0.10.1675317540; __utmc=223695111; __utmz=223695111.1675317540.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __gads=ID=fb33508fbeefffdc-22b1618a7fd900c1:T=1675317540:RT=1675317540:S=ALNI_Ma0hUcCRHqTpc0wmcM01k3qpX3big; __gpi=UID=0000099c3e5d1190:T=1675317540:RT=1675317540:S=ALNI_MYqY1aqMuFbXYpmO6sFDn6zMnHB9g; __yadk_uid=KpA5hjYEmww6Sf2qskRgZamuj7aaecAC; ct=y; __utmb=30149280.3.10.1675317520; _vwo_uuid_v2=D091DE0AFC99F8C5AFC3169D9CB1E30F3|b218e266efb05a6a7a8652ac6ceecfe9; _pk_id.100001.4cf6=a8eb1a0fc7d89e94.1675317540.1.1675318206.1675317540.',
'Host': 'movie.*****.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
- 1
- 2
- 3
- 4
- 5
for page in range(0, 4000):
print(page)
- 1
- 2
发送请求
url = f'https://movie.***.com/subject/35465232/comments?start={page*20}&limit=20&status=P&sort=new_score'
- 1
response = requests.get(url=url, headers=headers)
select = parsel.Selector(response.text)
comments = select.css('.comment-item .comment')
for comment in comments:
name = comment.css('.comment-info a::text').get()
try:
score_str = comment.css('.comment-info .rating::attr(class)').get()
score = score_str.replace('0 rating', '').replace('allstar', '')
except:
score = 0
comment_time = comment.css('.comment-info .comment-time::text').get().strip()
vote_count = comment.css('.comment-vote .votes.vote-count::text').get()
comment_content = comment.css('.comment-content span::text').get()
print(name, score, comment_time, vote_count, comment_content)
效果展示
贴出来的代码可以采集前十页的数据
后面的评论就需要登录才可以看到采集拉
你们可以登录后改一下’Cookie’然后就可以全部采集拉~
本文转载自 CDSN
正文完