English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Environment needed:
A Bilibili account, you need to log in first, otherwise you cannot view the historical bullet comments records
A computer connected to the internet and a convenient browser, I use Chrome
Python3Environment and request module, installation and use command, source switching is faster:
pip3 install request -i http://pypi.douban.com/simple
Crawling steps: Log in, open the video page to be scraped, open the developer tools, Chrome can use F12Shortcut keys, select network to monitor requests
Click to view historical bullet comments, obtain the request
The number after rolldate indicates the bullet comments ID of the corresponding video, the timestamp in the returned data indicates the bullet comments date, and new indicates the number
Select any day in the historical bullet comments to view, it will send a new request
dmroll, timestamp, bullet comments ID, indicates obtaining the bullet comments of the date1507564800 indicates2017/10/10 0:0:0
This request returns XML data
Use regular expressions to obtain all bullet comments messages, matching pattern
<d p=".*?>.(.*?)</d>'
Concatenate strings, save all bullet comments to a local file
with open('content.txt', mode='w+', encoding='utf8) as f: f.write(content)
Reference code as follows, save bullet comments by date into a single file...because there are too many...
import requests import re import time """ Scrape Bilibili video bullet comments information """ # 2043618 Is the bullet comments ID of the video, this address will return the timestamp list # https://www.bilibili.com/video/av1349282 url = 'https://comment.bilibili.com/rolldate,2043618' # Obtain bullet comments ID 2043618 video_id = url.split(',')[-1] print(video_id) # Obtain JSON file html = requests.get(url) # print(html.json()) # Generate timestamp list time_list = [i['timestamp'] for i in html.json()] # print(time_list) # Get danmu URL format 'https://comment.bilibili.com/dmroll,timestamp,danmu number' # Danmu content, since the total number of danmu is too large, each danmu file will be saved separately for i in time_list: content = '' j = 'https://comment.bilibili.com/dmroll,{0},{1}'.format(i, video_id) print(j) text = requests.get(j).text # Match danmu content res = re.findall('<d p=".*?>.(.*?)</d>', text) # Convert timestamp to date format, need to convert the string to an integer timeArray = time.localtime(int(i)) date_time = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) print(date_time) content += date_time + '\n' for k in res: content += k + '\n' content += '\n' file_path = 'txt/{}.txt'.format(time.strftime("%Y_%m_%d", timeArray)) print(file_path) with open(file_path, mode='w+', encoding='utf8') as f: f.write(content)
Final Effect
After that, you can do some word segmentation to generate word clouds or perform sentiment analysis, let's talk about it when we have time...
You can leave your learning experience below for the editor, and thank you for your support of the Yana Tutorial.