English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Python3Writing Function to Scrape Bilibili Video Danmu

Environment needed:

A Bilibili account, you need to log in first, otherwise you cannot view the historical bullet comments records

A computer connected to the internet and a convenient browser, I use Chrome

Python3Environment and request module, installation and use command, source switching is faster:

pip3 install request -i http://pypi.douban.com/simple

Crawling steps: Log in, open the video page to be scraped, open the developer tools, Chrome can use F12Shortcut keys, select network to monitor requests

 

Click to view historical bullet comments, obtain the request



The number after rolldate indicates the bullet comments ID of the corresponding video, the timestamp in the returned data indicates the bullet comments date, and new indicates the number


Select any day in the historical bullet comments to view, it will send a new request

dmroll, timestamp, bullet comments ID, indicates obtaining the bullet comments of the date1507564800 indicates2017/10/10 0:0:0



This request returns XML data


Use regular expressions to obtain all bullet comments messages, matching pattern

<d p=".*?>.(.*?)</d>'

Concatenate strings, save all bullet comments to a local file

with open('content.txt', mode='w+', encoding='utf8) as f:  f.write(content)

Reference code as follows, save bullet comments by date into a single file...because there are too many...

import requests
import re
import time
"""
  Scrape Bilibili video bullet comments information
"""
# 2043618 Is the bullet comments ID of the video, this address will return the timestamp list
# https://www.bilibili.com/video/av1349282
url = 'https://comment.bilibili.com/rolldate,2043618'
# Obtain bullet comments ID 2043618
video_id = url.split(',')[-1]
print(video_id)
# Obtain JSON file
html = requests.get(url)
# print(html.json())
# Generate timestamp list
time_list = [i['timestamp'] for i in html.json()]
# print(time_list)
# Get danmu URL format 'https://comment.bilibili.com/dmroll,timestamp,danmu number'
# Danmu content, since the total number of danmu is too large, each danmu file will be saved separately
for i in time_list:
  content = ''
  j = 'https://comment.bilibili.com/dmroll,{0},{1}'.format(i, video_id)
  print(j)
  text = requests.get(j).text
  # Match danmu content
  res = re.findall('<d p=".*?>.(.*?)</d>', text)
  # Convert timestamp to date format, need to convert the string to an integer
  timeArray = time.localtime(int(i))
  date_time = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
  print(date_time)
  content += date_time + '\n'
  for k in res:
    content += k + '\n'
  content += '\n'
  file_path = 'txt/{}.txt'.format(time.strftime("%Y_%m_%d", timeArray))
  print(file_path)
  with open(file_path, mode='w+', encoding='utf8') as f:
    f.write(content)

Final Effect



After that, you can do some word segmentation to generate word clouds or perform sentiment analysis, let's talk about it when we have time...

You can leave your learning experience below for the editor, and thank you for your support of the Yana Tutorial.

You May Also Like