pthon-BeautifulSoup

Posted 2024-10-9 Updated 2024-10- 10

By Administrator

7~9 min read

注意：在使用resquests.post时，data,cookies等参数需要使用字典来传入参数

BeautifulSoup

一个解析工具，可借助网页的结构和属性等特征来解析网页

使用：from bs4 import BeautifulSoup

解析器

解析器	使用方法	优势
Python标准库	BeautifulSoup(对象,'html,parser')	执行速度适中，文档容错强
LXML HTML解析器	BeautifulSoup(对象,'lxml')	速度快，文档容错强
LXML XML解析器	BeautifulSoup(对象,'xml')	速度快，唯一支持解析XML
html5lib	BeautifulSoup(对象,'html5lib')	以浏览器的方式解析文档，生成HTML5格式的文档

lxml解析器有解析html和XML的功能，推荐使用

from bs4 import BeautifulSoup
a='<p>hello</p>'
soup=BeautifulSoup(a,'lxml') #初始化一个对象

当初始化了一个BeautifulSoup后，这里以soup为例，soup.title指的是a中的title节点，以此类推

find方法

soup.find(name,sttrs,recursive,text,**kwargs)

name：可以是标签名如'div'，或者是一个字符串如{'class': 'some-class'}来搜索具有特定类的元素。

attrs：用于指定其他属性的字典，例如{'id': 'some-id'}。

recursive：布尔值，决定是否在整个文档中递归搜索，默认为True。

text：用于搜索具有特定文本内容的元素。

**kwargs：其他关键字参数，可以用来进一步指定搜索条件。

后面跟.text可输出所查找的内容

例如：

import requests
from bs4 import BeautifulSoup

url="http://eci-2ze66zwnnawobs6kkbqw.cloudeci1.ichunqiu.com/start"
response = requests.get(url)
soup=BeautifulSoup(response.text,"html.parser")
text_content=soup.find('p',{'id':'text'}).text

url1="http://eci-2ze66zwnnawobs6kkbqw.cloudeci1.ichunqiu.com/submit"
data={'user_input':text_content}
cookie={'session':response.cookies['session']}
print(data)
respose1=requests.post(url1,data=data,cookies=cookie)
print(respose1.text)

以霸王别姬为例子

url='https://ssr1.scrape.center/detail/1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
b=soup.find('p',{'data-v-63864230':""}).text
print(b)

soup.find_all可以用来找出所有相关的内容

计算机基础

License: CC BY 4.0