bs4 全名 BeautifulSoup，是编写 python 爬虫常用库之一，主要用来解析 html 标签。

前传-安装bs4

使用国内源快速安装bs4

pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple

一、初始化

from bs4 import BeautifulSoup
 
soup = BeautifulSoup("<html>A Html Text</html>", "html.parser")

两个参数：第一个参数是要解析的html文本，第二个参数是使用那种解析器，对于HTML来讲就是html.parser，这个是bs4自带的解析器。

如果一段HTML或XML文档格式不正确的话，那么在不同的解析器中返回的结果可能是不一样的。

解析器	使用方法	优势
Python标准库	BeautifulSoup(html, “html.parser”)	1、Python的内置标准库 2、执行速度适中 3、文档容错能力强
lxml HTML	BeautifulSoup(html, “lxml”)	1、速度快 2、文档容错能力强
lxml XML	BeautifulSoup(html, [“lxml”, “xml”]) BeautifulSoup(html, “xml”)	1、速度快 2、唯一支持XML的解析器
html5lib	BeautifulSoup(html, “html5lib”)	1、最好的容错性 2、以浏览器的方式解析文档 3、生成HTML5格式的文档

格式化输出

soup.prettify()  # prettify 有括号和没括号都可以

二、基本使用

from bs4 import BeautifulSoup
 
# 构造一个网页数据
html_doc = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title">
            <b>The Dormouse's story</b>
        </p>
        
        <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
        and they lived at the bottom of a well.</p>
        
        <p class="story">...</p>
    </body>
</html>
"""

2.1 获取标签

res = BeautifulSoup(html_doc, 'lxml')
 
print(res.a)

2.2 获取标签内文本

print(res.a.text)

2.3 获取标签内属性

print(res.a.attrs)

2.4 获取指定属性值

print(res.a.attrs.get('href'))
print(res.a.get('href'))

2.5 获取子节点

for i in res.p.children:
    print(i)

2.6 获取标签内部所有的元素

print(res.p.contents)

2.7 获取标签的父标签

print(res.p.parent)

2.8 获取最上级节点

for i in res.p.parents:
    print(i)

三、bs4核心库

3.1 find

只能找符合条件的第一个该方法的返回结果是一个标签对象

3.1.1 查找指定标签名的标签默认只找符合条件的第一个

print(res.find(name='p'))

3.1.2 查找具有某个特定属性的标签默认只找符合条件的第一个

print(res.find(name='p', id='title'))

3.1.3 为了解决关键字冲突会加下划线区分

print(res.find(name='p', class_='title'))

3.1.4 使用attrs参数直接避免冲突

print(res.find(name='p', attrs={'class': 'title'}))

3.2 find_all

查找所有符合条件的标签该方法的返回结果是一个列表。

3.2.1 查询某一个标签，查找的结果是一个列表

print(res.find_all('a'))

3.3 select方法

使用css选择器该方法的返回结果是一个列表。

3.3.1 查找class含有title的标签

print(res.select('.title'))

3.3.2 查看class含有sister标签内部所有的后代span

print(res.select('.title b'))

3.3.3 查找id等于title的标签

print(res.select('#title'))

四、使用bs4爬取豆瓣电影排行榜

from bs4 import BeautifulSoup
import requests
import re
 
def main():
 
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }
 
    baseurl = "https://movie.douban.com/top250?start="
 
    res = requests.get(url=baseurl, headers=head)
 
    connect = res.text
 
    res = BeautifulSoup(connect, 'lxml')
 
    video = res.select('.grid_view li')
 
    list = []
 
    for i in video:
 
        vidow = {
            "title": "",
            "year": "",
            "score": 0,
            "num": 0
        }
 
        for item in i.select('.title'):
            vidow['title'] += item.text.replace("\xa0", " ")
 
        for item in i.select('.other'):
            vidow['title'] += item.text.replace("\xa0", " ")
 
        for item in i.select(".bd p"):
            obj = re.compile('\d{4}', re.S)
            result = obj.finditer(item.text)
            for year in result:
                vidow['year'] = year.group()
 
        for item in i.select(".rating_num"):
            vidow['score'] = item.text
 
        vidow['num'] = i.select(".star span")[-1].text.replace("人评价", "")
 
        list.Python

五、使用bs4爬取能源学院官网（cqny.edu.cn）新闻中心所有超链接

示例代码demo

from bs4 import BeautifulSoup
import requests
import time

host = "https://www.cqny.edu.cn/xwzx/"

soup = BeautifulSoup(requests.get(host).text, 'html.parser')
for item in soup.find_all('a'):
    print(item.get('href'))

原创文章，作者：guozi，如若转载，请注明出处：https://www.sudun.com/ask/78983.html

Python爬虫之bs4

bs4 全名 BeautifulSoup，是编写 python 爬虫常用库之一，主要用来解析 html 标签。

前传-安装bs4

一、初始化

二、基本使用

2.1 获取标签

2.2 获取标签内文本

2.3 获取标签内属性

2.4 获取指定属性值

2.5 获取子节点

2.6 获取标签内部所有的元素

2.7 获取标签的父标签

2.8 获取最上级节点

三、bs4核心库

3.1 find

3.1.1 查找指定标签名的标签默认只找符合条件的第一个

3.1.2 查找具有某个特定属性的标签默认只找符合条件的第一个

3.1.3 为了解决关键字冲突会加下划线区分

3.1.4 使用attrs参数直接避免冲突

3.2 find_all

3.2.1 查询某一个标签，查找的结果是一个列表

3.3 select方法

3.3.1 查找class含有title的标签

3.3.2 查看class含有sister标签内部所有的后代span

3.3.3 查找id等于title的标签

四、使用bs4爬取豆瓣电影排行榜

五、使用bs4爬取能源学院官网（cqny.edu.cn）新闻中心所有超链接

发表回复

Python爬虫之bs4

bs4 全名 BeautifulSoup，是编写 python 爬虫常用库之一，主要用来解析 html 标签。

前传-安装bs4

一、初始化

二、基本使用

2.1 获取标签

2.2 获取标签内文本

2.3 获取标签内属性

2.4 获取指定属性值

2.5 获取子节点

2.6 获取标签内部所有的元素

2.7 获取标签的父标签

2.8 获取最上级节点

三、bs4核心库

3.1 find

3.1.1 查找指定标签名的标签 默认只找符合条件的第一个

3.1.2 查找具有某个特定属性的标签 默认只找符合条件的第一个

3.1.3 为了解决关键字冲突 会加下划线区分

3.1.4 使用attrs参数 直接避免冲突

3.2 find_all

3.2.1 查询某一个标签，查找的结果是一个列表

3.3 select方法

3.3.1 查找class含有title的标签

3.3.2 查看class含有sister标签内部所有的后代span

3.3.3 查找id等于title的标签

四、使用bs4爬取豆瓣电影排行榜

五、使用bs4爬取 能源学院官网（cqny.edu.cn）新闻中心 所有超链接

相关推荐

CDN在AI人工智能中的关键角色

网页设计制作网站

电脑服务器是什么？了解一下这个网络世界的基石

运营商屏蔽手机是怎么操作，运营商屏蔽了哪些端口

发表回复

3.1.1 查找指定标签名的标签默认只找符合条件的第一个

3.1.2 查找具有某个特定属性的标签默认只找符合条件的第一个

3.1.3 为了解决关键字冲突会加下划线区分

3.1.4 使用attrs参数直接避免冲突

五、使用bs4爬取能源学院官网（cqny.edu.cn）新闻中心所有超链接