Scrapling

搜索 Python

🚀 Scrapling

Scrapling 是首个自适应网页抓取库，它能从网站的变化中学习并随之进化。当其他库因网站结构更新而失效时，Scrapling 能自动重新定位元素，确保抓取程序持续运行，让你告别与反爬虫系统的斗争，无需在网站更新后重写选择器。

🚀 快速开始

基础用法

from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher
from scrapling.fetchers import FetcherSession, StealthySession, DynamicSession

# HTTP requests with session support
with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text')

# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text')

# Advanced stealth mode (Keep the browser open until you finish)
with StealthySession(headless=True, solve_cloudflare=True) as session:
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a')

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a')

# Full browser automation (Keep the browser open until you finish)
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()')  # XPath selector if you prefer it

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text')

高级解析与导航

from scrapling.fetchers import Fetcher

# Rich element selection and navigation
page = Fetcher.get('https://quotes.toscrape.com/')

# Get quotes with multiple selection methods
quotes = page.css('.quote')  # CSS selector
quotes = page.xpath('//div[@class="quote"]')  # XPath
quotes = page.find_all('div', {'class': 'quote'})  # BeautifulSoup-style
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote')  # and so on...
# Find element by text content
quotes = page.find_by_text('quote', tag='div')

# Advanced navigation
first_quote = page.css_first('.quote')
quote_text = first_quote.css('.text::text')
quote_text = page.css('.quote').css_first('.text::text')  # Chained selectors
quote_text = page.css_first('.quote .text').text  # Using `css_first` is faster than `css` if you want the first element
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent

# Element relationships and similarity
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()

你可以直接使用解析器，而无需像下面这样抓取网站：

from scrapling.parser import Selector

page = Selector("...")

它的工作方式完全相同！

异步会话管理示例

import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession

async with FetcherSession(http3=True) as session:  # `FetcherSession` is context-aware and can work in both sync/async patterns
page1 = session.get('https://quotes.toscrape.com/')
page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')

# Async session usage
async with AsyncStealthySession(max_pages=2) as session:
tasks = []
urls = ['https://example.com/page1', 'https://example.com/page2']

for url in urls:
task = session.fetch(url)
tasks.append(task)

print(session.get_pool_stats())  # Optional - The status of the browser tabs pool (busy/free/error)
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())

✨ 主要特性

支持会话的高级网站抓取

HTTP 请求：使用 Fetcher 类进行快速且隐蔽的 HTTP 请求。可以模拟浏览器的 TLS 指纹、头部信息，并使用 HTTP3。
动态加载：通过支持 Playwright 的 Chromium、真实 Chrome 和自定义隐身模式的 DynamicFetcher 类，实现全浏览器自动化来抓取动态网站。
绕过反爬虫机制：StealthyFetcher 具有高级隐身功能，使用修改版的 Firefox 和指纹欺骗技术。可以轻松通过自动化绕过所有级别的 Cloudflare 的 Turnstile。
会话管理：FetcherSession、StealthySession 和 DynamicSession 类支持持久会话，用于跨请求的 cookie 和状态管理。
异步支持：所有抓取器都提供完整的异步支持，并配有专门的异步会话类。

自适应抓取与 AI 集成

🔄 智能元素跟踪：使用智能相似性算法在网站更改后重新定位元素。
🎯 智能灵活选择：支持 CSS 选择器、XPath 选择器、基于过滤器的搜索、文本搜索、正则表达式搜索等。
🔍 查找相似元素：自动定位与已找到元素相似的元素。
🤖 可与 AI 配合使用的 MCP 服务器：内置 MCP 服务器，用于 AI 辅助的网页抓取和数据提取。MCP 服务器具有自定义的强大功能，利用 Scrapling 在将目标内容传递给 AI（Claude/Cursor 等）之前进行提取，从而通过减少令牌使用来加快操作速度并降低成本。(演示视频)

高性能且经过实战检验的架构

🚀 闪电般快速：经过优化的性能，超越了大多数 Python 抓取库。
🔋 内存高效：优化的数据结构和惰性加载，占用的内存极少。
⚡ 快速 JSON 序列化：比标准库快 10 倍。
🏗️ 经过实战检验：Scrapling 不仅拥有 92% 的测试覆盖率和完整的类型提示覆盖率，而且在过去一年中，每天都有数百名网页抓取人员在使用它。

对开发者/网页抓取人员友好的体验

🎯 交互式网页抓取 shell：可选的内置 IPython shell，集成了 Scrapling，提供快捷方式和新工具，可加快网页抓取脚本的开发速度，例如将 curl 请求转换为 Scrapling 请求，并在浏览器中查看请求结果。
🚀 直接从终端使用：你可以选择直接使用 Scrapling 来抓取 URL，而无需编写任何代码！
🛠️ 丰富的导航 API：通过父级、兄弟级和子级导航方法实现高级 DOM 遍历。
🧬 增强的文本处理：内置正则表达式、清理方法和优化的字符串操作。
📝 自动选择器生成：为任何元素生成强大的 CSS/XPath 选择器。
🔌 熟悉的 API：类似于 Scrapy/BeautifulSoup，使用与 Scrapy/Parsel 相同的伪元素。
📘 完整的类型覆盖：完整的类型提示，为 IDE 提供出色的支持和代码补全功能。

全新的会话架构

Scrapling 0.3 引入了全新的会话系统：

持久会话：在多个请求之间保持 cookie、头部信息和身份验证。
自动会话管理：智能处理会话生命周期，并进行适当的清理。
会话继承：所有抓取器都支持一次性请求和持久会话使用。
并发会话支持：同时运行多个隔离的会话。

📦 安装指南

Scrapling 需要 Python 3.10 或更高版本：

pip install scrapling

从 v0.3.2 开始，此安装仅包括解析引擎及其依赖项，不包括任何抓取器或命令行依赖项。

可选依赖项

如果你打算使用以下任何额外功能、抓取器或它们的类，则需要安装抓取器的依赖项，然后使用以下命令安装浏览器依赖项：
```
pip install "scrapling[fetchers]"

scrapling install
```
这将下载所有浏览器及其系统依赖项和指纹操作依赖项。
额外功能：
- 安装 MCP 服务器功能：
```
pip install "scrapling[ai]"
```
- 安装 shell 功能（网页抓取 shell 和 extract 命令）：
```
pip install "scrapling[shell]"
```
- 安装所有功能：
```
pip install "scrapling[all]"
```
不要忘记在安装这些额外功能后（如果你还没有安装），使用 scrapling install 安装浏览器依赖项。

💻 使用示例

基础用法

# 保持原始代码和注释不变
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
# 启用自适应模式
StealthyFetcher.adaptive = True
# 在隐蔽模式下获取网站源代码！
page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
print(page.status)
200
# 抓取能够在网站设计变更后仍能正常工作的数据！
products = page.css('.product', auto_save=True)
# 稍后，如果网站结构发生变化，传递 `adaptive=True` 参数
products = page.css('.product', adaptive=True)
# Scrapling 仍然能够找到它们！

高级用法

# 高级场景说明：使用不同的抓取器和会话类进行复杂的网页抓取操作，支持异步、隐身、动态加载等多种模式，同时可以处理会话管理和元素定位。
from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher
from scrapling.fetchers import FetcherSession, StealthySession, DynamicSession

# HTTP requests with session support
with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text')

# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text')

# Advanced stealth mode (Keep the browser open until you finish)
with StealthySession(headless=True, solve_cloudflare=True) as session:
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a')

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a')

# Full browser automation (Keep the browser open until you finish)
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()')  # XPath selector if you prefer it

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text')

📚 详细文档

Scrapling v0.3 包含一个强大的命令行界面：

# 启动交互式网页抓取 shell
scrapling shell

# 直接将页面提取到文件中，无需编程（默认提取 `body` 标签内的内容）
# 如果输出文件以 `.txt` 结尾，则将提取目标的文本内容。
# 如果以 `.md` 结尾，它将是 HTML 内容的 markdown 表示形式，而 `.html` 则是直接的 HTML 内容。
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome'  # 所有匹配 CSS 选择器 '#fromSkipToProducts' 的元素
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare

⚠️ 重要提示

还有许多其他功能，但我们希望保持此页面简洁，例如 MCP 服务器和交互式网页抓取 shell。请查看完整文档此处

🔧 技术细节

文本提取速度测试（5000 个嵌套元素）

#	库	时间 (ms)	与 Scrapling 相比
1	Scrapling	1.92	1.0x
2	Parsel/Scrapy	1.99	1.036x
3	Raw Lxml	2.33	1.214x
4	PyQuery	20.61	~11x
5	Selectolax	80.65	~42x
6	BS4 with Lxml	1283.21	~698x
7	MechanicalSoup	1304.57	~679x
8	BS4 with html5lib	3331.96	~1735x

元素相似性和文本搜索性能

Scrapling 的自适应元素查找功能明显优于其他替代方案：

库	时间 (ms)	与 Scrapling 相比
Scrapling	1.87	1.0x
AutoScraper	10.24	5.476x

所有基准测试均为 100 多次运行的平均值。有关方法，请参阅 benchmarks.py

📄 许可证

本项目采用 BSD-3-Clause 许可证。

致谢

本项目包含改编自以下项目的代码：

Parsel（BSD 许可证）—用于 translator 子模块

感谢与参考

Daijro 在 BrowserForge 和 Camoufox 上的出色工作
Vinyzu 在 Botright 上的工作
brotector 的浏览器检测绕过技术
fakebrowser 的指纹研究
rebrowser-patches 的隐身改进

由 Karim Shoair 用心设计与打造。

0 条评论
分类：搜索

0 关注
0 收藏，39 浏览
system 提出于 2025-09-18 02:54

Scrapling

🚀 Scrapling

🚀 快速开始

基础用法

高级解析与导航

异步会话管理示例

✨ 主要特性

支持会话的高级网站抓取

自适应抓取与 AI 集成

高性能且经过实战检验的架构

对开发者/网页抓取人员友好的体验

全新的会话架构

📦 安装指南

可选依赖项

💻 使用示例

基础用法

高级用法

📚 详细文档

🔧 技术细节

文本提取速度测试（5000 个嵌套元素）

元素相似性和文本搜索性能

📄 许可证

致谢

感谢与参考

0 个评论

相似服务问题

相关AI产品

热议话题 »