Scrapling 是首个自适应网页抓取库,它能从网站的变化中学习并随之进化。当其他库因网站结构更新而失效时,Scrapling 能自动重新定位元素,确保抓取程序持续运行,让你告别与反爬虫系统的斗争,无需在网站更新后重写选择器。
from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher
from scrapling.fetchers import FetcherSession, StealthySession, DynamicSession
# HTTP requests with session support
with FetcherSession(impersonate='chrome') as session: # Use latest version of Chrome's TLS fingerprint
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text')
# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text')
# Advanced stealth mode (Keep the browser open until you finish)
with StealthySession(headless=True, solve_cloudflare=True) as session:
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a')
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a')
# Full browser automation (Keep the browser open until you finish)
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()') # XPath selector if you prefer it
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text')
from scrapling.fetchers import Fetcher
# Rich element selection and navigation
page = Fetcher.get('https://quotes.toscrape.com/')
# Get quotes with multiple selection methods
quotes = page.css('.quote') # CSS selector
quotes = page.xpath('//div[@class="quote"]') # XPath
quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-style
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...
# Find element by text content
quotes = page.find_by_text('quote', tag='div')
# Advanced navigation
first_quote = page.css_first('.quote')
quote_text = first_quote.css('.text::text')
quote_text = page.css('.quote').css_first('.text::text') # Chained selectors
quote_text = page.css_first('.quote .text').text # Using `css_first` is faster than `css` if you want the first element
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent
# Element relationships and similarity
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()
你可以直接使用解析器,而无需像下面这样抓取网站:
from scrapling.parser import Selector
page = Selector("...")
它的工作方式完全相同!
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
async with FetcherSession(http3=True) as session: # `FetcherSession` is context-aware and can work in both sync/async patterns
page1 = session.get('https://quotes.toscrape.com/')
page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
# Async session usage
async with AsyncStealthySession(max_pages=2) as session:
tasks = []
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
task = session.fetch(url)
tasks.append(task)
print(session.get_pool_stats()) # Optional - The status of the browser tabs pool (busy/free/error)
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())
Fetcher 类进行快速且隐蔽的 HTTP 请求。可以模拟浏览器的 TLS 指纹、头部信息,并使用 HTTP3。DynamicFetcher 类,实现全浏览器自动化来抓取动态网站。StealthyFetcher 具有高级隐身功能,使用修改版的 Firefox 和指纹欺骗技术。可以轻松通过自动化绕过所有级别的 Cloudflare 的 Turnstile。FetcherSession、StealthySession 和 DynamicSession 类支持持久会话,用于跨请求的 cookie 和状态管理。Scrapling 0.3 引入了全新的会话系统:
Scrapling 需要 Python 3.10 或更高版本:
pip install scrapling
从 v0.3.2 开始,此安装仅包括解析引擎及其依赖项,不包括任何抓取器或命令行依赖项。
如果你打算使用以下任何额外功能、抓取器或它们的类,则需要安装抓取器的依赖项,然后使用以下命令安装浏览器依赖项:
pip install "scrapling[fetchers]"
scrapling install
这将下载所有浏览器及其系统依赖项和指纹操作依赖项。
额外功能:
pip install "scrapling[ai]"
extract 命令):pip install "scrapling[shell]"
pip install "scrapling[all]"
不要忘记在安装这些额外功能后(如果你还没有安装),使用 scrapling install 安装浏览器依赖项。
# 保持原始代码和注释不变
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
# 启用自适应模式
StealthyFetcher.adaptive = True
# 在隐蔽模式下获取网站源代码!
page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
print(page.status)
200
# 抓取能够在网站设计变更后仍能正常工作的数据!
products = page.css('.product', auto_save=True)
# 稍后,如果网站结构发生变化,传递 `adaptive=True` 参数
products = page.css('.product', adaptive=True)
# Scrapling 仍然能够找到它们!
# 高级场景说明:使用不同的抓取器和会话类进行复杂的网页抓取操作,支持异步、隐身、动态加载等多种模式,同时可以处理会话管理和元素定位。
from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher
from scrapling.fetchers import FetcherSession, StealthySession, DynamicSession
# HTTP requests with session support
with FetcherSession(impersonate='chrome') as session: # Use latest version of Chrome's TLS fingerprint
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text')
# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text')
# Advanced stealth mode (Keep the browser open until you finish)
with StealthySession(headless=True, solve_cloudflare=True) as session:
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a')
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a')
# Full browser automation (Keep the browser open until you finish)
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()') # XPath selector if you prefer it
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text')
Scrapling v0.3 包含一个强大的命令行界面:
# 启动交互式网页抓取 shell
scrapling shell
# 直接将页面提取到文件中,无需编程(默认提取 `body` 标签内的内容)
# 如果输出文件以 `.txt` 结尾,则将提取目标的文本内容。
# 如果以 `.md` 结尾,它将是 HTML 内容的 markdown 表示形式,而 `.html` 则是直接的 HTML 内容。
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # 所有匹配 CSS 选择器 '#fromSkipToProducts' 的元素
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
⚠️ 重要提示
还有许多其他功能,但我们希望保持此页面简洁,例如 MCP 服务器和交互式网页抓取 shell。请查看完整文档 此处
| # | 库 | 时间 (ms) | 与 Scrapling 相比 |
|---|---|---|---|
| 1 | Scrapling | 1.92 | 1.0x |
| 2 | Parsel/Scrapy | 1.99 | 1.036x |
| 3 | Raw Lxml | 2.33 | 1.214x |
| 4 | PyQuery | 20.61 | ~11x |
| 5 | Selectolax | 80.65 | ~42x |
| 6 | BS4 with Lxml | 1283.21 | ~698x |
| 7 | MechanicalSoup | 1304.57 | ~679x |
| 8 | BS4 with html5lib | 3331.96 | ~1735x |
Scrapling 的自适应元素查找功能明显优于其他替代方案:
| 库 | 时间 (ms) | 与 Scrapling 相比 |
|---|---|---|
| Scrapling | 1.87 | 1.0x |
| AutoScraper | 10.24 | 5.476x |
所有基准测试均为 100 多次运行的平均值。有关方法,请参阅 benchmarks.py
本项目采用 BSD-3-Clause 许可证。
本项目包含改编自以下项目的代码: