Markdown Web Crawl

开发官方认证 Python

🚀 MD 网站爬虫项目（基于MCP协议）

本项目是一个基于 Python 的 MCP 协议（https://modelcontextprotocol.io/introduction）网络爬虫，可高效提取网站内容并保存，为网站内容处理提供便捷方案。

🚀 快速开始

本爬虫是基于 Python 的 MCP 协议网络爬虫，用于提取并保存网站内容。

✨ 主要特性

提取网站内容并以 Markdown 文件格式保存
映射网站结构和链接
批量处理多个 URL
可配置的输出目录

📦 安装指南

克隆仓库：

git clone https://github.com/yourusername/webcrawler.git
cd webcrawler

安装依赖项：

pip install -r requirements.txt

可选：配置环境变量：

export OUTPUT_PATH=./output  # 设置您首选的输出目录

输出结果

爬取的内容将以 Markdown 格式保存到指定的输出目录中。

配置选项

通过环境变量对服务器进行配置：

属性	详情
`OUTPUT_PATH`	默认输出文件夹路径（默认值：./output）
`MAX_CONCURRENT_REQUESTS`	最大并行请求数（默认值：5）
`REQUEST_TIMEOUT`	请求超时时间（单位：秒，默认值：30）

使用 Claude 设置

通过 FastMCP 安装服务器： fastmcp install server.py

或使用自定义设置直接运行：

{
"Crawl Server": {
"command": "fastmcp",
"args": [
"run",
"/Users/mm22/Dev_Projekte/servers-main/src/Webcrawler/server.py"
],
"env": {
"OUTPUT_PATH": "/Users/user/Webcrawl"
}
}
}

📚 详细文档

开发指南

实时开发

fastmcp dev server.py --with-editable .

调试

建议使用 https://modelcontextprotocol.io/docs/tools/inspector 工具进行调试

💻 使用示例

基础用法

mcp call extract_content --url "https://example.com" --output_path "example.md"

高级用法

mcp call scan_linked_content --url "https://example.com" | \
mcp call create_index --content_map - --output_path "index.md"

贡献指南

叉取仓库
创建功能分支 (git checkout -b feature/AmazingFeature)
提交更改 (git commit -m 'Add some AmazingFeature')
推送到分支 (git push origin feature/AmazingFeature)
提交 Pull Request

📄 许可证

本项目基于 MIT 协议开源。更多详细信息请参阅 LICENSE 文件。

依赖项

Python 3.7+
FastMCP（uv pip install fastmcp）
列于 requirements.txt 中的其他依赖项

0 条评论
分类：开发