开源带有智慧的AI爬虫

SpeechGraphAI是一种网页抓取python库，使用LLM和直接图形逻辑为网站和本地文档(XML、HTML、JSON等)创建抓取管道。).

功能：

有三种主要的抓取管道可用于从网站（或本地文件）中提取信息:

SmartScraperGraph：只需用户提示和输入源的单页抓取工具；
SearchGraph：从搜索引擎的前n个搜索结果中提取信息的多页抓取工具；
SpeechGraph：从网站提取信息并生成音频文件的单页抓取工具。

另外还有一种优化的抓取工具：

SmartScraperMultiGraph：可以同时抓取多个页面，只需给出单个提示。

安装：

pip install scrapegraphai

注意:建议将库安装在虚拟环境中，以避免与其他库冲突

使用OpenAI的SpeechGraph

from scrapegraphai.graphs import SpeechGraphgraph_config = {    "llm": {        "api_key": "OPENAI_API_KEY",        "model": "gpt-3.5-turbo",    },    "tts_model": {        "api_key": "OPENAI_API_KEY",        "model": "tts-1",        "voice": "alloy"    },    "output_path": "audio_summary.mp3",}# ************************************************# Create the SpeechGraph instance and run it# ************************************************speech_graph = SpeechGraph(    prompt="Make a detailed audio summary of the projects.",    source="https://perinim.github.io/projects/",    config=graph_config,)result = speech_graph.run()print(result)

在官方示例代码中使用了ScrapeGraphAI库中的SpeechGraph来创建一个抓取实例，并生成音频摘要。代码详细介绍：

导入SpeechGraph类：从scrapegraphai.graphs模块中导入SpeechGraph类，用于创建SpeechGraph实例。
graph_config配置：定义了一个graph_config字典，包含了SpeechGraph所需的配置信息，包括：

"llm"：用于语言模型的配置，包括OpenAI API密钥和模型名称。
"tts_model"：用于文本转语音的配置，包括OpenAI API密钥、模型名称和声音选择。
"output_path"：指定音频摘要的输出路径。

创建SpeechGraph实例：使用SpeechGraph类创建一个实例，传入以下参数：

prompt：指定要生成音频摘要的提示文本。
source：指定要抓取信息的网站或本地文件的URL。
config：传入之前定义的graph_config配置。

运行SpeechGraph：调用SpeechGraph实例的run()方法来运行抓取过程，并将结果赋给result变量。

打印结果：使用print()函数打印抓取结果result。

使用ollama的SpeechGraph

from scrapegraphai.graphs import SmartScraperGraphgraph_config = {    "llm": {        "model": "ollama/mistral",        "temperature": 0,        "format": "json",  # Ollama needs the format to be specified explicitly        "base_url": "http://localhost:11434",  # set Ollama URL    },    "embeddings": {        "model": "ollama/nomic-embed-text",        "base_url": "http://localhost:11434",  # set Ollama URL    },    "verbose": True,}smart_scraper_graph = SmartScraperGraph(    prompt="List me all the projects with their descriptions",    # also accepts a string with the already downloaded HTML code    source="https://perinim.github.io/projects",    config=graph_config)result = smart_scraper_graph.run()print(result)

使用了ScrapeGraphAI库中的SmartScraperGraph来创建一个抓取实例，并提取网页中的项目信息和描述。代码详细介绍：

导入SmartScraperGraph类：从scrapegraphai.graphs模块中导入SmartScraperGraph类，用于创建SmartScraperGraph实例。
graph_config配置：定义了一个graph_config字典，包含了SmartScraperGraph所需的配置信息，包括：