最近在开发大语言模型应用时,我发现一个关键痛点:传统搜索引擎返回的结果往往包含大量广告和低质量内容,严重影响模型生成答案的准确性和专业性。经过多次测试对比,最终确定了DuckDuckGo+Tavily的组合方案,这个搭配在保证隐私的同时,显著提升了信息检索质量。
DuckDuckGo作为主打隐私保护的搜索引擎,其无追踪特性非常适合需要大量网络请求的AI应用场景。而Tavily作为新兴的AI专用搜索引擎,能够自动过滤低质量网页,直接返回结构化数据。两者结合使用时,DuckDuckGo负责广度覆盖,Tavily负责深度挖掘,形成完美的互补关系。
这个方案特别适合以下场景:
首先需要安装必要的Python库。我推荐使用虚拟环境隔离依赖:
bash复制python -m venv search_env
source search_env/bin/activate # Linux/macOS
search_env\Scripts\activate # Windows
pip install duckduckgo-search tavily-python python-dotenv
创建.env文件存储API密钥:
env复制TAVILY_API_KEY=your_api_key_here
DuckDuckGo的官方Python库使用非常简单:
python复制from duckduckgo_search import ddg
def ddg_search(query, max_results=5):
results = ddg(query, max_results=max_results)
return [{
'title': r['title'],
'url': r['link'],
'snippet': r['body']
} for r in results]
关键参数说明:
max_results:控制返回结果数量,建议3-5个平衡质量与速度region:可指定搜索区域(如'wt-wt'国际版)safesearch:内容安全过滤级别注意:DuckDuckGo对高频请求有限制,建议添加1-2秒的延迟
Tavily需要先申请API密钥(免费版足够个人使用):
python复制from tavily import TavilyClient
tavily = TavilyClient(api_key="your_api_key")
def tavily_search(query, include_raw_content=False):
response = tavily.search(
query=query,
search_depth="advanced", # 可选basic/advanced
include_raw_content=include_raw_content
)
return response['results']
Tavily的高级功能包括:
search_depth:控制搜索深度(basic/advanced)include_domains:限定特定网站exclude_domains:排除低质量网站include_raw_content:获取完整网页文本直接合并两个引擎的结果会导致大量重复。我开发了基于语义相似度的去重算法:
python复制from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
def deduplicate_results(results, threshold=0.85):
unique_results = []
embeddings = model.encode([r['snippet'] for r in results])
for i, result in enumerate(results):
is_duplicate = False
for ur in unique_results:
sim = cosine_similarity(
[embeddings[i]],
[model.encode(ur['snippet'])]
)[0][0]
if sim > threshold:
is_duplicate = True
break
if not is_duplicate:
unique_results.append(result)
return unique_results
为每个结果计算可信度分数:
python复制def calculate_credibility_score(result):
score = 0
# 域名权重
domain = result['url'].split('/')[2]
if '.edu' in domain: score += 0.3
elif '.gov' in domain: score += 0.2
elif '.org' in domain: score += 0.1
# 内容特征
if len(result['snippet']) > 200: score += 0.2
if '参考文献' in result['snippet']: score += 0.1
# 搜索引擎权重
if result['source'] == 'tavily': score += 0.2
return min(score, 1.0)
python复制def hybrid_search(query):
# 并行搜索
ddg_results = ddg_search(query)
tavily_results = tavily_search(query)
# 标记来源
for r in ddg_results: r['source'] = 'ddg'
for r in tavily_results: r['source'] = 'tavily'
# 合并去重
all_results = ddg_results + tavily_results
unique_results = deduplicate_results(all_results)
# 计算分数并排序
for r in unique_results:
r['credibility'] = calculate_credibility_score(r)
return sorted(unique_results, key=lambda x: x['credibility'], reverse=True)
将搜索结果转换为LLM友好的格式:
python复制def format_for_llm(results, max_length=3000):
context = ""
char_count = 0
for r in results:
if char_count >= max_length: break
snippet = f"来源:{r['source']} ({r['url']})\n{r['snippet']}\n\n"
if char_count + len(snippet) <= max_length:
context += snippet
char_count += len(snippet)
return context
优化后的提示模板:
python复制SEARCH_PROMPT_TEMPLATE = """基于以下最新信息回答问题。如果信息不足或不确定,请明确说明。
当前日期:{current_date}
问题:{query}
相关搜索结果:
{search_results}
请按照以下要求回答:
1. 综合多个来源的信息
2. 标注关键信息来源
3. 区分事实陈述和推测
4. 如信息矛盾需指出
5. 保持专业客观语气
最终答案:"""
python复制from datetime import datetime
def answer_with_search(llm_client, query):
# 获取搜索结果
search_results = hybrid_search(query)
formatted_results = format_for_llm(search_results)
# 构造提示词
prompt = SEARCH_PROMPT_TEMPLATE.format(
current_date=datetime.now().strftime("%Y-%m-%d"),
query=query,
search_results=formatted_results
)
# 调用LLM
response = llm_client.generate(
prompt,
max_tokens=1500,
temperature=0.3 # 降低创造性保证准确性
)
return {
"answer": response,
"sources": [r['url'] for r in search_results[:3]]
}
使用Redis缓存搜索结果:
python复制import redis
import pickle
import hashlib
r = redis.Redis(host='localhost', port=6379)
def get_cache_key(query):
return hashlib.md5(query.encode()).hexdigest()
def cached_search(query, expire=3600):
cache_key = get_cache_key(query)
cached = r.get(cache_key)
if cached:
return pickle.loads(cached)
results = hybrid_search(query)
r.setex(cache_key, expire, pickle.dumps(results))
return results
使用asyncio提升搜索速度:
python复制import asyncio
from duckduckgo_search import ddg_async
from tavily import AsyncTavilyClient
async def async_hybrid_search(query):
ddg_task = ddg_async(query)
tavily_task = AsyncTavilyClient().search(query)
ddg_results, tavily_results = await asyncio.gather(
ddg_task, tavily_task
)
# ...后续处理与同步版本相同...
根据用户位置优化结果:
python复制def localized_search(query, country_code):
# DuckDuckGo区域设置
ddg_results = ddg(query, region=f"{country_code.lower()}-{country_code.lower()}")
# Tavily地理过滤
tavily_results = tavily.search(
query=query,
include_domains=[f".{country_code.lower()}"]
)
# ...合并结果...
问题现象:
解决方案:
查询重构技巧:
高级过滤:
python复制def filter_low_quality(results):
return [r for r in results if not any(
d in r['url'] for d in ['advert', 'promo', 'click']
)]
频率限制规避策略:
python复制from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def safe_api_call(api_func, *args):
return api_func(*args)
典型错误处理:
python复制def safe_extract(result):
try:
return {
'title': result.get('title', ''),
'url': result['link'] if 'link' in result else result['url'],
'snippet': result.get('body', result.get('content', ''))
}
except Exception as e:
print(f"解析错误:{e}")
return None
python复制def generate_research_report(topic):
search_queries = [
f"{topic} 最新研究",
f"{topic} 行业报告",
f"{topic} 统计数据"
]
all_results = []
for query in search_queries:
all_results.extend(hybrid_search(query))
# 去重排序...
# 调用LLM生成报告...
python复制def fact_check(claim):
contradictory_terms = [
"研究表明", "数据显示", "根据统计"
]
if not any(term in claim for term in contradictory_terms):
return {"status": "无法验证", "confidence": 0}
results = hybrid_search(f"验证:{claim}")
# 分析结果一致性...
python复制from apscheduler.schedulers.background import BackgroundScheduler
def monitor_industry(keywords):
scheduler = BackgroundScheduler()
@scheduler.scheduled_job('interval', hours=6)
def check_updates():
for keyword in keywords:
new_results = hybrid_search(f"{keyword} 最新动态 after:{last_check_date}")
# 处理新结果...
scheduler.start()
在实际应用中,我发现这个搜索组合特别适合需要高时效性和高准确性的场景。比如在金融领域使用时,通过设置专门的行业关键词过滤列表,可以大幅提升相关结果的占比。一个实用技巧是为不同垂直领域建立专属的域名白名单,这在医疗和法律等专业领域效果尤为明显。