知乎内容爬取与知识分析技术实践-AI智能范式网

知乎内容爬取与知识分析技术实践

chen2766343375

1. 项目背景与需求解析

"小约翰"作为知乎平台上的活跃创作者，其输出的回答、文章和想法往往具有独特的观点和知识密度。对于长期关注其内容的读者或研究者而言，系统性地收集和分析这些数字资产具有多重价值：

知识沉淀：将碎片化的平台内容转化为结构化存档，避免因平台算法调整或内容下架导致的信息丢失
深度研究：通过批量获取历史内容，分析作者的写作风格、观点演变和知识体系构建
个人学习：建立离线知识库，便于随时检索和反复研读高质量内容
内容再创作：为二次创作（如读书笔记、观点整理）提供原材料

传统手工复制粘贴的方式在面对数百篇内容时效率低下，且难以保持格式统一。本项目实现的自动化方案可解决三个核心痛点：

跨内容类型（回答/文章/想法）的统一抓取
多格式导出适配不同使用场景
基于AI的知识挖掘能力加持

2. 技术方案设计

2.1 系统架构分解

整套系统采用模块化设计，各组件通过Python脚本串联：

code复制[知乎爬虫] → [数据清洗] → [格式转换] → [知识分析]
    │           │            │             │
    ↓           ↓            ↓             ↓
[原始HTML] → [纯净文本] → [多格式文件] → [知识图谱]

2.2 核心工具选型

爬取层：
- playwright：处理知乎动态加载和登录态维持（比selenium更轻量）
- BeautifulSoup：解析HTML提取结构化内容
- 自研反反爬策略：随机延迟+请求头轮换+代理IP池
处理层：
- pdfkit：将HTML转为PDF（需预装wkhtmltopdf）
- python-docx：生成可编辑Word文档
- jinja2：定制HTML模板实现美观排版
分析层：
- 腾讯云TI平台：调用词向量API和文本分类API
- gensim：本地实现TF-IDF关键词提取
- networkx：构建知识关联图谱

注意：知乎的robots.txt禁止爬取，需控制请求频率（建议≤3req/min）并遵守《网络安全法》相关规定，仅用于个人学习研究

3. 实现步骤详解

3.1 内容获取模块

3.1.1 用户主页解析

python复制async def get_user_content(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        )
        page = await context.new_page()
        
        # 模拟人类操作轨迹
        await page.goto(url, timeout=60000)
        await page.mouse.move(100, 100)
        await page.wait_for_timeout(2000)
        
        # 获取基础信息
        user_name = await page.query_selector('h1.ProfileHeader-name')
        total_answers = await page.query_selector('div.Profile-main >> div.Tabs-item:has-text("回答")')
        
        # 滚动加载全部内容
        last_height = 0
        while True:
            await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
            await page.wait_for_timeout(3000)
            new_height = await page.evaluate('document.body.scrollHeight')
            if new_height == last_height:
                break
            last_height = new_height
        
        # 提取所有回答卡片
        answer_cards = await page.query_selector_all('div.List-item')
        return [await parse_card(card) for card in answer_cards]

3.1.2 反爬应对策略

请求特征伪装：
- 随机化User-Agent池（包含移动端/PC端）
- 每个请求附带不同的X-Forwarded-For头
- 禁用自动化特征（如--disable-blink-features=AutomationControlled）
行为模式模拟：
- 随机滚动页面（模拟阅读行为）
- 操作间隔加入高斯分布随机延迟（均值3s，标准差1.5s）
- 鼠标移动轨迹采用贝塞尔曲线模拟
异常处理机制：
- 触发验证码时自动保存当前进度
- 连续5次失败后切换IP地址
- 自动识别"加载失败"提示并重试

3.2 数据清洗与存储

3.2.1 内容标准化处理

python复制def clean_content(raw_html):
    # 移除知乎特有元素
    for tag in ['button', 'svg', 'iframe', 'figure']:
        for element in raw_html.select(tag):
            element.decompose()
    
    # 保留核心结构
    content = raw_html.select('div.RichContent-inner')[0]
    
    # 统一处理图片链接
    for img in content.select('img'):
        if 'data-actualsrc' in img.attrs:
            img.attrs['src'] = img.attrs.pop('data-actualsrc')
    
    # 清理空白字符
    text = content.get_text(separator='\n')
    return re.sub(r'\n{3,}', '\n\n', text.strip())

3.2.2 存储数据结构设计

json复制{
  "metadata": {
    "author": "小约翰",
    "post_id": "123456789",
    "create_time": "2023-07-15T14:32:10",
    "update_time": "2023-07-20T09:15:33",
    "url": "https://www.zhihu.com/question/12345/answer/67890",
    "vote_count": 2568,
    "comment_count": 342
  },
  "content": {
    "text": "清理后的纯文本内容...",
    "html": "<div>保留格式的HTML内容...</div>",
    "images": ["https://pic1.zhimg.com/80/v2-xxx.jpg"]
  },
  "tags": ["哲学", "社会学", "文化批评"]
}

3.3 多格式导出实现

3.3.1 PDF生成优化

python复制def html_to_pdf(html_str, output_path):
    options = {
        'encoding': 'UTF-8',
        'quiet': '',
        'page-size': 'A4',
        'margin-top': '15mm',
        'margin-right': '15mm',
        'margin-bottom': '15mm',
        'margin-left': '15mm',
        'footer-center': '[page]/[topage]',
        'header-left': '小约翰知乎文集',
        'custom-header': [
            ('Accept-Encoding', 'gzip')
        ]
    }
    
    # 使用自定义CSS提升排版质量
    css = '''
    body { font-family: "Source Han Serif CN"; line-height: 1.6; }
    h1 { font-size: 18pt; border-bottom: 1px solid #eee; }
    img { max-width: 100%; height: auto; }
    '''
    
    pdfkit.from_string(
        html_str, 
        output_path, 
        options=options, 
        css=string.css
    )

3.3.2 Word文档样式控制

python复制from docx.shared import Pt, RGBColor
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT

def apply_style(doc):
    # 设置正文样式
    style = doc.styles['Normal']
    style.font.name = '微软雅黑'
    style.font.size = Pt(10.5)
    
    # 创建标题样式
    heading = doc.styles.add_style('Heading', WD_STYLE_TYPE.PARAGRAPH)
    heading.font.bold = True
    heading.font.color.rgb = RGBColor(44, 62, 80)
    
    # 设置代码块样式
    code_style = doc.styles.add_style('Code', WD_STYLE_TYPE.PARAGRAPH)
    code_style.font.name = 'Consolas'
    code_style.font.size = Pt(9)
    code_style.paragraph_format.shading.background.fill_color = RGBColor(240, 240, 240)

3.4 知识分析模块

3.4.1 腾讯云TI平台接入

python复制import json
from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.tiems.v20190416 import tiems_client, models

def analyze_with_tencent(text):
    cred = credential.Credential("your-secret-id", "your-secret-key") 
    httpProfile = HttpProfile()
    httpProfile.endpoint = "tiems.tencentcloudapi.com"

    clientProfile = ClientProfile()
    clientProfile.httpProfile = httpProfile
    client = tiems_client.TiemsClient(cred, "ap-beijing", clientProfile)

    req = models.CreateJobRequest()
    params = {
        "Name": "zhihu_analysis",
        "ResourceGroupId": "default",
        "AlgorithmSpecification": {
            "TrainingImageName": "ccr.ccs.tencentyun.com/ti-platform/text-analytics:latest"
        },
        "InputDataConfig": [
            {
                "ChannelName": "input",
                "DataSource": {
                    "Content": text
                }
            }
        ]
    }
    req.from_json_string(json.dumps(params))
    resp = client.CreateJob(req)
    return resp.JobId

3.4.2 本地知识图谱构建

python复制import spacy
import networkx as nx
from collections import defaultdict

nlp = spacy.load("zh_core_web_lg")

def build_knowledge_graph(texts):
    entity_relations = defaultdict(list)
    graph = nx.Graph()
    
    for text in texts:
        doc = nlp(text)
        entities = [ent.text for ent in doc.ents if ent.label_ in ["PERSON", "ORG", "PRODUCT"]]
        
        # 提取实体关系
        for token in doc:
            if token.dep_ in ("nsubj", "dobj"):
                subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                if subject:
                    relation = (subject[0].text, token.head.text, token.text)
                    entity_relations[relation[0]].append((relation[1], relation[2]))
        
        # 构建图结构
        for i in range(len(entities)-1):
            graph.add_edge(entities[i], entities[i+1], weight=1)
            
    return graph

4. 实战经验与避坑指南

4.1 内容获取环节

分页加载陷阱：知乎新版页面采用"滚动加载+分页器"混合模式，需同时监听：

javascript复制// 滚动加载触发条件
document.querySelector('div.Pagination').style.display === 'none'
&& document.querySelector('div.ContentItem').length % 20 === 0

// 传统分页器检测
document.querySelectorAll('button.Pagination-btn').length > 0

内容去重策略：由于知乎的推荐系统会导致内容重复出现，建议采用post_id作为唯一标识符建立MD5索引：

python复制import hashlib
def get_fingerprint(item):
    return hashlib.md5(
        f"{item['post_id']}_{item['update_time']}".encode()
    ).hexdigest()

4.2 格式转换优化

PDF中文支持：确保系统已安装中文字体（推荐使用思源宋体）：
```
bash复制# Ubuntu系统示例
sudo apt install fonts-noto-cjk-extra
```

Word文档分节：长文章建议按自然段落拆分章节，避免单个文档过大：

python复制def split_long_text(text, max_chars=5000):
    paragraphs = text.split('\n')
    sections = []
    current_section = []
    char_count = 0
    
    for para in paragraphs:
        if char_count + len(para) > max_chars and current_section:
            sections.append('\n'.join(current_section))
            current_section = []
            char_count = 0
        current_section.append(para)
        char_count += len(para)
    
    if current_section:
        sections.append('\n'.join(current_section))
    return sections

4.3 知识分析技巧

腾讯云API调优：当处理大量文本时，采用批量请求模式可提升效率：

python复制def batch_analyze(texts, batch_size=10):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        response = tencent_client.BatchAnalyze(
            Texts=batch,
            Tasks=['keyword', 'category', 'sentiment']
        )
        results.extend(response.Results)
        time.sleep(1)  # 遵守QPS限制
    return results

本地缓存策略：对分析结果建立本地缓存数据库，避免重复计算：

python复制import sqlite3
def init_cache_db():
    conn = sqlite3.connect('analysis_cache.db')
    conn.execute('''CREATE TABLE IF NOT EXISTS results
        (text_md5 TEXT PRIMARY KEY, 
         keywords TEXT,
         categories TEXT,
         sentiment REAL)''')
    return conn

5. 扩展应用场景

5.1 自动化知识更新

通过GitHub Actions实现定期自动抓取：

yaml复制name: Zhihu Sync
on:
  schedule:
    - cron: '0 12 * * *'  # 每天UTC时间12点运行
jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - run: pip install -r requirements.txt
      - run: python zhihu_crawler.py --user 小约翰
      - name: Commit changes
        run: |
          git config --global user.name "Automated Sync"
          git config --global user.email "actions@users.noreply.github.com"
          git add .
          git commit -m "Update zhihu content" || echo "No changes to commit"
          git push

5.2 个性化知识门户

使用Flask构建本地检索系统：

python复制from flask import Flask, request, render_template
import whoosh.index as index
from whoosh.qparser import QueryParser

app = Flask(__name__)
ix = index.open_dir("indexdir")

@app.route('/search')
def search():
    query_str = request.args.get('q')
    qp = QueryParser("content", ix.schema)
    q = qp.parse(query_str)
    
    with ix.searcher() as searcher:
        results = searcher.search(q, limit=20)
        return render_template('results.html', results=results)

5.3 移动端知识卡片

生成Anki记忆卡片：

python复制from genanki import Model, Deck, Package, Note

model = Model(
    1607392319,
    'Zhihu QA Model',
    fields=[
        {'name': 'Question'},
        {'name': 'Answer'},
        {'name': 'Source'},
    ],
    templates=[
        {
            'name': 'Card 1',
            'qfmt': '{{Question}}<br><small>{{Source}}</small>',
            'afmt': '{{FrontSide}}<hr id="answer">{{Answer}}',
        },
    ])

def create_anki_deck(contents):
    deck = Deck(2059400110, "小约翰知乎精选")
    for item in contents:
        note = Note(
            model=model,
            fields=[
                item['title'],
                item['content'],
                f"发布于 {item['create_time']} · 赞同 {item['vote_count']}"
            ]
        )
        deck.add_note(note)
    Package(deck).write_to_file('zhihu.apkg')

在实际操作中发现，知乎的页面结构平均每3-6个月会有一次较大改版，建议定期检查爬虫的健壮性。对于持续性的知识收集项目，可以建立页面结构变更的监控机制，当核心选择器失效时自动触发告警。同时推荐使用diff-match-patch库对历史版本内容进行差异分析，追踪作者观点的演变过程。