这个项目是一个融合了数据采集、存储、清洗、分析和可视化推荐功能的完整系统,主要针对豆瓣电影数据进行深度挖掘。作为计算机相关专业的毕业设计选题,它巧妙地将Python爬虫技术、Vue前端框架、Flask后端框架、LSTM深度学习模型和Echarts可视化技术进行了有机整合。
在实际应用中,这样的系统能够帮助影迷发现更多符合个人口味的电影,也能为影视行业从业者提供市场趋势分析的参考。从技术层面来看,项目涵盖了从数据获取到智能推荐的完整流程,非常适合作为展示全栈开发能力的毕业设计。
系统采用了前后端分离的架构设计:
这种技术组合既考虑了毕业设计的实现难度,又确保了系统的完整性和技术先进性。Flask作为轻量级Python Web框架,相比Django更适合中小型项目;Vue.js则以其渐进式特性和丰富的生态系统,成为前端开发的理想选择。
系统的核心数据流可以分为以下几个阶段:
豆瓣电影数据采集是本系统的基础环节,需要特别注意反爬策略。我们采用分布式爬虫架构,主要实现以下功能:
python复制import requests
from bs4 import BeautifulSoup
import time
import random
class DoubanSpider:
def __init__(self):
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
'Referer': 'https://movie.douban.com/'
}
self.proxies = self._get_proxies()
self.cookies = self._get_cookies()
def get_movie_list(self, start=0):
url = f'https://movie.douban.com/top250?start={start}'
try:
response = requests.get(url, headers=self.headers,
proxies=self.proxies, cookies=self.cookies)
if response.status_code == 200:
return self.parse_movie_list(response.text)
else:
self._handle_error(response)
except Exception as e:
print(f'Error occurred: {str(e)}')
time.sleep(random.randint(5, 10))
def parse_movie_list(self, html):
soup = BeautifulSoup(html, 'html.parser')
# 解析逻辑...
注意事项:豆瓣有严格的防爬机制,建议控制请求频率(2-3秒/次),使用代理IP池,并模拟真实用户行为。获取的数据应包括电影基本信息、评分、评论等核心字段。
采集的数据需要合理设计数据库结构。以下是主要的表结构设计:
sql复制CREATE TABLE movies (
id INT PRIMARY KEY AUTO_INCREMENT,
douban_id VARCHAR(20) UNIQUE,
title VARCHAR(100) NOT NULL,
director VARCHAR(100),
actors TEXT,
genres VARCHAR(100),
release_date DATE,
duration INT,
rating FLOAT,
votes INT,
summary TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE comments (
id INT PRIMARY KEY AUTO_INCREMENT,
movie_id INT,
user_id VARCHAR(50),
user_name VARCHAR(50),
rating FLOAT,
content TEXT,
comment_time DATETIME,
FOREIGN KEY (movie_id) REFERENCES movies(id)
);
对于大规模数据,建议添加适当的索引优化查询性能:
sql复制CREATE INDEX idx_movie_genres ON movies(genres);
CREATE INDEX idx_movie_rating ON movies(rating);
CREATE INDEX idx_comment_movie ON comments(movie_id);
首先实现基于电影特征的相似度计算:
python复制from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
def content_based_recommend(movie_id, top_n=5):
# 获取所有电影数据
movies = get_all_movies()
# 组合特征:导演+演员+类型+简介
movies['content'] = movies['director'] + ' ' + movies['actors'] + ' ' + movies['genres'] + ' ' + movies['summary']
# 使用TF-IDF向量化
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['content'])
# 计算余弦相似度
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
# 获取推荐
idx = movies.index[movies['id'] == movie_id].tolist()[0]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:top_n+1]
movie_indices = [i[0] for i in sim_scores]
return movies.iloc[movie_indices]
使用LSTM模型对评论进行情感分析,预测用户评分:
python复制from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
def build_lstm_model(vocab_size, max_length):
model = Sequential([
Embedding(vocab_size, 128, input_length=max_length),
LSTM(128, dropout=0.2, recurrent_dropout=0.2),
Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
def train_sentiment_analysis():
# 加载评论数据
comments = load_comments_with_labels() # 假设已经标注了情感倾向
# 文本预处理
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(comments['content'])
sequences = tokenizer.texts_to_sequences(comments['content'])
padded = pad_sequences(sequences, maxlen=200)
# 构建模型
model = build_lstm_model(5000, 200)
# 训练
model.fit(padded, comments['label'],
validation_split=0.2,
epochs=10, batch_size=128)
return model, tokenizer
后端采用Flask提供RESTful API:
python复制from flask import Flask, jsonify, request
from flask_restful import Api, Resource
from recommender import get_recommendations
app = Flask(__name__)
api = Api(app)
class MovieRecommend(Resource):
def get(self, movie_id):
try:
n = request.args.get('n', default=5, type=int)
method = request.args.get('method', default='content')
recommendations = get_recommendations(movie_id, n, method)
return jsonify({
'status': 'success',
'data': recommendations.to_dict('records')
})
except Exception as e:
return jsonify({
'status': 'error',
'message': str(e)
}), 500
api.add_resource(MovieRecommend, '/api/recommend/<int:movie_id>')
if __name__ == '__main__':
app.run(debug=True)
前端使用Vue.js构建推荐界面:
vue复制<template>
<div class="movie-recommend">
<el-select v-model="currentMovie" filterable placeholder="选择电影">
<el-option
v-for="movie in movies"
:key="movie.id"
:label="movie.title"
:value="movie.id">
</el-option>
</el-select>
<el-radio-group v-model="method">
<el-radio-button label="content">基于内容</el-radio-button>
<el-radio-button label="collab">协同过滤</el-radio-button>
</el-radio-group>
<el-button type="primary" @click="getRecommend">获取推荐</el-button>
<div class="recommend-list">
<movie-card
v-for="movie in recommendMovies"
:key="movie.id"
:movie="movie">
</movie-card>
</div>
</div>
</template>
<script>
import axios from 'axios'
import MovieCard from './MovieCard.vue'
export default {
components: { MovieCard },
data() {
return {
movies: [],
currentMovie: '',
method: 'content',
recommendMovies: []
}
},
methods: {
async getRecommend() {
try {
const res = await axios.get(`/api/recommend/${this.currentMovie}`, {
params: { method: this.method }
})
this.recommendMovies = res.data.data
} catch (error) {
this.$message.error('获取推荐失败')
}
}
}
}
</script>
使用ECharts展示电影数据统计信息:
javascript复制// 在Vue组件中
methods: {
initChart() {
const chart = this.$refs.chart
if (chart) {
const myChart = this.$echarts.init(chart)
const option = {
title: { text: '电影评分分布' },
tooltip: {},
xAxis: {
data: ['1星', '2星', '3星', '4星', '5星']
},
yAxis: {},
series: [{
name: '数量',
type: 'bar',
data: this.ratingDistribution
}]
}
myChart.setOption(option)
window.addEventListener('resize', myChart.resize)
}
}
}
实现电影评分随时间变化的趋势图:
javascript复制// 在Vue组件中
initTrendChart() {
axios.get('/api/movies/trend').then(res => {
const data = res.data.data
const chart = this.$refs.trendChart
const myChart = this.$echarts.init(chart)
const option = {
title: { text: '年度平均评分趋势' },
tooltip: {
trigger: 'axis'
},
xAxis: {
type: 'category',
data: data.years
},
yAxis: {
type: 'value',
min: 0,
max: 10
},
series: [{
data: data.ratings,
type: 'line',
smooth: true,
markPoint: {
data: [
{ type: 'max', name: '最高分' },
{ type: 'min', name: '最低分' }
]
}
}]
}
myChart.setOption(option)
})
}
建议使用conda创建Python虚拟环境:
bash复制conda create -n movie-recommender python=3.8
conda activate movie-recommender
pip install -r requirements.txt
前端依赖安装:
bash复制cd frontend
npm install
使用Nginx + Gunicorn部署方案:
nginx复制server {
listen 80;
server_name yourdomain.com;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /api {
proxy_pass http://127.0.0.1:5000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
bash复制gunicorn -w 4 -b 127.0.0.1:5000 app:app
bash复制cd frontend
npm run build
cp -r dist/* /var/www/html/
数据库优化:
推荐算法优化:
前端性能优化:
用户系统增强:
数据分析扩展:
移动端适配:
Q1: 如何应对豆瓣的反爬机制?
Q2: 爬取的数据不完整怎么办?
Q1: 冷启动问题如何解决?
Q2: 推荐结果不够精准怎么办?
Q1: 如何提高系统并发能力?
Q2: 如何监控系统运行状态?
在实际开发过程中,我发现合理设计数据库索引对查询性能提升最为明显,特别是在处理用户行为数据时。另外,推荐算法的效果很大程度上取决于特征工程的质量,需要投入足够时间进行数据探索和分析。