在自然语言处理领域,语义向量表示一直是核心挑战之一。传统方法往往将单词视为独立符号,无法捕捉上下文语义。而基于Transformer架构的编码解码模型,通过自注意力机制实现了真正意义上的上下文感知语义编码。
我在实际项目中发现,这种端到端的语义向量生成方式,比传统word2vec或GloVe等静态嵌入方法效果提升显著。特别是在处理一词多义、指代消解等复杂语义场景时,动态生成的token级向量能够准确反映当前上下文中的真实含义。
编码器由多层自注意力模块和前馈网络组成。以BERT-base为例:
python复制# 典型Transformer编码层结构
class TransformerLayer(nn.Module):
def __init__(self, d_model, nhead):
self.self_attn = MultiheadAttention(d_model, nhead)
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
def forward(self, x):
x = x + self.self_attn(x, x, x)[0]
x = x + self.linear2(F.relu(self.linear1(x)))
return x
关键参数说明:
d_model: 向量维度(通常768)nhead: 注意力头数(通常12)d_ff: 前馈网络隐层维度(通常3072)解码器在编码器基础上增加了:
实践发现:解码器最后一层的注意力分布可视化,能清晰显示token间的语义关联强度
推荐使用HuggingFace Transformers库:
bash复制pip install transformers torch
python复制from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("The cat sat on the mat", return_tensors="pt")
outputs = model(**inputs)
# 获取最后一层隐藏状态(768维语义向量)
token_embeddings = outputs.last_hidden_state # [1, 7, 768]
python复制from scipy.spatial.distance import cosine
def semantic_similarity(text1, text2):
emb1 = model(**tokenizer(text1, return_tensors="pt"))[0][:,0,:]
emb2 = model(**tokenizer(text2, return_tensors="pt"))[0][:,0,:]
return 1 - cosine(emb1.detach().numpy(), emb2.detach().numpy())
python复制from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(token_embeddings[0].detach().numpy())
使用Flash Attention可提升30%速度:
python复制model = AutoModel.from_pretrained("bert-base-uncased",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2")
采用PQ量化方法:
python复制from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
def product_quantize(vectors, m=8, k=256):
pca = PCA(n_components=m)
reduced = pca.fit_transform(vectors)
codebooks = []
for i in range(m):
kmeans = KMeans(n_clusters=k)
kmeans.fit(reduced[:,i:i+1])
codebooks.append(kmeans.cluster_centers_)
return codebooks
当输入超过512token时:
python复制def process_long_text(text, window_size=500):
tokens = tokenizer(text, truncation=False)["input_ids"]
embeddings = []
for i in range(0, len(tokens), window_size):
chunk = tokens[i:i+window_size]
emb = model(torch.tensor([chunk]))[0][:,0,:]
embeddings.append(emb)
return torch.mean(torch.stack(embeddings), dim=0)
当通用模型在专业领域表现不佳时:
python复制from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=8,
num_train_epochs=3,
save_steps=10_000,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=domain_dataset,
)
trainer.train()
通过对比学习实现图文匹配:
python复制clip_model = AutoModel.from_pretrained("openai/clip-vit-base-patch32")
def image_text_similarity(image, text):
image_emb = clip_model.get_image_features(image)
text_emb = clip_model.get_text_features(text)
return F.cosine_similarity(image_emb, text_emb)
实现语义搜索系统:
python复制from faiss import IndexFlatIP
index = IndexFlatIP(768) # 内积作为相似度度量
index.add(token_embeddings.numpy()) # 添加所有文档向量
D, I = index.search(query_embedding, k=5) # 返回top5相似结果
在实际部署中发现,采用IVF_PQ索引结构能在精度损失<2%的情况下,实现100倍的查询加速。