向量检索是整个RAG管道的一个重要的步骤,传统的暴力最近邻搜索因为计算成本太高,扩展性差等无法应对大规模的搜索。
HNSW(Hierarchical Navigable Small World,分层可导航小世界图)提供了一种对数时间复杂度的近似搜索方案。查询时间却缩短到原来的1/10,我们今天就来介绍HNSW算法。
传统搜索方法在高纬度下会崩溃,并且最近邻搜索(NNS)的线性时间复杂度让成本变得不可控。HNSW图的出现改变了搜索的方式。它能在数十亿向量上实现对数复杂度的实时检索。但大多数工程师只是把它当黑盒用,调调
efSearch
和
M
参数,并不真正理解为什么它这么快,也不知道如何针对具体场景做优化。
HNSW为什么这么块
HNSW的效率来自概率连接和受控探索。两个超参数决定了性能权衡:
M(连接度):决定每个节点的链接数量。M越大准确率越高,但内存占用和索引时间也会上升。
efSearch(动态搜索广度):控制查询时探索的候选数量。更高的efSearch带来更好的召回率,但是会有更大的延迟。
低延迟实时系统(聊天机器人检索、推荐引擎推理)适合用小M(8-12)配合小efSearch(32-64)。批量分析或高召回率RAG管道可以用大M(32+)和大efSearch(200-400),在可控成本下得到接近精确的结果。
调优得当的HNSW索引,在中等规模数据(1000万到1亿向量)上能超过FAISS IVF-PQ,维护成本还更低。
掌握HNSW不仅要知道默认配置,而且还要理解它的动态特性。
可视化各层结构。用平均节点度、层占用率这些指标识别不平衡。
测量召回率-延迟曲线。在不同efSearch值上跑基准测试,找到你的工作负载的帕累托前沿。
优化内存占用。索引大小按O(M × N)增长,精度允许的话用float16嵌入。
结合量化压缩。超大规模系统可以把HNSW和乘积量化(PQ)混用,压缩内存的同时保持召回率。
代码实现详解
这个脚本演示了如何用HNSW替代UCI乳腺癌数据集上的暴力k-NN。完整流程包括:加载数据,用2D PCA做快速EDA揭示聚类结构,标准化特征,可选地应用PCA生成紧凑嵌入(基于距离的方法对尺度很敏感)。
自定义的
HNSWClassifier
封装了hnswlib库。fit阶段构建分层小世界索引(由M和ef_construction调优),预测时用可控广度ef_search检索k个近似邻居,投票决定分类标签。代码对(M, ef_search, k)做交叉验证找高F1低延迟的操作点,然后在训练集上训练,在测试集上评估。还使用了对照组跑暴力k-NN基线来量化速度和质量。最后通过扫描ef_search可视化帕累托曲线(展示延迟如何上升而准确率趋于平稳),画混淆矩阵。所有逻辑都包在
run_hnsw_benchmark()
函数里,一次调用就能跑完整个流程。
"""
HNSW in Practice — End-to-end example on a real UCI dataset (Breast Cancer Wisconsin)
---------------------------------------------------------------------
这个单文件脚本演示了如何使用HNSW(通过`hnswlib`)作为一个快速、高召回率
的近似最近邻(ANN)引擎来*替换暴力k-NN*在分类工作流程中。它连接了
从文章核心思想——"非结构化空间中的结构化搜索"——到具体、可重现的
在广泛可用的基准数据集上的步骤。
我们注释每个阶段并解释不仅*做什么*,而且*为什么*对从业者重要。
依赖:
pip install hnswlib scikit-learn matplotlib numpy scipy
运行:
python hnsw_hands_on.py
"""
# ======================
# PHASE 0 — Imports
# ======================
# Why: We gather the scientific Python stack, sklearn utilities, and hnswlib for HNSW ANN search.
import time
import math
import warnings
from dataclasses import dataclass
from typing import Dict, Any, Tuple, List
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import (
accuracy_score,
f1_score,
classification_report,
confusion_matrix,
)
from sklearn.neighbors import KNeighborsClassifier
try:
import hnswlib # Fast C++ ANN library with HNSW
except ImportError as e:
raise SystemExit(
"此示例需要hnswlib。\n使用以下命令安装:pip install hnswlib"
) from e
# ======================
# Utility Data Structures
# ======================
@dataclass
class CVResult:
params: Dict[str, Any]
mean_f1: float
mean_acc: float
# ======================
# PHASE 1 — Data Loading
# ======================
# Why: We use a *real* benchmark (UCI Breast Cancer Wisconsin), bundled in scikit-learn.
# It's binary, tabular, and has enough dimensionality to show ANN benefits without GPU.
def load_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
data = load_breast_cancer()
X = data["data"] # shape (n_samples, n_features)
y = data["target"] # 二元标签:0 = 恶性,1 = 良性
feature_names = list(data["feature_names"])
return X, y, feature_names
# ======================
# PHASE 2 — EDA (Quick)
# ======================
# Why: In an ANN pipeline, we care about scale/distribution because distance metrics are
# sensitive to feature magnitudes. Quick EDA guides our choice to standardize and optionally reduce dim.
def quick_eda(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
print("\n=== EDA: 基本信息 ===")
print(f"样本数:{X.shape[0]},特征数:{X.shape[1]}")
unique, counts = np.unique(y, return_counts=True)
class_dist = dict(zip(unique, counts))
print(f"类别分布(0=恶性,1=良性):{class_dist}")
# Visualize first two principal components to *reason about* structure.
# Why: Low-dimensional projection lets us see clusters and overlap, which affect k-NN behavior.
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=RuntimeWarning)
pca_2d = PCA(n_components=2, random_state=42).fit_transform(StandardScaler().fit_transform(X))
plt.figure(figsize=(6, 5))
plt.title("EDA: 2D PCA投影(标准化)")
plt.scatter(pca_2d[:, 0], pca_2d[:, 1], c=y, s=15)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.tight_layout()
plt.show()
# ======================
# PHASE 3 — Feature Engineering
# ======================
# Why:
# - Standardization: Euclidean distances underpin HNSW's "l2" metric. Scaling avoids dominance by high-variance features.
# - Optional PCA: Reduces noise and may improve neighborhood quality and latency. It's a trade-off:
# too aggressive PCA can hide discriminative detail; modest reduction can help ANN structure.
def make_embeddings(
X_train: np.ndarray, X_test: np.ndarray, n_pca: int = None
) -> Tuple[np.ndarray, np.ndarray, StandardScaler, PCA]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
pca_model = None
if n_pca is not None:
pca_model = PCA(n_components=n_pca, random_state=42)
X_train_emb = pca_model.fit_transform(X_train_scaled)
X_test_emb = pca_model.transform(X_test_scaled)
else:
X_train_emb = X_train_scaled
X_test_emb = X_test_scaled
print(
f"嵌入维度:{X_train_emb.shape[1]} "
f"(PCA={'开启' if n_pca else '关闭'})"
)
return X_train_emb, X_test_emb, scaler, pca_model
# ======================
# PHASE 4 — Model Selection: HNSW k-NN Classifier
# ======================
# Why: HNSW provides a *hierarchical small-world graph* over points. Search descends coarse-to-fine
# layers, approximating nearest neighbors with logarithmic scaling—key to fast retrieval at scale.
#
# We'll implement a lightweight k-NN classifier on top of HNSW by majority vote over neighbors.
class HNSWClassifier:
def __init__(
self,
space: str = "l2",
M: int = 16,
ef_construction: int = 200,
ef_search: int = 64,
k: int = 5,
random_state: int = 42,
):
self.space = space
self.M = M
self.ef_construction = ef_construction
self.ef_search = ef_search
self.k = k
self.random_state = random_state
self.index = None
self.labels = None
self.dim = None
def fit(self, X: np.ndarray, y: np.ndarray):
n, d = X.shape
self.dim = d
self.labels = y.astype(np.int32)
# Build HNSW index
self.index = hnswlib.Index(space=self.space, dim=self.dim)
# 'init_index' builds the topological skeleton. ef_construction controls graph quality/speed.
self.index.init_index(max_elements=n, M=self.M, ef_construction=self.ef_construction, random_seed=self.random_state)
self.index.add_items(X, np.arange(n))
self.index.set_ef(self.ef_search) # ef_search controls query-time breadth (recall vs latency)
def predict(self, Xq: np.ndarray) -> np.ndarray:
# Query k approximate neighbors for each vector; majority vote on labels.
# NOTE: For binary, we could also weight votes by inverse distance.
labels_idx, distances = self.index.knn_query(Xq, k=self.k)
# labels_idx: (n_queries, k) -> indices into self.labels
yhat = []
for neigh in labels_idx:
votes = np.bincount(self.labels[neigh], minlength=np.max(self.labels) + 1)
yhat.append(np.argmax(votes))
return np.array(yhat)
def predict_proba(self, Xq: np.ndarray) -> np.ndarray:
# Simple probability from neighbor class frequency (can be replaced by distance weighting).
labels_idx, distances = self.index.knn_query(Xq, k=self.k)
n_classes = np.max(self.labels) + 1
probas = np.zeros((Xq.shape[0], n_classes), dtype=float)
for i, neigh in enumerate(labels_idx):
votes = np.bincount(self.labels[neigh], minlength=n_classes).astype(float)
probas[i] = votes / (votes.sum() + 1e-12)
return probas
# ======================
# PHASE 5 — Hyperparameter Tuning with Cross-Validation
# ======================
# Why: HNSW has two critical controls:
# - M: graph connectivity degree (memory & index-time vs accuracy)
# - ef_search: query breadth (latency vs recall/accuracy)
# We tune them (and k) using Stratified K-Fold to locate a Pareto-efficient setting.
def hnsw_cv_tuning(
X: np.ndarray,
y: np.ndarray,
param_grid: Dict[str, List[Any]],
n_splits: int = 5,
n_pca: int = None,
random_state: int = 42,
) -> CVResult:
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
results: List[CVResult] = []
# Small, principled grid: balances realism (we'd try more) and time.
grid_list = []
for M in param_grid.get("M", [16]):
for efc in param_grid.get("ef_construction", [200]):
for efs in param_grid.get("ef_search", [32, 64, 128]):
for k in param_grid.get("k", [5]):
grid_list.append(dict(M=M, ef_construction=efc, ef_search=efs, k=k))
print(f"\n=== HNSW CV调优:{len(grid_list)}个配置,{n_splits}折 ===")
for cfg in grid_list:
fold_f1, fold_acc = [], []
for train_idx, val_idx in skf.split(X, y):
X_tr, X_val = X[train_idx], X[val_idx]
y_tr, y_val = y[train_idx], y[val_idx]
# Feature pipeline per fold to avoid leakage
X_tr_emb, X_val_emb, _, _ = make_embeddings(X_tr, X_val, n_pca=n_pca)
clf = HNSWClassifier(
space="l2",
M=cfg["M"],
ef_construction=cfg["ef_construction"],
ef_search=cfg["ef_search"],
k=cfg["k"],
random_state=random_state,
)
clf.fit(X_tr_emb, y_tr)
y_pred = clf.predict(X_val_emb)
fold_f1.append(f1_score(y_val, y_pred))
fold_acc.append(accuracy_score(y_val, y_pred))
res = CVResult(params=cfg, mean_f1=float(np.mean(fold_f1)), mean_acc=float(np.mean(fold_acc)))
results.append(res)
print(f"配置 {cfg} -> F1: {res.mean_f1:.4f} | Acc: {res.mean_acc:.4f}")
# Select best by F1 (can also break ties by accuracy or latency measurements)
best = max(results, key=lambda r: (r.mean_f1, r.mean_acc))
print(f"\n最佳HNSW CV参数:{best.params}(F1={best.mean_f1:.4f},Acc={best.mean_acc:.4f})")
return best
# ======================
# PHASE 6 — Baseline (Brute Force k-NN) for Comparison
# ======================
# Why: As practitioners, we need to *quantify* what HNSW buys us. k-NN (brute) is the canonical baseline.
def brute_knn_baseline(
X_train_emb: np.ndarray,
y_train: np.ndarray,
X_test_emb: np.ndarray,
y_test: np.ndarray,
k: int = 5,
) -> Dict[str, Any]:
t0 = time.perf_counter()
knn = KNeighborsClassifier(n_neighbors=k, algorithm="brute", metric="euclidean")
knn.fit(X_train_emb, y_train)
train_time = time.perf_counter() - t0
t1 = time.perf_counter()
y_pred = knn.predict(X_test_emb)
query_time = time.perf_counter() - t1
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
return {
"model": knn,
"acc": acc,
"f1": f1,
"train_time_s": train_time,
"query_time_s": query_time,
"k": k,
}
# ======================
# PHASE 7 — Train Final HNSW, Evaluate, and Visualize Trade-offs
# ======================
# Why: After selecting params, we fit on train and test on hold-out. We also *sweep ef_search*
# to visualize the core speed-recall dial that defines HNSW's value proposition.
def fit_eval_hnsw_and_visualize(
X_train: np.ndarray,
y_train: np.ndarray,
X_test: np.ndarray,
y_test: np.ndarray,
best_cfg: Dict[str, Any],
sweep_ef_search: List[int],
) -> Dict[str, Any]:
M = best_cfg["M"]
efc = best_cfg["ef_construction"]
k = best_cfg["k"]
# Train with best ef_search for final scoring
efs_final = best_cfg["ef_search"]
# Build once for "final" numbers
clf = HNSWClassifier(M=M, ef_construction=efc, ef_search=efs_final, k=k)
t0 = time.perf_counter()
clf.fit(X_train, y_train)
train_time = time.perf_counter() - t0
# Evaluate with selected ef_search
t1 = time.perf_counter()
y_pred = clf.predict(X_test)
final_query_time = time.perf_counter() - t1
final_acc = accuracy_score(y_test, y_pred)
final_f1 = f1_score(y_test, y_pred)
print("\n=== 最终HNSW(最佳CV参数)===")
print(f"参数:{best_cfg}")
print(f"训练时间:{train_time:.4f}s | 查询时间:{final_query_time:.4f}s")
print(f"测试准确率:{final_acc:.4f} | F1:{final_f1:.4f}")
print("\n分类报告:\n", classification_report(y_test, y_pred, digits=4))
# Sweep ef_search to visualize latency vs accuracy (the *core* HNSW insight)
sweep_times, sweep_acc, sweep_f1 = [], [], []
for efs in sweep_ef_search:
clf.index.set_ef(efs) # adjust query breadth live
t2 = time.perf_counter()
y_hat = clf.predict(X_test)
dt = time.perf_counter() - t2
sweep_times.append(dt)
sweep_acc.append(accuracy_score(y_test, y_hat))
sweep_f1.append(f1_score(y_test, y_hat))
print(f"ef_search={efs:<4d} -> Acc={sweep_acc[-1]:.4f}, F1={sweep_f1[-1]:.4f}, QueryTime={dt:.4f}s")
# Plot the Pareto-ish curve: ef_search (x) vs (latency, quality)
fig, ax1 = plt.subplots(figsize=(7, 5))
ax1.set_title("HNSW权衡:ef_search vs 延迟和质量")
ax1.set_xlabel("ef_search(查询广度)")
ax1.set_ylabel("查询时间(s)")
ax1.plot(sweep_ef_search, sweep_times, marker="o", label="查询时间")
ax2 = ax1.twinx()
ax2.set_ylabel("得分")
ax2.plot(sweep_ef_search, sweep_acc, marker="s", label="准确率")
ax2.plot(sweep_ef_search, sweep_f1, marker="^", label="F1")
# Legend handling for twin axis
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc="best")
plt.tight_layout()
plt.show()
# Confusion matrix as an operational sanity check (post best ef_search)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4, 4))
plt.title("混淆矩阵(HNSW,最佳ef_search)")
plt.imshow(cm, interpolation="nearest")
plt.colorbar()
plt.xticks([0, 1], ["恶性", "良性"])
plt.yticks([0, 1], ["恶性", "良性"])
for (i, j), val in np.ndenumerate(cm):
plt.text(j, i, int(val), ha="center", va="center")
plt.tight_layout()
plt.show()
return {
"clf": clf,
"train_time_s": train_time,
"query_time_s": final_query_time,
"acc": final_acc,
"f1": final_f1,
"sweep": {
"ef_search": sweep_ef_search,
"times": sweep_times,
"acc": sweep_acc,
"f1": sweep_f1,
},
}
# ======================
# PHASE 8 — Narrative Glue (Challenge → Insight → Application)
# ======================
# Challenge:
# Brute-force nearest neighbor search scales poorly; even with modest dimensionality, latency explodes
# as data grows. Distance computations become the bottleneck in RAG and real-time ML systems.
#
# Insight:
# HNSW imposes hierarchy and navigable small-world structure on the vector space. We tune (M, ef_search)
# to move along the recall/latency curve. Standardize inputs; consider PCA to enhance neighborhood purity.
#
# Application:
# Swap KNN brute with HNSW-backed ANN. Cross-validate to pick an operating point. Measure with a
# hold-out and visualize ef_search sweeps to institutionalize a *speed-quality contract* with users.
# ======================
# PHASE 9 — End-to-end Wrapper
# ======================
def run_hnsw_benchmark(
test_size: float = 0.25,
random_state: int = 42,
n_pca: int = 16, # Try None to keep full dims; 16 often yields stable neighborhoods on this dataset
) -> None:
# 1) Load data
X, y, feature_names = load_data()
# 2) Quick EDA to connect intuition to modeling choices
quick_eda(X, y, feature_names)
# 3) Train/Test split (hold-out after CV)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, stratify=y, random_state=random_state
)
# 4) HNSW CV Tuning (on training only)
# We keep the grid small here; practitioners would expand it under time budgets.
param_grid = {
"M": [12, 16, 32],
"ef_construction": [100, 200],
"ef_search": [32, 64, 128, 256],
"k": [3, 5, 7],
}
best = hnsw_cv_tuning(
X_train, y_train, param_grid, n_splits=5, n_pca=n_pca, random_state=random_state
)
# 5) Build the final embedding pipeline on train/test with the chosen PCA config
X_train_emb, X_test_emb, scaler, pca_model = make_embeddings(
X_train, X_test, n_pca=n_pca
)
# 6) Baseline brute-force kNN, to *quantify* ROI
baseline = brute_knn_baseline(
X_train_emb, y_train, X_test_emb, y_test, k=best.params["k"]
)
print("\n=== 基线:暴力k-NN ===")
print(
f"k={baseline['k']} | 训练 {baseline['train_time_s']:.4f}s | "
f"查询 {baseline['query_time_s']:.4f}s | "
f"准确率={baseline['acc']:.4f} | F1={baseline['f1']:.4f}"
)
# 7) Final HNSW fit/eval and *visualize the ef_search dial* (the central HNSW idea)
sweep = [16, 32, 64, 128, 256, 384]
hnsw_out = fit_eval_hnsw_and_visualize(
X_train_emb, y_train, X_test_emb, y_test, best.params, sweep
)
# 8) Final narrative printout to tie back to the essay:
print("\n=== 叙述:挑战→洞察→应用 ===")
print(
"- 挑战:暴力k-NN需要扫描所有点并成为系统瓶颈。\n"
"- 洞察:HNSW构建分层小世界图,让我们权衡搜索广度(ef_search)\n"
" 与准确性和延迟——高效地从粗到细邻域缩放。\n"
"- 应用:我们用交叉验证调优(M,ef_search,k),选择高F1操作点,\n"
" 并使用ef_search扫描制度化速度-质量合同。"
)
# 9) Summarize ROI
print("\n=== ROI总结(留出测试)===")
print(
f"暴力kNN:准确率={baseline['acc']:.4f},F1={baseline['f1']:.4f},"
f"查询={baseline['query_time_s']:.4f}s"
)
print(
f"HNSW最佳:准确率={hnsw_out['acc']:.4f},F1={hnsw_out['f1']:.4f},"
f"查询={hnsw_out['query_time_s']:.4f}s(ef_search={best.params['ef_search']})"
)
print(
"解释:如果HNSW以更低的查询延迟实现可比较或更好的质量,\n"
"它验证了核心工程声明:*随着规模增长,更智能的检索胜过暴力*。"
)
# Entrypoint
if __name__ == "__main__":
run_hnsw_benchmark()
实验结果:分层确实有效
把数据集想象成一个图书馆,每个向量是一本书。暴力k-NN就像实习生,每次查询都得翻遍每个书架——靠谱但慢。HNSW是经验丰富的馆员,从顶层开始(粗糙的地图),直奔正确的区域。
在乳腺癌数据集上,HNSW达到了和暴力搜索相同的测试准确率和F1分数(≈0.979 / 0.983),查询时间却只有1/17(≈5毫秒 vs 89毫秒)。这验证了我们这篇文章的核心论点——非结构化空间中的结构化搜索——从理论变成了实实在在的性能提升。
什么配置真正起作用
层次结构+小搜索广度最优
交叉验证反复选中了小搜索广度(ef_search=32)配合适中连接度(M=12)和k=5。参数扫描显示,把ef_search从16推到384,准确率几乎不动,延迟却线性增长。这是因为,一旦定位到正确区域,翻4倍的候选集并不能在这个数据上找到更好的邻居,纯粹是浪费时间。
轻量级特征处理稳定邻域
标准化加上PCA降到16维,产生了稳定的邻域结构和一致的交叉验证得分,让HNSW的图更好导航。不需要花哨的度量学习,基本的数据清洗(缩放+轻度降噪)就足够暴露HNSW能利用的结构。
和暴力搜索质量持平
混淆矩阵很直白:2个假阴性,1个假阳性。和暴力k-NN完全一个水平。诊断质量没打折扣,但临床系统或下游应用不用再等那么久。
什么配置没起作用
更大的ef_search没换来准确率
超过32左右,准确率和F1就平了。在这个相对干净、分离良好的数据集上,第一批好邻居已经"够好"。继续扩大搜索范围看起来很好,但实际上像过度管理,只是在烧CPU时间。
更密集的连接(M)也不必要
试过M=12、16、32。图结构不需要更多链接就能找到同样的答案,对这个任务来说额外的内存和索引时间都是浪费。
暴力搜索在小规模下还行
几百个样本的情况下,暴力k-NN的训练和查询时间还能接受。重点不是说暴力搜索"不好",而是HNSW的优势会随数据规模增长——即便在小数据集上也能看出这个趋势。
后续探索方向
大规模试验(百万到亿级向量)
优势应该会进一步放大。重点测recall@k vs精确k-NN的差距,P95/P99尾延迟,单向量内存占用。
度量和嵌入的变体
试试归一化向量+余弦距离,用自编码器替换PCA,或者测领域定制的嵌入。不同几何结构会改变HNSW的最佳工作点。
压缩+混合索引
结合HNSW和PQ/OPQ来压缩内存,量化召回率和延迟的损失。超大语料库可以评估基于磁盘的HNSW(比如mmap)和热缓存行为。
在线更新能力
压测动态插入删除,看重平衡开销。流式场景下跟踪召回率随时间的漂移。
自适应策略
训练一个控制器,根据查询难度(熵、边距、重排序不确定性)动态设ef_search。简单查询少花时间,复杂查询多给预算。
=== EDA: 基本信息 ===
样本数:569,特征数:30
类别分布(0=恶性,1=良性):{np.int64(0): np.int64(212), np.int64(1): np.int64(357)}
=== HNSW CV调优:72个配置,5折 ===
嵌入维度:16(PCA=开启)
配置 {'M': 12, 'ef_construction': 100, 'ef_search': 32, 'k': 3} -> F1: 0.9688 | Acc: 0.9601
配置 {'M': 12, 'ef_construction': 100, 'ef_search': 32, 'k': 5} -> F1: 0.9726 | Acc: 0.9648
(中间配置结果省略)
最佳HNSW CV参数:{'M': 12, 'ef_construction': 100, 'ef_search': 32, 'k': 5}(F1=0.9726,Acc=0.9648)
=== 基线:暴力k-NN ===
k=5 | 训练 0.0012s | 查询 0.0893s | 准确率=0.9790 | F1=0.9834
=== 最终HNSW(最佳CV参数)===
参数:{'M': 12, 'ef_construction': 100, 'ef_search': 32, 'k': 5}
训练时间:0.0210s | 查询时间:0.0052s
测试准确率:0.9790 | F1:0.9834
分类报告:
precision recall f1-score support
0 0.9808 0.9623 0.9714 53
1 0.9780 0.9889 0.9834 90
accuracy 0.9790 143
ef_search参数扫描:
ef_search=16 -> Acc=0.9790, F1=0.9834, QueryTime=0.0043s
ef_search=32 -> Acc=0.9790, F1=0.9834, QueryTime=0.0053s
ef_search=64 -> Acc=0.9790, F1=0.9834, QueryTime=0.0070s
ef_search=128 -> Acc=0.9790, F1=0.9834, QueryTime=0.0107s
ef_search=256 -> Acc=0.9790, F1=0.9834, QueryTime=0.0231s
ef_search=384 -> Acc=0.9790, F1=0.9834, QueryTime=0.0223s
=== ROI总结(留出测试)===
暴力kNN:准确率=0.9790,F1=0.9834,查询=0.0893s
HNSW最佳:准确率=0.9790,F1=0.9834,查询=0.0052s(ef_search=32)
结论:HNSW在保持同等质量的前提下,查询延迟降到原来的1/17,
验证了核心主张:规模增长时,智能检索完胜暴力扫描。
总结
嵌入向量数量还在指数级增长,HNSW提出了一个简单事实:智能不是知道所有答案,而是知道先从哪里找起。
高效AI系统的未来可能更少依赖大模型本身,更多依赖智能检索。从这个角度看,HNSW不只是个优化技巧,它改变了机器记忆和查询的底层逻辑。
https://avoid.overfit.cn/post/6e8a792fb0eb4f4ab911cce7f3e98644
作者:Everton Gomede, PhD