HNSW算法实战:用分层图索引替换k-NN暴力搜索

简介: HNSW是一种高效向量检索算法,通过分层图结构实现近似最近邻的对数时间搜索,显著降低查询延迟。相比暴力搜索,它在保持高召回率的同时,将性能提升数十倍,广泛应用于大规模RAG系统。

向量检索是整个RAG管道的一个重要的步骤,传统的暴力最近邻搜索因为计算成本太高,扩展性差等无法应对大规模的搜索。

HNSW(Hierarchical Navigable Small World,分层可导航小世界图)提供了一种对数时间复杂度的近似搜索方案。查询时间却缩短到原来的1/10,我们今天就来介绍HNSW算法。

传统搜索方法在高纬度下会崩溃,并且最近邻搜索(NNS)的线性时间复杂度让成本变得不可控。HNSW图的出现改变了搜索的方式。它能在数十亿向量上实现对数复杂度的实时检索。但大多数工程师只是把它当黑盒用,调调

efSearch

M

参数,并不真正理解为什么它这么快,也不知道如何针对具体场景做优化。

HNSW为什么这么块

HNSW的效率来自概率连接和受控探索。两个超参数决定了性能权衡:

M(连接度):决定每个节点的链接数量。M越大准确率越高,但内存占用和索引时间也会上升。

efSearch(动态搜索广度):控制查询时探索的候选数量。更高的efSearch带来更好的召回率,但是会有更大的延迟。

低延迟实时系统(聊天机器人检索、推荐引擎推理)适合用小M(8-12)配合小efSearch(32-64)。批量分析或高召回率RAG管道可以用大M(32+)和大efSearch(200-400),在可控成本下得到接近精确的结果。

调优得当的HNSW索引,在中等规模数据(1000万到1亿向量)上能超过FAISS IVF-PQ,维护成本还更低。

掌握HNSW不仅要知道默认配置,而且还要理解它的动态特性。

可视化各层结构。用平均节点度、层占用率这些指标识别不平衡。

测量召回率-延迟曲线。在不同efSearch值上跑基准测试,找到你的工作负载的帕累托前沿。

优化内存占用。索引大小按O(M × N)增长,精度允许的话用float16嵌入。

结合量化压缩。超大规模系统可以把HNSW和乘积量化(PQ)混用,压缩内存的同时保持召回率。

代码实现详解

这个脚本演示了如何用HNSW替代UCI乳腺癌数据集上的暴力k-NN。完整流程包括:加载数据,用2D PCA做快速EDA揭示聚类结构,标准化特征,可选地应用PCA生成紧凑嵌入(基于距离的方法对尺度很敏感)。

自定义的

HNSWClassifier

封装了hnswlib库。fit阶段构建分层小世界索引(由M和ef_construction调优),预测时用可控广度ef_search检索k个近似邻居,投票决定分类标签。代码对(M, ef_search, k)做交叉验证找高F1低延迟的操作点,然后在训练集上训练,在测试集上评估。还使用了对照组跑暴力k-NN基线来量化速度和质量。最后通过扫描ef_search可视化帕累托曲线(展示延迟如何上升而准确率趋于平稳),画混淆矩阵。所有逻辑都包在

run_hnsw_benchmark()

函数里,一次调用就能跑完整个流程。

 """  
HNSW in Practice — End-to-end example on a real UCI dataset (Breast Cancer Wisconsin)  
---------------------------------------------------------------------  

这个单文件脚本演示了如何使用HNSW(通过`hnswlib`)作为一个快速、高召回率  
的近似最近邻(ANN)引擎来*替换暴力k-NN*在分类工作流程中。它连接了  
从文章核心思想——"非结构化空间中的结构化搜索"——到具体、可重现的  
在广泛可用的基准数据集上的步骤。  

我们注释每个阶段并解释不仅*做什么*,而且*为什么*对从业者重要。  

依赖:  
    pip install hnswlib scikit-learn matplotlib numpy scipy  

运行:  
    python hnsw_hands_on.py  
"""  

# ======================  
# PHASE 0 — Imports  
# ======================  
# Why: We gather the scientific Python stack, sklearn utilities, and hnswlib for HNSW ANN search.  
import time  
import math  
import warnings  
from dataclasses import dataclass  
from typing import Dict, Any, Tuple, List  

import numpy as np  
import matplotlib.pyplot as plt  

from sklearn.datasets import load_breast_cancer  
from sklearn.preprocessing import StandardScaler  
from sklearn.decomposition import PCA  
from sklearn.model_selection import StratifiedKFold, train_test_split  
from sklearn.metrics import (  
    accuracy_score,  
    f1_score,  
    classification_report,  
    confusion_matrix,  
)  
from sklearn.neighbors import KNeighborsClassifier  

try:  
    import hnswlib  # Fast C++ ANN library with HNSW  
except ImportError as e:  
    raise SystemExit(  
        "此示例需要hnswlib。\n使用以下命令安装:pip install hnswlib"  
    ) from e  

# ======================  
# Utility Data Structures  
# ======================  
@dataclass  
class CVResult:  
    params: Dict[str, Any]  
    mean_f1: float  
    mean_acc: float  

# ======================  
# PHASE 1 — Data Loading  
# ======================  
# Why: We use a *real* benchmark (UCI Breast Cancer Wisconsin), bundled in scikit-learn.  
# It's binary, tabular, and has enough dimensionality to show ANN benefits without GPU.  
def load_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:  
    data = load_breast_cancer()  
    X = data["data"]  # shape (n_samples, n_features)  
    y = data["target"]  # 二元标签:0 = 恶性,1 = 良性  
    feature_names = list(data["feature_names"])  
    return X, y, feature_names  

# ======================  
# PHASE 2 — EDA (Quick)  
# ======================  
# Why: In an ANN pipeline, we care about scale/distribution because distance metrics are  
# sensitive to feature magnitudes. Quick EDA guides our choice to standardize and optionally reduce dim.  
def quick_eda(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:  
    print("\n=== EDA: 基本信息 ===")  
    print(f"样本数:{X.shape[0]},特征数:{X.shape[1]}")  
    unique, counts = np.unique(y, return_counts=True)  
    class_dist = dict(zip(unique, counts))  
    print(f"类别分布(0=恶性,1=良性):{class_dist}")  

    # Visualize first two principal components to *reason about* structure.  
    # Why: Low-dimensional projection lets us see clusters and overlap, which affect k-NN behavior.  
    with warnings.catch_warnings():  
        warnings.simplefilter("ignore", category=RuntimeWarning)  
        pca_2d = PCA(n_components=2, random_state=42).fit_transform(StandardScaler().fit_transform(X))  

    plt.figure(figsize=(6, 5))  
    plt.title("EDA: 2D PCA投影(标准化)")  
    plt.scatter(pca_2d[:, 0], pca_2d[:, 1], c=y, s=15)  
    plt.xlabel("PC1")  
    plt.ylabel("PC2")  
    plt.tight_layout()  
    plt.show()  

# ======================  
# PHASE 3 — Feature Engineering  
# ======================  
# Why:  
#  - Standardization: Euclidean distances underpin HNSW's "l2" metric. Scaling avoids dominance by high-variance features.  
#  - Optional PCA: Reduces noise and may improve neighborhood quality and latency. It's a trade-off:  
#    too aggressive PCA can hide discriminative detail; modest reduction can help ANN structure.  
def make_embeddings(  
    X_train: np.ndarray, X_test: np.ndarray, n_pca: int = None  
) -> Tuple[np.ndarray, np.ndarray, StandardScaler, PCA]:  
    scaler = StandardScaler()  
    X_train_scaled = scaler.fit_transform(X_train)  
    X_test_scaled = scaler.transform(X_test)  

    pca_model = None  
    if n_pca is not None:  
        pca_model = PCA(n_components=n_pca, random_state=42)  
        X_train_emb = pca_model.fit_transform(X_train_scaled)  
        X_test_emb = pca_model.transform(X_test_scaled)  
    else:  
        X_train_emb = X_train_scaled  
        X_test_emb = X_test_scaled  

    print(  
        f"嵌入维度:{X_train_emb.shape[1]} "  
        f"(PCA={'开启' if n_pca else '关闭'})"  
    )  
    return X_train_emb, X_test_emb, scaler, pca_model  

# ======================  
# PHASE 4 — Model Selection: HNSW k-NN Classifier  
# ======================  
# Why: HNSW provides a *hierarchical small-world graph* over points. Search descends coarse-to-fine  
# layers, approximating nearest neighbors with logarithmic scaling—key to fast retrieval at scale.  
#  
# We'll implement a lightweight k-NN classifier on top of HNSW by majority vote over neighbors.  
class HNSWClassifier:  
    def __init__(  
        self,  
        space: str = "l2",  
        M: int = 16,  
        ef_construction: int = 200,  
        ef_search: int = 64,  
        k: int = 5,  
        random_state: int = 42,  
    ):  
        self.space = space  
        self.M = M  
        self.ef_construction = ef_construction  
        self.ef_search = ef_search  
        self.k = k  
        self.random_state = random_state  

        self.index = None  
        self.labels = None  
        self.dim = None  

    def fit(self, X: np.ndarray, y: np.ndarray):  
        n, d = X.shape  
        self.dim = d  
        self.labels = y.astype(np.int32)  

        # Build HNSW index  
        self.index = hnswlib.Index(space=self.space, dim=self.dim)  
        # 'init_index' builds the topological skeleton. ef_construction controls graph quality/speed.  
        self.index.init_index(max_elements=n, M=self.M, ef_construction=self.ef_construction, random_seed=self.random_state)  
        self.index.add_items(X, np.arange(n))  
        self.index.set_ef(self.ef_search)  # ef_search controls query-time breadth (recall vs latency)  

    def predict(self, Xq: np.ndarray) -> np.ndarray:  
        # Query k approximate neighbors for each vector; majority vote on labels.  
        # NOTE: For binary, we could also weight votes by inverse distance.  
        labels_idx, distances = self.index.knn_query(Xq, k=self.k)  
        # labels_idx: (n_queries, k) -> indices into self.labels  
        yhat = []  
        for neigh in labels_idx:  
            votes = np.bincount(self.labels[neigh], minlength=np.max(self.labels) + 1)  
            yhat.append(np.argmax(votes))  
        return np.array(yhat)  

    def predict_proba(self, Xq: np.ndarray) -> np.ndarray:  
        # Simple probability from neighbor class frequency (can be replaced by distance weighting).  
        labels_idx, distances = self.index.knn_query(Xq, k=self.k)  
        n_classes = np.max(self.labels) + 1  
        probas = np.zeros((Xq.shape[0], n_classes), dtype=float)  
        for i, neigh in enumerate(labels_idx):  
            votes = np.bincount(self.labels[neigh], minlength=n_classes).astype(float)  
            probas[i] = votes / (votes.sum() + 1e-12)  
        return probas  

# ======================  
# PHASE 5 — Hyperparameter Tuning with Cross-Validation  
# ======================  
# Why: HNSW has two critical controls:  
#  - M: graph connectivity degree (memory & index-time vs accuracy)  
#  - ef_search: query breadth (latency vs recall/accuracy)  
# We tune them (and k) using Stratified K-Fold to locate a Pareto-efficient setting.  
def hnsw_cv_tuning(  
    X: np.ndarray,  
    y: np.ndarray,  
    param_grid: Dict[str, List[Any]],  
    n_splits: int = 5,  
    n_pca: int = None,  
    random_state: int = 42,  
) -> CVResult:  
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)  
    results: List[CVResult] = []  

    # Small, principled grid: balances realism (we'd try more) and time.  
    grid_list = []  
    for M in param_grid.get("M", [16]):  
        for efc in param_grid.get("ef_construction", [200]):  
            for efs in param_grid.get("ef_search", [32, 64, 128]):  
                for k in param_grid.get("k", [5]):  
                    grid_list.append(dict(M=M, ef_construction=efc, ef_search=efs, k=k))  

    print(f"\n=== HNSW CV调优:{len(grid_list)}个配置,{n_splits}折 ===")  
    for cfg in grid_list:  
        fold_f1, fold_acc = [], []  
        for train_idx, val_idx in skf.split(X, y):  
            X_tr, X_val = X[train_idx], X[val_idx]  
            y_tr, y_val = y[train_idx], y[val_idx]  

            # Feature pipeline per fold to avoid leakage  
            X_tr_emb, X_val_emb, _, _ = make_embeddings(X_tr, X_val, n_pca=n_pca)  

            clf = HNSWClassifier(  
                space="l2",  
                M=cfg["M"],  
                ef_construction=cfg["ef_construction"],  
                ef_search=cfg["ef_search"],  
                k=cfg["k"],  
                random_state=random_state,  
            )  
            clf.fit(X_tr_emb, y_tr)  
            y_pred = clf.predict(X_val_emb)  

            fold_f1.append(f1_score(y_val, y_pred))  
            fold_acc.append(accuracy_score(y_val, y_pred))  

        res = CVResult(params=cfg, mean_f1=float(np.mean(fold_f1)), mean_acc=float(np.mean(fold_acc)))  
        results.append(res)  
        print(f"配置 {cfg} -> F1: {res.mean_f1:.4f} | Acc: {res.mean_acc:.4f}")  

    # Select best by F1 (can also break ties by accuracy or latency measurements)  
    best = max(results, key=lambda r: (r.mean_f1, r.mean_acc))  
    print(f"\n最佳HNSW CV参数:{best.params}(F1={best.mean_f1:.4f},Acc={best.mean_acc:.4f})")  
    return best  

# ======================  
# PHASE 6 — Baseline (Brute Force k-NN) for Comparison  
# ======================  
# Why: As practitioners, we need to *quantify* what HNSW buys us. k-NN (brute) is the canonical baseline.  
def brute_knn_baseline(  
    X_train_emb: np.ndarray,  
    y_train: np.ndarray,  
    X_test_emb: np.ndarray,  
    y_test: np.ndarray,  
    k: int = 5,  
) -> Dict[str, Any]:  
    t0 = time.perf_counter()  
    knn = KNeighborsClassifier(n_neighbors=k, algorithm="brute", metric="euclidean")  
    knn.fit(X_train_emb, y_train)  
    train_time = time.perf_counter() - t0  

    t1 = time.perf_counter()  
    y_pred = knn.predict(X_test_emb)  
    query_time = time.perf_counter() - t1  

    acc = accuracy_score(y_test, y_pred)  
    f1 = f1_score(y_test, y_pred)  

    return {  
        "model": knn,  
        "acc": acc,  
        "f1": f1,  
        "train_time_s": train_time,  
        "query_time_s": query_time,  
        "k": k,  
    }  

# ======================  
# PHASE 7 — Train Final HNSW, Evaluate, and Visualize Trade-offs  
# ======================  
# Why: After selecting params, we fit on train and test on hold-out. We also *sweep ef_search*  
# to visualize the core speed-recall dial that defines HNSW's value proposition.  
def fit_eval_hnsw_and_visualize(  
    X_train: np.ndarray,  
    y_train: np.ndarray,  
    X_test: np.ndarray,  
    y_test: np.ndarray,  
    best_cfg: Dict[str, Any],  
    sweep_ef_search: List[int],  
) -> Dict[str, Any]:  
    M = best_cfg["M"]  
    efc = best_cfg["ef_construction"]  
    k = best_cfg["k"]  

    # Train with best ef_search for final scoring  
    efs_final = best_cfg["ef_search"]  

    # Build once for "final" numbers  
    clf = HNSWClassifier(M=M, ef_construction=efc, ef_search=efs_final, k=k)  
    t0 = time.perf_counter()  
    clf.fit(X_train, y_train)  
    train_time = time.perf_counter() - t0  

    # Evaluate with selected ef_search  
    t1 = time.perf_counter()  
    y_pred = clf.predict(X_test)  
    final_query_time = time.perf_counter() - t1  
    final_acc = accuracy_score(y_test, y_pred)  
    final_f1 = f1_score(y_test, y_pred)  

    print("\n=== 最终HNSW(最佳CV参数)===")  
    print(f"参数:{best_cfg}")  
    print(f"训练时间:{train_time:.4f}s | 查询时间:{final_query_time:.4f}s")  
    print(f"测试准确率:{final_acc:.4f} | F1:{final_f1:.4f}")  
    print("\n分类报告:\n", classification_report(y_test, y_pred, digits=4))  

    # Sweep ef_search to visualize latency vs accuracy (the *core* HNSW insight)  
    sweep_times, sweep_acc, sweep_f1 = [], [], []  
    for efs in sweep_ef_search:  
        clf.index.set_ef(efs)  # adjust query breadth live  
        t2 = time.perf_counter()  
        y_hat = clf.predict(X_test)  
        dt = time.perf_counter() - t2  

        sweep_times.append(dt)  
        sweep_acc.append(accuracy_score(y_test, y_hat))  
        sweep_f1.append(f1_score(y_test, y_hat))  

        print(f"ef_search={efs:<4d} -> Acc={sweep_acc[-1]:.4f}, F1={sweep_f1[-1]:.4f}, QueryTime={dt:.4f}s")  

    # Plot the Pareto-ish curve: ef_search (x) vs (latency, quality)  
    fig, ax1 = plt.subplots(figsize=(7, 5))  
    ax1.set_title("HNSW权衡:ef_search vs 延迟和质量")  
    ax1.set_xlabel("ef_search(查询广度)")  
    ax1.set_ylabel("查询时间(s)")  
    ax1.plot(sweep_ef_search, sweep_times, marker="o", label="查询时间")  

    ax2 = ax1.twinx()  
    ax2.set_ylabel("得分")  
    ax2.plot(sweep_ef_search, sweep_acc, marker="s", label="准确率")  
    ax2.plot(sweep_ef_search, sweep_f1, marker="^", label="F1")  

    # Legend handling for twin axis  
    lines, labels = ax1.get_legend_handles_labels()  
    lines2, labels2 = ax2.get_legend_handles_labels()  
    ax1.legend(lines + lines2, labels + labels2, loc="best")  

    plt.tight_layout()  
    plt.show()  

    # Confusion matrix as an operational sanity check (post best ef_search)  
    cm = confusion_matrix(y_test, y_pred)  
    plt.figure(figsize=(4, 4))  
    plt.title("混淆矩阵(HNSW,最佳ef_search)")  
    plt.imshow(cm, interpolation="nearest")  
    plt.colorbar()  
    plt.xticks([0, 1], ["恶性", "良性"])  
    plt.yticks([0, 1], ["恶性", "良性"])  
    for (i, j), val in np.ndenumerate(cm):  
        plt.text(j, i, int(val), ha="center", va="center")  
    plt.tight_layout()  
    plt.show()  

    return {  
        "clf": clf,  
        "train_time_s": train_time,  
        "query_time_s": final_query_time,  
        "acc": final_acc,  
        "f1": final_f1,  
        "sweep": {  
            "ef_search": sweep_ef_search,  
            "times": sweep_times,  
            "acc": sweep_acc,  
            "f1": sweep_f1,  
        },  
    }  

# ======================  
# PHASE 8 — Narrative Glue (Challenge → Insight → Application)  
# ======================  
# Challenge:  
#   Brute-force nearest neighbor search scales poorly; even with modest dimensionality, latency explodes  
#   as data grows. Distance computations become the bottleneck in RAG and real-time ML systems.  
#  
# Insight:  
#   HNSW imposes hierarchy and navigable small-world structure on the vector space. We tune (M, ef_search)  
#   to move along the recall/latency curve. Standardize inputs; consider PCA to enhance neighborhood purity.  
#  
# Application:  
#   Swap KNN brute with HNSW-backed ANN. Cross-validate to pick an operating point. Measure with a  
#   hold-out and visualize ef_search sweeps to institutionalize a *speed-quality contract* with users.  

# ======================  
# PHASE 9 — End-to-end Wrapper  
# ======================  
def run_hnsw_benchmark(  
    test_size: float = 0.25,  
    random_state: int = 42,  
    n_pca: int = 16,  # Try None to keep full dims; 16 often yields stable neighborhoods on this dataset  
) -> None:  
    # 1) Load data  
    X, y, feature_names = load_data()  

    # 2) Quick EDA to connect intuition to modeling choices  
    quick_eda(X, y, feature_names)  

    # 3) Train/Test split (hold-out after CV)  
    X_train, X_test, y_train, y_test = train_test_split(  
        X, y, test_size=test_size, stratify=y, random_state=random_state  
    )  

    # 4) HNSW CV Tuning (on training only)  
    #    We keep the grid small here; practitioners would expand it under time budgets.  
    param_grid = {  
        "M": [12, 16, 32],  
        "ef_construction": [100, 200],  
        "ef_search": [32, 64, 128, 256],  
        "k": [3, 5, 7],  
    }  
    best = hnsw_cv_tuning(  
        X_train, y_train, param_grid, n_splits=5, n_pca=n_pca, random_state=random_state  
    )  

    # 5) Build the final embedding pipeline on train/test with the chosen PCA config  
    X_train_emb, X_test_emb, scaler, pca_model = make_embeddings(  
        X_train, X_test, n_pca=n_pca  
    )  

    # 6) Baseline brute-force kNN, to *quantify* ROI  
    baseline = brute_knn_baseline(  
        X_train_emb, y_train, X_test_emb, y_test, k=best.params["k"]  
    )  
    print("\n=== 基线:暴力k-NN ===")  
    print(  
        f"k={baseline['k']} | 训练 {baseline['train_time_s']:.4f}s | "  
        f"查询 {baseline['query_time_s']:.4f}s | "  
        f"准确率={baseline['acc']:.4f} | F1={baseline['f1']:.4f}"  
    )  

    # 7) Final HNSW fit/eval and *visualize the ef_search dial* (the central HNSW idea)  
    sweep = [16, 32, 64, 128, 256, 384]  
    hnsw_out = fit_eval_hnsw_and_visualize(  
        X_train_emb, y_train, X_test_emb, y_test, best.params, sweep  
    )  

    # 8) Final narrative printout to tie back to the essay:  
    print("\n=== 叙述:挑战→洞察→应用 ===")  
    print(  
        "- 挑战:暴力k-NN需要扫描所有点并成为系统瓶颈。\n"  
        "- 洞察:HNSW构建分层小世界图,让我们权衡搜索广度(ef_search)\n"  
        "          与准确性和延迟——高效地从粗到细邻域缩放。\n"  
        "- 应用:我们用交叉验证调优(M,ef_search,k),选择高F1操作点,\n"  
        "              并使用ef_search扫描制度化速度-质量合同。"  
    )  

    # 9) Summarize ROI  
    print("\n=== ROI总结(留出测试)===")  
    print(  
        f"暴力kNN:准确率={baseline['acc']:.4f},F1={baseline['f1']:.4f},"  
        f"查询={baseline['query_time_s']:.4f}s"  
    )  
    print(  
        f"HNSW最佳:准确率={hnsw_out['acc']:.4f},F1={hnsw_out['f1']:.4f},"  
        f"查询={hnsw_out['query_time_s']:.4f}s(ef_search={best.params['ef_search']})"  
    )  
    print(  
        "解释:如果HNSW以更低的查询延迟实现可比较或更好的质量,\n"  
        "它验证了核心工程声明:*随着规模增长,更智能的检索胜过暴力*。"  
    )  

# Entrypoint  
if __name__ == "__main__":  
     run_hnsw_benchmark()

实验结果:分层确实有效

把数据集想象成一个图书馆,每个向量是一本书。暴力k-NN就像实习生,每次查询都得翻遍每个书架——靠谱但慢。HNSW是经验丰富的馆员,从顶层开始(粗糙的地图),直奔正确的区域。

在乳腺癌数据集上,HNSW达到了和暴力搜索相同的测试准确率和F1分数(≈0.979 / 0.983),查询时间却只有1/17(≈5毫秒 vs 89毫秒)。这验证了我们这篇文章的核心论点——非结构化空间中的结构化搜索——从理论变成了实实在在的性能提升。

什么配置真正起作用

层次结构+小搜索广度最优

交叉验证反复选中了小搜索广度(ef_search=32)配合适中连接度(M=12)和k=5。参数扫描显示,把ef_search从16推到384,准确率几乎不动,延迟却线性增长。这是因为,一旦定位到正确区域,翻4倍的候选集并不能在这个数据上找到更好的邻居,纯粹是浪费时间。

轻量级特征处理稳定邻域

标准化加上PCA降到16维,产生了稳定的邻域结构和一致的交叉验证得分,让HNSW的图更好导航。不需要花哨的度量学习,基本的数据清洗(缩放+轻度降噪)就足够暴露HNSW能利用的结构。

和暴力搜索质量持平

混淆矩阵很直白:2个假阴性,1个假阳性。和暴力k-NN完全一个水平。诊断质量没打折扣,但临床系统或下游应用不用再等那么久。

什么配置没起作用

更大的ef_search没换来准确率

超过32左右,准确率和F1就平了。在这个相对干净、分离良好的数据集上,第一批好邻居已经"够好"。继续扩大搜索范围看起来很好,但实际上像过度管理,只是在烧CPU时间。

更密集的连接(M)也不必要

试过M=12、16、32。图结构不需要更多链接就能找到同样的答案,对这个任务来说额外的内存和索引时间都是浪费。

暴力搜索在小规模下还行

几百个样本的情况下,暴力k-NN的训练和查询时间还能接受。重点不是说暴力搜索"不好",而是HNSW的优势会随数据规模增长——即便在小数据集上也能看出这个趋势。

后续探索方向

大规模试验(百万到亿级向量)

优势应该会进一步放大。重点测recall@k vs精确k-NN的差距,P95/P99尾延迟,单向量内存占用。

度量和嵌入的变体

试试归一化向量+余弦距离,用自编码器替换PCA,或者测领域定制的嵌入。不同几何结构会改变HNSW的最佳工作点。

压缩+混合索引

结合HNSW和PQ/OPQ来压缩内存,量化召回率和延迟的损失。超大语料库可以评估基于磁盘的HNSW(比如mmap)和热缓存行为。

在线更新能力

压测动态插入删除,看重平衡开销。流式场景下跟踪召回率随时间的漂移。

自适应策略

训练一个控制器,根据查询难度(熵、边距、重排序不确定性)动态设ef_search。简单查询少花时间,复杂查询多给预算。

 === EDA: 基本信息 ===  
样本数:569,特征数:30  
类别分布(0=恶性,1=良性):{np.int64(0): np.int64(212), np.int64(1): np.int64(357)}  

=== HNSW CV调优:72个配置,5折 ===  
嵌入维度:16(PCA=开启)  
配置 {'M': 12, 'ef_construction': 100, 'ef_search': 32, 'k': 3} -> F1: 0.9688 | Acc: 0.9601  
配置 {'M': 12, 'ef_construction': 100, 'ef_search': 32, 'k': 5} -> F1: 0.9726 | Acc: 0.9648  
(中间配置结果省略)  
最佳HNSW CV参数:{'M': 12, 'ef_construction': 100, 'ef_search': 32, 'k': 5}(F1=0.9726,Acc=0.9648)  

=== 基线:暴力k-NN ===  
k=5 | 训练 0.0012s | 查询 0.0893s | 准确率=0.9790 | F1=0.9834  

=== 最终HNSW(最佳CV参数)===  
参数:{'M': 12, 'ef_construction': 100, 'ef_search': 32, 'k': 5}  
训练时间:0.0210s | 查询时间:0.0052s  
测试准确率:0.9790 | F1:0.9834  

分类报告:  
               precision    recall  f1-score   support  
           0     0.9808    0.9623    0.9714        53  
           1     0.9780    0.9889    0.9834        90  
    accuracy                         0.9790       143  

ef_search参数扫描:  
ef_search=16   -> Acc=0.9790, F1=0.9834, QueryTime=0.0043s  
ef_search=32   -> Acc=0.9790, F1=0.9834, QueryTime=0.0053s  
ef_search=64   -> Acc=0.9790, F1=0.9834, QueryTime=0.0070s  
ef_search=128  -> Acc=0.9790, F1=0.9834, QueryTime=0.0107s  
ef_search=256  -> Acc=0.9790, F1=0.9834, QueryTime=0.0231s  
ef_search=384  -> Acc=0.9790, F1=0.9834, QueryTime=0.0223s  

=== ROI总结(留出测试)===  
暴力kNN:准确率=0.9790,F1=0.9834,查询=0.0893s  
HNSW最佳:准确率=0.9790,F1=0.9834,查询=0.0052s(ef_search=32)  
结论:HNSW在保持同等质量的前提下,查询延迟降到原来的1/17,  
 验证了核心主张:规模增长时,智能检索完胜暴力扫描。

总结

嵌入向量数量还在指数级增长,HNSW提出了一个简单事实:智能不是知道所有答案,而是知道先从哪里找起。

高效AI系统的未来可能更少依赖大模型本身,更多依赖智能检索。从这个角度看,HNSW不只是个优化技巧,它改变了机器记忆和查询的底层逻辑。

https://avoid.overfit.cn/post/6e8a792fb0eb4f4ab911cce7f3e98644

作者:Everton Gomede, PhD

目录
相关文章
|
2天前
|
弹性计算 人工智能 安全
云上十五年——「弹性计算十五周年」系列客户故事(第二期)
阿里云弹性计算十五年深耕,以第九代ECS g9i实例引领算力革新。携手海尔三翼鸟、小鹏汽车、微帧科技等企业,实现性能跃升与成本优化,赋能AI、物联网、智能驾驶等前沿场景,共绘云端增长新图景。
|
8天前
|
存储 弹性计算 人工智能
【2025云栖精华内容】 打造持续领先,全球覆盖的澎湃算力底座——通用计算产品发布与行业实践专场回顾
2025年9月24日,阿里云弹性计算团队多位产品、技术专家及服务器团队技术专家共同在【2025云栖大会】现场带来了《通用计算产品发布与行业实践》的专场论坛,本论坛聚焦弹性计算多款通用算力产品发布。同时,ECS云服务器安全能力、资源售卖模式、计算AI助手等用户体验关键环节也宣布升级,让用云更简单、更智能。海尔三翼鸟云服务负责人刘建锋先生作为特邀嘉宾,莅临现场分享了关于阿里云ECS g9i推动AIoT平台的场景落地实践。
【2025云栖精华内容】 打造持续领先,全球覆盖的澎湃算力底座——通用计算产品发布与行业实践专场回顾
|
7天前
|
人工智能 自然语言处理 自动驾驶
关于举办首届全国大学生“启真问智”人工智能模型&智能体大赛决赛的通知
关于举办首届全国大学生“启真问智”人工智能模型&智能体大赛决赛的通知
|
7天前
|
云安全 人工智能 自然语言处理
阿里云x硅基流动:AI安全护栏助力构建可信模型生态
阿里云AI安全护栏:大模型的“智能过滤系统”。
|
8天前
|
编解码 自然语言处理 文字识别
Qwen3-VL再添丁!4B/8B Dense模型开源,更轻量,仍强大
凌晨,Qwen3-VL系列再添新成员——Dense架构的Qwen3-VL-8B、Qwen3-VL-4B 模型,本地部署友好,并完整保留了Qwen3-VL的全部表现,评测指标表现优秀。
633 7
Qwen3-VL再添丁!4B/8B Dense模型开源,更轻量,仍强大
|
10天前
|
存储 机器学习/深度学习 人工智能
大模型微调技术:LoRA原理与实践
本文深入解析大语言模型微调中的关键技术——低秩自适应(LoRA)。通过分析全参数微调的计算瓶颈,详细阐述LoRA的数学原理、实现机制和优势特点。文章包含完整的PyTorch实现代码、性能对比实验以及实际应用场景,为开发者提供高效微调大模型的实践指南。
755 2