局部异常因子与KL散度异常检测算法简述

简介: Local Outlier FactorGiven local outlier factors, we can detect the outliers that are always away from most of the samples. In order to outline the algorithm, some concepts must go first:

Local Outlier Factor

Given local outlier factors, we can detect the outliers that are always away from most of the samples. In order to outline the algorithm, some concepts must go first:
Reachability Distance

RDk(x,x)=max(xx(k),xx)

where x(k) stands for the k th point nearest to x in training set {xi}ni=1 . Note that k is manually selected.
Local Reachability Density
LRDk(x)=(1ki=1kRDk(x(i),x))1

Local Outlier Factor
LOFk(x)=1kki=1LRDk(x(i))LRDk(x)

Evidently, as the LOF of x ascends, the probability that x is an outlier also goes up. Theoretically, it is an easy algorithm with intuitive principle. However, when n is a very large number, it also requires tremendous computation amount.

Here is a simple example

n=100; x=[(rand(n/2,2)-0.5)*20; randn(n/2,2)]; x(n,1)=14;
k=3; x2=sum(x.^2,2);
[s, t]=sort(sqrt(repmat(x2,1,n)+repmat(x2',n,1)-2*x*x'), 2);

for i=1:k+1
    for j=1:k
        RD(:,j)=max(s(t(t(:,i),j+1),k), s(t(:,i),j+1));
    end
    LRD(:,i)=1./mean(RD,2);
end
LOF=mean(LRD(:,2:k+1),2)./LRD(:,1);

figure(1); clf; hold on
plot(x(:,1),x(:,2),'rx');
for i=1:n
    plot(x(i,1),x(i,2),'bo', 'MarkerSize', LOF(i)*10);
end

Large circle means abnormal point

KL Divergence

In unsupervised learning problems, there is usually little information about the outliers. However, when some known normal sample set {xi}ni=1 is given, we may be confident to figure out the outliers in the test set {xi}ni=1 to some degree.
Kullback-Leibler (KL) divergence, also known as Relative Entropy, is a powerful tool to estimate the probability density ratio of normal samples to test samples-

w(x)=p(x)p(x)

where p(x) is the probability density of normal samples and p(x) is that of test ones, and avoid direct calculation of the ratio. The ratio of normal sample approaches 1 but of an outlier is away from 1.
To begin with, let transform the model of density ratio to a parameterized linear model:
wα(x)=j=1bαjψj(x)=αTψ(x)

where α=(α1,,αb)T is the parameter vector and ψ(x)=(ψ1,,ψb)T is a non-negative basis function vector. Then wα(x)p(x) can be seen as an estimation of p(x) . Define the similarity between wα(x)p(x) and p(x) as KL distance, i.e.
KL(pwα(x)p(x))=p(x)logp(x)wα(x)p(x)

In general case, KL distance is non-negative and equals to zero only if wαp=p . When KL distance is considerably small, wαp can be regarded near to p . In order to guarantee that wαp is well-defined, we apply the following constraint
wα(x)p(x)dx=1,x,wα(x)p(x)0

Then by approximation, we can transform the estimation above to the following optimal problem:
maxα1ni=1nlogwα(xi)s.t.1ni=1nwα(xi)=1,α1,,αn0

We briefly summarize the estimation process:
  1. Initialize α .
  2. Repeatedly carry out the following process until α comes a suitable precision:
    1. αα+ϵAT(1./Aα)
    2. αα+(1bTα)b(bTb)
    3. αmax(0,α)
    4. αα/(bTα)

where A is the matrix whose (i,j) th element is ψj(xi) . b is the vector whose j th element is 1nni=1ψj(xi) .

Here is an example (Gaussian Kernal Model):

function [ a ] = KLIEP( k, r )

a0=rand(size(k,2),1); b=mean(r)'; c=sum(b.^2);
for o=1:1000
    a=a0+0.01*k'*(1./k*a0); a=a+b*(1-sum(b.*a))/c;
    a=max(0,a); a=a/sum(b.*a);
    if norm(a-a0)<0.001, break, end
    a0=a;
end

end
n=100; x=randn(n,1); y=randn(n,1); y(n)=5;
hhs=2*[1,5,10].^2; m=5;
x2=x.^2; xx=repmat(x2,1,n)+repmat(x2',n,1)-2*(x*x');
y2=y.^2; yx=repmat(y2,1,n)+repmat(x2',n,1)-2*y*x';
u=floor(m*(0:n-1)/n)+1;
u=u(randperm(n));

for hk=1:length(hhs)
    hh=hhs(hk);k=exp(-xx/hh); r=exp(-yx/hh);
    for i=1:m
        g(hk,i)=mean(k(u==i,:)*KLIEP(k(u~=i,:),r));
    end
end
[gh,ggh]=max(mean(g,2));
HH=hhs(ggh);
k=exp(-xx/HH); r=exp(-yx/HH); s=r*KLIEP(k,r);

figure(1); clf; hold on; plot(y,s,'rx');

Abnormal degree

SVM

Furthermore, outlier detection can be done using support vector machine techniques. Due to the time limit, we just outline the main structure of that algorithm.
A typical SVM outlier detector gives a hyper-ball that contains nearly all the sample points. Then a point which is outlying the hyper-ball can be seen as an outlier. Concretely speaking, we get the center c and radius R by solving the following optimal problem:

minc,r,ξ(R2+Ci=1nξi)s.t.xic2R2+ξi,ξi0,i=1,2,,n

It can be solved by using Lagrange multiplers:
L(c,r,ξ,α,β)=R2+Ci=1nξii=1nαi(R2+ξixic2)i=1nβiξi

Then its dual problem can be formulated as:
maxα,βinfc,R,ξL(c,r,ξ,α,β),s.t.α0,β0

KKT condition:
Lc=0LR=0Lξi=0c=ni=1αixini=1αii=1nαi=1αi+βi=C,i=1,2,,n

Therefore, the dual problem can be solved by
α^=argmaxα=i=1nαixTixii,j=1nαiαjxTixjs.t.0αiC,i=1,2,,n

It is in the form of typical quadratic programming problem. After solving it, we are able to further solve c and R :
R2^=xij=1nα^jxj2,c^=i=1nα^ixi

where xi is the support vector satisfying xic2=R2 and 0<αi<C .
Hence, when a sample point x satisfies
xc^2>R2^

it can be viewed as an outlier.
相关文章
|
1月前
|
监控 安全 算法
137_安全强化:输入过滤与水印 - 实现输出水印的检测算法与LLM安全防护最佳实践
随着大语言模型(LLM)在各行业的广泛应用,安全问题日益凸显。从提示注入攻击到恶意输出生成,从知识产权保护到内容溯源,LLM安全已成为部署和应用过程中不可忽视的关键环节。在2025年的LLM技术生态中,输入过滤和输出水印已成为两大核心安全技术,它们共同构建了LLM服务的安全防护体系。
|
1月前
|
机器学习/深度学习 算法 机器人
使用哈里斯角Harris和SIFT算法来实现局部特征匹配(Matlab代码实现)
使用哈里斯角Harris和SIFT算法来实现局部特征匹配(Matlab代码实现)
141 8
|
2月前
|
传感器 资源调度 算法
DDMA-MIMO雷达多子带相干累积目标检测算法——论文阅读
本文提出一种多子带相干累积(MSCA)算法,通过引入空带和子带相干处理,解决DDMA-MIMO雷达的多普勒模糊与能量分散问题。该方法在低信噪比下显著提升检测性能,实测验证可有效恢复目标速度,适用于车载雷达高精度感知。
452 4
DDMA-MIMO雷达多子带相干累积目标检测算法——论文阅读
|
1月前
|
开发框架 算法 .NET
基于ADMM无穷范数检测算法的MIMO通信系统信号检测MATLAB仿真,对比ML,MMSE,ZF以及LAMA
简介:本文介绍基于ADMM的MIMO信号检测算法,结合无穷范数优化与交替方向乘子法,降低计算复杂度并提升检测性能。涵盖MATLAB 2024b实现效果图、核心代码及详细注释,并对比ML、MMSE、ZF、OCD_MMSE与LAMA等算法。重点分析LAMA基于消息传递的低复杂度优势,适用于大规模MIMO系统,为通信系统检测提供理论支持与实践方案。(238字)
|
1月前
|
存储 监控 算法
基于 PHP 布隆过滤器的局域网监控管理工具异常行为检测算法研究
布隆过滤器以其高效的空间利用率和毫秒级查询性能,为局域网监控管理工具提供轻量化异常设备检测方案。相比传统数据库,显著降低延迟与资源消耗,适配边缘设备部署需求,提升网络安全实时防护能力。(238字)
146 0
|
3月前
|
人工智能 算法 安全
【博士论文】基于局部中心量度的聚类算法研究(Matlab代码实现)
【博士论文】基于局部中心量度的聚类算法研究(Matlab代码实现)
139 0
|
5月前
|
机器学习/深度学习 运维 监控
实时异常检测实战:Flink+PAI 算法模型服务化架构设计
本文深入探讨了基于 Apache Flink 与阿里云 PAI 构建的实时异常检测系统。内容涵盖技术演进、架构设计、核心模块实现及金融、工业等多领域实战案例,解析流处理、模型服务化、状态管理等关键技术,并提供性能优化与高可用方案,助力企业打造高效智能的实时异常检测平台。
453 1
|
4月前
|
监控 算法 决策智能
基于盲源分离与贝叶斯非局部均值的图像降噪算法
基于盲源分离与贝叶斯非局部均值的图像降噪算法
161 0
|
4月前
|
存储 监控 算法
基于跳表数据结构的企业局域网监控异常连接实时检测 C++ 算法研究
跳表(Skip List)是一种基于概率的数据结构,适用于企业局域网监控中海量连接记录的高效处理。其通过多层索引机制实现快速查找、插入和删除操作,时间复杂度为 $O(\log n)$,优于链表和平衡树。跳表在异常连接识别、黑名单管理和历史记录溯源等场景中表现出色,具备实现简单、支持范围查询等优势,是企业网络监控中动态数据管理的理想选择。
150 0
|
5月前
|
机器学习/深度学习 监控 算法
面向办公室屏幕监控系统的改进型四叉树屏幕变化检测算法研究
本文提出一种改进型四叉树数据结构模型,用于优化办公室屏幕监控系统。通过动态阈值调节、变化优先级索引及增量更新策略,显著降低计算复杂度并提升实时响应能力。实验表明,该算法在典型企业环境中将屏幕变化检测效率提升40%以上,同时减少资源消耗。其应用场景涵盖安全审计、工作效能分析及远程协作优化等,未来可结合深度学习实现更智能化的功能。
115 0

热门文章

最新文章