局部异常因子与KL散度异常检测算法简述

简介: Local Outlier FactorGiven local outlier factors, we can detect the outliers that are always away from most of the samples. In order to outline the algorithm, some concepts must go first:

Local Outlier Factor

Given local outlier factors, we can detect the outliers that are always away from most of the samples. In order to outline the algorithm, some concepts must go first:
Reachability Distance

RDk(x,x)=max(xx(k),xx)

where x(k) stands for the k th point nearest to x in training set {xi}ni=1 . Note that k is manually selected.
Local Reachability Density
LRDk(x)=(1ki=1kRDk(x(i),x))1

Local Outlier Factor
LOFk(x)=1kki=1LRDk(x(i))LRDk(x)

Evidently, as the LOF of x ascends, the probability that x is an outlier also goes up. Theoretically, it is an easy algorithm with intuitive principle. However, when n is a very large number, it also requires tremendous computation amount.

Here is a simple example

n=100; x=[(rand(n/2,2)-0.5)*20; randn(n/2,2)]; x(n,1)=14;
k=3; x2=sum(x.^2,2);
[s, t]=sort(sqrt(repmat(x2,1,n)+repmat(x2',n,1)-2*x*x'), 2);

for i=1:k+1
    for j=1:k
        RD(:,j)=max(s(t(t(:,i),j+1),k), s(t(:,i),j+1));
    end
    LRD(:,i)=1./mean(RD,2);
end
LOF=mean(LRD(:,2:k+1),2)./LRD(:,1);

figure(1); clf; hold on
plot(x(:,1),x(:,2),'rx');
for i=1:n
    plot(x(i,1),x(i,2),'bo', 'MarkerSize', LOF(i)*10);
end

Large circle means abnormal point

KL Divergence

In unsupervised learning problems, there is usually little information about the outliers. However, when some known normal sample set {xi}ni=1 is given, we may be confident to figure out the outliers in the test set {xi}ni=1 to some degree.
Kullback-Leibler (KL) divergence, also known as Relative Entropy, is a powerful tool to estimate the probability density ratio of normal samples to test samples-

w(x)=p(x)p(x)

where p(x) is the probability density of normal samples and p(x) is that of test ones, and avoid direct calculation of the ratio. The ratio of normal sample approaches 1 but of an outlier is away from 1.
To begin with, let transform the model of density ratio to a parameterized linear model:
wα(x)=j=1bαjψj(x)=αTψ(x)

where α=(α1,,αb)T is the parameter vector and ψ(x)=(ψ1,,ψb)T is a non-negative basis function vector. Then wα(x)p(x) can be seen as an estimation of p(x) . Define the similarity between wα(x)p(x) and p(x) as KL distance, i.e.
KL(pwα(x)p(x))=p(x)logp(x)wα(x)p(x)

In general case, KL distance is non-negative and equals to zero only if wαp=p . When KL distance is considerably small, wαp can be regarded near to p . In order to guarantee that wαp is well-defined, we apply the following constraint
wα(x)p(x)dx=1,x,wα(x)p(x)0

Then by approximation, we can transform the estimation above to the following optimal problem:
maxα1ni=1nlogwα(xi)s.t.1ni=1nwα(xi)=1,α1,,αn0

We briefly summarize the estimation process:
  1. Initialize α .
  2. Repeatedly carry out the following process until α comes a suitable precision:
    1. αα+ϵAT(1./Aα)
    2. αα+(1bTα)b(bTb)
    3. αmax(0,α)
    4. αα/(bTα)

where A is the matrix whose (i,j) th element is ψj(xi) . b is the vector whose j th element is 1nni=1ψj(xi) .

Here is an example (Gaussian Kernal Model):

function [ a ] = KLIEP( k, r )

a0=rand(size(k,2),1); b=mean(r)'; c=sum(b.^2);
for o=1:1000
    a=a0+0.01*k'*(1./k*a0); a=a+b*(1-sum(b.*a))/c;
    a=max(0,a); a=a/sum(b.*a);
    if norm(a-a0)<0.001, break, end
    a0=a;
end

end
n=100; x=randn(n,1); y=randn(n,1); y(n)=5;
hhs=2*[1,5,10].^2; m=5;
x2=x.^2; xx=repmat(x2,1,n)+repmat(x2',n,1)-2*(x*x');
y2=y.^2; yx=repmat(y2,1,n)+repmat(x2',n,1)-2*y*x';
u=floor(m*(0:n-1)/n)+1;
u=u(randperm(n));

for hk=1:length(hhs)
    hh=hhs(hk);k=exp(-xx/hh); r=exp(-yx/hh);
    for i=1:m
        g(hk,i)=mean(k(u==i,:)*KLIEP(k(u~=i,:),r));
    end
end
[gh,ggh]=max(mean(g,2));
HH=hhs(ggh);
k=exp(-xx/HH); r=exp(-yx/HH); s=r*KLIEP(k,r);

figure(1); clf; hold on; plot(y,s,'rx');

Abnormal degree

SVM

Furthermore, outlier detection can be done using support vector machine techniques. Due to the time limit, we just outline the main structure of that algorithm.
A typical SVM outlier detector gives a hyper-ball that contains nearly all the sample points. Then a point which is outlying the hyper-ball can be seen as an outlier. Concretely speaking, we get the center c and radius R by solving the following optimal problem:

minc,r,ξ(R2+Ci=1nξi)s.t.xic2R2+ξi,ξi0,i=1,2,,n

It can be solved by using Lagrange multiplers:
L(c,r,ξ,α,β)=R2+Ci=1nξii=1nαi(R2+ξixic2)i=1nβiξi

Then its dual problem can be formulated as:
maxα,βinfc,R,ξL(c,r,ξ,α,β),s.t.α0,β0

KKT condition:
Lc=0LR=0Lξi=0c=ni=1αixini=1αii=1nαi=1αi+βi=C,i=1,2,,n

Therefore, the dual problem can be solved by
α^=argmaxα=i=1nαixTixii,j=1nαiαjxTixjs.t.0αiC,i=1,2,,n

It is in the form of typical quadratic programming problem. After solving it, we are able to further solve c and R :
R2^=xij=1nα^jxj2,c^=i=1nα^ixi

where xi is the support vector satisfying xic2=R2 and 0<αi<C .
Hence, when a sample point x satisfies
xc^2>R2^

it can be viewed as an outlier.
相关文章
|
3月前
|
机器学习/深度学习 算法 数据安全/隐私保护
基于MSER和HOG特征提取的SVM交通标志检测和识别算法matlab仿真
### 算法简介 1. **算法运行效果图预览**:展示算法效果,完整程序运行后无水印。 2. **算法运行软件版本**:Matlab 2017b。 3. **部分核心程序**:完整版代码包含中文注释及操作步骤视频。 4. **算法理论概述**: - **MSER**:用于检测显著区域,提取图像中稳定区域,适用于光照变化下的交通标志检测。 - **HOG特征提取**:通过计算图像小区域的梯度直方图捕捉局部纹理信息,用于物体检测。 - **SVM**:寻找最大化间隔的超平面以分类样本。 整个算法流程图见下图。
|
2月前
|
机器学习/深度学习 监控 算法
基于反光衣和检测算法的应用探索
本文探讨了利用机器学习和计算机视觉技术进行反光衣检测的方法,涵盖图像预处理、目标检测与分类、特征提取等关键技术。通过YOLOv5等模型的训练与优化,展示了实现高效反光衣识别的完整流程,旨在提升智能检测系统的性能,应用于交通安全、工地监控等领域。
|
3月前
|
算法 安全
分别使用OVP-UVP和OFP-UFP算法以及AFD检测算法实现反孤岛检测simulink建模与仿真
本课题通过Simulink建模与仿真,实现OVP-UVP、OFP-UFP算法及AFD检测算法的反孤岛检测。OVP-UVP基于电压幅值变化,OFP-UFP基于频率变化,而AFD则通过注入频率偏移信号来检测孤岛效应,确保电力系统安全稳定运行。系统使用MATLAB 2013b进行建模与仿真验证。
|
2月前
|
机器学习/深度学习 算法 数据安全/隐私保护
基于GA-PSO-SVM算法的混沌背景下微弱信号检测matlab仿真
本项目基于MATLAB 2022a,展示了SVM、PSO、GA-PSO-SVM在混沌背景下微弱信号检测中的性能对比。核心程序包含详细中文注释和操作步骤视频。GA-PSO-SVM算法通过遗传算法和粒子群优化算法优化SVM参数,提高信号检测的准确性和鲁棒性,尤其适用于低信噪比环境。
|
2月前
|
存储 JSON 算法
TDengine 检测数据最佳压缩算法工具,助你一键找出最优压缩方案
在使用 TDengine 存储时序数据时,压缩数据以节省磁盘空间是至关重要的。TDengine 支持用户根据自身数据特性灵活指定压缩算法,从而实现更高效的存储。然而,如何选择最合适的压缩算法,才能最大限度地降低存储开销?为了解决这一问题,我们特别推出了一个实用工具,帮助用户快速判断并选择最适合其数据特征的压缩算法。
61 0
|
3月前
|
算法 计算机视觉 Python
圆形检测算法-基于颜色和形状(opencv)
该代码实现了一个圆检测算法,用于识别视频中的红色、白色和蓝色圆形。通过将图像从RGB转换为HSV颜色空间,并设置对应颜色的阈值范围,提取出目标颜色的区域。接着对这些区域进行轮廓提取和面积筛选,使用霍夫圆变换检测圆形,并在原图上绘制检测结果。
100 0
|
5月前
|
机器学习/深度学习 监控 算法
目标检测算法技术
8月更文挑战第11天
|
5月前
|
机器学习/深度学习 监控 算法
目标检测算法
8月更文挑战第5天
|
5月前
|
机器学习/深度学习 监控 算法
目标检测算法
8月更文挑战第8天
|
6月前
|
监控 算法 自动驾驶
目标检测算法:从理论到实践的深度探索
【7月更文第18天】目标检测,作为计算机视觉领域的核心任务之一,旨在识别图像或视频中特定对象的位置及其类别。这一技术在自动驾驶、视频监控、医疗影像分析等多个领域发挥着至关重要的作用。本文将深入浅出地介绍目标检测的基本概念、主流算法,并通过一个实际的代码示例,带您领略YOLOv5这一高效目标检测模型的魅力。
870 11

热门文章

最新文章