对盗图、盗文、盗墓深恶痛绝吗?PostgreSQL结合余弦、线性相关算法 在文本、图片、数组相似 等领域的应用 - 2 smlar插件详解

本文涉及的产品
RDS PostgreSQL Serverless,0.5-4RCU 50GB 3个月
推荐场景:
对影评进行热评分析
云原生数据库 PolarDB 分布式版,标准版 2核8GB
云数据库 RDS SQL Server,基础系列 2核4GB
简介:

标签

PostgreSQL , 文本分析 , cosine , smlar , 相似性 , simlar , tf , idf , tf-idf , tag


背景

以2个例子作为开始,

例1

在数据库中有两条这样的记录

"I want a dog"  // 狗  
"I want a chihuahua"  // 吉娃娃狗  

然后使用这样的查询条件进行查询

"dog|chihuahua"  

很显然,两条记录都会被匹配到,但是哪条记录应该排在前面呢?

例2

在搜索引擎中搜索"狗|吉娃娃狗"

哪个会排在前面呢?试试就知道了,翻到第二页,以防被广告冲昏头脑

pic

很显然吉娃娃被排在了前面,为什么呢?

其实都是tf-idf算法起到的作用,因为在全局文本中,狗比吉娃娃出现的次数多,所以根据idf的算法 ( log(总文本数/包含此词的文本数) ) 狗的idf比吉娃娃的低。

那么当tf相同时,很显然tf*idf取决于idf的大小。

关于tf与idf的概念,请参考

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

那么作为支持全文检索的PostgreSQL数据库,默认支持TF-IDF吗?

PostgreSQL 默认不使用idf计算rank

我们在ts_rank, ts_rank_cd的代码中可以了解到,PostgreSQL这两个函数并不关心idf。

其实在网上也有人问过这样的问题

http://stackoverflow.com/questions/18296444/does-postgresql-use-tf-idf

Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:  

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html  

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":  

"I want a dog"  // 狗  
"I want a chihuahua"  // 奇瓦瓦狗  

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:  

"dog lovers have an average of 1.5 dogs"  

In short: higher term frequency(TF) within the document results in a higher rank, but a lower term frequency in the corpus has no impact.  

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)  
Postgres does not use TF-IDF as a similarity measure among documents.  

ts_rank is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.  

ts_rank_cd is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.  

There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF.   

It also lets you turn tsvectors into arrays, and supports fast indexing.  

PostgreSQL内置的rank计算方法并不关心IDF,而仅仅计算当前文本的词频。

目前PostgreSQL通过ts_rank与ts_rank_cd计算tsquery与tsvector的相关性,算法详见

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

如何让PostgreSQL计算rank时关心IDF?

如何让PostgreSQL计算rank时关心IDF?详见此文

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

smlar插件介绍

smlar插件支持多种相似度计算公式(算法),cosine(default), tfidf, overlap。同时还提供了自定义公式计算相似度的函数。

函数接口

1. 计算数组的相似度

float4 smlar(anyarray, anyarray)  
    - computes similary of two arrays. Arrays should be the same type.  

2. 计算自定义复合数组(元素,权重)的数组的相似度,可以计算所有元素,同时也允许只计算重叠元素的部分(当权重不同时,相似度不同,例如cosine算法)。

或者我们把它理解为包含tfidf的加权数组,比如 [('中国', 0.1), ('日本', 0.1), ('海盗', 0.9)]

在文本相似度分析中很有用。

float4 smlar(anyarray, anyarray, bool useIntersect)  
    -  computes similary of two arrays of composite types. Composite type looks like:  
        CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);  
       useIntersect option points to use only intersected elements in denominator  
       see an exmaples in sql/composite_int4.sql or sql/composite_text.sql  

例子

postgres=# create type tp as (c1 text, c2 float4);
CREATE TYPE

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp);
  smlar   
----------
 0.816497
(1 row)

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp,true);
 smlar 
-------
     1
(1 row)

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.2), ('中国',1.1),('china',1)]::_tp,true);
  smlar   
----------
 0.999822
(1 row)

与顺序无关
postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('中国',1.1),('你好',2.2),('china',1)]::_tp,true);
  smlar   
----------
 0.999822
(1 row)

注意

elog(ERROR, "GiST  doesn't support composite (weighted) type");

3. 通过自定义公式,公式中包含3个变量N.i, N.a, N.b,计算两个数组的相似性,你可以自定义算法。

float4 smlar( anyarray a, anyarray b, text formula );  
    - computes similary of two arrays by given formula, arrays should   
    be the same type.   
    Predefined variables in formula:  
      N.i   - number of common elements in both array (intersection)  
      N.a   - number of uniqueelements in first array  
      N.b   - number of uniqueelements in second array  
    Example:  
    smlar('{1,4,6}'::int[], '{5,4,6}' )  
    smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' )  
    That calls are equivalent.  

操作符接口

1. 判断两个数组是否相似,(当相似值大于limit值时, limit值通过smlar.threshold参数设置)

anyarray % anyarray  
    - returns true if similarity of that arrays is greater than limit  

float4 show_smlar_limit()  - deprecated  
    - shows the limit for % operation  

float4 set_smlar_limit(float4) - deprecated  
    - sets the limit for % operation  

Use instead of show_smlar_limit/set_smlar_limit GUC variable   
smlar.threshold (see below)  

转换函数接口

1. 将tsvector类型转换为text array类型

text[] tsvector2textarray(tsvector)  
    - transforms tsvector type to text array  

2. 对数组内的元素排序并去除重复元素

anyarray array_unique(anyarray)  
    - sort and unique array  

3. 判断数组内是否包含某元素,包含时返回1.0, 不包含时返回0。

float4 inarray(anyarray, anyelement)  
    - returns zero if second argument does not present in a first one  
      and 1.0 in opposite case  

4. 类似三目操作符,判断数组内是否包含某元素,包含时返回第三个参数的值,不包含时返回第四个参数值。

float4 inarray(anyarray, anyelement, float4, float4)  
    - returns fourth argument if second argument does not present in   
      a first one and third argument in opposite case  

参数

1. 相似度LIMIT值,当相似性低于这个值时,%操作符(计算两个数组是否相似) 将返回false。

smlar.threshold  FLOAT  
    Array's with similarity lower than threshold are not similar   
    by % operation  

2. 是否持久化idf表

smlar.persistent_cache BOOL  
    Cache of global stat is stored in transaction-independent memory  

3. 默认的相似性计算公式

smlar.type  STRING  
    Type of similarity formula: cosine(default), tfidf, overlap  

源码,其中tfidf计算需要用到统计idf的表。

        switch(getSmlType())  
        {  
                case ST_TFIDF:  
                        PG_RETURN_FLOAT4( TFIDFSml(sa, sb) );  
                        break;  
                case ST_COSINE:  
                        {  
                                int                             cnt;  
                                double                  power;  

                                power = ((double)(sa->nelems)) * ((double)(sb->nelems));  
                                cnt = numOfIntersect(sa, sb);  

                                PG_RETURN_FLOAT4(  ((double)cnt) / sqrt( power ) );  
                        }  
                        break;  
                case ST_OVERLAP:  
                        {  
                                float4 res = (float4)numOfIntersect(sa, sb);  

                                PG_RETURN_FLOAT4(res);  
                        }  
                        break;  
                default:  
                        elog(ERROR,"Unsupported formula type of similarity");  
        }  

算法详见

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

4. 当相似性计算公式为tfidf时,需要设置这个参数,并且需要一张统计信息表,记录每个词的idf,以及总共有多少文本。

smlar.stattable STRING  
    Name of table stored set-wide statistic. Table should be   
    defined as  

    CREATE TABLE table_name (  
        value   data_type UNIQUE,  -- 词  
        ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)  -- 该词一共出现在几个文本中  
    );  

    And row with null value means total number of documents.  
    See an examples in sql/*g.sql files  

    Note: used on for smlar.type = 'tfidf'  

5. 当相似性计算公式为tfidf时的配置。

TF(词频)计算方法设置:出现次数、1+log(出现次数)、常数1

smlar.tf_method STRING  
    Calculation method for term frequency. Values:  
        "n"     - simple counting of entries (default)  
        "log"   - 1 + log(n)  
        "const" - TF is equal to 1  
    Note: used on for smlar.type = 'tfidf'  

注意 :

GIN index supports only smlar.tf_method = \"const\""

6. 当相似性计算公式为tfidf时的配置。

idf是否+1后再取LOG

smlar.idf_plus_one BOOL  
    If false (default), calculate idf as log(d/df),  
    if true - as log(1+d/df)  
    Note: used on for smlar.type = 'tfidf'  

建议的参数配置

Module provides several GUC variables smlar.threshold, it's highly  
recommended to add to postgesql.conf:  

custom_variable_classes = 'smlar'       # list of custom variable class names  
smlar.threshold = 0.6                   #or any other value > 0 and < 1  

and other smlar.* variables  

索引op class

smlar插件的核心,实际上是计算两个数组(任意类型的数组)的相似性,当然为了提高速度,它也支持索引。

不同类型对应的ops如下,在创建索引是需要使用它们

GiST/GIN support for  %  and  &&  operations for:  
  Array Type   |  GIN operator class  | GiST operator class    
---------------+----------------------+----------------------  
 bit[]         | _bit_sml_ops         |   
 bytea[]       | _bytea_sml_ops       | _bytea_sml_ops  
 char[]        | _char_sml_ops        | _char_sml_ops  
 cidr[]        | _cidr_sml_ops        | _cidr_sml_ops  
 date[]        | _date_sml_ops        | _date_sml_ops  
 float4[]      | _float4_sml_ops      | _float4_sml_ops  
 float8[]      | _float8_sml_ops      | _float8_sml_ops  
 inet[]        | _inet_sml_ops        | _inet_sml_ops  
 int2[]        | _int2_sml_ops        | _int2_sml_ops  
 int4[]        | _int4_sml_ops        | _int4_sml_ops  
 int8[]        | _int8_sml_ops        | _int8_sml_ops  
 interval[]    | _interval_sml_ops    | _interval_sml_ops  
 macaddr[]     | _macaddr_sml_ops     | _macaddr_sml_ops  
 money[]       | _money_sml_ops       |   
 numeric[]     | _numeric_sml_ops     | _numeric_sml_ops  
 oid[]         | _oid_sml_ops         | _oid_sml_ops  
 text[]        | _text_sml_ops        | _text_sml_ops  
 time[]        | _time_sml_ops        | _time_sml_ops  
 timestamp[]   | _timestamp_sml_ops   | _timestamp_sml_ops  
 timestamptz[] | _timestamptz_sml_ops | _timestamptz_sml_ops  
 timetz[]      | _timetz_sml_ops      | _timetz_sml_ops  
 varbit[]      | _varbit_sml_ops      |   
 varchar[]     | _varchar_sml_ops     | _varchar_sml_ops  

smlar 实现 tfidf相似性的 例子

默认的cosine算法的例子参考

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

接下来看看如何使用tfidf

1. 安装smlar插件

/* see http://blog.databasepatterns.com/2014/07/postgresql-install-smlar-extension.html */  
create extension if not exists smlar;  

2. 创建测试表

/* your basic table to search against: */  
create table documents (  
  document_id int primary key,  
  body text not null  
);  

3. 导入测试数据,注意可以直接到http://www.mockaroo.com勾选生成测试数据

/*   
   I created 100,000 "lorem ipsum" documents here http://www.mockaroo.com/c5418bd0  
   In retrospect, not a great choice due to the small number of unique words used to generate the dataset  
*/  
copy documents   
from program 'curl "http://www.mockaroo.com/c5418bd0/download?count=100000&key=5b15a410"'   
with (format csv, header true);  

copy documents   
from '/home/digoal/test.csv'   
with (format csv, header true);  

4. 创建统计表, 记录每个词在多少篇文本中出现过,以及总的文本数。

对于日常的使用,我们可以从词库中导出词组与idf,并生成文本总数,与词出现的次数,导入这张表

比如从scws, jieba分词等词库,导出并导入该表

/* this table holds document frequencies (# of docs in which a term appears) for the documents.body column: */  
create table documents_body_stats (  
  value text unique,  
  ndoc int not null  
);  

5. 使用本地数据,生成idf,(实际生产时,可以忽略,建议使用从词库导入的IDF)

/* used ts_stat for convenience, not ideal, but good for quick n dirty: */  
insert into documents_body_stats  
  select  
    word,  
    ndoc  
  from   
    ts_stat( 'select to_tsvector(''simple'', body) from documents' );  

6. 插入文本数

/* the smlar tfdif table needs the total document count as well. It's added as a row with null in the value column: */  
insert into documents_body_stats values   
  (null, (select count(*) from documents) );  

7. 将文本转换为text数组的函数

/* turn documents into array of words. you could also use tsvector2textarray( to_tsvector(...) ) : */  
create or replace function tokenize(text) returns text[] language sql strict immutable as $$  
 select regexp_split_to_array( lower($1), '[^[:alnum:]]' );  
$$;  

注意tsvector转换为数组时,会丢失重复值

postgres=# select tsvector2textarray(to_tsvector('i am digoal digoal')),  to_tsvector('i am digoal digoal');  
 tsvector2textarray | to_tsvector    
--------------------+--------------  
 {digoal}           | 'digoal':3,4  
(1 row)  

8. 创建表达式索引

/* use smlar's text array opclass. gist is a little more flexible than gin in this case (allows 'n' tf_method): */  
create index on documents using gist ( tokenize(body) _text_sml_ops ); --24 seconds  

9. 参数设置,使用tfidf公式计算数组相似度

/* setup smlar: */  
set smlar.type = 'tfidf';  
set smlar.stattable = 'documents_body_stats';  
set smlar.tf_method = 'n';  
set smlar.threshold = 0.4;  

10. 查询

当TFIDF similarity >= smlar.threshold时,返回。

/* the query */  
select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  


postgres=# explain (analyze,verbose,timing,costs,buffers) select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)  
   Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
   Buffers: shared hit=79  
   ->  Sort  (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)  
         Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
         Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=79  
         ->  Index Scan using documents_tokenize_idx on public.documents  (cost=0.14..2.17 rows=1 width=237) (actual time=7.641..7.641 rows=0 loops=1)  
               Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])  
               Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])  
               Rows Removed by Index Recheck: 61  
               Buffers: shared hit=79  
 Planning time: 0.148 ms  
 Execution time: 7.703 ms  
(15 rows)  

我们可以把相似度调高,从而排除更多的记录,提高查询效率

postgres=# set smlar.threshold = 0.8;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=2.18..2.18 rows=1 width=237) (actual time=1.051..1.051 rows=0 loops=1)  
   Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
   Buffers: shared hit=29  
   ->  Sort  (cost=2.18..2.18 rows=1 width=237) (actual time=1.049..1.049 rows=0 loops=1)  
         Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
         Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=29  
         ->  Index Scan using documents_tokenize_idx on public.documents  (cost=0.14..2.17 rows=1 width=237) (actual time=1.042..1.042 rows=0 loops=1)  
               Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])  
               Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])  
               Rows Removed by Index Recheck: 1  
               Buffers: shared hit=29  
 Planning time: 0.190 ms  
 Execution time: 1.113 ms  
(15 rows)  

如果你不需要对结果按相似度排序的话,也可以把order by去掉,提升性能。  

MADlib训练文本idf

Basic steps for clustering documents can vary a bit depending on your
precise goals, but one example would be:

1. Often there is an initial first level of processing to handle
tokenization, word stemming, filter stopwords etc.

There is no right or wrong way of handling this step and there is a fair
amount of art to doing it well.

2. Build feature vectors for each document.
There are a variety of methods for handling this, one common example
would be to build tf-idfs (http://en.wikipedia.org/wiki/Tfidf).

For which you need the following metrics:
a) word count for each document
b) how many documents word occur across all documents
c) documents count

  1. Having produced a feature vectors for your documents you can then call
    the kmeans function in MADlib to perform the actual clustering (
    http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html).
    Given a tf-idf
    feature vector the most common distance function is cosine similarity,
    though other distance functions may make sense depending on your use case.

http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html

rum也在支持tfidf了。

参考

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

http://www.sigaev.ru/

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;f=README;h=8fa81c51d749cc714e52257a15edf547f0b26edf;hb=8d02df18c0bbfd6ccba94a9582499ec8746047e5

http://www.pgcon.org/2012/schedule/events/443.en.html

http://www.pgcon.org/2012/schedule/attachments/252_smlar-2012.pdf

http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
2月前
|
存储 监控 算法
员工上网行为监控中的Go语言算法:布隆过滤器的应用
在信息化高速发展的时代,企业上网行为监管至关重要。布隆过滤器作为一种高效、节省空间的概率性数据结构,适用于大规模URL查询与匹配,是实现精准上网行为管理的理想选择。本文探讨了布隆过滤器的原理及其优缺点,并展示了如何使用Go语言实现该算法,以提升企业网络管理效率和安全性。尽管存在误报等局限性,但合理配置下,布隆过滤器为企业提供了经济有效的解决方案。
85 8
员工上网行为监控中的Go语言算法:布隆过滤器的应用
|
9天前
|
SQL 人工智能 关系型数据库
【PG锦囊】阿里云 RDS PostgreSQL 版插件—AI 插件(rds_ai)
本文介绍了AI 插件(rds_ai)的核心优势、适用场景等,帮助您更好地了解 rds_ai 插件。想了解更多 RDS 插件信息和讨论交流,欢迎加入 RDS PG 插件用户专项服务群(103525002795)
|
24天前
|
SQL NoSQL 关系型数据库
RDS PostgreSQL版发布 rds_duckdb 插件!
RDS PostgreSQL版内置DuckDB,结合列存储与向量化执行的优势,在RDS PG内实现复杂SQL查询加速和ETL功能,复杂查询效率提升30X,想了解更多rds_duckdb信息和讨论交流,欢迎加入RDS PG插件用户专项服务群(103525002795)
|
2月前
|
存储 缓存 算法
探索企业文件管理软件:Python中的哈希表算法应用
企业文件管理软件依赖哈希表实现高效的数据管理和安全保障。哈希表通过键值映射,提供平均O(1)时间复杂度的快速访问,适用于海量文件处理。在Python中,字典类型基于哈希表实现,可用于管理文件元数据、缓存机制、版本控制及快速搜索等功能,极大提升工作效率和数据安全性。
66 0
|
3月前
|
机器学习/深度学习 人工智能 算法
探索人工智能中的强化学习:原理、算法与应用
探索人工智能中的强化学习:原理、算法与应用
|
3月前
|
机器学习/深度学习 算法 数据挖掘
C语言在机器学习中的应用及其重要性。C语言以其高效性、灵活性和可移植性,适合开发高性能的机器学习算法,尤其在底层算法实现、嵌入式系统和高性能计算中表现突出
本文探讨了C语言在机器学习中的应用及其重要性。C语言以其高效性、灵活性和可移植性,适合开发高性能的机器学习算法,尤其在底层算法实现、嵌入式系统和高性能计算中表现突出。文章还介绍了C语言在知名机器学习库中的作用,以及与Python等语言结合使用的案例,展望了其未来发展的挑战与机遇。
69 1
|
3月前
|
并行计算 算法 测试技术
C语言因高效灵活被广泛应用于软件开发。本文探讨了优化C语言程序性能的策略,涵盖算法优化、代码结构优化、内存管理优化、编译器优化、数据结构优化、并行计算优化及性能测试与分析七个方面
C语言因高效灵活被广泛应用于软件开发。本文探讨了优化C语言程序性能的策略,涵盖算法优化、代码结构优化、内存管理优化、编译器优化、数据结构优化、并行计算优化及性能测试与分析七个方面,旨在通过综合策略提升程序性能,满足实际需求。
88 1
|
3月前
|
缓存 算法 网络协议
OSPF的路由计算算法:原理与应用
OSPF的路由计算算法:原理与应用
87 4
|
3月前
|
机器学习/深度学习 监控 算法
基于反光衣和检测算法的应用探索
本文探讨了利用机器学习和计算机视觉技术进行反光衣检测的方法,涵盖图像预处理、目标检测与分类、特征提取等关键技术。通过YOLOv5等模型的训练与优化,展示了实现高效反光衣识别的完整流程,旨在提升智能检测系统的性能,应用于交通安全、工地监控等领域。
|
3月前
|
存储 算法 网络协议
OSPF的SPF算法介绍:原理、实现与应用
OSPF的SPF算法介绍:原理、实现与应用
109 3

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版