对盗图、盗文、盗墓深恶痛绝吗?PostgreSQL结合余弦、线性相关算法 在文本、图片、数组相似 等领域的应用 - 2 smlar插件详解

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
云原生数据库 PolarDB 分布式版,标准版 2核8GB
云数据库 RDS MySQL Serverless,0.5-2RCU 50GB
简介:

标签

PostgreSQL , 文本分析 , cosine , smlar , 相似性 , simlar , tf , idf , tf-idf , tag


背景

以2个例子作为开始,

例1

在数据库中有两条这样的记录

"I want a dog"  // 狗  
"I want a chihuahua"  // 吉娃娃狗  

然后使用这样的查询条件进行查询

"dog|chihuahua"  

很显然,两条记录都会被匹配到,但是哪条记录应该排在前面呢?

例2

在搜索引擎中搜索"狗|吉娃娃狗"

哪个会排在前面呢?试试就知道了,翻到第二页,以防被广告冲昏头脑

pic

很显然吉娃娃被排在了前面,为什么呢?

其实都是tf-idf算法起到的作用,因为在全局文本中,狗比吉娃娃出现的次数多,所以根据idf的算法 ( log(总文本数/包含此词的文本数) ) 狗的idf比吉娃娃的低。

那么当tf相同时,很显然tf*idf取决于idf的大小。

关于tf与idf的概念,请参考

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

那么作为支持全文检索的PostgreSQL数据库,默认支持TF-IDF吗?

PostgreSQL 默认不使用idf计算rank

我们在ts_rank, ts_rank_cd的代码中可以了解到,PostgreSQL这两个函数并不关心idf。

其实在网上也有人问过这样的问题

http://stackoverflow.com/questions/18296444/does-postgresql-use-tf-idf

Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:  

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html  

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":  

"I want a dog"  // 狗  
"I want a chihuahua"  // 奇瓦瓦狗  

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:  

"dog lovers have an average of 1.5 dogs"  

In short: higher term frequency(TF) within the document results in a higher rank, but a lower term frequency in the corpus has no impact.  

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)  
Postgres does not use TF-IDF as a similarity measure among documents.  

ts_rank is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.  

ts_rank_cd is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.  

There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF.   

It also lets you turn tsvectors into arrays, and supports fast indexing.  

PostgreSQL内置的rank计算方法并不关心IDF,而仅仅计算当前文本的词频。

目前PostgreSQL通过ts_rank与ts_rank_cd计算tsquery与tsvector的相关性,算法详见

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

如何让PostgreSQL计算rank时关心IDF?

如何让PostgreSQL计算rank时关心IDF?详见此文

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

smlar插件介绍

smlar插件支持多种相似度计算公式(算法),cosine(default), tfidf, overlap。同时还提供了自定义公式计算相似度的函数。

函数接口

1. 计算数组的相似度

float4 smlar(anyarray, anyarray)  
    - computes similary of two arrays. Arrays should be the same type.  

2. 计算自定义复合数组(元素,权重)的数组的相似度,可以计算所有元素,同时也允许只计算重叠元素的部分(当权重不同时,相似度不同,例如cosine算法)。

或者我们把它理解为包含tfidf的加权数组,比如 [('中国', 0.1), ('日本', 0.1), ('海盗', 0.9)]

在文本相似度分析中很有用。

float4 smlar(anyarray, anyarray, bool useIntersect)  
    -  computes similary of two arrays of composite types. Composite type looks like:  
        CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);  
       useIntersect option points to use only intersected elements in denominator  
       see an exmaples in sql/composite_int4.sql or sql/composite_text.sql  

例子

postgres=# create type tp as (c1 text, c2 float4);
CREATE TYPE

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp);
  smlar   
----------
 0.816497
(1 row)

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp,true);
 smlar 
-------
     1
(1 row)

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.2), ('中国',1.1),('china',1)]::_tp,true);
  smlar   
----------
 0.999822
(1 row)

与顺序无关
postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('中国',1.1),('你好',2.2),('china',1)]::_tp,true);
  smlar   
----------
 0.999822
(1 row)

注意

elog(ERROR, "GiST  doesn't support composite (weighted) type");

3. 通过自定义公式,公式中包含3个变量N.i, N.a, N.b,计算两个数组的相似性,你可以自定义算法。

float4 smlar( anyarray a, anyarray b, text formula );  
    - computes similary of two arrays by given formula, arrays should   
    be the same type.   
    Predefined variables in formula:  
      N.i   - number of common elements in both array (intersection)  
      N.a   - number of uniqueelements in first array  
      N.b   - number of uniqueelements in second array  
    Example:  
    smlar('{1,4,6}'::int[], '{5,4,6}' )  
    smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' )  
    That calls are equivalent.  

操作符接口

1. 判断两个数组是否相似,(当相似值大于limit值时, limit值通过smlar.threshold参数设置)

anyarray % anyarray  
    - returns true if similarity of that arrays is greater than limit  

float4 show_smlar_limit()  - deprecated  
    - shows the limit for % operation  

float4 set_smlar_limit(float4) - deprecated  
    - sets the limit for % operation  

Use instead of show_smlar_limit/set_smlar_limit GUC variable   
smlar.threshold (see below)  

转换函数接口

1. 将tsvector类型转换为text array类型

text[] tsvector2textarray(tsvector)  
    - transforms tsvector type to text array  

2. 对数组内的元素排序并去除重复元素

anyarray array_unique(anyarray)  
    - sort and unique array  

3. 判断数组内是否包含某元素,包含时返回1.0, 不包含时返回0。

float4 inarray(anyarray, anyelement)  
    - returns zero if second argument does not present in a first one  
      and 1.0 in opposite case  

4. 类似三目操作符,判断数组内是否包含某元素,包含时返回第三个参数的值,不包含时返回第四个参数值。

float4 inarray(anyarray, anyelement, float4, float4)  
    - returns fourth argument if second argument does not present in   
      a first one and third argument in opposite case  

参数

1. 相似度LIMIT值,当相似性低于这个值时,%操作符(计算两个数组是否相似) 将返回false。

smlar.threshold  FLOAT  
    Array's with similarity lower than threshold are not similar   
    by % operation  

2. 是否持久化idf表

smlar.persistent_cache BOOL  
    Cache of global stat is stored in transaction-independent memory  

3. 默认的相似性计算公式

smlar.type  STRING  
    Type of similarity formula: cosine(default), tfidf, overlap  

源码,其中tfidf计算需要用到统计idf的表。

        switch(getSmlType())  
        {  
                case ST_TFIDF:  
                        PG_RETURN_FLOAT4( TFIDFSml(sa, sb) );  
                        break;  
                case ST_COSINE:  
                        {  
                                int                             cnt;  
                                double                  power;  

                                power = ((double)(sa->nelems)) * ((double)(sb->nelems));  
                                cnt = numOfIntersect(sa, sb);  

                                PG_RETURN_FLOAT4(  ((double)cnt) / sqrt( power ) );  
                        }  
                        break;  
                case ST_OVERLAP:  
                        {  
                                float4 res = (float4)numOfIntersect(sa, sb);  

                                PG_RETURN_FLOAT4(res);  
                        }  
                        break;  
                default:  
                        elog(ERROR,"Unsupported formula type of similarity");  
        }  

算法详见

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

4. 当相似性计算公式为tfidf时,需要设置这个参数,并且需要一张统计信息表,记录每个词的idf,以及总共有多少文本。

smlar.stattable STRING  
    Name of table stored set-wide statistic. Table should be   
    defined as  

    CREATE TABLE table_name (  
        value   data_type UNIQUE,  -- 词  
        ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)  -- 该词一共出现在几个文本中  
    );  

    And row with null value means total number of documents.  
    See an examples in sql/*g.sql files  

    Note: used on for smlar.type = 'tfidf'  

5. 当相似性计算公式为tfidf时的配置。

TF(词频)计算方法设置:出现次数、1+log(出现次数)、常数1

smlar.tf_method STRING  
    Calculation method for term frequency. Values:  
        "n"     - simple counting of entries (default)  
        "log"   - 1 + log(n)  
        "const" - TF is equal to 1  
    Note: used on for smlar.type = 'tfidf'  

注意 :

GIN index supports only smlar.tf_method = \"const\""

6. 当相似性计算公式为tfidf时的配置。

idf是否+1后再取LOG

smlar.idf_plus_one BOOL  
    If false (default), calculate idf as log(d/df),  
    if true - as log(1+d/df)  
    Note: used on for smlar.type = 'tfidf'  

建议的参数配置

Module provides several GUC variables smlar.threshold, it's highly  
recommended to add to postgesql.conf:  

custom_variable_classes = 'smlar'       # list of custom variable class names  
smlar.threshold = 0.6                   #or any other value > 0 and < 1  

and other smlar.* variables  

索引op class

smlar插件的核心,实际上是计算两个数组(任意类型的数组)的相似性,当然为了提高速度,它也支持索引。

不同类型对应的ops如下,在创建索引是需要使用它们

GiST/GIN support for  %  and  &&  operations for:  
  Array Type   |  GIN operator class  | GiST operator class    
---------------+----------------------+----------------------  
 bit[]         | _bit_sml_ops         |   
 bytea[]       | _bytea_sml_ops       | _bytea_sml_ops  
 char[]        | _char_sml_ops        | _char_sml_ops  
 cidr[]        | _cidr_sml_ops        | _cidr_sml_ops  
 date[]        | _date_sml_ops        | _date_sml_ops  
 float4[]      | _float4_sml_ops      | _float4_sml_ops  
 float8[]      | _float8_sml_ops      | _float8_sml_ops  
 inet[]        | _inet_sml_ops        | _inet_sml_ops  
 int2[]        | _int2_sml_ops        | _int2_sml_ops  
 int4[]        | _int4_sml_ops        | _int4_sml_ops  
 int8[]        | _int8_sml_ops        | _int8_sml_ops  
 interval[]    | _interval_sml_ops    | _interval_sml_ops  
 macaddr[]     | _macaddr_sml_ops     | _macaddr_sml_ops  
 money[]       | _money_sml_ops       |   
 numeric[]     | _numeric_sml_ops     | _numeric_sml_ops  
 oid[]         | _oid_sml_ops         | _oid_sml_ops  
 text[]        | _text_sml_ops        | _text_sml_ops  
 time[]        | _time_sml_ops        | _time_sml_ops  
 timestamp[]   | _timestamp_sml_ops   | _timestamp_sml_ops  
 timestamptz[] | _timestamptz_sml_ops | _timestamptz_sml_ops  
 timetz[]      | _timetz_sml_ops      | _timetz_sml_ops  
 varbit[]      | _varbit_sml_ops      |   
 varchar[]     | _varchar_sml_ops     | _varchar_sml_ops  

smlar 实现 tfidf相似性的 例子

默认的cosine算法的例子参考

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

接下来看看如何使用tfidf

1. 安装smlar插件

/* see http://blog.databasepatterns.com/2014/07/postgresql-install-smlar-extension.html */  
create extension if not exists smlar;  

2. 创建测试表

/* your basic table to search against: */  
create table documents (  
  document_id int primary key,  
  body text not null  
);  

3. 导入测试数据,注意可以直接到http://www.mockaroo.com勾选生成测试数据

/*   
   I created 100,000 "lorem ipsum" documents here http://www.mockaroo.com/c5418bd0  
   In retrospect, not a great choice due to the small number of unique words used to generate the dataset  
*/  
copy documents   
from program 'curl "http://www.mockaroo.com/c5418bd0/download?count=100000&key=5b15a410"'   
with (format csv, header true);  

copy documents   
from '/home/digoal/test.csv'   
with (format csv, header true);  

4. 创建统计表, 记录每个词在多少篇文本中出现过,以及总的文本数。

对于日常的使用,我们可以从词库中导出词组与idf,并生成文本总数,与词出现的次数,导入这张表

比如从scws, jieba分词等词库,导出并导入该表

/* this table holds document frequencies (# of docs in which a term appears) for the documents.body column: */  
create table documents_body_stats (  
  value text unique,  
  ndoc int not null  
);  

5. 使用本地数据,生成idf,(实际生产时,可以忽略,建议使用从词库导入的IDF)

/* used ts_stat for convenience, not ideal, but good for quick n dirty: */  
insert into documents_body_stats  
  select  
    word,  
    ndoc  
  from   
    ts_stat( 'select to_tsvector(''simple'', body) from documents' );  

6. 插入文本数

/* the smlar tfdif table needs the total document count as well. It's added as a row with null in the value column: */  
insert into documents_body_stats values   
  (null, (select count(*) from documents) );  

7. 将文本转换为text数组的函数

/* turn documents into array of words. you could also use tsvector2textarray( to_tsvector(...) ) : */  
create or replace function tokenize(text) returns text[] language sql strict immutable as $$  
 select regexp_split_to_array( lower($1), '[^[:alnum:]]' );  
$$;  

注意tsvector转换为数组时,会丢失重复值

postgres=# select tsvector2textarray(to_tsvector('i am digoal digoal')),  to_tsvector('i am digoal digoal');  
 tsvector2textarray | to_tsvector    
--------------------+--------------  
 {digoal}           | 'digoal':3,4  
(1 row)  

8. 创建表达式索引

/* use smlar's text array opclass. gist is a little more flexible than gin in this case (allows 'n' tf_method): */  
create index on documents using gist ( tokenize(body) _text_sml_ops ); --24 seconds  

9. 参数设置,使用tfidf公式计算数组相似度

/* setup smlar: */  
set smlar.type = 'tfidf';  
set smlar.stattable = 'documents_body_stats';  
set smlar.tf_method = 'n';  
set smlar.threshold = 0.4;  

10. 查询

当TFIDF similarity >= smlar.threshold时,返回。

/* the query */  
select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  


postgres=# explain (analyze,verbose,timing,costs,buffers) select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)  
   Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
   Buffers: shared hit=79  
   ->  Sort  (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)  
         Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
         Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=79  
         ->  Index Scan using documents_tokenize_idx on public.documents  (cost=0.14..2.17 rows=1 width=237) (actual time=7.641..7.641 rows=0 loops=1)  
               Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])  
               Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])  
               Rows Removed by Index Recheck: 61  
               Buffers: shared hit=79  
 Planning time: 0.148 ms  
 Execution time: 7.703 ms  
(15 rows)  

我们可以把相似度调高,从而排除更多的记录,提高查询效率

postgres=# set smlar.threshold = 0.8;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=2.18..2.18 rows=1 width=237) (actual time=1.051..1.051 rows=0 loops=1)  
   Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
   Buffers: shared hit=29  
   ->  Sort  (cost=2.18..2.18 rows=1 width=237) (actual time=1.049..1.049 rows=0 loops=1)  
         Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
         Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=29  
         ->  Index Scan using documents_tokenize_idx on public.documents  (cost=0.14..2.17 rows=1 width=237) (actual time=1.042..1.042 rows=0 loops=1)  
               Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])  
               Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])  
               Rows Removed by Index Recheck: 1  
               Buffers: shared hit=29  
 Planning time: 0.190 ms  
 Execution time: 1.113 ms  
(15 rows)  

如果你不需要对结果按相似度排序的话,也可以把order by去掉,提升性能。  

MADlib训练文本idf

Basic steps for clustering documents can vary a bit depending on your
precise goals, but one example would be:

1. Often there is an initial first level of processing to handle
tokenization, word stemming, filter stopwords etc.

There is no right or wrong way of handling this step and there is a fair
amount of art to doing it well.

2. Build feature vectors for each document.
There are a variety of methods for handling this, one common example
would be to build tf-idfs (http://en.wikipedia.org/wiki/Tfidf).

For which you need the following metrics:
a) word count for each document
b) how many documents word occur across all documents
c) documents count

  1. Having produced a feature vectors for your documents you can then call
    the kmeans function in MADlib to perform the actual clustering (
    http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html).
    Given a tf-idf
    feature vector the most common distance function is cosine similarity,
    though other distance functions may make sense depending on your use case.

http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html

rum也在支持tfidf了。

参考

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

http://www.sigaev.ru/

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;f=README;h=8fa81c51d749cc714e52257a15edf547f0b26edf;hb=8d02df18c0bbfd6ccba94a9582499ec8746047e5

http://www.pgcon.org/2012/schedule/events/443.en.html

http://www.pgcon.org/2012/schedule/attachments/252_smlar-2012.pdf

http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
1天前
|
机器学习/深度学习 算法
应用规则学习算法识别有毒的蘑菇
应用规则学习算法识别有毒的蘑菇
|
2天前
|
存储 算法 Java
数据结构与算法 数组和链表
数据结构与算法 数组和链表
8 0
|
6天前
|
存储 机器学习/深度学习 算法
R语言贝叶斯Metropolis-Hastings采样 MCMC算法理解和应用可视化案例
R语言贝叶斯Metropolis-Hastings采样 MCMC算法理解和应用可视化案例
|
6天前
|
数据采集 机器学习/深度学习 算法
数据分享|WEKA关联规则挖掘Apriori算法在学生就业数据中的应用
数据分享|WEKA关联规则挖掘Apriori算法在学生就业数据中的应用
|
6天前
|
存储 算法
Leetcode 30天高效刷数据结构和算法 Day1 两数之和 —— 无序数组
给定一个无序整数数组和目标值,找出数组中和为目标值的两个数的下标。要求不重复且可按任意顺序返回。示例:输入nums = [2,7,11,15], target = 9,输出[0,1]。暴力解法时间复杂度O(n²),优化解法利用哈希表实现,时间复杂度O(n)。
17 0
|
10天前
|
机器学习/深度学习 自然语言处理 算法
机器学习算法原理与应用:深入探索与实战
【5月更文挑战第2天】本文深入探讨机器学习算法原理,包括监督学习(如线性回归、SVM、神经网络)、非监督学习(聚类、PCA)和强化学习。通过案例展示了机器学习在图像识别(CNN)、自然语言处理(RNN/LSTM)和推荐系统(协同过滤)的应用。随着技术发展,机器学习正广泛影响各领域,但也带来隐私和算法偏见问题,需关注解决。
|
11天前
|
机器学习/深度学习 算法 C语言
【C言专栏】递归算法在 C 语言中的应用
【4月更文挑战第30天】本文介绍了递归算法在C语言中的应用,包括基本概念(通过调用自身解决子问题)、特点(调用自身、终止条件、栈空间)和实现步骤(定义递归函数、分解问题、设置终止条件、组合解)。文中通过阶乘计算和斐波那契数列两个案例展示了递归的使用,强调了递归可能导致的栈溢出问题及优化需求。学习递归有助于理解和应用“分而治之”策略。
|
11天前
|
机器学习/深度学习 数据可视化 算法
【Python机器学习专栏】t-SNE算法在数据可视化中的应用
【4月更文挑战第30天】t-SNE算法是用于高维数据可视化的非线性降维技术,通过最小化Kullback-Leibler散度在低维空间保持数据点间关系。其特点包括:高维到二维/三维映射、保留局部结构、无需预定义簇数量,但计算成本高。Python中可使用`scikit-learn`的`TSNE`类实现,结合`matplotlib`进行可视化。尽管计算昂贵,t-SNE在揭示复杂数据集结构上极具价值。
|
11天前
|
机器学习/深度学习 算法 数据挖掘
【Python机器学习专栏】层次聚类算法的原理与应用
【4月更文挑战第30天】层次聚类是数据挖掘中的聚类技术,无需预设簇数量,能生成数据的层次结构。分为凝聚(自下而上)和分裂(自上而下)两类,常用凝聚层次聚类有最短/最长距离、群集平均和Ward方法。优点是自动确定簇数、提供层次结构,适合小到中型数据集;缺点是计算成本高、过程不可逆且对异常值敏感。在Python中可使用`scipy.cluster.hierarchy`进行实现。尽管有局限,层次聚类仍是各领域强大的分析工具。
|
11天前
|
机器学习/深度学习 算法 前端开发
【Python机器学习专栏】集成学习算法的原理与应用
【4月更文挑战第30天】集成学习通过组合多个基学习器提升预测准确性,广泛应用于分类、回归等问题。主要步骤包括生成基学习器、训练和结合预测结果。算法类型有Bagging(如随机森林)、Boosting(如AdaBoost)和Stacking。Python中可使用scikit-learn实现,如示例代码展示的随机森林分类。集成学习能降低模型方差,缓解过拟合,提高预测性能。

相关产品

  • 云原生数据库 PolarDB