对盗图、盗文、盗墓深恶痛绝吗?PostgreSQL结合余弦、线性相关算法 在文本、图片、数组相似 等领域的应用 - 2 smlar插件详解

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
RDS PostgreSQL Serverless,0.5-4RCU 50GB 3个月
推荐场景:
对影评进行热评分析
云数据库 RDS SQL Server,基础系列 2核4GB
简介:

标签

PostgreSQL , 文本分析 , cosine , smlar , 相似性 , simlar , tf , idf , tf-idf , tag


背景

以2个例子作为开始,

例1

在数据库中有两条这样的记录

"I want a dog"  // 狗  
"I want a chihuahua"  // 吉娃娃狗  

然后使用这样的查询条件进行查询

"dog|chihuahua"  

很显然,两条记录都会被匹配到,但是哪条记录应该排在前面呢?

例2

在搜索引擎中搜索"狗|吉娃娃狗"

哪个会排在前面呢?试试就知道了,翻到第二页,以防被广告冲昏头脑

pic

很显然吉娃娃被排在了前面,为什么呢?

其实都是tf-idf算法起到的作用,因为在全局文本中,狗比吉娃娃出现的次数多,所以根据idf的算法 ( log(总文本数/包含此词的文本数) ) 狗的idf比吉娃娃的低。

那么当tf相同时,很显然tf*idf取决于idf的大小。

关于tf与idf的概念,请参考

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

那么作为支持全文检索的PostgreSQL数据库,默认支持TF-IDF吗?

PostgreSQL 默认不使用idf计算rank

我们在ts_rank, ts_rank_cd的代码中可以了解到,PostgreSQL这两个函数并不关心idf。

其实在网上也有人问过这样的问题

http://stackoverflow.com/questions/18296444/does-postgresql-use-tf-idf

Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:  

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html  

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":  

"I want a dog"  // 狗  
"I want a chihuahua"  // 奇瓦瓦狗  

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:  

"dog lovers have an average of 1.5 dogs"  

In short: higher term frequency(TF) within the document results in a higher rank, but a lower term frequency in the corpus has no impact.  

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)  
Postgres does not use TF-IDF as a similarity measure among documents.  

ts_rank is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.  

ts_rank_cd is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.  

There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF.   

It also lets you turn tsvectors into arrays, and supports fast indexing.  

PostgreSQL内置的rank计算方法并不关心IDF,而仅仅计算当前文本的词频。

目前PostgreSQL通过ts_rank与ts_rank_cd计算tsquery与tsvector的相关性,算法详见

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

如何让PostgreSQL计算rank时关心IDF?

如何让PostgreSQL计算rank时关心IDF?详见此文

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

smlar插件介绍

smlar插件支持多种相似度计算公式(算法),cosine(default), tfidf, overlap。同时还提供了自定义公式计算相似度的函数。

函数接口

1. 计算数组的相似度

float4 smlar(anyarray, anyarray)  
    - computes similary of two arrays. Arrays should be the same type.  

2. 计算自定义复合数组(元素,权重)的数组的相似度,可以计算所有元素,同时也允许只计算重叠元素的部分(当权重不同时,相似度不同,例如cosine算法)。

或者我们把它理解为包含tfidf的加权数组,比如 [('中国', 0.1), ('日本', 0.1), ('海盗', 0.9)]

在文本相似度分析中很有用。

float4 smlar(anyarray, anyarray, bool useIntersect)  
    -  computes similary of two arrays of composite types. Composite type looks like:  
        CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);  
       useIntersect option points to use only intersected elements in denominator  
       see an exmaples in sql/composite_int4.sql or sql/composite_text.sql  

例子

postgres=# create type tp as (c1 text, c2 float4);
CREATE TYPE

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp);
  smlar   
----------
 0.816497
(1 row)

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp,true);
 smlar 
-------
     1
(1 row)

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.2), ('中国',1.1),('china',1)]::_tp,true);
  smlar   
----------
 0.999822
(1 row)

与顺序无关
postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('中国',1.1),('你好',2.2),('china',1)]::_tp,true);
  smlar   
----------
 0.999822
(1 row)

注意

elog(ERROR, "GiST  doesn't support composite (weighted) type");

3. 通过自定义公式,公式中包含3个变量N.i, N.a, N.b,计算两个数组的相似性,你可以自定义算法。

float4 smlar( anyarray a, anyarray b, text formula );  
    - computes similary of two arrays by given formula, arrays should   
    be the same type.   
    Predefined variables in formula:  
      N.i   - number of common elements in both array (intersection)  
      N.a   - number of uniqueelements in first array  
      N.b   - number of uniqueelements in second array  
    Example:  
    smlar('{1,4,6}'::int[], '{5,4,6}' )  
    smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' )  
    That calls are equivalent.  

操作符接口

1. 判断两个数组是否相似,(当相似值大于limit值时, limit值通过smlar.threshold参数设置)

anyarray % anyarray  
    - returns true if similarity of that arrays is greater than limit  

float4 show_smlar_limit()  - deprecated  
    - shows the limit for % operation  

float4 set_smlar_limit(float4) - deprecated  
    - sets the limit for % operation  

Use instead of show_smlar_limit/set_smlar_limit GUC variable   
smlar.threshold (see below)  

转换函数接口

1. 将tsvector类型转换为text array类型

text[] tsvector2textarray(tsvector)  
    - transforms tsvector type to text array  

2. 对数组内的元素排序并去除重复元素

anyarray array_unique(anyarray)  
    - sort and unique array  

3. 判断数组内是否包含某元素,包含时返回1.0, 不包含时返回0。

float4 inarray(anyarray, anyelement)  
    - returns zero if second argument does not present in a first one  
      and 1.0 in opposite case  

4. 类似三目操作符,判断数组内是否包含某元素,包含时返回第三个参数的值,不包含时返回第四个参数值。

float4 inarray(anyarray, anyelement, float4, float4)  
    - returns fourth argument if second argument does not present in   
      a first one and third argument in opposite case  

参数

1. 相似度LIMIT值,当相似性低于这个值时,%操作符(计算两个数组是否相似) 将返回false。

smlar.threshold  FLOAT  
    Array's with similarity lower than threshold are not similar   
    by % operation  

2. 是否持久化idf表

smlar.persistent_cache BOOL  
    Cache of global stat is stored in transaction-independent memory  

3. 默认的相似性计算公式

smlar.type  STRING  
    Type of similarity formula: cosine(default), tfidf, overlap  

源码,其中tfidf计算需要用到统计idf的表。

        switch(getSmlType())  
        {  
                case ST_TFIDF:  
                        PG_RETURN_FLOAT4( TFIDFSml(sa, sb) );  
                        break;  
                case ST_COSINE:  
                        {  
                                int                             cnt;  
                                double                  power;  

                                power = ((double)(sa->nelems)) * ((double)(sb->nelems));  
                                cnt = numOfIntersect(sa, sb);  

                                PG_RETURN_FLOAT4(  ((double)cnt) / sqrt( power ) );  
                        }  
                        break;  
                case ST_OVERLAP:  
                        {  
                                float4 res = (float4)numOfIntersect(sa, sb);  

                                PG_RETURN_FLOAT4(res);  
                        }  
                        break;  
                default:  
                        elog(ERROR,"Unsupported formula type of similarity");  
        }  

算法详见

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

4. 当相似性计算公式为tfidf时,需要设置这个参数,并且需要一张统计信息表,记录每个词的idf,以及总共有多少文本。

smlar.stattable STRING  
    Name of table stored set-wide statistic. Table should be   
    defined as  

    CREATE TABLE table_name (  
        value   data_type UNIQUE,  -- 词  
        ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)  -- 该词一共出现在几个文本中  
    );  

    And row with null value means total number of documents.  
    See an examples in sql/*g.sql files  

    Note: used on for smlar.type = 'tfidf'  

5. 当相似性计算公式为tfidf时的配置。

TF(词频)计算方法设置:出现次数、1+log(出现次数)、常数1

smlar.tf_method STRING  
    Calculation method for term frequency. Values:  
        "n"     - simple counting of entries (default)  
        "log"   - 1 + log(n)  
        "const" - TF is equal to 1  
    Note: used on for smlar.type = 'tfidf'  

注意 :

GIN index supports only smlar.tf_method = \"const\""

6. 当相似性计算公式为tfidf时的配置。

idf是否+1后再取LOG

smlar.idf_plus_one BOOL  
    If false (default), calculate idf as log(d/df),  
    if true - as log(1+d/df)  
    Note: used on for smlar.type = 'tfidf'  

建议的参数配置

Module provides several GUC variables smlar.threshold, it's highly  
recommended to add to postgesql.conf:  

custom_variable_classes = 'smlar'       # list of custom variable class names  
smlar.threshold = 0.6                   #or any other value > 0 and < 1  

and other smlar.* variables  

索引op class

smlar插件的核心,实际上是计算两个数组(任意类型的数组)的相似性,当然为了提高速度,它也支持索引。

不同类型对应的ops如下,在创建索引是需要使用它们

GiST/GIN support for  %  and  &&  operations for:  
  Array Type   |  GIN operator class  | GiST operator class    
---------------+----------------------+----------------------  
 bit[]         | _bit_sml_ops         |   
 bytea[]       | _bytea_sml_ops       | _bytea_sml_ops  
 char[]        | _char_sml_ops        | _char_sml_ops  
 cidr[]        | _cidr_sml_ops        | _cidr_sml_ops  
 date[]        | _date_sml_ops        | _date_sml_ops  
 float4[]      | _float4_sml_ops      | _float4_sml_ops  
 float8[]      | _float8_sml_ops      | _float8_sml_ops  
 inet[]        | _inet_sml_ops        | _inet_sml_ops  
 int2[]        | _int2_sml_ops        | _int2_sml_ops  
 int4[]        | _int4_sml_ops        | _int4_sml_ops  
 int8[]        | _int8_sml_ops        | _int8_sml_ops  
 interval[]    | _interval_sml_ops    | _interval_sml_ops  
 macaddr[]     | _macaddr_sml_ops     | _macaddr_sml_ops  
 money[]       | _money_sml_ops       |   
 numeric[]     | _numeric_sml_ops     | _numeric_sml_ops  
 oid[]         | _oid_sml_ops         | _oid_sml_ops  
 text[]        | _text_sml_ops        | _text_sml_ops  
 time[]        | _time_sml_ops        | _time_sml_ops  
 timestamp[]   | _timestamp_sml_ops   | _timestamp_sml_ops  
 timestamptz[] | _timestamptz_sml_ops | _timestamptz_sml_ops  
 timetz[]      | _timetz_sml_ops      | _timetz_sml_ops  
 varbit[]      | _varbit_sml_ops      |   
 varchar[]     | _varchar_sml_ops     | _varchar_sml_ops  

smlar 实现 tfidf相似性的 例子

默认的cosine算法的例子参考

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

接下来看看如何使用tfidf

1. 安装smlar插件

/* see http://blog.databasepatterns.com/2014/07/postgresql-install-smlar-extension.html */  
create extension if not exists smlar;  

2. 创建测试表

/* your basic table to search against: */  
create table documents (  
  document_id int primary key,  
  body text not null  
);  

3. 导入测试数据,注意可以直接到http://www.mockaroo.com勾选生成测试数据

/*   
   I created 100,000 "lorem ipsum" documents here http://www.mockaroo.com/c5418bd0  
   In retrospect, not a great choice due to the small number of unique words used to generate the dataset  
*/  
copy documents   
from program 'curl "http://www.mockaroo.com/c5418bd0/download?count=100000&key=5b15a410"'   
with (format csv, header true);  

copy documents   
from '/home/digoal/test.csv'   
with (format csv, header true);  

4. 创建统计表, 记录每个词在多少篇文本中出现过,以及总的文本数。

对于日常的使用,我们可以从词库中导出词组与idf,并生成文本总数,与词出现的次数,导入这张表

比如从scws, jieba分词等词库,导出并导入该表

/* this table holds document frequencies (# of docs in which a term appears) for the documents.body column: */  
create table documents_body_stats (  
  value text unique,  
  ndoc int not null  
);  

5. 使用本地数据,生成idf,(实际生产时,可以忽略,建议使用从词库导入的IDF)

/* used ts_stat for convenience, not ideal, but good for quick n dirty: */  
insert into documents_body_stats  
  select  
    word,  
    ndoc  
  from   
    ts_stat( 'select to_tsvector(''simple'', body) from documents' );  

6. 插入文本数

/* the smlar tfdif table needs the total document count as well. It's added as a row with null in the value column: */  
insert into documents_body_stats values   
  (null, (select count(*) from documents) );  

7. 将文本转换为text数组的函数

/* turn documents into array of words. you could also use tsvector2textarray( to_tsvector(...) ) : */  
create or replace function tokenize(text) returns text[] language sql strict immutable as $$  
 select regexp_split_to_array( lower($1), '[^[:alnum:]]' );  
$$;  

注意tsvector转换为数组时,会丢失重复值

postgres=# select tsvector2textarray(to_tsvector('i am digoal digoal')),  to_tsvector('i am digoal digoal');  
 tsvector2textarray | to_tsvector    
--------------------+--------------  
 {digoal}           | 'digoal':3,4  
(1 row)  

8. 创建表达式索引

/* use smlar's text array opclass. gist is a little more flexible than gin in this case (allows 'n' tf_method): */  
create index on documents using gist ( tokenize(body) _text_sml_ops ); --24 seconds  

9. 参数设置,使用tfidf公式计算数组相似度

/* setup smlar: */  
set smlar.type = 'tfidf';  
set smlar.stattable = 'documents_body_stats';  
set smlar.tf_method = 'n';  
set smlar.threshold = 0.4;  

10. 查询

当TFIDF similarity >= smlar.threshold时,返回。

/* the query */  
select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  


postgres=# explain (analyze,verbose,timing,costs,buffers) select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)  
   Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
   Buffers: shared hit=79  
   ->  Sort  (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)  
         Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
         Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=79  
         ->  Index Scan using documents_tokenize_idx on public.documents  (cost=0.14..2.17 rows=1 width=237) (actual time=7.641..7.641 rows=0 loops=1)  
               Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])  
               Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])  
               Rows Removed by Index Recheck: 61  
               Buffers: shared hit=79  
 Planning time: 0.148 ms  
 Execution time: 7.703 ms  
(15 rows)  

我们可以把相似度调高,从而排除更多的记录,提高查询效率

postgres=# set smlar.threshold = 0.8;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=2.18..2.18 rows=1 width=237) (actual time=1.051..1.051 rows=0 loops=1)  
   Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
   Buffers: shared hit=29  
   ->  Sort  (cost=2.18..2.18 rows=1 width=237) (actual time=1.049..1.049 rows=0 loops=1)  
         Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
         Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=29  
         ->  Index Scan using documents_tokenize_idx on public.documents  (cost=0.14..2.17 rows=1 width=237) (actual time=1.042..1.042 rows=0 loops=1)  
               Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])  
               Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])  
               Rows Removed by Index Recheck: 1  
               Buffers: shared hit=29  
 Planning time: 0.190 ms  
 Execution time: 1.113 ms  
(15 rows)  

如果你不需要对结果按相似度排序的话,也可以把order by去掉,提升性能。  

MADlib训练文本idf

Basic steps for clustering documents can vary a bit depending on your
precise goals, but one example would be:

1. Often there is an initial first level of processing to handle
tokenization, word stemming, filter stopwords etc.

There is no right or wrong way of handling this step and there is a fair
amount of art to doing it well.

2. Build feature vectors for each document.
There are a variety of methods for handling this, one common example
would be to build tf-idfs (http://en.wikipedia.org/wiki/Tfidf).

For which you need the following metrics:
a) word count for each document
b) how many documents word occur across all documents
c) documents count

  1. Having produced a feature vectors for your documents you can then call
    the kmeans function in MADlib to perform the actual clustering (
    http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html).
    Given a tf-idf
    feature vector the most common distance function is cosine similarity,
    though other distance functions may make sense depending on your use case.

http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html

rum也在支持tfidf了。

参考

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

http://www.sigaev.ru/

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;f=README;h=8fa81c51d749cc714e52257a15edf547f0b26edf;hb=8d02df18c0bbfd6ccba94a9582499ec8746047e5

http://www.pgcon.org/2012/schedule/events/443.en.html

http://www.pgcon.org/2012/schedule/attachments/252_smlar-2012.pdf

http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍如何基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
数据采集 算法 数据可视化
基于Python的k-means聚类分析算法的实现与应用,可以用在电商评论、招聘信息等各个领域的文本聚类及指标聚类,效果很好
本文介绍了基于Python实现的k-means聚类分析算法,并通过微博考研话题的数据清洗、聚类数量评估、聚类分析实现与结果可视化等步骤,展示了该算法在文本聚类领域的应用效果。
608 1
|
机器学习/深度学习 存储 人工智能
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
使用Python作为开发语言,基于文本数据集(一个积极的xls文本格式和一个消极的xls文本格式文件),使用Word2vec对文本进行处理。通过支持向量机SVM算法训练情绪分类模型。实现对文本消极情感和文本积极情感的识别。并基于Django框架开发网页平台实现对用户的可视化操作和数据存储。
357 0
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
|
机器学习/深度学习 数据采集 算法
Python基于KMeans算法进行文本聚类项目实战
Python基于KMeans算法进行文本聚类项目实战
|
数据采集 自然语言处理 数据可视化
基于Python的社交媒体评论数据挖掘,使用LDA主题分析、文本聚类算法、情感分析实现
本文介绍了基于Python的社交媒体评论数据挖掘方法,使用LDA主题分析、文本聚类算法和情感分析技术,对数据进行深入分析和可视化,以揭示文本数据中的潜在主题、模式和情感倾向。
1827 0
|
文字识别 算法 Java
文本,保存图片09,一个可以用id作为图片名字的pom插件,利用雪花算法生成唯一的id
文本,保存图片09,一个可以用id作为图片名字的pom插件,利用雪花算法生成唯一的id
|
算法 数据可视化 搜索推荐
基于python的k-means聚类分析算法,对文本、数据等进行聚类,有轮廓系数和手肘法检验
本文详细介绍了基于Python实现的k-means聚类分析算法,包括数据准备、预处理、标准化、聚类数目确定、聚类分析、降维可视化以及结果输出的完整流程,并应用该算法对文本数据进行聚类分析,展示了轮廓系数法和手肘法检验确定最佳聚类数目的方法。
590 0
|
算法 JavaScript
「AIGC算法」将word文档转换为纯文本
使用Node.js模块`mammoth`和`html-to-text`,该代码示例演示了如何将Word文档(.docx格式)转换为纯文本以适应AIGC的文本识别。流程包括将Word文档转化为HTML,然后进一步转换为纯文本,进行格式调整,并输出到控制台。转换过程中考虑了错误处理。提供的代码片段展示了具体的实现细节,包括关键库的导入和转换函数的调用。
332 0
|
存储 JSON 关系型数据库
PostgreSQL Json应用场景介绍和Shared Detoast优化
PostgreSQL Json应用场景介绍和Shared Detoast优化
|
关系型数据库 数据库 PostgreSQL
Docker【应用 03】给Docker部署的PostgreSQL数据库安装PostGIS插件(安装流程及问题说明)
Docker【应用 03】给Docker部署的PostgreSQL数据库安装PostGIS插件(安装流程及问题说明)
1278 0
|
关系型数据库 数据库 PostgreSQL
PostgreSQL【应用 01】使用Vector插件实现向量相似度查询(Docker部署的PostgreSQL安装pgvector插件说明)和Milvus向量库对比
PostgreSQL【应用 01】使用Vector插件实现向量相似度查询(Docker部署的PostgreSQL安装pgvector插件说明)和Milvus向量库对比
1770 1

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版
  • 推荐镜像

    更多