业务场景
在当今社交媒体的时代,人们通过各种平台分享自己的生活、观点和情感。然而,对于平台管理员和品牌经营者来说,了解用户的情感和意见变得至关重要。为了帮助他们更好地了解用户的情感倾向,我们可以使用PostgreSQL中的pg_jieba插件对这些发帖进行分词和情感分析,来构建一个社交媒体情感分析系统,系统将根据用户的发帖内容,自动判断其情感倾向是积极、消极还是中性,并将结果存储在数据库中。
本文通过针对kaggle数据集中的豆瓣影评的中文评论数据,通过阿里云的PostgreSQL中的pg_jieba插件进行分词(可自定义多个词典,并且切换自定义词典进行分词),基于分词的结果进行统计分析。
数据准备
通过在kaggle上面找到豆瓣影评的数据集,里面包含了非常多的电影的中文和英文影评数据,非常适合用来实验和实践PG的pg_jieba分词插件的场景化分析。数据集链接如下:
https://www.kaggle.com/datasets/utmhikari/doubanmovieshortcomments
数据集元数据
kaggle上面的影评数据集字段介绍如下:
ID:the ID of the comment (start from 0)
Movie_Name_EN:the English name of the movie
Movie_Name_CN:the Chinese name of the movie
Crawl_Date:the date that the data are crawled
Number:the number of the comment
Username:the username of the account
Date:the date that the comment posted
Star:the star that users give to the movie (from 1 to 5, 5 grades)
Comment:the content of the comment
Like:the count of "like" on the comment
针对上述的影评数据集的字段信息,在PG数据库中创建对应的表结构如下,注意like是关键字,建议可以改为like_count,建表操作如下:
CREATE TABLE movie_comments (
ID SERIAL PRIMARY KEY,
Movie_Name_EN VARCHAR(255),
Movie_Name_CN VARCHAR(255),
Crawl_Date DATE,
Number INTEGER,
Username VARCHAR(255),
Comment_riqi DATE,
Star INTEGER,
Comment TEXT,
Like_Count INTEGER
);
数据导入
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
import csv
# Connect to the PostgreSQL database using SQLAlchemy
engine = create_engine('postgresql://XXXXXXXX:YYYYTTTT@pgm-ZZZZZZZZZZZ.pg.rds.aliyuncs.com:5432/demodb')
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()
# Define the MovieComments table schema
class MovieComments(Base):
__tablename__ = 'movie_comments'
id = Column(Integer, primary_key=True)
movie_name_en = Column(String)
movie_name_cn = Column(String)
crawl_date = Column(DateTime)
number = Column(Integer)
username = Column(String)
comment_riqi = Column(DateTime)
star = Column(Integer)
comment = Column(String)
like_count = Column(Integer)
# Open the CSV file and parse the data
with open('DMSC.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
next(csvreader) # Skip the header row
count = 0
for row in csvreader:
# Extract the data from the row
id = int(row[0])
movie_name_en = row[1]
movie_name_cn = row[2]
crawl_date = row[3]
number = int(row[4])
username = row[5]
comment_riqi = row[6]
star = int(row[7])
comment = row[8]
like_count = int(row[9])
# Create a new MovieComments object with the extracted data and add it to the session
movie_comment = MovieComments(id=id, movie_name_en=movie_name_en, movie_name_cn=movie_name_cn, crawl_date=crawl_date, number=number, username=username, comment_riqi=comment_riqi, star=star, comment=comment, like_count=like_count)
session.add(movie_comment)
count+=1
if count % 100 == 0:
# Commit the changes to the database
session.commit()
session.commit()
# Close the database connection
session.close()
engine.dispose()
自定义词典
导入数据之后,写入自定义词典,将电影的中文名和英文名写入词典表,这样大大的提高分词的准确度,同时也对后续的分析提供了更有价值的数据和信息,如下:
INSERT INTO JIEBA_USER_DICT(word, dict_name, weight)
SELECT TMP.Movie_Name_CN, 0, 100
FROM
(
SELECT DISTINCT Movie_Name_CN as Movie_Name_CN
FROM movie_comments
) AS TMP;
INSERT INTO JIEBA_USER_DICT(word, dict_name, weight)
SELECT TMP.Movie_Name_EN, 0, 100
FROM
(
SELECT DISTINCT Movie_Name_EN as Movie_Name_EN
FROM movie_comments
) AS TMP;
INSERT INTO jieba_user_dict VALUES ('钢铁侠',0,100);
分析场景示例
查看分词效果
可以使用pg_jieba的to_tsvector函数来对评论进行分词.例如,以下的SQL查询会返回每个评论的分词结果,如下:
SELECT id, movie_name_cn, to_tsvector('jiebacfg', comment) as words
FROM movie_comments
limit 10;
进行词频统计
可以对分词结果进行统计分析。例如,以下的SQL查询会返回每个词出现的次数,如下:
demodb=> SELECT word, count(*) as frequency
demodb-> FROM (
demodb(> SELECT unnest(tsvector_to_array(words)) as word
demodb(> FROM (
demodb(> SELECT to_tsvector('jiebacfg', comment) as words
demodb(> FROM movie_comments
demodb(> ) sub1
demodb(> ) sub2
demodb-> GROUP BY word
demodb-> ORDER BY frequency DESC limit 10;
word | frequency
------+-----------
| 2124991
电影 | 303655
剧情 | 191414
没有 | 161814
不错 | 155734
说 | 131681
觉得 | 131395
好看 | 130803
喜欢 | 126598
一个 | 118641
(10 行记录)
上面的查询首先使用tsvector_to_array函数将每个评论的分词结果转化为一个数组,然后使用unnest函数将这些数组转化为一列,最后对这一列进行分组和计数。
分析特定电影的影评
如果只对某部电影的评论感兴趣,可以添加一个WHERE子句来限制分析的范围。例如,以下的查询会返回电影"肖申克的救赎"的评论中每个词出现的次数,如下:
demodb=> SELECT word, count(*) as frequency
demodb-> FROM (
demodb(> SELECT unnest(tsvector_to_array(words)) as word
demodb(> FROM (
demodb(> SELECT to_tsvector('jiebacfg', comment) as words
demodb(> FROM movie_comments
demodb(> WHERE movie_name_cn like '%复仇者联盟%'
demodb(> ) sub1
demodb(> ) sub2
demodb-> GROUP BY word
demodb-> ORDER BY frequency DESC
demodb-> LIMIT 10;
word | frequency
--------+-----------
| 132433
电影 | 13480
英雄 | 12421
绿巨人 | 11514
剧情 | 10530
钢铁 | 8662
没有 | 8459
侠 | 7911
好看 | 7727
没 | 7200
(10 行记录)
分析高评分和低评分差异
可以比较高评分和低评分评论中常用词的差异。例如,以下的查询会返回评分高于4的评论和评分低于2的评论中每个词出现的次数,如下:
SELECT word, count(*) as frequency, 'high' as rating
FROM (
SELECT unnest(tsvector_to_array(words)) as word
FROM (
SELECT to_tsvector('jiebacfg', comment) as words
FROM movie_comments
WHERE star > 4
) sub1
) sub2
GROUP BY word
UNION ALL
SELECT word, count(*) as frequency, 'low' as rating
FROM (
SELECT unnest(tsvector_to_array(words)) as word
FROM (
SELECT to_tsvector('jiebacfg', comment) as words
FROM movie_comments
WHERE star < 2
) sub1
) sub2
GROUP BY word;
也可以通过下面的SQL来实现,如下:
SELECT word, SUM(CASE WHEN star > 4 THEN 1 ELSE 0 END) AS high_score_count, SUM(CASE WHEN star < 2 THEN 1 ELSE 0 END) AS low_score_count
FROM (
SELECT word, star
FROM (
SELECT unnest(string_to_array(Comment, ' ')) AS word, star
FROM movie_comments
WHERE star > 4 OR star < 2
) AS words
WHERE length(word) > 1
) AS filtered_words
GROUP BY word
HAVING SUM(CASE WHEN star > 4 THEN 1 ELSE 0 END) > 0 AND SUM(CASE WHEN star < 2 THEN 1 ELSE 0 END) > 0
ORDER BY high_score_count DESC, low_score_count DESC, word ASC;
上面的SQL查询首先使用string_to_array函数将每个评论拆分成单词数组。然后使用unnest函数将数组展开为单独的单词行。接下来将每个单词转换为小写,并过滤掉长度小于2的单词。最后,使用CASE语句在高评和低评中计算单词出现的次数,并使用GROUP BY将单词分组在一起。HAVING子句保证只返回同时出现在高评和低评中的单词。查询结果按高评计数、低评计数和单词的字母顺序排序。
分析分词的共现频率
可以分析两个词同时出现在同一评论中的频率。例如,以下的查询会返回"电影"和"好看"同时出现在同一评论中的次数,如下:
SELECT count(*) as cooccurrence
FROM (
SELECT to_tsvector('jiebacfg', comment) as words
FROM movie_comments
) sub
WHERE words @@ to_tsquery('jiebacfg', '电影 & 好看');
SELECT COUNT(DISTINCT Movie_Name_CN) AS Movie_Count
FROM movie_comments
WHERE to_tsvector('jieba', Comment) @@ to_tsquery('jieba', '电影 & 好看');
@@是PostgreSQL中的全文搜索运算符,它用于检查tsvector是否匹配给定的tsquery。 tsvector是文档的全文索引,而tsquery是用于搜索文档的查询。
to_tsvector('jieba',Comment)将“Comment”字段转换为tsvector,使用了“jieba”词典,使其能够使用pg_jieba插件进行中文分词。
to_tsquery('jieba','电影&好看')将“电影”和“好看”连接为一个查询,并使用“jieba”词典将其转换为tsquery。
@@运算符检查to_tsvector('jieba',Comment)是否与to_tsquery('jieba','电影&好看')匹配。 如果它们匹配,则返回true,否则返回false。
其他分析场景
统计每部电影的评论数量并按照数量从高到低排序。
SELECT Movie_Name_CN, COUNT(*) AS Comment_Count FROM movie_comments GROUP BY Movie_Name_CN ORDER BY Comment_Count DESC;
找出所有评分为5星且点赞数大于100的评论。
SELECT * FROM movie_comments WHERE Star = 5 AND Like_Count > 100;
统计每个用户的评论数量并按照数量从高到低排序。
SELECT Username, COUNT(*) AS Comment_Count FROM movie_comments GROUP BY Username ORDER BY Comment_Count DESC;
找出某部电影中评分为3星及以下的评论并按照点赞数从高到低排序。
SELECT * FROM movie_comments WHERE Movie_Name_CN = '西游降魔篇' AND Star <= 3 ORDER BY Like_Count DESC;
统计每个月的评论数量并按照时间顺序排序。
SELECT DATE_TRUNC('month', Crawl_Date) AS Month, COUNT(*) AS Comment_Count FROM movie_comments GROUP BY Month ORDER BY Month ASC;
注意事项
使用pg_jieba插件前,需要将pg_jieba加入到shared_preload_libraries参数中。
您可以使用RDS PostgreSQL参数设置功能,为shared_preload_libraries参数添加pg_jieba。具体操作,请参见设置实例参数。特别注意修改参数后,要点击提交按钮,否则修改不生效,不生效的情况下报错,如下:关于RDS PG数据库中的jieba_load_user_dict函数说明,针对不同的RDS PG的版本,该函数的参数不同,如下:
1)1.1.0 适用于10~13
2)1.2.0 适用于14/15select jieba_load_user_dict(参数1, 参数2)中
参数1,表示加载自定义词典的词典序号
参数2,表示是否加载默认词典,0表示加载默认词典,1表示不加载默认词典查看pg_jieba插件的详细信息,如下:
demodb=> \dx+ pg_jieba; Objects in extension "pg_jieba" Object Description function jieba_end(internal) function jieba_gettoken(internal,internal,internal) function jieba_gettoken_with_position(internal,internal,internal) function jieba_lextype(internal) function jieba_load_user_dict(integer,integer) function jieba_query_start(internal,integer) function jieba_start(internal,integer) table jieba_user_dict text search configuration jiebacfg text search configuration jiebacfg_pos text search configuration jiebaqry text search dictionary jieba_stem text search parser jieba text search parser jieba_position text search parser jiebaqry type word_type (16 rows)
查看jieba分词的词性表,如下:
demodb=> select * from ts_token_type('jiebaqry'); tokid | alias | description -------+-------+----------------------------- 1 | nz | other proper noun 2 | n | noun 3 | m | numeral 4 | i | idiom 5 | l | temporary idiom 6 | d | adverb 7 | s | space 8 | t | time 9 | mq | numeral-classifier compound
- tsvector_to_array函数用法
tsvector_to_array是PostgreSQL的一个函数,用于将tsvector类型的文本转换为由单词和位置组成的数组。tsvector是PostgreSQL的内置全文搜索类型,用于存储预处理的文本,包括单词、位置和权重。tsvector_to_array函数将tsvector文本分解为单词数组,每个单词都带有一个位置列表,该位置列表指示该单词在文本中出现的位置。例如,tsvector_to_array('a:1 b:2 c:1 d:4')将返回'{"a:1","b:2","c:1","d:4"}',其中每个元素代表一个单词和其位置列表。位置列表是一个整数数组,其中的每个元素都表示单词在文本中的一个位置。在全文搜索查询中,tsvector_to_array函数通常与unnest函数结合使用,以便在单词级别上分析tsvector文本。
通常,与unnest函数一起使用,将tsvector转换为单独的单词行。下面是一个使用tsvector_to_array和unnest函数的示例查询,它将一个包含多个tsvector的列拆分为单独的单词行:
在这个查询中,首先使用to_tsvector函数将comment列中的文本转换为tsvector。然后使用tsvector_to_array函数将tsvector转换为由单词和位置列表组成的数组。最后,使用unnest函数将数组展开为单独的单词行。为了过滤掉长度小于2的单词,添加了一个WHERE子句。查询结果按电影名称和单词排序。SELECT movie_name_cn, word FROM ( SELECT movie_name_cn, unnest(tsvector_to_array(to_tsvector('jieba', comment))) AS word FROM movie_comments ) AS words WHERE length(word) > 1 ORDER BY movie_name_cn, word;