我想得到这个维基百科数据集(people_wiki.csv)中每个单词的单词计数。我能够得到每个单词并将其作为字典出现,但我无法将字典键值对拆分为单独的列。我已经尝试了几种方法(from_dict,from_records,to_frame,pivot_table等)这在python中是可行的。
Samle数据集:
URI name text
http://dbpedia.org/resource/George_Clooney George Clooney 'george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his...'
我试过了:
clooney_word_count_table = pd.DataFrame.from_dict(clooney['word_count'], orient='index', columns=['word','count']
我也尝试过:
clooney['word_count'].to_frame()
这是我的代码:
people = pd.read_csv("people_wiki.csv")
clooney = people[people['name'] == 'George Clooney']
from collections import Counter
clooney['word_count']= clooney['text'].apply(lambda x: Counter(x.split(' ')))
clooney_word_count_table = pd.DataFrame.from_dict(clooney['word_count'], orient='index', columns=['word','count']
clooney _word_count_table
输出:
word_count
35817 {'george': 1, 'timothy': 1, 'clooney': 9, 'ii': ...
我希望从clooney_word_count_table获得一个包含2列的输出数据帧:
word count
normalize 1
george 3
combat 1
producer 2
问题是,clooney是(含一排索引35817)一个数据帧,所以clooney['word_count']是一个系列的指数35817包含一个值(您计数字典)。
DataFrame.from_dict然后将这个系列视为等同于{35817: {'george': 1,...}哪个系统会给你带来令人困惑的结果。
尝试类似的东西:
c = Counter()
cloony['text'].apply(lambda x: c.update(x.split()))
pd.from_dict(c, orient='index', columns=['count'])
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。