Introduction: Hacking on Twitter Data
鉴于个人对于python, Twitter, 甚至NLTK都非常熟悉, 我就直接切入主题.
从Twitter数据中, 我们可以关注两个问题,
What are people talking about right now?
Extracting relationships from the tweets
并且作为一个data scientist, 我们还要关注data visulization.
What are people talking about right now?
对于python的nlp问题, 我们可以借助nltk包来解决问题.
>>> import nltk
>>> import cPickle
>>> words = cPickle.load(open("myData.pickle"))
>>> freq_dist = nltk.FreqDist(words)
>>> freq_dist.keys()[:50] # 50 most frequent tokens
[u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin', u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at', u'in', u'bieber', u'and', u'you', u'watching', u'tina', u'for', u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with', u'so', u"can't", u'who', u'great', u'it', u'going', u'im', u':)', u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal', u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2', u'will']
通过FreqDist可以方便统计词频, 并取出高频的作为topic…
Extracting relationships from the tweets
这个也是twitter上的典型的问题, 分析RT关系, 从而来分析social network…
>>> import re
>>> rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
>>> example_tweets = ["RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?",
... "Justin Bieber is on SNL 2nite. w00t?!? (via @SocialWebMining)"]
>>> for t in example_tweets:
... rt_patterns.findall(t)
...
[('RT', ' @SocialWebMining')]
[('via', ' @SocialWebMining')]
通过这个简单的方法就可以detect retweet.
Visualizing Tweet Graphs
这章真正有价值的是, 介绍怎么进行数据的可视化...
下面的两个可视化包, 应该熟悉一下, 非常好用.
Protovis Gallery, based on Javascript, easy to use to web service
http://mbostock.github.com/protovis/ex/
Graphviz - Graph Visualization Software, based on DOT language
http://www.graphviz.org/Documentation.php
Microformats: Semantic Markup and Common Sense Collide
微格式是啥, 就是在Html这种半结构化的文档中, 嵌入结构化的数据. 是一种由当前web往semantic web演变过程中的临时的solution.
如果做过网页内容提取的, 都应该能明白从html这种文档中提取内容是件多么痛苦的事.
This chapter begins by briefly introducing the microformats landscape and then digs right into some examples involving specific uses of the XFN (XHTML Friends Network), geo, hRecipe, and hReview microformats.
这章比较无聊, 不说了
Mailboxes: Oldies but Goodies
这章主题就是怎样使用CouchDB来存储和分析邮件……
CouchDB, a document oriented database that provides map/reduce capabilities that are quite nice for building up indexes on the data and performing an aggregate frequency analysis that answers questions such as, “How many messages were sent by so-and-so?” or “How many messages were sent out on such-and-such a date?”
CouchDB和MongoDB非常相似, 特点为所有访问基于Restful API, 并且支持map/reduce
CouchDB: The Definitive Guide (O’Reilly)
本文介绍了使用CouchDB的map/reduce的功能进行sort, count frequency
并且结合couchdb-lucene进行全文检索
比较有趣的是介绍了一个进行Timeline图形化的工具, SIMILE Timeline, http://simile-widgets.org/wiki/Timeline
Twitter: Friends, Followers, and Setwise Operations
这章描述怎么分析Twitter上的social关系.
OAuth
OAuth为了解决如下问题,
An exchange of information (often called a “dance”) among a client application that needs access to a protected resource, a resource owner such as a social network, and an end user who needs to authorize the client application to access the protected resource (without giving it a username/password combination).
说白了, 你要通过一个应用去访问social service, 但是不想直接告诉应用你的service的密码, 怎么办, 通过OAuth, 应用会将你redirect到service站, 你在service站进行密码认证, 通过后service站会对该应用进行授权, 这样应用不需要知道你的密码也能获取你的信息.
从下面的链接可以比较好的理解, Twitter的Basic HTTP authentication 和OAuth之间的区别
https://dev.twitter.com/docs/auth/authorizing-request
也许我对Twitter太熟悉了, 所以觉得作者讲的都太浅显了, 包含如下的social问题的分析,
Calculating Similarity by Computing Common Friends and Followers
Measuring Influence
Clique Detection and Analysis
本章比较有价值的是,
介绍了通过networkx包来处理graph analytics的问题, 如Clique Detection , http://networkx.lanl.gov/
介绍了使用Redis来存储social关系, 远比传统的文件存储有效
介绍了http://www.infochimps.com/ , 这是一家数据服务提供公司, 提供一系列的基于social和geo数据相关的API, 还比较好用
比如本章中用到的Strong Links API, 用来找出对某个user影响力比较大的Twitter用户
介绍了3D绘图包Ubigraph http://ubietylab.net/ubigraph/, 支持python
Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet
In this chapter, we’ll largely use CouchDB’s map/reduce capabilities to exploit the entities in tweets (@mentions, #hashtags, etc.) to try to answer the question, “What’s everyone talking about?”
很有意思的话题...Twitter作为media chanel的作用远远大于social network, 所以对于twitter内容的分析很重要, 也很有趣
Tweet本身是非结构化的纯文本, 如果要分析需要较强的NLP技术. 由于Tweet非常的短, 大家为了可以在很短的文字中表达更丰富的内涵, 于是在Tweet中嵌入了些可结构化的实体(Entity).
Hashtag, #Bob, #WorldCup
Links, "http://bit.ly/9o4uoG”
Mention, @mentions, @crowdflower
Retweet, RT @user, via @user, r"(RT|via)((?:\b\W*@\w+)+)"
这些实体从很大程度上代表了tweet的内容, 而且提取和分析他们是非常容易的, 不用复杂的NLP技术
1) 通过twitter API在获取tweet的同时就包含所有entities列表
2) easy_install twitter-text-py 这是一个提取entities的包
本章接下来就是围绕这些entities进行一些分析, 用于回答下面类似的问题
How many of the user entities that appear most frequently in Tim’s tweets are also his friends?
What are the most frequently occurring entities that appear in Tim’s tweets?
Who does Tim retweet the most often?
How many of Tim’s tweets get retweeted?
How many of Tim’s tweets contain at least one entity?
统计hashtag可以分析出最热门的topic
统计某用户tweet中hashtag的比例, 可以分析出该用户的信息量
统计两个话题中hashtags的相似程度, 可以分析出两个话题的相似程度
统计links可以分析出最热门的social media
统计mention可以分析用户间的朋友关系, 以及用户和topic间的关系
统计Retweet可以分析influential
最终到Visualizing Tons of Tweets, 这个作者非常重视可视化
Open source rotating Tag Cloud
WP-Cumulus, http://code.google.com/p/word-cumulus-goog-vis/wiki/UserGuide
LinkedIn: Clustering Your Professional Network for Fun (and Profit?)
本章介绍了从linkedin爬取你的好友信息, 并根据好友的title进行聚类.
讲的比较浅显, 聚类首先要能measure好友title的相似程度, 介绍了各种distance, Edit distance, n-gram similarity, Jaccard distance, MASI distance
这些距离函数, 在NLTK中基本都有实现, NLTK很强大, 很久不用, 有空再去看看
然后讲聚类的方法, greedy方法, Hierarchical, K-means……就不具体说了
Google Buzz: TF-IDF, Cosine Similarity, and Collocations
这章介绍了简单的NLP的技术, 如TF-IDF, Cosine Similarity, 比较初级就不说了
有一章描述n-bigrams, 这个都是依赖NLTK的功能, 这边也不说了.
Blogs et al.: Natural Language Processing (and Beyond)
这章介绍更深一层的NLP技术, 我个人觉得还不如直接去看NLTK的书
Facebook: The All-in-One Wonder
From the standpoint of the social web, Facebook truly is an all-in-one wonder.
主要介绍了facebook的auth机制, 和Graph API的使用
本书比较浅显非常适合该领域的入门, 而且最重要的是介绍了一些好用的图形化工具比较好用,
Protovis Gallery, based on Javascript, easy to use to web service
http://mbostock.github.com/protovis/ex/
Graphviz - Graph Visualization Software, based on DOT language
http://www.graphviz.org/Documentation.php
介绍了3D绘图包Ubigraph http://ubietylab.net/ubigraph/, 支持python
比较有趣的是介绍了一个进行Timeline图形化的工具, SIMILE Timeline, http://simile-widgets.org/wiki/Timeline
Open source rotating Tag Cloud
WP-Cumulus, http://code.google.com/p/word-cumulus-goog-vis/wiki/UserGuide
The JavaScript InfoVis Toolkit provides tools for creating Interactive Data Visualizations for the Web.
本文章摘自博客园,原文发布日期:2012-01-31