Mining of Massive Datasets – Link Analysis

简介:

5.1 PageRank

5.1.1 Early Search Engines and Term Spam

As people began to use search engines to find their way around the Web, unethical people saw the opportunity to fool search engines into leading people to their page.

Techniques for fooling search engines into believing your page is about something it is not, are called term spam.

The ability of term spammers to operate so easily rendered early search engines almost useless. To combat term spam, Google introduced two innovations:

1. PageRank was used to simulate where Web surfers. Pages that would have a large number of surfers were considered more “important” than pages that would rarely be visited. Google prefers important pages to unimportant pages when deciding which pages to show first in response to a search query.

2. The content of a page was judged not only by the terms appearing on that page, but by the terms used in or near the links to that page.

It is reasonable to ask why simulation of random surfers should allow us to approximate the intuitive notion of the “importance” of pages. There are two related motivations that inspired this approach.

• Users of the Web “vote with their feet.” They tend to place links to pages they think are good or useful pages to look at, rather than bad or useless pages. 
• The behavior of a random surfer indicates which pages users of the Web are likely to visit. Users are more likely to visit useful pages than useless pages.

 

5.1.2 Definition of PageRank

Think of the Web as a directed graph, where pages are the nodes, and there is an arc from page p1 to page p2 if there are one or more links from p1 to p2.

An example of a tiny version of the Web, where there are only four pages.

Page A has links to each of the other three pages;

Page B has links to A and D only;

Page C has a link only to A;

Page D has links to B and C only.

In general, we can define the transition matrix of the Web to describe what happens to random surfers after one step. This matrix M has n rows and columns, if there are n pages. The element mij in row i and column j has value 1/k if page j has k arcs out, and one of them is to page i.

上边那个例子的转移矩阵M如下, 很容易理解, A可能链接到3个Page, 所以链接到每个Page的概率为1/3, 其他也一样.

   A    B    C   D

A  0    1/2  1  0 
1/3  0    0  1/2 
1/3  0    0  1/2 
D  1/3  1/2 0  0

那么根据这个转移矩阵, 怎么样能产生PageRank?

PageRank就是用来模拟随机web访问, 被访问概率越高的page应该越重要, 即pagerank就应该越高.

上面的M是初始的各个网页被访问的概率, 那么如果想要知道随机在web上访问n步后, 各个page被访问的概率, 怎么样算?

If M is the transition matrix of the Web, then after one step, the distribution of the surfer will be Mv0, 就是下面的V1, 然后经过n步就得到Vn, 就是经过n步随机访问每个page的被访问的概率, 即各个page的pageRank. 这就是一个马尔可夫(Markov)过程...

 V0                 V1                          V2                         V3                     Vn

1/4              9/24                15/48                11/32                3/9 
1/4   M*V0    5/24  M*V1     11/48   M*V2     7/32                 2/9 
1/4  ----->    5/24 ----->     11/48    ----->    7/32  …  ......     2/9 
1/4               5/24               11/48                7/32                 2/9

To see why multiplying a distribution vector v by M gives the distribution x = Mv at the next step, we reason as follows.

The probability xthat a random surfer will be at node i at the next step, is Sumj( mijv). Here, mij is the probability that a surfer at node j will move to node i at the next step (often 0 because there is no link from j to i), andvis the probability that the surfer was at node j at the previous step.

 

As we shall see, computing PageRank by simulating random surfers is a time-consuming process. One might think that simply counting the number of in-links for each page would be a good approximation to where random surfers would wind up.

Simplified PageRank Doesn’t Work

if that is all we did, then the hypothetical shirt-seller could simply create a “spam farm” of a million pages, each of which linked to his shirt page. Then, the shirt page looks very important indeed, and a search engine would be fooled.

 

5.1.3 Structure of the Web

前一节给出了基本的PageRank算法, 可是真正Web结构不可能那么简单, 直接使用这样的算法会有些问题, 下面先看看真实的Web结构是什么样的.

                    Figure 5.2: The “bowtie” picture of the Web

上面的领结图显示了真实Web的结构, 有如下部分组成,

1. A large strongly connected component (SCC). 
2. The in-component, consisting of pages that could reach the SCC by following links, but were not reachable from the SCC. 
3. The out-component, consisting of pages reachable from the SCC but unable to reach the SCC. 
4. Tendrils, which are of two types. Some tendrils consist of pages reachable from the in-component but not able to reach the in-component. The other tendrils can reach the out-component, but are not reachable from the out-component. 
5. Tubes, which are pages reachable from the in-component and able to reach the out-component, but unable to reach the SCC or be reached from the SCC. 
6. Isolated components that are unreachable from the large components (the SCC, in- and out-compunents) and unable to reach those components.

Several of these structures violate the assumptions needed for the Markov process iteration to converge to a limit.

As a result, PageRank is usually modified to prevent following problems,

First is the dead end, a page that has no links out.

The second problem is groups of pages that all have outlinks but they never link to any other pages. These structures are called spider traps.

 

5.1.4 Avoiding Dead Ends

Recall that a page with no link out is called a dead end.

   A      B    C   D

A  0    1/2  0  0 
1/3   0    0  1/2 
1/3   0    0  1/2 
D 1/3  1/2  0  0

如上面这个图, C就是一个dead end, 这样的图用Markov process iteration去计算page rank的结果就是全0, 因为一旦到达C就没有next step.

一个简单的解决方法就是先把dead end节点和相关边删掉, 然后再计算剩下节点的page rank.

最后根据其他节点的page rank来计算dead end节点的page rank.

PRc = 1/3*PRa + 1/2*PRd

当然也可以用下节介绍的Taxation方法来解决这个问题.

 

5.1.5 Spider Traps and Taxation

As we mentioned, a spider trap is a set of nodes with no dead ends but no arcs out. These structures can appear intentionally or unintentionally on the Web, and they cause the PageRank calculation to place all the PageRank within the spider traps.

这种情况和dead end不一样, 进入spider trap后, 虽然还是不断有next step, 但是再也无法离开.

   A      B    C   D

A  0    1/2  0  0 
1/3   0    0  1/2 
1/3   0    1  1/2 
D 1/3  1/2  0  0

这就给出一个简化版的spide trap, 对于C节点有个自循环, 一旦到C节点就无法离开...这个是对Web上真实的spide traps的抽象和简化.

Note that in general spider traps can have many nodes, there are spider traps with millions of nodes that spammers construct intentionally.

这样的图用Markov process iteration去计算page rank的结果是, spide trap得到所有的page rank, 其他的节点的page rank都为0.

这样spammer就达到目的了, 使他有意搭建的spider traps得到了最高的pagerank.

解决的办法就是引入随机性, 这个和解决局部最优问题思路是一样的, 通过某个随机事件跳出这个trap…

于是上面的Markov process变成如下, 增加了1 − β的随机surfer

v′ = βMv + (1 − β)e/n

where β is a chosen constant, usually in the range 0.8 to 0.9, e is a vector of all 1’s with the appropriate number of components, and n is the number of nodes in the Web graph.

with probability β, the random surfer decides to follow an out-link from their present page.

with probability 1 − β, a new random surfer at a random page

这样增加随机surfer, 最终结果虽然spide trap节点C仍然会得到大部分pagerank, 但是有所限制, 其他节点也会得到一些pagerank.

 

5.2 Efficient Computation of PageRank

To compute the PageRank for a large graph representing the Web, we have to perform a matrix–vector multiplication on the order of 50 times, until the vector is close to unchanged at one iteration.

涉及到稀疏矩阵的表示和MapReduce方法...

 

5.3 Topic-Sensitive PageRank

5.3.1 Motivation for Topic-Sensitive Page Rank

Different people have different interests, and sometimes distinct interests are expressed using the same term in a query. The canonical example is the search query jaguar, which might refer to the animal, the automobile, a version of the MAC operating system, or even an ancient game console.

但是用户的数目太多了, 不可能为每一个用户生成一个favorite vector, So,

The topic-sensitive PageRank approach creates one vector for each of some small number of topics, biasing the PageRank to favor pages of that topic.

然后只需要把每个用户classify到某个topic, 就可以近似的判断用户的interests.

 

5.3.2 Biased Random Walks

Suppose we have identified some pages that represent a topic such as “sports.” To create a topic-sensitive PageRank for sports, we can arrange that the random surfers are introduced only to a random sports page, rather than to a random page of any kind.

其实唯一的不同就是, surfer有了偏向性, 原来的模型就是每次surfer是随机的, 每个页面的概率是一致的, 现在有了biased, 就是假设用户只会(或较大概率)surfer到特定topic的page, 如sport.

The mathematical formulation for the iteration,

Suppose S is a set of the pages we have identified as belonging to a certain topic.

Let eS be a vector that has 1 in the components in S and 0 in other components.

v′ = βMv + (1 − β)eS/|S|


本文章摘自博客园,原文发布日期:2011-09-06

目录
相关文章
|
12天前
|
算法 数据挖掘 测试技术
文献解读-Processing UMI Datasets at High Accuracy and Efficiency with the Sentieon ctDNA Analysis Pipeline
Sentieon ctDNA分析流程通过创新的算法设计和高效的软件实现,为高深度、大panel的ctDNA测序数据分析提了一个快速而准确的解决方案。它在多个数据集上均展现出优于或等同于现有方法的性能,同时大幅提高了处理速度。这一进展有望推动ctDNA技术在临床肿瘤学中的广泛应用,特别是在早期癌症检测和最小残留病监测等领域。
36 8
|
机器学习/深度学习 人工智能 自然语言处理
OneIE:A Joint Neural Model for Information Extraction with Global Features论文解读
大多数现有的用于信息抽取(IE)的联合神经网络模型使用局部任务特定的分类器来预测单个实例(例如,触发词,关系)的标签,而不管它们之间的交互。
202 0
|
算法 Linux Shell
SGAT丨Single Gene Analysis Tool
SGAT丨Single Gene Analysis Tool
|
人工智能 自然语言处理 算法
UIE: Unified Structure Generation for Universal Information Extraction 论文解读
信息提取受到其不同目标、异构结构和特定需求模式的影响。本文提出了一个统一的文本到结构生成框架,即UIE,该框架可以对不同的IE任务进行统一建模,自适应生成目标结构
531 0
《40 Must Know Questions to test a data scientist on Dimensionality Reduction techniques》电子版地址
40 Must Know Questions to test a data scientist on Dimensionality Reduction techniques
98 0
《40 Must Know Questions to test a data scientist on Dimensionality Reduction techniques》电子版地址
|
机器学习/深度学习 算法 数据挖掘
Re18:读论文 GCI Everything Has a Cause: Leveraging Causal Inference in Legal Text Analysis
Re18:读论文 GCI Everything Has a Cause: Leveraging Causal Inference in Legal Text Analysis
Re18:读论文 GCI Everything Has a Cause: Leveraging Causal Inference in Legal Text Analysis
|
机器学习/深度学习 自然语言处理 PyTorch
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
|
索引 Python
hands-on-data-analysis 第二单元 2,3节
数据合并——concat横向合并
124 0
|
运维 自然语言处理 数据挖掘
论文调研: Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models
在大型计算机系统中,比如云服务系统,异常和错误会影响大量用户,及时准确的找出异常可以有效的保证系统的可靠性。软件系统的不断演进,要求异常检测系统可以处理软件升级或者冷启动后出现未知数据,难以检测是否是异常的问题。论文使用预训练语言模型,在日志维度使用日志的语义特征,实现在系统更新或冷启动后有效检测日志中出现的异常数据,并通过实验验证了方法的可靠性和有效性,进一步拓展了这一方向研究的可能性。
552 0
|
机器学习/深度学习 新零售 自然语言处理
KDD 2020 <A Dual Heterogeneous Graph Attention Network to Improve Long-Tail Performance for Shop Search in E-Commerce> 论文解读
店铺搜索是淘宝搜索的一个组成部分,目前淘宝有近千万的店铺,7日活跃店铺也达到百万级别。店铺搜索场景拥有日均千万级别UV,引导上亿的GVM。
KDD 2020 <A Dual Heterogeneous Graph Attention Network to Improve Long-Tail Performance for Shop Search in E-Commerce> 论文解读