《程序员》第9期智慧擂台题目——高频词汇提取

简介: 本文为原创,如需转载,请注明作者和出处,谢谢!     面对浩瀚的信息海洋,找到想要的资源有时真的是不容易。在大量文字中搜索高频词汇是信息搜索和数据压缩的共通课题。
本文为原创,如需转载,请注明作者和出处,谢谢!

    面对浩瀚的信息海洋,找到想要的资源有时真的是不容易。在大量文字中搜索高频词汇是信息搜索和数据压缩的共通课题。
这次智慧擂台请大家在一个比较庞大的英文文本中找出M个数量最多的短语(由N个单词组成)。统一处理相同的文本文件,该文本只包含英文单词、空格和回行符,比较谁的程序效率最高。

  程序输入:M,N,文本文件路径(M不超过20,N不超过8)
程序输出:高频短语及其数量清单

擂台规则:提交符合以上要求的可执行程序,语言招式不限,点到为止;
我们将在统一的环境下对每位选手的作品进行公平的测试,
比较出综合用时最少的程序。

源程序


import  java.io. * ;
import  java.util. * ;

class  tt
{
    
public  String phrase;
    
public   int  count;
}

public   class  searchphrase
{

    
private   static  LinkedHashMap phrase  =   new  LinkedHashMap();
    
static  tt[] max_phrase;

    
private   static  Vector SeparateString(String s)
    {
        Vector v 
=   new  Vector();
        String temp 
=   "" ;
        
for  ( int  i  =   0 ; i  <  s.length(); i ++ )
        {
            
if  (s.charAt(i)  !=   '   ' )
            {
                temp 
+=  s.charAt(i);
            }
            
else
            {
                
if  (temp  !=   "" )
                    v.add(temp);
                temp 
=   "" ;
            }
        }
        
if  (temp  !=   "" )
            v.add(temp);
        
return  v;
    }

    
private   static   void  swap( int  pos,  int  count, String phrase)
    {
        
int  i;
        
if  (max_phrase[pos  -   1 ].count  <  count)
        {
            
for  (i  =  pos  -   1 ; i  >   0 ; i -- )
            {
                
if  (max_phrase[i  -   1 ].count  >  max_phrase[i].count)
                    
break ;
            }
            max_phrase[pos].count 
=  max_phrase[i].count;
            max_phrase[pos].phrase 
=  max_phrase[i].phrase;
            max_phrase[i].count 
=  count;
            max_phrase[i].phrase 
=  phrase;
        }

    }

    
private   static   void  adjust_max( int  count, String phrase)
    {
        
int  i, j;
        
if  (count  <=  max_phrase[max_phrase.length  -   1 ].count)
            
return ;
        
for  (i  =  max_phrase.length  -   1 ; i  >=   0 ; i -- )
        {
            
if  (max_phrase[i].phrase.equals(phrase))
            {
                max_phrase[i].count 
=  count;
                
if  (i  >   0 )
                {
                    swap(i, count, phrase);
                }
                
return ;
            }
        }
        max_phrase[max_phrase.length 
-   1 ].count  =  count;
        max_phrase[max_phrase.length 
-   1 ].phrase  =  phrase;
        
if  (i  >   0 )
        {
            swap(max_phrase.length 
-   1 , count, phrase);
        }
    }

    
private   static   void  js(Vector v,  int  n)
    {
        String s;
        
for  ( int  i  =   0 ; i  <  v.size()  -  n  +   1 ; i ++ )
        {
            s 
=   "" ;
            
for  ( int  j  =  i; j  <  i  +  n; j ++ )
            {
                s 
+=  v.get(j)  +   "   " ;
            }
            
int  count  =   1 ;
            
if  (phrase.containsKey(s.hashCode()))
            {
                count 
=  Integer.parseInt(phrase.get(s.hashCode()).toString());
                count
++ ;
            }
            phrase.put(s.hashCode(), count);
            adjust_max(count, s);
        }
    }

    
public   static   void  main(String[] args)
    {
        
try
        {
            
long  t;
            
int  m, n;
            String path;
            m 
=  Integer.parseInt(args[ 0 ]);
            n 
=  Integer.parseInt(args[ 1 ]);
            path 
=  args[ 2 ];
            max_phrase 
=   new  tt[m];
            
for  ( int  i  =   0 ; i  <  m; i ++ )
            {
                max_phrase[i] 
=   new  tt();
                max_phrase[i].count 
=   0 ;
                max_phrase[i].phrase 
=   "" ;
            }
            t 
=  ( new  java.util.Date()).getTime();
            java.io.FileReader fr 
=   new  java.io.FileReader(path);
            java.io.BufferedReader br 
=   new  BufferedReader(fr);
            String s;

            Vector v 
=   null ;

            
while  ((s  =  br.readLine())  !=   null )
            {
                v 
=  SeparateString(s);
                js(v, n);
            }
            
for  ( int  i  =   0 ; i  <  m; i ++ )
            {
                System.out.println(max_phrase[i].phrase);
                System.out.println(max_phrase[i].count);
                System.out.println();
            }
            t 
=  ( new  java.util.Date()).getTime()  -  t;
            System.out.print(t);
            System.out.println(
"  ms " );
        }
        
catch  (Exception e)
        {
            System.out.println(e.getMessage());
        }

    }

}

测试结果1:m = 20 n = 8 

under games played won drawn lost goals for
71

tabulated under games played won drawn lost goals
70

games played won drawn lost goals for against
70

May Xinhua Following are the results from the
69

played won drawn lost goals for against and
59

won drawn lost goals for against and points
59

Jan Xinhua Following are the results from the
48

Chinas economic efficiency indicators of the sector of
39

The industrial statistics include all stateowned enterprises and
39

industrial statistics include all stateowned enterprises and the
39

statistics include all stateowned enterprises and the nonstateowned
39

include all stateowned enterprises and the nonstateowned ones
39

all stateowned enterprises and the nonstateowned ones with
39

stateowned enterprises and the nonstateowned ones with annual
39

enterprises and the nonstateowned ones with annual sales
39

and the nonstateowned ones with annual sales income
39

Xinhua Chinas economic efficiency indicators of the sector
39

the nonstateowned ones with annual sales income over
39

nonstateowned ones with annual sales income over million
39

up percent over the same period last year
35

13594 ms


测试结果2  m = 10 n = 5

Xinhua Following are the results
295

May Xinhua Following are the
209

Following are the results from
183

are the results from the
176

April Xinhua Following are the
141

Jan Xinhua Following are the
122

billion yuan billion US dollars
120

won drawn lost goals for
88

played won drawn lost goals
88

Dec Xinhua Following are the
87

12437 ms

 
以上源程序是采用的是最简单的方法,谁有更好,效率更高的方法,请跟贴!!



国内最棒的Google Android技术社区(eoeandroid),欢迎访问!

《银河系列原创教程》发布

《Java Web开发速学宝典》出版,欢迎定购

目录
相关文章
|
7月前
|
数据处理 5G
信息论与编码期末复习——计算题+基础汇总(二)
信息论与编码期末复习——计算题+基础汇总(二)
133 0
信息论与编码期末复习——计算题+基础汇总(二)
|
7月前
|
存储 算法 索引
模拟算法题练习(二)(DNA序列修正、无尽的石头)
模拟算法题练习(二)(DNA序列修正、无尽的石头)
|
7月前
|
自然语言处理 算法 Java
基于Python实现四大名著章节标注汉语拼音
基于Python实现四大名著章节标注汉语拼音
58 0
|
7月前
|
机器学习/深度学习 算法 BI
【设计】 【数学】1622 奇妙序列
【设计】 【数学】1622 奇妙序列
|
算法 测试技术
蓝桥算法_单词分析-wordAnalysis
蓝桥算法_单词分析-wordAnalysis
|
算法 Java
LeetCode150道面试经典题--单词规律(简单)
给定一种规律 pattern 和一个字符串 s ,判断 s 是否遵循相同的规律。 这里的 遵循 指完全匹配,例如, pattern 里的每个字母和字符串 s 中的每个非空单词之间存在着双向连接的对应规律
92 0
数学知识补充(一)度量空间
数学知识补充(一)度量空间
79 0
《LeetCode刷题计划》前K个高频单词
《LeetCode刷题计划》前K个高频单词
|
存储
【LeetCode】-- 692. 前K个高频单词
【LeetCode】-- 692. 前K个高频单词