ruby写一个文件内容相似性比较的代码

简介:
1.相似度定义

我们定义,则,我们设,则,|C|=s,则相似度p=,p(0,1)

2.相似度检测算法设计

算法设计:

定义4个字符为一个字符串,将T1,T2分割成若干字符串,若剩余字符不足4个,则以空格补全。将分割后的T1T2计数,记下|T1|=n,|T2|=m,s=0;在T1中取出第一字符串,检测是否在T2中,若存在,则s+1,并删除与被检测字符串相同的字符串,循环到T2检测,直到T2中不存在被检测的字符串,循环到T1,提出下一个被检测字符串,到T2中检测;如此循环检测,直到T1中的所有字符串都被检测或者T2中所有的字符串都被删除,停止,记下此时的s;将所得的s除以n和m中最大的那个数,所得的结果为T1,T2的相似度。先以T1为被检测模板,检测,然后再以T2为被检测模板检测,得出两个相似度的数,取最小值。


用ruby实现如下:

def fill_str(str,i=4)
  return str if str.size%i == 0
  str<<" "*(4-str.size%i)
end


def txt_cmp(f0,f1)
  str_f0,str_f1 = fill_str(File.new(f0).read),fill_str(File.new(f1).read)
  a0,a1 = str_f0.scan(/.{4}/m),str_f1.scan(/.{4}/m)
  n,m,s = a0.size,a1.size,0
  a0.each do |txt|
    if a1.include?(txt)
      size = a1.size
      s+=size-a1.keep_if {|item| item!=txt}.size
    end
    break if a1.size == 0
  end
  s/[n,m].max.to_f
rescue =>e
  puts "error : #{e.message}\n" << e.backtrace[0..2].join("\n")
end

(puts "you must cmp 2 txt file";exit) if ARGV.size != 2
r = txt_cmp(f0=ARGV[0],f1=ARGV[1])
puts "#{f0} and #{f1} semblance is #{r*100}%"


下面是4个文件分别为1.txt 2.txt a.txt b.txt,内容如下:

1.txt

NFC East rival quarterbacks Tony Romo(notes) of the Dallas Cowboys and Eli Manning(notes) of the New York Giants now have something else in common ḂẂ they've used the same wedding planner to help them tie the knot. Todd Fiscus, the man with the plan, set up what he called "man food" at Dallas' Arlington Hall on Saturday, when Romo married former Miss Missouri Candace Crawford. "I have a lot of football players to feed," said Fiscus, who had pizza and short ribs on the menu.

However, Romo apparently put all the tunes together. "Tony picked out every song, and when it plays, and what the keynote things are," Fiscus said.

Sounds like a very orderly occasion, but there was one wild card ḂẂ whether Cowboys owner Jerry Jones would be able to attend. With the continued lockout, owners and players are not supposed to have any contact away from the negotiating table. But Jones received special dispensation from the NFL to attend, just as the Green Bay Packers recently were informed that they will, in fact, receive their Super Bowl rings in a June 16 ceremony no matter what the labor situation is at that time. Jones was there along with virtually all of Romo's teammates.

It is unknown whether Jones and Romo actually discussed any labor issues at the wedding ḂẂ we're guessing this was more of a "friendly", though Jones is one of the most powerful owners on the NFL's side of things and Romo's marquee value gives him a lot of play on the other side.

"I've gotten special permission," Jones recently told ESPN's Ed Werder. "But more than anything, (I got the) right ticket from him and his fianceẀḊ ḂẂ Romo's wife-to-be. (It's) one of prettiest invitations I've ever seen.

"So, yes, I will be there and (I'm) proud for him. He's got the best end of this deal."

Romo, who had been linked romantically before with Jessica Simpson and Carrie Underwood, proposed to Crawford last December. Crawford's brother Chace is known for his role on the TV show "Gossip Girl' and has also been linked romantically with Underwood.

According to the new Mrs. Romo, the lockout may play a part in the couple's plans for a honeymoon; usually around this time of year, her husband would be participating in minicamps and other off-season workouts.

"This lockout has been quite a dent in the honeymoon idea," she told WFAA-TV. "We'll see. We haven't really gotten there yet. We're taking a day at a time with the lockout. We (are not) even sure if we're gonna get to go (on) one."

2.txt

Officially, Memorial Day, observed on the last Monday of May (this year it's May 30), honors the war dead. Unofficially, the day honors the start of summer. (More on that in a moment.)

The upcoming three-day weekend has prompted searches on Yahoo! for "when is memorial day," "what is memorial day," and "memorial day history." The day was originally known as "Decoration Day" because the day was dedicated to the Civil War dead, when mourners would decorate gravesites as a remembrance.

The holiday was first widely observed on May 30, 1868, when 5,000 people helped decorate the gravesites of 20,000 Union and Confederate soldiers buried at Arlington National Cemetery. (Some parts of the South still remember members of the Confederate Army with Confederate Memorial Day.)

After World War I, the observances were widened to honor the fallen from all American wars--and in 1971, Congress declared Memorial Day a national holiday.

Towns across the country now honor military personnel with services, parades, and fireworks. A national moment of remembrance takes place at 3 p.m. At Arlington National Cemetery, headstones are graced with small American flags.

This day is not to be confused with Veterans Day, which is observed on November 11 to honor military veterans, both alive and dead.

However, confusion abounds anyway, with the weekend marking for many the kickoff of summer, and it is reserved for weekend getaways, picnics, and sales. Searches on "memorial day sales," "memorial day recipes," and "memorial day weekend" are just some of the lookups related to the festivities.

a.txt

23l4kj23 klgjdlskgj235 3lkj 0952ru lkfj lkqejfg
2t34lktj3409t uj34gjklejeglekjfdklsafjalsfj
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
sdgakdgjsdalgjaslfjsalkfjsadlf

b.txt

23l4kj23 klgjdlskgj235 3lkj 0952ru lkfj lkqejfg
2t34lktj3409t uj34gjklejeglekjfdklsafjalsfj
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
sdgakdgjsdalgjaslfjsalkfjsadlf

测试代码如下:


ruby -EISO-8859-14 txtcmp.rb 1.txt 2.txt
1.txt and 2.txt semblance is 8.653846153846153%

ruby txtcmp.rb a.txt b.txt

a.txt and b.txt semblance is 79.54545454545455%


因为1.txt中包含非utf-8字符,默认比较会出错,遂指定外部编码比较。

相关文章
|
6月前
|
存储 Ruby
|
6月前
|
Ruby
|
6月前
|
Ruby
|
6月前
|
Ruby
|
6月前
|
JSON 监控 数据可视化
局域网管理软件的Ruby代码实践
使用Ruby进行局域网管理的实践,包括安装`net-ping`库进行设备扫描、利用`packetgen`监控流量、执行文件备份自动化任务和数据可视化,以及通过HTTP提交数据。示例代码展示了Ruby在局域网管理中的灵活性和效率。
298 5
|
3月前
|
开发者 数据库 虚拟化
Xamarin 应用性能优化策略大揭秘,从代码到界面再到数据访问,全面提升应用性能,快来围观!
【8月更文挑战第31天】在 Xamarin 跨平台移动应用开发中,性能优化至关重要。代码优化是基础,应避免不必要的计算与内存分配,如减少循环中的对象创建及合理使用数据结构。界面设计上需注意简化布局、减少特效并启用虚拟化以提升响应速度。数据访问方面,优化数据库查询和网络请求可显著改善性能。Xamarin Profiler 等工具还可帮助开发者实时监控并优化应用表现,从而打造流畅高效的用户体验。
51 0
|
3月前
|
开发者 UED Ruby
Ruby中的异常处理之谜:如何用异常与rescue让你的代码坚不可摧?
【8月更文挑战第31天】在软件开发中,错误处理对应用稳定性至关重要。作为动态、面向对象的语言,Ruby提供了丰富的错误处理机制,如异常与rescue。本文通过示例代码介绍了Ruby中的异常类型、异常链及rescue与else的使用,展示了如何优雅地处理各种错误情况,增强了程序的健壮性和用户体验。使用这些机制,开发者能更精准地识别并解决运行时问题,提升应用质量。随着Ruby生态的发展,错误处理的重要性将愈发凸显。
38 0
|
3月前
|
开发者 Ruby
神秘编程魔法惊现!Ruby 元编程究竟隐藏着怎样的力量?竟能让代码自我进化!
【8月更文挑战第31天】《Ruby元编程:让代码自我进化》介绍了Ruby元编程的魅力,通过动态修改代码结构和行为,实现代码自我进化。文章通过实例展示了如何使用`class_eval`动态添加属性和方法,以及通过别名修改现有方法。此外,还介绍了利用模块实现代码复用和扩展。元编程为开发者提供了极大的灵活性和创造力,使代码更加动态高效。
26 0