开发者社区> 问答> 正文

删除重复的内容Java

我收到了这段文字,我需要过滤掉这些重复的行和单词。我不知道是否有比我正在做的更好的方法。

00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:00,413|03:50:25,600|ISDB|PERFEITAMENTE. EU
00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
00:00:02,218|00:00:02,398|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
00:00:02,398|00:00:02,759|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
00:00:02,759|00:00:03,274|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:03,274|00:00:04,357|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A?
00:00:04,357|00:00:05,259|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,259|00:00:05,414|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,414|00:00:05,775|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
00:00:05,775|00:00:06,677|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
00:00:06,677|00:00:06,858|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT?QUE
00:00:07,914|00:00:07,916|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT?QUE EU
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO
00:00:08,997|00:00:09,178|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

而且我正在使用该代码,将这些行放入HashSet中,以免重复。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
public class Testecc {
   public static void main(String args[]) throws Exception {
      String filePath = "C://teste//teste1.txt";
      String input = null;
      //Buffered reader
      BufferedReader br = new BufferedReader(new FileReader(filePath));
      while((input=br.readLine()) !=null){
                input=br.readLine();

      //FileWriter (criando arquivo)
      FileWriter writer = new FileWriter("C://teste//teste.txt");
      //hashset para elimitar duplicatas
      Set set = new HashSet();
      String line;
      //adicionando linhas no hashset
      while((line=br.readLine())!=null){
          String line1= line.substring(0,31);
          String line2=line.substring(31);
          System.out.println(line);
          if(set.add(line2)){

      writer.append(line1+line2+"\n");
          }
      }
      writer.flush();
      System.out.println("Pronto!");
   }
}
   }

这样,我删除了重复的行,如下所示:

00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV�S
00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A�.
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A�. ELES
00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT� QUE
00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT� QUE EU
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT� QUE EU GOSTO

但是我还需要删除重复的单词。

我真的没有主意。

我怎样才能做到这一点?

问题来源:Stack Overflow

展开
收起
montos 2020-03-24 15:11:10 946 0
2 条回答
写回答
取消 提交回答
  • 重复的词行为单位还是全局,如果是行,那么就针对行再过滤一遍。 如果是全渠建议用一下倒排,标记出每个次出现的位置,后续统一处理。

    2020-03-25 09:59:08
    赞同 展开评论 打赏
  • 有一个地图,其中包含按特定键分组的线值。键是一行的开头,从您感兴趣的单词开始,例如前5个字母。然后将这些线添加到地图中,如果该线长于先前找到的线,则将其替换。

    try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
    
      final Map<String, String> map = new LinkedHashMap<>();
    
      br.lines().forEach(line -> {
            String message = line.substring(line.lastIndexOf("|") + 1);
            if (message.isEmpty()) {
              return;
            }
            String key = message.split(" ")[0];
            if (map.get(key) == null) {
              map.put(key, line);
            } else if (map.get(key).length() < line.length()) {
              map.remove(key);
              map.put(key, line);
            }
          }
      );
    
      map.forEach((k, v) -> System.out.println(v));
    }
    

    上面的代码将为您提供以下输出。

    00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
    00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
    00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
    00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
    00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
    00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO
    

    回答来源:Stack Overflow

    2020-03-24 15:11:56
    赞同 展开评论 打赏
问答分类:
问答标签:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
Spring Cloud Alibaba - 重新定义 Java Cloud-Native 立即下载
The Reactive Cloud Native Arch 立即下载
JAVA开发手册1.5.0 立即下载