开发者社区 问答 正文

在Linux中是否有任何命令可以基于多列进行模糊匹配

我有两个csv文件。文件1

D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0

文件2

PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018

展开
收起
祖安文状元 2020-01-07 14:19:25 419 分享 版权
1 条回答
写回答
取消 提交回答
  • 这是GNU awk的镜头(PROCINFO["sorted_in"]用来选择最合适的候选人)。它会对file2每个字段的字段值进行哈希处理,然后将值附加PID到该值上,就像field[2]["66M"]="S2"对每条记录中file1的PID匹配量计数一样,并打印出计数最大的一个:

    BEGIN {
        FS=OFS=","
        PROCINFO["sorted_in"]="@val_num_desc"
    }
    NR==FNR {                                                      # file2
        for(i=1;i<=6;i++)                                          # fields 1-6
            if($i!="") {
            field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
        }
        next
    }
    {                                                               # file1
            for(i=1;i<=6;i++) {                                     # fields 1-6
                if($i in field[i]) {                                # if value matches
                    split(field[i][$i],t,FS)                        # get PIDs
                    for(j in t) {                                   # and
                        matches[t[j]]++                             # increase PID counts
                    }
                } else {                                            # if no value match
                    for(j in field[i])                              # for all field values
                        if($i~j || j~$i)                            # "go fuzzy" :D
                            matches[field[i][j]]+=0.5               # fuzzy is half a match
                }
            }
            for(i in matches) {                                     # the best match first
                print $0,i
                delete matches
                break                                               # we only want the best match
            }
    }
    
    2020-01-07 14:19:35
    赞同 展开评论
问答分类:
问答标签:
问答地址: