开发者社区> 问答> 正文

从txt文件中提取文本并将其转换为df

将此txt文件包含值

google.com('172.217.163.46', 443)
        commonName: \*google.com
        issuer: GTS CA 1O1
        notBefore: 2020-02-12 11:47:11
        notAfter:  2020-05-06 11:47:11

facebook.com('31.13.79.35', 443)
        commonName: \*facebook.com
        issuer: DigiCert SHA2 High Assurance Server CA
        notBefore: 2020-01-16 00:00:00
        notAfter:  2020-04-15 12:00:00

如何将其转换为df

尝试了一下,并获得了部分成功:

f = open("out.txt", "r")
a=(f.read())


a=(pd.read_csv(StringIO(data),
              header=None,
     #use a delimiter not present in the text file
     #forces pandas to read data into one column
              sep="/",
              names=['string'])
     #limit number of splits to 1
  .string.str.split(':',n=1,expand=True)
  .rename({0:'Name',1:'temp'},axis=1)
  .assign(temp = lambda x: np.where(x.Name.str.strip()
                             #look for string that ends 
                             #with a bracket
                              .str.match(r'(.\*)]$)'),
                              x.Name,
                              x.temp),
          Name = lambda x: x.Name.str.replace(r'(.\*)]$)','Name')
          )
   #remove whitespace
 .assign(Name = lambda x: x.Name.str.strip())
 .pivot(columns='Name',values='temp')
 .ffill()
 .dropna(how='any')
 .reset_index(drop=True)
 .rename_axis(None,axis=1)
 .filter(['Name','commonName','issuer','notBefore','notAfter'])      
  )

但这是循环的,并给我多个数据,例如单行有多个重复项

问题来源:stackoverflow

展开
收起
is大龙 2020-03-24 23:38:56 732 0
1 条回答
写回答
取消 提交回答
  • 该文件不是csv格式,因此您不应该使用read_csv来读取它,而要手工解析它。在这里你可以做:

    with open("out.txt") as fd:
        cols = {'commonName','issuer','notBefore','notAfter'}  # columns to keep
        rows = []                                              # list of records
        for line in fd:
            line = line.strip()
            if ':' in line:
                elt = line.split(':', 1)                       # data line: parse it
                if elt[0] in cols:
                    rec[elt[0]] = elt[1]
            elif len(line) > 0:
                rec = {'Name': line}                           # initial line of a block
                rows.append(rec)
    
    a = pd.DataFrame(rows)         # and build the dataframe from the list of records
    

    它给:

                                    Name       commonName                                   issuer               notAfter             notBefore
    0  google.com('172.217.163.46', 443)     \*google.com                               GTS CA 1O1    2020-05-06 11:47:11   2020-02-12 11:47:11
    1   facebook.com('31.13.79.35', 443)   \*facebook.com   DigiCert SHA2 High Assurance Server CA    2020-04-15 12:00:00   2020-01-16 00:00:00
    

    回答来源:stackoverflow

    2020-03-24 23:39:05
    赞同 展开评论 打赏
问答分类:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
低代码开发师(初级)实战教程 立即下载
冬季实战营第三期:MySQL数据库进阶实战 立即下载
阿里巴巴DevOps 最佳实践手册 立即下载