开发者社区> 问答> 正文

数据比较

我们有一个SQL Server表,其中包含“公司名称”,“地址”和“联系人姓名”(以及其他名称)。

我们会定期从外部来源接收数据文件,这些数据文件要求我们对此表进行匹配。不幸的是,由于数据来自完全不同的系统,因此数据略有不同。例如,我们有“ 123 E. Main St.” 我们会收到“ 123 East Main Street”。另一个示例,我们有“ Acme,LLC”,文件包含“ Acme Inc.”。另一个是,我们有“爱德史密斯”,而他们有“爱德华史密斯”

我们有一个旧系统,该系统利用一些相当复杂且占用大量CPU的方法来处理这些匹配项。一些涉及纯SQL,而其他涉及Access数据库中的VBA代码。当前的系统是好的,但不是完美的,麻烦且难以维护

此处的管理层希望扩大其用途。将继承系统支持的开发人员希望用一种更灵活的解决方案来替换它,该解决方案需要更少的维护。

有没有一种普遍接受的方式来处理这种数据匹配?

展开
收起
心有灵_夕 2019-12-29 12:49:33 1054 0
1 条回答
写回答
取消 提交回答
  • 这是我为几乎相同的堆栈编写的(我们需要标准化硬件的制造商名称,并且有各种各样的变体)。不过,这是客户端(准确地说是VB.Net)-并使用Levenshtein距离算法(已修改,以获得更好的结果):

    Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
            Dim bestMatch As String = ""
            Dim bestDistance As Integer = 1000 'Almost anything should be better than that!
    
            For Each matchCandidate As String In stringList
                Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
                If candidateDistance < bestDistance Then
                    bestMatch = matchCandidate
                    bestDistance = candidateDistance
                End If
            Next
    
            Return bestMatch
        End Function
    
        'This will be used to determine how similar strings are.  Modified from the link below...
        'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
        Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
            Dim sLength As Integer = s.Length ' length of s
            Dim tLength As Integer = t.Length ' length of t
            Dim lvCost As Integer ' cost
            Dim lvDistance As Integer = 0
            Dim zeroCostCount As Integer = 0
    
            Try
                ' Step 1
                If tLength = 0 Then
                    Return sLength
                ElseIf sLength = 0 Then
                    Return tLength
                End If
    
                Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
                Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}
    
                ' fill first row
                For lvIndex As Integer = 0 To sLength
                    poBuffer(lvIndex) = lvIndex
                Next
    
                'fill first column
                For lvIndex As Integer = 1 To tLength
                    poBuffer(lvIndex * (sLength + 1)) = lvIndex
                Next
    
                For lvRowIndex As Integer = 0 To sLength - 1
                    Dim s_i As Char = s(lvRowIndex)
                    For lvColIndex As Integer = 0 To tLength - 1
                        If s_i = t(lvColIndex) Then
                            lvCost = 0
                            zeroCostCount += 1
                        Else
                            lvCost = 1
                        End If
                        ' Step 6
                        Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
                        Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
                        Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
                        Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
                        lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
                        poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
                    Next
                Next
            Catch ex As ThreadAbortException
                Err.Clear()
            Catch ex As Exception
                WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
            End Try
    
            Return lvDistance - zeroCostCount
        End Function
    
    2019-12-29 12:50:04
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
存储分层企业数据存储类型选择与优化 立即下载
实战-如何基于HBase构建图片视频数据的统一存储检索方案 立即下载
数据无边界:非结构化数据在MaxCompute上的处理 立即下载