我们有一个SQL Server表,其中包含“公司名称”,“地址”和“联系人姓名”(以及其他名称)。
我们会定期从外部来源接收数据文件,这些数据文件要求我们对此表进行匹配。不幸的是,由于数据来自完全不同的系统,因此数据略有不同。例如,我们有“ 123 E. Main St.” 我们会收到“ 123 East Main Street”。另一个示例,我们有“ Acme,LLC”,文件包含“ Acme Inc.”。另一个是,我们有“爱德史密斯”,而他们有“爱德华史密斯”
我们有一个旧系统,该系统利用一些相当复杂且占用大量CPU的方法来处理这些匹配项。一些涉及纯SQL,而其他涉及Access数据库中的VBA代码。当前的系统是好的,但不是完美的,麻烦且难以维护
此处的管理层希望扩大其用途。将继承系统支持的开发人员希望用一种更灵活的解决方案来替换它,该解决方案需要更少的维护。
有没有一种普遍接受的方式来处理这种数据匹配?
这是我为几乎相同的堆栈编写的(我们需要标准化硬件的制造商名称,并且有各种各样的变体)。不过,这是客户端(准确地说是VB.Net)-并使用Levenshtein距离算法(已修改,以获得更好的结果):
Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
Dim bestMatch As String = ""
Dim bestDistance As Integer = 1000 'Almost anything should be better than that!
For Each matchCandidate As String In stringList
Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
If candidateDistance < bestDistance Then
bestMatch = matchCandidate
bestDistance = candidateDistance
End If
Next
Return bestMatch
End Function
'This will be used to determine how similar strings are. Modified from the link below...
'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim sLength As Integer = s.Length ' length of s
Dim tLength As Integer = t.Length ' length of t
Dim lvCost As Integer ' cost
Dim lvDistance As Integer = 0
Dim zeroCostCount As Integer = 0
Try
' Step 1
If tLength = 0 Then
Return sLength
ElseIf sLength = 0 Then
Return tLength
End If
Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}
' fill first row
For lvIndex As Integer = 0 To sLength
poBuffer(lvIndex) = lvIndex
Next
'fill first column
For lvIndex As Integer = 1 To tLength
poBuffer(lvIndex * (sLength + 1)) = lvIndex
Next
For lvRowIndex As Integer = 0 To sLength - 1
Dim s_i As Char = s(lvRowIndex)
For lvColIndex As Integer = 0 To tLength - 1
If s_i = t(lvColIndex) Then
lvCost = 0
zeroCostCount += 1
Else
lvCost = 1
End If
' Step 6
Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
Next
Next
Catch ex As ThreadAbortException
Err.Clear()
Catch ex As Exception
WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
End Try
Return lvDistance - zeroCostCount
End Function
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。