【OCR学习笔记】7、明星经典模型汇总以及PyTroch实现(二)

本文涉及的产品
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
交互式建模 PAI-DSW,每月250计算时 3个月
模型训练 PAI-DLC,100CU*H 3个月
简介: 【OCR学习笔记】7、明星经典模型汇总以及PyTroch实现(二)

1.5 ResNet

随着网络深度的增加,模型的准确度也在同步提高。但是,网络加深的同时也一直无法摆脱一个问题:梯度消失现象越发明显。在梯度回传的过程中,比较靠前的梯度会很小,这意味着某些层基本上得不到更新,因此增加深度也就变得没有意义,反而徒增计算量。而且随着网络深度的增加,参数量更大,优化也变得更加困难。而残差网络的出现极大地缓解了这个问题。

微软亚洲研究院的何凯明等人提出了深度残差网络(Deep Residual Network),它在当年的ImageNet竞赛中获得冠军。该网络简称为ResNet(由算法Residual命名),层数达到了152层,top-5的错误率降到了3.57%,而2014年的冠军GoogLeNet的错误率是6.7%。ResNet的结构如图所示:

bedf60a002eefa939a51a9ad35cdc3d9.png

残差网络通过增加一个Identity Mapping(恒等映射),将当前的输出直接跨层输送到下一层网络,相当于走一个捷径,跳过了本层的运算,称为Skip Connection,同时在后向传播计算的过程中,将下一层网络的梯度直接传给上一层网络,极大地缓解了深层网络的梯度消失问题。其表达式为:

58b6aa018deac085689f845ac4b072c9.png

ResNet通过逐层保留的方式,保持了模型各层级的特征信息,一定程度上解决了梯度消失的问题,而且快捷连接的方式也很少产生额外的参数。这些优点促使ResNet逐渐成为各大检测、分割等算法的基础框架。

以下是基于PyTorch实现的ResNet代码:

import torch.nn as nn  
import math  
def conv3x3(in_planes, out_planes, stride=1):  
    # "3x3 convolution with padding"  
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, padding=1, bias=False)  
class BasicBlock(nn.Module):  
    expansion = 1  
    def __init__(self, inplanes, planes, stride=1, downsample=None):  
        super(BasicBlock, self).__init__()  
        m = OrderedDict()  
        m['conv1'] = conv3x3(inplanes, planes, stride)  
        m['bn1'] = nn.BatchNorm2d(planes)  
        m['relu1'] = nn.ReLU(inplace=True)  
        m['conv2'] = conv3x3(planes, planes)  
        m['bn2'] = nn.BatchNorm2d(planes)  
        self.group1 = nn.Sequential(m)  
        self.relu= nn.Sequential(nn.ReLU(inplace=True))  
        self.downsample = downsample  
    def forward(self, x):  
        if self.downsample is not None:  
            residual = self.downsample(x)  
        else:  
            residual = x  
        out = self.group1(x) + residual  
        out = self.relu(out)  
        return out  
class Bottleneck(nn.Module):  
    expansion = 4  
    def __init__(self, inplanes, planes, stride=1, downsample=None):  
        super(Bottleneck, self).__init__()  
        m  = OrderedDict()  
        m['conv1'] = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)  
        m['bn1'] = nn.BatchNorm2d(planes)  
        m['relu1'] = nn.ReLU(inplace=True)  
        m['conv2'] = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)  
        m['bn2'] = nn.BatchNorm2d(planes)  
        m['relu2'] = nn.ReLU(inplace=True)  
        m['conv3'] = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)  
        m['bn3'] = nn.BatchNorm2d(planes * 4)  
        self.group1 = nn.Sequential(m)  
        self.relu= nn.Sequential(nn.ReLU(inplace=True))  
        self.downsample = downsample  
    def forward(self, x):  
        if self.downsample is not None:  
            residual = self.downsample(x)  
        else:  
            residual = x  
        out = self.group1(x) + residual  
        out = self.relu(out)  
        return out  
class ResNet(nn.Module):  
    def __init__(self, block, layers, num_classes=1000):  
        self.inplanes = 64  
        super(ResNet, self).__init__()    
        m = OrderedDict()  
        m['conv1'] = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)  
        m['bn1'] = nn.BatchNorm2d(64)  
        m['relu1'] = nn.ReLU(inplace=True)  
        m['maxpool'] = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)  
        self.group1= nn.Sequential(m)  
        self.layer1 = self._make_layer(block, 64, layers[0])  
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)  
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)  
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)  
        self.avgpool = nn.Sequential(nn.AvgPool2d(7))  
        self.group2 = nn.Sequential(OrderedDict([('fc', nn.Linear(512 * block.expansion, num_classes))]))     
        for m in self.modules():  
            if isinstance(m, nn.Conv2d):  
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels  
                m.weight.data.normal_(0, math.sqrt(2. / n))  
            elif isinstance(m, nn.BatchNorm2d):  
                m.weight.data.fill_(1)  
                m.bias.data.zero_()  
    def _make_layer(self, block, planes, blocks, stride=1):  
        downsample = None  
        if stride != 1 or self.inplanes != planes * block.expansion:  
              ownsample = nn.Sequential(nn.Conv2d(self.inplanes, planes * block.expansion, kernel_size=1, stride=stride, bias=False),nn.BatchNorm2d(planes * block.expansion),)  
        layers = []  
        layers.append(block(self.inplanes, planes, stride, downsample))  
        self.inplanes = planes * block.expansion  
        for i in range(1, blocks):  
            layers.append(block(self.inplanes, planes))  
        return nn.Sequential(*layers)  
    def forward(self, x):  
        x = self.group1(x)  
        x = self.layer1(x)  
        x = self.layer2(x)  
        x = self.layer3(x)  
        x = self.layer4(x)  
        x = self.avgpool(x)  
        x = x.view(x.size(0), -1)  
        x = self.group2(x)  
        return x  
def resnet18(pretrained=False, model_root=None, **kwargs):  
    model = ResNet(BasicBlock, [2, 2, 2, 2], **kwargs)

1.6 DenseNet

受到ResNet的启发,DenseNet于2016年被提出来。DenseNet将每个卷积层网络的输入变为前面所有网络输出的拼接。这种稠密的方式使得每层都可以利用之前学习到的所有特征,无须重复学习。同时,仿照ResNet的结构,梯度可以更好地传播,训练深层网络也变得更加方便。DenseNet结构如图所示。

adcbf669572cbc5ba5a7a06ee558a848.png

相较于ResNet,DenseNet又将低层特征继续向前拼接,主要公式如下:


细节配置表

09fc5d501dc4c8d7b120ef61e9adecc9.png

需要注意的是,表中DenseNet主要是由Dense BlockTransition Layer组成的,其中每个conv都是BatchNorm+ReLU+Conv的组合。

以下是基于PyTorch实现的DenseNet代码:

import torch
import torch.nn as nn
from collections import OrderedDict
class _DenseLayer(nn.Sequential):
    def __init__(self, in_channels, growth_rate, bn_size):
        super(_DenseLayer, self).__init__()
        self.add_module('norm1', nn.BatchNorm2d(in_channels))
        self.add_module('relu1', nn.ReLU(inplace=True))
        self.add_module('conv1', nn.Conv2d(in_channels, bn_size * growth_rate,
                                           kernel_size=1,
                                           stride=1, bias=False))
        self.add_module('norm2', nn.BatchNorm2d(bn_size*growth_rate))
        self.add_module('relu2', nn.ReLU(inplace=True))
        self.add_module('conv2', nn.Conv2d(bn_size*growth_rate, growth_rate,
                                           kernel_size=3,
                                           stride=1, padding=1, bias=False))
    # 重载forward函数
    def forward(self, x):
        new_features = super(_DenseLayer, self).forward(x)
        return torch.cat([x, new_features], 1)
class _DenseBlock(nn.Sequential):
    def __init__(self, num_layers, in_channels, bn_size, growth_rate):
        super(_DenseBlock, self).__init__()
        for i in range(num_layers):
            self.add_module('denselayer%d' % (i+1), _DenseLayer(in_channels+growth_rate*i, growth_rate, bn_size))
class _Transition(nn.Sequential):
    def __init__(self, in_channels, out_channels):
        super(_Transition, self).__init__()
        self.add_module('norm', nn.BatchNorm2d(in_channels))
        self.add_module('relu', nn.ReLU(inplace=True))
        self.add_module('conv', nn.Conv2d(in_channels, out_channels,
                                          kernel_size=1,
                                          stride=1, bias=False))
        self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))
class DenseNet_BC(nn.Module):
    def __init__(self, growth_rate=12, block_config=(6,12,24,16),
                 bn_size=4, theta=0.5, num_classes=10):
        super(DenseNet_BC, self).__init__()
        # 初始的卷积为filter:2倍的growth_rate
        num_init_feature = 2 * growth_rate
        # 表示cifar-10
        if num_classes == 10:
            self.features = nn.Sequential(OrderedDict([
                ('conv0', nn.Conv2d(3, num_init_feature,
                                    kernel_size=3, stride=1,
                                    padding=1, bias=False)), ]))
        else:
            self.features = nn.Sequential(OrderedDict([
                ('conv0', nn.Conv2d(3, num_init_feature,
                                    kernel_size=7, stride=2,
                                    padding=3, bias=False)),
                ('norm0', nn.BatchNorm2d(num_init_feature)),
                ('relu0', nn.ReLU(inplace=True)),
                ('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) ]))
        num_feature = num_init_feature
        for i, num_layers in enumerate(block_config):
            self.features.add_module('denseblock%d' % (i+1), _DenseBlock(num_layers, num_feature, bn_size, growth_rate))
            num_feature = num_feature + growth_rate * num_layers
            if i != len(block_config)-1:
                self.features.add_module('transition%d' % (i + 1), _Transition(num_feature, int(num_feature * theta)))
                num_feature = int(num_feature * theta)
        self.features.add_module('norm5', nn.BatchNorm2d(num_feature))
        self.features.add_module('relu5', nn.ReLU(inplace=True))
        self.features.add_module('avg_pool', nn.AdaptiveAvgPool2d((1, 1)))
        self.classifier = nn.Linear(num_feature, num_classes)
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.constant_(m.bias, 0)
    def forward(self, x):
        features = self.features(x)
        out = features.view(features.size(0), -1)
        out = self.classifier(out)
        return out
# DenseNet_BC for ImageNet
def DenseNet121():
    return DenseNet_BC(growth_rate=32, block_config=(6, 12, 24, 16), num_classes=1000)
def DenseNet169():
    return DenseNet_BC(growth_rate=32, block_config=(6, 12, 32, 32), num_classes=1000)
def DenseNet201():
    return DenseNet_BC(growth_rate=32, block_config=(6, 12, 48, 32), num_classes=1000)
def DenseNet161():
    return DenseNet_BC(growth_rate=48, block_config=(6, 12, 36, 24), num_classes=1000,)
# DenseNet_BC for cifar
def densenet_BC_100():
    return DenseNet_BC(growth_rate=12, block_config=(16, 16, 16))
def test():
    net = densenet_cifar()
    x = torch.randn(2,3,32,32)
    y = net(x)
    print(y.size())
test()

1.7 SENet

SENet受到近几年Attention思想的启发,其主要思想是对每个输出通道(Channel)都预测一个权重,然后对每个通道进行加权,且是在2D空间做卷积。从本质上来说,其只对图像的空间信息进行建模,并没有对通道之间的信息建模,所以下面尝试对通道之间的信息进行建模。

SENet的基本结构如图所示:

对于每个输出通道,先执行系列的卷积Pooling操作之后,得到C×H×W大小的特征图,然后执行SqueezeExcitation操作:

  • Squeeze:对C×H×W特征图执行Global Average Pooling操作,得到1×1×C大小的特征图,这个特征图可以理解为具有全局的感受野。
  • Excitation:使用一个全连接神经网络,对Squeeze之后的结果做非线性变换。
  • 特征的重新标定:将Excitation得到的结果作为权重,使其与输入特征相乘。

以下是基于PyTorch实现的SENet代码:

import torch
import torch.nn as nn
import torch.nn.functional as F
class BasicBlock(nn.Module):
    def __init__(self, in_planes, planes, stride=1):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes)
            )
        # SE layers
        self.fc1 = nn.Conv2d(planes, planes//16, kernel_size=1)  # Use nn.Conv2d instead of nn.Linear
        self.fc2 = nn.Conv2d(planes//16, planes, kernel_size=1)
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        # Squeeze
        w = F.avg_pool2d(out, out.size(2))
        w = F.relu(self.fc1(w))
        w = F.sigmoid(self.fc2(w))
        # Excitation
        out = out * w  # New broadcasting feature from v0.2!
        out += self.shortcut(x)
        out = F.relu(out)
        return out
class PreActBlock(nn.Module):
    def __init__(self, in_planes, planes, stride=1):
        super(PreActBlock, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
        if stride != 1 or in_planes != planes:
            self.shortcut = nn.Sequential(nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride, bias=False))
        # SE layers
        self.fc1 = nn.Conv2d(planes, planes//16, kernel_size=1)
        self.fc2 = nn.Conv2d(planes//16, planes, kernel_size=1)
    def forward(self, x):
        out = F.relu(self.bn1(x))
        shortcut = self.shortcut(out) if hasattr(self, 'shortcut') else x
        out = self.conv1(out)
        out = self.conv2(F.relu(self.bn2(out)))
        # Squeeze
        w = F.avg_pool2d(out, out.size(2))
        w = F.relu(self.fc1(w))
        w = F.sigmoid(self.fc2(w))
        # Excitation
        out = out * w
        out += shortcut
        return out
class SENet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=10):
        super(SENet, self).__init__()
        self.in_planes = 64
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.layer1 = self._make_layer(block,  64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
        self.linear = nn.Linear(512, num_classes)
    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes
        return nn.Sequential(*layers)
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = F.avg_pool2d(out, 4)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out
def SENet18():
    return SENet(PreActBlock, [2,2,2,2])
def test():
    net = SENet18()
    y = net(torch.randn(1,3,32,32))
    print(y.size())
# test()

1.8 SKNet

SKNet针对卷积核的注意力机制研究,不同大小的感受视野(卷积核)对于不同尺度(远近、大小)的目标会有不同的效果。尽管比如Inception这样的增加了多个卷积核来适应不同尺度图像,但是一旦训练完成后,参数就固定了,这样多尺度信息就会被全部使用了(每个卷积核的权重相同)。

SKNet提出了一种机制,即卷积核的重要性,即不同的图像能够得到具有不同重要性的卷积核。据作者说,该模块在超分辨率任务上有很大提升,并且论文中的实验也证实了在分类任务上有很好的表现。SKNet对不同图像使用的卷积核权重不同,即一种针对不同尺度的图像动态生成卷积核。

整体结构如下图所示:

706f061002992cd32d955cde39f6cbbd.png

下图为GiantPandaCV公众号作者根据代码重画的网络图

03559b0faba2d1e90a40597359c81041.png

import torch
from torch import nn
#被替换的3*3卷积
class SKConv(nn.Module):
    def __init__(self, features, WH, M, G, r, stride=1 ,L=32):
        """ Constructor
        Args:
            features: input channel dimensionality.
            WH: input spatial dimensionality, used for GAP kernel size.
            M: the number of branchs.
            G: num of convolution groups.
            r: the radio for compute d, the length of z.
            stride: stride, default 1.
            L: the minimum dim of the vector z in paper, default 32.
        """
        super(SKConv, self).__init__()
        d = max(int(features/r), L)
        self.M = M
        self.features = features
        self.convs = nn.ModuleList([])
        for i in range(M):
            self.convs.append(nn.Sequential(
                nn.Conv2d(features, features, kernel_size=3+i*2, stride=stride, padding=1+i, groups=G),
                nn.BatchNorm2d(features),
                nn.ReLU(inplace=False)
            ))
        self.gap = nn.AvgPool2d(int(WH/stride))
        self.fc = nn.Linear(features, d)
        self.fcs = nn.ModuleList([])
        for i in range(M):
            self.fcs.append(
                nn.Linear(d, features)
            )
        self.softmax = nn.Softmax(dim=1)
    def forward(self, x):
        for i, conv in enumerate(self.convs):
            fea = conv(x).unsqueeze_(dim=1)
            if i == 0:
                feas = fea
            else:
                feas = torch.cat([feas, fea], dim=1)
        fea_U = torch.sum(feas, dim=1)
        fea_s = self.gap(fea_U).squeeze_()
        fea_z = self.fc(fea_s)
        for i, fc in enumerate(self.fcs):
            vector = fc(fea_z).unsqueeze_(dim=1)
            if i == 0:
                attention_vectors = vector
            else:
                attention_vectors = torch.cat([attention_vectors, vector], dim=1)
        attention_vectors = self.softmax(attention_vectors)
        attention_vectors = attention_vectors.unsqueeze(-1).unsqueeze(-1)
        fea_v = (feas * attention_vectors).sum(dim=1)
        return fea_v
#新的残差块结构
class SKUnit(nn.Module):
    def __init__(self, in_features, out_features, WH, M, G, r, mid_features=None, stride=1, L=32):
        """ Constructor
        Args:
            in_features: input channel dimensionality.
            out_features: output channel dimensionality.
            WH: input spatial dimensionality, used for GAP kernel size.
            M: the number of branchs.
            G: num of convolution groups.
            r: the radio for compute d, the length of z.
            mid_features: the channle dim of the middle conv with stride not 1, default out_features/2.
            stride: stride.
            L: the minimum dim of the vector z in paper.
        """
        super(SKUnit, self).__init__()
        if mid_features is None:
            mid_features = int(out_features/2)
        self.feas = nn.Sequential(
            nn.Conv2d(in_features, mid_features, 1, stride=1),
            nn.BatchNorm2d(mid_features),
            SKConv(mid_features, WH, M, G, r, stride=stride, L=L),
            nn.BatchNorm2d(mid_features),
            nn.Conv2d(mid_features, out_features, 1, stride=1),
            nn.BatchNorm2d(out_features)
        )
        if in_features == out_features: # when dim not change, in could be added diectly to out
            self.shortcut = nn.Sequential()
        else: # when dim not change, in should also change dim to be added to out
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_features, out_features, 1, stride=stride),
                nn.BatchNorm2d(out_features)
            )
    def forward(self, x):
        fea = self.feas(x)
        return fea + self.shortcut(x)
class SKNet(nn.Module):
    def __init__(self, class_num):
        super(SKNet, self).__init__()
        self.basic_conv = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1),
            nn.BatchNorm2d(64)
        ) # 32x32
        self.stage_1 = nn.Sequential(
            SKUnit(64, 256, 32, 2, 8, 2, stride=2),
            nn.ReLU(),
            SKUnit(256, 256, 32, 2, 8, 2),
            nn.ReLU(),
            SKUnit(256, 256, 32, 2, 8, 2),
            nn.ReLU()
        ) # 32x32
        self.stage_2 = nn.Sequential(
            SKUnit(256, 512, 32, 2, 8, 2, stride=2),
            nn.ReLU(),
            SKUnit(512, 512, 32, 2, 8, 2),
            nn.ReLU(),
            SKUnit(512, 512, 32, 2, 8, 2),
            nn.ReLU()
        ) # 16x16
        self.stage_3 = nn.Sequential(
            SKUnit(512, 1024, 32, 2, 8, 2, stride=2),
            nn.ReLU(),
            SKUnit(1024, 1024, 32, 2, 8, 2),
            nn.ReLU(),
            SKUnit(1024, 1024, 32, 2, 8, 2),
            nn.ReLU()
        ) # 8x8
        self.pool = nn.AvgPool2d(8)
        self.classifier = nn.Sequential(
            nn.Linear(1024, class_num),
            # nn.Softmax(dim=1)
        )
    def forward(self, x):
        fea = self.basic_conv(x)
        fea = self.stage_1(fea)
        fea = self.stage_2(fea)
        fea = self.stage_3(fea)
        fea = self.pool(fea)
        fea = torch.squeeze(fea)
        fea = self.classifier(fea)
        return fea
if __name__=='__main__':
    x = torch.rand(8,64,32,32)
    conv = SKConv(64, 32, 3, 8, 2)
    out = conv(x)
    criterion = nn.L1Loss()
    loss = criterion(out, x)
    loss.backward()
    print('out shape : {}'.format(out.shape))
    print('loss value : {}'.format(loss))

1.9 ResNeSt-Net

ResNeSt是基于ResNet,引入了Split-Attention块,可以跨不同的Feature-map组实现Feature-map注意力Split-Attention块是一个计算单元,由Feature-map组和分割注意力操作组成。

ResNeSt block 将输入分为K个,每一个记为Cardinal1-k,然后又将每个Cardinal拆分成R个,每一个记为Split1-r,所以总共有G=K*R个组。

源码地址如下:https://github.com/zhanghang1989/ResNeSt

相关文章
|
21天前
|
机器学习/深度学习 人工智能 文字识别
Kimi 上线视觉思考模型,K1 系列强化学习模型正式开放,无需借助外部 OCR 处理图像与文本进行思考并回答
k1视觉思考模型是kimi推出的k1系列强化学习AI模型,具备端到端图像理解和思维链技术,能够在数学、物理、化学等领域表现优异。本文详细介绍了k1视觉思考模型的功能、技术原理、使用方法及其在多个应用场景中的表现。
168 68
Kimi 上线视觉思考模型,K1 系列强化学习模型正式开放,无需借助外部 OCR 处理图像与文本进行思考并回答
|
4月前
|
编解码 人工智能 文字识别
阶跃星辰开源GOT-OCR2.0:统一端到端模型,魔搭一站式推理微调最佳实践来啦!
GOT来促进OCR-2.0的到来。该模型具有580百万参数,是一个统一、优雅和端到端的模型,由高压缩编码器和长上下文解码器组成。
阶跃星辰开源GOT-OCR2.0:统一端到端模型,魔搭一站式推理微调最佳实践来啦!
|
5月前
|
文字识别 并行计算 PyTorch
MiniCPM-V 系列模型在多模态文档 RAG 中的应用(无需OCR的多模态文档检索+生成)
现在我们以 OpenBMB 基于 MiniCPM-V-2.0 训练的端到端多模态检索模型 MiniCPM-Visual-Embedding-v0 为例,实现无需OCR的多模态文档检索与问答。
MiniCPM-V 系列模型在多模态文档 RAG 中的应用(无需OCR的多模态文档检索+生成)
|
5月前
|
数据采集 机器学习/深度学习 文字识别
OCR -- 文本检测 - 训练DB文字检测模型
OCR -- 文本检测 - 训练DB文字检测模型
116 0
|
6月前
|
文字识别 开发工具
印刷文字识别使用问题之模型已经生成,如何追加样本量
印刷文字识别产品,通常称为OCR(Optical Character Recognition)技术,是一种将图像中的印刷或手写文字转换为机器编码文本的过程。这项技术广泛应用于多个行业和场景中,显著提升文档处理、信息提取和数据录入的效率。以下是印刷文字识别产品的一些典型使用合集。
|
6月前
|
文字识别
印刷文字识别使用问题之如何实让其他人标注,自己创建模型
印刷文字识别产品,通常称为OCR(Optical Character Recognition)技术,是一种将图像中的印刷或手写文字转换为机器编码文本的过程。这项技术广泛应用于多个行业和场景中,显著提升文档处理、信息提取和数据录入的效率。以下是印刷文字识别产品的一些典型使用合集。
|
7月前
|
人工智能 文字识别 自然语言处理
OCR小模型仍有机会!华科等提出VIMTS:零样本视频端到端识别新SOTA
【6月更文挑战第7天】华中科技大学团队推出VIMTS模型,刷新零样本视频文本识别SOTA。该模型通过Prompt Queries Generation Module和Tasks-aware Adapter增强跨任务协同,提升泛化能力。在多个跨域基准测试中,VIMTS平均性能提升2.6%,视频识别上超越现有方法。此创新降低OCR对标注数据依赖,为资源受限场景提供新方案。论文链接:https://arxiv.org/pdf/2404.19652
87 3
|
8月前
|
编解码 文字识别 测试技术
论文介绍:TextMonkey——面向文本理解的无OCR大型多模态模型
【5月更文挑战第2天】TextMonkey是一款无OCR的大型多模态模型,设计用于高效提取文本信息。它采用Shifted Window Attention和零初始化技术处理高分辨率文档,减少训练成本。通过假设图像中的冗余标记,模型能精简标记并提升性能。TextMonkey还能定位文本答案在图像中的位置,增强可解释性,在场景文本任务和关键信息提取中表现优越,特别是在OCRBench基准测试中刷新记录。然而,它在处理小图像和需要深层推理的任务时仍面临挑战。[链接](https://arxiv.org/abs/2403.04473)
209 5
|
人工智能 文字识别 自然语言处理
Nougat:一种用于科学文档OCR的Transformer 模型
随着人工智能领域的不断进步,其子领域,包括自然语言处理,自然语言生成,计算机视觉等,由于其广泛的用例而迅速获得了大量的普及。光学字符识别(OCR)是计算机视觉中一个成熟且被广泛研究的领域。它有许多用途,如文档数字化、手写识别和场景文本识别。数学表达式的识别是OCR在学术研究中受到广泛关注的一个领域。
325 0
|
8月前
|
人工智能 文字识别 自然语言处理
【2023 CSIG垂直领域大模型】大模型时代,如何完成IDP智能文档处理领域的OCR大一统?
2023年12月28-31日,由中国图象图形学学会主办的第十九届CSIG青年科学家会议在中国广州隆重召开,会议吸引了学术界和企业界专家与青年学者,会议面向国际学术前沿与国家战略需求,聚焦最新前沿技术和热点领域,共同探讨图象图形学领域的前沿问题,分享最新的研究成果和创新观点,在垂直领域大模型专场,合合信息智能技术平台事业部副总经理、高级工程师丁凯博士为我们带来了《文档图像大模型的思考与探索》主题报告。

热门文章

最新文章

相关产品

  • 文字识别