文字识别是 AI 的一个重要应用场景,文字识别过程一般由图像输入、预处理、文本检测、文本识别、结果输出等环节组成.

其中,文本检测、文本识别是最核心的环节。文本检测方面,在我的 OCR_detection 专栏相关文章中已介绍过了多种基于深度学习的方法(有的还没完成,待整理后都会放入该专栏),可针对各种场景实现对文字的检测,详请见专栏中的相关文章。

在以前的 OCR 任务中,识别过程分为两步:单字切割 和 分类任务。我们一般都会将一连串文字的文本文件先利用 投影法 切割出单个字体,再送入 CNN 里进行文字分类。但是此法已经有点过时了,现在更流行的是基于深度学习的端到端的文字识别,即我们不需要显式加入文字切割这个环节,而是将文字识别转化为序列学习问题,虽然输入的图像尺度不同,文本长度不同,但是经过 DCNN 和 RNN 后,在输出阶段经过一定的 CTC 翻译转录后,就可以对整个文本图像进行识别,也就是说,文字的切割也被融入到深度学习中去了。

现今基于深度学习的端到端 OCR 技术有两大主流技术:CRNN OCR 和 attention OCR。其实这两大方法主要区别在于最后的输出层(翻译层),即怎么将网络学习到的序列特征信息转化为最终的识别结果。这两大主流技术在其特征学习阶段都采用了 CNN+RNN 的网络结构,CRNN OCR 在对齐时采取的方式是 CTC 算法,而 attention OCR 采取的方式则是 attention 机制。本部分主要介绍应用更为广泛的 CRNN 算法。


CRNN 模型,即将 CNN 与 RNN 网络结合,共同训练。主要用于在一定程度上实现端到端(end-to-end)地对不定长的文本序列进行识别,不用先对单个文字进行切割,而是将文本识别转化为时序依赖的序列学习问题,就是基于图像的序列识别。(说一定程度是因为虽然输入图像不需要精确给出每个字符的位置信息,但实际上还是需要对原始的图像进行前期的裁切工作)

CRNN有三部分:
CNN(卷积层):使用深度 CNN,对输入图像提取特征,得到特征图;
RNN(循环层):使用 双向 RNN(BLSTM)对特征序列进行预测,对序列中的每个特征向量进行学习,并输出预测标签(真实值)分布;
CTC loss(转录层):使用 CTC 损失,把从循环层获取的一系列标签分布转换成最终的标签序列。


预测过程中,先使用标准的 CNN 网络提取文本图像的特征,再利用 BLSTM 将特征向量进行融合以提取字符序列的上下文特征,然后得到每列特征的概率分布,最后通过 CTC 进行预测得到文本序列。

利用 BLSTM 和 CTC 学习到文本图像中的上下文关系,从而有效提升文本识别准确率,使得模型更加鲁棒。

在训练阶段,CRNN 将训练图像统一缩放为 w×32(w×h);在测试阶段,针对字符拉伸会导致识别率降低的问题,CRNN 保持输入图像尺寸比例,但是图像高度还是必须统一为 32 个像素,卷积特征图的尺寸动态决定 LSTM 的时序长度(时间步长)。

代码:

class CRNN(nn.Module):

def __init__(self, imgH, nc, nclass, nh, leakyRelu=False):
    super(CRNN, self).__init__()
    assert imgH % 16 == 0, 'imgH has to be a multiple of 16'

    # 1x32x128
    self.conv1 = nn.Conv2d(nc, 64, 3, 1, 1)
    self.relu1 = nn.ReLU(True)
    self.pool1 = nn.MaxPool2d(2, 2)

    # 64x16x64
    self.conv2 = nn.Conv2d(64, 128, 3, 1, 1)
    self.relu2 = nn.ReLU(True)
    self.pool2 = nn.MaxPool2d(2, 2)

    # 128x8x32
    self.conv3_1 = nn.Conv2d(128, 256, 3, 1, 1)
    self.bn3 = nn.BatchNorm2d(256)
    self.relu3_1 = nn.ReLU(True)
    self.conv3_2 = nn.Conv2d(256, 256, 3, 1, 1)
    self.relu3_2 = nn.ReLU(True)
    self.pool3 = nn.MaxPool2d((2, 2), (2, 1), (0, 1))

    # 256x4x16
    self.conv4_1 = nn.Conv2d(256, 512, 3, 1, 1)
    self.bn4 = nn.BatchNorm2d(512)
    self.relu4_1 = nn.ReLU(True)
    self.conv4_2 = nn.Conv2d(512, 512, 3, 1, 1)
    self.relu4_2 = nn.ReLU(True)
    self.pool4 = nn.MaxPool2d((2, 2), (2, 1), (0, 1))

    # 512x2x16
    self.conv5 = nn.Conv2d(512, 512, 2, 1, 0)
    self.bn5 = nn.BatchNorm2d(512)
    self.relu5 = nn.ReLU(True)

    # 512x1x16

    self.rnn = nn.Sequential(
        BidirectionalLSTM(512, nh, nh),
        BidirectionalLSTM(nh, nh, nclass))
def forward(self, input):
    # conv features
    x = self.pool1(self.relu1(self.conv1(input)))
    x = self.pool2(self.relu2(self.conv2(x)))
    x = self.pool3(self.relu3_2(self.conv3_2(self.relu3_1(self.bn3(self.conv3_1(x))))))
    x = self.pool4(self.relu4_2(self.conv4_2(self.relu4_1(self.bn4(self.conv4_1(x))))))
    conv = self.relu5(self.bn5(self.conv5(x)))
    # print(conv.size())

    b, c, h, w = conv.size()
    assert h == 1, "the height of conv must be 1"
    conv = conv.squeeze(2)
    conv = conv.permute(2, 0, 1)  # [w, b, c]

    # rnn features
    output = self.rnn(conv)

    return output


class CRNN_v2(nn.Module):
def __init__(self, imgH, nc, nclass, nh, leakyRelu=False):
    super(CRNN_v2, self).__init__()
    assert imgH % 16 == 0, 'imgH has to be a multiple of 16'

    # 1x32x128
    self.conv1_1 = nn.Conv2d(nc, 32, 3, 1, 1)
    self.bn1_1 = nn.BatchNorm2d(32)
    self.relu1_1 = nn.ReLU(True)

    self.conv1_2 = nn.Conv2d(32, 64, 3, 1, 1)
    self.bn1_2 = nn.BatchNorm2d(64)
    self.relu1_2 = nn.ReLU(True)
    self.pool1 = nn.MaxPool2d(2, 2)

    # 64x16x64
    self.conv2_1 = nn.Conv2d(64, 64, 3, 1, 1)
    self.bn2_1 = nn.BatchNorm2d(64)
    self.relu2_1 = nn.ReLU(True)

    self.conv2_2 = nn.Conv2d(64, 128, 3, 1, 1)
    self.bn2_2 = nn.BatchNorm2d(128)
    self.relu2_2 = nn.ReLU(True)
    self.pool2 = nn.MaxPool2d(2, 2)

    # 128x8x32
    self.conv3_1 = nn.Conv2d(128, 96, 3, 1, 1)
    self.bn3_1 = nn.BatchNorm2d(96)
    self.relu3_1 = nn.ReLU(True)

    self.conv3_2 = nn.Conv2d(96, 192, 3, 1, 1)
    self.bn3_2 = nn.BatchNorm2d(192)
    self.relu3_2 = nn.ReLU(True)
    self.pool3 = nn.MaxPool2d((2, 2), (2, 1), (0, 1))

    # 192x4x32
    self.conv4_1 = nn.Conv2d(192, 128, 3, 1, 1)
    self.bn4_1 = nn.BatchNorm2d(128)
    self.relu4_1 = nn.ReLU(True)
    self.conv4_2 = nn.Conv2d(128, 256, 3, 1, 1)
    self.bn4_2 = nn.BatchNorm2d(256)
    self.relu4_2 = nn.ReLU(True)
    self.pool4 = nn.MaxPool2d((2, 2), (2, 1), (0, 1))

    # 256x2x32
    self.bn5 = nn.BatchNorm2d(256)
    # 256x2x32

    self.rnn = nn.Sequential(
        BidirectionalLSTM(512, nh, nh),
        BidirectionalLSTM(nh, nh, nclass))
def forward(self, input):
    # conv features
    x = 
    self.pool1(self.relu1_2(self.bn1_2(self.conv1_2(self.relu1_1(self.bn1_1(self.conv1_1(input)))))))
    x = self.pool2(self.relu2_2(self.bn2_2(self.conv2_2(self.relu2_1(self.bn2_1(self.conv2_1(x)))))))
    x = self.pool3(self.relu3_2(self.bn3_2(self.conv3_2(self.relu3_1(self.bn3_1(self.conv3_1(x)))))))
    x = self.pool4(self.relu4_2(self.bn4_2(self.conv4_2(self.relu4_1(self.bn4_1(self.conv4_1(x)))))))
    conv = self.bn5(x)
    # print(conv.size())

    b, c, h, w = conv.size()
    assert h == 2, "the height of conv must be 2"
    conv = conv.reshape([b,c*h,w])
    conv = conv.permute(2, 0, 1)  # [w, b, c]

    # rnn features
    output = self.rnn(conv)

    return output
def conv3x3(nIn, nOut, stride=1):
# "3x3 convolution with padding"
return nn.Conv2d( nIn, nOut, kernel_size=3, stride=stride, padding=1, bias=False )
class basic_res_block(nn.Module):
def __init__(self, nIn, nOut, stride=1, downsample=None):
    super( basic_res_block, self ).__init__()
    m = OrderedDict()
    m['conv1'] = conv3x3( nIn, nOut, stride )
    m['bn1'] = nn.BatchNorm2d( nOut )
    m['relu1'] = nn.ReLU( inplace=True )
    m['conv2'] = conv3x3( nOut, nOut )
    m['bn2'] = nn.BatchNorm2d( nOut )
    self.group1 = nn.Sequential( m )

    self.relu = nn.Sequential( nn.ReLU( inplace=True ) )
    self.downsample = downsample

def forward(self, x):
    if self.downsample is not None:
        residual = self.downsample( x )
    else:
        residual = x
    out = self.group1( x ) + residual
    out = self.relu( out )
    return out
class CRNN_res(nn.Module):
def __init__(self, imgH, nc, nclass, nh):
    super(CRNN_res, self).__init__()
    assert imgH % 16 == 0, 'imgH has to be a multiple of 16'
    self.conv1 = nn.Conv2d(nc, 64, 3, 1, 1)
    self.relu1 = nn.ReLU(True)
    self.res1 = basic_res_block(64, 64)
    # 1x32x128
    down1 = nn.Sequential(nn.Conv2d(64, 128, kernel_size=1, stride=2, bias=False),nn.BatchNorm2d(128))
    self.res2_1 = basic_res_block( 64, 128, 2, down1 )
    self.res2_2 = basic_res_block(128,128)
    # 64x16x64
    down2 = nn.Sequential(nn.Conv2d(128, 256, kernel_size=1, stride=2, bias=False),nn.BatchNorm2d(256))
    self.res3_1 = basic_res_block(128, 256, 2, down2)
    self.res3_2 = basic_res_block(256, 256)
    self.res3_3 = basic_res_block(256, 256)
    # 128x8x32

    down3 = nn.Sequential(nn.Conv2d(256, 512, kernel_size=1, stride=(2, 1), bias=False),nn.BatchNorm2d(512))
    self.res4_1 = basic_res_block(256, 512, (2, 1), down3)
    self.res4_2 = basic_res_block(512, 512)
    self.res4_3 = basic_res_block(512, 512)
    # 256x4x16
    self.pool = nn.AvgPool2d((2, 2), (2, 1), (0, 1))
    # 512x2x16
    self.conv5 = nn.Conv2d(512, 512, 2, 1, 0)
    self.bn5 = nn.BatchNorm2d(512)
    self.relu5 = nn.ReLU(True)
    # 512x1x16

    self.rnn = nn.Sequential(
        BidirectionalLSTM(512, nh, nh),
        BidirectionalLSTM(nh, nh, nclass))

def forward(self, input):
    # conv features
    x = self.res1(self.relu1(self.conv1(input)))
    x = self.res2_2(self.res2_1(x))
    x = self.res3_3(self.res3_2(self.res3_1(x)))
    x = self.res4_3(self.res4_2(self.res4_1(x)))
    x = self.pool(x)
    conv = self.relu5(self.bn5(self.conv5(x)))
    # print(conv.size())
    b, c, h, w = conv.size()
    assert h == 1, "the height of conv must be 1"
    conv = conv.squeeze(2)
    conv = conv.permute(2, 0, 1)  # [w, b, c]

    # rnn features
    output = self.rnn(conv)

    return output