![[ECCV 2022] OCR-free Document Understanding Transformer (已开源) [ECCV 2022] OCR-free Document Understanding Transformer (已开源)](imgpxy.png)
![[ECCV 2022] OCR-free Document Understanding Transformer (已开源) [ECCV 2022] OCR-free Document Understanding Transformer (已开源)](imgpxy.png)
,其中n为输出的特征图大小,d为隐特征维度。该模块可以使用卷积神经网络,也可以使用基于Transformer的视觉模型。作者通过实验对比,最终采用了Swin Transformer[9]作为主干网络。
![[ECCV 2022] OCR-free Document Understanding Transformer (已开源) [ECCV 2022] OCR-free Document Understanding Transformer (已开源)](imgpxy.png)
![[ECCV 2022] OCR-free Document Understanding Transformer (已开源) [ECCV 2022] OCR-free Document Understanding Transformer (已开源)](imgpxy.png)
![[ECCV 2022] OCR-free Document Understanding Transformer (已开源) [ECCV 2022] OCR-free Document Understanding Transformer (已开源)](imgpxy.png)
![[ECCV 2022] OCR-free Document Understanding Transformer (已开源) [ECCV 2022] OCR-free Document Understanding Transformer (已开源)](imgpxy.png)
![[ECCV 2022] OCR-free Document Understanding Transformer (已开源) [ECCV 2022] OCR-free Document Understanding Transformer (已开源)](imgpxy.png)
1. BART部分的初始化权重:https://huggingface.co/hyunwoongko/asian-bart-ecjk
2. 作者关于LayoutLM等模型在CORD上指标差异的解释:Performance gap of baseline methods · Issue #42 · clovaai/donut (github.com)
3.论文地址:[2111.15664] OCR-free Document Understanding Transformer (arxiv.org)
[1] Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. Proceedings of the AAAI Conference on Artificial Intelligence 36(10), 10767–10775 (Jun 2022).
[2] Hwang, W., Kim, S., Yim, J., Seo, M., Park, S., Park, S., Lee, J., Lee, B., Lee, H.: Post-ocr parsing: building simple and robust parser via bio tagging. In: Workshop on Document Intelligence at NeurIPS 2019 (2019).
[3] Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 330–343. Association for Computational Linguistics, Online (Aug 2021).
[4] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. p. 1192–1200. KDD ’20, Association for Computing Machinery, New York, NY, USA (2020).
[5] Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., Zhou, L.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 2579–2591. Association for Computational Linguistics, Online (Aug 2021).
[6] Duong, Q., H¨am¨al¨ainen, M., Hengchen, S.: An unsupervised method for OCR post-correction and spelling normalisation for Finnish. In: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). pp. 240–248. Link¨oping University Electronic Press, Sweden, Reykjavik, Iceland (Online) (May 31–2 Jun 2021).
[7] Rijhwani, S., Anastasopoulos, A., Neubig, G.: OCR Post Correction for Endangered Language Texts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5931–5942. Association for Computational Linguistics, Online (Nov 2020).
[8] Schaefer, R., Neudecker, C.: A two-step approach for automatic OCR postcorrection. In: Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. pp. 52–57. International Committee on Computational Linguistics, Online (Dec 2020).
[9] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012– 10022 (October 2021).
[10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A largescale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009).
[11] Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 991–995 (2015).
[12]Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee, H.: Cord: A consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019).
[13]Guo, H., Qin, X., Liu, J., Han, J., Liu, J., Ding, E.: Eaten: Entity-aware attention for single shot visual text extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 254–259 (2019).
[14]Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2200–2209 (2021).
编排:高 学
审校:殷 飞
[ECCV 2022] Levenshtein OCR(已开源)
论文推荐|[TMM 2022]基于切分的手写中文文本识别:一种无需切分标注的方法
[ECCV 2022 oral]|Language Matters:面向场景文字检测和端到端识别的弱监督的视觉-语言预训练方法
[ACM MM 2022] DiT: 基于Transformer的文档图像自监督预训练方法
[IJCAI 2022] C3-STISR: 基于三重线索引导的场景文本图像超分辨率方法(有源码)
[CVPR 2022]基于语法感知网络的手写数学公式识别(已开源)
[ACM MM 2022] 解耦检测与识别:单阶段自依赖场景文本识别器
[ECCV 2022] CoMER: 基于Transformer与覆盖注意力机制建模的手写数学公式识别(已开源)
[ECCV 2022] 场景文字端到端识别中的全局到局部注意
[ECCV2022] MGP-STR:一种基于视觉Transformer的多粒度文字识别方法(已开源)
征稿启事:本公众号将不定期介绍文档图像分析与识别及相关领域的论文、数据集、代码等成果,欢迎自荐或推荐相关领域最新论文/代码/数据集等成果给本公众号审阅编排后发布 (联系Email: [email protected])。