magic starSummarize by Aili

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

๐ŸŒˆ Abstract

The paper proposes a Position Forest Transformer (PosFormer) for Handwritten Mathematical Expression Recognition (HMER). PosFormer models the mathematical expression as a position forest structure and explicitly parses the relative position relationships between symbols to enable position-aware symbol-level feature representation learning. It also introduces an implicit attention correction module to enhance attention precision in the sequence-based decoder architecture. PosFormer achieves state-of-the-art performance on single-line and multi-line HMER benchmarks, and exhibits significant gains on recognizing complex mathematical expressions.

๐Ÿ™‹ Q&A

[01] Position Forest Transformer (PosFormer)

1. What is the key idea behind PosFormer?

  • PosFormer models the mathematical expression as a position forest structure to explicitly capture the relative position relationships between symbols.
  • This position forest coding enables parsing of the nested levels and relative positions of each symbol, which assists in position-aware symbol-level feature representation learning.

2. How does PosFormer's position forest coding work?

  • The LaTeX mathematical expression sequence is encoded into a position forest structure based on the syntax rules.
  • Each substructure (e.g., superscript-subscript, fraction) is encoded into a tree with the main body as the root node, upper part as the left node, and lower part as the right node.
  • These encoded trees are then arranged in series or nested to form the final position forest structure.
  • Each symbol is assigned a position identifier in the forest to denote its relative spatial position.

3. What are the two sub-tasks in the position recognition component of PosFormer?

  • Nested level prediction: Predicting the number of nested levels the symbol resides in.
  • Relative position prediction: Predicting the relative position of the symbol (e.g., "M", "L", "R") within the nested substructure.

4. How does the Implicit Attention Correction (IAC) module work?

  • IAC introduces zero attention as the refinement term when decoding entity symbols, instead of using the accumulated attention from previous structure symbols.
  • This helps address the coverage problem, where the model allocates more attention to unimportant regions when decoding structure symbols.
  • The refined attention weights are then used to extract fine-grained feature representations for recognition.

[02] Experimental Results

1. How does PosFormer perform on the single-line CROHME datasets?

  • Without scale augmentation, PosFormer surpasses previous SOTA methods by 3.19%, 4.79% and 5.88% on the CROHME 2014/2016/2019 test sets, respectively.
  • With scale augmentation, PosFormer further improves the ExpRate metric by 2.03%, 1.22%, and 2.00% on the same datasets.

2. How does PosFormer perform on the multi-line M2E dataset?

  • PosFormer achieves the highest performance on the M2E dataset, with an ExpRate of 58.33% and a CER of 0.0366.
  • This represents a 2.13% and 1.83% improvement over the CoMER method and the latest LAST method, respectively.

3. How does PosFormer perform on the complex MNE dataset?

  • The MNE dataset consists of three subsets (N1, N2, N3) with nested levels of 1, 2, and 3, respectively.
  • PosFormer exhibits performance gains of 0.86%, 1.65%, and 10.04% on the three subsets, respectively, demonstrating its effectiveness in recognizing complex mathematical expressions.

4. How does PosFormer compare to tree-based methods?

  • Tree-based methods model the mathematical expression as a syntax tree and predict the entire tree structure, while PosFormer models it as a position forest and parses the nested levels and relative positions of symbols.
  • PosFormer can be easily integrated into sequence-based methods to enhance structural relationship perception, without the need for an extra tree-based decoder that may conflict with the sequence-based objective.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.