PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer
๐ Abstract
The paper proposes a Position Forest Transformer (PosFormer) for Handwritten Mathematical Expression Recognition (HMER). PosFormer models the mathematical expression as a position forest structure and explicitly parses the relative position relationships between symbols to enable position-aware symbol-level feature representation learning. It also introduces an implicit attention correction module to enhance attention precision in the sequence-based decoder architecture. PosFormer achieves state-of-the-art performance on single-line and multi-line HMER benchmarks, and exhibits significant gains on recognizing complex mathematical expressions.
๐ Q&A
[01] Position Forest Transformer (PosFormer)
1. What is the key idea behind PosFormer?
- PosFormer models the mathematical expression as a position forest structure to explicitly capture the relative position relationships between symbols.
- This position forest coding enables parsing of the nested levels and relative positions of each symbol, which assists in position-aware symbol-level feature representation learning.
2. How does PosFormer's position forest coding work?
- The LaTeX mathematical expression sequence is encoded into a position forest structure based on the syntax rules.
- Each substructure (e.g., superscript-subscript, fraction) is encoded into a tree with the main body as the root node, upper part as the left node, and lower part as the right node.
- These encoded trees are then arranged in series or nested to form the final position forest structure.
- Each symbol is assigned a position identifier in the forest to denote its relative spatial position.
3. What are the two sub-tasks in the position recognition component of PosFormer?
- Nested level prediction: Predicting the number of nested levels the symbol resides in.
- Relative position prediction: Predicting the relative position of the symbol (e.g., "M", "L", "R") within the nested substructure.
4. How does the Implicit Attention Correction (IAC) module work?
- IAC introduces zero attention as the refinement term when decoding entity symbols, instead of using the accumulated attention from previous structure symbols.
- This helps address the coverage problem, where the model allocates more attention to unimportant regions when decoding structure symbols.
- The refined attention weights are then used to extract fine-grained feature representations for recognition.
[02] Experimental Results
1. How does PosFormer perform on the single-line CROHME datasets?
- Without scale augmentation, PosFormer surpasses previous SOTA methods by 3.19%, 4.79% and 5.88% on the CROHME 2014/2016/2019 test sets, respectively.
- With scale augmentation, PosFormer further improves the ExpRate metric by 2.03%, 1.22%, and 2.00% on the same datasets.
2. How does PosFormer perform on the multi-line M2E dataset?
- PosFormer achieves the highest performance on the M2E dataset, with an ExpRate of 58.33% and a CER of 0.0366.
- This represents a 2.13% and 1.83% improvement over the CoMER method and the latest LAST method, respectively.
3. How does PosFormer perform on the complex MNE dataset?
- The MNE dataset consists of three subsets (N1, N2, N3) with nested levels of 1, 2, and 3, respectively.
- PosFormer exhibits performance gains of 0.86%, 1.65%, and 10.04% on the three subsets, respectively, demonstrating its effectiveness in recognizing complex mathematical expressions.
4. How does PosFormer compare to tree-based methods?
- Tree-based methods model the mathematical expression as a syntax tree and predict the entire tree structure, while PosFormer models it as a position forest and parses the nested levels and relative positions of symbols.
- PosFormer can be easily integrated into sequence-based methods to enhance structural relationship perception, without the need for an extra tree-based decoder that may conflict with the sequence-based objective.