Summarize by Aili

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

🌈 Abstract

The paper proposes a Position Forest Transformer (PosFormer) for Handwritten Mathematical Expression Recognition (HMER). PosFormer models the mathematical expression as a position forest structure and explicitly parses the relative position relationships between symbols to enable position-aware symbol-level feature representation learning. It also introduces an implicit attention correction module to enhance attention precision in the sequence-based decoder architecture. PosFormer achieves state-of-the-art performance on single-line and multi-line HMER benchmarks, and exhibits significant gains on recognizing complex mathematical expressions.

🙋 Q&A

[01] Position Forest Transformer (PosFormer)

1. What is the key idea behind PosFormer?

PosFormer models the mathematical expression as a position forest structure to explicitly capture the relative position relationships between symbols.
This position forest coding enables parsing of the nested levels and relative positions of each symbol, which assists in position-aware symbol-level feature representation learning.

2. How does PosFormer's position forest coding work?

The LaTeX mathematical expression sequence is encoded into a position forest structure based on the syntax rules.
Each substructure (e.g., superscript-subscript, fraction) is encoded into a tree with the main body as the root node, upper part as the left node, and lower part as the right node.
These encoded trees are then arranged in series or nested to form the final position forest structure.
Each symbol is assigned a position identifier in the forest to denote its relative spatial position.

3. What are the two sub-tasks in the position recognition component of PosFormer?

Nested level prediction: Predicting the number of nested levels the symbol resides in.
Relative position prediction: Predicting the relative position of the symbol (e.g., "M", "L", "R") within the nested substructure.

4. How does the Implicit Attention Correction (IAC) module work?

IAC introduces zero attention as the refinement term when decoding entity symbols, instead of using the accumulated attention from previous structure symbols.
This helps address the coverage problem, where the model allocates more attention to unimportant regions when decoding structure symbols.
The refined attention weights are then used to extract fine-grained feature representations for recognition.

[02] Experimental Results

1. How does PosFormer perform on the single-line CROHME datasets?

Without scale augmentation, PosFormer surpasses previous SOTA methods by 3.19%, 4.79% and 5.88% on the CROHME 2014/2016/2019 test sets, respectively.
With scale augmentation, PosFormer further improves the ExpRate metric by 2.03%, 1.22%, and 2.00% on the same datasets.

2. How does PosFormer perform on the multi-line M2E dataset?

PosFormer achieves the highest performance on the M2E dataset, with an ExpRate of 58.33% and a CER of 0.0366.
This represents a 2.13% and 1.83% improvement over the CoMER method and the latest LAST method, respectively.

3. How does PosFormer perform on the complex MNE dataset?

The MNE dataset consists of three subsets (N1, N2, N3) with nested levels of 1, 2, and 3, respectively.
PosFormer exhibits performance gains of 0.86%, 1.65%, and 10.04% on the three subsets, respectively, demonstrating its effectiveness in recognizing complex mathematical expressions.

4. How does PosFormer compare to tree-based methods?

Tree-based methods model the mathematical expression as a syntax tree and predict the entire tree structure, while PosFormer models it as a position forest and parses the nested levels and relative positions of symbols.
PosFormer can be easily integrated into sequence-based methods to enhance structural relationship perception, without the need for an extra tree-based decoder that may conflict with the sequence-based objective.

Shared by Daniel Chen ·

Install fromChrome Web Store