Overview

Selective state-space models have achieved great success in long-sequence modeling. However, their capacity for language representation, especially in complex hierarchical reasoning tasks, remains underexplored. Most large language models rely on flat Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this limitation, we propose Hierarchical Mamba (HiM), integrating efficient Mamba2 with exponential growth and curved nature of hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincare ball (via tangent-based mapping) or Lorentzian manifold (via cosine and sine-based mapping) with "learnable" curvature, optimized with a combined hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning. This makes it well-suited for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. We evaluated our HiM with four linguistic and medical datasets for mixed-hop prediction and multi-hop inference tasks. Experimental results demonstrated that: 1) Both HiM models effectively capture hierarchical relationships for four ontological datasets, surpassing Euclidean baselines. 2) HiM-Poincare captures fine-grained semantic distinctions with higher h-norms, while HiM-Lorentz provides more stable, compact, and hierarchy-preserving embeddings favoring robustness over detail.
Keywords: language embedding, mamba, hyperbolic geometry, long-term dependencies, hierachical reasoning

Paper link | GitHub link| Citation

Methodology

ST-TransformerG2G Model

The proposed ST-TransformerG2G model enhances the previous TransformerG2G model with Graph Convolutional Networks (GCNs). Key features include:

  • An additional GCN block with three GCN layers to capture spatial interactions for each graph snapshot.
  • Node embeddings generated by the GCN block are fed into a vanilla transformer encoder module.
  • Positional encoding is added to each input node token embedding.
  • The final output representation of each node is a multivariate normal distribution.

DG-Mamba Model

The main architecture of the proposed DG-Mamba model processes a sequence of discrete-time graph snapshots {G_t}_{t=1}^{T}. Key features include:

  • A look-back parameter l = {1,2,3,4,5} for historical context integration.
  • Each graph snapshot undergoes projection and convolution to capture localized node features while maintaining spatial relationships.
  • A State-Space Model (SSM) layer efficiently captures long-range temporal dependencies using the selective scan mechanism (Mamba architecture).
  • The output of the Mamba layer is passed through an activation function for non-linearity, followed by mean pooling to generate an aggregate representation.
  • A linear projection layer with tanh activation refines the node embeddings.
  • Two additional projection heads (one linear projection layer and one nonlinear projection mapping with ELU activation function) are used to obtain the mean and variance of Gaussian embeddings.

GDG-Mamba Model

The main architecture of the GDG-Mamba model processes a series of discrete-time graph snapshots {G_t}_{t=1}^{T} using GINE convolution. Key features include:

  • Enhanced spatial representation of the graph by considering both node and edge-level features at each timestamp.
  • Generated graph sequence representations are processed through the Mamba block to capture temporal dynamics.
  • Mean pooling and a linear layer with tanh nonlinearity are applied before outputting the final Gaussian embeddings.
  • The model incorporates node and edge-level features across both spatial and temporal dimensions. DGmamba

Results

We used five different graph benchmarks to validate and compare our proposed models. A summary of the dataset statistics - including the number of nodes, number of edges, number of timesteps, and the distribution of train/validation/test splits with specific numbers of timesteps, as well as the embedding size - is provided in Table 1.

DGmamba DGmamba Figure 4. Comparison of MAP and MRR for temporal link prediction task for DynG2G, TransformerG2G, ST-TransformerG2G (i.e., TransformerG2G + GCN), DG-Mamba, GDG-Mamba (i.e., DG-Mamba + GINEConv) models.

Aknowlegement

We would like to acknowledge support from the DOE SEA-CROGS project (DE-SC0023191) and the AFOSR project (FA9550-24-1-0231).

Citation

@article{parmanand2024comparative,
  title={A Comparative Study on Dynamic Graph Embedding based on Mamba and Transformers},
  author={Parmanand Pandey, Ashish and Varghese, Alan John and Patil, Sarang and Xu, Mengjia},
  journal={arXiv e-prints},
  pages={arXiv--2412},
  year={2024}
}