Appendix for Linear Dynamics-embedded Neural Network for Long-Sequence Modeling (2024)

Tongyi Liang, Han-XiongLi
Department of Systems Engineering, City University of Hong Kong
tyliang4-c@my.cityu.edu.hk, mehxli@cityu.edu.hk
Corresponding authors

Abstract

This appendix provides all necessary materials for the paper ’Linear Dynamics-embedded Neural Network for Long-Sequence Modeling’, including model details, experimental configurations, and PyTorch implementation. ¹¹1The codes are available at https://github.com/leonty1/DeepLDNN

•
Appendix A: Notations.
•
Appendix B: Model Details.
•
Appendix B.1: Convolutional View of Continuous SSMs.
•
Appendix B.2: Numerical Discretization.
•
Appendix B.3: Parameterization and Initialization of LDNN.
•
Appendix B.4: HiPPO Initialization.
•
Appendix C: Comparison with Related Model.
•
Appendix C.1: Structure Comparison of SSMs.
•
Appendix C.1: Parameterization and Initialization of SSMs.
•
Appendix C.2: Relationship betweenLDNN, S4, and S5.
•
Appendix D: Supplementary Results.
•
Appendix E: Experimental Configurations for Reproducibility.
•
Appendix F: PyTorch Implementation of LDNN Layer.

Appendix A Notations

Notations	Descriptions
SSMs	State space models
$u(t)\in\mathbb{R}^{H}$	System input sequence
$x(t)\in\mathbb{R}^{N}$	System state
$y(t)\in\mathbb{R}^{M}$	System output sequence
$A\in\mathbb{R}^{N\times N}$	System matrix in continuous SSMs
$B\in\mathbb{R}^{N\times H}$	Input matrix in continuous SSMs
$C\in\mathbb{R}^{M\times N}$	Output matrix in continuous SSMs
$D\in\mathbb{R}^{M\times H}$	Direct transition matrix in continuous SSMs
$\bar{A}\in\mathbb{R}^{N\times N}$	System matrix in discrete SSMs
$\bar{B}\in\mathbb{R}^{N\times H}$	Input matrix in discrete SSMs
$\bar{C}\in\mathbb{R}^{M\times N}$	Output matrix in discrete SSMs
$\bar{D}\in\mathbb{R}^{M\times H}$	Direct transition matrix in discrete SSMs
$\Delta\in\mathbb{R_{+}}$	Discrete time step in discrete SSMs
$K$	State kernel in convolutional SSMs
$V$	System kernel in convolutional SSMs
$\Lambda$	Diagonal system matrix in diagonal SSMs
$FFT$	Fast Fourier Transform

Appendix B Model Details

B.1 Convolutional View of Continuous SSMs

Here, we introduce the convolutional view of continuous SSMs [1].

$\displaystyle x(t)$	$\displaystyle=e^{At}x(0)+\int_{0}^{t}e^{A(t-\tau)}Bu(\tau)d\tau$
	$\displaystyle=e^{At}x(0)+\int_{0}^{t}e^{At}Bu(t-\tau)d\tau$	(1)
	$\displaystyle=e^{At}x(0)+\int_{0}^{t}h(t)u(t-\tau)d\tau$	(2)
	$\displaystyle=e^{At}x(0)+(h\ast u)(t)$	(3)

Using the change of variables, we reformulate Eq. (1). Then, let $h(t)=e^{At}B$ , we obtain the convolutional SSMs (3) according to the definition of convolution.

B.2 Numerical Discretization

B.2.1 Zero-order Hold Method

The state transition function is an ordinary differential equation (ODE). We can obtain its analytical solution as follows.

	$\displaystyle\dot{x}(t)=Ax(t)+Bu(t)$
	$\displaystyle\dot{x}(t)-Ax(t)=Bu(t)$
	$\displaystyle e^{-tA}\dot{x}(t)-e^{-tA}Ax(t)=e^{-tA}Bu(t)$
	$\displaystyle\frac{1}{dt}\left[e^{-tA}x(t))\right]=e^{-tA}Bu(t)$
	$\displaystyle\int_{0}^{t}\frac{1}{d\tau}\left[e^{-\tau A}x(\tau)\right]d\tau=%\int_{0}^{t}e^{-\tau A}Bu(\tau)d\tau$
	$\displaystyle e^{-tA}x(t)-x(0)=\int_{0}^{t}e^{-\tau A}Bu(\tau)d\tau$
	$\displaystyle x(t)=e^{tA}x(0)+e^{tA}\int_{0}^{t}e^{-\tau A}Bu(\tau)d\tau$
	$\displaystyle x(t)=e^{At}x(0)+\int_{0}^{t}e^{A(t-\tau)}Bu(\tau)d\tau$		(4)

Eq. (4) is the analytical solution of $x(t)$ . Then, we rewrite Eq. (4) with initial time $t_{0}$ .

x(t)=e^{A(t-t_{0})}x(t_{0})+\int_{t_{0}}^{t}e^{A(t-\tau)}Bu(\tau)d\tau

(5)

When we sample the $u(t)$ with time interval $\Delta$ , $t$ becomes $k\Delta$ , where $k=0,1,...$ is a positive integer. The Zero-order Hold method assumes $u(t)=u(k\Delta)$ . For $t\in[k\Delta,(k+1)\Delta]$ . Thus, we have

x((k+1)\Delta)=e^{A(\Delta)}x(k\Delta)+\int_{k\Delta}^{(k+1)\Delta}e^{A((k+1)%\Delta-\tau)}d\tau Bu(k\Delta)

(6)

We abbreviate $x((k+1)\Delta)$ , $x(k\Delta)$ , and $u(k\Delta)$ as $x_{k+1}$ , $x_{k}$ , and $u_{k}$ , respectively. Here, we get the discrete transition function.

x_{k+1}=\bar{A}x_{k}+\bar{B}u_{k}

(7)

with $\bar{A}=e^{A\Delta}$ , $\bar{B}=\int_{k\Delta}^{(k+1)\Delta}e^{A((k+1)\Delta-\tau)}d\tau B$ .

We can further simplify $\bar{B}$ assuming that $A$ is invertible.

$\displaystyle\bar{B}$	$\displaystyle=\int_{k\Delta}^{(k+1)\Delta}e^{A((k+1)\Delta-\tau)}d\tau B$
	$\displaystyle=\int_{0}^{\Delta}e^{At}dtB$
	$\displaystyle=\int_{0}^{\Delta}A^{-1}\frac{de^{At}}{dt}dtB$
	$\displaystyle=A^{-1}(e^{At}-I)B$	(8)

B.2.2 Numerical Approximation

Based on Taylor series expansion, the first derivative of $x$ can be approximated by numerical differentiation. Using the forward Euler Eq. (9), or the backward Euler Eq. (10) to approximate ${x}$ , we obtain the $\bar{A}$ , and $\bar{B}$ as described in Eq. (11).

\frac{x_{k+1}-x_{k}}{\Delta}\approx\dot{x}_{k}=Ax_{t}+Bu_{t}

(9)

\frac{x_{k+1}-x_{k}}{\Delta}\approx\dot{x}=Ax_{t+1}+Bu_{t}

(10)

When we use the Generalized Bilinear Transformation method (GBT).

\begin{cases}\bar{A}=(I-\alpha A\Delta)^{-1}(I+(1-\alpha)\Delta A)\\\bar{B}=(I-\alpha\Delta A)^{-1}\Delta B\end{cases}

(11)

There are three special cases for the GBT with different $\alpha$ : the forward Euler method is GBT with $\alpha=0$ , the Bilinear method is GBT with $\alpha=0.5$ , and the backward Euler method is GBT with $\alpha=1$ . Those methods approximate the differential equation based on Taylor series expansion.

B.3 Parameterization and Initialization of LDNN

The diagonal SSMs have learnable parameters $\Lambda,B,C,D$ , and a time step $\Delta$ for discretization. We introduce the parameterization and initialization of these parameters, respectively.

Parameter $\Lambda$ . According to Proposition 1 in Section III. A, we know that all elements in $\Lambda$ must have negative real parts to ensure state convergence. Thus, we restrict $\Lambda$ with an enforcing function $f_{+}$ , expressed as $-f_{+}(Re(\Lambda))+Im(\Lambda)i$ . The enforcing function $f_{+}$ outputs positive real numbers and may have many forms, for example, the Gaussian function, the rectified linear unit function (ReLU), and the Sigmoid function. A random or constant function can initialize $\Lambda$ . Besides, it can be initialized by the eigenvalues of some specially structured matrices, such as the HiPPO matrix introduced in [2]. We initialize $\Lambda$ via HiPPO throughout this work.

Parameter $B$ and $C$ . $B$ and $C$ are the parameters of the linear projection function. We parameterize them as learnable full matrices. Furthermore, the initialization of $B$ is given as random numbers under HiPPO framework as introduced in section B.4. $C$ is initialized by truncated normal distribution.

Parameter $D$ . Different parameterizations of $D$ have different meanings. If we parameterize $D$ as an untrainable zero matrix, the output of SSMs is only dependent on the state. When the input and output are the same size, we can parameterize it as an identity matrix, also known as residual connection [3]. In this work, $D$ is parameterized as a trainable diagonal matrix, which is initialized by a constant 1 in this work.

Parameter $\Delta$ . $\Delta\in\mathbb{R}$ is a scalar for a given SSM. We set it as a learnable parameter and initialize it by randomly sampling from a bounded interval. This work uses [0.001, 0.1] as fault choice if not otherwise specified. We experimentally find that relaxing the size of $\Delta$ from $\mathbb{R}$ to $\mathbb{R}^{N}$ will improve model accuracy, which is also reported in S5 [4]. Therefore, $\Delta\in\mathbb{R}^{N}$ is used across all experiments.

B.4 HiPPO Initialization

HiPPO theory introduces a way to compress continuous signals and discrete-time series by projection onto polynomial bases [2]. The continuous SSMs, as a particular type of ordinary differential equation (ODE), also belong to this framework. Thus, the structured HiPPO matrix shall serve as a good initialization method of $\Lambda,B$ . Following [4], we choose the HiPPO-LegS matrix for initialization, which is defined as

	$\displaystyle\mathbf{A}_{nk}$	$\displaystyle=-\begin{cases}(2n+1)^{1/2}(2k+1)^{1/2},&n>k\\n+1,&n=k\\0,&n<k\end{cases}.$		(12)
	$\displaystyle b_{n}$	$\displaystyle=(2n+1)^{\frac{1}{2}}.$		(13)

The naive diagonalization of $\mathbf{A}_{nk}$ to initialize $\Lambda$ would lead to numerically infeasible and unstable issues. Gu et al. [5] proposed that this problem is solved by equivalently transforming $\mathbf{A}_{nk}$ into a normal plus low-rank (NPLR) matrix, which is expressed as a normal matrix

Appendix C Comparison with related models

C.1 Structure Comparison of SSMs

According to SSM’s different input and output dimensions, the current related work can be divided into two categories. One type is built on SSM with single-input single-output (SISO), including S4, DSS, and S4D; the other type is based on SSM with multi-input multi-output (MIMO), including S5 and LDNN in this work.

As shown in Fig. 1, SISO SSM uses univariate sequences as input and output, while MIMO SSM directly models multivariate sequences. Usually, multiple SISO SSMs are used to model a multivariate sequence independently, and then a linear layer is used for feature fusion, as used in S4 and DSS. Compared with SISO SSM, MIMO SSM does not require an additional linear layer.

Table 2 concludes the structure of different SSM-based models. Except for S4, other methods are based on diagonal SSMs. S5 is the only one that directly utilizes recursive SSM for reasoning and learning. Though other models can perform recursive reasoning, learning is based on convolutional SSM.

Appendix for Linear Dynamics-embedded Neural Network for Long-Sequence Modeling (1)

Model	Type	Structure	Convolutional	Kernel Computation	Convolution	Recurrence	Discretiztion
S4	SISO	DPLR	✓	Cauchy	FFT	vanilla	Bilinear
DSS	SISO	Diagonal	✓	softmax	FFT	vanilla	ZOH
S4D	SISO	Diagonal	✓	Vandermonde	FFT	vanilla	Optional
S5	MIMO	Diagonal	✕	✕	✕	Scan operation	Bilinear
LDNN	MIMO	Diagonal	✓	Vandermonde	FFT	vanilla	ZOH

C.2 Relationship Between S4, S5, and LDNN

S4 and S5 are the most representative works in SISO and MIMO SSM. Here, we analyze the relationship between LDNN and them. Fig 2 presents the computational flow of those models. The following statements are summarized:

•
S4 is based on the SISO SSM, while S5 and LDNN are based on MIMO SSM.
•
S4 uses a DPLR parameterization for system matrix $A$ . S5 and LNDD both use diagonal SSM.
•
All three models can make inferences in recurrent mode. However, they differ in the learning process. S4 and LDNN learn in convolutional representations, but S5 learns in recurrent representation.
•
S4 calculates the kernel and convolution in the frequency domain, but LDNN calculates convolution in the frequency domain and the kernel in the time domain.
•
Multi-Head LNDD and multi-copy of S4 are block-diagonal MIMO SSM, but they differ in the structure of SISO and MIMO SSMs, as shown in Fig 1.
•
S4 with $H$ copies is the special case of Multi-Head LNDD with head number $S$ = input size $H$ .
•
S5 is equivalent to Multi-Head LNDD with head $S$ =1.
•
$A,B$ , and $C$ in S4 and S5 are complex numbers, but LDNN only parameterizes $A$ as complex numbers.
•
Bidirectional settings in LDNN do not introduce additional parameters, but S4 and S5 do.

Appendix for Linear Dynamics-embedded Neural Network for Long-Sequence Modeling (2)

Appendix D Supplementary Results

D.1 Extend Results on LRA

Model	ListOps	Text	Retrieval	Image	Pathfinder	Path-X	Avg.
length	2,000	4,096	4,000	1,024	1,024	16,384	-
Transformer [6]	36.37	64.27	57.46	42.44	71.40	-	53.66
Reformer [7]	37.27	56.10	53.40	38.07	68.50	-	50.56
Performer [8]	18.01	65.40	53.82	42.77	77.05	-	51.18
Linear Trans [9]	16.13	65.90	53.09	42.34	75.30	-	50.46
BigBird [10]	36.05	64.02	59.29	40.83	74.87	-	54.17
Luna-256 [11]	37.25	64.57	79.29	47.38	77.72	-	59.37
FNet [12]	35.33	65.11	59.61	38.67	77.80	-	54.42
Nyströmformer [13]	37.15	65.52	79.56	41.58	70.94	-	57.46
H-Transformer-1D [14]	49.53	78.69	63.99	46.05	68.78	-	61.42
CCNN [15]	43.60	84.08	-	88.90	91.51	-	68.02
CDIL-CNN [16]	60.60	87.62	84.27	64.49	91.00	-	77.59
S4 [5]	59.60	86.82	90.90	88.65	94.2	96.35	86.09
DSS [17]	60.6	84.8	87.8	85.7	84.6	87.8	81.88
S4D [18]	60.47	86.18	89.46	88.19	93.06	91.95	84.89
S5 [4]	62.15	89.31	91.40	88.00	95.33	98.58	87.46
LDNN	62.20	88.25	90.15	87.25	93.87	92.76	85.75

D.2 Extend Results on Raw Speech Classification

Model	MFCC	16kHz	8kHz
(Length)	(784)	(16,000)	(8,000)
Transformer [19, 6]	90.75	-	-
Performer [8]	80.85	30.77	30.68
ODE-RNN [20]	65.9	-	-
NRDE [21]	89.8	16.49	15.12
ExpRNN [22]	82.13	11.6	10.8
LipschitzRNN [23]	88.38	-	-
CKConv [24]	95.3	71.66	65.96
WaveGAN-D [25]	-	96.25	-
LSSL [26]	93.58	-	-
S4 [5]	93.96	98.32	96.30
LDNN	94.46	97.59	94.23

Model	Parameters	16kHz	8kHz
(Length)		(16,000)	(8,000)
InceptionNet [27]	481K	61.24	05.18
ResNet-18 [27]	216K	77.86	08.74
XResNet-50 [27]	904K	83.01	07.72
ConvNet [27]	26.2M	95.51	07.26
S4-LegS [5]	307K	96.08	91.32
S4-FouT [28]	307K	95.27	91.59
S4-(LegS/FouT) [28]	307K	95.32	90.72
S4D-LegS [17]	306K	95.83	91.08
S4D-Inv [17]	306K	96.18	91.80
S4D-Lin [17]	306K	96.25	91.58
Liquid-S4 [29]	224K	96.78	90.00
S5 [4]	280K	96.52	94.53
LDNN	220K	96.08	88.83

D.3 Extend Results on Pixel-level 1-D Image classification

Model	sMNSIT	psMNIST	sCIFAR
(Length)	(784)	(784)	(1024)
Transformer [19, 6]	98.9	97.9	62.2
CCNN [15]	99.72	98.84	93.08
FlexTCN [30]	99.62	98.63	80.82
CKConv [24]	99.32	98.54	63.74
TrellisNet [31]	99.20	98.13	73.42
TCN [32]	99.0	97.2	-
LSTM [33, 34]	98.9	95.11	63.01
r-LSTM [19]	98.4	95.2	72.2
Dilated GRU [35]	99.0	94.6	-
Dilated RNN [35]	98.0	96.1	-
IndRNN [36]	99.0	96.0	-
expRNN [22]	98.7	96.6	-
UR-LSTM [33]	99.28	96.96	71.00
UR-GRU [33]	99.27	96.51	74.4
LMU [37]	-	97.15	-
HiPPO-RNN [2]	98.9	98.3	61.1
UNIcoRNN [38]	-	98.4	-
LMU-FFT [39]	-	98.49	-
LipschitzRNN [23]	99.4	96.3	64.2
LSSL [26]	99.53	98.76	84.65
S4 [5]	99.63	98.70	91.80
S4D [18]	-	-	89.92
Liquid-S4 [29]	-	-	92.02
S5 [4]	99.65	98.67	90.10
LDNN	99.54	98.45	88.12

Appendix E Experimental Configurations for Reproducibility

E.1 Hyperparameters

Details of all experiments are described in this part. Table 7 lists the key hyperparameter , including model depth, learning rate, and so on.

Dataset	Batch	Epoch	Depth	Head	H	N	M	LR		Dropout	Prenorm
Dataset	Batch	Epoch	Depth	Head	H	N	M	SSM	Others	Dropout	Prenorm
ListOps	100	80	6	4	256	256	256	0.01	0.01	0	False
Text	16	80	6	256	256	256	256	0.001	0.004	0.1	True
Retrieval	32	20	6	64	128	128	128	0.001	0.002	0	True
Image	50	200	6	64	256	512	256	0.001	0.005	0.1	False
Pathfinder	64	300	6	8	192	256	192	0.001	0.005	0.05	True
Pathx	8	200	6	8	192	256	192	0.0005	0.001	0	True
SC10-MFCC	16	80	4	32	128	128	128	0.001	0.006	0.1	False
SC10	16	150	6	32	128	128	128	0.001	0.006	0.1	True
SC35	16	100	6	32	128	128	128	0.001	0.008	0.1	False
sMNSIT	50	150	4	16	128	96	128	0.002	0.008	0.1	True
psMNIST	50	200	4	8	128	128	128	0.001	0.004	0.15	True
sCIFAR	50	200	6	64	256	512	256	0.001	0.005	0.1	True

Activation

MIMO SSM directly models the multivariate sequence; no additional layer is needed to mix features. Therefore, we follow S5 and use a weighted sigmoid gated unit. Specifically, the LDNN outout $\mathbf{y}_{k}\in\mathbb{R}^{M}$ is fed into the activation function expressed as $\mathbf{u}_{k}^{\prime}=\mathrm{Gelu}(\mathbf{y}_{k})\odot\sigma(\mathbf{W}*%\mathrm{Gelu}(\mathbf{y}_{k}))$ , where $\mathbf{W}\in\mathbb{R}^{M\times M}$ is a learnable dense matrix. This activation function is used as the default setting if not otherwise specified.

Normalization

Either batch or layer normalization is applied before or after LDNN. Batch normalization after LDNN is used if not otherwise specified.

Initialization

All experiments are initialized using the same configuration introduced in B.3.

Loss and Metric

Cross-entropy loss is used for all classification tasks. Binary or multi-class accuracy is used for metric evaluation.

Optimizer

AdamW is used across all experiments. The learning rate applied to SSM is named $LR_{SSM}$ , and the other is named $LR_{other}$ . The learning rate is dynamically adjusted by $CosineAnnealingLR$ or $ReduceLROnPlateau$ in PyTorch.

E.2 Task Specific Hyperparameters

Here, we specify any task-specific details, hyperparameters, or architectural differences from the defaults outlined above.

E.2.1 Listops

The bidirectional setting is not used. Leakyrelu activation is applied. $C$ is initialized by HiPPO.

E.2.2 Text

The learning rate is adjusted by $ReduceLROnPlateau$ with factor=0.5, patience=5. $LR_{other}$ is applied to SSM parameter $C$ .

E.2.3 Retrieval

We follow the experimental configuration in S4. The model takes two documents as input and outputs two sequences. A mean pooling layer is then used to transform these two sequences into vectors, noted as $y_{1}$ and $y_{2}$ . Four features are created by concatenating $y_{1}$ and $y_{2}$ as following

\displaystyle y=[y_{1},y_{2},y_{1}*y_{2},y_{1}-y_{2}].

(19)

This concatenated feature is then fed to a linear layer and gelu function for binary classification.

The learning rate is adjusted by $CosineAnnealingLR$ with warmup steps=1,000 and total training steps=50,000. $LR_{other}$ is applied to SSM parameter $C$ .

E.2.4 Image

The learning rate is adjusted by $ReduceLROnPlateau$ with factor=0.6, patience=5. $LR_{other}$ is applied to SSM parameters $B$ and $C$ . Data augmentation, including horizontal flips and random crops, is applied.

E.2.5 Pathfinder

The learning rate is adjusted by $CosineAnnealingLR$ with warmup steps=5,000 and total training steps=40,000. $LR_{other}$ is applied to SSM parameter $C$ .

E.2.6 Path-X

The learning rate is adjusted by $CosineAnnealingLR$ with warmup steps=10,000 and total training steps=1,000,000. $LR_{other}$ is applied to SSM parameter $C$ . $\Delta$ is initialized by uniformly sampling from [0.0001, 0.1]. 50% training set is used before epoch=110. Validation and testing sets remain unchanged. A scale factor of 0.0625 is applied to $\Delta$ .

E.2.7 Speech Commands 10 - MFCC

The learning rate is adjusted by $ReduceLROnPlateau$ with factor=0.2, patience=5. $LR_{other}$ is applied to SSM parameter $C$ .

E.2.8 Speech Commands 10

The learning rate is adjusted by $ReduceLROnPlateau$ with factor=0.2, patience=10. $LR_{other}$ is applied to SSM parameter $C$ .

E.2.9 Speech Commands 35

The learning rate is adjusted by $CosineAnnealingLR$ with total training steps=270,000.

E.2.10 Sequential MNIST

The learning rate is adjusted by $ReduceLROnPlateau$ with factor=0.2, patience=10. $LR_{other}$ is applied to SSM parameters $B$ and $C$ .

E.2.11 Permuted Sequential MNIST

The learning rate is adjusted by $CosineAnnealingLR$ with warmup steps=1,000 and total training steps=81,000. $LR_{other}$ is applied to SSM parameter $C$ .

E.2.12 Sequential CIFAR

The same hyperparameter is used as in LRA-Image.

E.3 Dataset Details

Here, we provide more detailed introductions to LRA, Speech Commands, and 1D image classification. This work follows the same data preprocessing process of S4 and S5. For the preprocessing details of each task, please refer to the code we provide at https://github.com/leonty1/DeepLDNN.

LRA

ListOps:The ListOps contains mathematical operations performed on lists of single-digit integers, expressed in prefix notation [40]. The goal is to predict each complete sequence’s corresponding solution, which is also a single-digit integer. Consequently, this constitutes a ten-way balanced classification problem. For example, [MIN 2 9 [MAX 4 7 ] 0 ] has the solution 0. All sequences have a uniform length of 2000 (if not padded with zero). The dataset has a total of 10,000 samples, which are divided into 8:1:1 for training, validation, and testing.

Text:This dataset is based on the IMDB sentiment dataset. This task aims to classify the sentiment of a given movie review (text) as either positive or negative. For example, a positive comment: ’Probably my all-time favorite movie,…’. The maximum length of each sequence is $4,096$ . IMDB contains $25,000$ training examples and $25,000$ testing examples.

Retrieval:This task measures the similarity between two sequences based on the AAN dataset [41]. The maximum length of each sequence is 4,000. It is a binary classification task. There are $147,086$ training samples, $18,090$ validation samples, and $17,437$ test samples.

Image: This task is based on the CIFAR-10 dataset [42]. Grayscale CIFAR-10 image has a resolution of $32\times 32$ , which is flattened into a 1D sequence for a ten-way classification. All sequences have a length of 1024. It has $45,000$ training examples, $5,000$ validation examples, and $10,000$ test examples.

Pathfinder: This task aims to classify whether the two small circles depicted in the picture are connected with dashed lines, constituting a binary classification task [43]. A grayscale image has a size of $32\times 32$ , which is flattened into a sequence with length $1,024$ . There are $200,000$ examples, which are split into 8:1:1 for training, validation, and testing process.

Path-X: A more challenging version of the Pathfinder. The image’s resolution was increased to $128\times 128$ , resulting in a sixteenfold increase in sequence length, from 1024 to 16,384.

Raw Speech Commands

Speech Commands-35:This dataset records audio of 35 different words [44]. This task aims to determine which word a given audio is. It is a multi-classification problem with 35 categories. There are two audio collection frequencies, $16KHz$ and $8KHz$ . All audio sequences have the same length, $16,000$ if sampled at $16KHz$ or $8,000$ if sampled at $16KHz$ . It contains $24,482$ training samples, $5,246$ validation samples, and $5,247$ testing samples.

Speech Commands-10: This database contains ten categories of audio, a subset of Speech Commands-35.

Speech Commands-MFCC: The original audio in Speech Commands-10 is pre-processed into MFCC features with length of 161.

Pixel-level 1-D Image Classification

Sequential MNIST (sMNIST) :10-way digit classification from a $28\times 28$ grayscale image of a handwritten digit, where the input image is flattened into a $784$ -length scalar sequence.

Permuted Sequential MNIST (psMNIST):This task aims to perform 10-category digit classification from a $28\times 28$ grayscale image of handwritten digits. The original image is first flattened into a sequence of length 784. Next, this sequence is rearranged in a fixed order.

Sequential CIFAR (sCIFAR):Color version of image task, where each image is an (R,G,B) triple.

E.4 Implementation Configurations

All the experiments are conducted with:

•
Operating System: Windows 10, version 22H2
•
CPU: AMD Ryzen Threadripper 3960X 24-Core Processor @ 3.8GHz
•
GPU: NVIDIA GeForce RTX 3090 with 24 GB of memory
•
Software: Python 3.9.12, Cuda 11.3, PyTorch[45] 1.12.1.

Appendix F PyTorch Implementation of LDNN Layer

⬇

1 import torch

3 # B = batch, C = channel, S = head, H = input size,

4 # M = output size, N = state size, L = sequence length

6 def discretize_zoh(Lambda, B, Delta):

7 """ Discretize the diagonal, continuous-time linear SSM with MIMO

8 Args:

9 Lambda (complex64): diagonal state matrix (C, S, N)

10 B (complex64): input matrix (C, S, N, H)

11 Delta (float32): discretization step sizes (C, S, N)

12 Returns:

13 discretized Lambda_bar (complex64), B_bar (complex64) """

14 Lambda_bar = Lambda * Delta

15 Identity = torch.ones_like(Lambda)

16 B_coef = (reciprocal(Lambda) * (torch.exp(Lambda_bar)-Identity))

17 B_bar = torch.einsum(’cn,cnh->cnh’, B_coef, B)

18 return Lambda_bar, B_bar

20 def ldnn(Lambda_bar, B_bar, C_tilde, D, input):

21 """ Discretized SSM as linear dynamic-embedded neural network.

22 Args:

23 input (float32): input sequence of features (B, H, L)

24 Returns: y (float32): outputs (B, M, L) """

26 #Split input into heads, h=H/S

27 u = u.reshape(B, S, h, L)

29 #Calculate B*u

30 B_u = torch.einsum(’csnh,bshl->bcsnl’, B_bar, u)

32 #Compute State Kernel

33 length = torch.arange(Lk).cuda()

34 p=torch.einsum(’csn,l->csnl’, Lambda_bar, length)

35 state_kernel = p.exp() # [channel, head, N, L]

36 state_kernel = state_kernel.real # real part of complex kernel

38 #Bidirectional kernel for non-causal state inference

39 if self.bidirectional:

40 #reversal backforward kernel

41 state_kernel_new=F.pad(state_kernel,(0, L))+F.pad(state_kernel.flip(-1),( L, 0))

42 else:

43 state_kernel_new =state_kernel

45 #Efficient convolution for state inference via FFT

46 k_f = torch.fft.rfft(state_kernel_new, n=n)

47 u_f = torch.fft.rfft(B_u, n=n)

48 x_f = torch.einsum(’bcsnl,csnl->bcsnl’, u_f, k_f)

49 x = torch.fft.irfft(x_f, n=n)[..., :L]

51 #Calculate C*X

52 C_x = torch.einsum(’csmn, bcsnl->bcsml’, C, x)

54 #Calculate output with Du

55 y = C_x + torch.einsum(’csmh, bshl->bcsml’, self.D, u_D)

57 #Mix channels using linear projection

58 y= dropout(y)

59 y = self.channel_mixer(y)

61 #Activation

62 y=activation(y)

64 return y

References

[1]Hongyu Hè and Marko Kabic.A unified view of long-sequence models towards million-scaledependencies.arXiv preprint arXiv:2302.06218, 2023.
[2]Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré.Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems,33:1474–1487, 2020.
[3]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 770–778, 2016.
[4]JimmyTH Smith, Andrew Warrington, and ScottW Linderman.Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022.
[5]Albert Gu, Karan Goel, and Christopher Ré.Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021.
[6]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
[7]Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451, 2020.
[8]Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song,Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin,Lukasz Kaiser, etal.Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020.
[9]Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and FrançoisFleuret.Transformers are rnns: Fast autoregressive transformers with linearattention.In International conference on machine learning, pages5156–5165. PMLR, 2020.
[10]Manzil Zaheer, Guru Guruganesh, KumarAvinava Dubey, Joshua Ainslie, ChrisAlberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, LiYang,etal.Big bird: Transformers for longer sequences.Advances in neural information processing systems,33:17283–17297, 2020.
[11]Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, andLuke Zettlemoyer.Luna: Linear unified nested attention.Advances in Neural Information Processing Systems,34:2441–2453, 2021.
[12]James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon.Fnet: Mixing tokens with fourier transforms.arXiv preprint arXiv:2105.03824, 2021.
[13]Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung,Yin Li, and Vikas Singh.Nyströmformer: A nyström-based algorithm for approximatingself-attention.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume35, pages 14138–14148, 2021.
[14]Zhenhai Zhu and Radu Soricut.H-transformer-1d: Fast one-dimensional hierarchical attention forsequences.arXiv preprint arXiv:2107.11906, 2021.
[15]DavidW Romero, DavidM Knigge, Albert Gu, ErikJ Bekkers, Efstratios Gavves,JakubM Tomczak, and Mark Hoogendoorn.Towards a general purpose cnn for long range dependencies in $n$ d.arXiv preprint arXiv:2206.03398, 2022.
[16]Lei Cheng, Ruslan Khalitov, Tong Yu, Jing Zhang, and Zhirong Yang.Classification of long sequential data using circular dilatedconvolutional neural networks.Neurocomputing, 518:50–59, 2023.
[17]Ankit Gupta, Albert Gu, and Jonathan Berant.Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems,35:22982–22994, 2022.
[18]Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré.On the parameterization and initialization of diagonal state spacemodels.Advances in Neural Information Processing Systems,35:35971–35983, 2022.
[19]Trieu Trinh, Andrew Dai, Thang Luong, and Quoc Le.Learning longer-term dependencies in rnns with auxiliary losses.In International Conference on Machine Learning, pages4965–4974. PMLR, 2018.
[20]Yulia Rubanova, RickyTQ Chen, and DavidK Duvenaud.Latent ordinary differential equations for irregularly-sampled timeseries.Advances in neural information processing systems, 32, 2019.
[21]Patrick Kidger, James Morrill, James Foster, and Terry Lyons.Neural controlled differential equations for irregular time series.Advances in Neural Information Processing Systems,33:6696–6707, 2020.
[22]Mario Lezcano-Casado and David Martınez-Rubio.Cheap orthogonal constraints in neural networks: A simpleparametrization of the orthogonal and unitary group.In International Conference on Machine Learning, pages3794–3803. PMLR, 2019.
[23]NBenjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, andMichaelW Mahoney.Lipschitz recurrent neural networks.arXiv preprint arXiv:2006.12070, 2020.
[24]DavidW Romero, Anna Kuzina, ErikJ Bekkers, JakubM Tomczak, and MarkHoogendoorn.Ckconv: Continuous kernel convolution for sequential data.arXiv preprint arXiv:2102.02611, 2021.
[25]Chris Donahue, Julian McAuley, and Miller Puckette.Adversarial audio synthesis.arXiv preprint arXiv:1802.04208, 2018.
[26]Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, andChristopher Ré.Combining recurrent, convolutional, and continuous-time models withlinear state space layers.Advances in neural information processing systems, 34:572–585,2021.
[27]Naoki Nonaka and Jun Seita.In-depth benchmarking of deep neural network architectures for ecgdiagnosis.In Machine Learning for Healthcare Conference, pages 414–439.PMLR, 2021.
[28]Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré.How to train your hippo: State space models with generalizedorthogonal basis projections.arXiv preprint arXiv:2206.12037, 2022.
[29]Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, AlexanderAmini, and Daniela Rus.Liquid structural state-space models.arXiv preprint arXiv:2209.12951, 2022.
[30]DavidW Romero, Robert-Jan Bruintjes, JakubM Tomczak, ErikJ Bekkers, MarkHoogendoorn, and JanC van Gemert.Flexconv: Continuous kernel convolutions with differentiable kernelsizes.arXiv preprint arXiv:2110.08059, 2021.
[31]Shaojie Bai, JZico Kolter, and Vladlen Koltun.Trellis networks for sequence modeling.arXiv preprint arXiv:1810.06682, 2018.
[32]Shaojie Bai, JZico Kolter, and Vladlen Koltun.An empirical evaluation of generic convolutional and recurrentnetworks for sequence modeling.arXiv preprint arXiv:1803.01271, 2018.
[33]Albert Gu, Caglar Gulcehre, Thomas Paine, Matt Hoffman, and Razvan Pascanu.Improving the gating mechanism of recurrent neural networks.In International Conference on Machine Learning, pages3800–3809. PMLR, 2020.
[34]Sepp Hochreiter and Jürgen Schmidhuber.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
[35]Shiyu Chang, Yang Zhang, Wei Han, MoYu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui,Michael Witbrock, MarkA Hasegawa-Johnson, and ThomasS Huang.Dilated recurrent neural networks.Advances in neural information processing systems, 30, 2017.
[36]Shuai Li, Wanqing Li, Chris Cook, CeZhu, and Yanbo Gao.Independently recurrent neural network (indrnn): Building a longerand deeper rnn.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 5457–5466, 2018.
[37]Aaron Voelker, Ivana Kajić, and Chris Eliasmith.Legendre memory units: Continuous-time representation in recurrentneural networks.Advances in neural information processing systems, 32, 2019.
[38]TKonstantin Rusch and Siddhartha Mishra.Unicornn: A recurrent model for learning very long time dependencies.In International Conference on Machine Learning, pages9168–9178. PMLR, 2021.
[39]NarsimhaReddy Chilkuri and Chris Eliasmith.Parallelizing legendre memory unit training.In International Conference on Machine Learning, pages1898–1907. PMLR, 2021.
[40]Nikita Nangia and SamuelR Bowman.Listops: A diagnostic dataset for latent tree learning.arXiv preprint arXiv:1804.06028, 2018.
[41]DragomirR Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara.The acl anthology network corpus.Language Resources and Evaluation, 47:919–944, 2013.
[42]Alex Krizhevsky, Geoffrey Hinton, etal.Learning multiple layers of features from tiny images.2009.
[43]Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and ThomasSerre.Learning long-range spatial dependencies with horizontal gatedrecurrent units.Advances in neural information processing systems, 31, 2018.
[44]Pete Warden.Speech commands: A dataset for limited-vocabulary speech recognition.arXiv preprint arXiv:1804.03209, 2018.
[45]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, etal.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019.

	$\displaystyle\mathbf{A}^{\mathrm{Normal}}$	$\displaystyle=-\begin{cases}(n+\frac{1}{2})^{1/2}(k+\frac{1}{2})^{1/2},&n>k\\\frac{1}{2},&n=k\\(n+\frac{1}{2})^{1/2}(k+\frac{1}{2})^{1/2},&n<k\end{cases}.$		(17)
	$\displaystyle\mathbf{P}_{n}$	$\displaystyle=(n+\frac{1}{2})^{\frac{1}{2}}$		(18)