Tongyi Liang, Han-XiongLi
Department of Systems Engineering, City University of Hong Kong
tyliang4-c@my.cityu.edu.hk, mehxli@cityu.edu.hk
Corresponding authors
Abstract
This appendix provides all necessary materials for the paper ’Linear Dynamics-embedded Neural Network for Long-Sequence Modeling’, including model details, experimental configurations, and PyTorch implementation. 111The codes are available at https://github.com/leonty1/DeepLDNN
Contents:
- •
Appendix A: Notations.
- •
Appendix B: Model Details.
- •
Appendix B.1: Convolutional View of Continuous SSMs.
- •
Appendix B.2: Numerical Discretization.
- •
Appendix B.3: Parameterization and Initialization of LDNN.
- •
Appendix B.4: HiPPO Initialization.
- •
Appendix C: Comparison with Related Model.
- •
Appendix C.1: Structure Comparison of SSMs.
- •
Appendix C.1: Parameterization and Initialization of SSMs.
- •
Appendix C.2: Relationship betweenLDNN, S4, and S5.
- •
Appendix D: Supplementary Results.
- •
Appendix E: Experimental Configurations for Reproducibility.
- •
Appendix F: PyTorch Implementation of LDNN Layer.
Appendix A Notations
Notations | Descriptions |
---|---|
SSMs | State space models |
System input sequence | |
System state | |
System output sequence | |
System matrix in continuous SSMs | |
Input matrix in continuous SSMs | |
Output matrix in continuous SSMs | |
Direct transition matrix in continuous SSMs | |
System matrix in discrete SSMs | |
Input matrix in discrete SSMs | |
Output matrix in discrete SSMs | |
Direct transition matrix in discrete SSMs | |
Discrete time step in discrete SSMs | |
State kernel in convolutional SSMs | |
System kernel in convolutional SSMs | |
Diagonal system matrix in diagonal SSMs | |
Fast Fourier Transform |
Appendix B Model Details
B.1 Convolutional View of Continuous SSMs
Here, we introduce the convolutional view of continuous SSMs [1].
(1) | ||||
(2) | ||||
(3) |
Using the change of variables, we reformulate Eq. (1). Then, let , we obtain the convolutional SSMs (3) according to the definition of convolution.
B.2 Numerical Discretization
B.2.1 Zero-order Hold Method
The state transition function is an ordinary differential equation (ODE). We can obtain its analytical solution as follows.
(4) |
Eq. (4) is the analytical solution of . Then, we rewrite Eq. (4) with initial time .
(5) |
When we sample the with time interval , becomes , where is a positive integer. The Zero-order Hold method assumes . For . Thus, we have
(6) |
We abbreviate , , and as , , and , respectively. Here, we get the discrete transition function.
(7) |
with , .
We can further simplify assuming that is invertible.
(8) |
B.2.2 Numerical Approximation
Based on Taylor series expansion, the first derivative of can be approximated by numerical differentiation. Using the forward Euler Eq. (9), or the backward Euler Eq. (10) to approximate , we obtain the , and as described in Eq. (11).
(9) |
(10) |
When we use the Generalized Bilinear Transformation method (GBT).
(11) |
There are three special cases for the GBT with different : the forward Euler method is GBT with , the Bilinear method is GBT with , and the backward Euler method is GBT with . Those methods approximate the differential equation based on Taylor series expansion.
B.3 Parameterization and Initialization of LDNN
The diagonal SSMs have learnable parameters , and a time step for discretization. We introduce the parameterization and initialization of these parameters, respectively.
Parameter . According to Proposition 1 in Section III. A, we know that all elements in must have negative real parts to ensure state convergence. Thus, we restrict with an enforcing function , expressed as . The enforcing function outputs positive real numbers and may have many forms, for example, the Gaussian function, the rectified linear unit function (ReLU), and the Sigmoid function. A random or constant function can initialize . Besides, it can be initialized by the eigenvalues of some specially structured matrices, such as the HiPPO matrix introduced in [2]. We initialize via HiPPO throughout this work.
Parameter and . and are the parameters of the linear projection function. We parameterize them as learnable full matrices. Furthermore, the initialization of is given as random numbers under HiPPO framework as introduced in section B.4. is initialized by truncated normal distribution.
Parameter . Different parameterizations of have different meanings. If we parameterize as an untrainable zero matrix, the output of SSMs is only dependent on the state. When the input and output are the same size, we can parameterize it as an identity matrix, also known as residual connection [3]. In this work, is parameterized as a trainable diagonal matrix, which is initialized by a constant 1 in this work.
Parameter . is a scalar for a given SSM. We set it as a learnable parameter and initialize it by randomly sampling from a bounded interval. This work uses [0.001, 0.1] as fault choice if not otherwise specified. We experimentally find that relaxing the size of from to will improve model accuracy, which is also reported in S5 [4]. Therefore, is used across all experiments.
B.4 HiPPO Initialization
HiPPO theory introduces a way to compress continuous signals and discrete-time series by projection onto polynomial bases [2]. The continuous SSMs, as a particular type of ordinary differential equation (ODE), also belong to this framework. Thus, the structured HiPPO matrix shall serve as a good initialization method of . Following [4], we choose the HiPPO-LegS matrix for initialization, which is defined as
(12) | ||||
(13) |
The naive diagonalization of to initialize would lead to numerically infeasible and unstable issues. Gu et al. [5] proposed that this problem is solved by equivalently transforming into a normal plus low-rank (NPLR) matrix, which is expressed as a normal matrix
(14) |
together with a low-rank term.
(15) |
where unitary , diagonal , and low-rank factorization .
The HiPPO-LegS matrix can be further rewritten as
(16) |
where
(17) | ||||
(18) |
We initialize using the eigenvalue of . Following S5 [4], the eigenvectors of are used for initialization.
Appendix C Comparison with related models
C.1 Structure Comparison of SSMs
According to SSM’s different input and output dimensions, the current related work can be divided into two categories. One type is built on SSM with single-input single-output (SISO), including S4, DSS, and S4D; the other type is based on SSM with multi-input multi-output (MIMO), including S5 and LDNN in this work.
As shown in Fig. 1, SISO SSM uses univariate sequences as input and output, while MIMO SSM directly models multivariate sequences. Usually, multiple SISO SSMs are used to model a multivariate sequence independently, and then a linear layer is used for feature fusion, as used in S4 and DSS. Compared with SISO SSM, MIMO SSM does not require an additional linear layer.
Table 2 concludes the structure of different SSM-based models. Except for S4, other methods are based on diagonal SSMs. S5 is the only one that directly utilizes recursive SSM for reasoning and learning. Though other models can perform recursive reasoning, learning is based on convolutional SSM.
Model Type Structure Convolutional Kernel Computation Convolution Recurrence Discretiztion S4 SISO DPLR ✓ Cauchy FFT vanilla Bilinear DSS SISO Diagonal ✓ softmax FFT vanilla ZOH S4D SISO Diagonal ✓ Vandermonde FFT vanilla Optional S5 MIMO Diagonal ✕ ✕ ✕ Scan operation Bilinear LDNN MIMO Diagonal ✓ Vandermonde FFT vanilla ZOH
C.2 Relationship Between S4, S5, and LDNN
S4 and S5 are the most representative works in SISO and MIMO SSM. Here, we analyze the relationship between LDNN and them. Fig 2 presents the computational flow of those models. The following statements are summarized:
- •
S4 is based on the SISO SSM, while S5 and LDNN are based on MIMO SSM.
- •
S4 uses a DPLR parameterization for system matrix . S5 and LNDD both use diagonal SSM.
- •
All three models can make inferences in recurrent mode. However, they differ in the learning process. S4 and LDNN learn in convolutional representations, but S5 learns in recurrent representation.
- •
S4 calculates the kernel and convolution in the frequency domain, but LDNN calculates convolution in the frequency domain and the kernel in the time domain.
- •
Multi-Head LNDD and multi-copy of S4 are block-diagonal MIMO SSM, but they differ in the structure of SISO and MIMO SSMs, as shown in Fig 1.
- •
S4 with copies is the special case of Multi-Head LNDD with head number = input size .
- •
S5 is equivalent to Multi-Head LNDD with head =1.
- •
, and in S4 and S5 are complex numbers, but LDNN only parameterizes as complex numbers.
- •
Bidirectional settings in LDNN do not introduce additional parameters, but S4 and S5 do.
Appendix D Supplementary Results
D.1 Extend Results on LRA
Model | ListOps | Text | Retrieval | Image | Pathfinder | Path-X | Avg. |
length | 2,000 | 4,096 | 4,000 | 1,024 | 1,024 | 16,384 | - |
Transformer [6] | 36.37 | 64.27 | 57.46 | 42.44 | 71.40 | - | 53.66 |
Reformer [7] | 37.27 | 56.10 | 53.40 | 38.07 | 68.50 | - | 50.56 |
Performer [8] | 18.01 | 65.40 | 53.82 | 42.77 | 77.05 | - | 51.18 |
Linear Trans [9] | 16.13 | 65.90 | 53.09 | 42.34 | 75.30 | - | 50.46 |
BigBird [10] | 36.05 | 64.02 | 59.29 | 40.83 | 74.87 | - | 54.17 |
Luna-256 [11] | 37.25 | 64.57 | 79.29 | 47.38 | 77.72 | - | 59.37 |
FNet [12] | 35.33 | 65.11 | 59.61 | 38.67 | 77.80 | - | 54.42 |
Nyströmformer [13] | 37.15 | 65.52 | 79.56 | 41.58 | 70.94 | - | 57.46 |
H-Transformer-1D [14] | 49.53 | 78.69 | 63.99 | 46.05 | 68.78 | - | 61.42 |
CCNN [15] | 43.60 | 84.08 | - | 88.90 | 91.51 | - | 68.02 |
CDIL-CNN [16] | 60.60 | 87.62 | 84.27 | 64.49 | 91.00 | - | 77.59 |
S4 [5] | 59.60 | 86.82 | 90.90 | 88.65 | 94.2 | 96.35 | 86.09 |
DSS [17] | 60.6 | 84.8 | 87.8 | 85.7 | 84.6 | 87.8 | 81.88 |
S4D [18] | 60.47 | 86.18 | 89.46 | 88.19 | 93.06 | 91.95 | 84.89 |
S5 [4] | 62.15 | 89.31 | 91.40 | 88.00 | 95.33 | 98.58 | 87.46 |
LDNN | 62.20 | 88.25 | 90.15 | 87.25 | 93.87 | 92.76 | 85.75 |
D.2 Extend Results on Raw Speech Classification
Model | MFCC | 16kHz | 8kHz |
(Length) | (784) | (16,000) | (8,000) |
Transformer [19, 6] | 90.75 | - | - |
Performer [8] | 80.85 | 30.77 | 30.68 |
ODE-RNN [20] | 65.9 | - | - |
NRDE [21] | 89.8 | 16.49 | 15.12 |
ExpRNN [22] | 82.13 | 11.6 | 10.8 |
LipschitzRNN [23] | 88.38 | - | - |
CKConv [24] | 95.3 | 71.66 | 65.96 |
WaveGAN-D [25] | - | 96.25 | - |
LSSL [26] | 93.58 | - | - |
S4 [5] | 93.96 | 98.32 | 96.30 |
LDNN | 94.46 | 97.59 | 94.23 |
Model | Parameters | 16kHz | 8kHz |
(Length) | (16,000) | (8,000) | |
InceptionNet [27] | 481K | 61.24 | 05.18 |
ResNet-18 [27] | 216K | 77.86 | 08.74 |
XResNet-50 [27] | 904K | 83.01 | 07.72 |
ConvNet [27] | 26.2M | 95.51 | 07.26 |
S4-LegS [5] | 307K | 96.08 | 91.32 |
S4-FouT [28] | 307K | 95.27 | 91.59 |
S4-(LegS/FouT) [28] | 307K | 95.32 | 90.72 |
S4D-LegS [17] | 306K | 95.83 | 91.08 |
S4D-Inv [17] | 306K | 96.18 | 91.80 |
S4D-Lin [17] | 306K | 96.25 | 91.58 |
Liquid-S4 [29] | 224K | 96.78 | 90.00 |
S5 [4] | 280K | 96.52 | 94.53 |
LDNN | 220K | 96.08 | 88.83 |
D.3 Extend Results on Pixel-level 1-D Image classification
Model | sMNSIT | psMNIST | sCIFAR |
(Length) | (784) | (784) | (1024) |
Transformer [19, 6] | 98.9 | 97.9 | 62.2 |
CCNN [15] | 99.72 | 98.84 | 93.08 |
FlexTCN [30] | 99.62 | 98.63 | 80.82 |
CKConv [24] | 99.32 | 98.54 | 63.74 |
TrellisNet [31] | 99.20 | 98.13 | 73.42 |
TCN [32] | 99.0 | 97.2 | - |
LSTM [33, 34] | 98.9 | 95.11 | 63.01 |
r-LSTM [19] | 98.4 | 95.2 | 72.2 |
Dilated GRU [35] | 99.0 | 94.6 | - |
Dilated RNN [35] | 98.0 | 96.1 | - |
IndRNN [36] | 99.0 | 96.0 | - |
expRNN [22] | 98.7 | 96.6 | - |
UR-LSTM [33] | 99.28 | 96.96 | 71.00 |
UR-GRU [33] | 99.27 | 96.51 | 74.4 |
LMU [37] | - | 97.15 | - |
HiPPO-RNN [2] | 98.9 | 98.3 | 61.1 |
UNIcoRNN [38] | - | 98.4 | - |
LMU-FFT [39] | - | 98.49 | - |
LipschitzRNN [23] | 99.4 | 96.3 | 64.2 |
LSSL [26] | 99.53 | 98.76 | 84.65 |
S4 [5] | 99.63 | 98.70 | 91.80 |
S4D [18] | - | - | 89.92 |
Liquid-S4 [29] | - | - | 92.02 |
S5 [4] | 99.65 | 98.67 | 90.10 |
LDNN | 99.54 | 98.45 | 88.12 |
Appendix E Experimental Configurations for Reproducibility
E.1 Hyperparameters
Details of all experiments are described in this part. Table 7 lists the key hyperparameter , including model depth, learning rate, and so on.
Dataset Batch Epoch Depth Head H N M LR Dropout Prenorm SSM Others ListOps 100 80 6 4 256 256 256 0.01 0.01 0 False Text 16 80 6 256 256 256 256 0.001 0.004 0.1 True Retrieval 32 20 6 64 128 128 128 0.001 0.002 0 True Image 50 200 6 64 256 512 256 0.001 0.005 0.1 False Pathfinder 64 300 6 8 192 256 192 0.001 0.005 0.05 True Pathx 8 200 6 8 192 256 192 0.0005 0.001 0 True SC10-MFCC 16 80 4 32 128 128 128 0.001 0.006 0.1 False SC10 16 150 6 32 128 128 128 0.001 0.006 0.1 True SC35 16 100 6 32 128 128 128 0.001 0.008 0.1 False sMNSIT 50 150 4 16 128 96 128 0.002 0.008 0.1 True psMNIST 50 200 4 8 128 128 128 0.001 0.004 0.15 True sCIFAR 50 200 6 64 256 512 256 0.001 0.005 0.1 True
Activation
MIMO SSM directly models the multivariate sequence; no additional layer is needed to mix features. Therefore, we follow S5 and use a weighted sigmoid gated unit. Specifically, the LDNN outout is fed into the activation function expressed as , where is a learnable dense matrix. This activation function is used as the default setting if not otherwise specified.
Normalization
Either batch or layer normalization is applied before or after LDNN. Batch normalization after LDNN is used if not otherwise specified.
Initialization
All experiments are initialized using the same configuration introduced in B.3.
Loss and Metric
Cross-entropy loss is used for all classification tasks. Binary or multi-class accuracy is used for metric evaluation.
Optimizer
AdamW is used across all experiments. The learning rate applied to SSM is named , and the other is named . The learning rate is dynamically adjusted by or in PyTorch.
E.2 Task Specific Hyperparameters
Here, we specify any task-specific details, hyperparameters, or architectural differences from the defaults outlined above.
E.2.1 Listops
The bidirectional setting is not used. Leakyrelu activation is applied. is initialized by HiPPO.
E.2.2 Text
The learning rate is adjusted by with factor=0.5, patience=5. is applied to SSM parameter .
E.2.3 Retrieval
We follow the experimental configuration in S4. The model takes two documents as input and outputs two sequences. A mean pooling layer is then used to transform these two sequences into vectors, noted as and . Four features are created by concatenating and as following
(19) |
This concatenated feature is then fed to a linear layer and gelu function for binary classification.
The learning rate is adjusted by with warmup steps=1,000 and total training steps=50,000. is applied to SSM parameter .
E.2.4 Image
The learning rate is adjusted by with factor=0.6, patience=5. is applied to SSM parameters and . Data augmentation, including horizontal flips and random crops, is applied.
E.2.5 Pathfinder
The learning rate is adjusted by with warmup steps=5,000 and total training steps=40,000. is applied to SSM parameter .
E.2.6 Path-X
The learning rate is adjusted by with warmup steps=10,000 and total training steps=1,000,000. is applied to SSM parameter . is initialized by uniformly sampling from [0.0001, 0.1]. 50% training set is used before epoch=110. Validation and testing sets remain unchanged. A scale factor of 0.0625 is applied to .
E.2.7 Speech Commands 10 - MFCC
The learning rate is adjusted by with factor=0.2, patience=5. is applied to SSM parameter .
E.2.8 Speech Commands 10
The learning rate is adjusted by with factor=0.2, patience=10. is applied to SSM parameter .
E.2.9 Speech Commands 35
The learning rate is adjusted by with total training steps=270,000.
E.2.10 Sequential MNIST
The learning rate is adjusted by with factor=0.2, patience=10. is applied to SSM parameters and .
E.2.11 Permuted Sequential MNIST
The learning rate is adjusted by with warmup steps=1,000 and total training steps=81,000. is applied to SSM parameter .
E.2.12 Sequential CIFAR
The same hyperparameter is used as in LRA-Image.
E.3 Dataset Details
Here, we provide more detailed introductions to LRA, Speech Commands, and 1D image classification. This work follows the same data preprocessing process of S4 and S5. For the preprocessing details of each task, please refer to the code we provide at https://github.com/leonty1/DeepLDNN.
LRA
ListOps:The ListOps contains mathematical operations performed on lists of single-digit integers, expressed in prefix notation [40]. The goal is to predict each complete sequence’s corresponding solution, which is also a single-digit integer. Consequently, this constitutes a ten-way balanced classification problem. For example, [MIN 2 9 [MAX 4 7 ] 0 ] has the solution 0. All sequences have a uniform length of 2000 (if not padded with zero). The dataset has a total of 10,000 samples, which are divided into 8:1:1 for training, validation, and testing.
Text:This dataset is based on the IMDB sentiment dataset. This task aims to classify the sentiment of a given movie review (text) as either positive or negative. For example, a positive comment: ’Probably my all-time favorite movie,…’. The maximum length of each sequence is . IMDB contains training examples and testing examples.
Retrieval:This task measures the similarity between two sequences based on the AAN dataset [41]. The maximum length of each sequence is 4,000. It is a binary classification task. There are training samples, validation samples, and test samples.
Image: This task is based on the CIFAR-10 dataset [42]. Grayscale CIFAR-10 image has a resolution of , which is flattened into a 1D sequence for a ten-way classification. All sequences have a length of 1024. It has training examples, validation examples, and test examples.
Pathfinder: This task aims to classify whether the two small circles depicted in the picture are connected with dashed lines, constituting a binary classification task [43]. A grayscale image has a size of , which is flattened into a sequence with length . There are examples, which are split into 8:1:1 for training, validation, and testing process.
Path-X: A more challenging version of the Pathfinder. The image’s resolution was increased to , resulting in a sixteenfold increase in sequence length, from 1024 to 16,384.
Raw Speech Commands
Speech Commands-35:This dataset records audio of 35 different words [44]. This task aims to determine which word a given audio is. It is a multi-classification problem with 35 categories. There are two audio collection frequencies, and . All audio sequences have the same length, if sampled at or if sampled at . It contains training samples, validation samples, and testing samples.
Speech Commands-10: This database contains ten categories of audio, a subset of Speech Commands-35.
Speech Commands-MFCC: The original audio in Speech Commands-10 is pre-processed into MFCC features with length of 161.
Pixel-level 1-D Image Classification
Sequential MNIST (sMNIST) :10-way digit classification from a grayscale image of a handwritten digit, where the input image is flattened into a -length scalar sequence.
Permuted Sequential MNIST (psMNIST):This task aims to perform 10-category digit classification from a grayscale image of handwritten digits. The original image is first flattened into a sequence of length 784. Next, this sequence is rearranged in a fixed order.
Sequential CIFAR (sCIFAR):Color version of image task, where each image is an (R,G,B) triple.
E.4 Implementation Configurations
All the experiments are conducted with:
- •
Operating System: Windows 10, version 22H2
- •
CPU: AMD Ryzen Threadripper 3960X 24-Core Processor @ 3.8GHz
- •
GPU: NVIDIA GeForce RTX 3090 with 24 GB of memory
- •
Software: Python 3.9.12, Cuda 11.3, PyTorch[45] 1.12.1.
Appendix F PyTorch Implementation of LDNN Layer
References
- [1]Hongyu Hè and Marko Kabic.A unified view of long-sequence models towards million-scaledependencies.arXiv preprint arXiv:2302.06218, 2023.
- [2]Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré.Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems,33:1474–1487, 2020.
- [3]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 770–778, 2016.
- [4]JimmyTH Smith, Andrew Warrington, and ScottW Linderman.Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022.
- [5]Albert Gu, Karan Goel, and Christopher Ré.Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021.
- [6]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
- [7]Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451, 2020.
- [8]Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song,Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin,Lukasz Kaiser, etal.Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020.
- [9]Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and FrançoisFleuret.Transformers are rnns: Fast autoregressive transformers with linearattention.In International conference on machine learning, pages5156–5165. PMLR, 2020.
- [10]Manzil Zaheer, Guru Guruganesh, KumarAvinava Dubey, Joshua Ainslie, ChrisAlberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, LiYang,etal.Big bird: Transformers for longer sequences.Advances in neural information processing systems,33:17283–17297, 2020.
- [11]Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, andLuke Zettlemoyer.Luna: Linear unified nested attention.Advances in Neural Information Processing Systems,34:2441–2453, 2021.
- [12]James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon.Fnet: Mixing tokens with fourier transforms.arXiv preprint arXiv:2105.03824, 2021.
- [13]Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung,Yin Li, and Vikas Singh.Nyströmformer: A nyström-based algorithm for approximatingself-attention.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume35, pages 14138–14148, 2021.
- [14]Zhenhai Zhu and Radu Soricut.H-transformer-1d: Fast one-dimensional hierarchical attention forsequences.arXiv preprint arXiv:2107.11906, 2021.
- [15]DavidW Romero, DavidM Knigge, Albert Gu, ErikJ Bekkers, Efstratios Gavves,JakubM Tomczak, and Mark Hoogendoorn.Towards a general purpose cnn for long range dependencies in d.arXiv preprint arXiv:2206.03398, 2022.
- [16]Lei Cheng, Ruslan Khalitov, Tong Yu, Jing Zhang, and Zhirong Yang.Classification of long sequential data using circular dilatedconvolutional neural networks.Neurocomputing, 518:50–59, 2023.
- [17]Ankit Gupta, Albert Gu, and Jonathan Berant.Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems,35:22982–22994, 2022.
- [18]Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré.On the parameterization and initialization of diagonal state spacemodels.Advances in Neural Information Processing Systems,35:35971–35983, 2022.
- [19]Trieu Trinh, Andrew Dai, Thang Luong, and Quoc Le.Learning longer-term dependencies in rnns with auxiliary losses.In International Conference on Machine Learning, pages4965–4974. PMLR, 2018.
- [20]Yulia Rubanova, RickyTQ Chen, and DavidK Duvenaud.Latent ordinary differential equations for irregularly-sampled timeseries.Advances in neural information processing systems, 32, 2019.
- [21]Patrick Kidger, James Morrill, James Foster, and Terry Lyons.Neural controlled differential equations for irregular time series.Advances in Neural Information Processing Systems,33:6696–6707, 2020.
- [22]Mario Lezcano-Casado and David Martınez-Rubio.Cheap orthogonal constraints in neural networks: A simpleparametrization of the orthogonal and unitary group.In International Conference on Machine Learning, pages3794–3803. PMLR, 2019.
- [23]NBenjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, andMichaelW Mahoney.Lipschitz recurrent neural networks.arXiv preprint arXiv:2006.12070, 2020.
- [24]DavidW Romero, Anna Kuzina, ErikJ Bekkers, JakubM Tomczak, and MarkHoogendoorn.Ckconv: Continuous kernel convolution for sequential data.arXiv preprint arXiv:2102.02611, 2021.
- [25]Chris Donahue, Julian McAuley, and Miller Puckette.Adversarial audio synthesis.arXiv preprint arXiv:1802.04208, 2018.
- [26]Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, andChristopher Ré.Combining recurrent, convolutional, and continuous-time models withlinear state space layers.Advances in neural information processing systems, 34:572–585,2021.
- [27]Naoki Nonaka and Jun Seita.In-depth benchmarking of deep neural network architectures for ecgdiagnosis.In Machine Learning for Healthcare Conference, pages 414–439.PMLR, 2021.
- [28]Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré.How to train your hippo: State space models with generalizedorthogonal basis projections.arXiv preprint arXiv:2206.12037, 2022.
- [29]Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, AlexanderAmini, and Daniela Rus.Liquid structural state-space models.arXiv preprint arXiv:2209.12951, 2022.
- [30]DavidW Romero, Robert-Jan Bruintjes, JakubM Tomczak, ErikJ Bekkers, MarkHoogendoorn, and JanC van Gemert.Flexconv: Continuous kernel convolutions with differentiable kernelsizes.arXiv preprint arXiv:2110.08059, 2021.
- [31]Shaojie Bai, JZico Kolter, and Vladlen Koltun.Trellis networks for sequence modeling.arXiv preprint arXiv:1810.06682, 2018.
- [32]Shaojie Bai, JZico Kolter, and Vladlen Koltun.An empirical evaluation of generic convolutional and recurrentnetworks for sequence modeling.arXiv preprint arXiv:1803.01271, 2018.
- [33]Albert Gu, Caglar Gulcehre, Thomas Paine, Matt Hoffman, and Razvan Pascanu.Improving the gating mechanism of recurrent neural networks.In International Conference on Machine Learning, pages3800–3809. PMLR, 2020.
- [34]Sepp Hochreiter and Jürgen Schmidhuber.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
- [35]Shiyu Chang, Yang Zhang, Wei Han, MoYu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui,Michael Witbrock, MarkA Hasegawa-Johnson, and ThomasS Huang.Dilated recurrent neural networks.Advances in neural information processing systems, 30, 2017.
- [36]Shuai Li, Wanqing Li, Chris Cook, CeZhu, and Yanbo Gao.Independently recurrent neural network (indrnn): Building a longerand deeper rnn.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 5457–5466, 2018.
- [37]Aaron Voelker, Ivana Kajić, and Chris Eliasmith.Legendre memory units: Continuous-time representation in recurrentneural networks.Advances in neural information processing systems, 32, 2019.
- [38]TKonstantin Rusch and Siddhartha Mishra.Unicornn: A recurrent model for learning very long time dependencies.In International Conference on Machine Learning, pages9168–9178. PMLR, 2021.
- [39]NarsimhaReddy Chilkuri and Chris Eliasmith.Parallelizing legendre memory unit training.In International Conference on Machine Learning, pages1898–1907. PMLR, 2021.
- [40]Nikita Nangia and SamuelR Bowman.Listops: A diagnostic dataset for latent tree learning.arXiv preprint arXiv:1804.06028, 2018.
- [41]DragomirR Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara.The acl anthology network corpus.Language Resources and Evaluation, 47:919–944, 2013.
- [42]Alex Krizhevsky, Geoffrey Hinton, etal.Learning multiple layers of features from tiny images.2009.
- [43]Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and ThomasSerre.Learning long-range spatial dependencies with horizontal gatedrecurrent units.Advances in neural information processing systems, 31, 2018.
- [44]Pete Warden.Speech commands: A dataset for limited-vocabulary speech recognition.arXiv preprint arXiv:1804.03209, 2018.
- [45]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, etal.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019.