Appendix for Linear Dynamics-embedded Neural Network for Long-Sequence Modeling (2024)

Tongyi Liang, Han-XiongLi
Department of Systems Engineering, City University of Hong Kong
tyliang4-c@my.cityu.edu.hk, mehxli@cityu.edu.hk
Corresponding authors

Abstract

This appendix provides all necessary materials for the paper ’Linear Dynamics-embedded Neural Network for Long-Sequence Modeling’, including model details, experimental configurations, and PyTorch implementation. 111The codes are available at https://github.com/leonty1/DeepLDNN

Contents:

  • Appendix A: Notations.

  • Appendix B: Model Details.

  • Appendix B.1: Convolutional View of Continuous SSMs.

  • Appendix B.2: Numerical Discretization.

  • Appendix B.3: Parameterization and Initialization of LDNN.

  • Appendix B.4: HiPPO Initialization.

  • Appendix C: Comparison with Related Model.

  • Appendix C.1: Structure Comparison of SSMs.

  • Appendix C.1: Parameterization and Initialization of SSMs.

  • Appendix C.2: Relationship betweenLDNN, S4, and S5.

  • Appendix D: Supplementary Results.

  • Appendix E: Experimental Configurations for Reproducibility.

  • Appendix F: PyTorch Implementation of LDNN Layer.

Appendix A Notations

NotationsDescriptions
SSMsState space models
u(t)H𝑢𝑡superscript𝐻u(t)\in\mathbb{R}^{H}italic_u ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPTSystem input sequence
x(t)N𝑥𝑡superscript𝑁x(t)\in\mathbb{R}^{N}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPTSystem state
y(t)M𝑦𝑡superscript𝑀y(t)\in\mathbb{R}^{M}italic_y ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPTSystem output sequence
AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPTSystem matrix in continuous SSMs
BN×H𝐵superscript𝑁𝐻B\in\mathbb{R}^{N\times H}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_H end_POSTSUPERSCRIPTInput matrix in continuous SSMs
CM×N𝐶superscript𝑀𝑁C\in\mathbb{R}^{M\times N}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPTOutput matrix in continuous SSMs
DM×H𝐷superscript𝑀𝐻D\in\mathbb{R}^{M\times H}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_H end_POSTSUPERSCRIPTDirect transition matrix in continuous SSMs
A¯N×N¯𝐴superscript𝑁𝑁\bar{A}\in\mathbb{R}^{N\times N}over¯ start_ARG italic_A end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPTSystem matrix in discrete SSMs
B¯N×H¯𝐵superscript𝑁𝐻\bar{B}\in\mathbb{R}^{N\times H}over¯ start_ARG italic_B end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_H end_POSTSUPERSCRIPTInput matrix in discrete SSMs
C¯M×N¯𝐶superscript𝑀𝑁\bar{C}\in\mathbb{R}^{M\times N}over¯ start_ARG italic_C end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPTOutput matrix in discrete SSMs
D¯M×H¯𝐷superscript𝑀𝐻\bar{D}\in\mathbb{R}^{M\times H}over¯ start_ARG italic_D end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_H end_POSTSUPERSCRIPTDirect transition matrix in discrete SSMs
Δ+Δsubscript\Delta\in\mathbb{R_{+}}roman_Δ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPTDiscrete time step in discrete SSMs
K𝐾Kitalic_KState kernel in convolutional SSMs
V𝑉Vitalic_VSystem kernel in convolutional SSMs
ΛΛ\Lambdaroman_ΛDiagonal system matrix in diagonal SSMs
FFT𝐹𝐹𝑇FFTitalic_F italic_F italic_TFast Fourier Transform

Appendix B Model Details

B.1 Convolutional View of Continuous SSMs

Here, we introduce the convolutional view of continuous SSMs [1].

x(t)𝑥𝑡\displaystyle x(t)italic_x ( italic_t )=eAtx(0)+0teA(tτ)Bu(τ)𝑑τabsentsuperscript𝑒𝐴𝑡𝑥0superscriptsubscript0𝑡superscript𝑒𝐴𝑡𝜏𝐵𝑢𝜏differential-d𝜏\displaystyle=e^{At}x(0)+\int_{0}^{t}e^{A(t-\tau)}Bu(\tau)d\tau= italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_x ( 0 ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_t - italic_τ ) end_POSTSUPERSCRIPT italic_B italic_u ( italic_τ ) italic_d italic_τ
=eAtx(0)+0teAtBu(tτ)𝑑τabsentsuperscript𝑒𝐴𝑡𝑥0superscriptsubscript0𝑡superscript𝑒𝐴𝑡𝐵𝑢𝑡𝜏differential-d𝜏\displaystyle=e^{At}x(0)+\int_{0}^{t}e^{At}Bu(t-\tau)d\tau= italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_x ( 0 ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_B italic_u ( italic_t - italic_τ ) italic_d italic_τ(1)
=eAtx(0)+0th(t)u(tτ)𝑑τabsentsuperscript𝑒𝐴𝑡𝑥0superscriptsubscript0𝑡𝑡𝑢𝑡𝜏differential-d𝜏\displaystyle=e^{At}x(0)+\int_{0}^{t}h(t)u(t-\tau)d\tau= italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_x ( 0 ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h ( italic_t ) italic_u ( italic_t - italic_τ ) italic_d italic_τ(2)
=eAtx(0)+(hu)(t)absentsuperscript𝑒𝐴𝑡𝑥0𝑢𝑡\displaystyle=e^{At}x(0)+(h\ast u)(t)= italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_x ( 0 ) + ( italic_h ∗ italic_u ) ( italic_t )(3)

Using the change of variables, we reformulate Eq. (1). Then, let h(t)=eAtB𝑡superscript𝑒𝐴𝑡𝐵h(t)=e^{At}Bitalic_h ( italic_t ) = italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_B, we obtain the convolutional SSMs (3) according to the definition of convolution.

B.2 Numerical Discretization

B.2.1 Zero-order Hold Method

The state transition function is an ordinary differential equation (ODE). We can obtain its analytical solution as follows.

x˙(t)=Ax(t)+Bu(t)˙𝑥𝑡𝐴𝑥𝑡𝐵𝑢𝑡\displaystyle\dot{x}(t)=Ax(t)+Bu(t)over˙ start_ARG italic_x end_ARG ( italic_t ) = italic_A italic_x ( italic_t ) + italic_B italic_u ( italic_t )
x˙(t)Ax(t)=Bu(t)˙𝑥𝑡𝐴𝑥𝑡𝐵𝑢𝑡\displaystyle\dot{x}(t)-Ax(t)=Bu(t)over˙ start_ARG italic_x end_ARG ( italic_t ) - italic_A italic_x ( italic_t ) = italic_B italic_u ( italic_t )
etAx˙(t)etAAx(t)=etABu(t)superscript𝑒𝑡𝐴˙𝑥𝑡superscript𝑒𝑡𝐴𝐴𝑥𝑡superscript𝑒𝑡𝐴𝐵𝑢𝑡\displaystyle e^{-tA}\dot{x}(t)-e^{-tA}Ax(t)=e^{-tA}Bu(t)italic_e start_POSTSUPERSCRIPT - italic_t italic_A end_POSTSUPERSCRIPT over˙ start_ARG italic_x end_ARG ( italic_t ) - italic_e start_POSTSUPERSCRIPT - italic_t italic_A end_POSTSUPERSCRIPT italic_A italic_x ( italic_t ) = italic_e start_POSTSUPERSCRIPT - italic_t italic_A end_POSTSUPERSCRIPT italic_B italic_u ( italic_t )
1dt[etAx(t))]=etABu(t)\displaystyle\frac{1}{dt}\left[e^{-tA}x(t))\right]=e^{-tA}Bu(t)divide start_ARG 1 end_ARG start_ARG italic_d italic_t end_ARG [ italic_e start_POSTSUPERSCRIPT - italic_t italic_A end_POSTSUPERSCRIPT italic_x ( italic_t ) ) ] = italic_e start_POSTSUPERSCRIPT - italic_t italic_A end_POSTSUPERSCRIPT italic_B italic_u ( italic_t )
0t1dτ[eτAx(τ)]𝑑τ=0teτABu(τ)𝑑τsuperscriptsubscript0𝑡1𝑑𝜏delimited-[]superscript𝑒𝜏𝐴𝑥𝜏differential-d𝜏superscriptsubscript0𝑡superscript𝑒𝜏𝐴𝐵𝑢𝜏differential-d𝜏\displaystyle\int_{0}^{t}\frac{1}{d\tau}\left[e^{-\tau A}x(\tau)\right]d\tau=%\int_{0}^{t}e^{-\tau A}Bu(\tau)d\tau∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d italic_τ end_ARG [ italic_e start_POSTSUPERSCRIPT - italic_τ italic_A end_POSTSUPERSCRIPT italic_x ( italic_τ ) ] italic_d italic_τ = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_τ italic_A end_POSTSUPERSCRIPT italic_B italic_u ( italic_τ ) italic_d italic_τ
etAx(t)x(0)=0teτABu(τ)𝑑τsuperscript𝑒𝑡𝐴𝑥𝑡𝑥0superscriptsubscript0𝑡superscript𝑒𝜏𝐴𝐵𝑢𝜏differential-d𝜏\displaystyle e^{-tA}x(t)-x(0)=\int_{0}^{t}e^{-\tau A}Bu(\tau)d\tauitalic_e start_POSTSUPERSCRIPT - italic_t italic_A end_POSTSUPERSCRIPT italic_x ( italic_t ) - italic_x ( 0 ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_τ italic_A end_POSTSUPERSCRIPT italic_B italic_u ( italic_τ ) italic_d italic_τ
x(t)=etAx(0)+etA0teτABu(τ)𝑑τ𝑥𝑡superscript𝑒𝑡𝐴𝑥0superscript𝑒𝑡𝐴superscriptsubscript0𝑡superscript𝑒𝜏𝐴𝐵𝑢𝜏differential-d𝜏\displaystyle x(t)=e^{tA}x(0)+e^{tA}\int_{0}^{t}e^{-\tau A}Bu(\tau)d\tauitalic_x ( italic_t ) = italic_e start_POSTSUPERSCRIPT italic_t italic_A end_POSTSUPERSCRIPT italic_x ( 0 ) + italic_e start_POSTSUPERSCRIPT italic_t italic_A end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_τ italic_A end_POSTSUPERSCRIPT italic_B italic_u ( italic_τ ) italic_d italic_τ
x(t)=eAtx(0)+0teA(tτ)Bu(τ)𝑑τ𝑥𝑡superscript𝑒𝐴𝑡𝑥0superscriptsubscript0𝑡superscript𝑒𝐴𝑡𝜏𝐵𝑢𝜏differential-d𝜏\displaystyle x(t)=e^{At}x(0)+\int_{0}^{t}e^{A(t-\tau)}Bu(\tau)d\tauitalic_x ( italic_t ) = italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_x ( 0 ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_t - italic_τ ) end_POSTSUPERSCRIPT italic_B italic_u ( italic_τ ) italic_d italic_τ(4)

Eq. (4) is the analytical solution of x(t)𝑥𝑡x(t)italic_x ( italic_t ). Then, we rewrite Eq. (4) with initial time t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

x(t)=eA(tt0)x(t0)+t0teA(tτ)Bu(τ)𝑑τ𝑥𝑡superscript𝑒𝐴𝑡subscript𝑡0𝑥subscript𝑡0superscriptsubscriptsubscript𝑡0𝑡superscript𝑒𝐴𝑡𝜏𝐵𝑢𝜏differential-d𝜏x(t)=e^{A(t-t_{0})}x(t_{0})+\int_{t_{0}}^{t}e^{A(t-\tau)}Bu(\tau)d\tauitalic_x ( italic_t ) = italic_e start_POSTSUPERSCRIPT italic_A ( italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_x ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_t - italic_τ ) end_POSTSUPERSCRIPT italic_B italic_u ( italic_τ ) italic_d italic_τ(5)

When we sample the u(t)𝑢𝑡u(t)italic_u ( italic_t ) with time interval ΔΔ\Deltaroman_Δ, t𝑡titalic_t becomes kΔ𝑘Δk\Deltaitalic_k roman_Δ, where k=0,1,𝑘01k=0,1,...italic_k = 0 , 1 , … is a positive integer. The Zero-order Hold method assumes u(t)=u(kΔ)𝑢𝑡𝑢𝑘Δu(t)=u(k\Delta)italic_u ( italic_t ) = italic_u ( italic_k roman_Δ ). For t[kΔ,(k+1)Δ]𝑡𝑘Δ𝑘1Δt\in[k\Delta,(k+1)\Delta]italic_t ∈ [ italic_k roman_Δ , ( italic_k + 1 ) roman_Δ ]. Thus, we have

x((k+1)Δ)=eA(Δ)x(kΔ)+kΔ(k+1)ΔeA((k+1)Δτ)𝑑τBu(kΔ)𝑥𝑘1Δsuperscript𝑒𝐴Δ𝑥𝑘Δsuperscriptsubscript𝑘Δ𝑘1Δsuperscript𝑒𝐴𝑘1Δ𝜏differential-d𝜏𝐵𝑢𝑘Δx((k+1)\Delta)=e^{A(\Delta)}x(k\Delta)+\int_{k\Delta}^{(k+1)\Delta}e^{A((k+1)%\Delta-\tau)}d\tau Bu(k\Delta)italic_x ( ( italic_k + 1 ) roman_Δ ) = italic_e start_POSTSUPERSCRIPT italic_A ( roman_Δ ) end_POSTSUPERSCRIPT italic_x ( italic_k roman_Δ ) + ∫ start_POSTSUBSCRIPT italic_k roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) roman_Δ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( ( italic_k + 1 ) roman_Δ - italic_τ ) end_POSTSUPERSCRIPT italic_d italic_τ italic_B italic_u ( italic_k roman_Δ )(6)

We abbreviate x((k+1)Δ)𝑥𝑘1Δx((k+1)\Delta)italic_x ( ( italic_k + 1 ) roman_Δ ), x(kΔ)𝑥𝑘Δx(k\Delta)italic_x ( italic_k roman_Δ ), and u(kΔ)𝑢𝑘Δu(k\Delta)italic_u ( italic_k roman_Δ ) as xk+1subscript𝑥𝑘1x_{k+1}italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and uksubscript𝑢𝑘u_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, respectively. Here, we get the discrete transition function.

xk+1=A¯xk+B¯uksubscript𝑥𝑘1¯𝐴subscript𝑥𝑘¯𝐵subscript𝑢𝑘x_{k+1}=\bar{A}x_{k}+\bar{B}u_{k}italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = over¯ start_ARG italic_A end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over¯ start_ARG italic_B end_ARG italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT(7)

with A¯=eAΔ¯𝐴superscript𝑒𝐴Δ\bar{A}=e^{A\Delta}over¯ start_ARG italic_A end_ARG = italic_e start_POSTSUPERSCRIPT italic_A roman_Δ end_POSTSUPERSCRIPT, B¯=kΔ(k+1)ΔeA((k+1)Δτ)𝑑τB¯𝐵superscriptsubscript𝑘Δ𝑘1Δsuperscript𝑒𝐴𝑘1Δ𝜏differential-d𝜏𝐵\bar{B}=\int_{k\Delta}^{(k+1)\Delta}e^{A((k+1)\Delta-\tau)}d\tau Bover¯ start_ARG italic_B end_ARG = ∫ start_POSTSUBSCRIPT italic_k roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) roman_Δ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( ( italic_k + 1 ) roman_Δ - italic_τ ) end_POSTSUPERSCRIPT italic_d italic_τ italic_B.

We can further simplify B¯¯𝐵\bar{B}over¯ start_ARG italic_B end_ARG assuming that A𝐴Aitalic_A is invertible.

B¯¯𝐵\displaystyle\bar{B}over¯ start_ARG italic_B end_ARG=kΔ(k+1)ΔeA((k+1)Δτ)𝑑τBabsentsuperscriptsubscript𝑘Δ𝑘1Δsuperscript𝑒𝐴𝑘1Δ𝜏differential-d𝜏𝐵\displaystyle=\int_{k\Delta}^{(k+1)\Delta}e^{A((k+1)\Delta-\tau)}d\tau B= ∫ start_POSTSUBSCRIPT italic_k roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) roman_Δ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( ( italic_k + 1 ) roman_Δ - italic_τ ) end_POSTSUPERSCRIPT italic_d italic_τ italic_B
=0ΔeAt𝑑tBabsentsuperscriptsubscript0Δsuperscript𝑒𝐴𝑡differential-d𝑡𝐵\displaystyle=\int_{0}^{\Delta}e^{At}dtB= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT italic_d italic_t italic_B
=0ΔA1deAtdt𝑑tBabsentsuperscriptsubscript0Δsuperscript𝐴1𝑑superscript𝑒𝐴𝑡𝑑𝑡differential-d𝑡𝐵\displaystyle=\int_{0}^{\Delta}A^{-1}\frac{de^{At}}{dt}dtB= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_d italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_t end_ARG italic_d italic_t italic_B
=A1(eAtI)Babsentsuperscript𝐴1superscript𝑒𝐴𝑡𝐼𝐵\displaystyle=A^{-1}(e^{At}-I)B= italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_A italic_t end_POSTSUPERSCRIPT - italic_I ) italic_B(8)

B.2.2 Numerical Approximation

Based on Taylor series expansion, the first derivative of x𝑥xitalic_x can be approximated by numerical differentiation. Using the forward Euler Eq. (9), or the backward Euler Eq. (10) to approximate x𝑥{x}italic_x, we obtain the A¯¯𝐴\bar{A}over¯ start_ARG italic_A end_ARG, and B¯¯𝐵\bar{B}over¯ start_ARG italic_B end_ARG as described in Eq. (11).

xk+1xkΔx˙k=Axt+Butsubscript𝑥𝑘1subscript𝑥𝑘Δsubscript˙𝑥𝑘𝐴subscript𝑥𝑡𝐵subscript𝑢𝑡\frac{x_{k+1}-x_{k}}{\Delta}\approx\dot{x}_{k}=Ax_{t}+Bu_{t}divide start_ARG italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ end_ARG ≈ over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT(9)
xk+1xkΔx˙=Axt+1+Butsubscript𝑥𝑘1subscript𝑥𝑘Δ˙𝑥𝐴subscript𝑥𝑡1𝐵subscript𝑢𝑡\frac{x_{k+1}-x_{k}}{\Delta}\approx\dot{x}=Ax_{t+1}+Bu_{t}divide start_ARG italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ end_ARG ≈ over˙ start_ARG italic_x end_ARG = italic_A italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_B italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT(10)

When we use the Generalized Bilinear Transformation method (GBT).

{A¯=(IαAΔ)1(I+(1α)ΔA)B¯=(IαΔA)1ΔBcases¯𝐴superscript𝐼𝛼𝐴Δ1𝐼1𝛼Δ𝐴otherwise¯𝐵superscript𝐼𝛼Δ𝐴1Δ𝐵otherwise\begin{cases}\bar{A}=(I-\alpha A\Delta)^{-1}(I+(1-\alpha)\Delta A)\\\bar{B}=(I-\alpha\Delta A)^{-1}\Delta B\end{cases}{ start_ROW start_CELL over¯ start_ARG italic_A end_ARG = ( italic_I - italic_α italic_A roman_Δ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I + ( 1 - italic_α ) roman_Δ italic_A ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over¯ start_ARG italic_B end_ARG = ( italic_I - italic_α roman_Δ italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ italic_B end_CELL start_CELL end_CELL end_ROW(11)

There are three special cases for the GBT with different α𝛼\alphaitalic_α: the forward Euler method is GBT with α=0𝛼0\alpha=0italic_α = 0, the Bilinear method is GBT with α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, and the backward Euler method is GBT with α=1𝛼1\alpha=1italic_α = 1. Those methods approximate the differential equation based on Taylor series expansion.

B.3 Parameterization and Initialization of LDNN

The diagonal SSMs have learnable parameters Λ,B,C,DΛ𝐵𝐶𝐷\Lambda,B,C,Droman_Λ , italic_B , italic_C , italic_D, and a time step ΔΔ\Deltaroman_Δ for discretization. We introduce the parameterization and initialization of these parameters, respectively.

Parameter ΛΛ\Lambdaroman_Λ. According to Proposition 1 in Section III. A, we know that all elements in ΛΛ\Lambdaroman_Λ must have negative real parts to ensure state convergence. Thus, we restrict ΛΛ\Lambdaroman_Λ with an enforcing function f+subscript𝑓f_{+}italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, expressed as f+(Re(Λ))+Im(Λ)isubscript𝑓𝑅𝑒Λ𝐼𝑚Λ𝑖-f_{+}(Re(\Lambda))+Im(\Lambda)i- italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_R italic_e ( roman_Λ ) ) + italic_I italic_m ( roman_Λ ) italic_i. The enforcing function f+subscript𝑓f_{+}italic_f start_POSTSUBSCRIPT + end_POSTSUBSCRIPT outputs positive real numbers and may have many forms, for example, the Gaussian function, the rectified linear unit function (ReLU), and the Sigmoid function. A random or constant function can initialize ΛΛ\Lambdaroman_Λ. Besides, it can be initialized by the eigenvalues of some specially structured matrices, such as the HiPPO matrix introduced in [2]. We initialize ΛΛ\Lambdaroman_Λ via HiPPO throughout this work.

Parameter B𝐵Bitalic_B and C𝐶Citalic_C. B𝐵Bitalic_B and C𝐶Citalic_C are the parameters of the linear projection function. We parameterize them as learnable full matrices. Furthermore, the initialization of B𝐵Bitalic_B is given as random numbers under HiPPO framework as introduced in section B.4. C𝐶Citalic_C is initialized by truncated normal distribution.

Parameter D𝐷Ditalic_D. Different parameterizations of D𝐷Ditalic_D have different meanings. If we parameterize D𝐷Ditalic_D as an untrainable zero matrix, the output of SSMs is only dependent on the state. When the input and output are the same size, we can parameterize it as an identity matrix, also known as residual connection [3]. In this work, D𝐷Ditalic_D is parameterized as a trainable diagonal matrix, which is initialized by a constant 1 in this work.

Parameter ΔΔ\Deltaroman_Δ. ΔΔ\Delta\in\mathbb{R}roman_Δ ∈ blackboard_R is a scalar for a given SSM. We set it as a learnable parameter and initialize it by randomly sampling from a bounded interval. This work uses [0.001, 0.1] as fault choice if not otherwise specified. We experimentally find that relaxing the size of ΔΔ\Deltaroman_Δ from \mathbb{R}blackboard_R to Nsuperscript𝑁\mathbb{R}^{N}blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT will improve model accuracy, which is also reported in S5 [4]. Therefore, ΔNΔsuperscript𝑁\Delta\in\mathbb{R}^{N}roman_Δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is used across all experiments.

B.4 HiPPO Initialization

HiPPO theory introduces a way to compress continuous signals and discrete-time series by projection onto polynomial bases [2]. The continuous SSMs, as a particular type of ordinary differential equation (ODE), also belong to this framework. Thus, the structured HiPPO matrix shall serve as a good initialization method of Λ,BΛ𝐵\Lambda,Broman_Λ , italic_B. Following [4], we choose the HiPPO-LegS matrix for initialization, which is defined as

𝐀nksubscript𝐀𝑛𝑘\displaystyle\mathbf{A}_{nk}bold_A start_POSTSUBSCRIPT italic_n italic_k end_POSTSUBSCRIPT={(2n+1)1/2(2k+1)1/2,n>kn+1,n=k0,n<k.absentcasessuperscript2𝑛112superscript2𝑘112𝑛𝑘𝑛1𝑛𝑘0𝑛𝑘\displaystyle=-\begin{cases}(2n+1)^{1/2}(2k+1)^{1/2},&n>k\\n+1,&n=k\\0,&n<k\end{cases}.= - { start_ROW start_CELL ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 2 italic_k + 1 ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , end_CELL start_CELL italic_n > italic_k end_CELL end_ROW start_ROW start_CELL italic_n + 1 , end_CELL start_CELL italic_n = italic_k end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_n < italic_k end_CELL end_ROW .(12)
bnsubscript𝑏𝑛\displaystyle b_{n}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT=(2n+1)12.absentsuperscript2𝑛112\displaystyle=(2n+1)^{\frac{1}{2}}.= ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT .(13)

The naive diagonalization of 𝐀nksubscript𝐀𝑛𝑘\mathbf{A}_{nk}bold_A start_POSTSUBSCRIPT italic_n italic_k end_POSTSUBSCRIPT to initialize ΛΛ\Lambdaroman_Λ would lead to numerically infeasible and unstable issues. Gu et al. [5] proposed that this problem is solved by equivalently transforming 𝐀nksubscript𝐀𝑛𝑘\mathbf{A}_{nk}bold_A start_POSTSUBSCRIPT italic_n italic_k end_POSTSUBSCRIPT into a normal plus low-rank (NPLR) matrix, which is expressed as a normal matrix

𝐀Normal=𝐕𝚲𝐕superscript𝐀Normal𝐕𝚲superscript𝐕\displaystyle\mathbf{A}^{\mathrm{Normal}}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}%^{*}bold_A start_POSTSUPERSCRIPT roman_Normal end_POSTSUPERSCRIPT = bold_V bold_Λ bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT(14)

together with a low-rank term.

𝐀=𝐀Normal𝐏𝐐=𝐕(𝚲(𝐕𝐏)(𝐕𝐐))𝐕𝐀superscript𝐀Normalsuperscript𝐏𝐐top𝐕𝚲superscript𝐕𝐏superscriptsuperscript𝐕𝐐superscript𝐕\displaystyle\mathbf{A}=\mathbf{A}^{\mathrm{Normal}}-\mathbf{P}\mathbf{Q}^{%\top}=\mathbf{V}\left(\mathbf{\Lambda}-(\mathbf{V}^{*}\mathbf{P})(\mathbf{V}^{%*}\mathbf{Q})^{*}\right)\mathbf{V}^{*}bold_A = bold_A start_POSTSUPERSCRIPT roman_Normal end_POSTSUPERSCRIPT - bold_PQ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_V ( bold_Λ - ( bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_P ) ( bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_Q ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT(15)

where unitary 𝐕N×N𝐕superscript𝑁𝑁\mathbf{V}\in\mathbb{C}^{N\times N}bold_V ∈ blackboard_C start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, diagonal 𝚲N×N𝚲superscript𝑁𝑁\mathbf{\Lambda}\in\mathbb{C}^{N\times N}bold_Λ ∈ blackboard_C start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, and low-rank factorization 𝐏,𝐐N×r𝐏𝐐superscript𝑁𝑟\mathbf{P},\mathbf{Q}\in\mathbb{R}^{N\times r}bold_P , bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_r end_POSTSUPERSCRIPT.

The HiPPO-LegS matrix can be further rewritten as

𝐀nksubscript𝐀𝑛𝑘\displaystyle\mathbf{A}_{nk}bold_A start_POSTSUBSCRIPT italic_n italic_k end_POSTSUBSCRIPT=𝐀Normal𝐏𝐏absentsuperscript𝐀Normalsuperscript𝐏𝐏top\displaystyle=\mathbf{A}^{\mathrm{Normal}}-\mathbf{P}\mathbf{P}^{\top}= bold_A start_POSTSUPERSCRIPT roman_Normal end_POSTSUPERSCRIPT - bold_PP start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT(16)

where

𝐀Normalsuperscript𝐀Normal\displaystyle\mathbf{A}^{\mathrm{Normal}}bold_A start_POSTSUPERSCRIPT roman_Normal end_POSTSUPERSCRIPT={(n+12)1/2(k+12)1/2,n>k12,n=k(n+12)1/2(k+12)1/2,n<k.absentcasessuperscript𝑛1212superscript𝑘1212𝑛𝑘12𝑛𝑘superscript𝑛1212superscript𝑘1212𝑛𝑘\displaystyle=-\begin{cases}(n+\frac{1}{2})^{1/2}(k+\frac{1}{2})^{1/2},&n>k\\\frac{1}{2},&n=k\\(n+\frac{1}{2})^{1/2}(k+\frac{1}{2})^{1/2},&n<k\end{cases}.= - { start_ROW start_CELL ( italic_n + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_k + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , end_CELL start_CELL italic_n > italic_k end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG , end_CELL start_CELL italic_n = italic_k end_CELL end_ROW start_ROW start_CELL ( italic_n + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_k + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , end_CELL start_CELL italic_n < italic_k end_CELL end_ROW .(17)
𝐏nsubscript𝐏𝑛\displaystyle\mathbf{P}_{n}bold_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT=(n+12)12absentsuperscript𝑛1212\displaystyle=(n+\frac{1}{2})^{\frac{1}{2}}= ( italic_n + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT(18)

We initialize ΛΛ\Lambdaroman_Λ using the eigenvalue of 𝐀Normalsuperscript𝐀Normal\mathbf{A}^{\mathrm{Normal}}bold_A start_POSTSUPERSCRIPT roman_Normal end_POSTSUPERSCRIPT. Following S5 [4], the eigenvectors of 𝐀Normalsuperscript𝐀Normal\mathbf{A}^{\mathrm{Normal}}bold_A start_POSTSUPERSCRIPT roman_Normal end_POSTSUPERSCRIPT are used for B,C𝐵𝐶B,Citalic_B , italic_C initialization.

Appendix C Comparison with related models

C.1 Structure Comparison of SSMs

According to SSM’s different input and output dimensions, the current related work can be divided into two categories. One type is built on SSM with single-input single-output (SISO), including S4, DSS, and S4D; the other type is based on SSM with multi-input multi-output (MIMO), including S5 and LDNN in this work.

As shown in Fig. 1, SISO SSM uses univariate sequences as input and output, while MIMO SSM directly models multivariate sequences. Usually, multiple SISO SSMs are used to model a multivariate sequence independently, and then a linear layer is used for feature fusion, as used in S4 and DSS. Compared with SISO SSM, MIMO SSM does not require an additional linear layer.

Table 2 concludes the structure of different SSM-based models. Except for S4, other methods are based on diagonal SSMs. S5 is the only one that directly utilizes recursive SSM for reasoning and learning. Though other models can perform recursive reasoning, learning is based on convolutional SSM.

Appendix for Linear Dynamics-embedded Neural Network for Long-Sequence Modeling (1)

ModelTypeStructureConvolutionalKernel ComputationConvolutionRecurrenceDiscretiztion
S4SISODPLRCauchyFFTvanillaBilinear
DSSSISODiagonalsoftmaxFFTvanillaZOH
S4DSISODiagonalVandermondeFFTvanillaOptional
S5MIMODiagonalScan operationBilinear
LDNNMIMODiagonalVandermondeFFTvanillaZOH

C.2 Relationship Between S4, S5, and LDNN

S4 and S5 are the most representative works in SISO and MIMO SSM. Here, we analyze the relationship between LDNN and them. Fig 2 presents the computational flow of those models. The following statements are summarized:

  • S4 is based on the SISO SSM, while S5 and LDNN are based on MIMO SSM.

  • S4 uses a DPLR parameterization for system matrix A𝐴Aitalic_A. S5 and LNDD both use diagonal SSM.

  • All three models can make inferences in recurrent mode. However, they differ in the learning process. S4 and LDNN learn in convolutional representations, but S5 learns in recurrent representation.

  • S4 calculates the kernel and convolution in the frequency domain, but LDNN calculates convolution in the frequency domain and the kernel in the time domain.

  • Multi-Head LNDD and multi-copy of S4 are block-diagonal MIMO SSM, but they differ in the structure of SISO and MIMO SSMs, as shown in Fig 1.

  • S4 with H𝐻Hitalic_H copies is the special case of Multi-Head LNDD with head number S𝑆Sitalic_S = input size H𝐻Hitalic_H.

  • S5 is equivalent to Multi-Head LNDD with head S𝑆Sitalic_S=1.

  • A,B𝐴𝐵A,Bitalic_A , italic_B, and C𝐶Citalic_C in S4 and S5 are complex numbers, but LDNN only parameterizes A𝐴Aitalic_A as complex numbers.

  • Bidirectional settings in LDNN do not introduce additional parameters, but S4 and S5 do.

Appendix for Linear Dynamics-embedded Neural Network for Long-Sequence Modeling (2)

Appendix D Supplementary Results

D.1 Extend Results on LRA

ModelListOpsTextRetrievalImagePathfinderPath-XAvg.
length2,0004,0964,0001,0241,02416,384-
Transformer [6]36.3764.2757.4642.4471.40-53.66
Reformer [7]37.2756.1053.4038.0768.50-50.56
Performer [8]18.0165.4053.8242.7777.05-51.18
Linear Trans [9]16.1365.9053.0942.3475.30-50.46
BigBird [10]36.0564.0259.2940.8374.87-54.17
Luna-256 [11]37.2564.5779.2947.3877.72-59.37
FNet [12]35.3365.1159.6138.6777.80-54.42
Nyströmformer [13]37.1565.5279.5641.5870.94-57.46
H-Transformer-1D [14]49.5378.6963.9946.0568.78-61.42
CCNN [15]43.6084.08-88.9091.51-68.02
CDIL-CNN [16]60.6087.6284.2764.4991.00-77.59
S4 [5]59.6086.8290.9088.6594.296.3586.09
DSS [17]60.684.887.885.784.687.881.88
S4D [18]60.4786.1889.4688.1993.0691.9584.89
S5 [4]62.1589.3191.4088.0095.3398.5887.46
LDNN62.2088.2590.1587.2593.8792.7685.75

D.2 Extend Results on Raw Speech Classification

ModelMFCC16kHz8kHz
(Length)(784)(16,000)(8,000)
Transformer [19, 6]90.75--
Performer [8]80.8530.7730.68
ODE-RNN [20]65.9--
NRDE [21]89.816.4915.12
ExpRNN [22]82.1311.610.8
LipschitzRNN [23]88.38--
CKConv [24]95.371.6665.96
WaveGAN-D [25]-96.25-
LSSL [26]93.58--
S4 [5]93.9698.3296.30
LDNN94.4697.5994.23
ModelParameters16kHz8kHz
(Length)(16,000)(8,000)
InceptionNet [27]481K61.2405.18
ResNet-18 [27]216K77.8608.74
XResNet-50 [27]904K83.0107.72
ConvNet [27]26.2M95.5107.26
S4-LegS [5]307K96.0891.32
S4-FouT [28]307K95.2791.59
S4-(LegS/FouT) [28]307K95.3290.72
S4D-LegS [17]306K95.8391.08
S4D-Inv [17]306K96.1891.80
S4D-Lin [17]306K96.2591.58
Liquid-S4 [29]224K96.7890.00
S5 [4]280K96.5294.53
LDNN220K96.0888.83

D.3 Extend Results on Pixel-level 1-D Image classification

ModelsMNSITpsMNISTsCIFAR
(Length)(784)(784)(1024)
Transformer [19, 6]98.997.962.2
CCNN [15]99.7298.8493.08
FlexTCN [30]99.6298.6380.82
CKConv [24]99.3298.5463.74
TrellisNet [31]99.2098.1373.42
TCN [32]99.097.2-
LSTM [33, 34]98.995.1163.01
r-LSTM [19]98.495.272.2
Dilated GRU [35]99.094.6-
Dilated RNN [35]98.096.1-
IndRNN [36]99.096.0-
expRNN [22]98.796.6-
UR-LSTM [33]99.2896.9671.00
UR-GRU [33]99.2796.5174.4
LMU [37]-97.15-
HiPPO-RNN [2]98.998.361.1
UNIcoRNN [38]-98.4-
LMU-FFT [39]-98.49-
LipschitzRNN [23]99.496.364.2
LSSL [26]99.5398.7684.65
S4 [5]99.6398.7091.80
S4D [18]--89.92
Liquid-S4 [29]--92.02
S5 [4]99.6598.6790.10
LDNN99.5498.4588.12

Appendix E Experimental Configurations for Reproducibility

E.1 Hyperparameters

Details of all experiments are described in this part. Table 7 lists the key hyperparameter , including model depth, learning rate, and so on.

DatasetBatchEpochDepthHeadHNMLRDropoutPrenorm
SSMOthers
ListOps10080642562562560.010.010False
Text168062562562562560.0010.0040.1True
Retrieval32206641281281280.0010.0020True
Image502006642565122560.0010.0050.1False
Pathfinder64300681922561920.0010.0050.05True
Pathx8200681922561920.00050.0010True
SC10-MFCC16804321281281280.0010.0060.1False
SC10161506321281281280.0010.0060.1True
SC35161006321281281280.0010.0080.1False
sMNSIT50150416128961280.0020.0080.1True
psMNIST50200481281281280.0010.0040.15True
sCIFAR502006642565122560.0010.0050.1True

Activation

MIMO SSM directly models the multivariate sequence; no additional layer is needed to mix features. Therefore, we follow S5 and use a weighted sigmoid gated unit. Specifically, the LDNN outout 𝐲kMsubscript𝐲𝑘superscript𝑀\mathbf{y}_{k}\in\mathbb{R}^{M}bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is fed into the activation function expressed as 𝐮k=Gelu(𝐲k)σ(𝐖Gelu(𝐲k))superscriptsubscript𝐮𝑘direct-productGelusubscript𝐲𝑘𝜎𝐖Gelusubscript𝐲𝑘\mathbf{u}_{k}^{\prime}=\mathrm{Gelu}(\mathbf{y}_{k})\odot\sigma(\mathbf{W}*%\mathrm{Gelu}(\mathbf{y}_{k}))bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_Gelu ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⊙ italic_σ ( bold_W ∗ roman_Gelu ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ), where 𝐖M×M𝐖superscript𝑀𝑀\mathbf{W}\in\mathbb{R}^{M\times M}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT is a learnable dense matrix. This activation function is used as the default setting if not otherwise specified.

Normalization

Either batch or layer normalization is applied before or after LDNN. Batch normalization after LDNN is used if not otherwise specified.

Initialization

All experiments are initialized using the same configuration introduced in B.3.

Loss and Metric

Cross-entropy loss is used for all classification tasks. Binary or multi-class accuracy is used for metric evaluation.

Optimizer

AdamW is used across all experiments. The learning rate applied to SSM is named LRSSM𝐿subscript𝑅𝑆𝑆𝑀LR_{SSM}italic_L italic_R start_POSTSUBSCRIPT italic_S italic_S italic_M end_POSTSUBSCRIPT, and the other is named LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT. The learning rate is dynamically adjusted by CosineAnnealingLR𝐶𝑜𝑠𝑖𝑛𝑒𝐴𝑛𝑛𝑒𝑎𝑙𝑖𝑛𝑔𝐿𝑅CosineAnnealingLRitalic_C italic_o italic_s italic_i italic_n italic_e italic_A italic_n italic_n italic_e italic_a italic_l italic_i italic_n italic_g italic_L italic_R or ReduceLROnPlateau𝑅𝑒𝑑𝑢𝑐𝑒𝐿𝑅𝑂𝑛𝑃𝑙𝑎𝑡𝑒𝑎𝑢ReduceLROnPlateauitalic_R italic_e italic_d italic_u italic_c italic_e italic_L italic_R italic_O italic_n italic_P italic_l italic_a italic_t italic_e italic_a italic_u in PyTorch.

E.2 Task Specific Hyperparameters

Here, we specify any task-specific details, hyperparameters, or architectural differences from the defaults outlined above.

E.2.1 Listops

The bidirectional setting is not used. Leakyrelu activation is applied. C𝐶Citalic_C is initialized by HiPPO.

E.2.2 Text

The learning rate is adjusted by ReduceLROnPlateau𝑅𝑒𝑑𝑢𝑐𝑒𝐿𝑅𝑂𝑛𝑃𝑙𝑎𝑡𝑒𝑎𝑢ReduceLROnPlateauitalic_R italic_e italic_d italic_u italic_c italic_e italic_L italic_R italic_O italic_n italic_P italic_l italic_a italic_t italic_e italic_a italic_u with factor=0.5, patience=5. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameter C𝐶Citalic_C.

E.2.3 Retrieval

We follow the experimental configuration in S4. The model takes two documents as input and outputs two sequences. A mean pooling layer is then used to transform these two sequences into vectors, noted as y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Four features are created by concatenating y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as following

y=[y1,y2,y1y2,y1y2].𝑦subscript𝑦1subscript𝑦2subscript𝑦1subscript𝑦2subscript𝑦1subscript𝑦2\displaystyle y=[y_{1},y_{2},y_{1}*y_{2},y_{1}-y_{2}].italic_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] .(19)

This concatenated feature is then fed to a linear layer and gelu function for binary classification.

The learning rate is adjusted by CosineAnnealingLR𝐶𝑜𝑠𝑖𝑛𝑒𝐴𝑛𝑛𝑒𝑎𝑙𝑖𝑛𝑔𝐿𝑅CosineAnnealingLRitalic_C italic_o italic_s italic_i italic_n italic_e italic_A italic_n italic_n italic_e italic_a italic_l italic_i italic_n italic_g italic_L italic_R with warmup steps=1,000 and total training steps=50,000. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameter C𝐶Citalic_C.

E.2.4 Image

The learning rate is adjusted by ReduceLROnPlateau𝑅𝑒𝑑𝑢𝑐𝑒𝐿𝑅𝑂𝑛𝑃𝑙𝑎𝑡𝑒𝑎𝑢ReduceLROnPlateauitalic_R italic_e italic_d italic_u italic_c italic_e italic_L italic_R italic_O italic_n italic_P italic_l italic_a italic_t italic_e italic_a italic_u with factor=0.6, patience=5. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameters B𝐵Bitalic_B and C𝐶Citalic_C. Data augmentation, including horizontal flips and random crops, is applied.

E.2.5 Pathfinder

The learning rate is adjusted by CosineAnnealingLR𝐶𝑜𝑠𝑖𝑛𝑒𝐴𝑛𝑛𝑒𝑎𝑙𝑖𝑛𝑔𝐿𝑅CosineAnnealingLRitalic_C italic_o italic_s italic_i italic_n italic_e italic_A italic_n italic_n italic_e italic_a italic_l italic_i italic_n italic_g italic_L italic_R with warmup steps=5,000 and total training steps=40,000. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameter C𝐶Citalic_C.

E.2.6 Path-X

The learning rate is adjusted by CosineAnnealingLR𝐶𝑜𝑠𝑖𝑛𝑒𝐴𝑛𝑛𝑒𝑎𝑙𝑖𝑛𝑔𝐿𝑅CosineAnnealingLRitalic_C italic_o italic_s italic_i italic_n italic_e italic_A italic_n italic_n italic_e italic_a italic_l italic_i italic_n italic_g italic_L italic_R with warmup steps=10,000 and total training steps=1,000,000. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameter C𝐶Citalic_C. ΔΔ\Deltaroman_Δ is initialized by uniformly sampling from [0.0001, 0.1]. 50% training set is used before epoch=110. Validation and testing sets remain unchanged. A scale factor of 0.0625 is applied to ΔΔ\Deltaroman_Δ.

E.2.7 Speech Commands 10 - MFCC

The learning rate is adjusted by ReduceLROnPlateau𝑅𝑒𝑑𝑢𝑐𝑒𝐿𝑅𝑂𝑛𝑃𝑙𝑎𝑡𝑒𝑎𝑢ReduceLROnPlateauitalic_R italic_e italic_d italic_u italic_c italic_e italic_L italic_R italic_O italic_n italic_P italic_l italic_a italic_t italic_e italic_a italic_u with factor=0.2, patience=5. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameter C𝐶Citalic_C.

E.2.8 Speech Commands 10

The learning rate is adjusted by ReduceLROnPlateau𝑅𝑒𝑑𝑢𝑐𝑒𝐿𝑅𝑂𝑛𝑃𝑙𝑎𝑡𝑒𝑎𝑢ReduceLROnPlateauitalic_R italic_e italic_d italic_u italic_c italic_e italic_L italic_R italic_O italic_n italic_P italic_l italic_a italic_t italic_e italic_a italic_u with factor=0.2, patience=10. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameter C𝐶Citalic_C.

E.2.9 Speech Commands 35

The learning rate is adjusted by CosineAnnealingLR𝐶𝑜𝑠𝑖𝑛𝑒𝐴𝑛𝑛𝑒𝑎𝑙𝑖𝑛𝑔𝐿𝑅CosineAnnealingLRitalic_C italic_o italic_s italic_i italic_n italic_e italic_A italic_n italic_n italic_e italic_a italic_l italic_i italic_n italic_g italic_L italic_R with total training steps=270,000.

E.2.10 Sequential MNIST

The learning rate is adjusted by ReduceLROnPlateau𝑅𝑒𝑑𝑢𝑐𝑒𝐿𝑅𝑂𝑛𝑃𝑙𝑎𝑡𝑒𝑎𝑢ReduceLROnPlateauitalic_R italic_e italic_d italic_u italic_c italic_e italic_L italic_R italic_O italic_n italic_P italic_l italic_a italic_t italic_e italic_a italic_u with factor=0.2, patience=10. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameters B𝐵Bitalic_B and C𝐶Citalic_C.

E.2.11 Permuted Sequential MNIST

The learning rate is adjusted by CosineAnnealingLR𝐶𝑜𝑠𝑖𝑛𝑒𝐴𝑛𝑛𝑒𝑎𝑙𝑖𝑛𝑔𝐿𝑅CosineAnnealingLRitalic_C italic_o italic_s italic_i italic_n italic_e italic_A italic_n italic_n italic_e italic_a italic_l italic_i italic_n italic_g italic_L italic_R with warmup steps=1,000 and total training steps=81,000. LRother𝐿subscript𝑅𝑜𝑡𝑒𝑟LR_{other}italic_L italic_R start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT is applied to SSM parameter C𝐶Citalic_C.

E.2.12 Sequential CIFAR

The same hyperparameter is used as in LRA-Image.

E.3 Dataset Details

Here, we provide more detailed introductions to LRA, Speech Commands, and 1D image classification. This work follows the same data preprocessing process of S4 and S5. For the preprocessing details of each task, please refer to the code we provide at https://github.com/leonty1/DeepLDNN.

LRA

ListOps:The ListOps contains mathematical operations performed on lists of single-digit integers, expressed in prefix notation [40]. The goal is to predict each complete sequence’s corresponding solution, which is also a single-digit integer. Consequently, this constitutes a ten-way balanced classification problem. For example, [MIN 2 9 [MAX 4 7 ] 0 ] has the solution 0. All sequences have a uniform length of 2000 (if not padded with zero). The dataset has a total of 10,000 samples, which are divided into 8:1:1 for training, validation, and testing.

Text:This dataset is based on the IMDB sentiment dataset. This task aims to classify the sentiment of a given movie review (text) as either positive or negative. For example, a positive comment: ’Probably my all-time favorite movie,…’. The maximum length of each sequence is 4,09640964,0964 , 096. IMDB contains 25,0002500025,00025 , 000 training examples and 25,0002500025,00025 , 000 testing examples.

Retrieval:This task measures the similarity between two sequences based on the AAN dataset [41]. The maximum length of each sequence is 4,000. It is a binary classification task. There are 147,086147086147,086147 , 086 training samples, 18,0901809018,09018 , 090 validation samples, and 17,4371743717,43717 , 437 test samples.

Image: This task is based on the CIFAR-10 dataset [42]. Grayscale CIFAR-10 image has a resolution of 32×32323232\times 3232 × 32, which is flattened into a 1D sequence for a ten-way classification. All sequences have a length of 1024. It has 45,0004500045,00045 , 000 training examples, 5,00050005,0005 , 000 validation examples, and 10,0001000010,00010 , 000 test examples.

Pathfinder: This task aims to classify whether the two small circles depicted in the picture are connected with dashed lines, constituting a binary classification task [43]. A grayscale image has a size of 32×32323232\times 3232 × 32, which is flattened into a sequence with length 1,02410241,0241 , 024. There are 200,000200000200,000200 , 000 examples, which are split into 8:1:1 for training, validation, and testing process.

Path-X: A more challenging version of the Pathfinder. The image’s resolution was increased to 128×128128128128\times 128128 × 128, resulting in a sixteenfold increase in sequence length, from 1024 to 16,384.

Raw Speech Commands

Speech Commands-35:This dataset records audio of 35 different words [44]. This task aims to determine which word a given audio is. It is a multi-classification problem with 35 categories. There are two audio collection frequencies, 16KHz16𝐾𝐻𝑧16KHz16 italic_K italic_H italic_z and 8KHz8𝐾𝐻𝑧8KHz8 italic_K italic_H italic_z. All audio sequences have the same length, 16,0001600016,00016 , 000 if sampled at 16KHz16𝐾𝐻𝑧16KHz16 italic_K italic_H italic_z or 8,00080008,0008 , 000 if sampled at 16KHz16𝐾𝐻𝑧16KHz16 italic_K italic_H italic_z. It contains 24,4822448224,48224 , 482 training samples, 5,24652465,2465 , 246 validation samples, and 5,24752475,2475 , 247 testing samples.

Speech Commands-10: This database contains ten categories of audio, a subset of Speech Commands-35.

Speech Commands-MFCC: The original audio in Speech Commands-10 is pre-processed into MFCC features with length of 161.

Pixel-level 1-D Image Classification

Sequential MNIST (sMNIST) :10-way digit classification from a 28×28282828\times 2828 × 28 grayscale image of a handwritten digit, where the input image is flattened into a 784784784784-length scalar sequence.

Permuted Sequential MNIST (psMNIST):This task aims to perform 10-category digit classification from a 28×28282828\times 2828 × 28 grayscale image of handwritten digits. The original image is first flattened into a sequence of length 784. Next, this sequence is rearranged in a fixed order.

Sequential CIFAR (sCIFAR):Color version of image task, where each image is an (R,G,B) triple.

E.4 Implementation Configurations

All the experiments are conducted with:

  • Operating System: Windows 10, version 22H2

  • CPU: AMD Ryzen Threadripper 3960X 24-Core Processor @ 3.8GHz

  • GPU: NVIDIA GeForce RTX 3090 with 24 GB of memory

  • Software: Python 3.9.12, Cuda 11.3, PyTorch[45] 1.12.1.

Appendix F PyTorch Implementation of LDNN Layer

1 import torch

2

3 # B = batch, C = channel, S = head, H = input size,

4 # M = output size, N = state size, L = sequence length

5

6 def discretize_zoh(Lambda, B, Delta):

7 """ Discretize the diagonal, continuous-time linear SSM with MIMO

8 Args:

9 Lambda (complex64): diagonal state matrix (C, S, N)

10 B (complex64): input matrix (C, S, N, H)

11 Delta (float32): discretization step sizes (C, S, N)

12 Returns:

13 discretized Lambda_bar (complex64), B_bar (complex64) """

14 Lambda_bar = Lambda * Delta

15 Identity = torch.ones_like(Lambda)

16 B_coef = (reciprocal(Lambda) * (torch.exp(Lambda_bar)-Identity))

17 B_bar = torch.einsum(’cn,cnh->cnh’, B_coef, B)

18 return Lambda_bar, B_bar

19

20 def ldnn(Lambda_bar, B_bar, C_tilde, D, input):

21 """ Discretized SSM as linear dynamic-embedded neural network.

22 Args:

23 input (float32): input sequence of features (B, H, L)

24 Returns: y (float32): outputs (B, M, L) """

25

26 #Split input into heads, h=H/S

27 u = u.reshape(B, S, h, L)

28

29 #Calculate B*u

30 B_u = torch.einsum(’csnh,bshl->bcsnl’, B_bar, u)

31

32 #Compute State Kernel

33 length = torch.arange(Lk).cuda()

34 p=torch.einsum(’csn,l->csnl’, Lambda_bar, length)

35 state_kernel = p.exp() # [channel, head, N, L]

36 state_kernel = state_kernel.real # real part of complex kernel

37

38 #Bidirectional kernel for non-causal state inference

39 if self.bidirectional:

40 #reversal backforward kernel

41 state_kernel_new=F.pad(state_kernel,(0, L))+F.pad(state_kernel.flip(-1),( L, 0))

42 else:

43 state_kernel_new =state_kernel

44

45 #Efficient convolution for state inference via FFT

46 k_f = torch.fft.rfft(state_kernel_new, n=n)

47 u_f = torch.fft.rfft(B_u, n=n)

48 x_f = torch.einsum(’bcsnl,csnl->bcsnl’, u_f, k_f)

49 x = torch.fft.irfft(x_f, n=n)[..., :L]

50

51 #Calculate C*X

52 C_x = torch.einsum(’csmn, bcsnl->bcsml’, C, x)

53

54 #Calculate output with Du

55 y = C_x + torch.einsum(’csmh, bshl->bcsml’, self.D, u_D)

56

57 #Mix channels using linear projection

58 y= dropout(y)

59 y = self.channel_mixer(y)

60

61 #Activation

62 y=activation(y)

63

64 return y

References

  • [1]Hongyu Hè and Marko Kabic.A unified view of long-sequence models towards million-scaledependencies.arXiv preprint arXiv:2302.06218, 2023.
  • [2]Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré.Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems,33:1474–1487, 2020.
  • [3]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 770–778, 2016.
  • [4]JimmyTH Smith, Andrew Warrington, and ScottW Linderman.Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022.
  • [5]Albert Gu, Karan Goel, and Christopher Ré.Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021.
  • [6]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • [7]Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451, 2020.
  • [8]Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song,Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin,Lukasz Kaiser, etal.Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020.
  • [9]Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and FrançoisFleuret.Transformers are rnns: Fast autoregressive transformers with linearattention.In International conference on machine learning, pages5156–5165. PMLR, 2020.
  • [10]Manzil Zaheer, Guru Guruganesh, KumarAvinava Dubey, Joshua Ainslie, ChrisAlberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, LiYang,etal.Big bird: Transformers for longer sequences.Advances in neural information processing systems,33:17283–17297, 2020.
  • [11]Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, andLuke Zettlemoyer.Luna: Linear unified nested attention.Advances in Neural Information Processing Systems,34:2441–2453, 2021.
  • [12]James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon.Fnet: Mixing tokens with fourier transforms.arXiv preprint arXiv:2105.03824, 2021.
  • [13]Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung,Yin Li, and Vikas Singh.Nyströmformer: A nyström-based algorithm for approximatingself-attention.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume35, pages 14138–14148, 2021.
  • [14]Zhenhai Zhu and Radu Soricut.H-transformer-1d: Fast one-dimensional hierarchical attention forsequences.arXiv preprint arXiv:2107.11906, 2021.
  • [15]DavidW Romero, DavidM Knigge, Albert Gu, ErikJ Bekkers, Efstratios Gavves,JakubM Tomczak, and Mark Hoogendoorn.Towards a general purpose cnn for long range dependencies in n𝑛nitalic_n d.arXiv preprint arXiv:2206.03398, 2022.
  • [16]Lei Cheng, Ruslan Khalitov, Tong Yu, Jing Zhang, and Zhirong Yang.Classification of long sequential data using circular dilatedconvolutional neural networks.Neurocomputing, 518:50–59, 2023.
  • [17]Ankit Gupta, Albert Gu, and Jonathan Berant.Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems,35:22982–22994, 2022.
  • [18]Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré.On the parameterization and initialization of diagonal state spacemodels.Advances in Neural Information Processing Systems,35:35971–35983, 2022.
  • [19]Trieu Trinh, Andrew Dai, Thang Luong, and Quoc Le.Learning longer-term dependencies in rnns with auxiliary losses.In International Conference on Machine Learning, pages4965–4974. PMLR, 2018.
  • [20]Yulia Rubanova, RickyTQ Chen, and DavidK Duvenaud.Latent ordinary differential equations for irregularly-sampled timeseries.Advances in neural information processing systems, 32, 2019.
  • [21]Patrick Kidger, James Morrill, James Foster, and Terry Lyons.Neural controlled differential equations for irregular time series.Advances in Neural Information Processing Systems,33:6696–6707, 2020.
  • [22]Mario Lezcano-Casado and David Martınez-Rubio.Cheap orthogonal constraints in neural networks: A simpleparametrization of the orthogonal and unitary group.In International Conference on Machine Learning, pages3794–3803. PMLR, 2019.
  • [23]NBenjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, andMichaelW Mahoney.Lipschitz recurrent neural networks.arXiv preprint arXiv:2006.12070, 2020.
  • [24]DavidW Romero, Anna Kuzina, ErikJ Bekkers, JakubM Tomczak, and MarkHoogendoorn.Ckconv: Continuous kernel convolution for sequential data.arXiv preprint arXiv:2102.02611, 2021.
  • [25]Chris Donahue, Julian McAuley, and Miller Puckette.Adversarial audio synthesis.arXiv preprint arXiv:1802.04208, 2018.
  • [26]Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, andChristopher Ré.Combining recurrent, convolutional, and continuous-time models withlinear state space layers.Advances in neural information processing systems, 34:572–585,2021.
  • [27]Naoki Nonaka and Jun Seita.In-depth benchmarking of deep neural network architectures for ecgdiagnosis.In Machine Learning for Healthcare Conference, pages 414–439.PMLR, 2021.
  • [28]Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré.How to train your hippo: State space models with generalizedorthogonal basis projections.arXiv preprint arXiv:2206.12037, 2022.
  • [29]Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, AlexanderAmini, and Daniela Rus.Liquid structural state-space models.arXiv preprint arXiv:2209.12951, 2022.
  • [30]DavidW Romero, Robert-Jan Bruintjes, JakubM Tomczak, ErikJ Bekkers, MarkHoogendoorn, and JanC van Gemert.Flexconv: Continuous kernel convolutions with differentiable kernelsizes.arXiv preprint arXiv:2110.08059, 2021.
  • [31]Shaojie Bai, JZico Kolter, and Vladlen Koltun.Trellis networks for sequence modeling.arXiv preprint arXiv:1810.06682, 2018.
  • [32]Shaojie Bai, JZico Kolter, and Vladlen Koltun.An empirical evaluation of generic convolutional and recurrentnetworks for sequence modeling.arXiv preprint arXiv:1803.01271, 2018.
  • [33]Albert Gu, Caglar Gulcehre, Thomas Paine, Matt Hoffman, and Razvan Pascanu.Improving the gating mechanism of recurrent neural networks.In International Conference on Machine Learning, pages3800–3809. PMLR, 2020.
  • [34]Sepp Hochreiter and Jürgen Schmidhuber.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
  • [35]Shiyu Chang, Yang Zhang, Wei Han, MoYu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui,Michael Witbrock, MarkA Hasegawa-Johnson, and ThomasS Huang.Dilated recurrent neural networks.Advances in neural information processing systems, 30, 2017.
  • [36]Shuai Li, Wanqing Li, Chris Cook, CeZhu, and Yanbo Gao.Independently recurrent neural network (indrnn): Building a longerand deeper rnn.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 5457–5466, 2018.
  • [37]Aaron Voelker, Ivana Kajić, and Chris Eliasmith.Legendre memory units: Continuous-time representation in recurrentneural networks.Advances in neural information processing systems, 32, 2019.
  • [38]TKonstantin Rusch and Siddhartha Mishra.Unicornn: A recurrent model for learning very long time dependencies.In International Conference on Machine Learning, pages9168–9178. PMLR, 2021.
  • [39]NarsimhaReddy Chilkuri and Chris Eliasmith.Parallelizing legendre memory unit training.In International Conference on Machine Learning, pages1898–1907. PMLR, 2021.
  • [40]Nikita Nangia and SamuelR Bowman.Listops: A diagnostic dataset for latent tree learning.arXiv preprint arXiv:1804.06028, 2018.
  • [41]DragomirR Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara.The acl anthology network corpus.Language Resources and Evaluation, 47:919–944, 2013.
  • [42]Alex Krizhevsky, Geoffrey Hinton, etal.Learning multiple layers of features from tiny images.2009.
  • [43]Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and ThomasSerre.Learning long-range spatial dependencies with horizontal gatedrecurrent units.Advances in neural information processing systems, 31, 2018.
  • [44]Pete Warden.Speech commands: A dataset for limited-vocabulary speech recognition.arXiv preprint arXiv:1804.03209, 2018.
  • [45]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, etal.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019.
Appendix for Linear Dynamics-embedded Neural Network for Long-Sequence Modeling (2024)
Top Articles
Latest Posts
Article information

Author: Gov. Deandrea McKenzie

Last Updated:

Views: 5375

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Gov. Deandrea McKenzie

Birthday: 2001-01-17

Address: Suite 769 2454 Marsha Coves, Debbieton, MS 95002

Phone: +813077629322

Job: Real-Estate Executive

Hobby: Archery, Metal detecting, Kitesurfing, Genealogy, Kitesurfing, Calligraphy, Roller skating

Introduction: My name is Gov. Deandrea McKenzie, I am a spotless, clean, glamorous, sparkling, adventurous, nice, brainy person who loves writing and wants to share my knowledge and understanding with you.