--- title: TSTransformerPlus keywords: fastai sidebar: home_sidebar summary: "This is a PyTorch implementation created by Ignacio Oguiza (timeseriesAI@gmail.com)." description: "This is a PyTorch implementation created by Ignacio Oguiza (timeseriesAI@gmail.com)." nb_path: "nbs/124_models.TSTransformerPlus.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %} {% raw %}

class TSTransformerPlus[source]

TSTransformerPlus(c_in:int, c_out:int, seq_len:int, n_layers:int=6, d_model:int=128, n_heads:int=16, d_head:Optional[int]=None, act:str='reglu', d_ff:int=256, pos_dropout:float=0.0, attn_drop_rate:float=0, mlp_drop_rate:float=0, drop_path_rate:float=0.0, pre_norm:bool=False, use_cls_token:bool=True, pct_random_steps:float=1.0, fc_dropout:float=0.0, bn:bool=True, y_range:Optional[tuple]=None, custom_subsampling:Optional[Callable]=None, custom_head:Optional[Callable]=None, verbose:bool=True) :: Sequential

Time series transformer model based on ViT (Vision Transformer):

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Args: c_in: the number of features (aka variables, dimensions, channels) in the time series dataset. c_out: the number of target classes. seq_len: number of time steps in the time series. n_layers: number of layers (or blocks) in the encoder. Default: 3 (range(1-4)) d_model: total dimension of the model (number of features created by the model). Default: 128 (range(64-512)) n_heads: parallel attention heads. Default:16 (range(8-16)). d_head: size of the learned linear projection of queries, keys and values in the MHA. Usual values: 16-512. Default: None -> (d_model/n_heads) = 32. act: the activation function of intermediate layer, relu, gelu, geglu, reglu. d_ff: the dimension of the feedforward network model. Default: 512 (range(256-512)) pos_dropout: dropout applied to to the embedded sequence steps after position embeddings have been added. attn_drop_rate (float): dropout rate applied to the attention layer mlp_drop_rate (float): dropout rate applied to the mlp layer drop_path_rate: dropout applied to the output of MultheadAttention and PositionwiseFeedForward layers. pre_norm: if True normalization will be applied as the first step in the sublayers. Defaults to False. use_cls_token: if True, the output will come from the transformed class token. Otherwise a pooling layer will be applied. pct_random_steps: percent of steps that will be chosen during training (with replacement) fc_dropout: dropout applied to the final fully connected layer. bn: indicates if batchnorm will be applied to the head. y_range: range of possible y values (used in regression tasks). custom_subsampling: an optional callable (nn.Conv1d with dilation > 1 or stride > 1for example) that will be used to reduce the sequence length. custom_head: custom head that will be applied to the network. It must contain all kwargs (pass a partial function)

Input shape: x: bs (batch size) x nvars (aka features, variables, dimensions, channels) x seq_len (aka time steps)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
bs = 16
nvars = 4
seq_len = 50
c_out = 2
xb = torch.rand(bs, nvars, seq_len)
model = TSTransformerPlus(nvars, c_out, seq_len)
test_eq(model(xb).shape, (bs, c_out))
model
TSTransformerPlus(
  (backbone): _TSTransformerBackbone(
    (to_embedding): Sequential(
      (0): Transpose(1, 2)
      (1): Linear(in_features=4, out_features=128, bias=True)
    )
    (pos_dropout): Dropout(p=0.0, inplace=False)
    (encoder): _TransformerEncoder(
      (layers): ModuleList(
        (0): ModuleList(
          (0): MultiheadAttention(
            (W_Q): Linear(in_features=128, out_features=128, bias=False)
            (W_K): Linear(in_features=128, out_features=128, bias=False)
            (W_V): Linear(in_features=128, out_features=128, bias=False)
            (sdp_attn): ScaledDotProductAttention()
            (to_out): Sequential(
              (0): Linear(in_features=128, out_features=128, bias=True)
              (1): Dropout(p=0, inplace=False)
            )
          )
          (1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (2): PositionwiseFeedForward(
            (0): Linear(in_features=128, out_features=256, bias=True)
            (1): ReGLU()
            (2): Dropout(p=0, inplace=False)
            (3): Linear(in_features=128, out_features=128, bias=True)
            (4): Dropout(p=0, inplace=False)
          )
          (3): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (4): Identity()
        )
        (1): ModuleList(
          (0): MultiheadAttention(
            (W_Q): Linear(in_features=128, out_features=128, bias=False)
            (W_K): Linear(in_features=128, out_features=128, bias=False)
            (W_V): Linear(in_features=128, out_features=128, bias=False)
            (sdp_attn): ScaledDotProductAttention()
            (to_out): Sequential(
              (0): Linear(in_features=128, out_features=128, bias=True)
              (1): Dropout(p=0, inplace=False)
            )
          )
          (1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (2): PositionwiseFeedForward(
            (0): Linear(in_features=128, out_features=256, bias=True)
            (1): ReGLU()
            (2): Dropout(p=0, inplace=False)
            (3): Linear(in_features=128, out_features=128, bias=True)
            (4): Dropout(p=0, inplace=False)
          )
          (3): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (4): Identity()
        )
        (2): ModuleList(
          (0): MultiheadAttention(
            (W_Q): Linear(in_features=128, out_features=128, bias=False)
            (W_K): Linear(in_features=128, out_features=128, bias=False)
            (W_V): Linear(in_features=128, out_features=128, bias=False)
            (sdp_attn): ScaledDotProductAttention()
            (to_out): Sequential(
              (0): Linear(in_features=128, out_features=128, bias=True)
              (1): Dropout(p=0, inplace=False)
            )
          )
          (1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (2): PositionwiseFeedForward(
            (0): Linear(in_features=128, out_features=256, bias=True)
            (1): ReGLU()
            (2): Dropout(p=0, inplace=False)
            (3): Linear(in_features=128, out_features=128, bias=True)
            (4): Dropout(p=0, inplace=False)
          )
          (3): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (4): Identity()
        )
        (3): ModuleList(
          (0): MultiheadAttention(
            (W_Q): Linear(in_features=128, out_features=128, bias=False)
            (W_K): Linear(in_features=128, out_features=128, bias=False)
            (W_V): Linear(in_features=128, out_features=128, bias=False)
            (sdp_attn): ScaledDotProductAttention()
            (to_out): Sequential(
              (0): Linear(in_features=128, out_features=128, bias=True)
              (1): Dropout(p=0, inplace=False)
            )
          )
          (1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (2): PositionwiseFeedForward(
            (0): Linear(in_features=128, out_features=256, bias=True)
            (1): ReGLU()
            (2): Dropout(p=0, inplace=False)
            (3): Linear(in_features=128, out_features=128, bias=True)
            (4): Dropout(p=0, inplace=False)
          )
          (3): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (4): Identity()
        )
        (4): ModuleList(
          (0): MultiheadAttention(
            (W_Q): Linear(in_features=128, out_features=128, bias=False)
            (W_K): Linear(in_features=128, out_features=128, bias=False)
            (W_V): Linear(in_features=128, out_features=128, bias=False)
            (sdp_attn): ScaledDotProductAttention()
            (to_out): Sequential(
              (0): Linear(in_features=128, out_features=128, bias=True)
              (1): Dropout(p=0, inplace=False)
            )
          )
          (1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (2): PositionwiseFeedForward(
            (0): Linear(in_features=128, out_features=256, bias=True)
            (1): ReGLU()
            (2): Dropout(p=0, inplace=False)
            (3): Linear(in_features=128, out_features=128, bias=True)
            (4): Dropout(p=0, inplace=False)
          )
          (3): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (4): Identity()
        )
        (5): ModuleList(
          (0): MultiheadAttention(
            (W_Q): Linear(in_features=128, out_features=128, bias=False)
            (W_K): Linear(in_features=128, out_features=128, bias=False)
            (W_V): Linear(in_features=128, out_features=128, bias=False)
            (sdp_attn): ScaledDotProductAttention()
            (to_out): Sequential(
              (0): Linear(in_features=128, out_features=128, bias=True)
              (1): Dropout(p=0, inplace=False)
            )
          )
          (1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (2): PositionwiseFeedForward(
            (0): Linear(in_features=128, out_features=256, bias=True)
            (1): ReGLU()
            (2): Dropout(p=0, inplace=False)
            (3): Linear(in_features=128, out_features=128, bias=True)
            (4): Dropout(p=0, inplace=False)
          )
          (3): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (4): Identity()
        )
      )
    )
  )
  (head): Sequential(
    (0): TokenLayer()
    (1): LinBnDrop(
      (0): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=128, out_features=2, bias=False)
    )
  )
)
{% endraw %}

Subsampling

It's a known fact that transformers cannot be directly applied to long sequences. To avoid this, we have included a way to subsample the sequence to generate a more manageable input.

{% raw %}
from tsai.data.validation import get_splits
from tsai.data.core import get_ts_dls
X = np.zeros((10, 3, 5000)) 
y = np.random.randint(0,2,X.shape[0])
splits = get_splits(y)
dls = get_ts_dls(X, y, splits=splits)
xb, yb = dls.train.one_batch()
xb
TSTensor(samples:8, vars:3, len:5000)
{% endraw %}

If you try to use TSTransformerPlus, it's likely you'll get an 'out-of-memory' error.

To avoid this you can subsample the sequence reducing the input's length. This can be done in multiple ways. Here are a few examples:

{% raw %}
custom_subsampling = Conv1d(xb.shape[1], xb.shape[1], ks=100, stride=50, padding='same', groups=xb.shape[1]).to(default_device())
custom_subsampling(xb).shape
torch.Size([8, 3, 100])
{% endraw %} {% raw %}
custom_subsampling = Conv1d(xb.shape[1], 2, ks=100, stride=50, padding='same').to(default_device())
custom_subsampling(xb).shape
torch.Size([8, 2, 100])
{% endraw %} {% raw %}
custom_subsampling = nn.Sequential(Pad1d((0, 50), 0), nn.MaxPool1d(kernel_size=100, stride=50)).to(default_device())
custom_subsampling(xb).shape
torch.Size([8, 3, 100])
{% endraw %} {% raw %}
custom_subsampling = nn.Sequential(Pad1d((0, 50), 0), nn.AvgPool1d(kernel_size=100, stride=50)).to(default_device())
custom_subsampling(xb).shape
torch.Size([8, 3, 100])
{% endraw %}

Once you decide what type of transform you want to apply, you just need to pass the layer as the custom_subsampling attribute:

{% raw %}
bs = 16
nvars = 4
seq_len = 1000
c_out = 2
xb = torch.rand(bs, nvars, seq_len)
custom_subsampling = Conv1d(xb.shape[1], xb.shape[1], ks=5, stride=3, padding='same', groups=xb.shape[1])
model = TSTransformerPlus(nvars, c_out, seq_len, custom_subsampling=custom_subsampling)
test_eq(model(xb).shape, (bs, c_out))
custom_subsampling: (?, 4, 1000) --> (?, 4, 334)
{% endraw %}