An Overview of Quantized Activations#

In this second tutorial, we take a deeper look at quantized activation.
We were already introduced to quantized activations in the previous tutorial, when we looked at input and output quantization of QuantConv2d with the Int8ActPerTensorFloat quantizer. The same result can be obtained with different syntax by coupling QuantConv2d with QuantIdentity layers, which by default uses the Int8ActPerTensorFloat quantizer. As an example, we compare - on the same input - the result of QuantConv2d with output_quant enabled with the result of a QuantConv2d followed by a QuantIdentity:
[1]:
import torch
from brevitas.nn import QuantConv2d, QuantIdentity
from brevitas.quant.scaled_int import Int8ActPerTensorFloat

torch.manual_seed(0)
output_quant_conv = QuantConv2d(
    in_channels=2, out_channels=3, kernel_size=(3,3), output_quant=Int8ActPerTensorFloat)

torch.manual_seed(0)
default_quant_conv = QuantConv2d(
    in_channels=2, out_channels=3, kernel_size=(3,3))
output_identity_quant = QuantIdentity()

inp = torch.randn(1, 2, 5, 5)
out_tensor1 = output_quant_conv(inp)
out_tensor2 = output_identity_quant(default_quant_conv(inp))

assert out_tensor1.isclose(out_tensor2).all().item()
[1]:
True

We can observe a similar behaviour if we enable input quantization too:

[2]:
torch.manual_seed(0)
input_output_quant_conv = QuantConv2d(
    in_channels=2, out_channels=3, kernel_size=(3,3),
    input_quant=Int8ActPerTensorFloat, output_quant=Int8ActPerTensorFloat)

torch.manual_seed(0)
default_quant_conv = QuantConv2d(
    in_channels=2, out_channels=3, kernel_size=(3,3))
input_identity_quant = QuantIdentity()
output_identity_quant = QuantIdentity()

inp = torch.randn(1, 2, 5, 5)
out_tensor1 = input_output_quant_conv(inp)
out_tensor2 = output_identity_quant(default_quant_conv(input_identity_quant(inp)))

assert out_tensor1.isclose(out_tensor2).all().item()
[2]:
True

From an algorithmic point of view then the two different implementation are doing the same thing. However, as it will become clearer in later tutorials, there are currently some scenarios where picking one style over the other can make a difference when it comes to exporting to a format such as standard ONNX. In the meantime, we can just keep in mind that both alternatives exist.

As it was the case with QuantConv2d, when we disable quantization of an activation, the layer behaves as its floating-point variant. In the case of QuantIdentity, that means behaving like an identity function:

[3]:
disabled_quant_identity = QuantIdentity(act_quant=None)
(inp == disabled_quant_identity(inp)).all().item()
[3]:
True

Again, as it was the case for QuantConv2d, quantized activation layers can also return a QuantTensor:

[4]:
return_quant_identity = QuantIdentity(return_quant_tensor=True)
out_tensor = return_quant_identity(inp)
out_tensor
[4]:
QuantTensor(value=tensor([[[[-0.4566, -0.5707, -0.5517,  0.5897,  1.5409],
          [ 0.5136, -0.5897, -0.5707,  0.1902, -0.0761],
          [-0.4946, -1.5029, -0.1902,  0.4376,  1.3317],
          [-1.6361,  2.0736,  1.7122,  2.3780, -1.1224],
          [-0.3234, -1.0844, -0.0761, -0.0951, -0.7610]],

         [[-1.5980,  0.0190, -0.7419,  0.1902,  0.6278],
          [ 0.6468, -0.2473, -0.5327,  1.1605,  0.4376],
          [-0.7990, -1.2936, -0.7419, -1.3127, -0.2283],
          [-2.4351, -0.0761,  0.2283,  0.7990, -0.1902],
          [-0.3615, -1.2175, -0.6278, -0.4566,  1.9214]]]],
       grad_fn=<MulBackward0>), scale=tensor(0.0190, grad_fn=<DivBackward0>), zero_point=tensor(0.), bit_width=tensor(8.), signed_t=tensor(True), training_t=tensor(True))
[5]:
assert out_tensor.is_valid
[5]:
True

As expected, a QuantIdentity with quantization disabled behaves like an identity function also when a QuantTensor is passed in. However, depending on whather return_quant_tensor is set to False or not, quantization metadata might be stripped out, i.e. the input QuantTensor is going to be returned as an implicitly quantized torch.Tensor:

[6]:
out_torch_tensor = disabled_quant_identity(out_tensor)
out_torch_tensor
[6]:
tensor([[[[-0.4566, -0.5707, -0.5517,  0.5897,  1.5409],
          [ 0.5136, -0.5897, -0.5707,  0.1902, -0.0761],
          [-0.4946, -1.5029, -0.1902,  0.4376,  1.3317],
          [-1.6361,  2.0736,  1.7122,  2.3780, -1.1224],
          [-0.3234, -1.0844, -0.0761, -0.0951, -0.7610]],

         [[-1.5980,  0.0190, -0.7419,  0.1902,  0.6278],
          [ 0.6468, -0.2473, -0.5327,  1.1605,  0.4376],
          [-0.7990, -1.2936, -0.7419, -1.3127, -0.2283],
          [-2.4351, -0.0761,  0.2283,  0.7990, -0.1902],
          [-0.3615, -1.2175, -0.6278, -0.4566,  1.9214]]]],
       grad_fn=<MulBackward0>)
[7]:
return_disabled_quant_identity = QuantIdentity(act_quant=None, return_quant_tensor=True)
identity_out_tensor = return_disabled_quant_identity(out_tensor)
identity_out_tensor
[7]:
QuantTensor(value=tensor([[[[-0.4566, -0.5707, -0.5517,  0.5897,  1.5409],
          [ 0.5136, -0.5897, -0.5707,  0.1902, -0.0761],
          [-0.4946, -1.5029, -0.1902,  0.4376,  1.3317],
          [-1.6361,  2.0736,  1.7122,  2.3780, -1.1224],
          [-0.3234, -1.0844, -0.0761, -0.0951, -0.7610]],

         [[-1.5980,  0.0190, -0.7419,  0.1902,  0.6278],
          [ 0.6468, -0.2473, -0.5327,  1.1605,  0.4376],
          [-0.7990, -1.2936, -0.7419, -1.3127, -0.2283],
          [-2.4351, -0.0761,  0.2283,  0.7990, -0.1902],
          [-0.3615, -1.2175, -0.6278, -0.4566,  1.9214]]]],
       grad_fn=<MulBackward0>), scale=tensor(0.0190, grad_fn=<DivBackward0>), zero_point=tensor(0.), bit_width=tensor(8.), signed_t=tensor(True), training_t=tensor(True))

Moving on from QuantIdentity, let’s take a look at QuantReLU. Anything we said so far about QuantIdentity also applies to QuantReLU. The difference though is that QuantReLU implements a ReLU function followed by quantization, while QuantIdentity is really just the quantization operator. Additionally, by default QuantReLU adopts the Uint8ActPerTensorFloat, meaning that the output of quantization is unsigned:

[8]:
from brevitas.nn import QuantReLU

return_quant_relu = QuantReLU(return_quant_tensor=True)
return_quant_relu(inp)
[8]:
QuantTensor(value=tensor([[[[0.0000, 0.0000, 0.0000, 0.5974, 1.5402],
          [0.5041, 0.0000, 0.0000, 0.1867, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.4481, 1.3255],
          [0.0000, 2.0817, 1.7083, 2.3804, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],

         [[0.0000, 0.0187, 0.0000, 0.1867, 0.6254],
          [0.6348, 0.0000, 0.0000, 1.1668, 0.4387],
          [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.2334, 0.7935, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.0000, 1.9230]]]], grad_fn=<MulBackward0>), scale=tensor(0.0093, grad_fn=<DivBackward0>), zero_point=tensor(0.), bit_width=tensor(8.), signed_t=tensor(False), training_t=tensor(True))

QuantReLU, like QuantIdentity, is also special compared to other non-linear quantized activation layers as it preserves the metadata of an input QuantTensor even when quantization is disabled:

[9]:
return_disabled_quant_relu = QuantReLU(act_quant=None, return_quant_tensor=True)
relu_out_tensor = return_disabled_quant_relu(out_tensor)
assert relu_out_tensor.is_valid==True
assert relu_out_tensor.scale == out_tensor.scale
assert relu_out_tensor.zero_point == out_tensor.zero_point
assert relu_out_tensor.bit_width == out_tensor.bit_width

That doesn’t apply to other layers like, say, QuantSigmoid:

[10]:
from brevitas.nn import QuantSigmoid

return_disabled_quant_sigmoid = QuantSigmoid(act_quant=None, return_quant_tensor=True)
sigmoid_out_tensor = return_disabled_quant_sigmoid(out_tensor)
sigmoid_out_tensor
[10]:
QuantTensor(value=(tensor([[[[0.3878, 0.3611, 0.3655, 0.6433, 0.8236],
          [0.6257, 0.3567, 0.3611, 0.5474, 0.4810],
          [0.3788, 0.1820, 0.4526, 0.6077, 0.7911],
          [0.1630, 0.8883, 0.8471, 0.9151, 0.2456],
          [0.4198, 0.2527, 0.4810, 0.4762, 0.3184]],

         [[0.1683, 0.5048, 0.3226, 0.5474, 0.6520],
          [0.6563, 0.4385, 0.3699, 0.7614, 0.6077],
          [0.3102, 0.2152, 0.3226, 0.2120, 0.4432],
          [0.0805, 0.4810, 0.5568, 0.6898, 0.4526],
          [0.4106, 0.2284, 0.3480, 0.3878, 0.8723]]]],
       grad_fn=<SigmoidBackward0>), None, None, None), scale=None, zero_point=None, bit_width=None, signed_t=None, training_t=tensor(True))
[11]:
assert not sigmoid_out_tensor.is_valid
[11]:
False

Something to always keep in mind is that the non-linearity of a quantized activation layer is always called on the dequantized representation of the input. For example, let’s say we first quantize a floating-point torch.Tensor with an unsigned shifted quantizer such as ShiftedUint8ActPerTensorFloat, i.e. with zero-point such that the integer representation of its output is non-negative. Then, we pass this tensor as input to a QuantReLU with quantization disabled. The fact that the input to QuantReLU in its integer form is unsigned doesn’t mean QuantReLU won’t have any effect, as ReLU is called on the dequantized representation, which includes both positive and negative values:

[12]:
from brevitas.quant.shifted_scaled_int import ShiftedUint8ActPerTensorFloat

shifted_quant_identity = QuantIdentity(act_quant=ShiftedUint8ActPerTensorFloat, return_quant_tensor=True)
return_disabled_quant_relu = QuantReLU(act_quant=None, return_quant_tensor=True)
return_disabled_quant_relu(shifted_quant_identity(inp))
[12]:
QuantTensor(value=tensor([[[[0.0000, 0.0000, 0.0000, 0.5854, 1.5485],
          [0.5099, 0.0000, 0.0000, 0.1888, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.4532, 1.3219],
          [0.0000, 2.0772, 1.6996, 2.3794, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],

         [[0.0000, 0.0189, 0.0000, 0.1888, 0.6232],
          [0.6421, 0.0000, 0.0000, 1.1708, 0.4343],
          [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.2266, 0.7931, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.0000, 1.9262]]]], grad_fn=<ReluBackward0>), scale=tensor(0.0189, grad_fn=<DivBackward0>), zero_point=tensor(129., grad_fn=<SWhereBackward0>), bit_width=tensor(8.), signed_t=tensor(False), training_t=tensor(True))

Let’s now consider the very common scenario of a QuantConv2d followed by a ReLU or QuantReLU. In particular, let’s say we have a QuantConv2d with output quantization enabled followed by a ReLU:

[13]:
torch.manual_seed(0)
output_quant_conv = QuantConv2d(
    in_channels=2, out_channels=3, kernel_size=(3,3), output_quant=Int8ActPerTensorFloat)
torch.relu(output_quant_conv(inp))
[13]:
tensor([[[[0.0000, 0.0000, 0.0000],
          [1.3134, 1.2557, 1.0392],
          [0.4186, 0.0000, 0.0000]],

         [[0.7361, 0.5340, 0.8516],
          [0.2887, 0.3175, 0.0000],
          [0.8949, 1.6743, 0.0722]],

         [[0.0000, 0.0000, 0.0289],
          [0.0000, 0.0000, 0.2021],
          [0.0000, 0.0000, 0.4907]]]], grad_fn=<ReluBackward0>)

We compare it against a QuantConv2d with default settings (i.e. output quantization disabled), followed by a QuantReLU with default settings (i.e. activation quantization enabled):

[14]:
torch.manual_seed(0)
default_quant_conv = QuantConv2d(
    in_channels=2, out_channels=3, kernel_size=(3,3))
default_quant_relu = QuantReLU()
default_quant_relu(default_quant_conv(inp))
[14]:
tensor([[[[0.0000, 0.0000, 0.0000],
          [1.3078, 1.2555, 1.0397],
          [0.4185, 0.0000, 0.0000]],

         [[0.7454, 0.5427, 0.8566],
          [0.2943, 0.3269, 0.0000],
          [0.8893, 1.6674, 0.0785]],

         [[0.0065, 0.0000, 0.0262],
          [0.0000, 0.0000, 0.1962],
          [0.0000, 0.0000, 0.4839]]]], grad_fn=<MulBackward0>)
We can see the results are close but not quite the same.
In the first case, we quantized the output of QuantConv2d with an 8-bit signed quantizer, and then we passed it through a ReLU, meaning that half of the numerical range covered by the signed quantizer is now lost, and by all practical means the output can now be treated as a 7-bit unsigned number (although it’s not explicitly marked as such). In the second case, we perform unsigned 8-bit quantization after ReLU. Because the range covered by the quantizer now includes only non-negative numbers, we don’t waste a bit as in the previous case.

Regarding some premade activation quantizers, such as Uint8ActPerTensorFloat, ShiftedUint8ActPerTensorFloat, and Int8ActPerTensorFloat, a word of caution that anticipates some of the themes of the next tutorial. To minimize user interaction, Brevitas initializes scale and zero-point by collecting statistics for a number of training steps (by default 30). This can be seen as a sort of very basic calibration step, although it typically happens during training and with quantization already enabled. These statistics are accumulated in an exponential moving average that at end of the collection phase is used to initialize a learned parameter. During the collection phase then, the quantizer behaves differently between train() and eval() mode. In train() mode, the statistics for that particular batch are returned. In eval() mode, the exponential moving average is returned. After the collection phase is over the learned parameter is returned in both execution modes. We can easily observe this behaviour with an example. Let’s first define a quantized activation and two random input tensors:

[15]:
quant_identity = QuantIdentity(return_quant_tensor=True)
inp1 = torch.randn(3, 3)
inp2 = torch.randn(3, 3)

We then compare the output scale factor of the two tensors between train() and eval() mode. The ones in train mode in general are different. The ones in eval mode are the same.

[16]:
out1_train = quant_identity(inp1)
out2_train = quant_identity(inp2)
assert not out1_train.scale.isclose(out2_train.scale).item()
[16]:
False
[17]:
quant_identity.eval()
out1_eval = quant_identity(inp1)
out2_eval = quant_identity(inp2)
assert out1_eval.scale.isclose(out2_eval.scale).item()
[17]:
True

By default, the only layer that is an exception to this is QuantHardTanh. That is because the interface to torch.nn.HardTanh already requires users to manually specify min_val and max_val, so Brevitas preserves that both when quantization is enabled or disabled. With quantization enabled, by default those values are used for initialization, but then the range is learned. Let’s look at an example:

[18]:
from brevitas.nn import QuantHardTanh

QuantHardTanh()
---------------------------------------------------------------------------
DependencyError                           Traceback (most recent call last)
<ipython-input-18-8145d2f87fcb> in <module>
      1 from brevitas.nn import QuantHardTanh
      2
----> 3 QuantHardTanh()

c:\brevitas_fx\src\brevitas\nn\quant_activation.py in __init__(self, act_quant, input_quant, return_quant_tensor, **kwargs)
    117             act_quant=act_quant,
    118             return_quant_tensor=return_quant_tensor,
--> 119             **kwargs)
    120
    121

c:\brevitas_fx\src\brevitas\nn\quant_layer.py in __init__(self, act_impl, passthrough_act, input_quant, act_quant, return_quant_tensor, **kwargs)
     77             passthrough_act,
     78             act_quant,
---> 79             **kwargs)
     80
     81     @property

c:\brevitas_fx\src\brevitas\nn\mixin\act.py in __init__(self, act_impl, passthrough_act, act_quant, **kwargs)
    157             proxy_prefix='act_',
    158             kwargs_prefix='',
--> 159             **kwargs)
    160
    161     @property

c:\brevitas_fx\src\brevitas\nn\mixin\base.py in __init__(self, quant, proxy_protocol, none_quant_injector, proxy_prefix, kwargs_prefix, **kwargs)
     98             quant_injector = quant
     99             quant_injector = quant_injector.let(**filter_kwargs(kwargs_prefix, kwargs))
--> 100             quant = quant_injector.proxy_class(self, quant_injector)
    101         else:
    102             if not isinstance(quant, proxy_protocol):

c:\brevitas_fx\src\brevitas\proxy\runtime_quant.py in __init__(self, quant_layer, quant_injector)
    108
    109     def __init__(self, quant_layer, quant_injector):
--> 110         super(ActQuantProxyFromInjector, self).__init__(quant_layer, quant_injector)
    111         self.is_passthrough_act = _is_passthrough_act(quant_injector)
    112

c:\brevitas_fx\src\brevitas\proxy\quant_proxy.py in __init__(self, quant_layer, quant_injector, export_mode, export_handler)
     74         # Use a normal list and not a ModuleList since this is a pointer to parent modules
     75         self.tracked_module_list = []
---> 76         self.add_tracked_module(quant_layer)
     77         self.export_handler = export_handler
     78         self.export_mode = export_mode

c:\brevitas_fx\src\brevitas\proxy\quant_proxy.py in add_tracked_module(self, module)
    130             self.tracked_module_list.append(module)
    131             self.update_tracked_modules()
--> 132             self.init_tensor_quant()
    133         else:
    134             raise RuntimeError("Trying to add None as a parent module.")

c:\brevitas_fx\src\brevitas\proxy\runtime_quant.py in init_tensor_quant(self)
    120
    121     def init_tensor_quant(self):
--> 122         tensor_quant = self.quant_injector.tensor_quant
    123         act_impl = self.quant_injector.act_impl
    124         is_act_enabled = _is_act_enabled(act_impl, tensor_quant)

    [... skipping hidden 1 frame]

DependencyError: 'Int8ActPerTensorFloatMinMaxInit' can not resolve attribute 'max_val' while building 'scaling_init_impl'

As expected, we get an error concering a missing max_val attribute. Let’s try to pass it then, together with min_val:

[19]:
quant_hard_tanh = QuantHardTanh(max_val=1.0, min_val=-1.0, return_quant_tensor=True)

The layer is now correctly initialized. We can see that the output scale factors are all the same between train() and eval() mode:

[20]:
out1_train = quant_hard_tanh(inp1)
quant_hard_tanh.eval()
out2_eval = quant_hard_tanh(inp2)
assert out1_train.scale.isclose(out2_eval.scale).item()
[20]:
True

Finally, a reminder that mixing things up is perfectly legal and encouraged in Brevitas. For example, a QuantIdentity with act_quant=Int8ActPerTensorFloatMinMaxInit is equivalent to a default QuantHardTanh, or conversely a QuantHardTanh with act_quant=Int8ActPerTensorFloat is equivalent to a default QuantIdentity. This is allowed by the fact that - as it will be explained in the next tutorial - the same layer can accept different keyword arguments when different quantizers are set. So a QuantIdentity with act_quant=Int8ActPerTensorFloatMinMaxInit is going to expect arguments min_val and max_val the same way a default QuantHardTanh would.