An overview of QuantTensor and QuantConv2d#
In this initial tutorial, we take a first look at QuantTensor
, a basic data structure in Brevitas, and at QuantConv2d
, a typical quantized layer. QuantConv2d
is an instance of a QuantWeightBiasInputOutputLayer
(typically imported as QuantWBIOL
), meaning that it supports quantization of its weight, bias, input and output. Other instances of QuantWBIOL
are QuantLinear
, QuantConv1d
, QuantConvTranspose1d
and QuantConvTranspose2d
, and they all follow the same
principles.
If we take a look at the __init__
method of QuantConv2d
, we notice a few things:
[1]:
import inspect
from brevitas.nn import QuantConv2d
from brevitas.nn import QuantIdentity
from IPython.display import Markdown, display
def pretty_print_source(source):
display(Markdown('```python\n' + source + '\n```'))
source = inspect.getsource(QuantConv2d.__init__)
pretty_print_source(source)
/home/user/.local/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: Union[int, Tuple[int, int]],
stride: Union[int, Tuple[int, int]] = 1,
padding: Union[int, Tuple[int, int]] = 0,
dilation: Union[int, Tuple[int, int]] = 1,
groups: int = 1,
bias: bool = True,
padding_type: str = 'standard',
weight_quant: Optional[WeightQuantType] = Int8WeightPerTensorFloat,
bias_quant: Optional[BiasQuantType] = None,
input_quant: Optional[ActQuantType] = None,
output_quant: Optional[ActQuantType] = None,
return_quant_tensor: bool = False,
**kwargs) -> None:
Conv2d.__init__(
self,
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
groups=groups,
bias=bias)
QuantWBIOL.__init__(
self,
weight_quant=weight_quant,
bias_quant=bias_quant,
input_quant=input_quant,
output_quant=output_quant,
return_quant_tensor=return_quant_tensor,
**kwargs)
assert self.padding_mode == 'zeros'
assert not (padding_type == 'same' and padding != 0)
self.padding_type = padding_type
QuantConv2d
is an instance of both Conv2d
and QuantWBIOL
. Its initialization method exposes the usual arguments of a Conv2d
, as well as: an extra flag to support same padding; four different arguments to set a quantizer for - respectively - weight, bias, input, and output; a return_quant_tensor
boolean flag; the **kwargs
placeholder to intercept additional arbitrary keyword arguments.By default weight_quant=Int8WeightPerTensorFloat
, while bias_quant
, input_quant
and output_quant
are set to None
. That means that by default weights are quantized to 8-bit signed integer with a per-tensor floating-point scale factor (a very common type of quantization adopted by e.g. the ONNX standard opset), while quantization of bias, input, and output are disabled. We can easily verify all of this at runtime on an example:
[2]:
default_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=False)
[3]:
print(f'Is weight quant enabled: {default_quant_conv.is_weight_quant_enabled}')
print(f'Is bias quant enabled: {default_quant_conv.is_bias_quant_enabled}')
print(f'Is input quant enabled: {default_quant_conv.is_input_quant_enabled}')
print(f'Is output quant enabled: {default_quant_conv.is_output_quant_enabled}')
Is weight quant enabled: True
Is bias quant enabled: False
Is input quant enabled: False
Is output quant enabled: False
If we now try to pass in a random floating-point tensor as input, as expected we get the output of the convolution:
[4]:
import torch
out = default_quant_conv(torch.randn(1, 2, 5, 5))
out
[4]:
tensor([[[[-0.2594, 0.5392, 0.5916],
[ 0.3493, 0.6813, 0.2499],
[ 1.3732, 0.1229, -0.0084]],
[[ 0.0031, -0.1702, 0.1069],
[-0.8181, -0.8056, 0.0385],
[-0.4738, 0.0589, 0.1278]],
[[-0.1718, -0.1162, -0.1526],
[-0.9903, -0.3541, 0.1645],
[ 0.0557, -0.4458, -0.2080]]]], grad_fn=<ThnnConv2DBackward0>)
In this case we are computing the convolution between an unquantized input tensor and quantized weights, so the output in general is unquantized.
A QuantConv2d with quantization disabled everywhere behaves like a standard Conv2d
. Again can easily verify this:
[5]:
from torch.nn import Conv2d
torch.manual_seed(0) # set a seed to make sure the random weight init is reproducible
disabled_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=False, weight_quant=None)
torch.manual_seed(0) # reproduce the same random weight init as above
float_conv = Conv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=False)
inp = torch.randn(1, 2, 5, 5)
assert torch.isclose(disabled_quant_conv(inp), float_conv(inp)).all().item()
QuantTensor#
We can directly observe the quantized weights by calling the weight quantizer on the layer’s weights: default_quant_conv.weight_quant(quant_conv.weight)
, which for shortness is already implemented as default_quant_conv.quant_weight()
:
[6]:
default_quant_conv.quant_weight()
[6]:
QuantTensor(value=tensor([[[[-0.0790, 0.0503, -0.0934],
[-0.1149, -0.1903, -0.1329],
[-0.1813, 0.0108, 0.0593]],
[[ 0.0970, -0.0215, -0.0144],
[ 0.2280, 0.1239, -0.0090],
[ 0.1957, -0.2011, -0.0108]]],
[[[-0.0018, -0.1957, 0.1993],
[-0.0359, 0.1778, -0.1400],
[ 0.0916, 0.1059, 0.2173]],
[[-0.1670, 0.1939, -0.2191],
[-0.0215, 0.1688, -0.1383],
[-0.0449, -0.1185, 0.1742]]],
[[[-0.0808, -0.1652, -0.0233],
[-0.0700, 0.0467, -0.0485],
[ 0.1059, 0.1418, 0.1077]],
[[-0.0593, 0.0108, 0.0036],
[-0.1508, 0.0808, 0.1616],
[ 0.0144, -0.0287, -0.1365]]]], grad_fn=<MulBackward0>), scale=tensor(0.0018, grad_fn=<DivBackward0>), zero_point=tensor(0.), bit_width=tensor(8.), signed_t=tensor(True), training_t=tensor(True))
Notice how the quantized weights are wrapped in a data structure implemented by Brevitas called QuantTensor
. A QuantTensor
is a way to represent an affine quantized tensor with all its metadata, meaning: the value
of the quantized tensor in dequantized format, scale
, zero_point
, bit_width
, whether the quantized value it’s signed
or not, and whether the tensor was generated in training
mode.
As expected, we have that the quantized value (in dequantized format) can be computer from its integer representation, together with zero-point and scale:
[7]:
int_weight = default_quant_conv.int_weight()
zero_point = default_quant_conv.quant_weight_zero_point()
scale = default_quant_conv.quant_weight_scale()
quant_weight_manually = (int_weight - zero_point) * scale
assert default_quant_conv.quant_weight().value.isclose(quant_weight_manually).all().item()
A valid QuantTensor correctly populates all its fields with values != None
and respect the affine quantization invariant, i.e. value / scale + zero_point
is (accounting for rounding errors) an integer that can be represented within the interval defined by the bit_width
and signed
fields of the QuantTensor
. A non-valid one doesn’t. We can observe that the quantized weights are indeed marked as valid:
[8]:
assert default_quant_conv.quant_weight().is_valid
Calling is_valid
is relative expensive, so it should be using sparingly, but there are a few cases where a non-valid QuantTensor might be generated that is important to be aware of. Say we have two QuantTensor as output of the same quantized activation, and we want to sum them together:
[10]:
from brevitas.quant_tensor import QuantTensor
quant_act = QuantIdentity(return_quant_tensor=True)
out_tensor_0 = quant_act(torch.randn(1,2,5,5))
out_tensor_1 = quant_act(torch.randn(1,2,5,5))
assert out_tensor_0.is_valid
assert out_tensor_1.is_valid
print(out_tensor_0.scale)
print(out_tensor_1.scale)
tensor(0.0173, grad_fn=<DivBackward0>)
tensor(0.0307, grad_fn=<DivBackward0>)
Both QuantTensor are valid but since the quantized activation is in training mode by default, their scale factors are going to be different. It is important to note that the behaviour is different at evaluation time, where the two scale factors will be the same.
[11]:
out_tensor = out_tensor_0 + out_tensor_1
out_tensor
[11]:
QuantTensor(value=tensor([[[[ 0.9489, -0.9111, -0.0536, 0.5788, 0.3645],
[ 0.3401, 1.4325, 0.6498, 0.6411, -1.4390],
[-1.9029, 0.7012, 0.1591, 1.9235, 0.5883],
[-2.7258, 2.5330, 0.9165, -0.0820, 3.4148],
[-0.3651, 1.0164, 0.9567, -0.2758, -1.1376]],
[[-0.2414, 2.2111, -1.9124, -2.3814, -0.8805],
[ 1.3191, -0.8965, -0.2048, -3.8113, 1.1142],
[-0.3381, -0.2238, 1.2661, 0.0068, 0.2567],
[ 0.0731, -0.4280, 0.0909, 0.0875, -1.6851],
[-0.7744, -1.4127, -0.8143, 1.3557, -0.2802]]]],
grad_fn=<AddBackward0>), scale=tensor(0.0240, grad_fn=<DivBackward0>), zero_point=tensor(0.), bit_width=tensor(9.), signed_t=tensor(True), training_t=tensor(True))
Because we set training
to True
for both of them, we are allowed to sum them even if they have different scale factors. The output QuantTensor will have the correct bit_width
, and a scale which is the average of the two original scale factors. This is done only at training time, in order to propagate gradient information, however the consequence is that the resulting QuantTensor is no longer valid:
[12]:
assert not out_tensor.is_valid
QuantTensor
implements __torch_function__
to handle being called from torch functional operators (e.g. ops under torch.nn.functional
). Passing a QuantTensor to supported ops that are invariant to quantization, e.g. max-pooling, preserve the the validity of a QuantTensor. Example:
[108]:
import torch
quant_identity = QuantIdentity(return_quant_tensor=True)
quant_tensor = quant_identity(torch.randn(1, 3, 4, 4))
torch.nn.functional.max_pool2d(quant_tensor, kernel_size=2, stride=2)
[108]:
QuantTensor(value=tensor([[[[1.5800, 1.0157],
[1.4445, 0.8577]],
[[0.5643, 1.2414],
[1.0383, 0.9028]],
[[0.5191, 0.6546],
[2.1442, 0.5868]]]], grad_fn=<MaxPool2DWithIndicesBackward0>), scale=tensor(0.0226, grad_fn=<DivBackward0>), zero_point=tensor(0.), bit_width=tensor(8.), signed_t=tensor(True), training_t=tensor(True))
For ops that are not invariant to quantization, a QuantTensor
decays into a floating-point torch.Tensor
. Example:
[109]:
torch.tanh(quant_tensor)
[109]:
tensor([[[[-0.4943, -0.9938, -0.9073, 0.7681],
[-0.3262, 0.9186, 0.1786, 0.3659],
[ 0.7489, 0.8946, -0.0451, -0.5594],
[-0.1346, -0.4943, -0.4770, 0.6951]],
[[ 0.0676, 0.5111, 0.4943, 0.8459],
[-0.8990, -0.9426, 0.0676, -0.7945],
[-0.9220, 0.0676, -0.5594, 0.6321],
[-0.0676, 0.7772, 0.7177, -0.4414]],
[[ 0.4770, 0.2220, 0.0676, 0.5747],
[-0.0451, -0.6710, -0.4594, -0.3462],
[ 0.9729, -0.7177, -0.5896, -0.5276],
[-0.0900, 0.8852, 0.5276, -0.4414]]]], grad_fn=<TanhBackward0>)
Input Quantization#
We can obtain a valid output QuantTensor
by making sure that both input and weight of QuantConv2d
are quantized. To do so, we can set a quantizer for input_quant
. In this example we pick a signed 8-bit quantizer with per-tensor floating-point scale factor:
[110]:
from brevitas.quant.scaled_int import Int8ActPerTensorFloat
input_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=False,
input_quant=Int8ActPerTensorFloat, return_quant_tensor=True)
out_tensor = input_quant_conv(torch.randn(1, 2, 5, 5))
out_tensor
[110]:
QuantTensor(value=tensor([[[[ 0.9693, -0.9431, 0.2459],
[ 0.5416, 0.9037, -0.5278],
[-0.6207, -1.3578, -0.4815]],
[[ 0.4551, -1.4065, 0.8889],
[-0.3393, 0.0803, -0.1748],
[-0.0977, 0.6284, -0.7193]],
[[ 0.3655, 0.7626, -0.2634],
[-0.3453, 0.3349, 0.1923],
[ 0.5993, -0.9579, 0.3557]]]], grad_fn=<ThnnConv2DBackward0>), scale=tensor([[[[3.2208e-05]]]], grad_fn=<MulBackward0>), zero_point=tensor(0.), bit_width=tensor(21.), signed_t=tensor(True), training_t=tensor(True))
[111]:
assert out_tensor.is_valid
[111]:
True
What happens internally is that the input tensor passed to input_quant_conv
is being quantized before being passed to the convolution operator. That means we are now computing a convolution between two quantized tensors, which mimplies that the output of the operation is also quantized. As expected then out_tensor
is marked as valid.
Another important thing to notice is how the bit_width
field of out_tensor
is relatively high at 21 bits. In Brevitas, the assumption is always that the output bit-width of an operator reflects the worst-case size of the accumulator required by that operation. In other terms, given the size of the input and weight tensors and their bit-widths, 21 is the bit-width that would be required to represent the largest possible output value that could be generated. This makes sure that
the affine quantization invariant is always respected.
We could have obtained a similar result by directly passing as input a QuantTensor. In this example we are directly defining a QuantTensor ourselves, but it could also be the output of a previous layer.
[112]:
from brevitas.quant_tensor import QuantTensor
scale = 0.0001
bit_width = 8
zero_point = 0.
int_value = torch.randint(low=- 2 ** (bit_width - 1), high=2 ** (bit_width - 1) - 1, size=(1, 2, 5, 5))
quant_value = (int_value - zero_point) * scale
quant_tensor_input = QuantTensor(
quant_value,
scale=torch.tensor(scale),
zero_point=torch.tensor(zero_point),
bit_width=torch.tensor(float(bit_width)),
signed=True,
training=True)
quant_tensor_input
[112]:
QuantTensor(value=tensor([[[[ 5.7000e-03, 2.5000e-03, -1.2400e-02, -7.2000e-03, 3.7000e-03],
[-2.3000e-03, 7.0000e-04, -1.2700e-02, 5.2000e-03, 4.0000e-04],
[-7.9000e-03, 9.5000e-03, 6.6000e-03, 5.4000e-03, 2.5000e-03],
[ 1.1100e-02, 2.4000e-03, 1.0000e-02, -3.7000e-03, 7.2000e-03],
[-1.1500e-02, -5.8000e-03, -9.3000e-03, 1.0000e-02, 3.5000e-03]],
[[-6.8000e-03, 1.1500e-02, -1.0600e-02, -1.5000e-03, -1.9000e-03],
[ 2.9000e-03, 9.5000e-03, 7.2000e-03, -3.7000e-03, 7.7000e-03],
[-2.4000e-03, -8.9000e-03, -1.2000e-02, -8.1000e-03, 7.2000e-03],
[-1.1300e-02, -9.7000e-03, -1.0000e-03, 1.0100e-02, 3.8000e-03],
[-1.1900e-02, 6.9000e-03, 8.3000e-03, 1.0000e-04, -6.9000e-03]]]]), scale=tensor(1.0000e-04), zero_point=tensor(0.), bit_width=tensor(8.), signed_t=tensor(True), training_t=tensor(True))
[113]:
assert quant_tensor_input.is_valid
[113]:
True
Note: how we are explicitly forcing value
, scale
, zero_point
and bit_width
to be floating-point torch.Tensor
, as this is expected by Brevitas but it’s currently not enforced automatically at initialization time.
If we now pass in quant_tensor_input
to return_quant_conv
, we will see that indeed the output is a valid 21-bit QuantTensor
:
[114]:
return_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=False, return_quant_tensor=True)
out_tensor = return_quant_conv(quant_tensor_input)
out_tensor
[114]:
QuantTensor(value=tensor([[[[ 0.0085, 0.0066, 0.0050],
[-0.0038, -0.0009, -0.0115],
[-0.0055, -0.0037, 0.0009]],
[[ 0.0015, -0.0027, -0.0079],
[-0.0034, -0.0060, 0.0043],
[-0.0008, 0.0052, -0.0033]],
[[-0.0015, 0.0082, -0.0038],
[-0.0021, 0.0004, -0.0054],
[-0.0021, -0.0079, 0.0013]]]], grad_fn=<ThnnConv2DBackward0>), scale=tensor([[[[1.8448e-07]]]], grad_fn=<MulBackward0>), zero_point=tensor(0.), bit_width=tensor(21.), signed_t=tensor(True), training_t=tensor(True))
[115]:
assert out_tensor.is_valid
[115]:
True
We can also pass in an input QuantTensor
to a layer that has input_quant
enabled. In that case, the input gets re-quantized:
[116]:
input_quant_conv(quant_tensor_input)
[116]:
QuantTensor(value=tensor([[[[-0.0035, -0.0037, -0.0050],
[ 0.0010, -0.0051, -0.0027],
[-0.0010, 0.0047, 0.0017]],
[[ 0.0021, 0.0002, 0.0027],
[ 0.0028, 0.0002, -0.0044],
[ 0.0008, -0.0052, -0.0024]],
[[ 0.0010, -0.0052, -0.0011],
[-0.0018, 0.0024, 0.0011],
[-0.0001, 0.0039, 0.0035]]]], grad_fn=<ThnnConv2DBackward0>), scale=tensor([[[[1.7410e-07]]]], grad_fn=<MulBackward0>), zero_point=tensor(0.), bit_width=tensor(21.), signed_t=tensor(True), training_t=tensor(True))
Output Quantization#
Let’s now look at would have happened if we instead enabled output quantization:
[117]:
from brevitas.quant.scaled_int import Int8ActPerTensorFloat
output_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=False,
output_quant=Int8ActPerTensorFloat, return_quant_tensor=True)
out_tensor = output_quant_conv(torch.randn(1, 2, 5, 5))
out_tensor
[117]:
QuantTensor(value=tensor([[[[ 0.2111, 0.4060, 0.3654],
[-0.7876, 0.8119, -0.9825],
[-0.5115, 0.3979, -0.3248]],
[[ 0.3816, 0.0568, -0.0812],
[ 1.0312, -0.7876, 0.8038],
[-0.3491, -0.4141, 0.0650]],
[[-0.5846, -0.4222, -0.0731],
[-0.7389, 0.5034, -0.2517],
[-0.1624, -0.4385, 0.7308]]]], grad_fn=<MulBackward0>), scale=tensor(0.0081, grad_fn=<DivBackward0>), zero_point=tensor(0.), bit_width=tensor(8.), signed_t=tensor(True), training_t=tensor(True))
[118]:
assert out_tensor.is_valid
[118]:
True
QuantTensor
. However, what happened internally is quite different from before.bit_width
. In the previous case, we had that the bit_width
reflected the size of the output accumulator. In this case instead, we have bit_width=tensor(8.)
, which is what we expected since output_quant
had been set to an Int8 quantizer.Bias Quantization#
There is an important scenario where the various options we just saw make a practical difference, and it’s quantization of bias. In many contexts, such as in the ONNX standard opset and in FINN, bias is assumed to be quantized with scale factor equal to input scale weight scale*, which means that we need a valid quantized input somehow. A predefined bias quantizer that reflects that assumption is brevitas.quant.scaled_int.Int8Bias
. If we simply tried to set it to a QuantConv2d
without any sort of input quantization, we would get an error:
[119]:
from brevitas.quant.scaled_int import Int8Bias
bias_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=True,
bias_quant=Int8Bias, return_quant_tensor=True)
bias_quant_conv(torch.randn(1, 2, 5, 5))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_48365/2280634207.py in <module>
4 in_channels=2, out_channels=3, kernel_size=(3,3), bias=True,
5 bias_quant=Int8Bias, return_quant_tensor=True)
----> 6 bias_quant_conv(torch.randn(1, 2, 5, 5))
/opt/conda/envs/torch_1.10/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
/workspace/scratch/git/fork_brevitas/src/brevitas/nn/quant_conv.py in forward(self, input)
190
191 def forward(self, input: Union[Tensor, QuantTensor]) -> Union[Tensor, QuantTensor]:
--> 192 return self.forward_impl(input)
193
194 def inner_forward_impl(self, x: Tensor, quant_weight: Tensor, quant_bias: Optional[Tensor]):
/workspace/scratch/git/fork_brevitas/src/brevitas/nn/quant_layer.py in forward_impl(self, inp)
330
331 if self.bias is not None:
--> 332 quant_bias = self.bias_quant(self.bias, output_scale, output_bit_width)
333 if not self.training and self.cache_inference_quant_bias:
334 self._cached_bias = _CachedIO(quant_bias.detach(), metadata_only=False)
/opt/conda/envs/torch_1.10/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
/workspace/scratch/git/fork_brevitas/src/brevitas/proxy/parameter_quant.py in forward(self, x, input_scale, input_bit_width)
160 impl = self.export_handler if self.export_mode else self.tensor_quant
161 if self.requires_input_scale and input_scale is None:
--> 162 raise RuntimeError("Input scale required")
163 if self.requires_input_bit_width and input_bit_width is None:
164 raise RuntimeError("Input bit-width required")
RuntimeError: Input scale required
We can solve the issue by passing in a valid QuantTensor
, e.g. the quant_tensor_input
we defined above:
[120]:
bias_quant_conv(quant_tensor_input)
[120]:
QuantTensor(value=tensor([[[[ 0.0005, 0.0043, -0.0004],
[ 0.0005, 0.0106, 0.0012],
[ 0.0021, 0.0007, -0.0050]],
[[-0.0067, -0.0035, -0.0059],
[-0.0050, -0.0015, -0.0039],
[ 0.0015, 0.0028, -0.0008]],
[[-0.0051, -0.0050, 0.0060],
[-0.0015, 0.0037, 0.0071],
[ 0.0067, 0.0035, -0.0071]]]], grad_fn=<ThnnConv2DBackward0>), scale=tensor([[[[1.8108e-07]]]], grad_fn=<MulBackward0>), zero_point=tensor(0.), bit_width=tensor(22.), signed_t=tensor(True), training_t=tensor(True))
Or by enabling input quantization and then passing in a float a torch.Tensor
or a QuantTensor
:
[121]:
input_bias_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=True,
input_quant=Int8ActPerTensorFloat, bias_quant=Int8Bias, return_quant_tensor=True)
input_bias_quant_conv(torch.randn(1, 2, 5, 5))
[121]:
QuantTensor(value=tensor([[[[-0.3825, 0.1371, 0.9135],
[-0.2016, 0.7495, -0.4071],
[-0.0755, 0.5283, 0.2388]],
[[ 0.0788, -0.3802, -0.2234],
[ 0.8678, -0.5546, 0.4408],
[-0.6788, 0.4422, 0.3007]],
[[ 0.4412, -0.3205, 1.0033],
[-0.0083, -0.3295, -0.2076],
[ 0.4417, -0.1046, -0.3493]]]], grad_fn=<ThnnConv2DBackward0>), scale=tensor([[[[3.8610e-05]]]], grad_fn=<MulBackward0>), zero_point=tensor(0.), bit_width=tensor(22.), signed_t=tensor(True), training_t=tensor(True))
[122]:
input_bias_quant_conv(quant_tensor_input)
[122]:
QuantTensor(value=tensor([[[[ 0.0036, 0.0024, -0.0033],
[ 0.0050, 0.0080, -0.0014],
[-0.0036, -0.0080, -0.0029]],
[[ 0.0083, -0.0093, 0.0048],
[ 0.0035, 0.0015, -0.0011],
[-0.0003, 0.0067, 0.0013]],
[[-0.0009, -0.0019, 0.0039],
[ 0.0010, 0.0056, -0.0037],
[ 0.0091, -0.0095, 0.0054]]]], grad_fn=<ThnnConv2DBackward0>), scale=tensor([[[[1.8384e-07]]]], grad_fn=<MulBackward0>), zero_point=tensor(0.), bit_width=tensor(22.), signed_t=tensor(True), training_t=tensor(True))
Notice how the output bit_width=tensor(22.)
. This is because, in the worst-case, summing a 21-bit integer (the size of the accumulator before bias is added) and an 8-bit integer (the size of quantized bias) gives a 22-bit integer.
Let’s try now to enable output quantization instead of input quantization. That wouldn’t have solved the problem with bias quantization, as output quantization is performed after bias is added:
[123]:
output_bias_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=True,
output_quant=Int8ActPerTensorFloat, bias_quant=Int8Bias, return_quant_tensor=True)
output_bias_quant_conv(torch.randn(1, 2, 5, 5))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_48365/2990591641.py in <module>
2 in_channels=2, out_channels=3, kernel_size=(3,3), bias=True,
3 output_quant=Int8ActPerTensorFloat, bias_quant=Int8Bias, return_quant_tensor=True)
----> 4 output_bias_quant_conv(torch.randn(1, 2, 5, 5))
/opt/conda/envs/torch_1.10/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
/workspace/scratch/git/fork_brevitas/src/brevitas/nn/quant_conv.py in forward(self, input)
190
191 def forward(self, input: Union[Tensor, QuantTensor]) -> Union[Tensor, QuantTensor]:
--> 192 return self.forward_impl(input)
193
194 def inner_forward_impl(self, x: Tensor, quant_weight: Tensor, quant_bias: Optional[Tensor]):
/workspace/scratch/git/fork_brevitas/src/brevitas/nn/quant_layer.py in forward_impl(self, inp)
330
331 if self.bias is not None:
--> 332 quant_bias = self.bias_quant(self.bias, output_scale, output_bit_width)
333 if not self.training and self.cache_inference_quant_bias:
334 self._cached_bias = _CachedIO(quant_bias.detach(), metadata_only=False)
/opt/conda/envs/torch_1.10/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
/workspace/scratch/git/fork_brevitas/src/brevitas/proxy/parameter_quant.py in forward(self, x, input_scale, input_bit_width)
160 impl = self.export_handler if self.export_mode else self.tensor_quant
161 if self.requires_input_scale and input_scale is None:
--> 162 raise RuntimeError("Input scale required")
163 if self.requires_input_bit_width and input_bit_width is None:
164 raise RuntimeError("Input bit-width required")
RuntimeError: Input scale required
Not all scenarios require bias quantization to depend on the scale factor of the input. In those cases, biases can be quantized the same way weights are quantized, and have their own scale factor. In Brevitas, a predefined quantizer that reflects this other scenario is Int8BiasPerTensorFloatInternalScaling
. In this case then a valid quantized input is not required:
[124]:
from brevitas.quant.scaled_int import Int8BiasPerTensorFloatInternalScaling
bias_internal_scale_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=True,
bias_quant=Int8BiasPerTensorFloatInternalScaling, return_quant_tensor=False)
bias_internal_scale_quant_conv(torch.randn(1, 2, 5, 5))
[124]:
tensor([[[[ 0.2152, 0.8346, 0.0746],
[-0.0738, -0.5212, 0.1019],
[-0.6004, 0.1500, -0.1453]],
[[-1.1551, -1.3458, -0.1312],
[ 0.2502, -0.5267, 0.2412],
[-0.3556, -0.3289, -0.2276]],
[[-0.4599, -0.6094, 0.4682],
[-0.5064, -0.6768, -0.6638],
[ 0.0066, -0.3581, 0.2359]]]], grad_fn=<ThnnConv2DBackward0>)
There are a couple of situations to be aware of concerning bias quantization that can lead to changes in the output zero_point
.
Let’s consider the scenario where we compute the convolution between a quantized input tensor and quantized weights. In the first case, we then add an unquantized bias on top of the output. In the second one, we add a bias quantized with its own scale factor, e.g. with the Int8BiasPerTensorFloatInternalScaling
quantizer. In both cases, in order to make sure the output QuantTensor
is valid (i.e. the affine quantization invariant is respected), the output zero_point
becomes non-zero:
[125]:
unquant_bias_input_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=True,
input_quant=Int8ActPerTensorFloat, return_quant_tensor=True)
out_tensor = unquant_bias_input_quant_conv(torch.randn(1, 2, 5, 5))
out_tensor
[125]:
QuantTensor(value=tensor([[[[-0.6879, -0.6632, -0.2411],
[ 0.2064, -0.7371, 0.3910],
[ 0.9533, 0.2994, 0.6546]],
[[-0.4684, -0.4495, -0.5021],
[ 0.5738, 0.4199, -0.3380],
[ 0.6218, -0.0408, -0.8483]],
[[-0.5625, 0.1837, -1.0575],
[-1.2816, -0.4993, -0.3409],
[ 0.4556, -1.4269, 0.5369]]]], grad_fn=<ThnnConv2DBackward0>), scale=tensor([[[[3.0975e-05]]]], grad_fn=<MulBackward0>), zero_point=tensor([[[[ 1276.0774]],
[[-3152.4585]],
[[ 7320.2324]]]], grad_fn=<DivBackward0>), bit_width=tensor(21.), signed_t=tensor(True), training_t=tensor(True))
[126]:
assert out_tensor.is_valid
[126]:
True
Finally, an important point about QuantTensor
. With the exception of learned bit-width (which will be the subject of a separate tutorial) and some of the bias quantization scenarios we have just seen, usually returing a QuantTensor
is not necessary and can create extra complexity. This is why currently return_quant_tensor
defaults to False
. We can easily see it in an example:
[127]:
bias_input_quant_conv = QuantConv2d(
in_channels=2, out_channels=3, kernel_size=(3,3), bias=True,
input_quant=Int8ActPerTensorFloat, bias_quant=Int8Bias)
bias_input_quant_conv(torch.randn(1, 2, 5, 5))
[127]:
tensor([[[[ 0.8357, 0.0733, 0.9527],
[ 0.1803, 0.2154, 0.7598],
[ 1.1121, -0.8728, 1.0039]],
[[ 0.7917, 1.0063, 0.6516],
[-0.1852, -0.7263, 0.0956],
[-0.1876, 0.2747, -0.1617]],
[[ 0.8299, 0.9934, -0.3821],
[ 0.4865, 0.9309, -0.7924],
[-0.4201, 0.2343, 0.1532]]]], grad_fn=<ThnnConv2DBackward0>)
Altough not obvious, the output is actually implicitly quantized.