Described in this article are two example python functions to handle the post-training dynamic range quantization in Tensorflow. Finally, there is a prototype example inference API to help make quantized model deployments easier by making them plug-in-play with the Tensorflow Keras inference API.
The first function describes the conversion of the floating point model to a dynamic range quantized model and the second function describes the inference steps using the dynamic range quantized model.
This helps reduce the memory footprint and the inference time for a model. It can achieve 4x reduction in model size, along with 2x to 3x speed-up of the inference performance.
Post-training dynamic range model quantization
Inference using post-training dynamic range quantized model
Here, the floating point model weights are quantized statistically to 8-bit precision integers. To achieve reduced inference time, the outputs of activation functions are also 8-bits quantized dynamically depending on their range.
All the computations using the weights and the activations are therefore performed using the 8-bit precision integers. Also, the user does not have to provide a representative dataset for calibration of the quantized model.
It should be noted that the outputs are stored using floating point. The trade-off for storing the outputs as floating point is the slightly lower speed-up of the dynamic range operations, compared to the full fixed-point computation and storage of weights, activations and outputs.
A simple inference API for the quantized model
The key challenge in making quantized models production ready is to adapt the inference pipeline with the slightly different prediction generating steps for a quantized model.
Using an inference API such as the simple prototype example, built using the TFLiteModel() class, outlined above, the code compatibility with the Tensorflow Keras inference API can be accomplished.
Such an approach of writing compatible, plug-in-play APIs with the regular deep neural network models, especially for the inference pipelines, can make the production deployments of quantized models much simpler.
Your comment will be posted after it is approved.
Leave a Reply.
Moad Computer is an actionable insights firm. We provide enterprises with end-to-end artificial intelligence solutions. Actionable Insights blog is a quick overview of things we are most excited about.