Google has developed a novel, multilingual text vectorizer called RETVec (Resilient & Efficient Text Vectorizer) to help make text classifiers more robust and efficient. RETVec has been used to improve the Gmail spam detection rate by 38% and reduce the false positive rate by 19.4%. It has also reduced the TPU usage of the model by 83%.

RETVec achieves these improvements by combining a novel, highly-compact character encoder, an augmentation-driven training regime, and the use of metric learning. It is also open-source and available on GitHub.

Related Articles

RETVec has several benefits, including:

  • Improved classification performance: RETVec can help models achieve state-of-the-art classification performance.
  • Reduced computational cost: RETVec can drastically reduce the computational cost of text classification.
  • Out-of-the-box support for all languages: RETVec works out-of-the-box on every language and all UTF-8 characters without the need for text preprocessing.
  • Faster inference speed: Models trained with RETVec exhibit faster inference speed due to its compact representation.
  • Smaller models: Having smaller models reduces computational costs and decreases latency.
  • Seamless conversion to TFLite: Models trained with RETVec can be seamlessly converted to TFLite for mobile and edge devices.
  • TensorFlowJS layer implementation: RETVec is a novel open-source text vectorizer that allows you to build more resilient and efficient server-side and on-device text classifiers. The Gmail spam filter uses it to help protect Gmail inboxes against malicious emails.

If you are interested in learning more about RETVec, you can read the NeurIPS 2023 paper or check out the demo web page running a RETVec-based model. You can also get started with RETVec by following the tutorial.

Source: Google