Microsoft Unveils Florence-2: A Unified Vision Model
Microsoft Unveils Florence-2: A Unified Vision Model
Microsoft has introduced Florence-2, a groundbreaking vision foundation model designed to handle a diverse array of vision and vision-language tasks. This model stands out by using a unified, prompt-based approach that allows it to perform multiple tasks such as image captioning, object detection, visual grounding, and segmentation with remarkable efficiency and accuracy. Florence-2 leverages a large-scale dataset, FLD-5B, which includes 5.4 billion annotations across 126 million images, to achieve its versatile capabilities.
Florence-2's architecture incorporates a sequence-to-sequence structure with a DaViT vision encoder that converts images into visual token embeddings. These embeddings are combined with text embeddings generated by BERT and processed through a transformer-based multi-modal encoder-decoder. This setup enables the model to understand and execute various vision tasks through textual prompts, making it adaptable and powerful despite its compact size. The model comes in two versions, with 232 million and 771 million parameters, allowing it to outperform many larger models in zero-shot and fine-tuning scenarios.
One of the significant advancements with Florence-2 is its ability to perform on par or better than specialized models across different vision tasks. For example, in zero-shot captioning tests on the COCO dataset, both versions of Florence-2 outperformed DeepMind's Flamingo visual language model and Microsoft's own Kosmos-2 model. This performance is attributed to the comprehensive and diverse annotations in the FLD-5B dataset, which were generated using an iterative strategy of automated image annotation and model refinement.
The release of Florence-2 under the permissive MIT license on platforms like Hugging Face signifies a step forward in making advanced AI models accessible for a wider range of applications. This model's versatility and efficiency are expected to significantly reduce the need for multiple task-specific vision models, streamlining the development process for applications in various fields, from automated image analysis to advanced visual comprehension systems.