Skip to content

Microsoft Unveils Florence-2: A Unified Vision Model

19 Jun 2024

Microsoft Unveils Florence-2: A Unified Vision Model

Microsoft has introduced Florence-2, a groundbreaking vision foundation model designed to handle a diverse array of vision and vision-language tasks. This model stands out by using a unified, prompt-based approach that allows it to perform multiple tasks such as image captioning, object detection, visual grounding, and segmentation with remarkable efficiency and accuracy. Florence-2 leverages a large-scale dataset, FLD-5B, which includes 5.4 billion annotations across 126 million images, to achieve its versatile capabilities.

Florence-2's architecture incorporates a sequence-to-sequence structure with a DaViT vision encoder that converts images into visual token embeddings. These embeddings are combined with text embeddings generated by BERT and processed through a transformer-based multi-modal encoder-decoder. This setup enables the model to understand and execute various vision tasks through textual prompts, making it adaptable and powerful despite its compact size. The model comes in two versions, with 232 million and 771 million parameters, allowing it to outperform many larger models in zero-shot and fine-tuning scenarios.

One of the significant advancements with Florence-2 is its ability to perform on par or better than specialized models across different vision tasks. For example, in zero-shot captioning tests on the COCO dataset, both versions of Florence-2 outperformed DeepMind's Flamingo visual language model and Microsoft's own Kosmos-2 model. This performance is attributed to the comprehensive and diverse annotations in the FLD-5B dataset, which were generated using an iterative strategy of automated image annotation and model refinement.

The release of Florence-2 under the permissive MIT license on platforms like Hugging Face signifies a step forward in making advanced AI models accessible for a wider range of applications. This model's versatility and efficiency are expected to significantly reduce the need for multiple task-specific vision models, streamlining the development process for applications in various fields, from automated image analysis to advanced visual comprehension systems.

Source

Most popular AI tools

All recommendations
Cursor
Underlord by Descript
$0.00
$0.00
Eleven Labs
$0.00
$0.00
Looka
$0.00
$0.00
Murf AI
$0.00
$0.00
AdCreative.ai
$0.00
$0.00
Photo AI
$0.00
$0.00
Reply.io
$0.00
$0.00
MagicSlides
$0.00
$0.00
Pika Labs
$0.00
$0.00
LogoAI
$0.00
$0.00
Deepbrain AI
$0.00
$0.00
Mixo
$0.00
$0.00
FineShare FineCam
$0.00
$0.00
Taplio
$0.00
$0.00
Fiesta item
$0.00
$0.00
Description
$0.00
$0.00
AI Lawyer
$0.00
$0.00
Humata AI
$0.00
$0.00
Ask Your PDF
$0.00
$0.00
Audioread.com
$0.00
$0.00

Thanks for subscribing!

This email has been registered!

Shop the look

Choose Options

AiToolsChampion
Wait a second! We have an ultra-important mission for you! 🕵️‍♂️ Don't let AI take over! Humanity needs heroes like you to stay at the forefront and guide artificial intelligence to the light side of the Force! 🤖⚔️
Receive the latest news, tools and tips and keep your place as captain! 💪
Edit Option
Back In Stock Notification
this is just a warning
Login