cross-posted from: https://lemmy.world/post/1134694
KOSMOS-2: Microsoft’s New AI Breakthrough Generating Text, Images, Video & Sound in Real-Time!
Microsoft has unveiled its latest AI breakthrough, KOSMOS-2, which can generate text, images, video, and sound in real-time[1]. This multimodal large language model (MLLM) is grounded in the real world through its ability to understand and analyze image content[4]. It was trained using large-scale data of grounded image-text pairs called GrIT[2].
KOSMOS-2 is a significant step forward in AI technology, with its ability to generate content across multiple modalities[6]. It has the potential to revolutionize computer vision applications with improved efficiency, accuracy, and accessibility in image and video processing[3].
This breakthrough is a testament to Microsoft’s commitment to advancing AI technology and its potential to transform industries across the board. We can’t wait to see what the future holds with KOSMOS-2!
Citations: [1] https://youtube.com/watch?v=VxsqtoytLsA [2] https://www.microsoft.com/en-us/research/publication/kosmos-2-grounding-multimodal-large-language-models-to-the-world/ [3] https://azure.microsoft.com/en-us/blog/announcing-a-renaissance-in-computer-vision-ai-with-microsofts-florence-foundation-model/ [4] https://arstechnica.com/information-technology/2023/03/microsoft-unveils-kosmos-1-an-ai-language-model-with-visual-perception-abilities/ [5] https://www.linkedin.com/posts/trishuhl_generativeai-multimodal-ai-activity-7040590986057564160-ImJp [6] https://www.cjco.com.au/article/news/unleashing-the-power-of-kosmos-2-a-leap-forward-in-ai-tech-with-grounded-multimodal-language-models/
I found this that says it offers an online demo, but the actual demo link wouldn’t work from my computer. Maybe you’ll have success.
https://github.com/microsoft/unilm/tree/master/kosmos-2
502 still, but thanks for the link! This is super interesting. I’m curious to see where it goes.