2025 | Trends in Artificial Intelligence (252/340)

251 On the back of Google’s ‘Attention is All You Need’ Transformers research paper in 2017, the first wave of ‘modern AI’ (read: LLMs) focused on text: models such as OpenAI’s GPT-3 and Meta’s Llama-1 showed that teaching computers to finish sentences at scale could unlock broad reasoning abilities. Yet human communication is rarely text-only, and often not even text-first. Images, audio, video, and sensor readings carry context that words alone miss, so researchers at the same companies – and peers like Google, Anthropic, and xAI, among others – began extending language models to handle additional signals. Multimodal AI models are the result. They embed text, pictures, sound, and video into a shared representation and generate outputs in any of those formats. A single query can reference a paragraph and a diagram, and the model can respond with a spoken summary or an annotated image – without switching systems. Each new modality forces models to align meaning across formats rather than optimize for one. The path to this capability unfolded stepwise: OpenAI’s CLIP paired vision and language in 2021; Meta followed with ImageBind in 2023 and Chameleon in 2024; and by 2024-2025, frontier systems such as GPT-4o, Claude 3, and Chameleon had become fully multimodal. Each new modality forced the models to align meaning across formats rather than optimize for one. The payoff is practical. A field engineer can aim a phone camera at machinery and receive a plain-language fault diagnosis; a clinician can attach an X-ray to a note and get a structured report draft; and an analyst can combine charts, transcripts, and audio clips in a single query. Compared with text-only models, multimodal systems cut context switching, capture richer detail, and enable applications – quality control, assistive tech, content creation – where visual or auditory information matters as much as words. Rising Competition = AI Model Releases

2025 | Trends in Artificial Intelligence - Page 252

2025 | Trends in Artificial Intelligence Page 251 Page 253