Combining vision and language could be the key to better AI – TechCrunch

Depending on the theory of intelligence you subscribe to, achieving “human” AI will require a system that can take advantage of multiple modalities – for example, sound, vision, and text – to reason about the world. For example, when shown an image of an overturned truck and a police car on a snowy highway, a human-level AI could infer that dangerous road conditions caused an accident. Or, while running on a robot, when asked to grab a can of soda from the refrigerator, they would navigate among people, furniture, and pets to retrieve the can and place it within reach of the requester.

Today’s AI is insufficient. But new research shows encouraging signs of progress, from robots that can understand steps to satisfy basic commands (eg, “get a bottle of water”) to text-producing systems that learn from explanations. In this relaunched edition of Deep Science, our weekly series on the latest developments in AI and the wider science field, we cover the work of DeepMind, Google and OpenAI making progress towards systems that can – if not fully understand the world – solving narrow tasks such as image generation with impressive robustness.

AI Research Lab OpenAI’s enhanced DALL-E, DALL-E 2, is by far the most impressive project to emerge from the depths of an AI research lab. As my colleague Devin Coldewey writes, while the original DALL-E demonstrated remarkable prowess in creating images to match virtually any prompt (e.g. “a dog wearing a beret”), DALL-E 2 go further. The images it produces are much more detailed and DALL-E 2 can intelligently replace a given area in an image, for example inserting a painting into a photo of a marble floor filled with the appropriate highlights.


An example of the types of images that DALL-E 2 can generate.

DALL-E 2 received most of the attention this week. But on Thursday, Google researchers detailed an equally impressive visual understanding system called Visually-Driven Prosody for Text-to-Speech – VDTTS – in a post on Google’s AI Blog. VDTTS can generate realistic, lip-synced speech, nothing more than text and video images of the person speaking.

The speech generated by VDTTS, while not a perfect substitute for recorded dialogue, is still quite good, with compelling, human-like expressiveness and timing. Google sees it one day being used in a studio to replace the original audio that might have been recorded under noisy conditions.

Of course, visual understanding is just one step on the way to better AI. Another element is language comprehension, which lags behind in many respects, even setting aside the well-documented toxicity and bias issues of AI. In one striking example, a state-of-the-art Google system, Pathways Language Model (PaLM), memorized 40% of the data that was used to “train” it, according to one article, leading PaLM to plagiarize the text until to copyright notices in snippets.

Fortunately, DeepMind, the Alphabet-backed AI lab, is among those exploring techniques to solve this problem. In a new study, DeepMind researchers investigate whether AI language systems – which learn to generate text from many existing text examples (think books and social media) – could benefit from being given explanations of these texts. After annotating dozens of language tasks (eg, “Answer these questions by identifying whether the second sentence is an appropriate paraphrase of the metaphorical first sentence”) with explanations (eg, “David’s eyes were not literally daggers, this is a metaphor used to imply that David was staring at Paul with ferocity.”) and by evaluating the performance of different systems on these, the DeepMind team found that the examples did indeed improve the performance of the systems.

DeepMind’s approach, if successful within the academic community, could one day be applied to robotics, forming the building blocks of a robot that can understand vague requests (e.g., “throw the trash”) without step-by-step instructions. Google’s new “Do as I can, not as I say” project provides a glimpse of that future, but with significant limitations.

A collaboration between Robotics at Google and the Everyday Robotics team at Alphabet Lab X, Do As I Can, Not As I Say seeks to condition an AI language system to deliver “achievable” and “contextually appropriate” actions. for a robot, given an arbitrary task. The robot acts as the “hands and eyes” of the linguistic system while the system provides high-level semantic knowledge about the task – the theory being that the linguistic system encodes a wealth of useful knowledge to the robot.

Google robotics

Picture credits: Robotics at Google

A system called SayCan selects which skill the robot should perform in response to a command, taking into account (1) the likelihood that a given skill will be useful and (2) the possibility of successfully performing said skill. For example, in response to someone saying “I spilled my Coke, can you get me something to clean it up?”, SayCan can instruct the robot to find a sponge, pick up the sponge, and clean it up. bring to the person who requested this.

SayCan is limited by robotic hardware – on more than one occasion the research team observed the robot they had chosen to conduct experiments accidentally dropping objects. Yet this, along with DALL-E 2 and DeepMind’s work in contextual understanding, is an illustration of how AI systems when combined can bring us even closer to one Jetsons type future.


Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top