LINGO-1 trains a video-language model that comments on the ongoing scene. You can ask it to explain its decisions ("why are you stopped?") and planning ("what are you gonna do next?"). The explicit reasoning step comes with key benefits:
- Explainability: driving models are no longer a mysterious blackbox that you pray for safety.
- Counterfactuals: it's able to imagine scenarios that are not in the training data, and reason through how to handle them correctly.
- Long-tail programming: there are soooo many edge cases in driving. It's impossible to have good data coverage on everything. Instead of collecting 1000s of examples to "neural program" a case, you can now have a human teacher write prompts to explain a handful of examples.
LINGO-1 is closely related to a few works in game AI:
- MineDojo (my team's work at NVIDIA, https://t.co/sYdp8RzTCk): learns a reward model that aligns Minecraft gameplay videos with their transcripts. The model, called "MineCLIP", is able to ground commentary text in the video pixels.
- Thought Cloning (@jeffclune): pixel -> language -> action loop in gridworlds.