Inner Monologue:
Embodied Reasoning through Planning with Language Models


  • Wenlong Huang*
  • Fei Xia*
  • Ted Xiao*
  • Harris Chan
  • Jacky Liang
  • Pete Florence


  • Andy Zeng
  • Jonathan Tompson
  • Igor Mordatch
  • Yevgen Chebotar
  • Pierre Sermanet


  • Noah Brown
  • Tomas Jackson
  • Linda Luu
  • Sergey Levine
  • Karol Hausman
  • Brian Ichter


  • Robotics at Google
    * Equal contribution and listed in alphabetical order.

Abstract

Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robotics. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent’s own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, object recognition, scene description, and human interaction. We find that closed-loop language feedback significantly improves high level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a real kitchen environment.

Video Walkthrough


Approach

Prior works have shown large language models (LLMs) demonstrate impressive planning capabilities for long-horizon embodied tasks, given arbitrary language instructions. However, it has remained one-directional - the LLM blindly influences the agent and the environment, but no feedback is routed back to the LLM. This issue is particularly prominent when an intermediate action fails during execution, because the LLM is not informed with any feedback. In this work, formulate an inner monologue by continually adding information from various sources of feedback into the language model prompts. While any textual feedback can be incorporated, we focus our studies on three types of feedback: passive scene description, active scene description, and success detection. Passive scene description describes any feedback that is consistently provided in a structured form, such as object recognition results. Active scene description, on the other hand, describes any free-form questions that LLMs may ask and the corresponding unstructured answers provided by a learned model (e.g. a VQA model) or a person. This can also be repurposed to inject human preference during plan generation. Success detection refers to a binary feedback that indicates if the last action was successful, which is particularly useful in many long-horizon settings.

Results

In order to study how different sources of environment feedback can support a rich inner monologue that enables complex robotic control, we analyze diverse long-horizon manipulation and navigation tasks in simulation and in the real world. As Inner Monologue is not dependent on a specific LLM or a type of grounding feedback, we study different Inner Monologue implementations in three environments with different LLM planning methods and different sources of feedback from the environment. For more details about experiments, implementations, and the prompts used for LLM for each domain, please refer to the paper and the appendix.

Simulated Tabletop Rearrangement

Given an unseen task instruction, we show that LLMs can not only generate sensible action plans as observed in previous works, but can also incorporate injected textual feedback of success detection and passive scene description. The video below shows one instantiation of using passive scene description as feedback (Scene). Specifically, the LLM first infers desired sub-tasks given the high-level instruction. Then, the scene description keeps track of the achieved sub-tasks after each step. Additionally, the LLM also generates chain-of-thought text about what remains to be achieved after each step. We demonstrate this can elicit complex replanning behaviors in tasks that require combinatorial state spaces (e.g., "put all blocks in bowls with matching colors", "stack all the blocks").

Real-World Tabletop Rearrangement

We demonstrate another implementation of Inner Monologue in a real-world tabletop environment, where perceptual models may be subject to occlusions. We leverage passive scene description (implemented as object recognition) and success detection feedbacks. Given the list of visible and occluded objects and success detection results, we show this enables Inner Monologue to complete tasks like "stack all the blocks" and "put bottles and fruits in different plates", even under considerable perturbations to the primitive policy.

Real-World Mobile Manipulation

The method is also amenable to complex realistic household tasks given wide range of skills outside of pick-and-place. In the video below, we leverage success detection feedback. Although natural failures are already prone to occur in such settings, we use adversarial human interventions to force policy failures in order to demonstrate the replanning capability of Inner Monologue. We show that LLMs can effectively replan if the current or previous plan steps failed. This allows the robot to recover from failures and complete complex tasks like "put a coke in the top drawer", as shown in the video below.

Emergent Capabilities

Although LLMs can generate fluent continuations from the prompted examples, we surprisingly find that when informed with environment feedback, Inner Monologue demonstrates many impressive reasoning and replanning behaviors beyond the examples given in the prompt. Using a pre-trained LLM as the backbone, the method also inherits many of the appealing properties from its versatility and general-purpose languageunderstanding. In this section, we demonstrate a few of these emergent capabilities.

Continued Adaptation to New Instructions

Although not explicitly prompted, the LLM planner can react to human interaction that changes the high-level goal mid-task. Below we show a challenging case, where Human feedback changes the goal during the plan execution, and then changes the goal yet again by saying “finish the previous task”. We can see that the planner incorporates the feedback correctly by switching tasks twice. In another instance, despite not being explicitly prompted to terminate after a human says “please stop”, the LLM planner generalizes to this scenario and predicts a “done” action.

Self-Proposing Goals under Infeasibility

Instead of mindlessly following human-given instructions, Inner Monologue can also act as an interactive problem solver by proposing alternative goals to atempt when the previous goal becomes infeasible. Below, to solve the task “put any two blocks inside the purple bowl”, Inner Monologue first attempts an action of picking up the purple block – the action fails as the purple block is intentionally made to be too heavy for the robot. After a hint “the purple block is too heavy”, it proposes to “find a lighter block” and successfully solves the task in the end.

Multilingual Interaction

Pre-trained LLMs are known to be able to translate from one language to another, without any finetuning. We observe that such multilingual understanding also transfers to the embodied settings considered in this work. Specifically, the human-provided new instruction is written in Chinese, but the LLM can correctly interpret it, re-narrate it as a concrete goal to execute in English, and accordingly replan its future actions. Occasionally, we find that this capability even extends to symbols and emojis.

Interactive Scene Understanding

We also observe that Inner Monologue demonstrates interactive understanding of the scene using the past actions and environment feedback as context. Below, after a task instruction has been executed, we turn to ask questions about the scene, again a structure that has not appeared in the prompt. Surprisingly, we find that it can often correctly answer these questions that require temporal and embodied reasoning.

Robustness to Feedback Order

So far we prompted the language model following certain conventions. For instance, in the simulated tabletop domain, the convention is [Robot action, Scene, and Robot thought]. In practice, we find that the LLM planner is robust to occasionallys wapping the order of feedback. In one example, a new human instruction is injected in the middle of plan execution, but this structure has not been seen in the example prompts. Nonetheless, the planner recognizes the change and generates a new “Robot thought: Goal state is. . . ” statement allowing it to solve the new task.

Robustness to Typos

Inherited from the LLM backbone, our approach is also robust to typos in human instruction.

Citation

Acknowledgements

The authors would like to thank Kanishka Rao and Vincent Vanhoucke for valuable feedback and discussions. In addition, the authors would like to acknowledge the large team who built SayCan, upon which we construct our Kitchen Mobile Manipulation experiments.

The website template was borrowed from Jon Barron.