PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Kuo-Hao Zeng, Zichen "Charles" Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador,
Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs,

PRIOR @ Allen Institute for AI

arXiv Code

Policy TransFormer (PoliFormer) is a transformer-based policy trained using RL at scale in simulation. PoliFormer achieves SoTA results across LoCoBot and Stretch RE-1, in both simulation and real-world (see Performance).

We train PoliFormer using Online RL at Scale, including (i) Scale in Architecture, (ii) Scale in Rollouts, and (iii) Scale in Diverse Environment Interactions (see Training at scale).

PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning (see Real-world examples).

Real-world Qualitative Results

Here we present a number of real-world examples filmed in a robot testing lab. All results are collected using our PoliFormer agent that was trained, in simulation, with ground-truth detections; in these real-world examples, detections are generated using Detic, an open-vocabulary object detector. The agent's RGB navigation inputs are shown, as well as a 3rd person perspective for some examples. All videos are sped up by up to 20x for ease of viewing.

Floorplan of the real-world environment used for these qualitative examples.

Find an apple (LoCoBot)

PoliFormer finds an apple after navigating down a long hallway with many obstacles, including a chair that moves during the trajectory.

Find a book with title "Humans" (Stretch RE-1)

PoliFormer ignores the book it begins the episode looking at and searches multiple rooms until it finds the book with the title "Humans". Please see the main paper for a close up of the book in question, space constraints on the supplementary materials prevent us from uploading a high resolution video of this trajectory.

Find the kitchen (Stretch RE-1)

Starting from a bedroom, PoliFormer explores, correctly avoids entering a bathroom, and finally finds the kitchen.

Find a sofa, book, toilet, and houseplant (Stretch RE-1)

PoliFormer is able to find multiple objects in a single episode. Here it initially finds a sofa and book, then a houseplant, and finally a toilet.

Follow the toy truck (Stretch RE-1)

PoliFormer follows a toy truck as it moves through multiple rooms in an indoor environment. Note that PoliFormer is not trained in dynamic environments but is nevertheless able to navigate while its target moves.

Follow the person (Stretch RE-1)

Similar to the above example, PoliFormer follows a person as they move down a hallway and into a kitchen.

Follow the toy truck (Stretch RE-1) in a different scene

PoliFormer follows a toy truck in a different scene. Note that even with people walking in the clustered scene, PoliFormer is able to follow the toy truck smoothly over a long horizon.

Train Policy TransFormer (PoliFormer) using Online RL at Scale

Our method, PoliFormer, is a transformer-based model trained with on-policy RL in the AI2-THOR simulator at scale.

Scale in Architecture

We develop a fully transformer-based policy model that uses a powerful visual foundation model (DINOv2, a vision transformer), incorporates a transformer state encoder for improved state summarization, and employs a transformer decoder for explicit temporal memory modeling.

Vision Transformer Model and Transformer State Encoder. We choose DINOv2 as our visual foundation backbone, because of its remarkable dense prediction and sim-to-real transfer abilities. The Transformer State Encoder module summarizes the state at each timestep as a vector s. The input to this encoder includes the visual representation v, the goal feature g, and an embedding f of a STATE token. We concatenate these features together and feed them to a non-causal transformer encoder. This encoder then returns the output corresponding to the STATE token as the state feature vector. Since the transformer state encoder digests both visual and goal features, the produced state feature vector can also be seen as a goal-conditioned visual state representation.

Causal Transformer Deocder with KV-Cache. We use a causal transformer decoder to perform explicit memory modeling over time. This can enable both long-horizon (e.g., exhaustive exploration with back-tracking) and short-horizon (e.g., navigating around an object) planning. Concretely, the causal transformer decoder constructs its state belief b using the sequence of state features s within the same trajectories. During the rollout stage and inference, we leverage the KV-cache technique to keep past feed-forward results in two cache matrices, one for Keys and one for Values. With a KV-cache, our causal transformer decoder only performs feedforward computations with the most current state feature which results in computation time growing only linearly in t rather than quadratically.

Scale in Rollouts

We leverage hundreds of parallel rollouts and large batch sizes, which leads to high training throughput and allows us to train using a huge number of environment interactions.

Scale in Diverse Environment Interactions

Training PoliFormer with RL at scale in 150k procedurally generated PROCTHOR houses using optimized Objaverse assets results in steady validation set gains.

Quantitative Results

PoliFormer achieves excellent results across multiple navigation benchmarks in simulation and the real world.

In Simulation

On CHORES-S, it achieves an impressive 85.5% Success Rate, a higher than +28.5% absolute improvement over the previous SoTA model. Similarly, it also obtains SoTA Success Rates on Proc-THOR (+8.7%), ArchitecTHOR (+10.0%) and AI2-iTHOR (+6.9%). These results hold across two embodiments, LoCoBot and Stretch RE-1, with distinct action spaces.

In the Real World

In the real world, it outperforms ObjectNav baselines in the sim-to-real zero-shot transfer setting using LoCoBot (+13.3%) and Stretch RE-1 (+33.3%).

With Perfect Object Perception

We further train PoliFormer-BoxNav, which accepts a bounding box (e.g., from an off-the-shelf open-vocab object detector) as its goal specification in place of a given category. PoliFormer-BoxNav not only achieves an amazing 95.5% success rate on CHORES-S, but is also robust across different difficulty levels. Moreover, this abstraction makes PoliFormer-BoxNav a general-purpose navigator that can be “prompted” by an external model akin to the design of Segment Anything, as shown in the Real-world examples.

Simulation Qualitative Results

Here we show multiple examples of PoliFormer's behavior in simulation. In addition to the agent's RGB camera input, we also display the probabilities the agent assigns to each of its available actions. For the Stretch agent we show two RGB images side-by-side, the first (left) is the agent's RGB camera input, and the second (right) corresponds to a "manipulation" camera that is positioned 90 degrees clockwise from the agent's front-facing camera. The manipulation camera is purely for visualization, our agent only sees the left image during training and inference.

Backtracking in CHORES

PoliFormer (Stretch RE-1 embodiment) explores multiple rooms, backtracks and finally finds the requested mug.

Finding a Laptop in ArchitecTHOR

The PoliFormer agent (LoCoBot embodiment) ignores the bathroom in its search for a laptop in the bedroom. Top-down view is for visualization purposes only.

Finding a Television in ProcTHOR

The PoliFormer agent (LoCoBot embodiment) searches through every room in a house before finally finding the television mounted on a wall. Top-down view is for visualization purposes only.

Finding a Garbage Can in iTHOR

The PoliFormer agent (LoCoBot embodiment) first performs a 360 degree spin to scan the environment, it then looks behind a bed, backtracks, and finally finds the garbage can next to a desk. Top-down view is for visualization purposes only.

BibTeX

@article{
        poliformer2024,
        author    = {Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs},
        title     = {PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators},
        journal   = {arXiv},
        year      = {2024},
        eprint    = {2406.20083},
}