-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unknown PR #2639
unknown PR #2639
Conversation
first draft, need to add usage and thumbnails
assets/192_pi0/fast.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store images (except the thumbnail) in this dataset, to keep this repo as lean as possible. Would it be possible to move this one there?
pi0.md
Outdated
|
||
We have ported the first **robot foundation models** to **Hugging Face LeRobot**! Both **π0 and π0-FAST** developed by Physical Intelligence, which are now available in the **LeRobot repository**, bringing generalist robotic intelligence to the Hugging Face ecosystem. If you are curious about how Vision-Language-Action (VLA) models differ from Vision-Language Models (VLMs) and how actions are represented, dive into this blog post to find out! | ||
|
||
[Huggingface collection of Pi0 models](https://huggingface.co/collections/lerobot/pi0-models-67a0f92dc62e52ace7220eba) | [LeRobot] (link) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not up, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet! 30min
|
||
--- | ||
|
||
## 🔍 What is π0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are many bold terms in this section. Personally I find it somewhat distracting, but it's totally your call.
pi0.md
Outdated
|
||
π0 is trained on data from **7 robotic platforms** and **68 unique tasks**, demonstrating strong **zero-shot** and **fine-tuned performance** on complex, real-world tasks such as **laundry folding, table bussing, grocery bagging, box assembly, and object retrieval**. | ||
|
||
Unlike standard robotic policies, **π0 employs flow matching** to produce **smooth, real-time action trajectories at 50Hz**, making it highly **efficient, precise, and adaptable** for real-world deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to explain how flow matching works, but perhaps it'd be helpful to contextualize a bit as one recent technique behind some great quality improvements in diffusion models.
|
||
## How to Use π0 in LeRobot? | ||
|
||
### Fine-tuning the π0 Pretrained Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always need to fine-tune before use? Otherwise, we could explain how to use a fine-tuned model, and then the benefits of fine-tuning.
|
||
### Fine-tuning the π0 Pretrained Model | ||
|
||
To fine-tune the **π0** model using the `pi0_base` checkpoint from `openpi`, run the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Links here would be nice :)
--dataset.repo_id=danaaubakirova/koch_test | ||
``` | ||
|
||
To fine-tune the π0 neural network with PaliGemma and Expert Gemma, which were pretrained using VLM default parameters before π0 fine-tuning, execute: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explained above in suggestion!
pi0.md
Outdated
### **Handling 2D Attention Masks** | ||
The resulting **2D causal mask** exhibits strong **block sparsity**, but defining the boundaries of each block—especially in a batch of samples—is a bit tricky. We are used to causal masks with triangular structures for autoregressive modeling, but this is not one of these cases. | ||
|
||
As you can see in this example below: the image (first element) has some padding tokens, representing empty cameras. Then, text tokens are added, with text tokens as well. This "prefix" part forms a fully noncausal attention, as in PaliGemma. Then, the suffix (state + action/time tokens) has a causal-block structure. The eager naive implementation matmuls and softmaxes over all of this, which is quite inefficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see in this example below: the image (first element) has some padding tokens, representing empty cameras. Then, text tokens are added, with text tokens as well. This "prefix" part forms a fully noncausal attention, as in PaliGemma. Then, the suffix (state + action/time tokens) has a causal-block structure. The eager naive implementation matmuls and softmaxes over all of this, which is quite inefficient. | |
As you can see in this example below: the image (first element) has some padding tokens, representing empty cameras. Then, text tokens are added, with state tokens as well. This "prefix" part forms a fully noncausal attention, as in PaliGemma. Then, the suffix (state + action/time tokens) has a causal-block structure. The eager naive implementation matmuls and softmaxes over all of this, which is quite inefficient. |
I guess? Or should we remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it"s missing 'text padding tokens' rather
Co-authored-by: Pedro Cuenca <[email protected]>
Unlike standard robotic policies, **π0 employs flow matching** to produce **smooth, real-time action trajectories at 50Hz**, making it highly **efficient, precise, and adaptable** for real-world deployment. | ||
|
||
## How to Use π0 in LeRobot? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line with @pcuenca I'd add something like this (needs the snippet @Cadene )
First of all, you need to upgrade your lerobot install, which leverages `transformers` as a dependency now! Simply do after a git clone | |
```bash | |
pip install -e ".[pi0]" |
π0 models are foundational models that, much like PaliGemma, are made to be adapted to a variety of frameworks, environments, and scene inputs. The base models here are usable as-is, in particular π0.
Inference on π0 pretrained model
add python snippet...
However, the performances are reduced as it's a conversion from jax to torch and from a specific environment. We recommend fine-tuning your own π0 to your own environment, like below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestions are broken with code snippets 💀 but seems fine
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
|
||
One approach is **semantic action representation**, where actions are described as **high-level concepts** like sub-tasks or keypoints. While this allows for few-shot and zero-shot learning, it often relies on hand-designed low-level controllers, limiting flexibility across different robots. In contrast, low-level control representations map actions directly to motor commands, enabling precise movements but making training **less stable and harder to scale**. | ||
|
||
Most existing VLAs use **discrete action tokenization**, converting continuous actions into discrete tokens generated autoregressively. The most common method—per-dimension, per-timestep binning—struggles with high-frequency control tasks, leading to lossy representations and **inefficient training**. Alternatives like vector quantization (VQ) and time-series compression help, but **VQ is sensitive to hyperparameters**, making it less reliable for diverse robot designs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most existing VLAs use **discrete action tokenization**, converting continuous actions into discrete tokens generated autoregressively. The most common method—per-dimension, per-timestep binning—struggles with high-frequency control tasks, leading to lossy representations and **inefficient training**. Alternatives like vector quantization (VQ) and time-series compression help, but **VQ is sensitive to hyperparameters**, making it less reliable for diverse robot designs. | |
Most existing VLAs use **discrete action tokenization**, converting continuous actions into discrete tokens generated autoregressively. The most common method—meaning per-dimension and per-timestep binning—struggles with high-frequency control tasks, leading to lossy representations and **inefficient training**. Alternatives like vector quantization (VQ) and time-series compression help, but **VQ is sensitive to hyperparameters**, making it less reliable for diverse robot designs. |
This reverts commit 0029bfa.
Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.
Preparing the Article
You're not quite done yet, though. Please make sure to follow this process (as documented here):
md
file. You can also specifyguest
ororg
for the authors.Here is an example of a complete PR: #2382
Getting a Review
Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.
Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.