Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unknown PR #2639

Merged
merged 30 commits into from
Feb 3, 2025
Merged

unknown PR #2639

merged 30 commits into from
Feb 3, 2025

Conversation

danaaubakirova
Copy link
Contributor

Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

  • Add an entry to _blog.yml.
  • Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
  • Check you use a short title and blog path.
  • Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
  • Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
  • Ensure the publication date is correct.
  • Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.

Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.

@danaaubakirova danaaubakirova marked this pull request as draft February 3, 2025 15:01
@danaaubakirova danaaubakirova changed the title Dana/pi0 unknown PR Feb 3, 2025
@danaaubakirova danaaubakirova marked this pull request as ready for review February 3, 2025 17:02
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We store images (except the thumbnail) in this dataset, to keep this repo as lean as possible. Would it be possible to move this one there?

pi0.md Outdated

We have ported the first **robot foundation models** to **Hugging Face LeRobot**! Both **π0 and π0-FAST** developed by Physical Intelligence, which are now available in the **LeRobot repository**, bringing generalist robotic intelligence to the Hugging Face ecosystem. If you are curious about how Vision-Language-Action (VLA) models differ from Vision-Language Models (VLMs) and how actions are represented, dive into this blog post to find out!

[Huggingface collection of Pi0 models](https://huggingface.co/collections/lerobot/pi0-models-67a0f92dc62e52ace7220eba) | [LeRobot] (link)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not up, is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet! 30min


---

## 🔍 What is π0?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are many bold terms in this section. Personally I find it somewhat distracting, but it's totally your call.

pi0.md Outdated

π0 is trained on data from **7 robotic platforms** and **68 unique tasks**, demonstrating strong **zero-shot** and **fine-tuned performance** on complex, real-world tasks such as **laundry folding, table bussing, grocery bagging, box assembly, and object retrieval**.

Unlike standard robotic policies, **π0 employs flow matching** to produce **smooth, real-time action trajectories at 50Hz**, making it highly **efficient, precise, and adaptable** for real-world deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to explain how flow matching works, but perhaps it'd be helpful to contextualize a bit as one recent technique behind some great quality improvements in diffusion models.


## How to Use π0 in LeRobot?

### Fine-tuning the π0 Pretrained Model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always need to fine-tune before use? Otherwise, we could explain how to use a fine-tuned model, and then the benefits of fine-tuning.


### Fine-tuning the π0 Pretrained Model

To fine-tune the **π0** model using the `pi0_base` checkpoint from `openpi`, run the following command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Links here would be nice :)

--dataset.repo_id=danaaubakirova/koch_test
```

To fine-tune the π0 neural network with PaliGemma and Expert Gemma, which were pretrained using VLM default parameters before π0 fine-tuning, execute:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained above in suggestion!

pi0.md Outdated
### **Handling 2D Attention Masks**
The resulting **2D causal mask** exhibits strong **block sparsity**, but defining the boundaries of each block—especially in a batch of samples—is a bit tricky. We are used to causal masks with triangular structures for autoregressive modeling, but this is not one of these cases.

As you can see in this example below: the image (first element) has some padding tokens, representing empty cameras. Then, text tokens are added, with text tokens as well. This "prefix" part forms a fully noncausal attention, as in PaliGemma. Then, the suffix (state + action/time tokens) has a causal-block structure. The eager naive implementation matmuls and softmaxes over all of this, which is quite inefficient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As you can see in this example below: the image (first element) has some padding tokens, representing empty cameras. Then, text tokens are added, with text tokens as well. This "prefix" part forms a fully noncausal attention, as in PaliGemma. Then, the suffix (state + action/time tokens) has a causal-block structure. The eager naive implementation matmuls and softmaxes over all of this, which is quite inefficient.
As you can see in this example below: the image (first element) has some padding tokens, representing empty cameras. Then, text tokens are added, with state tokens as well. This "prefix" part forms a fully noncausal attention, as in PaliGemma. Then, the suffix (state + action/time tokens) has a causal-block structure. The eager naive implementation matmuls and softmaxes over all of this, which is quite inefficient.

I guess? Or should we remove?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it"s missing 'text padding tokens' rather

danaaubakirova and others added 2 commits February 3, 2025 18:51
Unlike standard robotic policies, **π0 employs flow matching** to produce **smooth, real-time action trajectories at 50Hz**, making it highly **efficient, precise, and adaptable** for real-world deployment.

## How to Use π0 in LeRobot?

Copy link
Contributor

@molbap molbap Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line with @pcuenca I'd add something like this (needs the snippet @Cadene )

Suggested change
First of all, you need to upgrade your lerobot install, which leverages `transformers` as a dependency now! Simply do after a git clone
```bash
pip install -e ".[pi0]"

π0 models are foundational models that, much like PaliGemma, are made to be adapted to a variety of frameworks, environments, and scene inputs. The base models here are usable as-is, in particular π0.

Inference on π0 pretrained model

add python snippet...

However, the performances are reduced as it's a conversion from jax to torch and from a specific environment. We recommend fine-tuning your own π0 to your own environment, like below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestions are broken with code snippets 💀 but seems fine

danaaubakirova and others added 3 commits February 3, 2025 18:58
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
danaaubakirova and others added 3 commits February 3, 2025 19:47
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
danaaubakirova and others added 2 commits February 3, 2025 19:51

One approach is **semantic action representation**, where actions are described as **high-level concepts** like sub-tasks or keypoints. While this allows for few-shot and zero-shot learning, it often relies on hand-designed low-level controllers, limiting flexibility across different robots. In contrast, low-level control representations map actions directly to motor commands, enabling precise movements but making training **less stable and harder to scale**.

Most existing VLAs use **discrete action tokenization**, converting continuous actions into discrete tokens generated autoregressively. The most common method—per-dimension, per-timestep binning—struggles with high-frequency control tasks, leading to lossy representations and **inefficient training**. Alternatives like vector quantization (VQ) and time-series compression help, but **VQ is sensitive to hyperparameters**, making it less reliable for diverse robot designs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Most existing VLAs use **discrete action tokenization**, converting continuous actions into discrete tokens generated autoregressively. The most common method—per-dimension, per-timestep binning—struggles with high-frequency control tasks, leading to lossy representations and **inefficient training**. Alternatives like vector quantization (VQ) and time-series compression help, but **VQ is sensitive to hyperparameters**, making it less reliable for diverse robot designs.
Most existing VLAs use **discrete action tokenization**, converting continuous actions into discrete tokens generated autoregressively. The most common method—meaning per-dimension and per-timestep binning—struggles with high-frequency control tasks, leading to lossy representations and **inefficient training**. Alternatives like vector quantization (VQ) and time-series compression help, but **VQ is sensitive to hyperparameters**, making it less reliable for diverse robot designs.

@merveenoyan merveenoyan merged commit 0029bfa into huggingface:main Feb 3, 2025
1 check passed
merveenoyan added a commit that referenced this pull request Feb 3, 2025
This reverts commit 0029bfa.
@merveenoyan merveenoyan mentioned this pull request Feb 3, 2025
merveenoyan added a commit that referenced this pull request Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants