Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] short blogpost on tooling around processing video datasets #2631

Merged
merged 13 commits into from
Feb 12, 2025

Conversation

sayakpaul
Copy link
Member

@sayakpaul sayakpaul commented Jan 29, 2025

With @hlky.

TODO

  • Update _blog.yml.
  • Update thumbnail.
  • Update dates.
  • Any remaining todos related to the content.

@@ -0,0 +1,149 @@
---
title: "Build awesome datasets for video generation"
thumbnail: /blog/assets/vid_ds_scripts/thumbnail.png
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need the changes to _blog.yml etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been taken care of.

@sayakpaul sayakpaul requested a review from hlky January 29, 2025 14:59
Copy link
Contributor

@hlky hlky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

vid_ds_scripts.md Outdated Show resolved Hide resolved
@sayakpaul
Copy link
Member Author

@hlky feel free to add more content if needed. @pcuenca possible to give this a review?

@sayakpaul sayakpaul requested a review from pcuenca January 29, 2025 15:44
@sayakpaul sayakpaul marked this pull request as ready for review January 30, 2025 08:37
@sayakpaul
Copy link
Member Author

@hlky I updated some content now that we support NSFW filtering.

@sayakpaul
Copy link
Member Author

@pcuenca, a friendly ping.

@pcuenca
Copy link
Member

pcuenca commented Feb 4, 2025

Sure, I can take a look tomorrow if that works.

@sayakpaul
Copy link
Member Author

@pcuenca that works. Thanks!

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick pass. I think it's going in a good direction, but I'd suggest to focus a bit on cohesion and flow.

I'd also recommend to check this post by @mfarre, there are great and insightful discussions about filtering, scoring, processing, etc.

@@ -0,0 +1,149 @@
---
title: "Build awesome datasets for video generation"
thumbnail: /blog/assets/vid_ds_scripts/thumbnail.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need the changes to _blog.yml etc.

vid_ds_scripts.md Outdated Show resolved Hide resolved
vid_ds_scripts.md Outdated Show resolved Hide resolved
vid_ds_scripts.md Outdated Show resolved Hide resolved

#### Entire video

- predict a motion score with [OpenCV](https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should state what it is and what it will be used for. The link to optical flow methods might be a bit confusing in my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't really using the motion score yet, it would be used to filter videos where there is no/little or lots of motion, but we need to do some analysis of videos and their scores to determine useful thresholds.

Comment on lines 56 to 58
Florence-2 [`microsoft/Florence-2-large`](http://hf.co/microsoft/Florence-2-large) to run `<CAPTION>`, `<DETAILED_CAPTION>`, `<DENSE_REGION_CAPTION>` and `<OCR_WITH_REGION>`.

We can bring in any other captioner in this regard. We can also caption the entire video (e.g., with a model like [Qwen2.5](https://huggingface.co/docs/transformers/main/en/model_doc/qwen2_5_vl)) as opposed to captioning individual frames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we captioning individual frames with Florence 2? What do those tags <DENSE_REGION_CAPTION> etc mean?

In general, I think we should previously introduce the task earlier in the post: we are going to grab a lot of videos from the Internet, and we are going to caption them so that the models to be trained have enough information to match text descriptions with frames/sequences of frames/full videos. And of course we need to filter bad quality content, videos that are too short or too long, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Florence-2 is running on the extracted frames. The tags are the task types for Florence-2, the examples are below.

introduce the task earlier in the post

It's explained in the ## Tooling section, no?


## Filtering examples

In the [dataset](https://huggingface.co/datasets/finetrainers/crush-smol) for the model [`finetrainers/crush-smol-v0`](https://hf.co/finetrainers/crush-smol-v0) we filtered on `pwatermark < 0.1` and `aesthetic > 5.5`. This highly restrictive filtering resulted in 47 videos out of 1493 total.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one appears to have been captioned with Qwen actually, but we said we'd use Florence.

I'd explain that this is an example dataset we obtained using this process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this dataset was obtained using this process then recaptioned with Qwen as we'd already had success with Qwen captions for another video model.

Comment on lines 64 to 66
Let's review the example frames from `pwatermark` - two with text have scores of 0.69 and 0.61, the "toy car with a bunch of mice in it" scores 0.60 then 0.17 as the toy car is crushed. All example frames were filtered by `pwatermark < 0.1`. `pwatermark` is effective at detecting text/watermarks however the score gives no indication whether it is a text overlay or a toy car's license plate. Our filtering required all scores to be below the threshold, an average across frames would be a more effective strategy for `pwatermark` with a threshold of around 0.2 - 0.3.

Let's review the example frames from aesthetic scores - the pink castle initially scores 5.5 then 4.44 as it is crushed, the action figure scores lower at 4.99 dropping to 4.84 as it is crushed and the shard of glass scores low at 4.04. Again fitlering required all scores to be below the threshold, in this case using the aesthetic score from the first frame only would be a more effective strategy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit lost here. Screenshots would go a long way imo.

vid_ds_scripts.md Outdated Show resolved Hide resolved
vid_ds_scripts.md Outdated Show resolved Hide resolved
@sayakpaul sayakpaul requested a review from pcuenca February 7, 2025 08:25
@sayakpaul
Copy link
Member Author

@pcuenca thanks for your reviews thus far. Plan to merge this today. But if you want to review once more, feel free to. We can merge then.

@sayakpaul sayakpaul merged commit 878cf95 into main Feb 12, 2025
1 check passed
@sayakpaul sayakpaul deleted the video-tooling branch February 12, 2025 03:49
@julien-c
Copy link
Member

there's a TODO left btw:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants