-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] short blogpost on tooling around processing video datasets #2631
Conversation
@@ -0,0 +1,149 @@ | |||
--- | |||
title: "Build awesome datasets for video generation" | |||
thumbnail: /blog/assets/vid_ds_scripts/thumbnail.png |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need the changes to _blog.yml
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has been taken care of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Co-authored-by: hlky <[email protected]>
@hlky I updated some content now that we support NSFW filtering. |
@pcuenca, a friendly ping. |
Sure, I can take a look tomorrow if that works. |
@pcuenca that works. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,149 @@ | |||
--- | |||
title: "Build awesome datasets for video generation" | |||
thumbnail: /blog/assets/vid_ds_scripts/thumbnail.png |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need the changes to _blog.yml
etc.
|
||
#### Entire video | ||
|
||
- predict a motion score with [OpenCV](https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should state what it is and what it will be used for. The link to optical flow methods might be a bit confusing in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't really using the motion score yet, it would be used to filter videos where there is no/little or lots of motion, but we need to do some analysis of videos and their scores to determine useful thresholds.
vid_ds_scripts.md
Outdated
Florence-2 [`microsoft/Florence-2-large`](http://hf.co/microsoft/Florence-2-large) to run `<CAPTION>`, `<DETAILED_CAPTION>`, `<DENSE_REGION_CAPTION>` and `<OCR_WITH_REGION>`. | ||
|
||
We can bring in any other captioner in this regard. We can also caption the entire video (e.g., with a model like [Qwen2.5](https://huggingface.co/docs/transformers/main/en/model_doc/qwen2_5_vl)) as opposed to captioning individual frames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we captioning individual frames with Florence 2? What do those tags <DENSE_REGION_CAPTION>
etc mean?
In general, I think we should previously introduce the task earlier in the post: we are going to grab a lot of videos from the Internet, and we are going to caption them so that the models to be trained have enough information to match text descriptions with frames/sequences of frames/full videos. And of course we need to filter bad quality content, videos that are too short or too long, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Florence-2 is running on the extracted frames. The tags are the task types for Florence-2, the examples are below.
introduce the task earlier in the post
It's explained in the ## Tooling
section, no?
vid_ds_scripts.md
Outdated
|
||
## Filtering examples | ||
|
||
In the [dataset](https://huggingface.co/datasets/finetrainers/crush-smol) for the model [`finetrainers/crush-smol-v0`](https://hf.co/finetrainers/crush-smol-v0) we filtered on `pwatermark < 0.1` and `aesthetic > 5.5`. This highly restrictive filtering resulted in 47 videos out of 1493 total. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one appears to have been captioned with Qwen actually, but we said we'd use Florence.
I'd explain that this is an example dataset we obtained using this process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this dataset was obtained using this process then recaptioned with Qwen as we'd already had success with Qwen captions for another video model.
vid_ds_scripts.md
Outdated
Let's review the example frames from `pwatermark` - two with text have scores of 0.69 and 0.61, the "toy car with a bunch of mice in it" scores 0.60 then 0.17 as the toy car is crushed. All example frames were filtered by `pwatermark < 0.1`. `pwatermark` is effective at detecting text/watermarks however the score gives no indication whether it is a text overlay or a toy car's license plate. Our filtering required all scores to be below the threshold, an average across frames would be a more effective strategy for `pwatermark` with a threshold of around 0.2 - 0.3. | ||
|
||
Let's review the example frames from aesthetic scores - the pink castle initially scores 5.5 then 4.44 as it is crushed, the action figure scores lower at 4.99 dropping to 4.84 as it is crushed and the shard of glass scores low at 4.04. Again fitlering required all scores to be below the threshold, in this case using the aesthetic score from the first frame only would be a more effective strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit lost here. Screenshots would go a long way imo.
Co-authored-by: Pedro Cuenca <[email protected]>
@pcuenca thanks for your reviews thus far. Plan to merge this today. But if you want to review once more, feel free to. We can merge then. |
With @hlky.
TODO