[WIP] short blogpost on tooling around processing video datasets #2631

sayakpaul · 2025-01-29T05:06:46Z

TODO

Update _blog.yml.
Update thumbnail.
Update dates.
Any remaining todos related to the content.

sayakpaul · 2025-01-29T05:09:05Z

vid_ds_scripts.md

@@ -0,0 +1,149 @@
+---
+title: "Build awesome datasets for video generation"
+thumbnail: /blog/assets/vid_ds_scripts/thumbnail.png


TODO: change.

We also need the changes to _blog.yml etc.

Has been taken care of.

hlky

Looks good

vid_ds_scripts.md

Co-authored-by: hlky <[email protected]>

sayakpaul · 2025-01-29T15:43:59Z

@hlky feel free to add more content if needed. @pcuenca possible to give this a review?

sayakpaul · 2025-01-30T16:12:45Z

@hlky I updated some content now that we support NSFW filtering.

sayakpaul · 2025-02-04T06:58:27Z

@pcuenca, a friendly ping.

pcuenca · 2025-02-04T08:54:04Z

Sure, I can take a look tomorrow if that works.

sayakpaul · 2025-02-04T10:42:37Z

@pcuenca that works. Thanks!

pcuenca

I just did a quick pass. I think it's going in a good direction, but I'd suggest to focus a bit on cohesion and flow.

I'd also recommend to check this post by @mfarre, there are great and insightful discussions about filtering, scoring, processing, etc.

pcuenca · 2025-02-05T17:02:48Z

vid_ds_scripts.md

@@ -0,0 +1,149 @@
+---
+title: "Build awesome datasets for video generation"
+thumbnail: /blog/assets/vid_ds_scripts/thumbnail.png


We also need the changes to _blog.yml etc.

vid_ds_scripts.md

pcuenca · 2025-02-05T17:09:53Z

vid_ds_scripts.md

+
+#### Entire video
+
+- predict a motion score with [OpenCV](https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html)


I think we should state what it is and what it will be used for. The link to optical flow methods might be a bit confusing in my opinion.

We aren't really using the motion score yet, it would be used to filter videos where there is no/little or lots of motion, but we need to do some analysis of videos and their scores to determine useful thresholds.

pcuenca · 2025-02-05T17:14:12Z

vid_ds_scripts.md

+Florence-2 [`microsoft/Florence-2-large`](http://hf.co/microsoft/Florence-2-large) to run `<CAPTION>`, `<DETAILED_CAPTION>`, `<DENSE_REGION_CAPTION>` and `<OCR_WITH_REGION>`.
+
+We can bring in any other captioner in this regard. We can also caption the entire video (e.g., with a model like [Qwen2.5](https://huggingface.co/docs/transformers/main/en/model_doc/qwen2_5_vl)) as opposed to captioning individual frames.


Are we captioning individual frames with Florence 2? What do those tags <DENSE_REGION_CAPTION> etc mean?

In general, I think we should previously introduce the task earlier in the post: we are going to grab a lot of videos from the Internet, and we are going to caption them so that the models to be trained have enough information to match text descriptions with frames/sequences of frames/full videos. And of course we need to filter bad quality content, videos that are too short or too long, etc.

Florence-2 is running on the extracted frames. The tags are the task types for Florence-2, the examples are below.

introduce the task earlier in the post

It's explained in the ## Tooling section, no?

pcuenca · 2025-02-05T17:15:51Z

vid_ds_scripts.md

+
+## Filtering examples
+
+In the [dataset](https://huggingface.co/datasets/finetrainers/crush-smol) for the model [`finetrainers/crush-smol-v0`](https://hf.co/finetrainers/crush-smol-v0) we filtered on `pwatermark < 0.1` and `aesthetic > 5.5`. This highly restrictive filtering resulted in 47 videos out of 1493 total. 


This one appears to have been captioned with Qwen actually, but we said we'd use Florence.

I'd explain that this is an example dataset we obtained using this process.

Yeah this dataset was obtained using this process then recaptioned with Qwen as we'd already had success with Qwen captions for another video model.

pcuenca · 2025-02-05T17:17:33Z

vid_ds_scripts.md

+Let's review the example frames from `pwatermark` - two with text have scores of 0.69 and 0.61, the "toy car with a bunch of mice in it" scores 0.60 then 0.17 as the toy car is crushed. All example frames were filtered by `pwatermark < 0.1`. `pwatermark` is effective at detecting text/watermarks however the score gives no indication whether it is a text overlay or a toy car's license plate. Our filtering required all scores to be below the threshold, an average across frames would be a more effective strategy for `pwatermark` with a threshold of around 0.2 - 0.3.
+
+Let's review the example frames from aesthetic scores - the pink castle initially scores 5.5 then 4.44 as it is crushed, the action figure scores lower at 4.99 dropping to 4.84 as it is crushed and the shard of glass scores low at 4.04. Again fitlering required all scores to be below the threshold, in this case using the aesthetic score from the first frame only would be a more effective strategy. 


I'm a bit lost here. Screenshots would go a long way imo.

vid_ds_scripts.md

Co-authored-by: Pedro Cuenca <[email protected]>

sayakpaul · 2025-02-11T03:05:21Z

@pcuenca thanks for your reviews thus far. Plan to merge this today. But if you want to review once more, feel free to. We can merge then.

julien-c · 2025-02-12T09:29:30Z

there's a TODO left btw:

sayakpaul added 3 commits January 29, 2025 10:32

start

40afdec

update authors.

5baef9c

updates

04cf312

sayakpaul commented Jan 29, 2025

View reviewed changes

sayakpaul requested a review from hlky January 29, 2025 14:59

hlky approved these changes Jan 29, 2025

View reviewed changes

vid_ds_scripts.md Outdated Show resolved Hide resolved

Apply suggestions from code review

1345564

Co-authored-by: hlky <[email protected]>

sayakpaul requested a review from pcuenca January 29, 2025 15:44

sayakpaul marked this pull request as ready for review January 30, 2025 08:37

updates

b19a1ce

pcuenca reviewed Feb 5, 2025

View reviewed changes

hlky and others added 3 commits February 6, 2025 12:31

Apply suggestions from code review

ad98d00

Co-authored-by: Pedro Cuenca <[email protected]>

Address some comments

cdc5624

Merge branch 'main' into video-tooling

02f9a9b

sayakpaul requested a review from pcuenca February 7, 2025 08:25

sayakpaul added 4 commits February 7, 2025 20:15

updates

e728339

updates

38d2204

updates

f352dc0

resolve conflicts

cd0ea05

change publication date.

4ca6d05

sayakpaul merged commit 878cf95 into main Feb 12, 2025
1 check passed

sayakpaul deleted the video-tooling branch February 12, 2025 03:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] short blogpost on tooling around processing video datasets #2631

[WIP] short blogpost on tooling around processing video datasets #2631

sayakpaul commented Jan 29, 2025 •

edited

Loading

sayakpaul Jan 29, 2025

pcuenca Feb 5, 2025

sayakpaul Feb 7, 2025

hlky left a comment

sayakpaul commented Jan 29, 2025

sayakpaul commented Jan 30, 2025

sayakpaul commented Feb 4, 2025

pcuenca commented Feb 4, 2025

sayakpaul commented Feb 4, 2025

pcuenca left a comment

pcuenca Feb 5, 2025

pcuenca Feb 5, 2025

hlky Feb 6, 2025

pcuenca Feb 5, 2025

hlky Feb 6, 2025

pcuenca Feb 5, 2025

hlky Feb 6, 2025

pcuenca Feb 5, 2025

sayakpaul commented Feb 11, 2025

julien-c commented Feb 12, 2025


		#### Entire video

		- predict a motion score with [OpenCV](https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html)

		Florence-2 [`microsoft/Florence-2-large`](http://hf.co/microsoft/Florence-2-large) to run `<CAPTION>`, `<DETAILED_CAPTION>`, `<DENSE_REGION_CAPTION>` and `<OCR_WITH_REGION>`.

		We can bring in any other captioner in this regard. We can also caption the entire video (e.g., with a model like [Qwen2.5](https://huggingface.co/docs/transformers/main/en/model_doc/qwen2_5_vl)) as opposed to captioning individual frames.


		## Filtering examples

		In the [dataset](https://huggingface.co/datasets/finetrainers/crush-smol) for the model [`finetrainers/crush-smol-v0`](https://hf.co/finetrainers/crush-smol-v0) we filtered on `pwatermark < 0.1` and `aesthetic > 5.5`. This highly restrictive filtering resulted in 47 videos out of 1493 total.

		Let's review the example frames from `pwatermark` - two with text have scores of 0.69 and 0.61, the "toy car with a bunch of mice in it" scores 0.60 then 0.17 as the toy car is crushed. All example frames were filtered by `pwatermark < 0.1`. `pwatermark` is effective at detecting text/watermarks however the score gives no indication whether it is a text overlay or a toy car's license plate. Our filtering required all scores to be below the threshold, an average across frames would be a more effective strategy for `pwatermark` with a threshold of around 0.2 - 0.3.

		Let's review the example frames from aesthetic scores - the pink castle initially scores 5.5 then 4.44 as it is crushed, the action figure scores lower at 4.99 dropping to 4.84 as it is crushed and the shard of glass scores low at 4.04. Again fitlering required all scores to be below the threshold, in this case using the aesthetic score from the first frame only would be a more effective strategy.

[WIP] short blogpost on tooling around processing video datasets #2631

[WIP] short blogpost on tooling around processing video datasets #2631

Conversation

sayakpaul commented Jan 29, 2025 • edited Loading

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hlky left a comment

Choose a reason for hiding this comment

sayakpaul commented Jan 29, 2025

sayakpaul commented Jan 30, 2025

sayakpaul commented Feb 4, 2025

pcuenca commented Feb 4, 2025

sayakpaul commented Feb 4, 2025

pcuenca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul commented Feb 11, 2025

julien-c commented Feb 12, 2025

sayakpaul commented Jan 29, 2025 •

edited

Loading