Why is data imported as {"@type": "PodcastEpisode"} and how to change it to something more appropriate for my website #347

galaxy101quest · 2025-09-10T14:23:51Z

galaxy101quest
Sep 10, 2025

I'm sure a bunch of people have noticed that most data that gets imported into a vector database, ends up with data type "PodcastEpisode".
I saw a few online tutorials and their data is classified also as a podcast, even though it's different type, so I'm not the only one.
The chat tool becomes certainly smarter as more data is provided to it, but it still bothers me that the data is automatically assigned "podcast" when it's not :)

My question is - what is the recommended way to select a data type?

I've setup my website to use the RSS feed, and I've noticed that all articles (e.g.400) get imported as "PodcastEpisode". If i import only approx 10 articles, and chat with it - it will tell me it's a Podcast. When I import more data, it becomes smarter and no longer uses the "podcast".

I saw that the file NLWeb/code/python/data_loading/db_load_utils.py lists these:
`# Item type categorization
SKIP_TYPES = ["ItemList", "ListItem", "AboutPage", "WebPage", "WebSite", "Person"]

INCLUDE_TYPES = [
"Recipe", "NeurIPSPoster", "InvitedTalk", "Oral", "Movie", "LocalBusiness", "Review",
"TVShow", "TVEpisode", "Product", "Offer", "PodcastEpisode", "Book",
"Podcast", "TVSeries", "ProductGroup", "Event", "FoodEstablishment",
"Apartment", "House", "Home", "RealEstateListing", "SingleFamilyResidence", "Offer",
"AggregateOffer", "Event", "BusinessEvent", "Festival", "MusicEvent", "EducationEvent",
"SocialEvent", "SportsEvent"
]`

Should I just delete all types from "INCLUDE_TYPES" and create a new one e.g. "News", and will that work?
I would appreciate any feedback on how you've dealt with this and how do you suggest others to classify their websites better.

Answered by galaxy101quest

Sep 22, 2025

Thanks, you were right :)

For anyone else who might be reading this and uses RSS:

I made modifications to "def parse_rss_2_0" in the rss2schema.py file and it all worked out well at the end :)

FYI: I established that the default RSS ingestion was not precise enough for my website (the default produced only 1 point in the db), so I further adjusted db_load.py and introduced chunking based on article's HTML elements (H1,H2,H3,H4,H5, <p>,<ul>,<table>, etc.).
This worked well -> approx. 3000 words article ended up into 50 points in qdrant, with proper schema_json, and the chat was able to pull details out of article afterwards much better, with high precision and no errors or lack of info.

View full answer

rvguha · 2025-09-10T14:26:06Z

rvguha
Sep 10, 2025
Maintainer

Really good point. Yes, this should be fixed. Yes, you can make that change and it should work.

…

On Wed, Sep 10, 2025 at 7:24 AM galaxy101quest ***@***.***> wrote: I'm sure a bunch of people have noticed that most data that gets imported into a vector database, ends up with data type "PodcastEpisode". I saw a few online tutorials and their data is classified also as a podcast, even though it's different type, so I'm not the only one. The chat tool becomes certainly smarter as more data is provided to it, but it still bothers me that the data is automatically assigned "podcast" when it's not :) *My question is - what is the recommended way to select a data type?* I've setup my website to use the RSS feed, and I've noticed that all articles (e.g.400) get imported as "PodcastEpisode". If i import only approx 10 articles, and chat with it - it will tell me it's a Podcast. When I import more data, it becomes smarter and no longer uses the "podcast". I saw that the file *NLWeb/code/python/data_loading/db_load_utils.py* lists these: `# Item type categorization SKIP_TYPES = ["ItemList", "ListItem", "AboutPage", "WebPage", "WebSite", "Person"] INCLUDE_TYPES = [ "Recipe", "NeurIPSPoster", "InvitedTalk", "Oral", "Movie", "LocalBusiness", "Review", "TVShow", "TVEpisode", "Product", "Offer", "PodcastEpisode", "Book", "Podcast", "TVSeries", "ProductGroup", "Event", "FoodEstablishment", "Apartment", "House", "Home", "RealEstateListing", "SingleFamilyResidence", "Offer", "AggregateOffer", "Event", "BusinessEvent", "Festival", "MusicEvent", "EducationEvent", "SocialEvent", "SportsEvent" ]` *Should I just delete all types from "INCLUDE_TYPES" and create a new one e.g. "News", and will that work?* I would appreciate any feedback on how you've dealt with this and how do you suggest others to classify their websites better. — Reply to this email directly, view it on GitHub <#347>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABICKCT6FOBSW6Z2LQCS7RL3SAYA7AVCNFSM6AAAAACGEOREBWVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZYHA3TMNZYHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

galaxy101quest Sep 10, 2025
Author

Thanks for the quick reply rvguha :) I tired doing that just now but unfortunately it didn't work :(

I simply deleted all of the old INCLUDE_TYPES in db_load_utils.py and listed a few new ones e.g. "TechArticle", "Report", etc. and tested a new database import. It again imported the new entry as {"@type": "PodcastEpisode", ....}

I was wondering why, and when I searched for "podcast" in the NLWeb project files, i saw 225 matches in 21 files. I guess it's not just one place to edit this out.
Some of the other types e.g. "RealEstateListing", "Festival", had much less search matches.

So I'm guessing it's not as easy as that.

Do you have any other suggestions?

rvguha · 2025-09-10T17:32:04Z

rvguha
Sep 10, 2025
Maintainer

It is really only in one place that matters rss2schema.py. Change that to whatever you want. Or, just create schema files natively. Here is an example --- scifi_movies_schemas.txt.gz <https://1drv.ms/u/c/6c6197aa87f7f4c4/EcT094eql2EggGxmBAAAAAABMTnzUOSLZXtnQyhHFj8JvQ?e=mTBGf7>

…

On Wed, Sep 10, 2025 at 10:22 AM galaxy101quest ***@***.***> wrote: Thanks for the quick reply rvguha :) I tired doing that just now but unfortunately it didn't work :( I simply deleted all of the old INCLUDE_TYPES in *db_load_utils.py* and listed a few new ones e.g. "TechArticle", "Report", etc. and tested a new database import. It again imported the new entry as ***@***.*** <https://github.com/type>": "PodcastEpisode", ....} I was wondering why, and when I searched for "podcast" in the NLWeb project files, i saw 225 matches in 21 files. I guess it's not just one place to edit this out. Some of the other types e.g. "RealEstateListing", "Festival", had much less search matches. So I'm guessing it's not as easy as that. Do you have any other suggestions? — Reply to this email directly, view it on GitHub <#347 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABICKCVIODDSBLIDHT444QL3SBM4JAVCNFSM6AAAAACGEOREBWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMZWGU3DCNQ> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

galaxy101quest Sep 22, 2025
Author

Thanks, you were right :)

For anyone else who might be reading this and uses RSS:

I made modifications to "def parse_rss_2_0" in the rss2schema.py file and it all worked out well at the end :)

FYI: I established that the default RSS ingestion was not precise enough for my website (the default produced only 1 point in the db), so I further adjusted db_load.py and introduced chunking based on article's HTML elements (H1,H2,H3,H4,H5, <p>,<ul>,<table>, etc.).
This worked well -> approx. 3000 words article ended up into 50 points in qdrant, with proper schema_json, and the chat was able to pull details out of article afterwards much better, with high precision and no errors or lack of info.

Answer selected by chelseacarter29

rvguha · 2025-09-22T14:02:38Z

rvguha
Sep 22, 2025
Maintainer

Fantastic. Great to hear that. Can you send a PR with your changes?

…

On Mon, Sep 22, 2025 at 6:58 AM galaxy101quest ***@***.***> wrote: Thanks, you were right :) For anyone else who might be reading this and uses RSS: - I made modifications to "def parse_rss_2_0" in the rss2schema.py file and it all worked out well at the end :) FYI: I established that the default RSS ingestion was not precise enough for my website (the default produced only 1 point in the db), so I further adjusted db_load.py and introduced chunking based on article's HTML elements (H1,H2,H3,H4,H5, , ,, etc.). This worked well -> approx. 3000 words article ended up into 50 points in qdrant, with proper schema_json, and the chat was able to pull details out of article afterwards much better, with high precision and no errors or lack of info. — Reply to this email directly, view it on GitHub <#347 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABICKCSG2YI4GA6ZQCMENWD3T76ALAVCNFSM6AAAAACGEOREBWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTINBXG4YDMMQ> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

galaxy101quest Sep 28, 2025
Author

Sure, happy to contribute :) Will send you the PR now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nlweb.ai

Why is data imported as {"@type": "PodcastEpisode"} and how to change it to something more appropriate for my website #347

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

nlweb.ai

Why is data imported as {"@type": "PodcastEpisode"} and how to change it to something more appropriate for my website #347

Uh oh!

galaxy101quest Sep 10, 2025

Replies: 3 comments · 3 replies

Uh oh!

rvguha Sep 10, 2025 Maintainer

Uh oh!

galaxy101quest Sep 10, 2025 Author

Uh oh!

rvguha Sep 10, 2025 Maintainer

Uh oh!

Uh oh!

galaxy101quest Sep 22, 2025 Author

Uh oh!

rvguha Sep 22, 2025 Maintainer

Uh oh!

galaxy101quest Sep 28, 2025 Author

galaxy101quest
Sep 10, 2025

Replies: 3 comments 3 replies

rvguha
Sep 10, 2025
Maintainer

galaxy101quest Sep 10, 2025
Author

rvguha
Sep 10, 2025
Maintainer

galaxy101quest Sep 22, 2025
Author

rvguha
Sep 22, 2025
Maintainer

galaxy101quest Sep 28, 2025
Author