Why is data imported as {"@type": "PodcastEpisode"} and how to change it to something more appropriate for my website #347
-
|
I'm sure a bunch of people have noticed that most data that gets imported into a vector database, ends up with data type "PodcastEpisode". My question is - what is the recommended way to select a data type? I've setup my website to use the RSS feed, and I've noticed that all articles (e.g.400) get imported as "PodcastEpisode". If i import only approx 10 articles, and chat with it - it will tell me it's a Podcast. When I import more data, it becomes smarter and no longer uses the "podcast". I saw that the file NLWeb/code/python/data_loading/db_load_utils.py lists these: INCLUDE_TYPES = [ Should I just delete all types from "INCLUDE_TYPES" and create a new one e.g. "News", and will that work? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
|
Really good point. Yes, this should be fixed. Yes, you can make that change
and it should work.
…On Wed, Sep 10, 2025 at 7:24 AM galaxy101quest ***@***.***> wrote:
I'm sure a bunch of people have noticed that most data that gets imported
into a vector database, ends up with data type "PodcastEpisode".
I saw a few online tutorials and their data is classified also as a
podcast, even though it's different type, so I'm not the only one.
The chat tool becomes certainly smarter as more data is provided to it,
but it still bothers me that the data is automatically assigned "podcast"
when it's not :)
*My question is - what is the recommended way to select a data type?*
I've setup my website to use the RSS feed, and I've noticed that all
articles (e.g.400) get imported as "PodcastEpisode". If i import only
approx 10 articles, and chat with it - it will tell me it's a Podcast. When
I import more data, it becomes smarter and no longer uses the "podcast".
I saw that the file *NLWeb/code/python/data_loading/db_load_utils.py*
lists these:
`# Item type categorization
SKIP_TYPES = ["ItemList", "ListItem", "AboutPage", "WebPage", "WebSite",
"Person"]
INCLUDE_TYPES = [
"Recipe", "NeurIPSPoster", "InvitedTalk", "Oral", "Movie",
"LocalBusiness", "Review",
"TVShow", "TVEpisode", "Product", "Offer", "PodcastEpisode", "Book",
"Podcast", "TVSeries", "ProductGroup", "Event", "FoodEstablishment",
"Apartment", "House", "Home", "RealEstateListing",
"SingleFamilyResidence", "Offer",
"AggregateOffer", "Event", "BusinessEvent", "Festival", "MusicEvent",
"EducationEvent",
"SocialEvent", "SportsEvent"
]`
*Should I just delete all types from "INCLUDE_TYPES" and create a new one
e.g. "News", and will that work?*
I would appreciate any feedback on how you've dealt with this and how do
you suggest others to classify their websites better.
—
Reply to this email directly, view it on GitHub
<#347>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABICKCT6FOBSW6Z2LQCS7RL3SAYA7AVCNFSM6AAAAACGEOREBWVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZYHA3TMNZYHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
It is really only in one place that matters rss2schema.py. Change that to
whatever you want.
Or, just create schema files natively. Here is an example ---
scifi_movies_schemas.txt.gz
<https://1drv.ms/u/c/6c6197aa87f7f4c4/EcT094eql2EggGxmBAAAAAABMTnzUOSLZXtnQyhHFj8JvQ?e=mTBGf7>
…On Wed, Sep 10, 2025 at 10:22 AM galaxy101quest ***@***.***> wrote:
Thanks for the quick reply rvguha :) I tired doing that just now but
unfortunately it didn't work :(
I simply deleted all of the old INCLUDE_TYPES in *db_load_utils.py* and
listed a few new ones e.g. "TechArticle", "Report", etc. and tested a new
database import. It again imported the new entry as ***@***.***
<https://github.com/type>": "PodcastEpisode", ....}
I was wondering why, and when I searched for "podcast" in the NLWeb
project files, i saw 225 matches in 21 files. I guess it's not just one
place to edit this out.
Some of the other types e.g. "RealEstateListing", "Festival", had much
less search matches.
So I'm guessing it's not as easy as that.
Do you have any other suggestions?
—
Reply to this email directly, view it on GitHub
<#347 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABICKCVIODDSBLIDHT444QL3SBM4JAVCNFSM6AAAAACGEOREBWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMZWGU3DCNQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Fantastic. Great to hear that. Can you send a PR with your changes?
…On Mon, Sep 22, 2025 at 6:58 AM galaxy101quest ***@***.***> wrote:
Thanks, you were right :)
For anyone else who might be reading this and uses RSS:
- I made modifications to "def parse_rss_2_0" in the rss2schema.py
file and it all worked out well at the end :)
FYI: I established that the default RSS ingestion was not precise enough
for my website (the default produced only 1 point in the db), so I further
adjusted db_load.py and introduced chunking based on article's HTML
elements (H1,H2,H3,H4,H5,
,
,, etc.).
This worked well -> approx. 3000 words article ended up into 50 points
in qdrant, with proper schema_json, and the chat was able to pull details
out of article afterwards much better, with high precision and no errors or
lack of info.
—
Reply to this email directly, view it on GitHub
<#347 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABICKCSG2YI4GA6ZQCMENWD3T76ALAVCNFSM6AAAAACGEOREBWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTINBXG4YDMMQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Thanks, you were right :)
For anyone else who might be reading this and uses RSS:
FYI: I established that the default RSS ingestion was not precise enough for my website (the default produced only 1 point in the db), so I further adjusted db_load.py and introduced chunking based on article's HTML elements (
H1,H2,H3,H4,H5, <p>,<ul>,<table>, etc.).This worked well -> approx. 3000 words article ended up into 50 points in qdrant, with proper schema_json, and the chat was able to pull details out of article afterwards much better, with high precision and no errors or lack of info.