-
Notifications
You must be signed in to change notification settings - Fork 0
Implement Opinion Extraction Stage in Podcast Processing Pipeline #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sanchitmonga22
wants to merge
8
commits into
main
Choose a base branch
from
smonga/opinion_extractor
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit introduces the Opinion Extraction Stage to the AllInVault podcast processing pipeline, enhancing the system's ability to extract and track opinions expressed in podcast episodes. Key changes include: - Added `OpinionExtractorService` for managing opinion extraction using LLM integration. - Created `ExtractOpinionsStage` to facilitate the execution of the opinion extraction process within the pipeline. - Developed `Opinion` model to represent extracted opinions, including metadata and relationships. - Implemented `OpinionRepository` for managing storage and retrieval of opinion data. - Updated architecture documentation to reflect the new stage and its components, ensuring clarity and adherence to SOLID principles. - Modified CLI to support the new extraction stage and its parameters. This update improves the modularity and scalability of the system, allowing for better tracking of opinions over time and enhancing the overall functionality of the podcast processing pipeline.
This commit introduces significant improvements to the opinion extraction system within the AllInVault podcast processing pipeline. Key changes include: - Added support for multiple transcript formats (JSON and TXT) in the `OpinionExtractorService`, enhancing flexibility in processing. - Implemented a new script, `run_opinion_extraction_for_first_10.py`, to facilitate the extraction of opinions from the first ten episodes in chronological order, with configurable parameters for batch processing and rate limit management. - Enhanced the `Opinion` model to include additional fields for tracking opinion evolution, contradictions, and speaker timestamps, improving the granularity of opinion data. - Introduced a `CategoryRepository` for managing opinion categories, allowing for structured categorization and improved data organization. These enhancements improve the scalability and maintainability of the system, allowing for better tracking and analysis of opinions expressed in podcast episodes.
… Logic This commit introduces significant enhancements to the opinion tracking architecture and refines the opinion extraction logic within the podcast processing pipeline. Key changes include: - Expanded the architecture documentation to include a comprehensive overview of the Opinion Evolution Tracking system, detailing data models for Opinion, OpinionAppearance, and SpeakerStance. - Implemented a new migration function in the OpinionRepository to seamlessly convert legacy opinion formats to the new structure, ensuring data integrity and backward compatibility. - Refactored the OpinionExtractorService to utilize the new Opinion model, allowing for better tracking of speaker stances and opinion evolution across episodes. - Updated the LLMService to support enhanced prompts for opinion extraction, improving the accuracy and relevance of extracted data. - Enhanced the pipeline orchestrator to process episodes individually, allowing for more granular control over opinion extraction and metadata management.
…tion system within the AllInVault podcast processing pipeline. Key changes include: - Expanded the architecture documentation to include a detailed overview of the new Opinion Extraction System, including a comprehensive system architecture diagram and component responsibilities. - Implemented a new script, `convert_opinions_to_intermediate.py`, to convert existing opinions into intermediate formats, facilitating better data management and processing. - Added a new script, `show_opinions.py`, for displaying opinions and their relationships, enhancing the usability of the opinion data. - Refactored the `run_opinion_extraction_for_first_10.py` script to utilize the new multi-stage opinion extraction architecture, improving the efficiency and flexibility of the extraction process. These enhancements improve the overall functionality and clarity of the opinion extraction system, allowing for better tracking and analysis of opinions expressed in podcast episodes.
…iew of the new LLM integration, including the modular architecture for LLM providers (DeepSeek and OpenAI) and their respective roles. - Implemented a new script, `run_opinion_extraction_for_all_episodes.py`, which processes all podcast episodes in chronological order with robust checkpoint management, allowing for resumable execution and detailed logging. - Updated the `LLMService` and `DeepSeekProvider` to support the new model configurations and improved error handling, enhancing the reliability of LLM interactions. - Enhanced the `OpinionExtractionService` to utilize the new LLM model defaults and improved the speaker identification logic within the `DeepSeekProvider`. These enhancements improve the scalability, maintainability, and clarity of the opinion extraction system, facilitating better tracking and analysis of opinions expressed in podcast episodes.
This commit introduces a robust checkpoint management system to the opinion extraction pipeline, allowing for resumable processing and improved error recovery. Key changes include: - Added `CheckpointService` to track and manage the progress of opinion extraction across episodes, ensuring that the process can be resumed from the last successful point. - Enhanced the `OpinionExtractionService` to utilize the new checkpointing features, allowing for better control over the extraction process and improved logging. - Refactored the `ExtractOpinionsStage` to integrate checkpoint management, enabling granular control over opinion extraction stages. - Updated the architecture documentation to reflect the new checkpoint management system and its integration into the opinion extraction pipeline. These enhancements improve the scalability, maintainability, and reliability of the opinion extraction system, facilitating better tracking and analysis of opinions expressed in podcast episodes.
This commit introduces significant improvements to the opinion categorization system and checkpoint management within the AllInVault platform. Key changes include: - Expanded the architecture documentation to detail the new Opinion Categorization System, including a modular architecture diagram and component responsibilities. - Implemented a `Categorization Service` to standardize raw categories and utilize LLM for intelligent categorization, ensuring categories exist in the repository. - Enhanced the `Checkpoint Service` to save and retrieve LLM responses, improving error recovery and allowing for efficient caching of categorization results. - Added a new script, `run_from_stage.py`, to facilitate resuming the extraction process from the last completed stage, improving usability and control over the extraction pipeline. - Updated the `OpinionExtractionService` and `RelationshipService` to integrate the new checkpointing features, allowing for better tracking and management of extraction stages. These enhancements improve the scalability, maintainability, and reliability of the opinion extraction system, facilitating better organization and analysis of opinions expressed in podcast episodes.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit introduces the Opinion Extraction Stage to the AllInVault podcast processing pipeline, enhancing the system's ability to extract and track opinions expressed in podcast episodes. Key changes include:
OpinionExtractorServicefor managing opinion extraction using LLM integration.ExtractOpinionsStageto facilitate the execution of the opinion extraction process within the pipeline.Opinionmodel to represent extracted opinions, including metadata and relationships.OpinionRepositoryfor managing storage and retrieval of opinion data.This update improves the modularity and scalability of the system, allowing for better tracking of opinions over time and enhancing the overall functionality of the podcast processing pipeline.