-
Notifications
You must be signed in to change notification settings - Fork 0
Add Glassdoor platform adapter for job scraping #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This repository is associated with groupthinking whose free trial has ended. Subscribe at jazzberry.ai. |
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Comment |
Co-authored-by: groupthinking <[email protected]>
Co-authored-by: groupthinking <[email protected]>
Co-authored-by: groupthinking <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
| // Phase 1: JobSpy platforms (4 core platforms) | |
| const jobSpyPlatforms = params.platforms.filter(p => | |
| ['indeed', 'linkedin', 'glassdoor', 'ziprecruiter'].includes(p) | |
| ); |
In searchAllPlatforms the Glassdoor platform is always grouped with the JobSpy MCP platforms and sent to jobSpyAdapter (lines 80‑83), and the method has no branch for the new 'glassdoor-direct' option. As a result, multi-platform searches ignore USE_GLASSDOOR_DIRECT=true and drop any 'glassdoor-direct' entry from params.platforms, returning either JobSpy data or no Glassdoor results at all even though a dedicated GlassdoorAdapter was added elsewhere. Consider routing Glassdoor through the dedicated adapter when requested so the new scraper can actually run during parallel searches.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a comprehensive Glassdoor platform adapter for job scraping, extending the AJOB4AGENT system with a new job source. The implementation follows the established LinkedIn adapter pattern with a Playwright-based scraper, utility functions, comprehensive tests, and integration with the agent orchestrator.
Key Changes
- New Glassdoor scraper with Playwright-based automation, pagination support, and rate limiting with jitter
- Comprehensive utility functions for URL building, HTML parsing, and rating/review count parsing
- 88+ unit tests with HTML fixtures achieving 80%+ coverage targets
- Agent orchestrator integration with dual routing (dedicated adapter + JobSpy fallback)
Reviewed changes
Copilot reviewed 23 out of 25 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
services/platform-adapters/types/job.ts |
Shared type definitions for unified Job interface and search filters |
services/platform-adapters/glassdoor/src/types.ts |
Glassdoor-specific types extending base Job interface with rating, salary, and Easy Apply fields |
services/platform-adapters/glassdoor/src/GlassdoorJobScraper.ts |
Core scraper class with modal dismissal, throttling, and pagination logic |
services/platform-adapters/glassdoor/src/utils/*.ts |
URL builder, HTML parser, and rating parser utilities |
services/platform-adapters/glassdoor/src/base/* |
Base scraper class and duplicated type definitions |
services/platform-adapters/glassdoor/tests/*.test.ts |
Comprehensive unit tests for scraper, parsers, and utilities |
services/platform-adapters/glassdoor/tests/fixtures/*.html |
Mock HTML fixtures for testing |
services/platform-adapters/glassdoor/package.json |
Package configuration with dependencies and test scripts |
services/platform-adapters/glassdoor/Dockerfile |
Multi-stage Docker build with Playwright support |
services/platform-adapters/glassdoor/README.md |
Comprehensive documentation with API reference and examples |
services/agent-orchestrator/src/adapters/glassdoor-adapter.ts |
HTTP-based adapter for agent orchestrator (simplified regex parsing) |
services/agent-orchestrator/src/adapters/platform-manager.ts |
Integration with dual routing for glassdoor/glassdoor-direct platforms |
.gitignore |
Excludes TypeScript build artifacts from platform-adapters |
| async search(filters: GlassdoorSearchFilters): Promise<GlassdoorJob[]> { | ||
| await this.initialize(); | ||
|
|
||
| const page = await this.newPage(); | ||
| const allJobs: GlassdoorJob[] = []; | ||
|
|
||
| try { | ||
| const searchUrl = this.buildSearchUrl(filters); | ||
| console.log(`🔍 Searching Glassdoor: ${searchUrl}`); | ||
|
|
||
| await this.throttle(); | ||
| await page.goto(searchUrl, { | ||
| waitUntil: 'networkidle', | ||
| timeout: this.config.timeout | ||
| }); | ||
|
|
||
| // Handle potential modal or overlay | ||
| await this.dismissModals(page); | ||
|
|
||
| let pageNum = 1; | ||
| const maxResults = this.config.maxResults || 50; | ||
|
|
||
| while (allJobs.length < maxResults) { | ||
| console.log(`📄 Processing page ${pageNum}...`); | ||
|
|
||
| // Handle modals that might appear mid-session | ||
| await this.dismissModals(page); | ||
|
|
||
| const jobs = await this.parseJobList(page); | ||
| allJobs.push(...jobs); | ||
|
|
||
| console.log(`✅ Total jobs collected: ${allJobs.length}`); | ||
|
|
||
| if (allJobs.length >= maxResults) { | ||
| break; | ||
| } | ||
|
|
||
| const hasMore = await this.hasNextPage(page); | ||
| if (!hasMore) { | ||
| console.log('📍 No more pages available'); | ||
| break; | ||
| } | ||
|
|
||
| await this.throttle(); | ||
| await this.goToNextPage(page); | ||
| pageNum++; | ||
| } | ||
|
|
||
| return allJobs.slice(0, maxResults); | ||
|
|
||
| } finally { | ||
| await page.close(); | ||
| } | ||
| } |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The search() method is fully overridden in the subclass and doesn't call super.search(), which duplicates pagination logic from the base class. This violates the DRY principle and makes maintenance harder since changes to search logic would need to be duplicated. Consider either calling super.search() and only customizing the modal dismissal, or refactoring the base class to allow hooks for platform-specific behavior.
| private async parseBasicInfoFromDetailPage(page: Page): Promise<GlassdoorJob> { | ||
| const title = await page.$eval('[data-test="job-title"], .JobDetails_jobTitle__bFQf_, h1', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => 'Unknown Title'); | ||
|
|
||
| const company = await page.$eval('[data-test="employer-name"], .EmployerProfile_employerName__Xemli', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => 'Unknown Company'); | ||
|
|
||
| const location = await page.$eval('[data-test="emp-location"], .JobDetails_location__4j4Qv', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => ''); | ||
|
|
||
| const salary = await page.$eval('[data-test="detailSalary"], .JobDetails_salaryEstimate__cPQyl', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => undefined); | ||
|
|
||
| const ratingText = await page.$eval('[data-test="rating"], .EmployerProfile_ratingValue__2BBWA', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => ''); | ||
|
|
||
| const reviewCountText = await page.$eval('[data-test="review-count"]', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => ''); | ||
|
|
||
| const easyApply = await page.$('[data-test="easy-apply"], .JobDetails_easyApply__YZw6j') | ||
| .then(el => el !== null) | ||
| .catch(() => false); | ||
|
|
||
| // Extract job ID from URL | ||
| const url = page.url(); | ||
| const idMatch = url.match(/jl=(\d+)/) || url.match(/-JV_(\d+)/); | ||
| const id = idMatch ? idMatch[1] : `gd-${Date.now()}`; | ||
|
|
||
| return { | ||
| id, | ||
| title, | ||
| company, | ||
| location, | ||
| salary: salary || undefined, | ||
| companyRating: ratingText ? parseRating(ratingText) : undefined, | ||
| reviewCount: reviewCountText ? parseReviewCount(reviewCountText) : undefined, | ||
| easyApply, | ||
| salaryEstimate: salary ? parseSalaryEstimate(salary) : undefined, | ||
| url, | ||
| tags: [], | ||
| source: 'glassdoor', | ||
| scrapedAt: new Date().toISOString() | ||
| }; | ||
| } |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parseBasicInfoFromDetailPage method returns a GlassdoorJob object with potentially empty/default values when elements are not found (lines 157-203). However, the method name suggests it only parses "basic info", not a complete job object. This could lead to confusion about what data is guaranteed to be present. Consider renaming to parseJobFromDetailPage or returning a partial type to clarify expectations.
| const errorMessage = error instanceof Error ? error.message : 'Unknown error'; | ||
| console.error('❌ Glassdoor search failed:', errorMessage); |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixing error types in the catch block - using Error type check but also catching all errors with unknown. The pattern on line 97 error instanceof Error ? error.message : 'Unknown error' is correct, but this could be simplified using optional chaining: (error as Error)?.message || 'Unknown error'. More importantly, errors should be logged with full stack traces for debugging, not just the message.
| const errorMessage = error instanceof Error ? error.message : 'Unknown error'; | |
| console.error('❌ Glassdoor search failed:', errorMessage); | |
| const errorMessage = (error as Error)?.message || 'Unknown error'; | |
| console.error('❌ Glassdoor search failed:', errorMessage, error); |
| private async rateLimitInterceptor(config: any) { | ||
| const now = Date.now(); | ||
| const timeSinceLastRequest = now - this.lastRequestTime; | ||
|
|
||
| if (timeSinceLastRequest < this.minRequestInterval) { | ||
| const delay = this.minRequestInterval - timeSinceLastRequest; | ||
| console.log(`⏳ Glassdoor rate limiting: waiting ${delay}ms`); | ||
| await new Promise(resolve => setTimeout(resolve, delay)); | ||
| } | ||
|
|
||
| if (this.requestCount >= this.maxRequestsPerWindow) { | ||
| const windowReset = this.lastRequestTime + this.rateLimitWindow; | ||
| if (now < windowReset) { | ||
| const delay = windowReset - now; | ||
| console.log(`⚠️ Glassdoor rate limit reached: waiting ${Math.ceil(delay / 1000)}s`); | ||
| await new Promise(resolve => setTimeout(resolve, delay)); | ||
| this.requestCount = 0; | ||
| } | ||
| } | ||
|
|
||
| this.lastRequestTime = Date.now(); | ||
| this.requestCount++; | ||
|
|
||
| return config; |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rate limiting logic has a potential race condition. The requestCount is checked and incremented in the interceptor (lines 59-70), but if multiple requests are made concurrently, they could all pass the check before any increments the counter. Consider using an atomic operation or mutex to ensure thread-safety, or document that this adapter should not be used concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit 2dca7cc. Implemented a mutex pattern using Promise chaining for thread-safe rate limiting:
- Added
acquireRateLimitLock()method that ensures only one request processes rate limiting at a time - Uses Promise chaining to serialize access - each request waits for the previous to complete before checking/updating counters
- Added
windowStartTimefor proper rate limit window tracking - Updated
getRateLimitStatus()to use the correct window timing
The adapter is now safe for concurrent access.
| console.log('⚠️ No job listings found on page'); | ||
| return []; | ||
| } | ||
|
|
||
| // Get page HTML and parse job cards | ||
| const html = await page.content(); | ||
| const rawJobs = parseJobListHtml(html); | ||
|
|
||
| // Transform raw job cards to GlassdoorJob format | ||
| const jobs: GlassdoorJob[] = rawJobs.map(raw => { | ||
| const job = transformRawJobCard(raw); | ||
|
|
||
| // Parse salary estimate if available | ||
| if (raw.salary) { | ||
| job.salaryEstimate = parseSalaryEstimate(raw.salary); | ||
| } | ||
|
|
||
| return job; | ||
| }); | ||
|
|
||
| console.log(`📋 Parsed ${jobs.length} jobs from current page`); | ||
| return jobs; | ||
| } | ||
|
|
||
| /** | ||
| * Check if there's a next page available | ||
| */ | ||
| protected async hasNextPage(page: Page): Promise<boolean> { | ||
| try { | ||
| const html = await page.content(); | ||
| return hasNextPageButton(html); | ||
| } catch { | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Navigate to the next page | ||
| */ | ||
| protected async goToNextPage(page: Page): Promise<void> { | ||
| const nextButton = await page.$('[data-test="pagination-next"], .nextButton, [aria-label="Next"]'); | ||
|
|
||
| if (nextButton) { | ||
| await nextButton.click(); | ||
| await page.waitForLoadState('networkidle'); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Get detailed job information from a job detail page | ||
| */ | ||
| async getJobDetails(jobUrl: string): Promise<GlassdoorJob> { | ||
| await this.initialize(); | ||
|
|
||
| const page = await this.newPage(); | ||
|
|
||
| try { | ||
| await this.throttle(); | ||
| await page.goto(jobUrl, { | ||
| waitUntil: 'networkidle', | ||
| timeout: this.config.timeout | ||
| }); | ||
|
|
||
| // Wait for job description to load | ||
| await page.waitForSelector('[data-test="job-description"], .JobDetails_jobDescription__uW_fK, #JobDescriptionContainer', { | ||
| timeout: 10000 | ||
| }).catch(() => { | ||
| console.log('⚠️ Job description selector not found, continuing...'); | ||
| }); | ||
|
|
||
| // Parse the page content | ||
| const html = await page.content(); | ||
| const detail = parseJobDetailHtml(html); | ||
|
|
||
| // Get basic info from the page | ||
| const basicInfo = await this.parseBasicInfoFromDetailPage(page); | ||
|
|
||
| return { | ||
| ...basicInfo, | ||
| descriptionHtml: detail.description, | ||
| description: this.stripHtml(detail.description), | ||
| companySize: detail.companySize, | ||
| industry: detail.industry, | ||
| headquarters: detail.headquarters, | ||
| benefits: detail.benefits, | ||
| skills: detail.skills, | ||
| employmentType: detail.employmentType, | ||
| url: jobUrl, | ||
| source: 'glassdoor', | ||
| scrapedAt: new Date().toISOString() | ||
| }; | ||
|
|
||
| } finally { | ||
| await page.close(); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Parse basic job info from the detail page | ||
| */ | ||
| private async parseBasicInfoFromDetailPage(page: Page): Promise<GlassdoorJob> { | ||
| const title = await page.$eval('[data-test="job-title"], .JobDetails_jobTitle__bFQf_, h1', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => 'Unknown Title'); | ||
|
|
||
| const company = await page.$eval('[data-test="employer-name"], .EmployerProfile_employerName__Xemli', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => 'Unknown Company'); | ||
|
|
||
| const location = await page.$eval('[data-test="emp-location"], .JobDetails_location__4j4Qv', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => ''); | ||
|
|
||
| const salary = await page.$eval('[data-test="detailSalary"], .JobDetails_salaryEstimate__cPQyl', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => undefined); | ||
|
|
||
| const ratingText = await page.$eval('[data-test="rating"], .EmployerProfile_ratingValue__2BBWA', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => ''); | ||
|
|
||
| const reviewCountText = await page.$eval('[data-test="review-count"]', | ||
| el => el.textContent?.trim() || '' | ||
| ).catch(() => ''); | ||
|
|
||
| const easyApply = await page.$('[data-test="easy-apply"], .JobDetails_easyApply__YZw6j') | ||
| .then(el => el !== null) | ||
| .catch(() => false); | ||
|
|
||
| // Extract job ID from URL | ||
| const url = page.url(); | ||
| const idMatch = url.match(/jl=(\d+)/) || url.match(/-JV_(\d+)/); | ||
| const id = idMatch ? idMatch[1] : `gd-${Date.now()}`; | ||
|
|
||
| return { | ||
| id, | ||
| title, | ||
| company, | ||
| location, | ||
| salary: salary || undefined, | ||
| companyRating: ratingText ? parseRating(ratingText) : undefined, | ||
| reviewCount: reviewCountText ? parseReviewCount(reviewCountText) : undefined, | ||
| easyApply, | ||
| salaryEstimate: salary ? parseSalaryEstimate(salary) : undefined, | ||
| url, | ||
| tags: [], | ||
| source: 'glassdoor', | ||
| scrapedAt: new Date().toISOString() | ||
| }; | ||
| } | ||
|
|
||
| /** | ||
| * Strip HTML tags from a string | ||
| */ | ||
| private stripHtml(html: string): string { | ||
| return html | ||
| .replace(/<[^>]*>/g, ' ') | ||
| .replace(/\s+/g, ' ') | ||
| .trim(); | ||
| } | ||
|
|
||
| /** | ||
| * Handle Glassdoor's anti-bot measures | ||
| * Override the throttle method with more aggressive rate limiting | ||
| */ | ||
| protected async throttle(): Promise<void> { | ||
| const now = Date.now(); | ||
| const elapsed = now - this.lastRequestTime; | ||
|
|
||
| // Use a longer throttle time for Glassdoor (default 3 seconds) | ||
| const throttleMs = this.config.throttleMs || 3000; | ||
|
|
||
| // Add some randomness to avoid detection patterns | ||
| const jitter = Math.floor(Math.random() * 1000); | ||
| const totalDelay = throttleMs + jitter; | ||
|
|
||
| if (elapsed < totalDelay) { | ||
| const delay = totalDelay - elapsed; | ||
| console.log(`⏳ Throttling: waiting ${delay}ms`); |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The console.log statements should be replaced with proper logging using a logger instance. Using console.log directly in production code makes it difficult to control log levels, format output consistently, or redirect logs to different destinations. Consider using a logging library or creating a logger interface.
| await new Promise(resolve => setTimeout(resolve, delay)); | ||
| } | ||
|
|
||
| this.lastRequestTime = Date.now(); |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lastRequestTime property is declared as protected in the base class, but the throttle method in GlassdoorJobScraper (line 223) attempts to access it. While this is technically correct for protected members, the overridden throttle() method never calls super.throttle(), which means the base class's lastRequestTime won't be updated. This could cause issues if other methods in the base class rely on lastRequestTime being current.
| this.lastRequestTime = Date.now(); | |
| this.lastRequestTime = Date.now(); | |
| await super.throttle(); |
Co-authored-by: Copilot <[email protected]>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
@copilot apply changes based on the comments in this thread |
…efactor search() to use hooks, add thread-safety docs Co-authored-by: groupthinking <[email protected]>
services/platform-adapters/glassdoor/src/GlassdoorJobScraper.ts
Outdated
Show resolved
Hide resolved
Applied all code review changes in commit d54b047:
All 88 tests still passing. |
…ifying shared config Co-authored-by: groupthinking <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
…tead of fragile regex patterns Co-authored-by: groupthinking <[email protected]>
…e concurrent access Co-authored-by: groupthinking <[email protected]>
Co-authored-by: vercel[bot] <35613825+vercel[bot]@users.noreply.github.com>
| const searchParams: Record<string, string> = {}; | ||
|
|
||
| if (params.searchTerm) { | ||
| searchParams.sc = 'keyword=' + encodeURIComponent(params.searchTerm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The search parameter construction double-encodes the search term, causing malformed query parameters that Glassdoor won't understand.
View Details
📝 Patch Details
diff --git a/services/agent-orchestrator/src/adapters/glassdoor-adapter.ts b/services/agent-orchestrator/src/adapters/glassdoor-adapter.ts
index 45e1593..5e70728 100644
--- a/services/agent-orchestrator/src/adapters/glassdoor-adapter.ts
+++ b/services/agent-orchestrator/src/adapters/glassdoor-adapter.ts
@@ -166,7 +166,7 @@ export class GlassdoorAdapter {
const searchParams: Record<string, string> = {};
if (params.searchTerm) {
- searchParams.sc = 'keyword=' + encodeURIComponent(params.searchTerm);
+ searchParams.sc = 'keyword=' + params.searchTerm;
}
if (params.location) {
diff --git a/services/platform-adapters/glassdoor/src/utils/urlBuilder.ts b/services/platform-adapters/glassdoor/src/utils/urlBuilder.ts
index 83a9553..6776cb7 100644
--- a/services/platform-adapters/glassdoor/src/utils/urlBuilder.ts
+++ b/services/platform-adapters/glassdoor/src/utils/urlBuilder.ts
@@ -55,7 +55,7 @@ export function buildGlassdoorSearchUrl(filters: GlassdoorSearchFilters): string
// Keywords/search term
if (filters.keywords) {
- params.sc = 'keyword=' + encodeURIComponent(filters.keywords);
+ params.sc = 'keyword=' + filters.keywords;
}
// Location handling
Analysis
Double-encoding of search parameters causes malformed Glassdoor API requests
What fails: Search parameter construction in GlassdoorAdapter.buildSearchParams() and buildGlassdoorSearchUrl() double-encodes the search term, resulting in malformed query parameters that Glassdoor cannot parse.
How to reproduce:
// In services/platform-adapters/glassdoor/src/utils/urlBuilder.ts
const filters = { keywords: 'software engineer' };
const url = buildGlassdoorSearchUrl(filters);
// Result: ?sc=keyword%3Dsoftware%2520engineerWhat happens: When axios/URL API receives a pre-encoded parameter value like sc='keyword=software%20engineer', it automatically URL-encodes all parameter values again, resulting in:
- Original space:
20→ encoded to%20→ re-encoded to%2520 - Original
=sign: → encoded to%3D - Final URL:
?sc=keyword%3Dsoftware%2520engineer - Glassdoor receives:
sc parameter value = "keyword=software%20engineer"(with literal%20instead of space)
Expected behavior: axios should handle all URL encoding automatically. Parameters passed to axios should NOT be pre-encoded. Let axios/URL API do the single encoding pass:
- With fix:
?sc=keyword%3Dsoftware+engineer - Glassdoor receives:
sc parameter value = "keyword=software engineer"(properly decoded)
Fix applied: Removed encodeURIComponent() calls in both:
services/agent-orchestrator/src/adapters/glassdoor-adapter.tsline 169services/platform-adapters/glassdoor/src/utils/urlBuilder.tsline 58
Changed from: searchParams.sc = 'keyword=' + encodeURIComponent(params.searchTerm); Changed to: searchParams.sc = 'keyword=' + params.searchTerm;
This allows axios to perform a single, correct URL encoding rather than double-encoding the parameter values.
Verification: All existing tests pass, URL now correctly generates sc=keyword%3Dsoftware+engineer instead of sc=keyword%3Dsoftware%2520engineer
| params.sc = 'keyword=' + encodeURIComponent(filters.keywords); | ||
| } | ||
|
|
||
| // Location handling | ||
| if (filters.location) { | ||
| params.locT = 'C'; // City type | ||
| params.locKeyword = filters.location; | ||
| } | ||
|
|
||
| // Remote work filter | ||
| if (filters.remote) { | ||
| params.remoteWorkType = '1'; | ||
| } | ||
|
|
||
| // Salary filter | ||
| if (filters.salary?.min) { | ||
| params.minSalary = filters.salary.min.toString(); | ||
| } | ||
| if (filters.salary?.max) { | ||
| params.maxSalary = filters.salary.max.toString(); | ||
| } | ||
|
|
||
| // Date posted filter | ||
| const fromAge = mapDatePosted(filters.datePosted); | ||
| if (fromAge) { | ||
| params.fromAge = fromAge; | ||
| } | ||
|
|
||
| // Easy Apply filter | ||
| if (filters.easyApplyOnly) { | ||
| params.applicationType = '1'; | ||
| } | ||
|
|
||
| // Company rating filter (minimum stars) | ||
| if ( | ||
| filters.companyRating !== undefined && | ||
| filters.companyRating !== null && | ||
| filters.companyRating >= 1 && | ||
| filters.companyRating <= 5 | ||
| ) { | ||
| params.minRating = filters.companyRating.toString(); | ||
| } | ||
|
|
||
| // Company size filter | ||
| const employeeCount = mapCompanySize(filters.companySize); | ||
| if (employeeCount) { | ||
| params.employerSizes = employeeCount; | ||
| } | ||
|
|
||
| // Job type filter | ||
| const jobType = mapJobType(filters.jobType); | ||
| if (jobType) { | ||
| params.jobType = jobType; | ||
| } | ||
|
|
||
| // Industry filter | ||
| if (filters.industry) { | ||
| params.industry = filters.industry; | ||
| } | ||
|
|
||
| // Company name filter | ||
| if (filters.companyName) { | ||
| params.employer = filters.companyName; | ||
| } | ||
|
|
||
| // Build query string | ||
| const queryString = Object.entries(params) | ||
| .map(([key, value]) => `${key}=${encodeURIComponent(value)}`) | ||
| .join('&'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL builder double-encodes the search term by first encoding it, then encoding it again when building the final query string.
View Details
📝 Patch Details
diff --git a/services/agent-orchestrator/src/adapters/glassdoor-adapter.ts b/services/agent-orchestrator/src/adapters/glassdoor-adapter.ts
index 45e1593..5e70728 100644
--- a/services/agent-orchestrator/src/adapters/glassdoor-adapter.ts
+++ b/services/agent-orchestrator/src/adapters/glassdoor-adapter.ts
@@ -166,7 +166,7 @@ export class GlassdoorAdapter {
const searchParams: Record<string, string> = {};
if (params.searchTerm) {
- searchParams.sc = 'keyword=' + encodeURIComponent(params.searchTerm);
+ searchParams.sc = 'keyword=' + params.searchTerm;
}
if (params.location) {
diff --git a/services/platform-adapters/glassdoor/src/utils/urlBuilder.ts b/services/platform-adapters/glassdoor/src/utils/urlBuilder.ts
index 83a9553..6776cb7 100644
--- a/services/platform-adapters/glassdoor/src/utils/urlBuilder.ts
+++ b/services/platform-adapters/glassdoor/src/utils/urlBuilder.ts
@@ -55,7 +55,7 @@ export function buildGlassdoorSearchUrl(filters: GlassdoorSearchFilters): string
// Keywords/search term
if (filters.keywords) {
- params.sc = 'keyword=' + encodeURIComponent(filters.keywords);
+ params.sc = 'keyword=' + filters.keywords;
}
// Location handling
Analysis
Double-encoding of search terms in Glassdoor URL builder
What fails: The URL builder pre-encodes the search term keyword using encodeURIComponent(), then encodes it again when building the query string. This results in %20 (space character) being double-encoded as %2520, which malforms the search parameters sent to Glassdoor.
How to reproduce:
import { buildGlassdoorSearchUrl } from './services/platform-adapters/glassdoor/src/utils/urlBuilder';
const filters = { keywords: 'python developer' };
const url = buildGlassdoorSearchUrl(filters);
console.log(url);
// Actual: https://www.glassdoor.com/Job/jobs.htm?sc=keyword%3Dpython%2520developer
// Expected: https://www.glassdoor.com/Job/jobs.htm?sc=keyword%3Dpython%20developerWhat happens vs. expected behavior:
- Current (broken):
sc=keyword%3Dpython%2520developer- the space becomes%2520(double-encoded) - When decoded once:
keyword=python%20developer(still has encoded space) - Expected:
sc=keyword%3Dpython%20developer- space should be%20only (single-encoded)
Root cause: At line 58 of services/platform-adapters/glassdoor/src/utils/urlBuilder.ts, the code pre-encodes the search term:
params.sc = 'keyword=' + encodeURIComponent(filters.keywords);Then at lines 124-126, it encodes all parameter values again:
const queryString = Object.entries(params)
.map(([key, value]) => ` This causes double-encoding of the entire keyword= parameter value. Since axios automatically applies encodeURIComponent to param values, the same issue exists in services/agent-orchestrator/src/adapters/glassdoor-adapter.ts at line 169.
Fix: Remove the pre-encoding at the parameter construction stage and let the query string builder handle all encoding:
In services/platform-adapters/glassdoor/src/utils/urlBuilder.ts (line 58):
params.sc = 'keyword=' + filters.keywords; // Remove encodeURIComponentIn services/agent-orchestrator/src/adapters/glassdoor-adapter.ts (line 169):
searchParams.sc = 'keyword=' + params.searchTerm; // Remove encodeURIComponentBoth locations use the same pattern and require the same fix.
|
@copilot Fix this repo |
I see build errors in both services. The main issues are: Glassdoor adapter:
Agent orchestrator:
Would you like me to:
Or should I focus on a specific area? |
…nd playwright dev dependency Co-authored-by: groupthinking <[email protected]>
| /** | ||
| * Export jobs to JSON file | ||
| * Search for jobs with the given filters | ||
| */ | ||
| async exportToJson(jobs: Job[], filePath: string): Promise<void> { | ||
| const fs = await import('fs/promises'); | ||
| const path = await import('path'); | ||
| async search(filters: F): Promise<T[]> { | ||
| await this.initialize(); | ||
|
|
||
| const page = await this.newPage(); | ||
| const allJobs: T[] = []; | ||
|
|
||
| try { | ||
| const searchUrl = this.buildSearchUrl(filters); | ||
| await this.throttle(); | ||
| await page.goto(searchUrl, { | ||
| waitUntil: 'networkidle', | ||
| timeout: this.config.timeout | ||
| }); | ||
|
|
||
| // Ensure directory exists | ||
| const dir = path.dirname(filePath); | ||
| await fs.mkdir(dir, { recursive: true }); | ||
| let pageNum = 1; | ||
| const maxResults = this.config.maxResults || 50; | ||
|
|
||
| // Write jobs to file | ||
| while (allJobs.length < maxResults) { | ||
| const jobs = await this.parseJobList(page); | ||
| allJobs.push(...jobs); | ||
|
|
||
| if (allJobs.length >= maxResults) { | ||
| break; | ||
| } | ||
|
|
||
| const hasMore = await this.hasNextPage(page); | ||
| if (!hasMore) { | ||
| break; | ||
| } | ||
|
|
||
| await this.throttle(); | ||
| await this.goToNextPage(page); | ||
| pageNum++; | ||
| } | ||
|
|
||
| return allJobs.slice(0, maxResults); | ||
|
|
||
| } finally { | ||
| await page.close(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shared BaseJobScraper class is missing the beforeParseJobList() hook and its integration into the search() method, making it architecturally incomplete compared to the improved version in the Glassdoor module.
View Details
📝 Patch Details
diff --git a/services/platform-adapters/base/BaseJobScraper.ts b/services/platform-adapters/base/BaseJobScraper.ts
index 4f7c267..9c4ddb8 100644
--- a/services/platform-adapters/base/BaseJobScraper.ts
+++ b/services/platform-adapters/base/BaseJobScraper.ts
@@ -90,6 +90,14 @@ export abstract class BaseJobScraper<T extends Job = Job, F extends SearchFilter
this.lastRequestTime = Date.now();
}
+ /**
+ * Hook called before parsing job list on each page.
+ * Subclasses can override to add platform-specific behavior (e.g., dismissing modals).
+ */
+ protected async beforeParseJobList(_page: Page): Promise<void> {
+ // Default implementation does nothing
+ }
+
/**
* Search for jobs with the given filters
*/
@@ -107,10 +115,16 @@ export abstract class BaseJobScraper<T extends Job = Job, F extends SearchFilter
timeout: this.config.timeout
});
+ // Allow subclasses to perform platform-specific setup
+ await this.beforeParseJobList(page);
+
let pageNum = 1;
const maxResults = this.config.maxResults || 50;
while (allJobs.length < maxResults) {
+ // Allow subclasses to handle modals or other interruptions before parsing
+ await this.beforeParseJobList(page);
+
const jobs = await this.parseJobList(page);
allJobs.push(...jobs);
Analysis
Missing beforeParseJobList() hook in root BaseJobScraper class
What fails: The root BaseJobScraper class in services/platform-adapters/base/BaseJobScraper.ts is missing the beforeParseJobList() extension hook and its integration into the search() method, causing architectural inconsistency with the improved Glassdoor adapter implementation.
How to reproduce: Compare the two files:
- Root base:
services/platform-adapters/base/BaseJobScraper.ts - Glassdoor base:
services/platform-adapters/glassdoor/src/base/BaseJobScraper.ts
The Glassdoor version includes (lines 93-136 in complete form):
- A protected
beforeParseJobList(_page: Page): Promise<void>hook method with default no-op implementation - A call to
await this.beforeParseJobList(page)after the initial page navigation - A call to
await this.beforeParseJobList(page)before each iteration of the job parsing loop
The root base class was missing all three elements.
Result before fix: Subclasses using the root base class cannot override beforeParseJobList() to handle platform-specific setup (like dismissing modals) without reimplementing the entire search() method.
Expected behavior: Both base classes should be architecturally identical, providing the same extension points. The beforeParseJobList() hook allows subclasses to add platform-specific behavior (demonstrated in GlassdoorJobScraper.beforeParseJobList() which dismisses modals) without needing to override the entire search() method.
Fix applied: Added the missing hook method and integrated it into the search() method at two points:
- After initial page load (allows platform-specific setup before first parse)
- Before each parsing loop iteration (allows cleanup between pages, e.g., dismissing modals)
This provides a consistent, extensible base class for all platform adapters and matches the architectural pattern established in the Glassdoor implementation.
Original prompt
Objective
Develop Glassdoor platform adapter for job scraping, similar to the LinkedIn adapter, to expand job source coverage and provide users with more opportunities.
Related Issue
Closes #22
Requirements
Technical Specifications
Directory Structure
Glassdoor-Specific Job Interface
GlassdoorJobScraper Class
URL Builder Utility