Skip to content

feat: Add timeout handling to CAPTCHA solver API calls#7

Open
Cordycepsers wants to merge 1 commit intomainfrom
dev
Open

feat: Add timeout handling to CAPTCHA solver API calls#7
Cordycepsers wants to merge 1 commit intomainfrom
dev

Conversation

@Cordycepsers
Copy link
Copy Markdown
Owner

@Cordycepsers Cordycepsers commented Sep 22, 2025

🎯 Summary

This PR adds comprehensive timeout handling to the CAPTCHA solver API calls to prevent infinite hangs when CAPTCHA services are unresponsive.

🔧 Changes Made

Configuration Updates

  • Added CAPTCHA_INITIAL_REQUEST_TIMEOUT = 30 seconds for submission requests
  • Added CAPTCHA_POLLING_REQUEST_TIMEOUT = 10 seconds for polling requests

Exception Handling

  • Created CaptchaSolverError custom exception class
  • Added timeout handling to all 4 problematic API calls:
    • 2captcha submission (POST)
    • 2captcha polling (GET)
    • Anticaptcha submission (POST)
    • Anticaptcha polling (POST)

Error Propagation

  • Updated solve_recaptcha() method to properly propagate timeout errors
  • Enhanced all method docstrings to document timeout behavior
  • Updated request handler to catch and handle CaptchaSolverError

Testing

  • Added comprehensive unit tests (10 test cases)
  • Tests cover timeout scenarios, network errors, and successful flows
  • All tests pass ✅

🚨 Impact

  • Prevents infinite hangs when CAPTCHA services are unresponsive
  • Improves reliability by failing fast instead of blocking indefinitely
  • Better error handling with specific timeout error messages
  • Maintains backward compatibility while adding robust error handling

📋 Testing

pytest tests/test_captcha_solver.py -v

🔍 Code Quality

  • Formatted with Black
  • Passed flake8 linting
  • Added comprehensive docstrings

🎯 Fixes

Resolves the issue where CAPTCHA solver API calls could hang indefinitely without timeout parameters.

Summary by Sourcery

Add robust timeout and error handling to CAPTCHA solver integration by introducing configurable timeouts, a custom CaptchaSolverError, and updating the request handler to handle solver failures. Refactor test suites to include dedicated timeout tests for CAPTCHA solver and streamline analysis module tests.

New Features:

  • Add configuration options for CAPTCHA submission and polling timeouts

Bug Fixes:

  • Prevent infinite hangs by adding timeout handling to CAPTCHA solver API calls

Enhancements:

  • Introduce CaptchaSolverError for unified error handling and propagate solver exceptions in solve_recaptcha
  • Wrap 2captcha and Anticaptcha submission and polling requests with configurable timeouts and network error catching
  • Update request handler to catch CaptchaSolverError and continue gracefully if CAPTCHA solving fails

Documentation:

  • Enhance method docstrings to document timeout behavior for CAPTCHA solver

Tests:

  • Add unit tests for CAPTCHA solver covering timeout scenarios, network errors, and successful flows
  • Refactor analysis module tests into a simplified asyncio-based runner

- Add timeout constants to anti-scraping config (30s for submission, 10s for polling)
- Add CaptchaSolverError exception for timeout and network error handling
- Update all requests.post() and requests.get() calls in captcha_solver.py with timeouts
- Improve error propagation in solve_recaptcha() method
- Update request_handler.py to properly handle CaptchaSolverError
- Add comprehensive unit tests for timeout behavior
- Update method docstrings to document timeout behavior

This prevents infinite hangs when CAPTCHA services are unresponsive and provides
better error handling for timeout scenarios.
Copilot AI review requested due to automatic review settings September 22, 2025 13:11
@vercel
Copy link
Copy Markdown

vercel bot commented Sep 22, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
proposal-master Ready Ready Preview Comment Sep 22, 2025 1:11pm

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Sep 22, 2025

Reviewer's Guide

This PR integrates configurable timeouts into all CAPTCHA solver API interactions, centralizes error handling via a custom exception, propagates errors cleanly through the solver and request handler, and adds focused unit tests to verify timeout and network‐failure scenarios.

File-Level Changes

Change Details Files
Add timeout configuration parameters
  • Defined CAPTCHA_INITIAL_REQUEST_TIMEOUT for submission
  • Defined CAPTCHA_POLLING_REQUEST_TIMEOUT for polling
src/anti_scraping/config.py
Introduce CaptchaSolverError and document timeout behavior
  • Created CaptchaSolverError exception class
  • Extended solve_recaptcha docstring to describe timeout and error cases
src/anti_scraping/captcha_solver.py
Implement timeout handling in 2captcha integration
  • Added timeout to initial POST submission request
  • Added timeout to polling GET request within loop
  • Catch requests.Timeout and RequestException to raise CaptchaSolverError
src/anti_scraping/captcha_solver.py
Implement timeout handling in Anticaptcha integration
  • Added timeout to createTask POST request
  • Added timeout to getTaskResult POST polling
  • Catch timeout and network errors to raise CaptchaSolverError
src/anti_scraping/captcha_solver.py
Propagate and handle solver errors in request workflow
  • solve_recaptcha now re‐raises CaptchaSolverError and ValueError
  • Imported CaptchaSolverError in request_handler
  • Wrapped CAPTCHA solve attempt to catch solver errors and log them without blocking
src/anti_scraping/captcha_solver.py
src/anti_scraping/request_handler.py
Add unit tests for timeout and error scenarios
  • New tests for submission and polling timeouts on both providers
  • Tests verifying correct timeout arguments on requests
  • Tests for network errors and successful flows
  • Test propagation of timeout errors through solve_recaptcha
tests/test_captcha_solver.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive timeout handling to the CAPTCHA solver API calls to prevent infinite hangs when CAPTCHA services are unresponsive. The implementation introduces a custom CaptchaSolverError exception class, configurable timeout values for different types of requests, and proper error propagation throughout the system.

Key Changes:

  • Added timeout parameters to all 4 CAPTCHA API calls (2captcha and Anticaptcha submission/polling)
  • Created CaptchaSolverError custom exception for timeout and network error handling
  • Enhanced error propagation in the request handler to catch and handle CAPTCHA solver timeouts

Reviewed Changes

Copilot reviewed 6 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_captcha_solver.py Comprehensive test suite with 10 test cases covering timeout scenarios, network errors, and successful flows
tests/test_analysis_module.py Replaced complex test file with simplified functional tests (file reorganization)
test_analysis_module.py Removed duplicate test file to consolidate testing
src/anti_scraping/request_handler.py Updated to import and handle CaptchaSolverError with proper exception catching
src/anti_scraping/config.py Added timeout configuration constants for CAPTCHA requests
src/anti_scraping/captcha_solver.py Core implementation with timeout handling, custom exception, and enhanced error propagation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 92 to 110
while True:
poll_response = requests.get(poll_url, params=poll_payload)
poll_result = poll_response.json()

if poll_result.get('status') == 1:
return poll_result.get('request')

try:
poll_response = requests.get(
poll_url,
params=poll_payload,
timeout=config.CAPTCHA_POLLING_REQUEST_TIMEOUT,
)
poll_result = poll_response.json()
except requests.Timeout:
raise CaptchaSolverError("CAPTCHA polling request timed out")
except requests.RequestException as e:
raise CaptchaSolverError(
f"CAPTCHA polling request failed: {str(e)}"
)

if poll_result.get("status") == 1:
return poll_result.get("request")

time.sleep(5) # Wait 5 seconds before polling again
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infinite polling loop lacks a maximum retry limit or total timeout, which could cause the application to poll indefinitely if the CAPTCHA service never returns a ready status. Consider adding a maximum polling duration or retry count to prevent infinite loops.

Copilot uses AI. Check for mistakes.
Comment on lines 149 to 167
while True:
poll_response = requests.post(poll_url, json=poll_payload)
poll_result = poll_response.json()

if poll_result.get('status') == 'ready':
return poll_result.get('solution', {}).get('gRecaptchaResponse')

try:
poll_response = requests.post(
poll_url,
json=poll_payload,
timeout=config.CAPTCHA_POLLING_REQUEST_TIMEOUT,
)
poll_result = poll_response.json()
except requests.Timeout:
raise CaptchaSolverError("CAPTCHA polling request timed out")
except requests.RequestException as e:
raise CaptchaSolverError(
f"CAPTCHA polling request failed: {str(e)}"
)

if poll_result.get("status") == "ready":
return poll_result.get("solution", {}).get("gRecaptchaResponse")

time.sleep(5) # Wait 5 seconds before polling again
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infinite polling loop lacks a maximum retry limit or total timeout, which could cause the application to poll indefinitely if the CAPTCHA service never returns a ready status. Consider adding a maximum polling duration or retry count to prevent infinite loops.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected a request using 'http://'. This request will be unencrypted, and attackers could listen into traffic on the network and be able to obtain sensitive information. Use 'https://' instead. (link)
  • Detected a request using 'http://'. This request will be unencrypted, and attackers could listen into traffic on the network and be able to obtain sensitive information. Use 'https://' instead. (link)

General comments:

  • Introduce a cap or overall timeout in the while-True polling loops in both 2captcha and Anticaptcha methods so you don’t risk infinite polling if the service never responds.
  • Replace the ad-hoc print statements in the CAPTCHA request handler and test runner with a proper logging framework to control verbosity and integrate with existing logs.
  • Consider refactoring the new monolithic test_analysis_module.py into discrete pytest-style functions with fixtures instead of a custom async main, to improve readability, parallelism, and maintainability.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Introduce a cap or overall timeout in the while-True polling loops in both 2captcha and Anticaptcha methods so you don’t risk infinite polling if the service never responds.
- Replace the ad-hoc print statements in the CAPTCHA request handler and test runner with a proper logging framework to control verbosity and integrate with existing logs.
- Consider refactoring the new monolithic `test_analysis_module.py` into discrete pytest-style functions with fixtures instead of a custom async main, to improve readability, parallelism, and maintainability.

## Individual Comments

### Comment 1
<location> `src/anti_scraping/captcha_solver.py:38-47` </location>
<code_context>
-            captcha_id = result.get('request')
-            
+
+        try:
+            response = requests.post(
+                url, data=payload, timeout=config.CAPTCHA_INITIAL_REQUEST_TIMEOUT
+            )
+            result = response.json()
+        except requests.Timeout:
+            raise CaptchaSolverError("CAPTCHA submission request timed out")
+        except requests.RequestException as e:
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Consider limiting polling attempts to avoid infinite loops.

Both polling loops lack a termination condition if the CAPTCHA is never solved. Adding a maximum attempt limit or timeout will prevent hangs and excessive resource use.
</issue_to_address>

### Comment 2
<location> `src/anti_scraping/request_handler.py:81` </location>
<code_context>
+                    if captcha_solution:
+                        # Retry with CAPTCHA solution
+                        return self._retry_with_captcha_solution(url, captcha_solution)
+                except CaptchaSolverError as e:
+                    print(f"CAPTCHA solving failed: {str(e)}")
+                    # Continue with original response if CAPTCHA solving fails
</code_context>

<issue_to_address>
**question (bug_risk):** Continuing with the original response after CAPTCHA solving failure may not be ideal.

Returning the original response after CAPTCHA failure may cause downstream code to process a CAPTCHA page. Consider raising an exception or returning an explicit error to prevent this.
</issue_to_address>

### Comment 3
<location> `src/anti_scraping/request_handler.py:112-117` </location>
<code_context>
         return False
-    
+
     def _attempt_captcha_solve(self, page_url: str) -> Optional[str]:
         """Attempt to solve CAPTCHA"""
-        # This is a simplified version - real implementation would extract 
</code_context>

<issue_to_address>
**suggestion:** The placeholder implementation for CAPTCHA solving may lead to confusion.

Consider raising NotImplementedError or clearly marking this method as a stub to avoid accidental use in production environments.

```suggestion
    def _attempt_captcha_solve(self, page_url: str) -> Optional[str]:
        """Attempt to solve CAPTCHA

        This method is a stub and must be implemented for production use.
        """
        raise NotImplementedError("CAPTCHA solving is not implemented. This method is a stub.")
```
</issue_to_address>

### Comment 4
<location> `src/anti_scraping/config.py:32-33` </location>
<code_context>
-    
+    CAPTCHA_SOLVER_API_KEY: Optional[str] = os.getenv("CAPTCHA_API_KEY", "")
+    CAPTCHA_SOLVER_PROVIDER: str = "2captcha"  # or 'anticaptcha'
+    CAPTCHA_INITIAL_REQUEST_TIMEOUT: int = 30  # seconds for submission requests
+    CAPTCHA_POLLING_REQUEST_TIMEOUT: int = 10  # seconds for polling requests
+
     # Randomization settings
</code_context>

<issue_to_address>
**suggestion:** Timeout values for CAPTCHA requests are hardcoded and may not suit all environments.

Recommend making timeout values configurable through environment variables or constructor parameters to accommodate varying deployment needs.

```suggestion
    CAPTCHA_INITIAL_REQUEST_TIMEOUT: int = int(os.getenv("CAPTCHA_INITIAL_REQUEST_TIMEOUT", "30"))  # seconds for submission requests
    CAPTCHA_POLLING_REQUEST_TIMEOUT: int = int(os.getenv("CAPTCHA_POLLING_REQUEST_TIMEOUT", "10"))  # seconds for polling requests
```
</issue_to_address>

### Comment 5
<location> `tests/test_captcha_solver.py:45` </location>
<code_context>
+        assert call_args[1]["timeout"] == 30
+
+    @patch("src.anti_scraping.captcha_solver.requests.post")
+    def test_2captcha_submission_network_error(self, mock_post, captcha_solver):
+        """Test that 2captcha submission handles network errors properly"""
+        # Mock network error
</code_context>

<issue_to_address>
**suggestion (testing):** No test for provider value case sensitivity.

Please add a test to confirm provider values are treated case-insensitively, as the implementation lowercases the input. This will help catch configuration issues related to case mismatches.
</issue_to_address>

### Comment 6
<location> `src/anti_scraping/captcha_solver.py:72` </location>
<code_context>
                url, data=payload, timeout=config.CAPTCHA_INITIAL_REQUEST_TIMEOUT
</code_context>

<issue_to_address>
**security (python.lang.security.audit.insecure-transport.requests.request-with-http):** Detected a request using 'http://'. This request will be unencrypted, and attackers could listen into traffic on the network and be able to obtain sensitive information. Use 'https://' instead.

*Source: opengrep*
</issue_to_address>

### Comment 7
<location> `src/anti_scraping/captcha_solver.py:95` </location>
<code_context>
                        poll_url,
</code_context>

<issue_to_address>
**security (python.lang.security.audit.insecure-transport.requests.request-with-http):** Detected a request using 'http://'. This request will be unencrypted, and attackers could listen into traffic on the network and be able to obtain sensitive information. Use 'https://' instead.

*Source: opengrep*
</issue_to_address>

### Comment 8
<location> `tests/test_captcha_solver.py:199-202` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))

<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals

Some ways to fix this:

* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.

> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>

### Comment 9
<location> `src/anti_scraping/captcha_solver.py:70` </location>
<code_context>
    def _solve_with_2captcha(self, site_key: str, page_url: str) -> Optional[str]:
        """Solve CAPTCHA using 2captcha API

        Uses configurable timeouts for submission and polling requests.

        Raises:
            CaptchaSolverError: If timeout occurs or network errors happen
        """
        url = "http://2captcha.com/in.php"

        payload = {
            "key": self.api_key,
            "method": "userrecaptcha",
            "googlekey": site_key,
            "pageurl": page_url,
            "json": 1,
        }

        try:
            response = requests.post(
                url, data=payload, timeout=config.CAPTCHA_INITIAL_REQUEST_TIMEOUT
            )
            result = response.json()
        except requests.Timeout:
            raise CaptchaSolverError("CAPTCHA submission request timed out")
        except requests.RequestException as e:
            raise CaptchaSolverError(f"CAPTCHA submission request failed: {str(e)}")

        if result.get("status") == 1:
            captcha_id = result.get("request")

            # Poll for solution
            poll_url = "http://2captcha.com/res.php"
            poll_payload = {
                "key": self.api_key,
                "action": "get",
                "id": captcha_id,
                "json": 1,
            }

            while True:
                try:
                    poll_response = requests.get(
                        poll_url,
                        params=poll_payload,
                        timeout=config.CAPTCHA_POLLING_REQUEST_TIMEOUT,
                    )
                    poll_result = poll_response.json()
                except requests.Timeout:
                    raise CaptchaSolverError("CAPTCHA polling request timed out")
                except requests.RequestException as e:
                    raise CaptchaSolverError(
                        f"CAPTCHA polling request failed: {str(e)}"
                    )

                if poll_result.get("status") == 1:
                    return poll_result.get("request")

                time.sleep(5)  # Wait 5 seconds before polling again
        return None

</code_context>

<issue_to_address>
**issue (code-quality):** Explicitly raise from a previous error [×4] ([`raise-from-previous-error`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/raise-from-previous-error/))
</issue_to_address>

### Comment 10
<location> `src/anti_scraping/captcha_solver.py:132` </location>
<code_context>
    def _solve_with_anticaptcha(self, site_key: str, page_url: str) -> Optional[str]:
        """Solve CAPTCHA using Anticaptcha API

        Uses configurable timeouts for submission and polling requests.

        Raises:
            CaptchaSolverError: If timeout occurs or network errors happen
        """
        url = "https://api.anti-captcha.com/createTask"

        payload = {
            "clientKey": self.api_key,
            "task": {
                "type": "RecaptchaV2TaskProxyless",
                "websiteURL": page_url,
                "websiteKey": site_key,
            },
        }

        try:
            response = requests.post(
                url, json=payload, timeout=config.CAPTCHA_INITIAL_REQUEST_TIMEOUT
            )
            result = response.json()
        except requests.Timeout:
            raise CaptchaSolverError("CAPTCHA submission request timed out")
        except requests.RequestException as e:
            raise CaptchaSolverError(f"CAPTCHA submission request failed: {str(e)}")

        if result.get("errorId") == 0:
            task_id = result.get("taskId")

            # Poll for solution
            poll_url = "https://api.anti-captcha.com/getTaskResult"
            poll_payload = {"clientKey": self.api_key, "taskId": task_id}

            while True:
                try:
                    poll_response = requests.post(
                        poll_url,
                        json=poll_payload,
                        timeout=config.CAPTCHA_POLLING_REQUEST_TIMEOUT,
                    )
                    poll_result = poll_response.json()
                except requests.Timeout:
                    raise CaptchaSolverError("CAPTCHA polling request timed out")
                except requests.RequestException as e:
                    raise CaptchaSolverError(
                        f"CAPTCHA polling request failed: {str(e)}"
                    )

                if poll_result.get("status") == "ready":
                    return poll_result.get("solution", {}).get("gRecaptchaResponse")

                time.sleep(5)  # Wait 5 seconds before polling again
        return None

</code_context>

<issue_to_address>
**issue (code-quality):** Explicitly raise from a previous error [×4] ([`raise-from-previous-error`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/raise-from-previous-error/))
</issue_to_address>

### Comment 11
<location> `src/anti_scraping/request_handler.py:54` </location>
<code_context>
    def make_request(
        self,
        url: str,
        method: str = "GET",
        headers: Optional[Dict[str, str]] = None,
        **kwargs,
    ) -> Optional[requests.Response]:
        """
        Make HTTP request with anti-scraping measures
        """
        print("--- Entering make_request ---")
        domain = urlparse(url).netloc
        # Wait if rate limit is exceeded
        self.rate_limiter.wait_if_needed(domain)

        # Get random user-agent
        user_agent = self.user_agent_manager.get_random_user_agent()

        # Get random proxy
        proxy = self.proxy_manager.get_random_proxy()

        # Set default headers
        default_headers = {
            "User-Agent": user_agent,
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        }

        if headers:
            default_headers.update(headers)

        # Add random delay before request
        delay = random.uniform(*self.rate_limiter.random_delay_range)
        time.sleep(delay)

        # Make the request
        try:
            print("--- About to call requests.request ---")
            response = requests.request(
                method=method,
                url=url,
                headers=default_headers,
                proxies=proxy,
                timeout=30,
                **kwargs,
            )
            print("--- requests.request was called ---")

            # Check if CAPTCHA was encountered
            if self._is_captcha_encountered(response):
                print("CAPTCHA detected, attempting to solve...")
                try:
                    captcha_solution = self._attempt_captcha_solve(url)
                    if captcha_solution:
                        # Retry with CAPTCHA solution
                        return self._retry_with_captcha_solution(url, captcha_solution)
                except CaptchaSolverError as e:
                    print(f"CAPTCHA solving failed: {str(e)}")
                    # Continue with original response if CAPTCHA solving fails

            # Record the request
            self.rate_limiter.record_request(domain)

            return response

        except requests.exceptions.RequestException as e:
            print(f"Request failed: {str(e)}")
            if proxy:
                self.proxy_manager.mark_proxy_as_failed(proxy)
            raise

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Merge dictionary updates via the union operator ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 38 to +47
try:
if self.provider == '2captcha':
if self.provider == "2captcha":
return self._solve_with_2captcha(site_key, page_url)
elif self.provider == 'anticaptcha':
elif self.provider == "anticaptcha":
return self._solve_with_anticaptcha(site_key, page_url)
else:
raise ValueError(f"Unsupported CAPTCHA provider: {self.provider}")
except (CaptchaSolverError, ValueError):
# Re-raise these exceptions as they indicate configuration or timeout issues
raise
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Consider limiting polling attempts to avoid infinite loops.

Both polling loops lack a termination condition if the CAPTCHA is never solved. Adding a maximum attempt limit or timeout will prevent hangs and excessive resource use.

if captcha_solution:
# Retry with CAPTCHA solution
return self._retry_with_captcha_solution(url, captcha_solution)
except CaptchaSolverError as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): Continuing with the original response after CAPTCHA solving failure may not be ideal.

Returning the original response after CAPTCHA failure may cause downstream code to process a CAPTCHA page. Consider raising an exception or returning an explicit error to prevent this.

Comment on lines 112 to 117
def _attempt_captcha_solve(self, page_url: str) -> Optional[str]:
"""Attempt to solve CAPTCHA"""
# This is a simplified version - real implementation would extract
# This is a simplified version - real implementation would extract
# site key from the page and pass it to solver
print("CAPTCHA solving not implemented in this example")
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The placeholder implementation for CAPTCHA solving may lead to confusion.

Consider raising NotImplementedError or clearly marking this method as a stub to avoid accidental use in production environments.

Suggested change
def _attempt_captcha_solve(self, page_url: str) -> Optional[str]:
"""Attempt to solve CAPTCHA"""
# This is a simplified version - real implementation would extract
# This is a simplified version - real implementation would extract
# site key from the page and pass it to solver
print("CAPTCHA solving not implemented in this example")
return None
def _attempt_captcha_solve(self, page_url: str) -> Optional[str]:
"""Attempt to solve CAPTCHA
This method is a stub and must be implemented for production use.
"""
raise NotImplementedError("CAPTCHA solving is not implemented. This method is a stub.")

Comment on lines +32 to +33
CAPTCHA_INITIAL_REQUEST_TIMEOUT: int = 30 # seconds for submission requests
CAPTCHA_POLLING_REQUEST_TIMEOUT: int = 10 # seconds for polling requests
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Timeout values for CAPTCHA requests are hardcoded and may not suit all environments.

Recommend making timeout values configurable through environment variables or constructor parameters to accommodate varying deployment needs.

Suggested change
CAPTCHA_INITIAL_REQUEST_TIMEOUT: int = 30 # seconds for submission requests
CAPTCHA_POLLING_REQUEST_TIMEOUT: int = 10 # seconds for polling requests
CAPTCHA_INITIAL_REQUEST_TIMEOUT: int = int(os.getenv("CAPTCHA_INITIAL_REQUEST_TIMEOUT", "30")) # seconds for submission requests
CAPTCHA_POLLING_REQUEST_TIMEOUT: int = int(os.getenv("CAPTCHA_POLLING_REQUEST_TIMEOUT", "10")) # seconds for polling requests

assert call_args[1]["timeout"] == 30

@patch("src.anti_scraping.captcha_solver.requests.post")
def test_2captcha_submission_network_error(self, mock_post, captcha_solver):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): No test for provider value case sensitivity.

Please add a test to confirm provider values are treated case-insensitively, as the implementation lowercases the input. This will help catch configuration issues related to case mismatches.


try:
poll_response = requests.get(
poll_url,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.lang.security.audit.insecure-transport.requests.request-with-http): Detected a request using 'http://'. This request will be unencrypted, and attackers could listen into traffic on the network and be able to obtain sensitive information. Use 'https://' instead.

Source: opengrep

Comment on lines +199 to +202
for call_args in mock_post.call_args_list:
# Submission uses 30s timeout, polling uses 10s timeout
timeout = call_args[1]["timeout"]
assert timeout in [30, 10]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

captcha_id = result.get('request')


try:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Explicitly raise from a previous error [×4] (raise-from-previous-error)

task_id = result.get('taskId')


try:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Explicitly raise from a previous error [×4] (raise-from-previous-error)


if headers:
default_headers.update(headers)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants