Skip to content

JDS refactor#299

Open
plebhash wants to merge 17 commits intostratum-mining:mainfrom
plebhash:2026-02-03-jd-server-sv2
Open

JDS refactor#299
plebhash wants to merge 17 commits intostratum-mining:mainfrom
plebhash:2026-02-03-jd-server-sv2

Conversation

@plebhash
Copy link
Member

@plebhash plebhash commented Feb 26, 2026

close #24

  • jd_server_sv2
  • bitcoin_core_sv2
  • pool_sv2 leveraging jd_server_sv2
  • integration tests

@plebhash plebhash force-pushed the 2026-02-03-jd-server-sv2 branch 5 times, most recently from e7d7bdd to 40f2a41 Compare February 27, 2026 14:31
@plebhash plebhash force-pushed the 2026-02-03-jd-server-sv2 branch 3 times, most recently from 06d8930 to 1d84c3d Compare March 6, 2026 18:19
/// Represents a downstream client connected to this node.
#[derive(Clone)]
pub struct Downstream {
pub downstream_data: Arc<Mutex<DownstreamData>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move DownstreamData fields to Downstream struct here.

Copy link
Member Author

@plebhash plebhash Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that this is simply following the same pattern as ChannelManager

I'd be happy to follow the suggestion above, but if DownstreamData is superfluous here, why isn't it on ChannelManager?

the immediate knee-jerk answer is probably number of fields (few here vs multple on ChannelManager), under the rationale that there's not enough fields to justify a separate struct

but this would fall into the pattern fragmentation issue I've been flagging, which usually introduces unnecessary friction to reason about the code


tbh I ultimately don't really care whether we:

  • keep DownstreamData as-is
  • move DownstreamData fields to Downstream struct

if I really had to pick one, I'd keep it as-is, for the reasons described above

but also no energy to dive into a long debate about this, so if we're opinionated and want to proceed with this debt of extra fragmentation, I'm not going to fight it

cc @GitGab19

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are changing this in dashmap PR for Pool. But I won't block your PR on this. Feel free to resolve.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok if we're already aware of this, moving DownstreamData fields to Downstream struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shourya742 can you do a sanity check on the adaptations I introduced here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming pool-config.toml.template -> pool-jds-config.toml.template

and making embedded-jds as default

not sure if it's the right move, @lucasbalieiro @GitGab19 please sanity check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is gonna break other things on the build process.

And also will probably need to adapt some tricks to run the apps with/without JDP. As we do with the --profile nowadays

If it gets too complex to explain, I'll post a commit so you can cherry-pick it.

I'm still checking the fallout

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah please share commit

# Protocol Extensions Configuration
# Extensions that the pool supports (will accept if requested by clients)
# Comment/uncomment to enable/disable specific extensions:
supported_extensions = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed this is not needed

@plebhash plebhash force-pushed the 2026-02-03-jd-server-sv2 branch from e8a35c3 to cbfa504 Compare March 8, 2026 22:19
@plebhash
Copy link
Member Author

plebhash commented Mar 8, 2026

@lucasbalieiro can you please do a sanity check on cbfa504?

@plebhash
Copy link
Member Author

plebhash commented Mar 8, 2026

so far I haven't added any new integration tests... might add some before we merge this (open to suggestions)

but marking it as ready for review... hopefully CI should be all green now

@plebhash plebhash marked this pull request as ready for review March 8, 2026 22:20
@plebhash plebhash force-pushed the 2026-02-03-jd-server-sv2 branch from cbfa504 to afae806 Compare March 10, 2026 14:53
@plebhash plebhash force-pushed the 2026-02-03-jd-server-sv2 branch 4 times, most recently from 2dadb44 to ea5cc2b Compare March 10, 2026 15:22
@plebhash
Copy link
Member Author

plebhash commented Mar 11, 2026

the issue @lucasbalieiro reported on today's dev call is a potential race-condition bug on libmultiprocess

In this PR, pool_sv2 now spawns both BitcoinCoreSv2TDP and BitcoinCoreSv2JDP in the same process (two independent IPC client lifecycles). This pattern seems relevant to triggering this bug.

I created this gist and asked @Sjors to reproduce, triage and sanity-check before we report to libmultiprocess repo

it's non-deterministic and more easily reproducible via docker than cargo run

moving forward, we have a few options:

  • introduce aggressive refactors on bitcoin_core_sv2 (plus jd_server_sv2, pool_sv2, translator_sv2 and integration_tests_sv2 as a consequence) with a workaround that makes sure this bug is not triggered
  • harden Docker lifecycle/restart behavior to reduce bug likelihood in normal runs (partial mitigation, not root fix)

personally speaking, I'd lean towards the second option

whenever it gets fixed (either on v31 or some backport v30.3 or v31.1), hopefully we won't have to worry about it anymore

@GitGab19
Copy link
Member

I also think we could proceed with the second option, even though it's still not clear to me how we can harden it on docker side (cc @lucasbalieiro).

@Sjors
Copy link
Contributor

Sjors commented Mar 11, 2026

The gist refers to the 30.x Bitcoin Core branch, which has known crash issues (when the client disconnects mid wait). The fixes are in master, and I'm not sure if they will be back-ported or if it's better to wait for v31.

Maybe use master on CI until a 31.x branch exists?

@plebhash
Copy link
Member Author

plebhash commented Mar 11, 2026

thanks for the input @Sjors

The gist refers to the 30.x Bitcoin Core branch, which has known crash issues (when the client disconnects mid wait). The fixes are in master, and I'm not sure if they will be back-ported or if it's better to wait for v31.

v31 is introducing breaking changes on capnp interfaces, which are set to be implemented as a follow-up via #318

Maybe use master on CI until a 31.x branch exists?

this was detected on manual testing by @lucasbalieiro , not on CI

also, using master would require us to bring the scope of #318 here


re @GitGab19 @lucasbalieiro

even though it's still not clear to me how we can harden it on docker side

during this investigation I asked gpt-5.3-codex for suggestions in this direction and it said it would be feasible

here are the steps it's suggesting, for which a sanity check would be appreciated

  • keep direct socket-file mount (BITCOIN_SOCKET_PATH -> /root/.bitcoin/node.sock)
    (on Docker Desktop/macOS, data-dir mount + socket lookup can be unreliable)
  • add startup gate: wait for node.sock ([ -S ... ]) with timeout before starting pool_sv2 / jd_client_sv2
  • use graceful container stop (stop_signal: SIGINT + stop_grace_period)
  • document safer ops (--remove-orphans, avoid forced docker kill/restart during normal runs)

this won’t fix the root race in 30.x, but it should reduce accidental reproduction in normal Docker UX until v31-compatible follow-up work lands.


@lucasbalieiro can you try to elaborate a commit for me to cherry-pick here?

either to replace ea5cc2b, or to be added on top of it... let me know which approach you think it's better

@Sjors
Copy link
Contributor

Sjors commented Mar 11, 2026

Under the old code I think you can avoid crashes by calling interrupt on waitNext() before disconnecting.

@plebhash
Copy link
Member Author

Under the old code I think you can avoid crashes by calling interrupt on waitNext() before disconnecting.

we already do interrupt_wait_request() in our monitor cancellation paths (JDP + TDP), so we’re following the interrupt-before-disconnect pattern in normal graceful shutdown.

like I mentioned above, we can try to mitigate it at a docker level (depending on @lucasbalieiro analysis).

but I’m hesitant to add extra lower-level interrupt wiring for bitcoin_core_sv2 in this PR: it would add churn/debt, and it likely only gives partial mitigation on 30.x (abrupt teardown cases can still bypass cleanup).

@plebhash
Copy link
Member Author

Adding integration tests:

  • jds_reject_setup_connection_with_non_job_declaration_protocol
    Verifies JDS rejects SetupConnection when protocol is not JDP, returning SetupConnectionError("unsupported-protocol").

  • jds_reject_setup_connection_without_declare_tx_data_flag
    Verifies JDS rejects JDP SetupConnection when DECLARE_TX_DATA is missing, returning SetupConnectionError("missing-declare-tx-data-flag").

  • jds_reject_declare_mining_job_with_invalid_mining_job_token
    Verifies JDS rejects DeclareMiningJob for both malformed and unallocated tokens, returning DeclareMiningJobError("invalid-mining-job-token").

  • pool_rejects_reused_set_custom_mining_job_token
    Verifies pool/JDS token lifecycle enforcement by replaying a consumed SetCustomMiningJob token and asserting SetCustomMiningJobError("invalid-mining-job-token").

  • pool_without_jds_rejects_set_custom_mining_job
    Verifies pool started without embedded JDS (start_pool, no [jds]) rejects custom jobs with SetCustomMiningJobError("jd-not-supported").

plebhash added 16 commits March 13, 2026 11:42
Changed BitcoinCoreSv2TDP::handler_* methods from pub to pub(crate) to better encapsulate internal implementation:
- handle_coinbase_output_constraints()
- handle_request_transaction_data()
- handle_submit_solution()

These handlers are only called internally by the run() loop and monitors.
External crates (pool, jd-client) communicate via channels and never call
handlers directly, so they should not be part of the public API contract.

This improves API clarity by making it explicit that the channel-based
interface is the intended way to interact with BitcoinCoreSv2TDP.
@plebhash plebhash force-pushed the 2026-02-03-jd-server-sv2 branch from ea5cc2b to 9264728 Compare March 13, 2026 14:43
@lucasbalieiro
Copy link
Collaborator

lucasbalieiro commented Mar 13, 2026

@plebhash, regarding the Docker changes:

You can cherry-pick this commit: lucasbalieiro@7f2d9db

and apply it on top of you commit f8c8907

Or merge the two. Up to you whether that’s the preferred approach.

In my commit I remove jd_server from being published as a Docker image, since it will now be embedded inside the pool application. I also update README.md and docker_env.example.

Dropped jd_server from the docker_compose.yml services.

Another change adjusts how signals are handled by the Docker engine. This solution works on macOS; it still needs testing on Linux.

Regarding the pool_and_miner_apps and pool_and_miner_apps_no_jd profiles: properly wiring the new pool parameter to dynamically enable/disable JDS would require additional work and experimentation. My workaround is simpler: the pool always starts with JDS enabled, regardless of profile. When the profile is pool_and_miner_apps_no_jd, the only difference is that jd_client_sv2 is not started.

nerding out about the signal handling issue:

When running pool_sv2 with JDS enabled and sending CTRL-C to Docker, Docker sends a SIGTERM to the container and allows a 10-second grace period for shutdown, if the period is not respected it will send a SIGKILL.

However, our applications expect SIGINT (the signal sent by CTRL-C to start a keyboard interruption) to perform graceful shutdown. Because of that, the container was not triggering the application’s shutdown logic and then breaking the node. I changed the container’s stop signal accordingly.

The solution is based on this option in the Docker Compose docs:
docker_docs#stop_signal

Other approaches I tried but rejected:

@plebhash
Copy link
Member Author

plebhash commented Mar 13, 2026

thanks @lucasbalieiro

just to clarify the crash path in Bitcoin Core (rephrasing your point):

our graceful path is triggered when the app receives Ctrl+C (SIGINT), which cancels tasks and calls interruptWait via CancellationToken.

if the process is terminated abruptly (SIGTERM without a handler path, or especially SIGKILL), that graceful path can be bypassed.


this is not Docker-specific; similar behavior can happen with cargo run or direct binaries if they are stopped abruptly (kill, pkill, forced restart, etc.).

however these are all highly unlikely scenarios on regular UX use-cases


so the goal here is mitigation, not a full fix:

we should avoid adding heavy local shutdown-workaround complexity in this PR, and instead harden Docker behavior to reduce exposure.

lucasbalieiro@7f2d9db does that by aligning container stop behavior with our graceful path (SIGINT) and simplifying the Docker topology around embedded JDS.

root fix remains upstream in Bitcoin Core/libmultiprocess (with fixes already in master; v31 coming out soon), so I’ll cherry-pick this as an interim mitigation.

@lucasbalieiro
Copy link
Collaborator

started a long running session with the changes from this PR.

The blocks will be mined to our signet and have pr299 signature:

image

@lucasbalieiro
Copy link
Collaborator

lucasbalieiro commented Mar 13, 2026

After finding 1+ blocks the JDC sends AllocateMiningJobToken and receives AllocateMiningJobToken.Success but right after the JDS refuses the Job

2026-03-13T20:41:00.847773Z  WARN jd_client_sv2::channel_manager::jd_message_handler: Received: DeclareMiningJobError(request_id: 6, error_code: invalid-mining-job-token, error_details: B064K())
2026-03-13T20:41:00.847789Z  WARN jd_client_sv2::channel_manager::jd_message_handler: ⚠️ JDS refused the declared job with a DeclareMiningJobError ❌. Starting fallback mechanism.
2026-03-13T20:41:00.847869Z ERROR jd_client_sv2::channel_manager: Error handling JDS message error=JDCError { kind: DeclareMiningJobError, action: Fallback, _owner: PhantomData<jd_client_sv2::error::ChannelManager> }
2026-03-13T20:41:00.847908Z DEBUG jd_client_sv2::status: Sending status from ChannelManager: UpstreamShutdownFallback(DeclareMiningJobError)

This trigger the fallback mechanism in JDC, drops all clients and flags the JDS as a malicious upstream

2026-03-13T20:41:01.857119Z  INFO jd_client_sv2: Upstream previously marked as malicious, skipping initial attempt warnings.

Also, another problem is that the fallback mechanism does NOT await for the jd_client port of the downstream server to be free generating the error:

2026-03-13T20:41:01.860854Z ERROR jd_client_sv2::channel_manager: Failed to bind downstream server at 127.0.0.1:34265 error=Os { code: 98, kind: AddrInUse, message: "Address already in use" }

This problem happened consistently on my setup after finding 1+ blocks.

Attach the logs from a testing session @ 69f7447
jdc.log
pool.log
tproxy.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: JDS

5 participants