Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netbird won't reconnect to other windows peer on reboot until I ping the peer #2860

Open
Mikesco3 opened this issue Nov 7, 2024 · 10 comments

Comments

@Mikesco3
Copy link

Mikesco3 commented Nov 7, 2024

Describe the problem

Windows clients can't reach peers untill I ping them.
I believe the issue is prevalent after reboots and I haven't het monitored if it drops later during the day.

To Reproduce

Steps to reproduce the behavior:

  1. Reboot the computer
  2. When it comes back on, I can't reconnect to the windows peer even if I wait 5-15 minutes
  3. When I ping it, there is a 3 - 5 second delay but then the peer replies to pings and I can access it normally after that.
  4. If either one reboots, were back to the same issue.

Are you using NetBird Cloud?

I'm using the NetBird SelfHosted control plane.

NetBird version

0.31.0

  • I have also made sure they are all running the latest version of netbird.
  • Windows 11 Pro

Expected behavior

A clear and concise description of what you expected to happen.

I have a few computers connected on different locations and I'm using Netbird to connect them together.

Installed the Netbird client and joined it to the control pane and did the same on other windows PCs across the internet

When I reboot the computer, I cannot reach the other Peer across netbird.
What seems to help is if I ping the computer, and then after that I can reach the computer.
I haven't checked if after a while it drops or not.

What I would hope is that if I have another windows computer connected from a different location via netbird, I would be able to reconnect to it, even if I have to wait for a bit untill the other services come back online (at least that works with zerotier and other similar products)

Screenshots

If applicable, add screenshots to help explain your problem.
image


Additional context

Then if I ping the peer I'm trying to access, after a (3 to 5 second) pause it starts replying to pings and then I can connect to it fine

ping server-cr

Slight pause (3 to 5 seconds)

Pinging server-cr.netbird.selfhosted [100.65.118.84] with 32 bytes of data:
Reply from 100.65.118.84: bytes=32 time=22ms TTL=128
Reply from 100.65.118.84: bytes=32 time=22ms TTL=128
Reply from 100.65.118.84: bytes=32 time=21ms TTL=128
Reply from 100.65.118.84: bytes=32 time=23ms TTL=128

Ping statistics for 100.65.118.84:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 21ms, Maximum = 23ms, Average = 22ms

Then we can connect to the Peer Windows PC with no issues...


Attachments

I've attached a copy of my config files:

  • My Server's docker-compose.yml
  • My Server's management.json
  • One of my Windows Netbird App config.json
  • A screenshot of the connection error

NetBird status -dA output:

If applicable, add the `netbird status -dA' command output.

I grabbed the output while the computer wasn't having the connection issues and before I pinged the peer.

> NetBird status -dA
Peers detail:
 server-nl.netbird.selfhosted:
  NetBird IP: 100.65.114.233
  Public key: MyRandomGibberishKey
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/host
  ICE candidate endpoints (Local/Remote): 127.0.0.1:51820/192.168.1.2:51820
  Relay server address: rels://netbird.anon-rvQ2R.domain:443
  Last connection update: 1 minute, 41 seconds ago
  Last WireGuard handshake: 1 minute, 36 seconds ago
  Transfer status (received/sent) 392 B/396 B
  Quantum resistance: false
  Routes: -
  Latency: 570µs

 server-cr.netbird.selfhosted:
  NetBird IP: 100.65.118.84
  Public key: MyRandomGibberishKey
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/srflx
  ICE candidate endpoints (Local/Remote): 198.51.100.0:27911/198.51.100.1:1032
  Relay server address: rels://netbird.anon-rvQ2R.domain:443
  Last connection update: 1 minute, 41 seconds ago
  Last WireGuard handshake: 1 minute, 36 seconds ago
  Transfer status (received/sent) 1.2 KiB/548 B
  Quantum resistance: false
  Routes: -
  Latency: 21.9377ms

 beth-lt.netbird.selfhosted:
  NetBird IP: 100.65.130.238
  Public key: MyRandomGibberishKey
  Status: Disconnected
  -- detail --
  Connection type:
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address:
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false
  Routes: -
  Latency: 0s

OS: windows/amd64
Daemon version: 0.31.0
CLI version: 0.31.0
Management: Connected to https://netbird.anon-rvQ2R.domain:443
Signal: Connected to https://netbird.anon-rvQ2R.domain:443
Relays:
  [stun:netbird.anon-rvQ2R.domain:3478] is Available
  [turn:netbird.anon-rvQ2R.domain:3478?transport=udp] is Available
  [rels://netbird.anon-rvQ2R.domain:443] is Available
Nameservers:
FQDN: beth-nlvm.netbird.selfhosted
NetBird IP: 100.65.55.28/16
Interface type: Userspace
Quantum resistance: false
Routes: -
Peers count: 2/3 Connected
@mlsmaycon
Copy link
Collaborator

It seems like a crash. @Mikesco3 can you check if there was any event in the event viewer (system and applications)?

@Mikesco3
Copy link
Author

Mikesco3 commented Nov 7, 2024

I just have 2 Event ID 86

SCEP Certificate enrollment initialization for Local system via https://-KeyId-844db4655e5dcb9f989f2082a7662b019449d1bd.microsoftaik.azure.net/templates/Aik/scep failed:

GetCACaps

Method: GET(172ms)
Stage: GetCACaps
The connection with the server was terminated abnormally 0x80072efe (WinHttp: 12030 ERROR_WINHTTP_CONNECTION_ERROR)

- System 

  - Provider 

   [ Name]  Microsoft-Windows-CertificateServicesClient-CertEnroll 
   [ Guid]  {54164045-7C50-4905-963F-E5BC1EEF0CCA} 
   [ EventSourceName]  CertEnroll 
 
  - EventID 86 

   [ Qualifiers]  49754 
 
   Version 0 
 
   Level 2 
 
   Task 0 
 
   Opcode 0 
 
   Keywords 0x80000000000000 
 
  - TimeCreated 

   [ SystemTime]  2024-11-07T18:35:02.8888095Z 
 
   EventRecordID 12728 
 
   Correlation 
 
  - Execution 

   [ ProcessID]  5048 
   [ ThreadID]  0 
 
   Channel Application 
 
   Computer BethNL-VM 
 
  - Security 

   [ UserID]  S-1-5-18 
 

- EventData 

  Context WORKGROUP\BETHNL-VM$ 
  Url https://-KeyId-844db4655e5dcb9f989f2082a7662b019449d1bd.microsoftaik.azure.net/templates/Aik/scep 
  MessageText GetCACaps  
  Method GET(62ms) 
  Stage GetCACaps 
  ErrorCode The connection with the server was terminated abnormally 0x80072efe (WinHttp: 12030 ERROR_WINHTTP_CONNECTION_ERROR) 

@mlsmaycon
Copy link
Collaborator

mlsmaycon commented Nov 7, 2024

Could you please run the following command in an elevated powershell:

[System.Environment]::SetEnvironmentVariable('NB_WINDOWS_PANIC_LOG', "$env:ProgramData\netbird\netbird.err", 'Machine')

then try to reproduce the issue? If you can reproduce it, you should see a file in C:\ProgramData\netbird\netbird.err. Please share it with us.

@Mikesco3
Copy link
Author

Mikesco3 commented Nov 7, 2024

  1. I ran your string in an elevanted powershell...
  2. I killed netbird `taskkill /f /im netbi*"
  3. restarted netbird (still no connection) (your file was still empty)
  4. rebooted the computer
  5. attempted to open the UNC path (network drive).

I get the Error where windows cannot reach the computer

  1. ping the computer and after a 3 to 5 second delay, the peer responds.
  2. I can access the Peer's network share normally
  3. Your netbird.err file is still empty
  4. The windows Logs \ Applcation has two new Event ID 86 as I just posted.
    I also have some Information stuff in the system part of the windows logs:

Attempted to reserve URL http://*:5357/. Status 0x0. Process Id 0x4 Executable path , User SYSTEM
Attempted to reserve URL http://+:80/Temporary_Listen_Addresses/. Status 0x0. Process Id 0x4 Executable path , User SYSTEM
Attempted to reserve URL http://+:80/Temporary_Listen_Addresses/. Status 0x0. Process Id 0x4 Executable path , User SYSTEM
Attempted to reserve URL https://+:5986/wsman/. Status 0x0. Process Id 0x4 Executable path , User SYSTEM

etc...

Create URL group 0xFE00000220000001. Status 0x0. Process Id 0xAA8 Executable path \Device\HarddiskVolume3\Windows\System32\svchost.exe, User LOCAL SERVICE

etc...

@Mikesco3
Copy link
Author

Mikesco3 commented Nov 7, 2024

I can refer to the system practically right away from the netbird IP...
So it seems to be related to DNS??

Update
If I add the Netbird IP address of the peer in question to the host file and reboot, then it is available right away...

Temporary solution

  1. ping the netbird peer you want to reach
  2. Grab the IP address that replies
    In my example server-cr points to100.65.118.84
  3. Enter the information into the hosts file
    C:\Windows\System32\drivers\etc\hosts
    At the end of the file add (Adjust for your case):
100.65.118.84   server-cr
  1. Rebooted and the drives were available immediately after I logged in.

@mlsmaycon
Copy link
Collaborator

can you share the client logs?

You can bundle them with:

netbird debug bundle -A

@Mikesco3
Copy link
Author

Mikesco3 commented Nov 7, 2024

Here are the logs...
I checked lighly to see if I needed to obfuscate anything and there only seem to be public keys...

netbird.debug.984634646.zip

@Mikesco3
Copy link
Author

Mikesco3 commented Nov 7, 2024

BTW, I don't know if this makes a difference or not...
Before running the current setup of Netbird,

  1. I had previously setup netbird on another vps.
  2. Installed the clients and connected.
  3. Once that worked,
  4. I removed the netbird clients and the config foder from \programdata\
  5. setup the current vps
  6. re-installed netbird...

So I don't know if there are any leftovers in the registry, that could point to the old vps??

@Mikesco3
Copy link
Author

Mikesco3 commented Nov 7, 2024

BTW, I don't know if this makes a difference or not... Before running the current setup of Netbird,

  1. I had previously setup netbird on another vps.
  2. Installed the clients and connected.
  3. Once that worked,
  4. I removed the netbird clients and the config foder from \programdata\
  5. setup the current vps
  6. re-installed netbird...

So I don't know if there are any leftovers in the registry, that could point to the old vps??

However this wouldn't make sense, because I'm also having the issue on machines that never had Netbird before...

@Mikesco3
Copy link
Author

Mikesco3 commented Nov 9, 2024

For now my solution has been to add the IP's for the machines I need to reach to the host file and that worked.

It's not a pretty fix but it's working like a charm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants