feat(linstorvolumemanager): cache controller uri in a file#83
Closed
Millefeuille42 wants to merge 95 commits into3.2.12-8.3from
Closed
feat(linstorvolumemanager): cache controller uri in a file#83Millefeuille42 wants to merge 95 commits into3.2.12-8.3from
Millefeuille42 wants to merge 95 commits into3.2.12-8.3from
Conversation
958caec to
42b42bc
Compare
Wescoeur
requested changes
Apr 14, 2025
Comment on lines
+238
to
+244
| address = uri.removeprefix("linstor://") | ||
| session = util.timeout_call(10, util.get_localAPI_session) | ||
| for host_ref, host_record in session.xenapi.host.get_all(): | ||
| if host_record.get('address', '') != address: | ||
| continue | ||
| return util.strtobool( | ||
| session.xenapi.host.call_plugin(host_ref, PLUGIN, PLUGIN_CMD, {}) |
Member
There was a problem hiding this comment.
I'm not sure that's necessary here. We can instead do a refactoring of the codebase:
- We will assume that the cache is valid 99% of the time.
- We can directly attempt to create a LINSTOR instance from the URI without checking anything using a plugin. This is the initial idea.
- I think we can implement a new static function on
LinstorVolumeManagerthat is used to create an instance of the class using the cached value directly without any checks. If it fails, we rebuild the local cache and try again. This allows us to improve several smapi functions. - There are a few edge cases left with this idea:
- In some places we use the URI to create the journaler (linstor.KV) and to create a linstor object. We could try to use the cached URI without checks again and add a try/catch directly on these specific cases. In case of connection failure, we explicitly request the cache update.
- I see one last edge case concerning the creation of a linstor instance using
get_ips_from_xha_config_file, we could directly use the function to get the cached URI. In the worst can we can again fallback on the xha config file.
Author
Member
There was a problem hiding this comment.
Sounds good from my POV. Feel free to continue when you have time. :)
42b42bc to
065e30c
Compare
c315328 to
9760980
Compare
Nambrok
reviewed
Jun 17, 2025
Wescoeur
requested changes
Jun 17, 2025
Wescoeur
reviewed
Jun 19, 2025
Wescoeur
reviewed
Jun 24, 2025
Wescoeur
reviewed
Jun 24, 2025
Wescoeur
reviewed
Jun 25, 2025
Wescoeur
reviewed
Jun 25, 2025
Wescoeur
reviewed
Jun 25, 2025
Nambrok
reviewed
Jun 26, 2025
4b663cb to
363372d
Compare
Wescoeur
requested changes
Jun 26, 2025
Comment on lines
+228
to
+235
| def get_cached_controller_uri(ctx=None): | ||
| try: | ||
| with ctx if ctx else shared_reader(CONTROLLER_CACHE_PATH) as f: | ||
| return f.read().strip() | ||
| except FileNotFoundError: | ||
| pass | ||
| except Exception as e: | ||
| util.SMlog('Unable to read controller URI cache file at `{}` : {}'.format(CONTROLLER_CACHE_PATH,e)) |
Member
There was a problem hiding this comment.
Suggested change
| def get_cached_controller_uri(ctx=None): | |
| try: | |
| with ctx if ctx else shared_reader(CONTROLLER_CACHE_PATH) as f: | |
| return f.read().strip() | |
| except FileNotFoundError: | |
| pass | |
| except Exception as e: | |
| util.SMlog('Unable to read controller URI cache file at `{}` : {}'.format(CONTROLLER_CACHE_PATH,e)) | |
| def _read_controller_uri_from_file(f): | |
| try: | |
| return f.read().strip() | |
| except Exception as e: | |
| util.SMlog('Unable to read controller URI cache file at `{}`: {}'.format(CONTROLLER_CACHE_PATH, e)) | |
| def read_controller_uri_cache(): | |
| try: | |
| with shared_reader(CONTROLLER_CACHE_PATH) as f: | |
| return _read_controller_uri_from_file(f) | |
| except FileNotFoundError: | |
| pass | |
| except Exception as e: | |
| util.SMlog('Unable to read controller URI cache file at `{}`: {}'.format(CONTROLLER_CACHE_PATH, e)) |
Comment on lines
+238
to
+246
| def delete_controller_uri_cache(ctx=None): | ||
| try: | ||
| with ctx if ctx else excl_writer(CONTROLLER_CACHE_PATH) as f: | ||
| f.seek(0) | ||
| f.truncate() | ||
| except FileNotFoundError: | ||
| pass | ||
| except Exception as e: | ||
| util.SMlog('Unable to delete uri cache file at `{}` : {}'.format(CONTROLLER_CACHE_PATH, e)) |
Member
There was a problem hiding this comment.
Suggested change
| def delete_controller_uri_cache(ctx=None): | |
| try: | |
| with ctx if ctx else excl_writer(CONTROLLER_CACHE_PATH) as f: | |
| f.seek(0) | |
| f.truncate() | |
| except FileNotFoundError: | |
| pass | |
| except Exception as e: | |
| util.SMlog('Unable to delete uri cache file at `{}` : {}'.format(CONTROLLER_CACHE_PATH, e)) | |
| def delete_controller_uri_cache(): | |
| try: | |
| with excl_writer(CONTROLLER_CACHE_PATH) as f: | |
| f.seek(0) | |
| f.truncate() | |
| except FileNotFoundError: | |
| pass | |
| except Exception as e: | |
| util.SMlog('Unable to delete URI cache file at `{}`: {}'.format(CONTROLLER_CACHE_PATH, e)) |
Comment on lines
+249
to
+279
| def write_controller_uri_cache(uri, ctx=None): | ||
| try: | ||
| if not os.path.exists(CONTROLLER_CACHE_DIRECTORY): | ||
| os.makedirs(CONTROLLER_CACHE_DIRECTORY) | ||
| os.chmod(CONTROLLER_CACHE_DIRECTORY, 0o700) | ||
| with ctx if ctx else excl_writer(CONTROLLER_CACHE_PATH) as f: | ||
| f.seek(0) | ||
| f.write(uri) | ||
| f.truncate() | ||
| except FileNotFoundError: | ||
| pass | ||
| except Exception as e: | ||
| util.SMlog('Unable to write URI cache file at `{}` : {}'.format(CONTROLLER_CACHE_PATH, e)) | ||
|
|
||
|
|
||
| def build_controller_uri_cache(): | ||
| with excl_writer(CONTROLLER_CACHE_PATH) as f: | ||
| uri = get_cached_controller_uri(contextlib.nullcontext(f)) | ||
| if uri: | ||
| return uri | ||
| uri = _get_controller_uri() | ||
| if not uri: | ||
| for retries in range(9): | ||
| time.sleep(1) | ||
| uri = _get_controller_uri() | ||
| if uri: | ||
| break | ||
|
|
||
| retries += 1 | ||
| if retries >= 10: | ||
| break | ||
| time.sleep(1) | ||
| if uri: | ||
| write_controller_uri_cache(uri, contextlib.nullcontext(f)) | ||
| return uri |
Member
There was a problem hiding this comment.
Cannot add suggestion due to github limitation but:
def build_controller_uri_cache():
uri = ''
try:
with excl_writer(CONTROLLER_CACHE_PATH) as f:
uri = _read_controller_uri_from_file(f)
if uri:
return uri
uri = _get_controller_uri()
if not uri:
for retries in range(9):
time.sleep(1)
uri = _get_controller_uri()
if uri:
break
if uri:
f.seek(0)
f.write(uri)
f.truncate()
except FileNotFoundError:
if os.path.exists(CONTROLLER_CACHE_DIRECTORY):
raise
os.makedirs(CONTROLLER_CACHE_DIRECTORY)
os.chmod(CONTROLLER_CACHE_DIRECTORY, 0o700)
return build_controller_uri_cache()
except Exception as e:
util.SMlog('Unable to write URI cache file at `{}` : {}'.format(CONTROLLER_CACHE_PATH, e))
return uri
Comment on lines
+282
to
+286
| def get_controller_uri(): | ||
| uri = get_cached_controller_uri() | ||
| if not uri: | ||
| uri = build_controller_uri_cache() | ||
| return uri |
Member
There was a problem hiding this comment.
Suggested change
| def get_controller_uri(): | |
| uri = get_cached_controller_uri() | |
| if not uri: | |
| uri = build_controller_uri_cache() | |
| return uri | |
| def get_controller_uri(): | |
| uri = read_controller_uri_cache() | |
| if not uri: | |
| uri = build_controller_uri_cache() | |
| return uri |
| :param function logger: Function to log messages. | ||
| :param int attempt_count: Number of attempts to join the controller. | ||
| """ | ||
| uri = get_cached_controller_uri() |
Member
There was a problem hiding this comment.
Suggested change
| uri = get_cached_controller_uri() | |
| uri = read_controller_uri_cache() |
e7801da to
ca5b52d
Compare
6cdeb37 to
3579d92
Compare
fix(linstor): prevent use of e before assignment in nested try-except fix(linstor): use util.get_master_ref to get the master ref fix(linstor): log host_ref instead UUID to prevent XAPI call fix(log_failed_call): set error value for the call without an actual error fix(linstorhostcall): use next iter instead of list conversion cleanup(linstor): remove currently unused get_primary function Signed-off-by: Mathieu Labourier <mathieu.labourier@vates.tech> Co-authored-by: Damien Thenot <damien.thenot@vates.tech> Co-authored-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Upstream patch of ae10349 is incorrect. All "@mock.patch('blktap2.VDI.PhyLink', autospec=True)" lines must be removed because PhyLink is mocked globally. Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
- Use specific DRBD options to detect failures in a small delay. - Use these options to control quorum with drbd-reactor. - Provide a better compromise in terms of availability. Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Impacted functions: `_get_volumes_info` and `_get_volume_node_names_and_size`. Before this change "usable_size" validity was checked too early and which could lead to an exception for no good reason while the size could be known on at least one host despite an issue on other machines. Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
…ll context Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
…mutators Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Session attr is not set during "attach/detach calls from config". In this context local method must always be called. Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
A change in lvm2 `https://github.com/xcp-ng-rpms/lvm2/pull/3/files` introduces an issue in LargeBlockSR: `/dev/` is not scanned now meaning the loop device is never used for VG activation. So we must add a custom scan parameter to LVM commands. We also now systematically do the call to _redo_vg_connection to use our custom parameters to enable the LV on the correct device before calling `EXTSR.attach()`. Signed-off-by: Damien Thenot <damien.thenot@vates.tech>
This is not done on every and each implementation of SR but only on ones that calls cleanup.start_gc_service (like FileSR) and on the classes that inherits from them and don't call super on detach. This is to prevent useless errors logs like Failed to stop xxx.service: Unit xxx.service not loaded. Signed-off-by: Mathieu Labourier <mathieu.labourier@vates.tech>
When the pool master is changed and if it doesn't have a local DB path then `get_database_path` fails during SR.scan call. This patch allows creating a diskless path if necessary. Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
…104) In `_request_device_path`: Before this change, an exception was thrown when a resource was missing, but not when the returned path was empty. Now it's raised in both cases. Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Add a way in `linstorvolumemanager` to verify that all nodes are using the same LINSTOR version at init. Raise an error early if this happens so that SR ops are properly disabled with clear feedback to the user. Signed-off-by: Antoine Bartuccio <antoine.bartuccio@vates.tech>
Avoid python version mismatch that pulls incompatible dependencies in github actions when running unittests. Signed-off-by: Antoine Bartuccio <antoine.bartuccio@vates.tech>
During the construction of the volume set in `LinstorVolumeManager`,
a resource might not be properly deleted following a previous operation, such as a snapshot.
In this case, it's renamed with the prefix `DELETED_`. Unfortunately, this part fails
at the end of the renaming process because it attempts to delete the original volume name
from the set, which is not present because the list is being initialized at that point.
Without this fix, we have this trace:
```
Nov 26 08:20:30 xcp-node-1 SM: [1045423] Cannot clean volume 389e891c-2150-4caa-b201-073dedb8b886: Could not destroy resource `xcp-volume-9ef5f4a1-f101-4f34-8029-f2ad698decdd` from SR `xcp-sr-linstor_group_thin_device`: (Node: 'xcp-nodo
-1') Failed to delete lvm volume
Nov 26 08:20:30 xcp-node-1 SM: [1045423] Trying to update volume UUID 389e891c-2150-4caa-b201-073dedb8b886 to DELETED_389e891c-2150-4caa-b201-073dedb8b886...
Nov 26 08:20:30 xcp-node-1 SM: [1045423] Raising exception [47, The SR is not available [opterr='389e891c-2150-4caa-b201-073dedb8b886']]
Nov 26 08:20:30 xcp-node-1 SM: [1045423] lock: released /var/lock/sm/33dcb0ef-e089-4b6b-ab79-3d337045528e/sr
Nov 26 08:20:30 xcp-node-1 SM: [1045423] ***** generic exception: vdi_snapshot: EXCEPTION <class 'xs_errors.SROSError'>, The SR is not available [opterr='389e891c-2150-4caa-b201-073dedb8b886']
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/SRCommand.py", line 113, in run
Nov 26 08:20:30 xcp-node-1 SM: [1045423] return self._run_locked(sr)
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/SRCommand.py", line 157, in _run_locked
Nov 26 08:20:30 xcp-node-1 SM: [1045423] target = sr.vdi(self.vdi_uuid)
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/LinstorSR", line 537, in wrap
Nov 26 08:20:30 xcp-node-1 SM: [1045423] return load(self, *args, **kwargs)
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/LinstorSR", line 463, in load
Nov 26 08:20:30 xcp-node-1 SM: [1045423] raise xs_errors.XenError('SRUnavailable', opterr=str(e))
Nov 26 08:20:30 xcp-node-1 SM: [1045423]
Nov 26 08:20:30 xcp-node-1 SM: [1045423] ***** LINSTOR resources on XCP-ng: EXCEPTION <class 'xs_errors.SROSError'>, The SR is not available [opterr='389e891c-2150-4caa-b201-073dedb8b886']
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/SRCommand.py", line 392, in run
Nov 26 08:20:30 xcp-node-1 SM: [1045423] ret = cmd.run(sr)
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/SRCommand.py", line 113, in run
Nov 26 08:20:30 xcp-node-1 SM: [1045423] return self._run_locked(sr)
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/SRCommand.py", line 157, in _run_locked
Nov 26 08:20:30 xcp-node-1 SM: [1045423] target = sr.vdi(self.vdi_uuid)
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/LinstorSR", line 537, in wrap
Nov 26 08:20:30 xcp-node-1 SM: [1045423] return load(self, *args, **kwargs)
Nov 26 08:20:30 xcp-node-1 SM: [1045423] File "/opt/xensource/sm/LinstorSR", line 463, in load
Nov 26 08:20:30 xcp-node-1 SM: [1045423] raise xs_errors.XenError('SRUnavailable', opterr=str(e))
Nov 26 08:20:30 xcp-node-1 SM: [1045423]
```
Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
In the case of a DRBD resource deletion issue via the LINSTOR API,
a wrong exception was thrown in the `VDI.delete` code instead of
simply logging and properly terminating the command.
Trace before correction:
```
Nov 26 08:19:58 xcp-node-1 SM: [1055074] Failed to remove the volume (maybe is leaf coalescing) for 389e891c-2150-4caa-b201-073dedb8b886 err: Cannot destroy volume `389e891c-2150-4caa-b201-073dedb8b886`: Could not destroy resource `xcp-
volume-9ef5f4a1-f101-4f34-8029-f2ad698decdd` from SR `xcp-sr-linstor_group_thin_device`: (Node: 'xcp-nodo-1') Failed to delete lvm volume
Nov 26 08:19:58 xcp-node-1 SM: [1055074] Raising exception [80, Failed to mark VDI hidden [opterr=Cannot destroy volume `389e891c-2150-4caa-b201-073dedb8b886`: Could not destroy resource `xcp-volume-9ef5f4a1-f101-4f34-8029-f2ad698decdd`
from SR `xcp-sr-linstor_group_thin_device`: (Node: 'xcp-nodo-1') Failed to delete lvm volume]]
Nov 26 08:19:58 xcp-node-1 SM: [1055074] lock: released /var/lock/sm/33dcb0ef-e089-4b6b-ab79-3d337045528e/sr
Nov 26 08:19:58 xcp-node-1 SM: [1150284] lock: acquired /var/lock/sm/33dcb0ef-e089-4b6b-ab79-3d337045528e/sr
Nov 26 08:19:58 xcp-node-1 SM: [1055074] ***** generic exception: vdi_delete: EXCEPTION <class 'xs_errors.SROSError'>, Failed to mark VDI hidden [opterr=Cannot destroy volume `389e891c-2150-4caa-b201-073dedb8b886`: Could not destroy res
ource `xcp-volume-9ef5f4a1-f101-4f34-8029-f2ad698decdd` from SR `xcp-sr-linstor_group_thin_device`: (Node: 'xcp-nodo-1') Failed to delete lvm volume]
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/SRCommand.py", line 113, in run
Nov 26 08:19:58 xcp-node-1 SM: [1055074] return self._run_locked(sr)
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/SRCommand.py", line 163, in _run_locked
Nov 26 08:19:58 xcp-node-1 SM: [1055074] rv = self._run(sr, target)
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/SRCommand.py", line 254, in _run
Nov 26 08:19:58 xcp-node-1 SM: [1055074] return target.delete(self.params['sr_uuid'], self.vdi_uuid)
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/LinstorSR", line 1775, in delete
Nov 26 08:19:58 xcp-node-1 SM: [1055074] raise xs_errors.XenError('VDIDelete', opterr=str(e))
Nov 26 08:19:58 xcp-node-1 SM: [1055074]
Nov 26 08:19:58 xcp-node-1 SM: [1055074] ***** LINSTOR resources on XCP-ng: EXCEPTION <class 'xs_errors.SROSError'>, Failed to mark VDI hidden [opterr=Cannot destroy volume `389e891c-2150-4caa-b201-073dedb8b886`: Could not destroy resou
rce `xcp-volume-9ef5f4a1-f101-4f34-8029-f2ad698decdd` from SR `xcp-sr-linstor_group_thin_device`: (Node: 'xcp-nodo-1') Failed to delete lvm volume]
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/SRCommand.py", line 392, in run
Nov 26 08:19:58 xcp-node-1 SM: [1055074] ret = cmd.run(sr)
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/SRCommand.py", line 113, in run
Nov 26 08:19:58 xcp-node-1 SM: [1055074] return self._run_locked(sr)
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/SRCommand.py", line 163, in _run_locked
Nov 26 08:19:58 xcp-node-1 SM: [1055074] rv = self._run(sr, target)
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/SRCommand.py", line 254, in _run
Nov 26 08:19:58 xcp-node-1 SM: [1055074] return target.delete(self.params['sr_uuid'], self.vdi_uuid)
Nov 26 08:19:58 xcp-node-1 SM: [1055074] File "/opt/xensource/sm/LinstorSR", line 1775, in delete
Nov 26 08:19:58 xcp-node-1 SM: [1055074] raise xs_errors.XenError('VDIDelete', opterr=str(e))
Nov 26 08:19:58 xcp-node-1 SM: [1055074]
```
Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Before this change `EMEDIUMTYPE` was not tested, so we only checked the DRBD openers if we attempted a local opening in write mode. As a reminder, `EMEDIUMTYPE` is returned if a local read-only opening is attempted and the volume is open for writing on another machine. Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.tech>
Explicit error message pointing to missing vdi_type on SRMetadata update Add a distinction for unpacking corrupted empty metadata headers for easier diagnostic Signed-off-by: Antoine Bartuccio <antoine.bartuccio@vates.tech>
Adding SRs on multiple usb devices may fail because /usr/lib/udev/scsi_id returns the same device id for all the usb devices. This change fixes this by checking the drive type and using the device serial number if correctly read. Signed-off-by: Frederic Bor <frederic.bor@wanadoo.fr>
Co-authored-by: Ronan Abhamon <ronan.abhamon@vates.tech> Co-authored-by: Damien Thenot <damien.thenot@vates.tech> Signed-off-by: Mathieu Labourier <mathieu.labourier@vates.tech>
e3e4980 to
8458755
Compare
Member
|
Replaced by: #121 . |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.