Optional garbage collection and CheckpointManager._global_deps by Ig-dolci · Pull Request #187 · dolfin-adjoint/pyadjoint

Ig-dolci · 2024-12-13T11:40:36Z

PR Description

Enable the user to apply the garbage collection if necessary:
This PR introduces garbage collection optional support during checkpointing to enable the user to handle the lack of Python to properly track and clean up checkpoint objects in memory.

Experiment details used to test the garbage collection imposed manually

Degree of Freedom (DoFs): 40,401
Test Type: Burgers test
Total Steps: 1,000

The black curve represents the scenario with garbage collection enabled, while the blue curve shows the case without garbage collection during checkpointing.

Checkpoint Manager _global_deps:
A private attribute, _global_deps, is introduced in the CheckpointManager class. This attribute stores dependencies that are used at each time step and are not time-dependent.
- If a block variable is included in _global_deps, it will not be cleaned during checkpointing. This prevents unnecessary cleanup and re-creation of checkpoints for dependencies that do not change with time.

…solver.

pyadjoint/tape.py

pyadjoint/checkpointing.py

jrmaddison · 2025-01-22T10:49:48Z

Is it equivalent to instead drop zero output Blocks on the tape for garbage collection? Or is this slightly different?

Ig-dolci · 2025-01-22T13:22:43Z

Is it equivalent to instead drop zero output Blocks on the tape for garbage collection? Or is this slightly different?

I believe it is different. I noticed that mainly during the recomputation process, memory usage kept growing, even after I cleared the checkpoint using block_variable._checkpoint = None. After some discussions here, my hypothesis is that Python might not be tracking all objects in memory properly. So, I am only allowing the user to employ the garbage collector manually, which looks like is helping.

pyadjoint/checkpointing.py

pyadjoint/tape.py

pyadjoint/checkpointing.py

connorjward

I think this could do with a lot more explanation. This is very complicated so adding some substantial comments and expanding docstrings would be extremely helpful.

The code style seems fine.

connorjward · 2025-02-06T11:38:30Z

pyadjoint/checkpointing.py

    Args:
        schedule (checkpoint_schedules.schedule): A schedule provided by the `checkpoint_schedules` package.
        tape (Tape): A list of blocks :class:`Block` instances.
+        gc_timestep_frequency (int): The timestep frequency for garbage collection.


This could be clearer. Perhaps "the number of timesteps between garbage collections"

Also it should state that if None then no collection is done, or similar.

connorjward · 2025-02-06T11:39:53Z

pyadjoint/checkpointing.py

+        # The user can manually invoke the garbage collector if Python fails to
+        # track and clean all checkpoint objects in memory properly.


This is confusing because setting gc_timestep_frequency suggests that GC is being run automatically, whereas here you say manually

connorjward · 2025-02-06T11:41:09Z

pyadjoint/checkpointing.py

+                for deps in self.tape.timesteps[timestep - 1].checkpointable_state:
+                    self._global_deps.add(deps)
+            else:
+                deps_to_clear = self._global_deps - self._global_deps.intersection(


I think you might want set.difference https://docs.python.org/3/library/stdtypes.html#frozenset.difference

pyadjoint/checkpointing.py

pyadjoint/tape.py

pyadjoint/checkpointing.py

connorjward · 2025-02-06T17:18:31Z

pyadjoint/checkpointing.py

+                # Check if the block variables stored at `self._global_deps` are still
+                # dependencies in the previous timestep. If not, will remove them from the
+                # global dependencies.
+                deps_to_clear = self._global_deps.difference(self._global_deps.intersection(


I don't think you need to have the intersection here.

But I could be wrong.

I will check this.

pyadjoint/tape.py

pyadjoint/checkpointing.py

connorjward · 2025-02-06T17:24:12Z

pyadjoint/checkpointing.py

+                # Clear the checkpoint once it is not a global dependency and should be stored
+                # only in the ``self.tape.timesteps`` checkpoints when needed.


I'm afraid I don't quite understand what this means. Could you rephrase this?

Better to understand this text:

For no global dependencies, checkpoint storage occurs at a self.tape timestep only when required by an action from the schedule. Thus, we have to clear the checkpoint of block variables excluded from the self._global_deps.

Yeah I think that's good. Thanks.

Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>

angus-g · 2025-02-11T10:59:32Z

I think something that could be worth looking at here is the circular reference between OverloadedType and BlockVariable:

pyadjoint/pyadjoint/overloaded_type.py

Lines 96 to 98 in c7d7392

    
           def create_block_variable(self): 
        
               self.block_variable = BlockVariable(self) 
        
               return self.block_variable

In effect, any subclass of OverloadedType can only be deleted after garbage collection, not reference counting. I think this means that several data arrays sit around for longer than they should, rather than being deleted when their owner goes out of scope.

Ig-dolci · 2025-02-11T11:04:58Z

I think something that could be worth looking at here is the circular reference between OverloadedType and BlockVariable:

pyadjoint/pyadjoint/overloaded_type.py

Lines 96 to 98 in c7d7392

def create_block_variable(self):

self.block_variable = BlockVariable(self)

return self.block_variable

In effect, any subclass of OverloadedType can only be deleted after garbage collection, not reference counting. I think this means that several data arrays sit around for longer than they should, rather than being deleted when their owner goes out of scope.

Thank you. I will investigate that.

connorjward · 2025-02-11T11:21:14Z

In effect, any subclass of OverloadedType can only be deleted after garbage collection, not reference counting. I think this means that several data arrays sit around for longer than they should, rather than being deleted when their owner goes out of scope.

Is there any chance that this could be made a weakref so as to avoid this cycle?

dham · 2025-02-11T11:32:27Z

In effect, any subclass of OverloadedType can only be deleted after garbage collection, not reference counting. I think this means that several data arrays sit around for longer than they should, rather than being deleted when their owner goes out of scope.

Is there any chance that this could be made a weakref so as to avoid this cycle?

I think so. I think the BlockVariable should not prolong the lifetime of the Overloaded Type, so BlockVariable.output should be a weakref. We have to be careful that we're not abusing Blockvariable.output anywhere (i.e. relying on it as a source of information after the operation has been taped).

Ig-dolci · 2025-02-11T11:50:20Z

In effect, any subclass of OverloadedType can only be deleted after garbage collection, not reference counting. I think this means that several data arrays sit around for longer than they should, rather than being deleted when their owner goes out of scope.

Is there any chance that this could be made a weakref so as to avoid this cycle?

I already tried weakref for BlockVariables.output, but I hit a number of errors I do not remember now. It is a very careful work to do.

angus-g · 2025-02-11T11:50:40Z

I had a hacky go at that before, which worked for the forward run of the tape (with some implementation ugliness). The underlying OverloadedType was deleted at some point before/during the adjoint call, so that might need some care.

jrmaddison · 2025-02-11T13:50:28Z

Looks easier to break the cycle on the other side, see #194 for an attempt.

Ig-dolci · 2025-02-12T15:58:16Z

To make you updated:

I have tested this PR merged to the PR 194 against the PR 194 (only) for Burgers' equation using the following setup: 40,000 DoFs and 1,000 time steps.

The chart below uses SingleDiskStorageSchedule with the fixing from PR 4020 . The black line represents the results related to the PR 194 (only), and the blue line represents this PR merged to the PR 194 using gc_timestep_frequency=100.

Ig-dolci · 2025-02-12T16:08:07Z

I will also check the PR 4033 using the same example and add it here.

Ig-dolci · 2025-02-12T19:47:19Z

Now using firedrake PR 4033 merged to firedrake PR 4020 that automatically uses SingleDiskStorageSchedule.

Again, I have tested this PR merged to the pyadjoint PR 194 against the PR 194 (only) for Burgers' equation using the following setup: 40,000 DoFs and 1,000 time steps.

The black line represents the results related to the PR 194 (only), and the blue line represents this PR merged to the PR 194 using gc_timestep_frequency=100.

pyadjoint/checkpointing.py

Ig-dolci · 2025-02-18T10:37:01Z

pyadjoint/checkpointing.py

+                # Check if the block variables stored at `self._global_deps` are still
+                # dependencies in the previous timestep. If not, will remove them from the
+                # global dependencies.
+                deps_to_clear = self._global_deps.difference(self._global_deps.intersection(


I will check this.

pyadjoint/tape.py

connorjward

Looks great. Very readable now.

Ig-dolci added 10 commits December 8, 2024 10:49

Enable the user to set the options to be passed to the inner product …

37e5021

…solver.

flake8

f9413f7

opt gc_collect

5d43b93

Add clear checkpoint OverloadedType method

6a9e256

Minor changes

4858699

global_deps

3c7c319

Avoid clean and copy global_deps

653690e

Remove unecessary changes

4586d75

merge master

9664618

docs

c91208b

Ig-dolci commented Jan 21, 2025

View reviewed changes

pyadjoint/tape.py Outdated Show resolved Hide resolved

pyadjoint/checkpointing.py Outdated Show resolved Hide resolved

Add gc arguments

e20b60e

Ig-dolci commented Jan 23, 2025

View reviewed changes

pyadjoint/checkpointing.py Outdated Show resolved Hide resolved

pyadjoint/tape.py Outdated Show resolved Hide resolved

pyadjoint/tape.py Show resolved Hide resolved

pyadjoint/checkpointing.py Outdated Show resolved Hide resolved

Ig-dolci and others added 4 commits January 23, 2025 11:42

Fixing and and change docs

ce30153

fixing

5168326

Raise an error only in tape.enable_checkpointing

226c75e

Add _adj_deps_cleaned into TimeStep

dd227b3

Ig-dolci commented Jan 24, 2025

View reviewed changes

pyadjoint/checkpointing.py Outdated Show resolved Hide resolved

Ig-dolci added 3 commits February 5, 2025 15:32

Test global deps

48ef717

Small change

5874738

Enhance docs

f60c903

Ig-dolci changed the title ~~Optional garbage collection and more...~~ Optional garbage collection and CheckpointManager._global_deps Feb 5, 2025

Ig-dolci marked this pull request as ready for review February 5, 2025 17:59

connorjward requested changes Feb 6, 2025

View reviewed changes

Docs enhancement

68870da

Ig-dolci commented Feb 6, 2025

View reviewed changes

pyadjoint/checkpointing.py Show resolved Hide resolved

Ig-dolci commented Feb 6, 2025

View reviewed changes

pyadjoint/checkpointing.py Show resolved Hide resolved

Ig-dolci commented Feb 6, 2025

View reviewed changes

pyadjoint/checkpointing.py Outdated Show resolved Hide resolved

connorjward requested changes Feb 6, 2025

View reviewed changes

Ig-dolci and others added 2 commits February 6, 2025 17:29

Apply suggestions from code review

51278ce

Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>

docs enhancement

8c8c25a

angus-g mentioned this pull request Feb 11, 2025

Fix memory leak for disk checkpointing firedrakeproject/firedrake#4020

Merged

Merge master

328eb04

Ig-dolci commented Feb 18, 2025

View reviewed changes

Ig-dolci and others added 2 commits February 18, 2025 11:10

Apply suggestions from code review

0d3d8aa

Simplify deps_to_clear

81c33d7

Ig-dolci requested a review from connorjward February 18, 2025 11:41

connorjward approved these changes Feb 18, 2025

View reviewed changes

Ig-dolci merged commit c7bf2ec into master Feb 18, 2025
1 of 2 checks passed

		# The user can manually invoke the garbage collector if Python fails to
		# track and clean all checkpoint objects in memory properly.

		# Clear the checkpoint once it is not a global dependency and should be stored
		# only in the ``self.tape.timesteps`` checkpoints when needed.

Conversation

Ig-dolci commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Uh oh!

Uh oh!

Uh oh!

jrmaddison commented Jan 22, 2025

Uh oh!

Ig-dolci commented Jan 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ig-dolci Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angus-g commented Feb 11, 2025

Uh oh!

Ig-dolci commented Feb 11, 2025

Uh oh!

connorjward commented Feb 11, 2025

Uh oh!

dham commented Feb 11, 2025

Uh oh!

Ig-dolci commented Feb 11, 2025

Uh oh!

angus-g commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrmaddison commented Feb 11, 2025

Uh oh!

Ig-dolci commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ig-dolci commented Feb 12, 2025

Uh oh!

Ig-dolci commented Feb 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Ig-dolci commented Dec 13, 2024 •

edited

Loading

Ig-dolci Feb 6, 2025 •

edited

Loading

angus-g commented Feb 11, 2025 •

edited

Loading

Ig-dolci commented Feb 12, 2025 •

edited

Loading