Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using neutrino gun events to probe memory growth #46598

Open
Dr15Jones opened this issue Nov 4, 2024 · 18 comments
Open

Using neutrino gun events to probe memory growth #46598

Dr15Jones opened this issue Nov 4, 2024 · 18 comments

Comments

@Dr15Jones
Copy link
Contributor

Dr15Jones commented Nov 4, 2024

I started from workflow 12861.0 (which is a neutrino gun) and then modified step3 (reconstruction) by

  1. Replaced PoolSource with RepeatingCachedRootSource. This allows the job to run an infinite number of events without any memory growth from the source (as all events are already cached in memory)
  2. Replaced PoolOutputModule with AsciiOutputModule both using identical outputCommands and verbosity turned off. This avoids problems with the output file sizings growing too large and makes the jobs run faster.
  3. Depending on the exact measurement being done, added Service SimpleMemoryCheck with periodic (event 10 second) sampling on and/or Service PeriodicAllocMonitor (sampling every 10 seconds).
  4. Set framework-has-started-processing-event message to only print once every 100 events (makes job run faster and avoids clogging the output file)

The full step3 cmsDriver.py command I started from was

step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,SKIM:LogError+LogErrorMonitor,PAT,NANO,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2024_realistic --datatier RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3_2024 --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --filein file:step2.root --fileout file:step3.root
@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2024

A new Issue was created by @Dr15Jones.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

assign reconstruction, dqm

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2024

New categories assigned: reconstruction,dqm

@antoniovagnerini,@jfernan2,@mandrenguyen,@rseidita you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor Author

From the RSS and presently allocated memory measurements I get the following time profile of the job (8 threads and 8 streams)

image

If I exclude the startup memory growth and plot allocated memory versus number of events started in the job we get

image

which shows an approximate 330 byte memory growth per event processed.

Trying to make a similar plot using the RSS information shows more variance
image

@Dr15Jones
Copy link
Contributor Author

Without the PeriodicAllocMonitor (and not preloading libPerfToolsAllocMonitorPreload.so) to job runs at a poor 24 events/second. The modules taking the most time are

image

The events are empty so taking ~20 to ~5 ms to process nothing seems pretty high.

@Dr15Jones
Copy link
Contributor Author

Thread scaling in the job is hampered by the following one modules which the framework must work around. [These showed up while running the StallMonitoring service]

StallMonitor>         Module label  # of stalls  Total stalled time  Max stalled time
StallMonitor>  -------------------  -----------  ------------------  ----------------
StallMonitor>  ecalMonitorTask                3          0.347438 s        0.129493 s
StallMonitor>  NANOEDMAODSIMoutput           10           2.86258 s          0.4244 s
StallMonitor>  SKIMStreamLogError             7           1.30872 s        0.266045 s
StallMonitor>  particleLevel                  6           1.08233 s         0.18974 s
StallMonitor>  rivetProducerHTXS              6           1.08204 s        0.189733 s

NOTE: the NANOEDMAODSIMoutput and SKIMStreamLogError had not yet been replaced with AsciiOutputModule at the time I made the measurement.

@Dr15Jones
Copy link
Contributor Author

My intent is to write a customize function which could be used to create a standard test for memory leaking/hoarding.

@Dr15Jones
Copy link
Contributor Author

Dr15Jones commented Nov 4, 2024

From the PeriodicAllocMonitor information, we do 1.1 million allocation requests each event. Some of that will be from copying the cached input data but the rest is just churn.

Revised:
I ran a job with just the RepeatingCachedRootSource and AsciiOutputModules (I removed the DQM output module as no DQM modules are in the job) where the job did not read the data products from the cache, just asked for which event to run next. It took less than 80 seconds to process 1M events using 8 threads and 8 streams, and the average number of allocations per event was less than 200.

Then if I did ask for each data product from each event it took 265 seconds to process 100k events (380 events/s) and it averaged 3400 allocations per event.

I ran the PeriodicAlloMonitor for both of those jobs (which does slow it down some) and there is no sign of any memory increase.

@Dr15Jones
Copy link
Contributor Author

See #46603

@Dr15Jones
Copy link
Contributor Author

So I dropped the DQM parts of the configuration and re-ran processing 20,000 events (just to make things faster). The allocation monitoring does not show any memory increase so it appears that the 300+ byte/event memory increase is coming from DQM related code. In addition the average live-allocation is only 3.2GB with the number of per event allocations being 450k. Therefore DQM is contributing 3+GB of memory (at 8 threads/8streams) and more than doubles the number of allocations per event.

image

@Dr15Jones
Copy link
Contributor Author

Dr15Jones commented Nov 6, 2024

So I used a script to generate a configuration that matches the step3 for module and path dependencies but with each producer replaced with the trivial AddIntsProducer producer. This allows us to see the effect on framework infrastructure on the memory behavior.

Running over 70,000 events using 1 thread I see that the job was doing 9,000 allocations per event with an average amounted allocated per event of 270kB (we had 2000 modules which each module doing at least 2 new calls, one for each of the two products it puts into the Event) and the live allocation stayed at a near constant 22.6MB.

Running over 140,000 events using 8 threads gave the same allocation and memory per event and a live allocation of 83.3MB.

So the framework only accounts for 1% of the number of allocations being done per event.

@Dr15Jones
Copy link
Contributor Author

I used the fixed ModuleAllocMonitor service to look at the job for modules that do lots of allocations during event processing. Here are the largest

valEmtfStage2Digis: L1TMuonEndCapTrackProducer uses 1.3M allocations on the first event of each stream and then 20k for all subsequent events. The startup is from here

track_finder_->process(iEvent, iSetup, *out_hits_tmp, *out_tracks);

slimmedPatTrigger:PATTriggerObjectStandAloneSlimmer uses 1.3M allocations on the first event of 6 of the 8 streams in the job and then just 100 on all others. The startup cost comes from here

struct key_hash {
std::size_t operator()(edm::ParameterSetID const &iKey) const { return iKey.smallHash(); }
};
typedef tbb::concurrent_unordered_map<edm::ParameterSetID, std::vector<std::string>, key_hash> AllLabelsMap;
CMS_THREAD_SAFE AllLabelsMap allLabelsMap;
} // namespace

and the reason only 6 of 8 are affected is those 6 streams hit that code section at about the same time so they are all trying to fill in that thread safe global.

patTrigger:PATTriggerProducer always allocates 330k times each event. This is doing lots of stuff on the HLT trigger information which is still filled in for this workflow even though the events are empty.

dqmHLTFiltersDQMonitor:HLTFiltersDQMonitor always allocates 158k times each event.

valBmtfDigis:L1TMuonBarrelTrackProducer always allocates 85k times each event.

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

All subsequent modules allocate less than 50k times each event.

@makortel
Copy link
Contributor

makortel commented Nov 7, 2024

valEmtfStage2Digis: L1TMuonEndCapTrackProducer uses 1.3M allocations on the first event of each stream and then 20k for all subsequent events. The startup is from here

track_finder_->process(iEvent, iSetup, *out_hits_tmp, *out_tracks);

This one already has an open issue #42526

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

This one has an open issue as well #46498

@Dr15Jones
Copy link
Contributor Author

See #46628 which decreases patTrigger:PATTriggerProducer allocations from 330k to 48k.

@mmusich
Copy link
Contributor

mmusich commented Nov 8, 2024

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

for my own education, can you explain what is the mechanics behind obtaining these numbers?
I've been trying the following recipe:

cmsrel CMSSW_14_2_X_2024-11-07-2300
cd CMSSW_14_2_X_2024-11-07-2300/src/
cmsenv
runTheMatrix.py -l 12861.0 --maxSteps=2
cd 12861.0_NuGun+2024
cmsDriver.py step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,SKIM:LogError+LogErrorMonitor,PAT,NANO,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2024_realistic --datatier RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3_2024 --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --filein file:step2.root --fileout file:step3.root --no_exec
echo 'process.add_(cms.Service("ModuleAllocMonitor", fileName = cms.untracked.string("moduleAlloc.log")))' >> step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
echo 'process.ModuleAllocMonitor.moduleNames = cms.untracked.vstring(["HLTSiStripMonitorTrack"])' >> step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
setenv LD_PRELOAD "libPerfToolsAllocMonitorPreload.so libPerfToolsMaxMemoryPreload.so"
cmsRun step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
wget https://raw.githubusercontent.com/cms-sw/cmssw/refs/heads/master/PerfTools/AllocMonitor/scripts/edmModuleAllocMonitorAnalyze.py .
python3 edmModuleAllocMonitorAnalyze.py moduleAlloc.log

(notice that edmModuleAllocMonitorAnalyze.py was not in the path, I tried to amend that at #46633)

That gives me:

Memory Report: total memory requested: 95750866
Memory Report:  max memory used: 9708024
Memory Report:  presently used: 643496
Memory Report:  # allocations calls:   17393
Memory Report:  # deallocations calls: 17024

how should I read these numbers?

@Dr15Jones
Copy link
Contributor Author

@mmusich you actually load two different monitors, one a Service (ModuleAllocMonitor) and the other which starts because you added it to LD_PRELOAD (libPerfToolsMaxMemoryPreload.so). The printout you refer to is from the LD_PRELOAD library and refers to statistics about the entire program.

The data obtained by the ModuleAllocMonitor was written to the file you specified in the Service's configuration, i.e. moduleAlloc.log). That is where you want to look for information specific to HLTSiStripMonitorTrack. In principal you should be able to point edmModuleAllocMonitorAnalyze.py at the file moduleAlloc.log to get useful information.

@Dr15Jones
Copy link
Contributor Author

The 85k from L1TMuonBarrelTrackProducer is brought down to 19 for all but the first event of a stream in #46640.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants