Using neutrino gun events to probe memory growth #46598

Dr15Jones · 2024-11-04T14:42:43Z

I started from workflow 12861.0 (which is a neutrino gun) and then modified step3 (reconstruction) by

Replaced PoolSource with RepeatingCachedRootSource. This allows the job to run an infinite number of events without any memory growth from the source (as all events are already cached in memory)
Replaced PoolOutputModule with AsciiOutputModule both using identical outputCommands and verbosity turned off. This avoids problems with the output file sizings growing too large and makes the jobs run faster.
Depending on the exact measurement being done, added Service SimpleMemoryCheck with periodic (event 10 second) sampling on and/or Service PeriodicAllocMonitor (sampling every 10 seconds).
Set framework-has-started-processing-event message to only print once every 100 events (makes job run faster and avoids clogging the output file)

The full step3 cmsDriver.py command I started from was

step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,SKIM:LogError+LogErrorMonitor,PAT,NANO,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2024_realistic --datatier RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3_2024 --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --filein file:step2.root --fileout file:step3.root

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-11-04T14:43:02Z

cms-bot internal usage

cmsbuild · 2024-11-04T14:43:03Z

A new Issue was created by @Dr15Jones.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2024-11-04T14:43:24Z

assign reconstruction, dqm

cmsbuild · 2024-11-04T14:43:26Z

New categories assigned: reconstruction,dqm

@antoniovagnerini,@jfernan2,@mandrenguyen,@rseidita you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones · 2024-11-04T14:49:25Z

From the RSS and presently allocated memory measurements I get the following time profile of the job (8 threads and 8 streams)

If I exclude the startup memory growth and plot allocated memory versus number of events started in the job we get

which shows an approximate 330 byte memory growth per event processed.

Trying to make a similar plot using the RSS information shows more variance

Dr15Jones · 2024-11-04T14:59:04Z

Without the PeriodicAllocMonitor (and not preloading libPerfToolsAllocMonitorPreload.so) to job runs at a poor 24 events/second. The modules taking the most time are

The events are empty so taking ~20 to ~5 ms to process nothing seems pretty high.

Dr15Jones · 2024-11-04T15:03:02Z

Thread scaling in the job is hampered by the following one modules which the framework must work around. [These showed up while running the StallMonitoring service]

StallMonitor>         Module label  # of stalls  Total stalled time  Max stalled time
StallMonitor>  -------------------  -----------  ------------------  ----------------
StallMonitor>  ecalMonitorTask                3          0.347438 s        0.129493 s
StallMonitor>  NANOEDMAODSIMoutput           10           2.86258 s          0.4244 s
StallMonitor>  SKIMStreamLogError             7           1.30872 s        0.266045 s
StallMonitor>  particleLevel                  6           1.08233 s         0.18974 s
StallMonitor>  rivetProducerHTXS              6           1.08204 s        0.189733 s

NOTE: the NANOEDMAODSIMoutput and SKIMStreamLogError had not yet been replaced with AsciiOutputModule at the time I made the measurement.

Dr15Jones · 2024-11-04T15:03:50Z

My intent is to write a customize function which could be used to create a standard test for memory leaking/hoarding.

Dr15Jones · 2024-11-04T15:14:37Z

From the PeriodicAllocMonitor information, we do 1.1 million allocation requests each event. Some of that will be from copying the cached input data but the rest is just churn.

Revised:
I ran a job with just the RepeatingCachedRootSource and AsciiOutputModules (I removed the DQM output module as no DQM modules are in the job) where the job did not read the data products from the cache, just asked for which event to run next. It took less than 80 seconds to process 1M events using 8 threads and 8 streams, and the average number of allocations per event was less than 200.

Then if I did ask for each data product from each event it took 265 seconds to process 100k events (380 events/s) and it averaged 3400 allocations per event.

I ran the PeriodicAlloMonitor for both of those jobs (which does slow it down some) and there is no sign of any memory increase.

Dr15Jones · 2024-11-04T18:35:14Z

See #46603

Dr15Jones · 2024-11-05T19:17:41Z

So I dropped the DQM parts of the configuration and re-ran processing 20,000 events (just to make things faster). The allocation monitoring does not show any memory increase so it appears that the 300+ byte/event memory increase is coming from DQM related code. In addition the average live-allocation is only 3.2GB with the number of per event allocations being 450k. Therefore DQM is contributing 3+GB of memory (at 8 threads/8streams) and more than doubles the number of allocations per event.

Dr15Jones · 2024-11-06T21:04:35Z

So I used a script to generate a configuration that matches the step3 for module and path dependencies but with each producer replaced with the trivial AddIntsProducer producer. This allows us to see the effect on framework infrastructure on the memory behavior.

Running over 70,000 events using 1 thread I see that the job was doing 9,000 allocations per event with an average amounted allocated per event of 270kB (we had 2000 modules which each module doing at least 2 new calls, one for each of the two products it puts into the Event) and the live allocation stayed at a near constant 22.6MB.

Running over 140,000 events using 8 threads gave the same allocation and memory per event and a live allocation of 83.3MB.

So the framework only accounts for 1% of the number of allocations being done per event.

Dr15Jones · 2024-11-07T16:20:30Z

I used the fixed ModuleAllocMonitor service to look at the job for modules that do lots of allocations during event processing. Here are the largest

valEmtfStage2Digis: L1TMuonEndCapTrackProducer uses 1.3M allocations on the first event of each stream and then 20k for all subsequent events. The startup is from here

cmssw/L1Trigger/L1TMuonEndCap/plugins/L1TMuonEndCapTrackProducer.cc

Line 19 in aca061d

track_finder_->process(iEvent, iSetup, *out_hits_tmp, *out_tracks);

slimmedPatTrigger:PATTriggerObjectStandAloneSlimmer uses 1.3M allocations on the first event of 6 of the 8 streams in the job and then just 100 on all others. The startup cost comes from here

cmssw/DataFormats/PatCandidates/src/TriggerObjectStandAlone.cc

Lines 389 to 394 in aca061d

    
             struct key_hash { 
        
               std::size_t operator()(edm::ParameterSetID const &iKey) const { return iKey.smallHash(); } 
        
             }; 
        
             typedef tbb::concurrent_unordered_map<edm::ParameterSetID, std::vector<std::string>, key_hash> AllLabelsMap; 
        
             CMS_THREAD_SAFE AllLabelsMap allLabelsMap; 
        
           }  // namespace

and the reason only 6 of 8 are affected is those 6 streams hit that code section at about the same time so they are all trying to fill in that thread safe global.

patTrigger:PATTriggerProducer always allocates 330k times each event. This is doing lots of stuff on the HLT trigger information which is still filled in for this workflow even though the events are empty.

dqmHLTFiltersDQMonitor:HLTFiltersDQMonitor always allocates 158k times each event.

valBmtfDigis:L1TMuonBarrelTrackProducer always allocates 85k times each event.

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

All subsequent modules allocate less than 50k times each event.

makortel · 2024-11-07T16:32:04Z

valEmtfStage2Digis: L1TMuonEndCapTrackProducer uses 1.3M allocations on the first event of each stream and then 20k for all subsequent events. The startup is from here

cmssw/L1Trigger/L1TMuonEndCap/plugins/L1TMuonEndCapTrackProducer.cc

Line 19 in aca061d

track_finder_->process(iEvent, iSetup, *out_hits_tmp, *out_tracks);

This one already has an open issue #42526

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

This one has an open issue as well #46498

Dr15Jones · 2024-11-07T19:17:44Z

See #46628 which decreases patTrigger:PATTriggerProducer allocations from 330k to 48k.

mmusich · 2024-11-08T10:15:31Z

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

for my own education, can you explain what is the mechanics behind obtaining these numbers?
I've been trying the following recipe:

cmsrel CMSSW_14_2_X_2024-11-07-2300
cd CMSSW_14_2_X_2024-11-07-2300/src/
cmsenv
runTheMatrix.py -l 12861.0 --maxSteps=2
cd 12861.0_NuGun+2024
cmsDriver.py step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,SKIM:LogError+LogErrorMonitor,PAT,NANO,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2024_realistic --datatier RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3_2024 --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --filein file:step2.root --fileout file:step3.root --no_exec
echo 'process.add_(cms.Service("ModuleAllocMonitor", fileName = cms.untracked.string("moduleAlloc.log")))' >> step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
echo 'process.ModuleAllocMonitor.moduleNames = cms.untracked.vstring(["HLTSiStripMonitorTrack"])' >> step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
setenv LD_PRELOAD "libPerfToolsAllocMonitorPreload.so libPerfToolsMaxMemoryPreload.so"
cmsRun step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
wget https://raw.githubusercontent.com/cms-sw/cmssw/refs/heads/master/PerfTools/AllocMonitor/scripts/edmModuleAllocMonitorAnalyze.py .
python3 edmModuleAllocMonitorAnalyze.py moduleAlloc.log

(notice that edmModuleAllocMonitorAnalyze.py was not in the path, I tried to amend that at #46633)

That gives me:

Memory Report: total memory requested: 95750866
Memory Report:  max memory used: 9708024
Memory Report:  presently used: 643496
Memory Report:  # allocations calls:   17393
Memory Report:  # deallocations calls: 17024

how should I read these numbers?

Dr15Jones · 2024-11-08T12:35:57Z

@mmusich you actually load two different monitors, one a Service (ModuleAllocMonitor) and the other which starts because you added it to LD_PRELOAD (libPerfToolsMaxMemoryPreload.so). The printout you refer to is from the LD_PRELOAD library and refers to statistics about the entire program.

The data obtained by the ModuleAllocMonitor was written to the file you specified in the Service's configuration, i.e. moduleAlloc.log). That is where you want to look for information specific to HLTSiStripMonitorTrack. In principal you should be able to point edmModuleAllocMonitorAnalyze.py at the file moduleAlloc.log to get useful information.

Dr15Jones · 2024-11-08T22:09:34Z

The 85k from L1TMuonBarrelTrackProducer is brought down to 19 for all but the first event of a stream in #46640.

cmsbuild added the pending-assignment label Nov 4, 2024

cmsbuild added reconstruction-pending dqm-pending pending-signatures and removed pending-assignment labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using neutrino gun events to probe memory growth #46598

Using neutrino gun events to probe memory growth #46598

Dr15Jones commented Nov 4, 2024 •

edited

Loading

cmsbuild commented Nov 4, 2024 •

edited

Loading

cmsbuild commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

cmsbuild commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024 •

edited

Loading

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 5, 2024

Dr15Jones commented Nov 6, 2024 •

edited

Loading

Dr15Jones commented Nov 7, 2024

makortel commented Nov 7, 2024

Dr15Jones commented Nov 7, 2024

mmusich commented Nov 8, 2024

Dr15Jones commented Nov 8, 2024

Dr15Jones commented Nov 8, 2024

Using neutrino gun events to probe memory growth #46598

Using neutrino gun events to probe memory growth #46598

Comments

Dr15Jones commented Nov 4, 2024 • edited Loading

cmsbuild commented Nov 4, 2024 • edited Loading

cmsbuild commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

cmsbuild commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024 • edited Loading

Dr15Jones commented Nov 4, 2024

Dr15Jones commented Nov 5, 2024

Dr15Jones commented Nov 6, 2024 • edited Loading

Dr15Jones commented Nov 7, 2024

makortel commented Nov 7, 2024

Dr15Jones commented Nov 7, 2024

mmusich commented Nov 8, 2024

Dr15Jones commented Nov 8, 2024

Dr15Jones commented Nov 8, 2024

Dr15Jones commented Nov 4, 2024 •

edited

Loading

cmsbuild commented Nov 4, 2024 •

edited

Loading

Dr15Jones commented Nov 4, 2024 •

edited

Loading

Dr15Jones commented Nov 6, 2024 •

edited

Loading