Extend TCAV for Object detection

How do we extend the logic for Object detection? 

Let's say we use the network with RPN(region proposal network) for Object detection.
So we get the output as class confidence and bounding box (usually in 1000s in numbers)
next we IOU to filter out the predictions. 

It is not as simple as an image classifier. 
How do we pick the logit out of so many predictions for gradient calculation?