-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a tensorrt backend #33
base: master
Are you sure you want to change the base?
Conversation
I'll add the rest of my comments once there's a clear story for mutli-GPU support. |
self.context = self.trt_engine.create_execution_context() | ||
# create buffers for inference | ||
self.buffers = {} | ||
for batch_size in range(1, self.trt_engine.max_batch_size + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving a comment here to remind me:
We need to do some memory measurement to figure out if all of these buffers are necessary. I wonder if allocating a buffer for Batch-Size [1, 2, 5, 10] or other combinations might be better.
THings to test:
- How many tensorrt models can the NX hold?
- How much extra memory does this allocate (relating to Use an s3 bucket to store large files instead of Git LFS #1)
- What's the speed performance if we do [1, 2, 10] vs [1, 2, 5, 10], vs [1, 2, 3, 4, 5, 6,...10]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question we'll have to figure out: Should this be configurable via the init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do some tests to figure how much memory is needed for those buffers. Another thought is that if we don't get performance improvement with a larger batch size, we don't have to do that. Based on my tests, larger batch size will improve the inference time by 10% but lower the preprocessing performance, the overall performance is even a little lower than a small batch size.
self.ctx.pop() | ||
return final_outputs | ||
|
||
def _prepare_post_process(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to think that are too many constants and GridNet specific functions here, and it might be easier to make a separate class specifically for parsing GridNet bounding boxes.
For now, let's clean up the rest of the code first, then discuss how that would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants are only necessary for detectors, maybe we need another parameter like is_detector
in the constructor to indicate if this capsule a detector or classifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can check if these constants exist before we call the post process function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I'm thinking that this is super duper specific to GridNet detectors particularly. Maybe we can just offer a function that for parsing GridNet detector outputs, and name it as such.
class GridNetParser:
def __init__(parameters):
...
def parse_detection_results(prediction):
...
class BaseTensorRTBackend:
...
The benefit would be to separate all of these GridNet specific parameters out of the BaseTensorRTBackend 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, we should have separate parsers for different architectures.
outputs.append(HostDeviceMem(host_mem, device_mem)) | ||
return inputs, outputs, bindings, stream | ||
|
||
def do_inference(self, bindings: List[int], inputs: List[HostDeviceMem], outputs: List[HostDeviceMem], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def do_inference(self, bindings: List[int], inputs: List[HostDeviceMem], outputs: List[HostDeviceMem], | |
def _do_inference(self, bindings: List[int], | |
inputs: List[HostDeviceMem], | |
outputs: List[HostDeviceMem], | |
stream: cuda.Stream, | |
batch_size: int = 1) -> List[List[float]]: |
def batch_predict(self, input_data_list: List[Any]) -> List[Any]: | ||
task_size = len(input_data_list) | ||
curr_index = 0 | ||
while curr_index < task_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic may need to be revisited if we decide not to have buffers [0->10], and instead have combinations of [1, 2, 5, 10], for example
Co-authored-by: Alex Thiel <[email protected]>
…ules into tensorrt_backend
@apockill I resolved most of your comments but the post-process stuff. The code is tested to be working and the performance is almost the same. I'll continue to work on the GridParser stuff. In the meantime, you can take another look. |
out_lists = [out_array.tolist() for out_array in out_array_by_batch] | ||
batch_outputs.append(out_lists) | ||
final_outputs = list(zip(*batch_outputs)) | ||
final_outputs = [list(item) for item in final_outputs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we need to cast each item as a list? After zip it's held as a tuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to match the original type hit. I can also change the type hit instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Yeah, just change the type hint. Tuples are cheaper and faster anyways.
Add BaseTensorRTBackend that provide support for TensorRT models