DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPUXu Tianhao, Deep Learning Solution Architect, NVIDIA
2
AGENDA
• TensorRT Hyperscale Inference Platform overview
• TensorRT Inference Server
• Overview and Deep Dive: Key features
• Deployment possibilities: Generic deployment ecosystem
• Hands-on
• NVIDA BERT Overview
• FasterTransformer and TRT optimized BERT inference
• Deploy BERT TensorFlow model with custom op
• Deploy BERT TensorRT model with plugins
• Benchmarking
• Open Discussion
DEEP INTO TRTIS
3
WORLD’S MOST ADVANCEDSCALE-OUT GPU
INTEGRATED INTO TENSORFLOW & ONNX SUPPORT
TENSORRT HYPERSCALE INFERENCE PLATFORM
TENSORRT INFERENCE SERVER
4
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU
5
NEW TURING TENSOR CORE
MULTI-PRECISION FOR AI INFERENCE
65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4
6
Up To 36X Faster Than CPUs | Accelerates All AI Workloads
WORLD’S MOST PERFORMANT INFERENCE PLATFORM
Speedup: 36x fasterGNMT
Speedup: 27x fasterResNet-50 (7ms latency limit)
Speedup: 21X fasterDeepSpeech 2
1.0
10X
36X
-0
5
10
15
20
25
30
35
40
Spee
du
p v
. CP
U S
erve
r
Natural Language Processing Inference
CPU Server Tesla P4 Tesla T4
1.0
4X
21X
-0
5
10
15
20
25
Spee
du
p v
. CP
U S
erve
r
Speech Inference
CPU Server Tesla P4 Tesla T4
1.0
10X
27X
-0
5
10
15
20
25
30
Spee
du
p v
. CP
U S
erve
r
Video Inference
CPU Server Tesla P4 Tesla T4
5.522
65
130
260
0
50
100
150
200
250
300
TFLO
PS
/ TO
PS
Peak Performance
T4P4
Float INT8 Float INT8 INT4
7
NVIDIA TENSORRT OVERVIEWFrom Every Framework, Optimized For Each Target Platform
TESLA V100
DRIVE PX 2
NVIDIA T4
JETSON TX2
NVIDIA DLA
TensorRT
8
NVIDIA TENSORRT OVERVIEWFrom Every Framework, Optimized For Each Target Platform
Quantized INT8 (Precision Optimization)Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss
Layer Fusion (Graph Optimization)Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution
Kernel Auto-Tuning (Auto-tuning) Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla or DrivePX GPU platform
Dynamic Tensor Memory (Memory optimization)Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration it’s usage
Multi Stream ExecutionScales to multiple input streams, by processing them parallel using the same model and weights
9
Un-Optimized Network
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
concat
1x1 conv.
relu
bias5x5 conv.
relu
bias
Non-Optimized Network• Vertical Fusion
• Horizonal Fusion
• Layer Elimination
Network Layers
before
Layers
after
VGG19 43 27
Inception
V3
309 113
ResNet-152 670 159
GRAPH OPTIMIZATION
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
concat
1x1 conv.
relu
bias5x5 conv.
relu
bias
10
Un-Optimized Network
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
concat
1x1 conv.
relu
bias5x5 conv.
relu
bias
max pool
input
next input
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
TensorRT Optimized Network• Vertical Fusion
• Horizonal Fusion
• Layer Elimination
Network Layers
before
Layers
after
VGG19 43 27
Inception
V3
309 113
ResNet-152 670 159
GRAPH OPTIMIZATION
11
140 305
5700
14 ms
6.67 ms 6.83 ms
0
5
10
15
20
25
30
35
40
0
1,000
2,000
3,000
4,000
5,000
6,000
CPU-Only V100 +TensorFlow
V100 + TensorRT
Late
ncy (m
s)Images/
sec
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake
with AVX512.
425
550
280 ms
153 ms
117 ms
0
50
100
150
200
250
300
350
400
450
500
0
100
200
300
400
500
600
CPU-Only + Torch V100 + Torch V100 + TensorRT
Late
ncy (m
s)Images/
sec
Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
3.5GHz Turbo (Broadwell) HT On
TENSORRT PERFORMANCE
developer.nvidia.com/tensorrt
40x Faster CNNs on V100 vs. CPU-Only
Under 7ms Latency (ResNet50)
140x Faster Language Translation RNNs on
V100 vs. CPU-Only Inference (OpenNMT)
12
AGENDA
• TensorRT Hyperscale Inference Platform overview
• TensorRT Inference Server
• Overview and Deep Dive: Key features
• Deployment possibilities: Generic deployment ecosystem
• Hands-on
• NVIDA BERT Overview
• FasterTransformer and TRT optimized BERT inference
• Deploy BERT TensorFlow model with custom op
• Deploy BERT TensorRT model with plugins
• Benchmarking
• Open Discussion
DEEP INTO TRTIS
13
INEFFICIENCY LIMITS INNOVATIONDifficulties with Deploying Data Center Inference
Single Framework OnlySingle Model Only Custom Development
Some systems are overused while
others are underutilizedSolutions can only support
models from one framework
Developers need to reinvent the
plumbing for every application
ASR NLPRec-
ommender
!
14
NVIDIA TENSORRT INFERENCE SERVERArchitected for Maximum Datacenter Utilization
Maximize real-time inference
performance of GPUs
Quickly deploy and manage multiple
models per GPU per node
Easily scale to heterogeneous GPUs
and multi GPU nodes
Integrates with orchestration
systems and auto scalers via latency
and health metrics
Now open source for thorough
customization and integration
TensorR
T
Infe
rence
Serv
er
NVIDIA
T4
NVIDIA
T4
Te
nso
rRT
Infe
ren
ce
Se
rve
r
Tesla
V100
Tesla
V100
Te
nso
rRT
Infe
ren
ce
Se
rve
r Tesla P4
Tesla P4
15
FEATURESUtilization Usability Performance Customization
Dynamic BatchingInference requests can be
batched up by the
inference server to 1) the
model-allowed maximum or
2) the user-defined latency
SLA
Concurrent Model
ExecutionMultiple models (or multiple
instances of same model) may
execute on GPU simultaneously
CPU Model Inference
ExecutionFramework native models can
execute inference requests on the
CPU
Multiple Model Format
SupportPyTorch JIT (.pt)
TensorFlow GraphDef/SavedModel
TensorFlow+TensorRT GraphDef
ONNX graph (ONNX Runtime)
TensorRT Plans
Caffe2 NetDef (ONNX import path)
MetricsUtilization, count, memory, and
latency
Model Control APIExplicitly load/unload models into
and out of TRTIS based on changes
made in the model-control
configuration
System/CUDA Shared
MemoryInputs/outputs needed to be
passed to/from TRTIS are stored
in system/CUDA shared memory.
Reduces HTTP/gRPC overhead
Library VersionLink against libtrtserver.so so that
you can include all the inference
server functionality directly in
your application
Custom BackendCustom backend allows the user
more flexibility by providing their
own implementation of an
execution engine through the use
of a shared library
Model EnsemblePipeline of one or more models
and the connection of input and
output tensors between those
models (can be used with custom
backend)
Streaming APIBuilt-in support for audio
streaming input e.g. for speech
recognition
16
INFERENCE SERVER ARCHITECTURE
Models supported● TensorFlow GraphDef/SavedModel● TensorFlow and TensorRT GraphDef● TensorRT Plans● Caffe2 NetDef (ONNX import)● ONNX graph● PyTorch JIT (.pb)
Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC
Python/C++ client libraries
Python/C++ Client Library
Available with Monthly Updates
17
COMMON WAYS TO FULLY UTILIZE GPU
1. Increase computation intensity – Increase batch size
2. Execute multi- tasks concurrently with multi- streams or MPS (MULTI-PROCESS SERVICE)
18
DYNAMIC BATCHING SCHEDULER
Framework Backend
Dynamic
Batcher
Runtime
Context
Context
Batch-1 RequestBatch-4 Request
TensorRT Inference Server
19
DYNAMIC BATCHING SCHEDULER
ModelY Backend
Dynamic
Batcher
Runtime
Context
Context
Preferred batch size and wait time are configuration options.
Assume 4 gives best utilization in this example.
Grouping requests into a single “batch” increases overall GPU throughput
TensorRT Inference Server
20
DYNAMIC BATCHING
TensorRT Inference Servergroups inference requests based on customer defined metrics for optimal performance
Customer defines 1) batch size (required) and 2) latency requirements (optional)
Example: No dynamic batching (batch size 1 & 8) vs dynamic batching
2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency Threshold
21
MPS VS CUDA STREAMS IN TRTIS
21
TRTIS CUDA Streams are 1-4% slower than MPS but provide
some usability advantages and other methods to maximize
performance over MPS limitations
MPS
• Multiple processes on a single GPU (no
interconnect/intercommunication between processes)
• Shares GPU memory between multiple processes, if one
process over subscribes the memory, the others are
starved - harder to coordinate memory usage
• Experimental in nv-docker
CUDA Streams
• One process on a single GPU with multiple
streams/execution contexts
• More holistic view of memory - easier to coordinate
memory usage
• Maximize GPU utilization by using batching vs having
several processes executing at batch size 1
22
Concurrent Model Execution
max_batch_size: 8instance_group [
{count: 4kind: KIND_GPUgpus: [0, 1]
},{count: 4kind: KIND_CPUgpus: [3, 4]
}]
Create one execution context for each instance of a group of a certain model
23
CONCURRENT MODEL EXECUTION - RESNET 50
Inference
Requests
TensorRT Inference Server
ResNet
50Request
Queue
V100 16GB GPU
Time
RN50 Instance 1 CUDA Stream
RN50 Instance 2CUDA Stream
RN50 Instance 3CUDA Stream
RN50 Instance 4 CUDA Stream
RN50 Instance 5 CUDA Stream
RN50 Instance 6 CUDA Stream
RN50 Instance 8 CUDA Stream
RN50 Instance 7CUDA Stream
RN50 Instance 9 CUDA Stream
RN50 Instance 10CUDA Stream
RN50 Instance 11 CUDA Stream
RN50 Instance 12 CUDA Stream
4x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency
14
concurrent
requests
Common Scenario
One API using multiple copies of the
same model on a GPU
Example: 12 instances of TRT FP16
ResNet50 (each model takes 1.33GB GPU
memory) are loaded onto the GPU and can
run concurrently on a 16GB V100 GPU.
14 concurrent inference requests happen:
each model instance fulfills one request
simultaneously and 2 are queued in the
per-model scheduler queues in TensorRT
Inference Server to execute after the 12
requests finish.
With this configuration, 2832 inferences
per second at 33.94 ms with batch size 8
on each inference server instance is
achieved.
24
CONCURRENT MODEL EXECUTION - RESNET 504x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency
Common Scenario
One API using multiple copies of the
same model on a GPU
Example: 12 instances of TRT FP16
ResNet50 (each model takes 1.33GB GPU
memory) are loaded onto the GPU and can
run concurrently on a 16GB V100 GPU.
14 concurrent inference requests happen:
each model instance fulfills one request
simultaneously and 2 are queued in the
per-model scheduler queues in TensorRT
Inference Server to execute after the 12
requests finish.
With this configuration, 2832 inferences
per second at 33.94 ms with batch size 8
on each inference server instance is
achieved.
25
CONCURRENT MODEL EXECUTION - RESNET 504x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency
Common Scenario
One API using multiple copies of the
same model on a GPU
Example: 12 instances of TRT FP16
ResNet50 (each model takes 1.33GB GPU
memory) are loaded onto the GPU and can
run concurrently on a 16GB V100 GPU.
14 concurrent inference requests happen:
each model instance fulfills one request
simultaneously and 2 are queued in the
per-model scheduler queues in TensorRT
Inference Server to execute after the 12
requests finish.
With this configuration, 2832 inferences
per second at 33.94 ms with batch size 8
on each inference server instance is
achieved.
26
Concurrent Model Execution
max_batch_size: 8instance_group [
{count: 4kind: KIND_GPUgpus: [0, 1]
},{count: 4kind: KIND_CPUgpus: [3, 4]
}]
Create one execution context for each instance of a group of a certain model
Scheduling threadsMultiple streams
Priority: MAX, DEFAUTL, MIN
27
Model Control and Model Configuration
Perform HTTP POST to /api/modelcontrol/<load|unload>/<model
name> loads or unloads a model from the inference server
Model Control Modes
1) NONE
• Server attempts to load all models at runtime.
• Changes to the model repo will be ignored
• Model control API requests will have no affect
2) POLL
• Server attempts to load all models at runtime
• Changes to model repo will be detected and server will
attempt to load and unload models based on changes
• Model control requests will have no affect
3) EXPLICIT
• Server does not load any models in the model repo at
runtime
• All model loading and unloading must be initiated using the
Model Control API
Local model repository
28
Model Control and Model Configuration
name: "mymodel"platform: "tensorrt_plan"max_batch_size: 8input [
{name: "input0"data_type: TYPE_FP32dims: [ 16 ]reshape: { shape: [ ] } }
]output [
{name: "output0"data_type: TYPE_FP32dims: [ 16 ]
}]version_policy: { all { }}
instance_group [{count: 2kind: KIND_GPUgpus: [0, 1]
}]dynamic_batching {
preferred_batch_size: [ 4, 8 ]max_queue_delay_microseconds: 100
}optimization {
graph {level: 1
},cuda {graphs: 1
},priority: PRIORITY_MAX
}
• Dims, -1 for dynamic• Reshape for model accepted dims
• Support multiple backends(platform)
• Version control: serve selected versions
• Instances for concurrent exection• Select multiple gpus• Select CPU or GPU for execution• There can be multiple groups
• Preferred batch size is configurable• Set max queue delay for SLA control
• Multiple optimizations• Set graph level to 1 to trigger XLA of TF• Set cuda graphs to 1 to using CUDA graph for
small batch sizes inference• Set priority to max to set scheduler thread
priority and cuda stream priority (only for TRT now)
• ExecutionAccelerators, enable onnx-tensorrtor tensorflow-tensorrt to automatically benefit from tensorrt integration
29
MODEL ENSEMBLING• Pipeline of one or more models
and the connection of input and
output tensors between those
models
• Use for model stitching or data
flow of multiple models such as
data preprocessing → inference →data post-processing
• Collects the output tensors in
each step, provides them as input
tensors for other steps according
to the specification
• Ensemble models will inherit the
characteristics of the models
involved, so the meta-data in the
request header must comply with
the models within the ensemble
ensemble_scheduling {step [
{model_name: "image_preprocess_model"model_version: -1input_map {
key: "RAW_IMAGE"value: "IMAGE"
}output_map {
key: "PREPROCESSED_OUTPUT"value: "preprocessed_image"
}},{
model_name: "classification_model"model_version: -1input_map {
key: "FORMATTED_IMAGE"value: "preprocessed_image"
}output_map {
key: "CLASSIFICATION_OUTPUT"value: "CLASSIFICATION"
}},{
model_name: "segmentation_model"model_version: -1input_map {
key: "FORMATTED_IMAGE"value: "preprocessed_image"
}output_map {
key: "SEGMENTATION_OUTPUT"value: "SEGMENTATION"
}}
]}
CustomBackend
30
CUSTOM BACKENDIntegrate custom, non-framework code into TRTIS
Not uncommon for model to have some non-ML-model parts
BERT: tokenizer, feature extractor
Custom backend allows these parts to be integrated into TRTIS
Implement code as shared library using backwards compatible C API
Benefit from the full TRTIS feature set (same as framework backends)
• Dynamic batcher, sequence batcher, concurrent execution, multi-GPU, etc.
Provides deployment flexibility; TRTIS provides standard, consistent interface protocol
between models and custom components
31
STREAMING INFERENCE REQUESTS
DeepSpeech2
Wave2Letter
Per Model Request Queues
Corr 1Corr 1Corr 1Corr 1 Corr 2 Corr 2 Corr 3 Corr 3
New Streaming API
Based on the correlation ID, the
audio requests are sent to the
appropriate batch slot in the
sequence batcher*
*Correct order of requests is assumed at entry into the endpointNote: Corr = Correlation ID
Inference Request
Corr 1Corr 1Corr 1Corr 1
Corr 2Corr 2Corr 3Corr 3
Corr 1Corr 1Corr 1Corr 1
Corr 2Corr 2
Corr 3Corr 3
NEW
NEW
DeepSpeech2 Sequence Batcher
Wav2Letter Sequence Batcher
Framework Inference Backend
32
Streaming APIFSM maintained in StreamInferContext
Request
Done
Finished
Done
Non-Streaming
Request
Done
Finished
Done
Response
Done
Initialized
Done
Next
Unexpected or finish
Unexpected or finish
Start to write all remaining data back
Will call CompleteExecution() to
write result back.
Streaming, it’s bidirectional
Reset
Reset
33
TRTIS LIBRARY VERSIONTightly couple TRTIS functionality into control application via shared library
Smaller binary to Plug TRTIS library into existing application
Removes existing REST and gRPC endpoints
Still leverage GPU optimizations like dynamic batching and model concurrency
Very low communication overhead (same system and CUDA memory address space)
Backward compatible C interface
34
AVAILABLE METRICS
Category Name Use Case Granularity Frequency
GPU Utilization
Power usage Proxy for load on the GPU Per GPU Per second
Power limit Maximum GPU power limit Per GPU Per second
GPU utilizationGPU utilization rate
[0.0 - 1.0)
Per GPU Per second
GPU Memory
GPU Total Memory Total GPU memory, in bytes Per GPU Per second
GPU Used Memory Used GPU memory, in bytes Per GPU Per second
CountGPU & CPU
Request count Number of inference requests Per model Per request
Execution count
Number of model inference executions
Request count / execution count = avg dynamic request
batching
Per model Per request
Inference countNumber of inferences performed (one request counts as
“batch size” inferences)
Per model Per request
LatencyGPU & CPU
Latency: request time End-to-end inference request handling time Per model Per request
Latency: compute timeTime a request spends executing the inference model (in
the appropriate framework)
Per model Per request
Latency: queue timeTime a request spends waiting in the queue before being
executed
Per model Per request
35
PERF_CLIENT TOOL
• Measures throughput (inf/s) and latency under varying client loads
• Perf_client Modes
1. Specify how many concurrent outstanding requests and it will find a stable ltency and throughput for that level
2. Generate throughout vs latency curve by increasing the request concurrency until a specific latency or concurrency limit is reached
• Generates a file containing CSV output of the results
• Easy steps to help visualize the throughput vs latency tradeoffs
36
Clie
nt
R
EC
Clie
nt
IMG
AP
I: A
SR
Model:
Legend
ATTIS
(CPU/ GPU)
Load balancer
Containerized
inference
service
(CPU/ GPU)
Pre
processing
Post
processing
Cluster
Metrics serviceAuto scaler
GPU
Model repository
Persistent volume
Your training/ pruning/
validation flow: dump
model
Model repository(Network Storage Location)
GENERIC INFERENCE SERVER
DEPLOYMENT ARCHITECTURE
GPU
GPU
GPU
TensorRT, TensorFlow, C2/ONNX
Already existing New from NVIDIA
Multiple
workloads
37For a more detailed explanation and step-by-step guidance for this collaboration, refer to this GitHub repo.
TENSORRT INFERENCE SERVER COLLABORATION WITH KUBEFLOW
What is Kubeflow?
• Open-source project to make ML workflows on Kubernetes simple, portable, and
scalable
• Customizable scripts and configuration files to deploy containers on their chosen
environment
Problems it solves
• Easily set up an ML stack/pipeline that can fit into the majority of enterprise
datacenter and multi-cloud environments
How it helps TensorRT Inference Server
• TensorRT Inference Server is deployed as a component inside of a production
workflow to
• Optimize GPU performance
• Enable auto-scaling, traffic load balancing, and redundancy/failover via
metrics
38
TRTIS Helm Chart
Helm: Most used “package manager” for Kubernetes
We built a simple chart (“package”) for the TensorRTInference Server.
You can use it to easily deploy an instance of the server.It can also be easily configured to point to a different image, model store, …
https://github.com/NVIDIA/tensorrt-inference-server/tree/b6b45ead074d57e3d18703b7c0273672c5e92893/deploy/single_server
Simple helm chart for installing a single instance of the NVIDIA TensorRT Inference Server
39
AGENDA
• TensorRT Hyperscale Inference Platform overview
• TensorRT Inference Server
• Overview and Deep Dive: Key features
• Deployment possibilities: Generic deployment ecosystem
• Hands-on
• NVIDA BERT Overview
• FasterTransformer and TRT optimized BERT inference
• Deploy BERT TensorFlow model with custom op
• Deploy BERT TensorRT model with plugins
• Benchmarking
• Open Discussion
DEEP INTO TRTIS
40
WHAT IS BERT?
BERT: Bidirectional Encoder Representations from Transformers
Widely used in Multiple NLP Tasks, due to high accuracy.
41
WHAT IS BERTTransformer Encoder Part
42
TENSORFLOW INFERENCE
Previous TF inference is not efficient:
1. TF Ops are very small, kernel launch causes much time, e.g., GELU/LayerNorm contains several small Ops;2. Multi head self attention lacks efficient GPU implementation;3. TF Scheduling is not good.
43
NVIDIA’S INFERENCE
Optimization ideas:
1. Optimize the calculations with CUDA, integrate the implementation to TF with custom op
2. Optimize the inference with TensorRT
3. Algorithm Level Acceleration
44
NVIDIA’S INFERENCECUDA Optimization - Performance
<batch_size,
layers, seq_len,
head_num,
size_per_head>
P4 FP32 (in ms) T4 FP32 (in ms) T4 FP16 (in ms)
(1, 12, 32, 12, 64) 3.43 2.74 1.56
(1, 12, 64, 12, 64) 4.04 3.64 1.77
(1, 12, 128, 12, 64) 6.22 5.93 2.23
Performance over different seq_len on P4, T4
45
NVIDIA’S INFERENCECUDA Optimization - Resources
Where you can find it:
FasterTransformer project(open-sourced):https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer
46
NVIDIA’S INFERENCETRT Optimization
47
NVIDIA’S INFERENCETRT Optimization
Before
After
48
NVIDIA’S INFERENCETRT Optimization - Resources
Where you can find it:
Bert TRT demo(open-sourced):https://github.com/NVIDIA/TensorRT/tree/release/6.0/demo/BERT(To be re-located to DeepLearningExamples)Blog:https://devblogs.nvidia.com/nlu-with-tensorrt-bert/
49
HANDS-ON
1. Follow FasterTransformer README and generate custom op lib: libtf_fastertransformer.so
2. Prepare gemm_config.in for best algo of gemm by running the built binaries.
3. Modify sample/tensorflow_bert/profile_bert_inference.py to create the squad model, and using saved_model api to export the model.
4. Arrange the exported model in a tree structure ./bert_ft/1/model.savedmodel/xxxexported_files
Deploy BERT TensorFlow Model with custom op
50
HANDS-ON
1. Follow TensorRT/demo/BERT README and generate plugin lib: libbert_plugins.so and libcommon.so
2. Follow the README and ran sample_bert with additional arg ‘—saveEngine=model.plan’
3. Arrange the model dir in a tree structure: bert_trt/1/model.plan
Deploy BERT TensorRT Model with plugins
51
HANDS-ON
1. Prepare model_repository
Prepare model_repository, run trtserver and perf_client
model_repository/|-- bert_fastertransformer| |-- 1| | `-- model.savedmodel| | |-- saved_model.pb| | `-- variables| | |-- variables.data-00000-of-00001| | `-- variables.index| `-- config.pbtxt`-- bert_trt
|-- 1| `-- model.plan`-- config.pbtxt
name: “bert_fastertransformer"platform: "tensorflow_savedmodel"input [
{name: "input_ids"data_type: TYPE_INT32dims: [ 1, 128 ]
},{
name: "input_mask"data_type: TYPE_INT32dims: [ 1, 128 ]
},{
name: "segment_ids"data_type: TYPE_INT32dims: [ 1, 128 ]
}]output [
{name: "prediction"data_type: TYPE_FP32dims: [ 2, 1, 128 ]
}]instance_group {
kind: KIND_GPUcount: 1
}version_policy: {specific { versions: [1] }}
name: "bert_trt"platform: "tensorrt_plan"max_batch_size: 1input [
{name: "segment_ids"data_type: TYPE_INT32dims: 128
},{
name: "input_ids"data_type: TYPE_INT32dims: 128
},{
name: "input_mask"data_type: TYPE_INT32dims: 128
}]output [
{name: "cls_squad_logits"data_type: TYPE_FP32dims: [128,2,1,1]
}]
instance_group {kind: KIND_GPUcount: 1
}
version_policy: {specific { versions: [32] }}
config.pbtxt config.pbtxt
Model directory
52
HANDS-ON
1. Launch trtserver over http/grpc
1. NV_GPU=x nvidia-docker run --rm -it -name=trtis_bert -p8000:8000 -p8001:8001 -v/path/to/model_repository:/models nvcr.io/nvidia/tensorrtserver:19.11-py3
2. Export LD_PRELOAD=/path/to/{libcommon.so; libbert_plugins.so; libtf_fastertransformer.so}
3. trtserver --model-store=/models --log-verbose=1 --strict-model-config=True
Prepare model_repository, run trtserver and perf_client
53
HANDS-ON
1. Run perf_client to infer over grpc
1. Launch docker: docker run –-net=host –-rm –it nvcr.io/nvidia/tensorrtserver-clients
2. ./install/bin/perf_client -m bert_trt -d –c8 –l200 –p2000 –b1 -i grpc -u localhost:8001 -t1 --max-threads=8
Prepare model_repository, run trtserver and perf_client
Request concurrency: 1Client:
Request count: 59Throughput: 944 infer/secAvg latency: 34422 usec (standard deviation 288 usec)p50 latency: 34457 usecp90 latency: 34667 usecp95 latency: 34877 usecp99 latency: 35130 usecAvg gRPC time: 34452 usec ((un)marshal request/response 27 usec + response wait 34425 usec)
Server: Request count: 70Avg request latency: 33473 usec (overhead 26 usec + queue 58 usec + compute 33389 usec)
Result reported by perf_client
54
HANDS-ONBenchmarking - FasterTransformer
SQuAD task inference (FasterTransformer):batchsize=1, tensorflow backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor Precision QPS AvgL(ms) TP99(ms) Concurrent
CPU FP32 7.7 131 203 1
CPU FP32 10.4 289 339 4
GPU FP32 104.5 9.5 11.8 1
GPU FP32 137 21.9 23.7 4
GPU FP16 267.5 3.7 3.9 1
GPU FP16 461.5 8.7 10.3 4
CPU -> multi-thread CPU -> GPU FP32 -> concurrent GPU FP32 -> GPU FP16 -> concurrent GPU FP167.7 104.5 461.5131 9.5 8.7
Virtual GPU feature in Tensorflowto enable multi-stream:
--tf-add-vgpu=“0;4;3000”
55
HANDS-ONBenchmarking - FasterTransformer
SQuAD task inference (FasterTransformer):batchsize=32, tensorflow backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor Precision QPS AvgL(ms) TP99(ms) Concurrent
CPU FP32 0.4 2491 2810 1
GPU FP32 5.5 182 184 1
GPU FP32 5 186 186 4
GPU FP16 21.8 46 48.8 1
GPU FP16 21.6 46.1 48.1 4
CPU -> GPU FP32 -> GPU FP160.4 5.5 21.82491 182 46
56
HANDS-ONBenchmarking - TensorRT
SQuAD task inference (TensorRT):batchsize=1, TensorRT backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor Precision QPS AvgL(ms) TP99(ms) Concurrent
GPU FP32 163 12.3 12.4 1
GPU FP32 156 12.8 14 4
GPU FP16 438.5 4.6 4.6 1
GPU FP16 473.5 4.2 5.1 4
GPU FP32 -> GPU FP16163 473.512.3 4.2
57
HANDS-ONBenchmarking - TensorRT
SQuAD task inference (TensorRT):batchsize=32, TensorRT backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Processor Precision QPS AvgL(ms) TP99(ms) Concurrent
GPU FP32 6.5 157 159 1
GPU FP32 6.5 316 356 4
GPU FP16 29.5 34.2 34.8 1
GPU FP16 30.5 134 151 4
GPU FP32 -> GPU FP166.5 30.5157 134
58
Learn more here:https://nvidia.com/data-center-inference
https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html
https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-
guide/docs/quickstart.html
Get the ready-to-deploy container with monthly updates
from the NGC container registry:https://ngc.nvidia.com/catalog/containers/nvidia%2Ftensorrtserver
Open source GitHub repository:https://github.com/NVIDIA/tensorrt-inference-server
LEARN MORE AND DOWNLOAD TO USE
59
Engineering developer blog (benchmarks, model concurrency, etc.):https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/
Kubeflow guest blog:https://www.kubeflow.org/blog/nvidia_tensorrt/
Open source announcement:https://news.developer.nvidia.com/nvidia-tensorrt-inference-server-now-open-source
More:Data center inference page & TensorRT page
DevTalk Forum for Support
TensorRT Hyperscale Inference Platform infographic
NVIDIA AI Inference Platform technical overview
NVIDIA TensorRT Inference Server and Kubeflow
NVIDIA TensorRT Inference Server Now Available
ADDITIONAL RESOURCES
60NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
NVIDIA (DLI)
•
•
•
•
• GPU
• DLI
•
www.nvidia.cn/dli
DLI 深度学习全天培训 @ GTC CHINA 2019
全球开发者培训证书 | 全配置GPU实验环境 | 5大新课首发 | 一年一度6折特惠
查看课程
和报名
培训
咨询
NEW!
NEW!NEW!
NEW!
NEW!
用多GPU训练神经网络应对大规模神经网络训练的算法和工程挑战
CUDA Python
轻松实现在GPU上加速运行Python应用
计算机视觉零基础入门,深度学习方法与实践
自然语言处理NLP 必备理论与应用技能
多数据类型机器视觉和NLP技术的融合进阶应用
自动驾驶汽车的感知系统(2019新版)学用 NVIDIA DRIVE AGX 构建自动驾驶汽车
工业检测应用深度学习打造自动化工业检测模型
使用 Jetson Nano 开发AI应用机器人基础入门,获得您的Jetson Nano套件
Top Related