DeepStream SDK: Towards Large Scale Deployment of Intelligent … · 2017. 11. 2. · DeepStream...

42
November 1, 2017 DeepStream SDK: Towards Large Scale Deployment of Intelligent Video Analytics Systems Jeremy Furtek

Transcript of DeepStream SDK: Towards Large Scale Deployment of Intelligent … · 2017. 11. 2. · DeepStream...

  • November 1, 2017

    DeepStream SDK:Towards Large Scale Deployment of Intelligent Video Analytics Systems

    Jeremy Furtek

  • 2

    MOTIVATION

  • 3

    Proliferation of Video Data

    Content Filtering

    Retail Analytics

    Traffic Engineering

    Law Enforcement

    Ad Injection

    Securing Critical Infrastructure

    Parking Management

    In-Vehicle Analytics

  • 4

    Advances in Deep Learning: Classification and Beyond

    Image Classification(AlexNet, 2012)

    Object Detection(R-FCN, 2016)

    Semantic Image Segmentation(ENet, 2016)

    Image Compression(WaveOne, 2017)

    Structure From Motion(SfM-Net, 2017)

    Pose-Aware Face Recognition(USC, 2017)

  • 5

    WHAT IS DEEPSTREAM?

  • HOW IT WORKS

    DEEP LEARNING

  • HOW IT WORKS

    VIDEO

    DEEP LEARNING WITH DEEPSTREAM

    DeepStream EnabledApp or Service

  • VIDEO ANALYTICS WITH DEEP LEARNING ON NVIDIA GPUS: AN OVERVIEW

  • 9

    From Video to Analytics With Deep Learning…

    compressed video

    bitstream

    decoded videoframe

    (YCbCr image)

    Video

    Decode

    1 0 1 1 0…

  • 10

    From Video to Analytics With Deep Learning…

    compressed video

    bitstream

    decoded videoframe

    (YCbCr image)

    Video

    Decode

    1 0 1 1 0…

    Video Decode

    Fully-accelerated hardware decoding on NVIDIAGPUs (NVDEC) is accessible via the Video DecodeAPI.

  • 11

    From Video to Analytics With Deep Learning…

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    (YCbCr image)

    Image

    Conversion

    Video

    Decode

    1 0 1 1 0…

  • 12

    From Video to Analytics With Deep Learning…

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    (YCbCr image)

    Image

    Conversion

    Video

    Decode

    1 0 1 1 0…

    Image Conversion

    Custom CUDA kernels and/or optimized libraries (e.g. NVIDIA Performance Primitives, or NPP) enable high performance image conversions.

    __global__ void resize(uint8_t* dst,

    const uint8_t* src,

    int dst_w,

    int dst_h,

    int src_w,

    int src_h)

    {

    }

  • 13

    From Video to Analytics With Deep Learning…

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    (YCbCr image)

    Image

    Conversion

    Video

    Decode

    1 0 1 1 0…

    inferenceresults

    cat

    dog

    tree

    0.9

    0.05

    0.05

    Inference

  • 14

    From Video to Analytics With Deep Learning…

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    (YCbCr image)

    Image

    Conversion

    Video

    Decode

    1 0 1 1 0…

    inferenceresults

    cat

    dog

    tree

    0.9

    0.05

    0.05

    Inference

    InferenceTensorRT* provides a neural network optimizer and a high performance runtime library for inference deployment on NVIDIA GPUs.

    Step 1: Optimize a trained neural network Step 2: Perform real-time inference with the TensorRT runtime

    *The current DeepStream release supports TensorRT 2.1.

  • 15

    From Video to Analytics With Deep Learning…

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    inferenceresults

    (YCbCr image)

    cat

    dog

    tree

    0.9

    0.05

    0.05

    Image

    ConversionInference

    Display /

    StorageVideo

    Decode

    1 0 1 1 0…

  • 16

    From Video to Analytics With Deep Learning…

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    inferenceresults

    (YCbCr image)

    cat

    dog

    tree

    0.9

    0.05

    0.05

    Image

    ConversionInference

    Display /

    StorageVideo

    Decode

    1 0 1 1 0…

    Display/StorageProcessing of inference output is application dependent.

    Some common options are:

    • Real-time display with overlay• Alert generation• Database insertion• Augmented video encoding• “Slew-to-Cue” camera control• Tracker update• Custom post-processing algorithms (CUDA, OpenCV, …)

  • 17

    Video to Analytics With the DeepStream SDK

    DeepStream runtime library (libdeepstream)

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    inferenceresults

    (YCbCr image)

    cat

    dog

    tree

    0.9

    0.05

    0.05

    Image

    ConversionInference

    Display /

    StorageVideo

    Decode

    1 0 1 1 0…

  • 18

    USING THE DEEPSTREAM LIBRARY

  • 19

    DeepStream from the Ground Up

    DeepStream tensors are objects that implement the IStreamTensor interface and represent a 4-dimensional array of values stored in NCHW order.

    Tensors

    compressed video

    bitstream

    W

    H

    C

    N

    typedef enum {

    FLOAT_TENSOR, // float32

    NV12_FRAME, // decoded YCbCr

    OBJ_COORD, // detection bounding box

    CUSTOMER_TYPE // user defined

    } TENSOR_TYPE;

    typedef enum {

    GPU_DATA, // device memory

    CPU_DATA, // host memory

    } MEMORY_TYPE;

    IStreamTensor* createStreamTensor(int maxlen,

    size_t elemsize,

    TENSOR_TYPE Ttype,

    MEMORY_TYPE Mtype,

    int deviceID);

  • 20

    DeepStream from the Ground Up

    DeepStream modules are objects that implement the IModule interface and operate on one or more input tensors to produce one or more output tensors.

    Modules

    void IModule::execute(const ModuleContext& ctx,

    const std::vector& in,

    const std::vector& out) = 0;

    DeepStream

    Module

    input[0]

    input[1]

    input[M -1]

    output[0]

    output[1]

    output[N -1]

  • 21

    DeepStream from the Ground Up

    DeepStream modules maintain connectivity information using a std::vector of pairs that represent the inputs to the module.

    Building the DeepStream dataflow graph

    Module CModule A A.output[0]

    A.output[1]

    Module B

    output[0]

    B.output[0]

    B.output[1]

    C.input[0]

    C.input[1]

    IModule* index

    &A 1

    &B 0

    Inputs:

  • 22

    Using the DeepStream Library

    // Create a device worker

    IDeviceWorker* pDW = createDeviceWorker(1, // num channels

    0); // GPU device ID

    Step 1: Create a device worker object

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

  • 23

    Using the DeepStream Library

    // Create a device worker

    IDeviceWorker* pDW = createDeviceWorker(1, 0);

    // Add a decode task, specifying the CODEC type

    pDW->addDecodeTask(cudaVideoCodec_H264);

    Step 2: Add a decode task

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    compressed video

    bitstream

    decoded videoframe

    (YCbCr image)

    Video

    Decode

    1 0 1 1 0…

  • 24

    Using the DeepStream Library

    // Create a device worker

    IDeviceWorker* pDW = createDeviceWorker(1, 0);

    // Add a decode task, specifying the CODEC type

    pDW->addDecodeTask(cudaVideoCodec_H264);

    // Add a module for color conversion.

    // Color conversion module outputs:

    // 0: BGR_PLANAR

    // 1: NV12 (YCbCr)

    IModule* pCC = pDW->addColorSpaceConvertorTask(BGR_PLANAR);

    Step 3: Add an image conversion module

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    (YCbCr image)

    Image

    Conversion

    Video

    Decode

    1 0 1 1 0…

    Caffe uses BGR ordering(not RGB)

  • 25

    Inference in DeepStream: CAFFE and TensorRT

    solver

    parameters

    network

    description

    (training)

    CAFFE

    caffe train –solver solver.prototxt

    trained

    weightssolver.prototxt

    network.prototxt

    trained.caffemodel(BINARYPROTO)

    TensorRT

    network

    description

    (deploy)

    deploy.protoxt

    CUDA

    Engine

    in-memory(serializable)

  • 26

    Using the DeepStream Library

    // Add a module for color conversion.

    // Color conversion module outputs:

    // 0: BGR_PLANAR

    // 1: NV12 (YCbCr)

    IModule* pCC = pDW->addColorSpaceConvertorTask(BGR_PLANAR);

    // Add an inference module using output 0 from color conversion

    std::vector inf_inputs{std::make_pair(pCC, 0)};

    IModule* pInf = pDW->addInferenceTask(inf_inputs,

    prototxt_file.c_str(),

    model_file.c_str(),

    input_layer_name,

    output_layer_names_vector,

    num_channels,

    &inference_params);

    Step 4: Add an inference module

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    inferenceresults

    (YCbCr image)

    cat

    dog

    tree

    0.9

    0.05

    0.05

    Image

    ConversionInference

    Video

    Decode

    1 0 1 1 0…

  • 27

    Using the DeepStream Library

    // Add a parser module using output 0 from inference

    std::vector parser_inputs;

    parser_inputs.push_back(std::make_pair(pInf, 0));

    IModule* pParse = new UserInferenceParser(parser_inputs);

    pDW->addCustomerTask(pParse);

    // Add a display module using image and parser outputs

    std::vector display_inputs;

    display_inputs.push_back(std::make_pair(pCC, 0));

    display_inputs.push_back(std::make_pair(pParse, 0));

    IModule* pDisp = new UserDisplayModule(display_inputs);

    pDW->addCustomerTask(pDisp);

    Step 5: Add custom output processor(s)

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    compressed video

    bitstream

    decoded videoframe

    BGRimage

    inferenceresults

    (YCbCr image)

    Image

    ConversionInference Parser

    Video

    Decode

    1 0 1 1 0…

    Display /

    Storage

  • 28

    Using the DeepStream Library

    // Add a display module using image conversion and parser outputs

    std::vector display_inputs;

    display_inputs.push_back(std::make_pair(pCC, 0)); // BGR output

    display_inputs.push_back(std::make_pair(pParse, 0));

    IModule* pDisp = new UserDisplayModule(display_inputs);

    pDW->addCustomerTask(pDisp);

    // Start the device worker...

    pDW->start();

    Step 6: Start the device worker

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

  • 29

    Using the DeepStream Library

    // Start the device worker...

    pDW->start();

    // Initiate data input by pushing data to the device worker

    while(1)

    {

    // Read data from file/socket/database...

    pDW->pushPacket(data, num_bytes, channel_num);

    }

    Step 7: Initiate data input

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

    45

    46

    47

    48

    49

  • 30

    SCALING WITH DEEPSTREAM

  • 31

    DECODE PERFORMANCE ON NVIDIA GPUS

  • 32

    Scaling Video Analytics With DeepStream: Multiple Channels

    GPU 0DeepStream runtime library (libdeepstream)

    Display /

    Storage

    Image Conversion InferenceVideo Decode

    Image Conversion InferenceVideo Decode

    Image Conversion InferenceVideo Decode

    Image Conversion InferenceVideo Decode

    … … … ……

    channel 0

    channel 1

    channel 2

    channel 3

  • 33

    Faster Inference with INT8 Integers

    GPUs with Compute Capability 6.1 and higher support dot product instructions with 8-bit operands. (Examples of GPUs with this capability are Tesla P4 and P40.)

  • 34

    Inference with INT8: Accuracy

    From: “8-bit Inference with TensorRT,” Szymon Migacz, NVIDIA, GTC2017.

  • 35

    Inference with INT8: Performance

    From: “8-bit Inference with TensorRT,” Szymon Migacz, NVIDIA, GTC2017.

  • 36

    Using INT8 Inference in DeepStreamStep 1: Run TensorRT offline to generate a calibration table file

    trained

    weights

    trained.caffemodel(BINARYPROTO)

    TensorRT

    network

    description

    (deploy)

    deploy.protoxt

    CUDA

    Engine

    calibration

    table

  • 37

    Using int8 Inference in DeepStreamStep 2: Provide the path to the calibration file when creating the

    inference task

    // Add an inference module using output 0 from color conversion

    std::vector inf_inputs{std::make_pair(pCC, 0)};

    inference_params.calibrationTableFile = calibration_file.c_str();

    IModule* pInf = pDW->addInferenceTask(inf_inputs,

    prototxt_file.c_str(),

    model_file.c_str(),

    input_layer_name,

    output_layer_names_vector,

    num_channels,

    &inference_params);

  • 38

    GPU 1

    Scaling Video Analytics With DeepStream: Multiple GPUs

    GPU 0DeepStream device worker (device ID = 0)

    Display /

    Storage

    Image Conversion InferenceVideo Decode

    Image Conversion InferenceVideo Decode

    … … … ……

    channel 0

    channel 1

    DeepStream device worker (device ID = 1)

    Image Conversion InferenceVideo Decode

    Image Conversion InferenceVideo Decode

    … … … ……

    channel 0

    channel 1

  • 39

    Examples (Videos)

  • 40

    DeepStream Resources• DeepStream SDK Download

    https://developer.nvidia.com/deepstream-sdk

    • NVIDIA Parallel For All Blog:

    DeepStream: Next Generation Video Analytics for Smart Citieshttps://devblogs.nvidia.com/parallelforall/deepstream-video-analytics-smart-cities/

    • NVIDIA Developer Forums

    https://devtalk.nvidia.com/

    • Caffe Model Zoo

    https://github.com/BVLC/caffe/wiki/Model-Zoo

    • NVIDIA Video Encode and Decode GPU Support Matrix

    https://developer.nvidia.com/video-encode-decode-gpu-support-matrix

    https://developer.nvidia.com/deepstream-sdkhttps://devblogs.nvidia.com/parallelforall/deepstream-video-analytics-smart-cities/https://devtalk.nvidia.com/https://github.com/BVLC/caffe/wiki/Model-Zoohttps://developer.nvidia.com/video-encode-decode-gpu-support-matrix

  • Fundamentals

    Parallel Computing

    Game Development & Digital Content

    Finance

    NVIDIA DEEP LEARNING INSTITUTE

    Training available as online self-paced labs and instructor-led workshops

    Take self-paced labs at www.nvidia.com/dlilabs

    Find or request an instructor-led workshop at www.nvidia.com/dli

    Educators can download the Teaching Kit at developer.nvidia.com/teaching-kit and contact [email protected] for info on the University Ambassador Program

    Intelligent Video Analytics

    Healthcare

    Robotics

    Autonomous Vehicles

    Virtual Reality