20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC...

15
Approaches To High Resolution Network Telemetry & Analytics With Machine Learning Corey Eichelberger

Transcript of 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC...

Page 1: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Approaches To High Resolution Network Telemetry & Analytics With Machine Learning Corey Eichelberger

Page 2: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Outline• Introduction • Streaming Telemetry Overview • Streaming Telemetry Implementation At NCSA • Comparison Of SNMP and Streaming Telemetry • So, We Have All This Data. What Do We Do With It? • LoudML • Machine Learning Efforts At NCSA • Where Do We Go From Here? • Acknowledgements

Page 3: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Introduction• Current Standard:

• SNMP Polling At 1-5 Minute Intervals • RRD Based NMS (Cacti, LibreNMS, etc.)

• Outstanding Issues: • Lossy Databases • RRD Lock Step • Distributed Polling • SNMP Is Resource Intensive

• Goals: • Better Operational Visibility • Lower Time to Resolution • Lossless Database

Page 4: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Streaming Telemetry Overview• Subscription Based Model • Asynchronous Push

• Periodic Updates • State Changes

• Data Models • Openconfig

• gRPC • Google Protocol Buffers (GPB) • SSL Support

• Support For UDP Streaming From Native Sensors • Direct From Line Card or NPU

Page 5: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Streaming Telemetry Implementation At NCSA• TIG (Telegraf, InfluxDB, Grafana) Stack

• Telegraf: Collector • Easily Extensible Via Plugins

• SNMP • Juniper Telemetry Interface • Cisco Model Driven Telemetry

• InfluxDB: Database • Time Series Database • Scalable to millions of data points per second

• Grafana: Visualization/Alerting • Juniper Implementation

• Junos Telemetry Interface (JTI) • Arista/Mellanox

• Work In Progress

Page 6: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Comparison of SNMP vs Streaming Telemetry

WAN Interface - 60s Collection Interval (SNMP)

Page 7: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Comparison of SNMP vs Streaming Telemetry

Same Interface - 2s Collection Interval (Streaming Telemetry)

Page 8: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

So, we have all this data. What do we do with it?• Goals:

• Find Issues Before They Happen • Better Forecasting of Traffic Patterns

• Capacity Planning • Intent Based Networking

• Machine Learning Overview: • Supervised:

• Resource Intensive • Requires Labeling Data • Requires Human Interaction

• Unsupervised: • Point and shoot • Reliable?

• Packages Available: • TenserFlow, SciPy, LoudML, etc.

Page 9: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

LoudML• Support For A Variety of Data Sources

• Elasticsearch • InfluxDB • MongoDB

• Easy To Setup and Get Started • Donut Unsupervised Learning Model

• Created By Researchers At Alibaba • Models Are Abstracted

• YAML Config Files • Simplified Training, Validation and Implementation

Page 10: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Machine Learning Efforts At NCSA• Disclaimer: I'm a network engineer, not a data scientist

• Issues Encountered (LoudML): • Accuracy of Models • Lack of control and insight into the models being created • Forecasting WAN Traffic in HPC • Lack of Data

• Transition To TensorFlow • Better control over models and training • Requires a large amount of time and resources • Best for those with a data science background

Page 11: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Where Do We Go From Here?• Migration From LoudML To TensorFlow • Migration To Center Wide Monitoring

• Monitoring data from Networking, Systems, Storage, Security in one place

• Holistic Approach To Monitoring/Alerting • Continue Training Models • Explore In-Band Network Telemetry

Page 12: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1238993. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Page 13: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

References• Telegraf: https://github.com/influxdata/telegraf • InfluxDB: https://github.com/influxdata/influxdb • Grafana: https://github.com/grafana/grafana • LoudML: https://github.com/regel/loudml • Donut Unsupervised Learning: https://github.com/haowen-xu/

donut

Page 14: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training

Questions:• Contact Corey Eichelberger

• email: [email protected]

Page 15: 20191210-eichelberger-Approaches To High …...2019/12/10  · • Forecasting WAN Traffic in HPC • Lack of Data • Transition To TensorFlow • Better control over models and training