Open Source Firmware Conference 2019 OpenBMC - Platform ...
Transcript of Open Source Firmware Conference 2019 OpenBMC - Platform ...
OpenBMC - Platform Telemetry
Neeraj Ladkani – [email protected]
Open Source Firmware Conference 2019
•The rise and rapid evolution of data analytics, AI and machine learning workloads have significant impact on cloud hardware design.
•Commercial Cloud Infrastructure requires high availability and need state of art telemetry to build and predict failsafe models.
•BMC role has evolved from legacy hardware management service to central intelligent controller serving cloud control plane operations.
Cloud Telemetry Conundrums
Specialization with Standardization• Processors
• Processors errors and CPU Crash dump
• Memory • Memory Correctable and uncorrectable errors
• IO• PCIe Correctable and uncorrectable errors• SMART data for disks
• Add on cards and custom silicon • Thermal data • Vendor specific telemetry
• Host Subsystem• OS heartbeat• Network link status
• Power Supply • Fault history
• Energy storage attributes
• Consumption history
• BMC • Firmware Stats
• Request and Response history
• BMC CPU/Memory/Flash stability
•Mainboard HW• Hot Swap Controller Faults
• Voltage Regulator Faults
Objective
• Standardize telemetry model
• Design a configurable BMC telemetry and health monitoring framework for OpenBMC platforms ( hardware, thermal, power, BMC and custom )
• Provide a generic interface to remotely access the metric data using both a push and pull model.
Possible Solutions
•Custom Daemons for every subsystem and custom IPMI/Redfish to push telemetry information• Use native binary blobs and OEM URIs
•Custom methods to specify telemetry parameters like metric definition, sensing interval, specifying triggers
Telemetry Collection Subsystem
• Use “collectd” for collecting metrics.
• “collectd” plugins can be written or provided by subsystem owners to collect metrics ( Hardware as Service).
• Integrating IPMI and Redfish subsystems with collectd using intermediate translation services.
• Supports aggregation of metrics data, which enables space-efficient storage of data.
Redfish Telemetry Model
• Use Standard Redfish telemetry model (Credit : Paul Vancil )
• Flexible, extendible and complete for OpenBMC client interfaces
• Supports push (Redfish event model) and pull model ( Event logs)
• Supports Triggers for specific scenarios
Redfish Telemetry – Sample Metric Report
Source: https://www.dmtf.org/documents/redfish-spmf/redfish-telemetry-white-paper-010a
Get Involved • Workgroup call ( Bi-weekly)
https://github.com/openbmc/openbmc/wiki/Platform-telemetry-and-health-monitoring-Work-Group
• Community requirements https://docs.google.com/spreadsheets/d/12gMMXB9r_WfWDf5wz-Z_zXsz6RNheC6p2LKp7HePAEE/edit?usp=sharing
• Design proposalshttps://gerrit.openbmc-project.xyz/c/openbmc/docs/+/22257
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/23758
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/24357
OpenBMC Metrics Collection
Proposal + Progress on Collectd Integration
Kun Yi ([email protected])
Open Source Firmware Conference 2019
Context: What are "metrics"?
● "a degree to which a software system or process possesses some property" -- Wikipedia
● Timeseries data● Sensor value is a good example for BMC systems● Metrics enable monitoring system data at scale such as:
○ How much performance improves across the fleet when the BMCs are updated?○ How many times do BMCs report thermal throttling on a group of machines running
a heavy load?
Context: Characteristics of MetricsCharacteristics Metric Log Event
Generally numeric Yes No Maybe
Time Interval Regular Irregular Irregular
Urgent No No Maybe
Target Automation Human Human/Automation
Impact of losing a data point Low Medium High
Examples CPU loadmemory usagedisk usagedaemon restart countsystem uptime...
dmesg/kmesgsystemd journalrsyslog...
catastrophic GPIOsensor value over thresholdsystem disk full...
Context: Characteristics of MetricsCharacteristics Metric Log Event
Generally numeric Yes No Maybe
Time Interval Regular Irregular Irregular
Urgent No No Maybe
Target Automation Human Human/Automation
Impact of losing a data point Low Medium High
Examples CPU loadmemory usagedisk usagedaemon restart countsystem uptime...
dmesgsystemd journalrsyslog...
catastrophic GPIOsensor value over thresholdsystem disk full...
Existing OpenBMC collection Adhoc solutions systemd journalrsyslogRedfish logging
IPMI SELRedfish events
Context: Collectd and RRDTool
● Collectd [1]○ Metrics collection daemon○ Written in C○ Highly configurable○ Over 100 plugins available○ Supports various data formats including CSV and RRD ○ Used in OpenWRT
● RRDTool [2]○ Based on RRD format○ Includes shared library "libRRD"
Context: Round Robin Database (RRD)
● Stores data in a circular buffer● Automatically aggregates data according to configuration● Constant size
Round RobinArchive (RRA)
PD
P
PD
P
PD
P
PD
P
CF
ConsolidatedData Point
Consolidation Function
Primary Data Point
Updates
CDP CDP CDP . . . RRD Format Illustration
Credit: Gabriel Matute
Design: Requirements
● Must be able to persist certain critical metrics● Resource-friendly
○ Trade-off between storage and amount of data to persist○ Persist only the important data○ External program can scrape from BMC frequently
● Common, simple interface for instrumenting
System Diagram(Illustrative)
Progress
● Created Proof-of-concept○ Collect BMC load and memory usage using Collectd plugins○ Use OEM IPMI command to transfer data to the host○ Host translates data to feed into other collection frameworks
● Preliminary study on resource consumption○ Default bitbake recipe for rrdtool includes too many dependencies
Progress: Resource Consumption
● Tested based on OpenBMC 2.7, ARMv7a● Image size
○ By default rrdtool recipes includes perl, python, graphic libs..○ Building default rrdtool+collectd takes >7MB of flash space after xz compression○ Building the minimally required recipe trims it down to 2.6MB
● CPU/Memory○ With a few metrics being collected, memory consumption is ~4.8MB○ CPU usage is ~1%○ Will increase with the number of metrics being collected
● RRD file size○ 23KB for 1 metric updated every 30s and kept for a day
Future
● Configurable RRDtool recipe to drop unnecessary dependencies● More code into librrd+ (librrd C++ wrapper)● Look into generating events● Look into tagging metrics
○ RRD file has no intrinsic string meta fields○ "Collectd is moving (slowly but calmly) towards implementing arbitrary key/value
attributes attached to each value. "● Redfish Telemetry Metric Report
○ Current proposal of JSON definition [3]
References
[1] Collectd: https://github.com/collectd/collectd[2] RRDtool: https://github.com/oetiker/rrdtool-1.x[3] DMTF Redfish API JSON definition: https://redfish.dmtf.org/schemas/v1/MetricReportDefinition.v1_2_0.json
Credits
Gabriel Matute for his awesome work as an intern!
Questions?