AIOps Practice in Network Operation and Maintenance · 2019-09-27 · RAN (D U/CU) BBU X-Haul...
Transcript of AIOps Practice in Network Operation and Maintenance · 2019-09-27 · RAN (D U/CU) BBU X-Haul...
2
Challenges:Telecom Network becoming more Complex
AI
OSSEMS
Orche-strator BSS ...
Cloud OS
Server Switch
Operation Center
Storage
CloudAIR
RAN (DU/CU)
BBU
X-Haul
X-Haul
Swtich
Swtich
Mobile Connection
Enterprise Connection
Home Connection
Agile Controler
MxU
ONT
PremiumHomeBroadband
Core Network
CPE
OLT
SW
WDM
BRAS/PE
WDM WDM
DCI&CloudOptix Backbone
WDM
P Router P RouterAI
Network CloudEngine
RAN vBNG Core(CU) …
5G
CP/UPAPPs
...
Cloud OS
Server Switch Storage
AIVideo
IoT…
Network CloudEngine
IMSMME Core
…
5G
(CP)APPs
...
Cloud OS
Server Switch Storage
AIRAN 5G Core APPs(CU) (UP) ...
Cloud OSServer Switch Storage
AI SD-WAN Agile Private Line
Cloud-basedBNG
CloudAIR
CloudCampus
CloudFAN(Joint Innovation)
CloudEdge/CloudCore
NFVI CloudFabric
Aggregation&Metro
CampusSwtich
Orchestrator &MANO
CloudRAN(Joint Innovation)
Metro as aFabric (JointInnovation) CloudBackbone
AP
AP
Edge Network Cloud
Vertical layers and Heterogeneous componentsDifficult for integration and
troubleshooting
Vertical layers and Heterogeneous componentsDifficult for integration and
troubleshooting
Millions of nodes and Myriads of connections
Difficult for deployment and fault locating
Millions of nodes and Myriads of connections
Difficult for deployment and fault locating
Billions of ServicesNeed more agile and handle
massive O&M data
Billions of ServicesNeed more agile and handle
massive O&M data
3
AIOps: Use AI to assist Human operation and maintenance
Operationby scripts and tools
Before 2008
Operationby scripts and tools
Before 2008
Operationby ITIL and WebsFrom 2008 - 2012
Operationby ITIL and WebsFrom 2008 - 2012
Operationby DevOps
From 2012 - 2015
Operationby DevOps
From 2012 - 2015
4
AIOps — Gartner's View
Utilize big data, machine learning and other analytical technologies to directly and indirectly enhance the relevant technical capabilities of IT services through preventive forecasting, personalization and dynamic analysis to achieve higher quality, reasonable cost and efficient support for the products or services being maintained.
Garnter 2016 Garnter 2018
AIOps platforms enhance IT operations through greater insights by combining big data, machine learning and visualization. I&O leaders should initiate AIOpsdeployment to refine performance analysis today and augment to IT service management and automation over the next two to five years
5
Focus on Different Ability Requirement for O&M Scenarios
Survey Monitoring Trouble-shooting
Service Recovery
Upgrading & Patching
Change & Optimization
Fast and Automatic
Real-time Fast to root cause Self-recovery Less
Service interruption
Safe & Error free
Product Lifecycle
……DeploymentPlanning
No so much expertise required
Need not on the locale
So many layers and Heterogeneous components. Cost even more than 10 days.
Hard to find root cause between layers and components(Physical and VM). More than 3 hours.
Human error in configuration cause accidents. More than 25%.
6
AIOps Target in Huawei
Autonomous network
Device Device Device Device
NMS/EMS NMS/EMS
Service Delivery and Assurance
AI
NMS/EMS NMS/EMS
Service Delivery and Assurance
AI(Model,
Training)
Controller Controller
AI Reasoning
Device Device Device Device
AI Reasoning AI Reasoning AI Reasoning AI Reasoning
Data Data Data Data
Develop AI Model
Autonomous network
Intelligent Service Operation
Data Execute
Target/Intent Based(Scenario, Model, Training)
Autonomous Network
Rule Based(Template, Customization, Configuration)
Top-down Management
7
Definition of Network Autonomous Driving Network
Based on scenarios, the system gradually replaces “hand (operation), eye (monitoring), brain (decision), heart
(intention)”, and finally realizes “Autonomous Driving Network”.
Level definition
LO:Manual Management
L1:Assisted Automation
L2:Partial Autonomous Network
L3:Conditional Autonomous Network
L4:Highly Autonomous Network
L5:Full Autonomous
Planning and Design
Definition: Manual Management NetworkAbility: no automation, no auxiliary system
intention(heart)
System characteristics
Definition: tool-assisted support for discrete pointsCapabilities: tool assistance, rules and expert experience curing
Definition: Task-oriented automationAbility: Based on strategy automation, free hands
Definition: Based on the scene self-closed ability, the responsibility is peopleAbility: gram perceives environmental changes, frees hands and eyes
Definition: Single scene autonomy, responsibility for the systemAbility: Predictive autonomy, free hands, eyes and brain
Definition: full scene closed loop autonomyAbility: intention driven, full human liberation
Maintenance and
optimizationdecision
(brain)
Business Distribution
monitor(eye)
Deployment operation(hand)
8
AI-enabled Autonomous Driving Network
Control & management convergence
Intelligent ServiceCross-domain and global experience
Intelligent SitesEmbedded AI capabilities
…...Intelligence function
Analyticsfunction
Automationfunction
MR data
RU location SLA data
KPI data Hardware status Networkload
Network AI
Site AI
Intent translation
Automation function
Intelligence function
Analytics function
Cloud AI Data & inference Model training
Edge IntelligenceSmart sensors and data collection
with real-time awareness
Cloud IntelligenceApplications in the cloud
Local Intelligence
Low latency control loops (TTI < 0.5ms)
Massive live streaming data (200GB/day/site)
Full mobile network data and status, all-scenario automation
9
Challenges of AIOps
Object hard to understand Experience hard to use AI methods hard to choose ……
Complex relationships between
components
Differentiated realization
methods
Complex solutions combinations
Knowledge in the “Mind” of
technicians
Lack of inheritance
Silo knowledge management
Un-structured
No one algorithm for all
Scenario specific
No plenty of live data(Log, KPI,
Alarms, etc) for training
KnowledgeModel Algorithm
10
Model: PnP for Radio Station Deployment based on eModel
FullPara Audit Planning/audit
Live Network
eModel
NE Configuration
NMS
Huawei NE Model Operator Model
Simplified parameters enabled by eModel, which consists of Engine and Policy.
1)Lots of parameter complex relations predefined,
2)Parameter policy specify configure method of features in different scenarios.
3)Policy is script-based, which can be used to quickly transfer expert
knowledge.
eModel: Mapping between Customer language to Device language
11
Knowledge&Policy:Knowledge Generation and Management for Automation and Self-service Support
“The greatest waste in the process of operation and maintenance is the waste of knowledge.”
Product Knowledge O&M Knowledge
O&M Knowledge Graph and Knowledge Base
AutomationSelf-Service
Device/App A
EMS/MANO/Controller A
Device/App BDevice/App C
EMS/MANO/Controller B
Command
Command Data
Data
12
Knowledge&Policy:Real-time Assistant
Digitalized O&M Knowledge Cloud
O&M Knowledge Graph
Policy Model
AR Intelligent Glass Intelligent Assistant Robert
• Network level
assurance
• Network Planning
• O&M Policy
Management
• ……Field Technician Remote Expertise
• Hardware operation
• Cable operation
• Status Check
• Troubleshooting
• ……
13
Single log anomaly detection
Appears as an exception.
Violate normal rules
Log exception summary
Violate sequential rules
Log exception scenario
One dimensionanalysis such as
power failure
Multi-dimensionanalysis, such as the
number of up/down is inconsistent.
Sequentialanalysis of logs have
time series characters, such as OSPF logs.
Multiple log anomaly detection
Log mode Anomaly detection
Log summary
Logs
Algorithm in Anomalies finding: Pattern Finding through Log Analysis
14
Algorithm in Fault Prediction: Time Series Algorithms Using in KPI Analysis
Find abnormal in advance.
Normal Status
Prediction Area
Alarm appear
Historical KPI data
Real-time KPI data
Rules filter Normal/Abnormal
Experts Annotation
Deterioration model
Training the model
Result
Uncertain samplesFeedback and update
Raw Data
Patterns Result
15
The “strobe and oscillation detection” mechanism automatically identifies whether it needs to be tuned.
Algorithm in Precise alarm:Reduce Alarm and Root Alarm Recommendation through Alarm Analysis
Time series dynamic adaptive technology Root cause diagnosis based on frequent item mining and random forest
Main flow of the dynamic algorithm
Alarms
Object Type Separation
Alarm Sequence Segmentation
Effect verification
Rule Application
Strobe Oscillation Detection
END
Time correlation matrix
Object correlation suppression matrix
Time domain merge inflection point fitting
Object type separation
Alarm sequence segmentation
1 2 3 … 30
ALM1 435 546 578 … 643
ALM2 528 634 697 … 724
ALM3 261 325 365 … 471
ALM4 142 267 375 … 794
Time and object associationmatrix
Time domain merging,inflection point fitting
Key point:
Tune needed
No need to tune
Monthly aging recalculation, self-renewal
Key Algorithm: Alarm Correlation Mining Based on Frequent Items Mining
Key algorithm:Root Case Mining Based on Random Forest
A CA B C C B D
RelevantIrrelevant
Time window
*Common color -> common parameters
ALM1 ALM2 RELATIVITYA B 0.7
B C 0.98
B D 0.82
Inferring the managed object causality and operating system causality based on the known alarm causality, and further obtain root cause alarm based on random forest algorithm.
TREE #1 TREE #2 TREE #3 TREE #4
CLASS C CLASS D CLASS B CLASS C
FINAL CLASS
MAJORITY VOTING
X dataset
N1 features N2 features N3 features N4 features
Alarm TypeRunning System
HardwareSystem Signaling
System
AD C
B
RRU Board Cell
1616
Summary
• AI can help much in O&M assistant, but not means all in current stage.
• Model, Knowledge/Policy and Algorithm are key factors.
• AIOps need to be scenario oriented, no one way for all.
• AIOps is a long term work, and need to have different target and emphasis in stages.
• AIOps and Zero-Touch do not means no Human interfere. It will be a different role for Human.
17
愿景和使命
把数字世界带入每个人、每个家庭、每个组织,构建万物互联的智能世界
Vision and MissionBring digital to every person, home and organization
for a fully connected, intelligent world