Slide 1/18 IBM Research Lilliput meets Brobdingnagian: Data Center Systems Management through Mobile...
-
Upload
crystal-conley -
Category
Documents
-
view
214 -
download
0
Transcript of Slide 1/18 IBM Research Lilliput meets Brobdingnagian: Data Center Systems Management through Mobile...
Slide 1/18IBM Research
Lilliput meets Brobdingnagian: Data Center Systems Management through
Mobile Devices
Jan Rellermeyer, Thomas Osiecki, Michael Kistler, Ahmed Gheith
3rd International Workshop on Dependability of Clouds, Data Centers and Virtual Machine Technology (DCDV)
Held in conjunction with Dependable Systems and Networks (DSN)Budapest, Hungary June 18, 2013
Saurabh Bagchi,
Fahad Arshad
Slide 2/18IBM Research
System Management Workflow
Something is wrong!
Pa
tc
h
Slide 3/18IBM Research
Systems Management: A Changed View
FilteringPatch
Slide 4/18IBM Research
So What Exactly Are the Changes?
1. Platform being used for doing the systems management
Server Mobile devices
1. Large screen2. Resource rich3. Within organization’s
security perimeter4. High dependability
1. Small screen2. Resource constrained3. Outside organization’s
security perimeter4. Lower dependability
Slide 5/18IBM Research
So Exactly Are the Changes?
2. Layered systems management to flat hierarchy
Filtering
Slide 6/18IBM Research
Case Study: IBM Research’s IBM Remote Project
Always Connected
Instantaneous
Focused
SimpleUser Interface
Communication
visualization of complex datarelevance firstdrill-down UI
direct connection to the managed machinesrefresh rate vs. power consumption
IBM Blade Centers
Slide 7/18IBM Research
Case Study: IBM Remote Project
Slide 8/18IBM Research
Research Challenges Due To The Changes
1. Platform being used for doing the systems management: Server to Mobile Devices
I. How do we optimize the scarce resources of the systems management platforms? Primarily, battery and communication bandwidth.
II. How do we handle the fact that the platforms will be insecure and fault-intolerant for parts of their operation?
III. How do we visualize the (hopefully) rare failure event in a deluge of systems monitoring data?
Slide 9/18IBM Research
Research Challenges Due To The Changes
2. Layered systems management to flat hierarchy
I. Can we avoid chaos due to the looser coordination?
II. Can we leverage overlap between interests to cut down on traffic to individual mobile devices?
Slide 10/18IBM Research
Solution Directions for Question 1
I. How do we optimize the scarce resources of the systems management platforms? Primarily, battery and communication bandwidth.
1. Platform being used for doing the systems management: Server to Mobile Devices
• Minimize number of messages, while still receiving enough to reliably detect failures– Use publish-subscribe or other push mechanism, in preference to
pull mechanism– BUT: Most hardware management modules do not support push– Use an intermediate server for aggregation and filtering
• Apply principles of rare event detection – Non-events occur with much higher frequency than events of interest– BUT: Requires model of events: time distribution, correlation, etc.
Slide 11/18IBM Research
Solution Directions for Question 1
II. How do we handle mismatch in dependability characteristics (between target platform and management platform)?
– Mobile device can be physically compromised and OS-level protection can be bypassed
– Mobile devices are often employee owned
1. Platform being used for doing the systems management: Server to Mobile Devices
• Application security and server-side security need to be built in– Periodic authentications, not one-time authentications– Biometric-based authentication
Slide 12/18IBM Research
Solution Directions for Question 1
III. How do we visualize the needle in the haystack?– Needle: Outages, failures, or behavior that is indicative of an
imminent failure– Haystack: Deluge of monitored data about target platforms– Screen real estate is limited
1. Platform being used for doing the systems management: Server to Mobile Devices
• First off, deliver only a small superset of relevant messages– Push notification, such as, through Google Cloud Messaging (GCM)
• Drill-down views, starting with summary alert view for all machines in data center– Followed up with root cause analysis techniques that run on servers
Slide 13/18IBM Research
Solution Directions for Question 2
I. Tight vertical integration of different software layers implies different domain experts will be concurrently involved in problem troubleshooting
1. Layered systems management to flat hierarchy, OR Crowdsourcing systems management
• Relevant features of social media will be used– Example: At IBM, you can “friend” specific Blade Centers and have
“circles” of administrators
• Role-based Access Control (RBAC) can be used for security control of different software layers– Fine-grained roles can be assigned– RBAC solutions exist for sophisticated management of these roles,
such as, hierarchies, overlaps, and trasience
Slide 14/18IBM Research
Solution Directions for Question 2
I. Overlap between interests of multiple mobile devices and their geographical proximity
1. Layered systems management to flat hierarchy, OR Crowdsourcing systems management
• Commonalities of interest can be used to cut down on cellular bandwidth usage– Commonalities can exist due to proximal geographic location or
overlap among system administration responsibilities – Distribute information to a subset of mobile devices and then use
local communication (Bluetooth, Wi-Fi) to disseminate information among proximal devices
Slide 15/18IBM Research
Case Study: IBM Remote
• Health view (left) broken into critical, non-critical, and system-level health messages
• Event log view (right) is filtered to show only warnings and errors
Slide 16/18IBM Research
Related Work
• Much work on managing mobile devices – opposite direction than what we are discussing in this paper– Some work on mobile agents for managing servers [18 –
NOMS02, 19 – Software07]– Sophistication lies in designing a dynamic set of agents whose
monitoring policies can be changed on the fly
• Some commercial prototypes for monitoring and control of target end points from mobile devices– UCSand for Android devices [21] for Cisco Unified Systems
monitoring and control – PCMonitor [22] from MMSoft Design Ltd. – VMWare vCenter Mobile Access [23] is a virtual appliance on the
server side for managing a datacenter from mobile devices– Recent offering from HP [18]
Slide 17/18IBM Research
Take-away Lessons
• A changed vision of systems management is happening – mobile clients being used to manage large masses of physical and virtual servers
• This throws open some technical challenges
1. Management to be done through resource-constrained mobile devices which have lower dependability than target devices
2. Crowd-sourcing of systems management, rather than linear flow of control through hierarchies of sysadmins
• These challenges are being addressed in multiple projects at commercial organizations, including in the IBM Remote project at IBM Research
Slide 18/18IBM Research
Presentation available at:Dependable Computing Systems Lab (DCSL)
web siteengineering.purdue.edu/dcsl