FA Agents - Installation and Administration -...

211
FlexFrame™ for SAP ® Version 4.0 FA Agents - Installation and Administration Edition March 2007 Document Version 2.0

Transcript of FA Agents - Installation and Administration -...

  • FlexFrame for SAP Version 4.0

    FA Agents - Installation and Administration

    Edition March 2007 Document Version 2.0

  • Fujitsu Siemens Computers GmbH

    Copyright Fujitsu Siemens Computers GmbH 2007 FlexFrame, PRIMECLUSTER. PRIMEPOWER and PRIMERGY are trademarks of Fujitsu Siemens Computers SPARC64 is a registered trademark of Fujitsu Ltd. SAP and NetWeaver are trademarks or registered trademarks of SAP AG in Germany and in several other countries Linux is a registered trademark of Linus Torvalds SUSE Linux is a registered trademark of Novell, Inc., in the United States and other countries Java and Solaris are trademarks of Sun Microsystems, Inc. in the United States and other coun-tries Intel and PXE are registered trademarks of Intel Corporation in the United States and other coun-tries MaxDB is a registered trademark of MySQL AB, Sweden MySQL is a registered trademark of MySQL AB, Sweden NetApp, Network Appliance, Open Network Technology for Appliance Products, Write Anywhere File Layout and WAFL are trademarks or registered trademarks of Network Appilance, Inc. in the United States and other countries Oracle is a registered trademark of ORACLE Corporation EMC, CLARiiON, Symmetrix, PowerPath, Celerra and SnapSure are trademarks or regis-tered trademarks of EMC Corporation in the United States and other countries SPARC is a trademark of SPARC International, Inc. in the United States and other countries Ethernet is a registered trademark of XEROX, Inc., Digital Equipment Corporation and Intel Corpo-ration Windows, Excel and Word are registered trademarks of Microsoft Corporation All other hardware and software names used are trademarks of their respective companies. All rights, including rights of translation, reproduction by printing, copying or similar methods, in part or in whole, are reserved. Offenders will be liable for damages. All rights, including rights created by patent grant or registration of a utility model or design, are reserved. Delivery subject to availability. Right of technical modification reserved.

  • FA Agents - Installation and Administration

    Contents 1 Introduction ..................................................................................................... 1 1.1 FlexFrame-Autonomy........................................................................................ 1 1.2 Additional Documentation ................................................................................. 2 1.3 Target Group..................................................................................................... 2 1.4 Notational Conventions ..................................................................................... 2 1.5 Document History.............................................................................................. 3 1.6 Related Documents........................................................................................... 3

    2 First Steps........................................................................................................ 5 2.1 Installation and Startup...................................................................................... 5 2.2 Installation Requirements.................................................................................. 5 2.2.1 The FlexFrame Solution .................................................................................... 5 2.2.2 Installation ......................................................................................................... 6 2.2.2.1 Installation Packages ........................................................................................ 7 2.2.2.2 Standard Installation.......................................................................................... 7 2.2.3 Configuration ..................................................................................................... 8 2.2.4 Starting and Stopping........................................................................................ 8 2.3 FA WebInterface ............................................................................................... 9 2.3.1 Function ............................................................................................................ 9 2.3.2 Installation ......................................................................................................... 9 2.3.3 Configuration ..................................................................................................... 9 2.3.4 Starting and Stopping...................................................................................... 10 2.4 DomainManager.............................................................................................. 10

    3 FlexFrame Autonomy Architecture.............................................................. 11 3.1 Pool Creation and Grouping............................................................................ 12 3.1.1 Virtual FlexFrame Autonomy Pools ................................................................. 12 3.1.2 Grouping ......................................................................................................... 13 3.1.2.1 Manual Group Creation ................................................................................... 14 3.1.2.2 Configuration in the LDAP Directory................................................................ 14 3.1.2.3 Automatic (generic) Group Creation................................................................ 14 3.2 Service Classes............................................................................................... 14 3.2.1 Service Priority ................................................................................................ 15 3.2.2 Service Power Value ....................................................................................... 15 3.2.3 Class Creation Rules....................................................................................... 15 3.2.4 Testament Types............................................................................................. 15 3.2.4.1 Service-specific Testaments............................................................................ 16 3.2.4.2 Enqueue Service in Case of Replicated Enqueue........................................... 16 3.3 FA Configuration, Work and Log Files............................................................. 16 3.4 Systems .......................................................................................................... 17 3.5 Service Types.................................................................................................. 17 3.5.1 Replicated Enqueue Service ........................................................................... 17

  • Contents

    FA Agents - Installation and Administration

    3.5.2 Live Cache....................................................................................................... 18 3.6 Generic Services ............................................................................................. 18 3.6.1 Service State Model ........................................................................................ 18 3.6.2 Service Detection Model.................................................................................. 19 3.6.3 Service Reaction Model................................................................................... 19 3.7 FlexFrame Performance and Accounting Option............................................. 19 3.7.1 Performance Option ........................................................................................ 20 3.7.2 Accounting Option ........................................................................................... 21 3.7.3 Billing............................................................................................................... 23

    4 FlexFrame Autonomy.................................................................................... 25 4.1 FlexFrame Autonomy Reactions ..................................................................... 26 4.1.1 Restart ............................................................................................................. 26 4.1.2 Reboot ............................................................................................................. 26 4.1.3 Switchover ....................................................................................................... 27 4.1.3.1 General Rules.................................................................................................. 27 4.1.3.2 Internal Switchover .......................................................................................... 28 4.1.3.3 External Switchover......................................................................................... 28 4.1.4 Maintenance .................................................................................................... 28 4.2 Self-Repair Strategies ..................................................................................... 29 4.2.1 Self-Repair in the Event of a Service Failure ................................................... 29 4.2.2 Self-Repair in the Event of a Node Failure ...................................................... 29 4.2.3 Takeover by a Spare Node (Switchover) ......................................................... 30 4.2.4 Multi Node Failure ........................................................................................... 30 4.2.4.1 Case 1: ShortTime Failure............................................................................... 30 4.2.4.2 Case 2: LongTime Failure ............................................................................... 31 4.3 Takeover Rules ............................................................................................... 32 4.3.1 Overview.......................................................................................................... 32 4.3.2 TakeOver strategy ........................................................................................... 32 4.3.2.1 Overview.......................................................................................................... 32 4.3.2.2 FirstFit ............................................................................................................. 32 4.3.2.3 LowPrioFit........................................................................................................ 32 4.3.3 TakeOver Rule ................................................................................................ 33 4.3.3.1 Overview.......................................................................................................... 33 4.3.3.2 Static Takeover Rule ....................................................................................... 33 4.3.3.3 Dynamic TakeOver Rules................................................................................ 35 4.4 Operating Mode............................................................................................... 38 4.4.1 Event Mode ..................................................................................................... 39 4.4.2 Local Reaction Mode....................................................................................... 39 4.4.3 Central Reaction Mode.................................................................................... 39 4.5 Autonomous Operation of a FlexFrame Infrastructure..................................... 40 4.5.1 FlexFrame Autonomy and the Adaptive Computing Controller (ACC)............. 40 4.5.2 FlexFrame Autonomy and FSC FlexFrame Scripts ......................................... 40 4.6 FlexFrame Autonomy and User Interactions ................................................... 40 4.6.1 myAMC.FA Agents: Starting/Stopping/Status.................................................. 41

  • Contents

    FA Agents - Installation and Administration

    4.6.1.1 Starting the myAMC.FA Agents Manually ....................................................... 41 4.6.1.2 Stopping the myAMC.FA Agents Manually ..................................................... 41 4.6.1.3 Status of the myAMC.FA Agents..................................................................... 42 4.6.2 Starting/Stopping an SAP Instance ................................................................. 42 4.6.2.1 Starting an SAP Instance ................................................................................ 42 4.6.2.2 Stopping an SAP Instance .............................................................................. 43 4.7 Possible Applications ...................................................................................... 43 4.7.1 General ........................................................................................................... 43 4.7.2 Semi-autonomous Operation........................................................................... 44 4.7.2.1 Monitoring of Application Instances................................................................. 44 4.7.3 Autonomy for Application Instances ................................................................ 45 4.7.3.1 Restart............................................................................................................. 46 4.7.3.2 Reboot............................................................................................................. 46 4.7.3.3 Switchover....................................................................................................... 46 4.8 FA Work and Log Files.................................................................................... 47 4.8.1 General ........................................................................................................... 47 4.8.2 Overview, Principal Directories, Files .............................................................. 47 4.8.3 Collecting Diagnostic Information for Support Assistance ............................... 51 4.8.4 Selected Files.................................................................................................. 52 4.8.4.1 Livelist ............................................................................................................. 52 4.8.4.2 Services List .................................................................................................... 52 4.8.4.3 Services Log.................................................................................................... 52 4.8.4.4 Reboot............................................................................................................. 52 4.8.4.5 Switchover....................................................................................................... 52 4.8.4.6 XML Repository............................................................................................... 53 4.8.4.7 BlackBoard...................................................................................................... 53 4.9 Migration of FA Agent Versions on Pool Level ................................................ 53 4.10 The FA Migration Tool..................................................................................... 57 4.10.1 Pool Mode ....................................................................................................... 57 4.10.2 File Mode ........................................................................................................ 57 4.10.3 Usage of Help.................................................................................................. 58 4.11 Command Line Interface ................................................................................. 59 4.12 Command Execution at All Nodes of a Pool.................................................... 59

    5 WebInterface.................................................................................................. 61 5.1 Installation / Configuration............................................................................... 61 5.1.1 Prerequisites ................................................................................................... 61 5.1.2 Installation ....................................................................................................... 61 5.1.3 Configuration ................................................................................................... 61 5.1.3.1 Web Server ..................................................................................................... 61 5.1.3.2 Login IDs ......................................................................................................... 61 5.1.3.3 Link to the myAMC.Messenger Database ....................................................... 62 5.1.3.4 LDAP Options.................................................................................................. 64 5.1.3.5 GUI Options..................................................................................................... 65 5.1.3.6 Other Settings ................................................................................................. 66

  • Contents

    FA Agents - Installation and Administration

    5.2 Visualization .................................................................................................... 67 5.2.1 Starting the WebInterface / Access via Web Browser ..................................... 67 5.2.2 Login................................................................................................................ 67 5.2.3 Overview of Elements...................................................................................... 68 5.2.4 Pool / Group Tree ............................................................................................ 68 5.2.4.1 Overview.......................................................................................................... 68 5.2.4.2 Status .............................................................................................................. 69 5.2.4.3 Selecting an Element....................................................................................... 69 5.2.4.4 Different Tree Presentations............................................................................ 70 5.2.5 Status Display.................................................................................................. 71 5.2.5.1 Node Panel...................................................................................................... 71 5.2.5.2 System Panel .................................................................................................. 72 5.2.5.3 Instance Panel................................................................................................. 72 5.2.5.4 Assigning States to Colors............................................................................... 74 5.2.6 Message Display ............................................................................................. 75 5.2.6.1 Fields............................................................................................................... 75 5.2.6.2 Navigation........................................................................................................ 76 5.2.6.3 Viewsets .......................................................................................................... 77 5.2.6.4 Sorting ............................................................................................................. 79 5.2.7 Configuration of FlexFrame Autonomy with the Webinterface......................... 79 5.3 Interaction........................................................................................................ 80 5.3.1 Commands ...................................................................................................... 80 5.3.1.1 Activating the Context Menus .......................................................................... 81 5.3.1.2 Confirming a Command................................................................................... 81 5.3.1.3 Pools ............................................................................................................... 83 5.3.1.4 Groups............................................................................................................. 84 5.3.1.5 Nodes .............................................................................................................. 85 5.3.1.6 System ............................................................................................................ 86 5.3.1.7 Instance........................................................................................................... 87 5.3.2 Updates ........................................................................................................... 88 5.3.2.1 Update Interval ................................................................................................ 88 5.3.2.2 Manual Update ................................................................................................ 88 5.3.2.3 Reinitialization ................................................................................................. 89 5.3.2.4 Pause Mode (No Update) ................................................................................ 89 5.4 Info and Help ................................................................................................... 90 5.5 FlexFrame Performance and Accounting Plug-in .......................................... 91 5.5.1 FlexFrame Reporting Plug-in........................................................................... 92

    6 FlexFrame Autonomy Power Shutdown Concept....................................... 93 6.1 General............................................................................................................ 93 6.2 Power Shutdown Architecture ......................................................................... 94 6.3 Basics.............................................................................................................. 95 6.3.1 Power Shutdown for Blade Systems ............................................................... 95 6.3.2 Power Shutdown for PRIMERGY Systems...................................................... 95 6.3.3 Power Shutdown for PRIMEPOWER Systems................................................ 95

  • Contents

    FA Agents - Installation and Administration

    6.4 Power Shutdown Configuration ....................................................................... 96 6.4.1 Switchover Control Parameters....................................................................... 96 6.4.2 User, Password and Community ..................................................................... 96 6.4.3 Management Blades ....................................................................................... 97 6.4.4 Application Nodes ........................................................................................... 98 6.4.5 Default Shutdown Mode ................................................................................ 101

    7 Parameter Reference .................................................................................. 103 7.1 FA Agents ..................................................................................................... 103 7.1.1 FA Agent Configuration Files......................................................................... 103 7.2 SNMP Traps.................................................................................................. 104 7.2.1 General ......................................................................................................... 104 7.2.2 Structure........................................................................................................ 104 7.2.3 Default Parameter File .................................................................................. 106 7.3 Pooling and Grouping.................................................................................... 108 7.3.1 Pooling .......................................................................................................... 108 7.3.2 Grouping ....................................................................................................... 108 7.3.2.1 LDAP Grouping ............................................................................................. 109 7.3.2.2 Manual Group Assignment ............................................................................ 109 7.3.2.3 Generic Grouping .......................................................................................... 110 7.3.3 Default Parameter File .................................................................................. 113 7.4 Service Classes............................................................................................. 122 7.4.1 Service Priority .............................................................................................. 123 7.4.2 Service Power Value ..................................................................................... 123 7.4.3 Class Creation Rules..................................................................................... 123 7.4.4 Example ........................................................................................................ 123 7.5 FlexFrame Autonomy .................................................................................... 124 7.5.1 General Parameters ...................................................................................... 125 7.5.2 Parameters for the Performance and Accounting Option .............................. 126 7.5.3 Node-Related Parameters............................................................................. 127 7.5.4 Service-Related Parameters ......................................................................... 128 7.5.5 Parameters for the Definition of a Generic Service ....................................... 130 7.5.5.1 Parametering of the Service Detection.......................................................... 131 7.5.6 Path Configuration......................................................................................... 134 7.5.7 Shutdown Configuration ................................................................................ 135 7.5.8 Default Parameter File .................................................................................. 136

    8 BlackBoard .................................................................................................. 165 8.1 General ......................................................................................................... 165 8.2 Implementation.............................................................................................. 165 8.3 Generating BlackBoard Commands .............................................................. 167 8.3.1 WebInterface................................................................................................. 167 8.3.2 Interactive...................................................................................................... 167 8.4 Commands.................................................................................................... 168

  • Contents

    FA Agents - Installation and Administration

    9 FlexFrame Autonomous Agent Traps........................................................ 169 9.1 Format of the FlexFrame Autonomy SNMP Traps......................................... 169 9.2 Severities....................................................................................................... 172 9.3 Overview of the FlexFrame Autonomy SNMP Traps ..................................... 172

    10 Troubleshooting .......................................................................................... 185

    11 Abbreviations............................................................................................... 189

    12 Glossary ....................................................................................................... 193

    13 Index ............................................................................................................. 199

  • FA Agents - Installation and Administration 1

    1 Introduction For many companies, applications such as SAP today provide the basis for handling all important business processes. Failure of these components therefore results in consider-able costs. Nowadays companies must be able to react very rapidly to changing market and organizational demands, which also means that it must be possible to adapt the capacity of existing IT resources very quickly to the changing requirements.

    The myAMC components for monitoring the availability and utilization of IT systems with their intelligent automated facility for responding to system failures are the answer to these demands. FlexFrame-Autonomy complements the powerful monitoring and man-agement functions of myAMC with functions which permit the autonomous operation of a distributed applications environment. These functions reduce the number of manual inter-ventions and make the operation of your business critical applications more efficient.

    FlexFrame offers a flexible hardware architecture which can be adapted to altered re-quirements and, together with management components, permits highly available opera-tion of this infrastructure. Partial outages are automatically repaired or compensated for. FlexFrame Autonomy is an integral component of every FlexFrame solution and provides the functions for implementing operation with considerably reduced operator interventions through the built-in autonomy functions, right up to a high-availability solution.

    This manual describes the functional concepts and the application scenarios for Flex-Frame-Autonomy.

    FlexFrame Autonomy for distributed database instances, SAP central instances and SAP application instances

    FlexFrame Autonomy supports SAP, SAPDB and Oracle instances

    Status monitoring, restart, reboot or switchover of instances

    1.1 FlexFrame-Autonomy The Application Management Center myAMC is a solution for monitoring and managing IT infrastructures. The resources required for a business process, from monitoring of a printer, the network and of the server and the applications which run on it can be moni-tored using myAMC. The FlexFrame Autonomy component myAMC.FA substantially extends the range of functions. In addition to monitoring, this component also provides the option of implementing automatic restoration of failed services autonomously. These self-repair mechanisms are not just effective locally for one system, however; they also permit a failed service to be moved automatically to another resource which, in line with a defined rule for operation, is suitable for operating the service.

  • Introduction

    2 FA Agents - Installation and Administration

    This function permits a considerable reduction in the number of manual interventions by an administrator. Availability is increased, and the costs for operating a complex applica-tions environment are reduced.

    For this functionality, myAMC.FA uses its agents and management components to de-tect, collect and analyze the information. Autonomous functions can be configured for various tasks and requirements by combining different detectors and manager compo-nents and by defining and selecting the reaction and decision rules. In conjunction with the powerful myAMC GUI, the entire infrastructure can be presented in a straightforward manner in an IT cockpit.

    1.2 Additional Documentation Further application options for other myAMC management components are described in the document myAMC.Overview. Use of the Messenger for editing and forwarding myAMC.FA messages is described in the documentation myAMC.Messenger.

    1.3 Target Group This documentation is intended to support both users of FlexFrame Autonomy and ad-ministrators who wish to integrate this solution in an enterprise IT management solution.

    1.4 Notational Conventions The following conventions are used in this manual:

    Additional information that should be observed.

    Warning that must be observed.

    fixed font Names of paths, files, commands, and system output.

    Names of variables.

    fixed font User input in command examples (if applicable using with variables)

  • Introduction

    FA Agents - Installation and Administration 3

    1.5 Document History Document Version Changes Date

    1.0 First Edition 2005-03-22

    1.1 Adding some new features 2006-05-26

    2.0 FA Agents Version 3.0 2006-11-30

    1.6 Related Documents FlexFrame for SAP Planning Tool

    FlexFrame for SAP Installation of a FlexFrame Environment

    FlexFrame for SAP Installation Guide for SAP Solutions

    FlexFrame for SAP Administration and Operation

    FlexFrame for SAP Network Design and Configuration Guide

    FlexFrame for SAP Installation ACC 1.0 SP13

    FlexFrame for SAP myAMC.FA_LogAgent - Concept and Usage

    FlexFrame for SAP Upgrading FlexFrame 3.1 or 3.2 to 4.0

    FlexFrame for SAP White Paper

    PRIMECLUSTER Documentation

    ServerView Documentation

    SUSE Linux Enterprise Server Documentation

    Solaris Documentation

  • FA Agents - Installation and Administration 5

    2 First Steps

    2.1 Installation and Startup This chapter describes how you start and stop the FlexFrame Autonomy components. It also describes how FlexFrame Autonomy is installed and its basic configuration.

    FlexFrame Autonomy provides a comprehensive, flexible and scalable solution for setting up semi-autonomous IT processes. Its functionality falls into three subareas:

    FA_AppAgents: FlexFrame Autonomy Application Agents for monitoring, checking and controlling instances

    FA_CtrlAgent: FlexFrame Autonomy Control Agent for monitoring, checking and controlling Application Nodes with a separate Control Node.

    FA WebInterface: A component for displaying the active services on a web front-end.

    To monitor instances, the FA_AppAgent supplies cyclical information on the availability of an instance in a definable rhythm. For this purpose it is necessary that the FA_AppAgent is active on every node.

    myAMC.Messenger is used to forward information on faults and autonomous reactions to the outside. This messaging component of the myAMC family should be operated on the Control Node.

    2.2 Installation Requirements

    2.2.1 The FlexFrame Solution The FlexFrame Autonomy solution was conceived and developed especially for the FlexFrame for SAP solution from Fujitsu Siemens Computers. Consequently the FlexFrame solution with the components Shared OS, Virtualized SAP Application and NetApp Storage on the target computers is a prerequisite for the procedure described in the following.

    Further details on FlexFrame configurations can be found in the FlexFrame manual In-stallation of a FlexFrame Environment. Use of FlexFrame Autonomy on other Linux architectures (e.g. standalone systems or for monitoring processes which do not belong to SAP R/3) is not described in this manual and is not supported.

  • First Steps

    6 FA Agents - Installation and Administration

    The following prerequisites are thus particularly important:

    Server architecture with IP storage (NetApp Filer) and Client, Server and Storage LANs.

    Paths for read-only and read/write Root Images.

    SAP start scripts from Fujitsu Siemens Computers

    Operating system SUSE Linux Enterprise Server (SLES)

    FA Agents are installed in a directory on the storage system which is reachable and available to all nodes in accordance with the FlexFrame rules for jointly used programs. Programs are always accessed and installed via a Control Node.

    The FA Agents are installed using two RPM packages. Normally the agents are stored in the directories /opt/myAMC/FA_AppAgent and /opt/myAMC/FA_CtrlAgent. The /opt/myAMC directory is located in a FlexFrame environment on the Filer and is avai-lable from every Application Node and Control Node.

    Multiple FlexFrame Autonomy versions can be installed simultaneously. Installation, con-figuration and activation of a version are three separate activities.

    Installation, parameterization and configuration of a new version can thus be performed during ongoing operation. Only when all preparations have been completed is the active version deactivated and the new version activated.

    Deactivation and activation of a version always takes place on a pool-specific basis. In this way new agent versions can, for example, first be activated in a pool with test sys-tems.

    2.2.2 Installation In the case of a FlexFrame standard installation, new software components are installed via one of the Control Nodes. The FlexFrame Autonomy software is contained in the /opt/myAMC directory. Ensure that all servers (Control and Application Nodes) use the same directories. FlexFrame Autonomy is thus also installed in a tree on a Filer (NFS share).

    The NFS file systems used have to support NFS file locking.

    control1:/opt/myAMC # mount filer1_qa:/vol/volFF/FlexFrame/myAMC on /FlexFrame/myAMC type nfs (rw,nfsvers=3,intr,noac,wsize=32768,rsize=32768,addr=172.16.1.204) filer1_qa:/vol/volFF/FlexFrame/scripts on /FlexFrame/scripts type nfs (rw,nfsvers=3,intr,nolock,noac,wsize=32768,rsize=32768,addr=172.16.1.204) control1:/opt/myAMC # ls -al /opt/myAMC lrwxrwxrwx 1 root root 16 Dec 2 18:35 /opt/myAMC -> /FlexFrame/myAMC

  • First Steps

    FA Agents - Installation and Administration 7

    2.2.2.1 Installation Packages

    The following packages must be installed:

    myAMC.FlexFrame Autonomy Application Agent; the installation package for this is called

    myAMC.FA_AppAgent-.i386.rpm

    myAMC.FlexFrame Autonomy Control Agent; the installation package for this is called

    myAMC.FA_CtrlAgent-.i386.rpm

    myAMC.FlexFrame Autonomy WebInterface; the installation package for this is called

    myAMC.FA_WebGui-.i386.rpm

    where X.Y-Z stands for the version number.

    myAMC.Domainmanager (optional, e.g.for the Performance and Accounting option ), The installation package for this is called

    myAMC.FA_DomainManager-X.Y-Z.i386.rpm

    2.2.2.2 Standard Installation

    Standard installation is implemented from a completely writeable (as the user root) directory tree.

    1. Log onto the target computer as root and copy the rpm packages to a temporary directory.

    2. Install the required package with

    rpm ihv --nodeps myAMC.FA_AppAgent-.i386.rpm

    After all the required packages have been installed, the start scripts may need to be cop-ied to the ROOT Images of the various node types (Application / Control).

    Only the myAMC.FA_CtrlAgent may run on the Control Node, and only the myAMC.FA_AppAgent may run on the Application Nodes.

  • First Steps

    8 FA Agents - Installation and Administration

    2.2.3 Configuration The FlexFrame Autonomy Agents do not require any additional configuration for use in productive operation.

    The myAMC_FA.xml file is stored when installation takes place. This file already contains a complete parameter set for the operation of the FA_AppAgents and FA_CtrlAgents. The services to be monitored and the reaction scenarios which run in the event of prob-lems are parameterized in this file. The parameters and their default values are described in section 7.5. The mode in which the agents are to operate is also configured here.

    In the course of the startup, in particular the start and stop times, the function of the Moni-torAlerts, and the times for a reboot and switchover need to be checked. The Moni-torAlerts are a component part of the der FlexFrame basic installation. The MonitorAlert-Time must always be at least three times as great as the parameterized CheckCy-cleTime.

    In the startup scenarios, the real start, stop, restart and reboot times must be determined individually for each service type. If the times specified for start, restart, reboot or switch-over are not sufficient, this can result in unwanted reaction escalations.

    Changes in the parameter file become effective only after the agents have been re-started.

    The FA migration tool enables a configuration file of an existing installation to transfer the data automatically to a new configuration file. Parameters which, for example, were not present in an older version of the configuration file are then initially automatically set to their default values.

    2.2.4 Starting and Stopping During installation, links to the FA Agents start/stop scripts were set in /etc/init.d/. Run this script without any options so that all available options are displayed, e.g. start or stop.

    Example: Starting the FA Application Agent:

    /etc/init.d/myAMC.FA_AppAgent start /opt/myAMC/FA_AppAgent/ myAMC.FA_AppAgent start

    Example: Starting the FA Control Agent:

    /etc/init.d/myAMC.FA_CtrlAgent start /opt/myAMC/FA_CtrlAgent/ myAMC.FA_CtrlAgent start

  • First Steps

    FA Agents - Installation and Administration 9

    2.3 FA WebInterface

    2.3.1 Function The FA WebInterface visualizes all nodes and services which exist in a FlexFrame Sys-tem insofar as these are monitored by an FA_AppAgent. The status, availability and messages of the Application and Control Agents are displayed.

    2.3.2 Installation The installation package is called myAMC.FA_WebGui-.i386.rpm.

    A prerequisite here is that an Apache-Tomcat Servlet Container is installed. Currently Tomcat >= 5.0.x is supported.

    2.3.3 Configuration Provided no paths have been changed in the FA configuration, the configuration of the WebInterface is restricted to linking it into the Tomcat configuration file. For this purpose the following change must be made in the Tomcat configuration file (e.g. /opt/jakarta-tomcat-/conf/server.xml):

    The following line has to be added at the end of the configuration file (in front of ):

    Changes to the web server require Tomcat to be restarted or reloaded.

    Further settings can be made in the files /opt/myAMC/config/FA_WebGui.conf (general settings, paths, cycle tymes, database settings) and /opt/myAMC/config/amc-users.xml (user administration). The various settings are described in section 5.1.3.

    Changes require the FA WebInterface to be restarted or reloaded (e.g. via the Tomcat Service Manager) or Tomcat to be restarted or reloaded.

  • First Steps

    10 FA Agents - Installation and Administration

    2.3.4 Starting and Stopping The WebInterface can always be reached if the Apache-Tomcat is running. This can generally be started using the script /etc/init.d/jakarta-tomcat start.

    The WebInterface can then be reached at the following address:

    http://:8080/FAwebgui/

    The specified port can be changed in the Tomcat configuration file server.xml.

    Prerequisites here are Mozilla >= 1.4.1 or Internet Explorer >= 6.0 and the Java plugin for Sun >= 1.4.2.

    2.4 DomainManager The DomainManager is installed on the Control Node. It could be integrated in PRIMECLUSTER, but this not the default. It has to be done in the single projects.

    The accounting and performance data collected by the FA Application Agents is auto-matically adopted by the ITDW and can be visualized and evaluated with the help of the FA WebGUI with the Accounting and Performance management plugin.

    The DomainManager is configured via the file /opt/myAMC/DomainManager/config/ DomainManager.xml. Pool-specific configuration is also possible. Changes to parame-ters in the DomainManager configuration are dynamically recognized and adopted.

    Alternatively to processing through the DomainManager, the files can also be accessed by an external DomainManager which runs outside of FlexFrame. In addition to this, extension of the Tomcat server by means of the myAMC.Fileretriever module is possible. This is optional and not part of the standard delivery.

  • FA Agents - Installation and Administration 11

    3 FlexFrame Autonomy Architecture FlexFrame Autonomy is a powerful component for high-availability operation of systems with distributed instances. A FlexFrame solution consists of Storage, Application Servers and redundant Contol Nodes. This product has been implemented for this solution com-prising storage, servers and connectivity. It enables fast and flexible setup of solutions which offer autonomous functions to simplify and provide flexibility for operating applica-tions. The figure below shows an overview of the FlexFrame architecture and the associ-ated FlexFrame Autonomy components:

    The benefit of the FlexFrame Autonomy solution lies in the flexibility for integrating new nodes and instances without changing the configuration.

    Components of FlexFrame-Autonomy:

    FlexFrame Autonomy Application Agents (FA_AppAgents)

    FlexFrame Autonomy Control Agents (FA_CtrlAgents)

    The FlexFrame Autonomy copmponents permit highly available, semi-autonomous op-eration of distributed applications. In principle the instances can be distributed to any number of nodes within a FlexFrame solution. The individual services are monitored via FlexFrame Autonomy Agents. By default, the Application Agents currently support SAP central instances and SAP application instances, as well as SAPDB and Oracle database instances.

  • FlexFrame Autonomy Architecture

    12 FA Agents - Installation and Administration

    3.1 Pool Creation and Grouping FlexFrame Autonomy Version 2.0 permits pool creation and grouping functions to be implemented.

    3.1.1 Virtual FlexFrame Autonomy Pools A pool is the assignment of hardware resources to a virtual FlexFrame Autonomy pool. From the viewpoint of autonomy and of the high-availability functions, an Autonomy pool is an independent structure. In a standard installation of Version 1, all resources of a FlexFrame solution are managed in a single pool. Configuration of the pools takes place directly with the configuration of FlexFrame in the LDAP. The FA_App Agents ascertain the pool affiliation at startup. Configuration of the FlexFrame Agents always relates to one pool, i.e. there is one directory tree with the parameters and configuration data for each pool.

    In a pool, the FA Agents provide the autonomous functions restart, reboot and switchover of services and nodes. These reactions no longer relate to all nodes of a FlexFrame solu-tion, but only to the set of nodes which belong to the same pool.

    Pool creation results in virtual FlexFrame Autonomy pools being created, each of which performs autonomous functions independently of other pools which exist in the same FlexFrame solution.

    A FlexFrame Autonomy pool always consists of one Control Agent and n Application Agents. Each Control Agent is responsible only for the Application Nodes which belong to its pool and shares a joint config and data directory with its Application Nodes. For each pool it is thus possible to parameterize autonomous behavior which is independent of other pools.

    The flexibility and security of a virtual FlexFrame pool is based on two major new features which the FlexFrame Autonomy Agents provide.

    A Control Agent for each virtual FlexFrame Autonomy pool

    Each Application Agent is provided with a flexible assignment to the pool and thus to the Control Agent with which it interworks.

    The use and interleaving of these two new options with the FlexFrame basic functionality offers a large number of new options to enhance the flexibility in server farms.

    The virtual FlexFrame Autonomy pools provide the option of simple and secure operation of multiple Autonomy clusters which run in parallel and simultaneously in a distributed IT infrastructure.

  • FlexFrame Autonomy Architecture

    FA Agents - Installation and Administration 13

    FA_Version 1.0

    data

    data

    data

    config

    config

    config

    FA_Version 1.x

    A virtual FlexFrame Autonomy pool offers the advantage of complete separation of all reactions and the associated parameter sets for the start and stop times. FlexFrame Autonomy can also be completely disabled for a virtual pool (e.g. for service and maintenance) without affecting any other virtual FlexFrame Autonomy pool which is running in parallel.

    A virtual FlexFrame does not share its FlexFrame Autonomous Agents with any other virtual FlexFrame. In this way, depending on the configuration, the virtual FlexFrame Autonomy pools could use different binary statuses.

    3.1.2 Grouping For flexible server farming, FlexFrame offers grouping functions which differ from the pool in that these enable nodes and services within a pool to be assigned to different groups. A group is thus always a part of a virtual FlexFrame pool.

    Grouping can also be implemented according to the same generic rules. Group schemas can be defined for this purpose. In the parameter file you select the schema which is to be used for group creation.

    The configuration information for the groups is stored in the myAMC_FA_Group.xml file. The entries in this file can be made manually or by taking them over from the LDAP direc-tory. Configuration can take place through concrete assignment or through generic as-signment.

  • FlexFrame Autonomy Architecture

    14 FA Agents - Installation and Administration

    3.1.2.1 Manual Group Creation

    The group assignment is entered in the configuration file manually. In the event of manual group creation each node name is unambiguously assigned a group name.

    3.1.2.2 Configuration in the LDAP Directory

    As of FlexFrame V3.1 the group information can be stored in the LDAP directory. When the Agents are started, the group information is read directly from the LDAP directory.

    3.1.2.3 Automatic (generic) Group Creation

    Automatic group creation is performed on the basis of generic information which the Ap-plication Agents can ascertain automatically. For generic group creation it makes sense to use the host names, the IP addresses or the operating system employed.

    In the event of generic group creation the concrete host name is not entered in the myAMC_FA_Group.xml file, but a creation element which enables the algorithm for ge-neric group creation to find a group assignment.

    Example: On the basis of the platform information, i.e. if no manual configuration of groups takes place, two groups are created, the group of the Linux nodes within a FlexFrame installa-tion and the group of the Solaris Nodes within a FlexFrame installation.

    In this case the group name is also created generically. For this purpose each schema is assigned a group naming rule which combines a fixed part with a variable part.

    Automatic group creation is nont currently used by myAMC FA Agents in an FlexFrame environment, as the groups are usually configured statically by the FlexFrame configura-tion tool.

    3.2 Service Classes The service classes are required for the prioritized operation of individual services or systems. A service is defined as an application instance which must be identified unam-biguously and which can be started and stopped individually, e.g. central instance, appli-cation instance or database instance (CI, AP DB).

    A service class defines the minimum requirements which must be provided when services are taken over in the event of a switchover.

    When multiple nodes fail simultaneously, the spare nodes in the group take into account the priorities of the services which have failed. First all prio 1 services are taken over, and only then all services with a lower priority. It will be possible to extend the attributes of a service in the future, as already shown in the examples (e.g. operating system).

  • FlexFrame Autonomy Architecture

    FA Agents - Installation and Administration 15

    A system is a logical unit consisting of multiple service instances which together define a system. In an SAP system these comprise the database instance, central instance and application instances.

    The following attributes are defined in the service classes:

    Service priority Service power value

    In the future it will be possible to enhance such a service class by further attributes which, for example, define the operating system required by a service or the number of CPUs or the performance requirement of the service.

    3.2.1 Service Priority The highest service priority is 1. Every service is assigned this priority by default, i.e. if no service classes are defined, all services have the priority 1. The higher the number, the lower the priority of a service.

    Priority 0 has a special status. Setting priority 0 for a service class enables the autono-mous functions to be disabled for a service.

    The service priority is evaluated for all autonomous reactions. If, for example, a service of a productive system and a service of a test system are running on the same node and the test systems service is assigned priority 5, this service is not executed because the pro-ductive systems service which is functioning without error has the higher priority of 1.

    3.2.2 Service Power Value The service power value specifies for a service a performance number which defines the maximum performance (SAPS) required by this service.

    This value is used for takeover scenarios; the add rule requires the service power value

    A failed service with a performance value of 50 can, for example, also be taken over by a node which still has at least 50 of its maximum performance number free.

    3.2.3 Class Creation Rules A service belongs either to the default class which always exists or it can be assigned unambiguously to another class by evaluating the aforementioned variables.

    3.2.4 Testament Types The switchover scenarios use testaments to transport the service information to other nodes. The creation of the testaments can be node-based or service-based. With node-

  • FlexFrame Autonomy Architecture

    16 FA Agents - Installation and Administration

    based testaments, all services of a node always come together to the takeover node. With service-based testaments the services could be taken by different nodes.

    The parameter for the testament type and the takeover rules therefore strongly influence the possible takeover scenarios.

    3.2.4.1 Service-specific Testaments

    Service-specific testaments are used for services which require individual takeover sce-narios. The following services require service-specific testaments.

    3.2.4.2 Enqueue Service in Case of Replicated Enqueue

    The enqueue service with replicated enqueue service has its own service type. This spe-cial testament is built dynamically if a replicated enqueue service exists. For this service-specific testament, the service-based takeover rule applies. Only nodes with a replicated enqueue service can apply.

    3.3 FA Configuration, Work and Log Files The figure below provides an overview of the configuration and log files which are gener-ated by FA components and stored on the common file system. These files also form the permanent memory which is required, for example, to restore the services needed when a system is rebooted.

    directory structure /opt/myAMC/:

    ./FAwebgui FA web interface

    ./vFF/Common/.vFF_template. Template for pool directories

    ./vFF/vFF_ Pool directory config Pool-specific configuration data log Log files log/AppAgt Log files of Application Nodes log/CtrlAgt Log files of Control Nodes log.common Common log files data/FA/ FA data directory data/FA/blackboard Blackboard directory data/FA/livelist Live list data/FA/servicelists Service files of all nodes data/FA/servicelogs Service files of all nodes (history) data/FA/xmlrepository XML files for the web interface data/FA/reboot Reboot files for all nodes data/FA/switchover Switchover files for all nodes data/FA/performance Performance and accounting files

  • FlexFrame Autonomy Architecture

    FA Agents - Installation and Administration 17

    In a standard FlexFrame installation the directory tree /opt/myAMC exists for myAMC.FA. All the directories and files required for the myAMC.FA software are located here.

    3.4 Systems A system is based on several services which belong to a logical group. SAP systems are an example of logical systems. The services of such a system can be distributed in one pool on several Application Nodes.

    The FA-AppAgents identify the services and the system they belong to autonomously and they identify standard SAP services automatically.

    3.5 Service Types The FA_AppAgents are able to identify standard SAP services and the hierarchy in a logical SAP system. For these services the FA_AppAgents do the autonomous reactions restart, reboot and switchover.

    DB, CI, APP, J, JC, SCS, ASCS, ERS, LC

    Version 3.0 of the FA Agents can monitor the service types SCS and ASCS with repli-cated enqueue service (ERS). The detection of SCS/ASCS with or without replicated enqueue service is done automatically.

    3.5.1 Replicated Enqueue Service Version 3.0 of the FA Agents can monitor the service types SCS and ASCS with repli-cated enqueue service (ERS) scenarios. The detection of SCS/ASCS with or without replicated enqueue service is done automatically.

    For an SCS or ASCS service there is a replicated enqueue service on which the enqueue table is replicated. If the SCS or ASCS service fails, this service must be restarted on an associated replicated enqueue service.

    The SCS, ASCS service takes over the enqueue service table present there and stops the replicated enqueue service. Once the replicated enqueue service is stopped, the testament is published and, with the autonomy scenarios for internal switchover, the replicated enqueue service gets a new node and starts up. So if the SCS or ASCS fails again, there is another replicated enqueue service for a new takeover scenario. This scenario works with one or more replicated enqueue services in one system.

    The rules for the switchover of the replicated enqueue are the same as those configured for the other services. The switchover to the replicated enqueue service is based on a service-based testament. The rule to apply for this service is based on the generally con-

  • FlexFrame Autonomy Architecture

    18 FA Agents - Installation and Administration

    figured takeover rule as well as on service priority and the takeover type for this pool or the dynamic takeover table.

    3.5.2 Live Cache With version 3.0 of the FA Agents it is possible to integrate the live cache into the stan-dard autonomy scenarios. The FA Agents offer the standard autonomy functions restart, reboot and switchover for the live cache.

    A specialty of the live cache is the possibility to stop it from the SAP GUI. For this restart scenario you have to check the restart times of the live cache, otherwise this scenario cannot be diagnosed (recognized) as a fault of the live cache.

    3.6 Generic Services Generic services are services which are not integrated into the FA-AppAgents autonomy rules. With generic services it is now possible to integrate other virtualized services into the autonomy monitoring and reaction scenarios.

    A generic service is a logic application suite consisting of one or more subservices.

    For this purpose a generic service is defined through a set of parameters which are used for its identification and which generate the service states. The description and definition of a service is arranged in several models:

    Service state model Service detection model Service reaction model

    3.6.1 Service State Model The autonomy scenarios are based on an defined state model. The standard service state model uses the following states:

    Starting Running Stopping Error

    The state changes are initated through events from an event script or through detection. For a generic service, implementation and integration in the standard start/stop procedure of the service is necessary. The standard state model knows the following events:

    Start Stop Restart

  • FlexFrame Autonomy Architecture

    FA Agents - Installation and Administration 19

    Error Watch Nowatch

    3.6.2 Service Detection Model The service detection model provides the basis for identifying the service and building the state model. A service detection model needs the parameters for the identification of the service components. The parameters are the subservice and the processes of the sub-service. For this there are parameters for hierarchy and process count. There is also a process filter and exception rules, to avoid ambiguities.

    3.6.3 Service Reaction Model The service reaction model defines the reaction and the connection to the start, stop and restart scripts. The reaction API has the parameters Script and Parameter:

    Script The call reference for the script Parameter Set of parameters for the script

    The FA-AppAgents reaction API provides a set of parameters, which can be used as call parameters in service-specific scripts:

    @{SIDENT}@ Parameter for the (SID, in upper case) @{sident}@ Parameter for the (SID, in lower case) @{SRV}@ Parameter for service name (in upper case) @{srv}@ Service name @{SRVDISP}@ Display service name (in upper case) @{srvdisp}@ Display service name @{NIDENT:2@ Instance number (two-digits)

    3.7 FlexFrame Performance and Accounting Option The FA-Agents provide optional performance and accouting data. The agents collect node-, service- and group-based information.

    The FlexFrame performance and accounting option requires the activation of additional services on the Control Node. This service does a performance and accounting calcula-tion of the raw data.

    The FA Agents produce performance and accounting collets in the data directory of the pool. There are 3 types of collet data

    Collets per node with the name pattern Perf_Node~.prf..col

  • FlexFrame Autonomy Architecture

    20 FA Agents - Installation and Administration

    Collets per service group with the name pattern Perf_Group~.prf..col

    Collets per service with the name pattern Perf_Service~~~.prf..col

    The number and size of the collets produced by the FA Agents can be adjusted. In the standard adjustment there are in each case 10 collets per service or node installed. This results in a ring buffer of data automatically reorganized by the agents. For the sizing it is possible to calculate the required storage size through the number of nodes and the size of the report cycle.

    The parameters of the DomainManager and of the backup routine have to be configured in a waythat the raw data can be safely processed before being overwritten by the FA Agents.

    The following graphic shows the architecture of the performance and accounting option.

    mySAP.comApplication-server

    Database-server

    Network

    Storage

    DB CI APP

    FA-Application

    Agents

    FA-Performance andAccountingService

    ITDW

    Accounting and

    Performance Collets

    3.7.1 Performance Option The performance option measures several performance values. For all measured values there is a minimum, average, maximum and total value. This data is supplied in absolute as well as relative form. The performance option enables monitoring and evaluation of the server and services over a longer period of time. For every node the following data are available as a minimal, average and maximum value:

    load of SAP-, database- or generic services other services Machine idle

  • FlexFrame Autonomy Architecture

    FA Agents - Installation and Administration 21

    By using the generic services, the granularity of the performance values will be in-creased. The data of the performance and accounting option can be directly visualized with the FlexFrame FA Web GUI with performance and accounting management plugin. The granularity of the view and the timespan can be freely defined.

    Service Groups

    Services are combined to form groups through specific criteria. This enables the group-aggregated evaluation of the recorded data.The collected data is aggregated per report cycle and is created for every node. By default the following groups exist:

    SAP SAP services DB Database services IDLE Share (proportion) of the free CPU capacity OTHER Sum of all processes not belonging to a defined group

    It is possible to define further services and assign them to existing or new groups.

    3.7.2 Accounting Option The accounting option is, like the performance option, an optionally activatable part of the FA Agents. The production of the accounting data is a multistage process determining accounting data through aggregation and analysis of the recorded raw data.

    Time-

    stamp

    Host 1BackupHost 7ASCSP22

    Host 3

    Host 6Host 5Host 4Host 3

    Host 2

    Host 1

    Hostname

    xy

    SCSP22JCP22JP22APPP22

    CIP22

    DBP22

    SAPS

    %

    SAPS

    abs

    Mem

    %

    Mem

    Kb

    CPU

    %

    CPU

    ms

    ServiceSystem

    SID

    Time-

    stamp

    Host 1BackupHost 7ASCSP22

    Host 3

    Host 6Host 5Host 4Host 3

    Host 2

    Host 1

    Hostname

    xy

    SCSP22JCP22JP22APPP22

    CIP22

    DBP22

    SAPS

    %

    SAPS

    abs

    Mem

    %

    Mem

    Kb

    CPU

    %

    CPU

    ms

    ServiceSystem

    SID

    Min, Max, Avg, Totalper Report-cycle

    The accounting data is determined on the basis of SAPS values. SAPS is the measured size used for the sizing of a server for the SAP operation. SAPS values are only available within the scope of a defined benchmark with defined SAP transactions.Therefore only SAPS equivalents can be produced and calculated during the operation. For this purpose

  • FlexFrame Autonomy Architecture

    22 FA Agents - Installation and Administration

    the agents dynamically evaluate information on the SAP version and hardware-workload data and use this to calculate the SAPS equivalent values.

    Important parameters for the accounting are detection and report cycles. The detection cycle defines the number of measurements within a report cycle. The minimum, maxi-mum and average values are calculated on the basis of individual measurements for a report cycle. The detection cycle therefore always corresponds to the detection cycle of the FA Agents, which is also parametered for the autonomy function.

    The following figure shows the ascertainment and calculation of values with regard to the detection cycle and report cycle.

    Detection cycle

    Default 10 sec

    Max

    workload

    Min

    workload

    Report-cycle 1 min

    SAPs

    t

    Total work

    Server capacity

    Automatic Calculation of SAPS Values

    The SAPS calculation is based on the automatic and dynamically determined workload ability of a node. Based on a variety of technical features such as cache, CPU, hyper-threating etc. and the possibilities of the operating system to use these, modern servers can, come to wrong assumptions concerning the workload abilities of a node. In these cases the automatic valuation can result in defective workload calculations.

    If the internal automatic ascertainment of the SAPS value results in defective values, the manual SAPS calculation can be used.

    Manual Calculation of SAPS Values

    If the maximum workload number of a server could not be correctly determined via the FA application agent the workload number can be individually defined for each node. The workload values are then calculated using the prepared workload data. In this way the individual particularities of the workload abilities of a node can be taken into considera-

  • FlexFrame Autonomy Architecture

    FA Agents - Installation and Administration 23

    tion. For this purpose, however, the workload values for each node have to be entered manually.

    3.7.3 Billing Using another calculation stage, chargeable workload units can be calculated from the SAPS-based accounting data. For the calculation, a range of parameters enabling differ-entiated pricing of the workload used can be set.

    In the default mode, all systems and services are charged at the same value every time.

    With the help of the FlexFrame ControlCenterAccounting plug-ins, the pricing can be determined through additional configuration settings.

    Therefore the following statements are necessary:

    Service contract no. System ID ServiceID Date range Day type, e.g. weekday, holiday, weekend Time of day, e.g. daytime, nighttime operation.

    The billing table enables very differentiated billing of accounting data as far as the service contract level. By way of system, service time, time of day and time types, different ser-vice contract items with various workload prices can be used.

    CPU/ SAPS Values

    0.15

    0.20

    0.30

    0.15

    0.25

    Accounting

    Price

    Sapsrule

    Sapsrule

    Sapsrule

    Sapsrule

    Sapsrule

    Service

    level rule

    SC_12345

    SC_12345

    SC_12345

    SC_12345

    SC_12345

    Service-

    Contract

    Standard

    Standard

    Standard

    Standard

    Standard

    Accounting

    rule

    24:0000:00daily01.01.210001.01.2006DBP23

    24:0000:00daily01.01.210001.01.2006otherP23

    24:0000:00daily01.01.210001.01.2006allQ22

    P22

    P22

    System

    ID

    all

    all

    Service

    typ

    00:00

    00:00

    from

    Time

    24:00

    24:00

    to

    Time

    weekend01.01.210001.01.1900

    workday01.01.210001.01.1900

    UnitdayTypetoDatefromDate

    0.15

    0.20

    0.30

    0.15

    0.25

    Accounting

    Price

    Sapsrule

    Sapsrule

    Sapsrule

    Sapsrule

    Sapsrule

    Service

    level rule

    SC_12345

    SC_12345

    SC_12345

    SC_12345

    SC_12345

    Service-

    Contract

    Standard

    Standard

    Standard

    Standard

    Standard

    Accounting

    rule

    24:0000:00daily01.01.210001.01.2006DBP23

    24:0000:00daily01.01.210001.01.2006otherP23

    24:0000:00daily01.01.210001.01.2006allQ22

    P22

    P22

    System

    ID

    all

    all

    Service

    typ

    00:00

    00:00

    from

    Time

    24:00

    24:00

    to

    Time

    weekend01.01.210001.01.1900

    workday01.01.210001.01.1900

    UnitdayTypetoDatefromDate

    Aggregation cycle

    Accounting cycleAccounting report

  • FlexFrame Autonomy Architecture

    24 FA Agents - Installation and Administration

  • FlexFrame Autonomy

    FA Agents - Installation and Administration 25

    4 FlexFrame Autonomy The operation of SAP systems is becoming increasingly complex, the number of compo-nents required is constantly rising.

    Installation, configuration and operation of a distributed SAP installation consequently involve considerable administrative effort. The demands on the systems change rapidly, and it must be possible to expand an existing configuration of replace failed components both quickly and flexibly.

    Through the use of Autonomy Agents, FlexFrame enables the number of operator inter-ventions to be reduced and availability to be increased. This chapter describes the appli-cation scenarios for the FlexFrame Autonomy functions.

    Installation and startup of the agents is described in chapter 2.

    To permit active operation of a FlexFrame Autonomy installation, the FlexFrame Auton-omy Agents must run on the Application Nodes and a Control Agent on the active Control Node.

    Use of the Messenger component is optional and is required only for displaying and for-warding events of the FA Agents and for integration into Enterprise Event Management Systems.

    The FlexFrame Autonomy Application Agent is used to monitor SAP central instances, SAP application instances and database instances. In the event of a problem, so-called self-repair mechanisms are used for these services. Execution of these self-repair mechanisms can be triggered locally or centrally. For each service/node these mecha-nisms can be divided into the following categories:

    Monitoring of a service

    Restarting of a service if it was down

    Rebooting of a node if a service could not be started again after one or more restarts

    Switchover (automatic change) to another node if the reboot could not be performed or was not successful

    Detecting of START, STOP and maintenance situations

    Control functions for displaying activities and statuses, sending mails and SMSs, configurable in conjunction with time, contact and problem situation

  • FlexFrame Autonomy

    26 FA Agents - Installation and Administration

    4.1 FlexFrame Autonomy Reactions FlexFrame Autonomy detects problems and decides autonomously on the reactions to be implemented after evaluating rules which can be controlled via parameters. FlexFrame Autonomy knows the following basic reactions:

    Restart Reboot Switchover (internal / external)

    These basic reactions, combined with pool creation, grouping, the service classes, and the service priorities result in a large number of reaction scenarios.

    4.1.1 Restart The FA_AppAgent restarts a service if a required subservice is down or no longer avail-able. In this case it checks whether the service is available again after the restart on the basis of a configurable time interval. The restart is not performed if any service which runs on the node has already triggered a reboot. Furthermore, failure of multiple subser-vices of a service leads to a restart within the configured time interval only until service availability has been restored.

    The number of restart attempts for restoring service availability can be configured. If the number of parameterized restart attempts is 0, failure of the service results directly in a reboot attempt. If the number of reboots permitted for the nodes is 0, a switchover is initiated.

    The restart reactions are not affected by pool creation, grouping or service classification.

    4.1.2 Reboot A node is rebooted if an monitored service has failed and could not be made available again after the configured number of restart attempts, or if no restarts are allowed.

    The autonomous reaction reboot also evaluates the service class and the service prior-ity of the service which causes the reboot. However, if multiple services are running si-multaneously on the nodes, the reboot rule is used to check whether services with the same or higher priority are still running. If this is the case, the reboot is not performed but only a corresponding alarm generated which informs the administrator of this problem.

  • FlexFrame Autonomy

    FA Agents - Installation and Administration 27

    4.1.3 Switchover A switchover always leads to all the monitored services of a node being moved to another Server Node.

    The decision to move to another node can be taken locally by the FlexFrame Autonomy Agent (internal switchover), or by the Control Agent on one of the Control Nodes (external switchover).

    4.1.3.1 General Rules

    Takeover in the event of a node failure is implemented using what is termed an applicant rule. The applicant rule states that each spare node may apply to take over the services of a failed node.

    Pool creation, grouping and service classes permit new switchover scenarios which can satisfy different availability requirements depending on the parameterization.

    This results in the following scenarios:

    Pool-dependent switchover Group-dependent switchover Service-prioritized switchover

    The failure of a node is only reacted to within a virtual FlexFrame pool.

    Groups can be defined within a virtual FlexFrame pool. The applicant rule states that a node only issues an application when a node in its own group fails.

    The granularity of the reaction to a system failure can be further refined by prioritizing individual services.

    The applicant rule states that in the event of simultaneous failure of multiple services, the application is first issued for the switchover file (testament) of the service with the higher priority. Only if all higher-priority services have been taken over by another node and free spare nodes exist do these apply for the switchover files of lower-priority services which still need to be taken over.

    When services with priority 0 fail, no applications are made by spare nodes. This pre-vents spare nodes being used up by the failure of unimportant test systems.

    The parameter file also contains a minimum priority parameter. This parameter provides a very simple way to define, for example, that spare nodes only apply to take over the services of a node if none of the failed services has a lower priority than that entered there.

    In conjunction with the basic rule by default all services have priority 1, a lower priority can be configured for individual services, thus providing a simple way to prevent valuable spare nodes being used up by the failure of test systems.

  • FlexFrame Autonomy

    28 FA Agents - Installation and Administration

    4.1.3.2 Internal Switchover

    In the case of an internal switchover the Application Agent recognizes that a service is down and cannot (or depending on the configuration may not) be restored using a restart or reboot. The FlexFrame Autonomy Agent then initiates an internal switchover.

    The actual takeover by another node begins with the transfer message. Only spare nodes can apply and take over these services. The node which takes over control starts the required services. If, after the maximum switchover time, the FA_AppAgent on the system that is to take over control is not able to start the services, it reports this by means of an SNMP trap. The switchover is aborted and must be processed further by the administra-tor.

    4.1.3.3 External Switchover

    In contrast to the internal switchover, the external switchover is detected and initiated by the Control Agent on a Control Node. This is required if the system is showing no sign of life or can no longer be reached in the network. Reachability is tested using Ping or SSH tests. The user decides whether to perform only Ping, only SSH or both kinds of test. Additionally the Ping requests may be configured for client LAN, server LAN, and/or stor-age LAN interfaces.

    The takeover is performed in the same way as for the internal switchover.

    In order to enable user-specific actions before or after a node was powered down, the CtrlAgent calls hook scripts, which may be customized by the user. The scripts are pro-vided with the return code of the previously executed action.

    Pre-PowerOff hook script: Called with return code 0 as argument, as there was no previ-ously executed action.

    Post-PowerOff hook: Called with the return code of the Pre-PowerOff hook script (if it failed, i.e. the return code was != 0) or with the return code of the power off script.

    If the configuration value IgnorePoffHookResult is set to true, the return codes of the hook scripts are ignored. If set to false, they are used as hints on how to proceed in case of errors: if the Pre-PowerOff hook script returns a value != 0, power off will not be performed, if the Pre-PowerOff hook script returns a value != 0, the SwitchOver will not proceed. This enables the user to customize the external switch over and power off proc-esses based on additional information or rules or to perform additional actions, e.g. mounting SAN devices.

    4.1.4 Maintenance The autonomous functions and reactions of the FA Agents can be disabled for individual services by calling a maintenance script. This is always required when application in-stances are to be started and stopped without autonomous reactions.

  • FlexFrame Autonomy

    FA Agents - Installation and Administration 29

    A service is set to nowatch using the following scripts on the relevant Application Node:

    sapdb nowatch sapci nowatch sapapp nowatch

    A service is reincluded in monitoring using the following scripts on the relevant Applica-tion Node:

    sapdb watch sapci watch sapapp watch

    4.2 Self-Repair Strategies In terms of the strategy for restoring a failed service, a distinction must be made between the following failures:

    Service failure Node failure

    A detailed description of the procedure for the subsequent autonomous reaction was provided in the preceding chapter.

    4.2.1 Self-Repair in the Event of a Service Failure If a service failure occurs, this is detected by the myAMC.FA_AppAgent and an attempt is made to make the service available again using the following autonomous reactions and their escalations:

    Restart of the service Reboot of the node Switchover (internal)

    Implementation and the number of the above-mentioned autonomous reactions and the escalations can be affected by the configuration.

    4.2.2 Self-Repair in the Event of a Node Failure If a node failure occurs, this is detected by the myAMC.FA_CtrlAgent and an attempt is made to make the service available again using the following autonomous reactions:

    Switchover (external)

  • FlexFrame Autonomy

    30 FA Agents - Installation and Administration

    4.2.3 Takeover by a Spare Node (Switchover) The standard rule in the FlexFrame concept for taking over the services of a failed node is to have them taken over by a spare node.

    Every Application Node in a standard FlexFrame installation on which a FA_AppAgent is running and none of the monitored services exists is automatically a spare node.

    If a switchover is started as a result of a node failure or escalation of a service failure, all spare nodes apply to take over the services. The quickest node in the application proce-dure is chosen and takes over the tasks.

    4.2.4 Multi Node Failure The simultaneous failure of multiple systems or nodes is called Multi Node Failure. This indicates a different kind of failure than a single node or system failure, where the cause is usually more complex. From version V30A10 on, the FA Agents offer support for the automatic detection of Multi Node Failures with different reactions and additional alarms. This allows the recognition of failure states, which require the attention and decision of an administrator. Several new parameters allow the modification of the usual behaviour, like delaying or skipping reactions. Additionally a set of new alarms triggered by user-configurable indicators inform the administrator in case of of a multi node failure, so he may take apprioriate actions. The configuration of these indicators can be performed per pool.

    There are two different Multi Node Failure scenarios:

    1. Simultaneous failure of multiple Nodes, Systems or Services, e.g. due to a power outage in a blade cabinet, which shows an affect within a short period of time (e.g. one minute)

    2. Failure of several Nodes, Systems or Services, within a specific timer-ange, which is bigger than the one specified above (e.g. one hour)

    These scenarios are called ShortTime Failure and LongTime Failure.

    The CtrlAgent keeps a list of all failures, with each entry containing node name and time-stamp. If the number of entries within a scenario-specific timerange exceeds the limit, the CtrlAgent assumes a Multi Node Failure.

    4.2.4.1 Case 1: ShortTime Failure

    Simultaneous failure of multiple Nodes, Systems or Services, e.g. due to a power outage in a blade cabinet, which shows an affect within a short period of time (e.g. one minute).

  • FlexFrame Autonomy

    FA Agents - Installation and Administration 31

    MultiNodeFailure_ShortTime_FailureCount Specifies the number of failures within a certain time range, which leads to a Multi Node Failure state.

    MultiNodeFailure_ShortTime_FailureTime Specifies the time range (in seconds) to be used for failure aggregation.

    MultiNodeFailure_ShortTime_ReactionDelay Specifies a delay time (in seconds) before the CtrlAgent reacts on failures.

    MultiNodeFailure_ShortTime_ReactionAction (for future use) Specifies a reaction different from the normal modus of operation.

    In case of a Short Time Multi Node Failure the CtrlAgent sends an emergency alarm. Additionally the usual autonomous reactions can be delayed or skipped (by setting Mul-tiNodeFailure_ShortTime_ReactionDelay to a very big value).

    4.2.4.2 Case 2: LongTime Failure

    Failure of several Nodes, Systems or Services, within a specific timerange, which is big-ger than the one specified above (e.g. one hour).

    MultiNodeFailure_LongTime_FailureCount Specifies the number of failures within a certain time range, which leads to a Multi Node Failure state.

    MultiNodeFailure_LongTime_FailureTime Specifies the time range (in seconds) to be used for failure aggregation.

    MultiNodeFailure_LongTime_ReactionDelay Specifies a delay time (in seconds) before the CtrlAgent reacts on failures.

    MultiNodeFailure_LongTime_ReactionAction (for future use) Specifies a reaction different from the normal modus of operation.

    In case of a Long Time Multi Node Failure the CtrlAgent sends an emergency alarm. Additionally the usual autonomous reactions can be delayed or skipped (by setting Mul-tiNodeFailure_LongTime_ReactionDelay to a very big value).

  • FlexFrame Autonomy

    32 FA Agents - Installation and Administration

    4.3 Takeover Rules

    4.3.1 Overview Rule based high availability for nodes and services is performed by evaluating rule sets, which control the take over of services from a failed node. They consist of qualification rules, take over strategy and take over rules.

    The qualification rules specifies, which nodes may apply for the services of a failed node.

    The take over strategy defines the conflict resolution mode to be used, when more than one node applies for a node testament.

    The take over rules controls the actual take over, i.e. service start order and possibly service displacement or replacement.

    4.3.2 TakeOver strategy

    4.3.2.1 Overview

    The qualification rules specifies, which nodes may apply for the services of a failed node. When performing a SwitchOver, all nodes may apply for take over of the failed nodes services by taking part in an auction. As long as the auction lasts, all nodes, which match the requirements as specified in the failed nodes testament may apply. When it is fin-ished, the take over strategy is used to decide which node won the auction.

    4.3.2.2 FirstFit

    FirstFit specifies that the first node, which applied for a testament, is the winner. This is the default strategy.

    4.3.2.3 LowPrioFit

    From Version V30A10 on, the FA Agents provide a new strategy: LowPrioFit. The appli-cation node containing the services with the lowest priority wins the auction. It therefore has the best chance to replace or displace some running services in order to take over the failed services.

    By definition a spare node is considered to have the lowest priority, so it will win an auc-tion over a node with running services. A node with only services of priority 0 will win over a node with services of priority 1 and higher and so on.

    This strategy can be used as an alternative to FirstFit. This changes only the behaviour of the new take over rules: add rule, replace rule and substitution rule. If only the spare

  • FlexFrame Autonomy

    FA Agents - Installation and Administration 33

    node rule is used, the behaviour is the same as with the FirstFit strategy, because all spare nodes have the same priority and the first one wins the auction.

    4.3.3 TakeOver Rule

    4.3.3.1 Overview

    In version 3 and higher, the FA Agents offer the option of configuring various takeover rules. It is now possible