Netbackup Operations Procedures

79
ISL/OPS/A869 ISSUE 2 Netbackup Operational Procedures Veritas Netbackup Procedures

description

NET OPS

Transcript of Netbackup Operations Procedures

COMMENTS \* MERGEFORMAT Netbackup Operational ProceduresAbout this document...

ISL/OPS/A869Issue 2

Netbackup Operational ProceduresVeritas Netbackup Procedures

About this document...

If you do not want to receive future updates of this document, please de-register by sending your name, address, and OUC to the ISIS Helpdesk (01206 894088).

Author

The author of this document may be contacted at:

John WessonEldon House

Sheffield S1 3PL

Content approval

This is Issue 2 of this document.

The information contained in this document Issue IF = Issue "was" "is not" \* MERGEFORMAT was approved for use.

Issue IF = Issue "" "VOID" \* MERGEFORMAT

Approver Signature

Filing

The filing reference for this document is ISL/OPS/A869.

History

IssueDateAuthorReason

Issue 1 23/02/05Mick SweetingDRAFT 1

Issue 226/05/07John WessonIncorporate OPS produced docs.

Contents

61Introduction

2Contacts62.1Customer Experience Management Centre Operations62.2Customer Experience Management Centre Second Line Team62.3Customer Experience Management Centre Second Line DBA72.4Storage management72.5OPERATIONS / MEDIA CONTACT LIST72.6ADIC GRAU Contact/Info72.7Escalation82.8Hardware Information82.9Drive Information82.10IBM Hardware Contact83Tape Library Naming Conventions94IBM Library Information104.1Using robtest to interrogate the robot104.2mtlib Commands104.3Dealing with problems when a robot goes into Pause mode114.4Media movements - Inserting media124.5Ejecting media from the IBM library154.6Drive problems154.7Library problems164.8mtlib commands - Description and useful commands175ADIC Library Information175.1Using robtest to interrogate the robot175.2dasadmin (only to be used on ADIC libraries)175.3Mount a Volume in a tape drive205.4Dismount a Volume from a tape drive205.5Media Movements - Inserting Media205.6Ejecting Media from the ADIC Library215.7Drive Problems225.8AMLs225.9Common Problems225.10Points To Note225.11Library Problem246Accessing NetBackup246.1NBU 4.5247NetBackup Daemon Problems257.2Shutdown Netbackup268Netbackup Activity Logs and process information278.1Introduction278.2File system full problems278.3Full List of NETBACKUP Processes288.4General Drive Testing and Fault Resolution308.5Dealing With Tape Drive Problems308.6Determining whether problem is Drive or Robot.308.7Device Configuration Utility - Tpconfig329Deconfigure/Reconfigure Sequent Drives3210Resetting the SCSI extenders3410.1Overview3411Useful Media Commands3511.1BPMEDIA: Freeze, unfreeze, suspend or unsuspend media3511.2Tape Management3612Restore Guidance3712.1Background3712.2Restore information3812.3Media related items3912.4Restore processes4012.5Failover Restores4612.6Reduction of NetBackup drive usage to allow a restore to run5012.7Additional information5113Backups5114NetBackup Logmon Error Messages5114.1NetBackup Processes and Procmon Error Messages5215NetBackup STATUS Exits the big hitters5315.1Exit Status 41 Network connection timed out5315.2Exit Status 1 Backup was partially successful5515.3Exit Status 52 Timed out waiting for Media Manager to mount volume5515.4Exit Status 71 None of the files in the file list exist5515.5Exit Status 219 The required storage unit is not available5615.6Exit Status 54 Timed out connecting to client5615.7Exit Status 84 Media write error5615.8Exit Status 57 Client connection refused5715.9Exit Status 96 Unable to allocate new media for backup, storage unit has none available5815.10Exit Status 131 Client is not validated to use the server5815.11Application Resource Alert6015.12Slow throughput of backups6016Netbackup Clients6116.1Netbackup Documentation6116.2Troubleshooting Guide6116.3NetBackup reporting6116.4Supportal6117APPENDICES61

1 TAG: 7054001 Introduction

This document is intended for use in identifying and resolving Veritas NetBackup problems. It is not intended to replace any software manuals and should be used in conjunction with the current Veritas manuals. It also assumes a certain level of Unix command experience and use of dasadmin commands.

2 TAG: 7054004 Contacts

2.1 TAG: 7054005 Customer Experience Management Centre Operations

The Lines of Business Operations team are based at the Sheffield Customer Experience Management Centre and provides a first line and monitoring role for Netbackup alarmsThis group is available 24 hours a day, 7 days a week, and can be contacted as follows:

Tel: 0800 216662 Option 3

Fax: 01142774224Operations Service Manager: 0800 216662 Option 6,2

Email: [email protected] Clarify class for the Operations Group is: CCLOBOPS

2.2 TAG: 7054006 Customer Experience Management Centre Second Line TeamThe BT/HP Second Line team are based at the Sheffield Customer Experience Management Centre, and provide second line technical support for Unix, Wintel and Netbackup.The team is available 24 hours a day, 7 days a week, and can be contacted as follows:

Tel: 0800 216662 Option 5,2Fax: 01142774224Second Line team leader: 0800 216662 Option 5,4Email: [email protected] Clarify classes for the Second Line team are:Unix: CWHPOPSUXWintel: CWHPOPSNTNetbackup: CWHPOPSSTNB TAG: 7054006 Customer Experience Management Centre Second Line DBAA Second line DBA is available 24 hours a day, 7 days a week, and can be contacted as follows:Tel: 0114 277 4032

Fax: 01142774224

Clarify class: DBAORA2L2.3 TAG: 7054031 Storage management

Contact information for the storage management support group with responsibility for Netbackup

Bridge Clarify class CWBACKUP

Outside office hours CWBACKUP Bridge Callout NETBACKUP

2.4 OPERATIONS / TAG: 7054008 MEDIA CONTACT LIST

A full list of site contacts can be found in Departmental Contacts under Contacts at

MACROBUTTON ExternalReference http://dataintegrity.intra.bt.com/

2.5 TAG: 7054009 ADIC GRAU Contact/Info

2.5.1 TAG: 7054010 Contact Number / fault logging / hotline

ADIC GRAU 01344 488786

Maurice Rutherford

Office

0118 922 9100

Mobile

07710 576425

Mike Halliday

Office

0118 922 9100

Mobile

07850 000152

2.5.2 ADIC TAG: 7054011 Callout Agreement

Callout will be instigated by the Second Line team/Netbackup Support out of hours where we consider there to be a serious degradation of the system, i.e. over 50 % of tape drives are unavailable. Otherwise a call will be placed during normal working hours. On site access will be arranged by the Second Line team via the Change / Problem Management System.

2.5.3 TAG: 7054012 Hours of Cover

Full cover will be provided on a 24 hours per day, 365 days per year basis with a 2-hour response time.

2.6 TAG: 7054013 Escalation

2.6.1 TAG: 7054343 TAG: 7054014 HPADIC

John Orman

(Tel 01908 656267)Gary Page

(Tel: 0118 922 9100)

(Mobile: 07919 330945)

Pete Meade

(Tel 0151 706 8805)

(Mobile: 07802 471232)

Maurice Rutherford

(Tel: 0118 922 9100)

(Mobile: 07710 576425)

2.7 TAG: 7054015 Hardware Information

Hardware information for each site can be found in Hardware Information under Documentation at

http://dataintegrity.intra.bt.com/2.8 TAG: 7054017 Drive Information

This can be found under Tape Library Info under Media & Library under Drive & Media at

http://byadsm03.nat.bt.com/2.9 TAG: 7054018 IBM Hardware Contact

In the event of a failure of any IBM kit a call should be made to IBM via Storagetek (01483 728101). The following information will be required in order for Storagetek to place a call with IBM.

The site id number.

The 4-digit machine type number as listed beside hostname.

The serial number of the machine.

The project name, which in most instances is Brunix

2.9.1 IBM TAG: 7054019 Callout Agreement

Callout will be instigated by the Second Line team/Netbackup Support out of hours where we consider there to be a serious degradation of the system, i.e. over 50 % of tape drives are unavailable. Otherwise a call will be placed during normal working hours. On site access will be arranged by the Second Line team/Netbackup Support group via the Change / Problem Management System.

2.9.2 TAG: 7054020 Hours of Cover

Full cover will be provided on a 24 hours per day, 365 days per year basis with a 2-hour response time.

3 Tape Library Naming Conventions

The NetBackup tape libraries (also known as tape robots) come from two different manufacturers, IBM and ADIC/GRAU.

You can tell a good deal just from the library name.

- the first two letters give the location, e.g., IP is Ipswich.

- the next three letters are either IBM or GRA, indicating the manufacturer is either IBM or ADIC/GRAU

- the next two letters are LB, to indicate this is a library rather than a normal server

- finally there is a robot number. This is needed as one site may have several IBM or ADIC/GRAU libraries.

Understanding the naming convention means you can tell a good deal just from a name like IPGRALB1, TPIBMLB2, etc.An overview of our tape libraries atthe major sites can be found under theURL http://dataintegrity.intra.bt.com/. Just click the tab in the left-hand menu for 'Tape Drive Layout Diagrams', then click on the site you are looking for. Alternatively, issuing tpconfig -d on any server will indicate the library type it uses (tlh indicates an IBM 3494, and tlm indicates an ADIC/GRAU AML library) and show the drives that server has.4 TAG: 7054027 IBM Library Information

"Note: Most of our IBM libraries are IBM 3494's, and these have a library manager which understands the mtlib commands explained below. Newer IBM 3584's are coming into service now which do not have a library manager. You cannot use mtlib commands on these, so will have to rely on robtest and NetBackup for library information."

4.1 TAG: 7054033 Using robtest to interrogate the robot

The robtest utility in /usr/openv/volmgr/bin, can be used to interrogate and control the IBM silo from the system running the daemon in the same way that it can be used to control the ADIC silo from machines that drive an ADIC silo.

Issue the following command

#robtestUse ? to get help in robtest.

The main commands available are drstat (to display drive status) and view to examine tapes. Note that unload cannot be used, as the tape drives are not connected to the SUN systems.

For further information regarding the robtest utility, refer to ISL/OPS/B274 - BRUNIX media management procedures at

MACROBUTTON ExternalReference http://documents.intra.bt.com/

Note: Do not leave your robtest session active any longer than you have to. When you are in the robtest utility, communication between the tlhcd daemon and the media servers is blocked. If a media server makes a request to tlhcd (either a mount or a dismount request) while the robtest utility is active, the request will be blocked. This will result in the tape drives on that media server being switched into AVR mode until the robtest session is terminated.

4.2 TAG: 7054034 mtlib Commands

The following commands can be used to interrogate the robot using the mtlib utility (/usr/bin/mtlib)

CommandResult

mtlib -l 3494c q I Query Inventory. This command produces an inventory of the silo and the category of the tape.

mtlib -l 3494c q LQuery Library. This produces various options of library data including: -

State - This should display "Automated Operational State". If the library has been PAUSED then a pause state will display.

Available cells - Displays the number of empty cells.

mtlib -l 3494c q S

Statistical data

mtlib -l 3494c q K

Count of tapes

mtlib -l 3494c -C -t -V This changes the category of a tape.

mtlib -l 3494c -C -t FF10 -V It can also be used to eject tapes from the silo.

mtlib -l 3494c -D

This displays the silo device numbers

mtlib -l 3494c q MDisplays which tapes are loaded in above device drives

For further information regarding the robtest utility, refer to ISL/OPS/B274 - BRUNIX media management procedures at

MACROBUTTON ExternalReference http://documents.intra.bt.com/

4.3 TAG: 7054035 Dealing with problems when a robot goes into Pause mode

Occasionally the IBM library may go into a paused state. This results in no cartridge movement and probably a number of backup failures. The cause is the I/O station problem slots, of which there are two, have been filled with cartridges that the robot has had problems handling.

If the robot goes into a paused state contact the Media Ops and ask them to empty the I/O station problem slots and note the cartridge numbers. When this has been done the library should automatically restart and work as normal. Confirm that backups are running and that cartridges are being loaded and unloaded. Occasionally the problem may be due to a failure in the robotic arm and all cartridge manipulation will fail. The cartridges would then be placed in the I/O station problem slots causing the library to go into a paused state again. This will require callout to the engineers to resolve the problem.

If the library does recover then the problem cartridges can be re-inserted into the library. If this fails then they need to be checked to see if they contain any valid data and if so need to be put into a frozen state. If no valid data exists then they can be deleted from the Media Manager database.

4.4 TAG: 7054347 Media movements - Inserting media

To insert media to the IBM library is a straightforward process. There is no insert command to run as the library will automatically take cartridges from the I/O station, or hopper, and place them in available slots. The only problems that are likely to occur are that there are no slots available or the library is not functioning correctly.

Once cartridges have been inserted and accepted by the library they can be added to NetBackup media management database as normal via the GUI or command line. This should be done from the NetBackup master server.

4.4.1 TAG: 7054348 Checking available slots

There are two ways to find out how many available slots there are.

The least disruptive way is to use the following webpage MACROBUTTON ExternalReference http://byadsm03.nat.bt.com/grau_info , which includes ADIC and IBM library information.

An alternative is to interrogate the tape library. As the library is being checked this command must be issued from the mount server. Log onto the mount server and run the robtest utility.

Caution: Note that while robtest is running no further library action can take place, i.e. mount and dismounts.

Select the appropriate library, which is normally option 1, and to get a list of the possible options use the ? command.

dyadsm01 $ robtest

Configured robots with local control supporting test utilities:

TLH(0) LMCP device path = /dev/lmcp0

Robot Selection

---------------

1) TLH 0

2) none/quit

Enter choice: 1

Robot selected: TLH(0) LMCP device path = /dev/lmcp0

Invoking robotic test utility:

/usr/openv/volmgr/bin/tlhtest -r /dev/lmcp0 -d /dev/rmt2.1 003590E1A00 -d /dev/rmt1.1 003590E1A01 -d /dev/rmt3.1 003590E1A02

Opening /dev/lmcp0

Enter tlh commands (? returns help information)

?

To exit the utility, type q or Q.

audit - Audit library for volser

catinv [] - Print library inventory by category

dm [|] - Dismount volser from drive

drmapclear - Clear drive address mapping

drmapfreeze - Freeze drive address mapping

drmapshow - Show drive address mapping

drstat [|] - Print drive status

eject [bulk] - Eject volser to standard (or bulk) output area

inv [] - Print library inventory

libstat - Print library status

m [|] - Mount volser

setcat - Set volser category

types - Print list of media types

verbose - Toggle verbose mode

view - Print volser data

SCSI commands:

unload [|] - Issue SCSI unload

= d1 if drive 1, d2 if drive 2, ..., d256 if drive 256

Use the libstat option to display the status of the library.

libstat

Library information:

state: Automated Operational State

input stations: 1

output stations: 1

input/output status: All convenience input stations empty

All convenience output stations empty

machine type: 3494

sequence number: 0x16552

number of cells: 5431

available cells: 3285

number of subsystems: 17

convenience capacity: 10

accessor config: 01

accessor 0 status: Accessor available

Gripper 1 available

Gripper 2 not installed

Vision system operational

comp avail status: Primary library manager installed.

Primary library manager available.

Primary hard drive installed.

Primary hard drive available.

Secondary hard drive installed.

Secondary hard drive available.

Convenience input station installed.

Convenience input station available.

Convenience output station installed.

Convenience output station available.

avail 3490 cln cycles: 0

avail 3590 cln cycles: 9

QUERY LIBRARY DATA complete

Note: To find out the drive details on the server, issue the tpconfig d command on the master/media server, which uses the library. The output will look similar to that shown below.

dyadsm01 $ tpconfig -d

Index DriveName DrivePath Type Shared Status

***** ********* ********** **** ****** ******

0 Drive1 /dev/rmt2.1 hcart2 No UP

TLH(0) IBM Device Name=003590E1A00

1 Drive2 /dev/rmt1.1 hcart2 No UP

TLH(0) IBM Device Name=003590E1A01

2 Drive3 /dev/rmt3.1 hcart2 No UP

TLH(0) IBM Device Name=003590E1A02

Currently defined robotics are:

TLH(0) LMCP device path = /dev/lmcp0,

volume database host = dyadsm01From this it can be seen that there are 3 drives configured to be used by this server dyadsm01.

4.5 TAG: 7054349 Ejecting media from the IBM library

To eject media from a library, issue the following command for each tape:

mtlib l library device name C s FF00 t FF10 V aaannn

lirary device name can be found by running tpconfig d on the master server

Where l is the librarys filename, -C indicates that you want to change the category of a volume, -s is the starting value, -t is the value you want to change it to and V is the volume serial number to be changed.

This will change the state of cartridge aaannn from being in the library to ejected and move it accordingly.

For further details of the mtlib command you can issue mtlib -?, which will produce a list of all the possible parameters and their meaning.

4.6 TAG: 7054351 Drive problems

If it is a drive problem then the SCSI commands to manipulate the drive must be issued from the master/media server. For example to rewind a cartridge and prepare it to be unloaded from a drive you would have to issue the mt f /dev/rmt1 rewoff command.

The most common problem is when a cartridge does not eject from a drive.

4.6.1 TAG: 7054352 Identifying stuck cartridges

Compare the information from the robtest drstat (see below for details) command to that from the vmoprcmd command. If drstat lists a cartridge as being in the drive and vmoprcmd does not, and there are no active jobs using the cartridge, then it is safe to assume that the cartridge is stuck. To confirm this it is also worth checking the log files for SCSI errors related to the drive. These can indicate when the failure occurred and help identify the cause to the engineers.

If you suspect a cartridge to be stuck, log onto the master/media server which uses the drive and issue the mt f /dev/xxxxx rewoff command from the root account, where /dev/xxxxx is the file name for the drive. This should rewind the cartridge and eject it ready for the robot. If this command fails it could be because the cartridge has already been rewound or there are problems communicating with the drive. To determine which you need to log onto the mount server and use the robtest dismount option. If this fails then the operators will have to power-cycle the drive to force it to eject the cartridge; this should be followed by the robtest dismount. If no progress is made after all this then it is time to get the engineers involved.

4.7 TAG: 7054353 Library problems

If there are problems with the library, i.e. drives in AVR and cartridges not being mounted, then attempt a manual mount of a cartridge.

For example to mount a cartridge you would run robtest (see above for details) and then run the appropriate command.

A useful command is the libstat command. This shows the library status on the first line, which should be:

state: Automated Operational State

Further down the output the status of the I/O station/hopper is displayed and whether it is full and requires emptying. The status should be:

input/output status: All convenience input stations empty

All convenience output stations empty

Another useful command is drstat, which lists information for all the drives. From this it is possible to tell if a cartridge is in the drive and its identity.

Drive 3 information:

drive number: 3

device name: 003590B1A02

device number: 0x203140

device class: 0x11 - 3590 Model B1A/other

device category: 0x0000

mounted volser:

mounted category: 0x0000

device states: Device installed in ATL.

Dev is available to ATL.

ACL is installed.

In this example you can see that there is no cartridge in drive 3.

4.8 TAG: 7054356 mtlib commands - Description and useful commands

The mtlib command allows the IBM library to be interrogated and manipulated by a user.

To display all possible options use the mtlib -? command.

To find out what the logical device number is for a library display the /etc/ibmatl.conf file and the library will be at the end of the file.

mtlib l 3494c qV V xxxnnn

This will display the status of a cartridge; i.e. is in the library.mtlib l 3494c qMThis will display all mounted cartridges and the library device number it is in.

mtlib l 3494c DThis will display all the devices and their numbers.

mtlib l 3494c qLThis will display the status of the library.

5 ADIC Library Information

Note: Most of our ADIC/GRAU libraries are AML's, and these have a library manager which understands the dasadmin commands explained below. Newer ADIC/GRAU libraries do not have a library manager. You cannot use dasadmin commands on these, so will have to rely on robtest and NetBackup for library information."

5.1 Using robtest to interrogate the robot

Please refer to the instructions for the IBM robot above in section 3.1 as the process is the same.5.2 TAG: 7054074 dasadmin(only to be used on ADIC libraries)

LISTD: List Drive StatusThis command displays the drive status for all clients or a specific client.

dasadmin ld, if the robot has more than 15 drives use dasadmin ld2E.g.dybkup01 $ dasadmin ld

listd for client: successful

drive: DRIVE1 amu drive: 01 st: UP type: N sysid: client: dycase01 volser: cleaning 0 clean_count: 10

drive: DRIVE2 amu drive: 02 st: UP type: N sysid: client: dycase01 volser: cleaning 0 clean_count: 29

drive: DRIVE3 amu drive: 03 st: UP type: N sysid: client: dycase01 volser: cleaning 0 clean_count: 28

drive: DRIVE4 amu drive: 04 st: UP type: N sysid: client: dycase01 volser: cleaning 0 clean_count: 28

drive: DRIVE5 amu drive: 05 st: UP type: N sysid: client: dyespwp1 volser: cleaning 0 clean_count: 14

drive: DRIVE6 amu drive: 06 st: UP type: N sysid: client: dyespwp1 volser: cleaning 0 clean_count: 1

drive: DRIVE7 amu drive: 07 st: UP type: N sysid: client: dyespwp1 volser: cleaning 0 clean_count: 15

drive: DRIVE8 amu drive: 08 st: UP type: N sysid: client: dyespwp1 volser: cleaning 0 clean_count: 24

drive: DRIVE9 amu drive: 09 st: UP type: N sysid: client: dyvsisb1 volser: cleaning 0 clean_count: 6

drive: DRIVE10 amu drive: 10 st: UP type: N sysid: client: dyvsisb1 volser: cleaning 0 clean_count: 3

drive: DRIVE11 amu drive: 11 st: UP type: N sysid: client: dybkup01 volser: DEF595 cleaning 0 clean_count: 17

drive: DRIVE12 amu drive: 12 st: UP type: N sysid: client: dybkup01 volser: cleaning 0 clean_count: 7

drive: DRIVE13 amu drive: 13 st: UP type: N sysid: client: dynebk01 volser: cleaning 0 clean_count: 5

drive: DRIVE14 amu drive: 14 st: UP type: N sysid: client: dynebk01 volser: cleaning 0 clean_count: 23

drive: DRIVE15 amu drive: 15 st: UP type: N sysid: client: dynebk01 volser: DEC406 cleaning 0 clean_count: 2To display the drive list for a specific client

dasadmin ld dybkup01

dybkup01 $ dasadmin ld dybkup01

listd for client: dybkup01 successful

drive: DRIVE11 amu drive: 11 st: UP type: N sysid: client: dybkup01 volser: DEF595 cleaning 0 clean_count: 17

drive: DRIVE12 amu drive: 12 st: UP type: N sysid: client: dybkup01 volser: cleaning 0 clean_count: 7

drive: DRIVE23 amu drive: 23 st: UP type: N sysid: client: dybkup01 volser: cleaning 0 clean_count: 13

drive: DRIVE24 amu drive: 24 st: UP type: N sysid: client: dybkup01 volser: DEG393 cleaning 0 clean_count: 16

das options:DisplayDescription of Parameter

driveDrive number

stDrive status UP or DOWN

typeDrive type

sysidReserved

clientClient name allocated to the drive

volserMounted volume on the drive

cleaningActual cleaning activity

0: no clean activity on the drive

1: cleaning media mounted on the drive

clean countNumber of mounts until the next cleaning interval

TAG: 7054075 ALLOCD: Allocate drive to different client

dasadmin allocd DRIVEx UP client

DOWN

To see the range of tapes assigned use the dasadmin qvolsrange command

Note: This command returns a list of volsers, which are accessible to the specified client within the requested volser range.

dasadmin qvolsrange beginvolser endvolser count (client name)e.g. dasadmin qvolsrange DTP216 DTP220 8

ParameterDescription of Parameter

beginvolserThe beginvolser specifies the first volser in the range

endvolserThe endvolser specifies the last volser in the range

countSpecifies the number of volsers to report within the range. (This number can be larger than the actual number required)

client nameIf the client name is specified the volser range is checked for that client only, if none is specified then all are checked

5.3 Mount a Volume in a tape driveThis command mounts a volume on a drive from the library.

dasadmin mount -t media-type volser drive

e.g. dasadmin mount -t 3590 DTP216 DRIVE4

ParameterDescription of Parameter

media-typeSpecifies the type of media you are using e.g. 3590

volserSpecifies the specific media you wish to be mounted e.g. DTP216

driveSpecifies the drive number on which the media is to be mounted e.g. DRIVE4

5.4 Dismount a Volume from a tape driveThis command dismounts a volume and replaces it back into the library.

dasadmin dismount -t media-type volser

e.g. dasadmin dismount -t 3590 DTP216

ParameterDescription of Parameter

media-typeSpecifies the type of media you are using e.g. 3590

volserSpecifies the specific media you wish to be dismounted e.g. DTP216

5.5 Media Movements - Inserting MediaUnlike the IBM library, it is necessary to issue a command to insert media from the input hopper into the library. This command will insert volumes from a specific insert area into the library area.

dasadmin insert -t media-type areae.g. dasadmin insert -t 3590 I01

ParameterDescription of Parameter

media-typeSpecifies the type of media you are using e.g. 3590

areaSpecifies the area where the tape(s) will be inserted e.g. I01

5.6 Ejecting Media from the ADIC LibraryThis command will move volume(s) to the eject area to be removed from the library.

dasadmin eject (-c) -t media-type volser-range area

e.g. dasadmin eject c -t 3590 DTP216,DTP220 E01

ParameterDescription of Parameter

-cTells DAS to remove the volser from the catalog

media-typeSpecifies the type of media you are using e.g. 3590

volser-rangeSpecifies one or more volsers to be ejected

areaSpecifies the area where the tape(s) will be ejected e.g. E01

5.7 TAG: 7054053 Drive Problems

Robotic problems are more straightforward to identify than drive problems.

Determine what type of robot you are using:-

Predominantly we use ADIC robots at the moment (AMLs).

5.8 TAG: 7054059 AMLs

The amu is a pc, which sits on the front of the robot and handles all requests for action from all the backup servers. The tape drives sit in the robot, but there is no direct electrical connection between the robot and the drives. The robot arm actions requests given to it by the amu. The racking is used to store the tapes. The amu is contacted using standard ip addressing. ADIC provide a binary that can be used to communicate with the library, and that command is dasadmin. A full list of dasadmin commands and their use can be found by typing dasadmin -?

5.9 TAG: 7054060 Common Problems

The drives are in AVR mode Check the messages/syslog for errors. Then either:

Ping amu from the server, the ip address will be found in /etc/hosts

If there is no response from either server callout ADIC. This could also be a network problem such as the network connectivity to the box has been lost. In which case there is nothing to be done until the network has been restored.

If you can ping the amu. Then issue the command: #dasadmin ldwhich will talk to the amu and query the status of the drives. If this hangs or returns unexpected response code received from the amu - callout ADIC.

If the dasadmin ld returns a list of drives, then select a tape from media manager which does not have a time assigned value

Try and mount it in a drive. Use the command: - #dasadmin mount t 3590 volser DRIVEn.

If this returns unexpected response code received from the amu, then you callout ADIC. If it works then either use reset with drive control, or

#mt -f /drive off ,

and

#dasadmin dismount t 3590 volser5.10 TAG: 7054061 Points To Note

Some more observations on problems encountered with ADIC robots.

The tlmd daemon netbackup uses to talk with the robot, will periodically test the state of the tape drives. If drives are in AVR mode (not DOWN-TLM) then when the daemon gets a positive response the drives come back up automatically.

On the syslog you may see the message "robot encountered an error handling a volume", this could cause intermittent problems. You may be able to get away with freezing the tape or if it is scratch, to move some from another pool. (see exit status 96 for detailed instructions on moving media.)

Then call up ADIC. The command to freeze a tape is #bpmedia -freeze -ev volser If the robot door is opened it will stop the robot, and will need to call ADIC out to restart the machine.

dasadmin ld2 may have to be used to view tape drive numbers above 15.

By using vmoprcmd it is possible to determine the UNIX device files associated with the NetBackup Drive index. When the output appears it contains the drive index number, and the device file associated with it.

Check whether a tape is stuck on a drive: -

Is there an RVSN on the drive index?If so it means that NetBackup can read the tape label, and the drive is functioning to some degree.

Use drive control to reset the drive. This effectively does mt f off and tells the robot to put the tape away. If this does not work, make sure backups are not using that tape drive, and issue mt f /device-file off. If this takes you back to the prompt (ie. It has worked) then instruct the robot to put the tape away by issuing the dasadmin dismount command, then try the drive out again. If the drive fails again, then an engineer will have to be called, because it may be that the drive can read the header label but not position on the tape. The sequence of events would be:-

Mt -f /devicefile off

Dasadmin dismount -t 3590 volser

Run a backup, and check on job monitor to see if the tape has positioned.

If the mt f /device-file off reports an error then call an engineer. Take the drive down in NetBackup until the engineer has dealt with it.

If there is no RVSN on the drive

Firstly interrogate the robot to determine whether the robot has been putting tapes in the drive. Use dasadmin ld to list the drives.

If there is a tape on the drive, check that the tape isnt just sitting on the lip of the drive by instructing the robot to put the tape away. This can be done by issuing the command#dasadmin dismount -t 3590 volser if the command was successful then the tape would be on the lip.

If the robot reports that the drive cannot be unloaded, then issue #mt f /device-file off If this takes you back to the prompt (i.e. It has worked) then instruct the robot to put the tape away, then try the drive out again. If it doesnt work, try #mt f /device-file status then call out an engineer. It is worth checking the syslog again, because the previous actions may have forced UNIX to report an IO error.

5.11 Library ProblemIf there are problems with the library, i.e. drives in AVR and cartridges not being mounted. Check the messages log on the master server for the following:

Feb 10 09:39:30 dybkup01 tlmd[25753]: [ID 897060 daemon.error] TLM(0) dismount failure for volser SDY815 on drive DRIVE12, d_errno = 10, The AMU was unable to communicate with the robot.

Feb 10 09:39:30 dybkup01 tlmd[18661]: [ID 160136 daemon.error] TLM(0) going to DOWN state, status: Robot hardware or communication error

Feb 10 09:39:55 dybkup01 tlmd[26277]: [ID 969665 daemon.error] TLM(0) dismount failure for volser SDY470 on drive DRIVE11, d_errno = 10, The AMU was unable to communicate with the robot.

Feb 10 10:33:13 dybkup01 tlmd[18661]: [ID 861719 daemon.error] TLM(0) drive DRIVE11 (device 0) is being DOWNED, status: Robotic dismount failure

On seeing these messages, initiate a call with the vendor ADIC.6 A TAG: 7054036 ccessing NetBackup

6.1 TAG: 7054363 NBU 4.5

When you logon to a server - it is only possible to use this on a Netbackup Master server, not a client - please access netbackup using one of the following methods: -

Via the vt100 panels

#bpadm

from the toolbar:-Select the start button programs Veritas NetBackup NetBackup administration from there access the required panel. clicking your shortcut icon (if one has been set up)You may also wish to amend your profile to include the netbackup directories in your PATH.

Another method of accessing the panels without the need to amend profiles, as root, su storage which will give you the profile necessary to perform the above.

7 TAG: 7054037 NetBackup Daemon Problems

There are six main NetBackup daemons, which run constantly.

NetBackup :- bprd

Bpdbm

Media Manager:-vmd

Ltid

Avrd

Tlmd/tlhd

These daemons run on the server only, and not on any of the clients. On slave servers only the media manager daemons run.

Of these the most significant are bprd and ltid.

Bprd starts bpdbm, and ltid starts vmd, avrd, and tlmd or tlhd.

Tlmd and tlhd are robotic daemons, which are different depending upon the type of robot used. Tlmd refers to ADIC robots, whilst tlhd refers to IBM robots.

7.1.1 TAG: 7054038 Daemon Descriptions Bprd - On master servers this daemon handles requests for backups and restores and scheduled backups.

Bpdbm On master servers bpdbm handles all the configuration, error and file databases

Ltid On master and slaves this daemon controls the reservation and management of volumes

Avrd On slaves and masters performs automatic volume recognition, i.e. being able to recognise a volume that has a label on the tape.

Vmd Volume manager daemon manages the volume database containing details about tape usage.

Tlmd/tlhd robotic control daemons perform robot handling.

7.1.2 TAG: 7054039 Starting NetBackup

The script bp.kill_all in /usr/openv/netbackup/bin/goodies will kill off all netbackup daemons under normal conditions. If after running bp.kill_all there are daemons hung (usually bpsched processes) these should be killed, using kill or kill 9 if necessary.

Alternatively, if you experience problems shutting down Netbackup using the above script use nbu_kill script, which can be found in /usr/openv/btscripts.7.1.3 Starting up individual NetBAckup daemonsTo start up Netbackup daemons you must always be in root.

To check for Netbackup daemons, issue bpps a, which can be found in directory /usr/openv/netbackup/bin

The Netbackup startup script netbackup is in /usr/openv/netbackup/bin/goodies.

If bprd has crashed, issuing /usr/openv/netbackup/bin/initbprd will start it again.

If bpdbm has crashed bprd will periodically start it up again,

If vmd has crashed then it can be started by issuing:-

#vmadm

s> special actions

i> initiate Media Manager Volume Daemon

If just tlmd has not started, or has crashed, this can be started using ./tlmd

This is the recommend command when starting Netbackup.

If avrd, has crashed, issue /usr/openv/volmgr/bin/stopltid to stop ltid, and then /usr/openv/volmgr/bin/ltid to restart ltid

7.2 TAG: 7054041 Shutdown Netbackup

The script bp.kill_all in /usr/openv/netbackup/bin/goodies will kill off all netbackup daemons under normal conditions. If after running bp.kill_all there are daemons hung (usually bpsched processes) these should be killed, using kill or kill 9 if necessary.

Alternatively, if you experience problems shutting down Netbackup using the above script use nbu_kill script, which can be found in /usr/openv/btscripts.7.2.1 TAG: 7054043 Problem resolution

It is rare that there are problems with the NetBackup daemons.

Common problems are: -

The daemons werent started after a reboot, in this case logon to the server, and su- root, then issue ./netbackup from /usr/openv/netbackup/bin/goodies.

The hostname on the box has changed. This should be reviewed, with NetBackup Support and Service Delivery.

Tlmd can crash if there has been a severe robot problem, to start this daemon logon to root and issue: ./tlmd from /usr/openv/volmgr/bin.

Tlmd can crash if there is a network problem.

If after trying the above Netbackup does not start, then it will be necessary to callout NetBackup support.

8 TAG: 7054045 Netbackup Activity Logs and process information

8.1 TAG: 7054046 Introduction

This directory (/usr/openv/netbackup/logs) is where detailed activity logs will be placed on the NetBackup client box if certain sub-directories exist. These sub-directories should only be created if unexplained problems are occurring with the NetBackup product and more information is required to isolate the problem. For further information on Veritas NetBackup Logging procedures see ISL/OPS/B194 at

MACROBUTTON ExternalReference http://documents.intra.bt.com/

Warning: Some of these logs can potentially grow very large, and should only be enabled if unexplained problems exist.

8.2 TAG: 7054047 File system full problems

In the event of filesystems listed below becoming full or being reported as over a specified percentage, this should be addressed with the immediate removal of log directories under /usr/openv/netbackup/logs, leaving only admin and user_ops

/usr/openv/ mountpoints > 100%

If the situation arises where the mountpoint for NetBackup reaches or exceeds 100%, then the following action can be taken to try and reduce the utilisation.

Logon to the server and su to root and issue the following command:-

bpimage -cleanup -allclients

An alternative is to bounce the daemons, which will force the cleanup process to start. This has been added to the crontab on all BRUNIX master servers to run it regularly.

8.3 TAG: 7054049 Full List of NETBACKUP Processes

Here are descriptions of NetBackup processes:

bprd

-request daemon

-can be terminated and initiated from the admin interfaces

-responds to client and administrative requests

-restores

-backups

-archives

-"list files backed-up or archived"

-manual/immediate backups

-reread configuration database

bpsched

-backup scheduler

-started by bprd on user directed backups and archives

-started by bprd on immediate/manual backups

-started by bprd every "Wakeup Interval" for regularly scheduled incremental and full backups

-uses information from the class & storage unit databases to determine what clients to start, when to start them, and

what storage unit to write backups/archives to

bpdm

-disk manager

-used on storage units of type Disk

-started by bpbrm on backups and restores

-during backups and restores, one of these is started (on the

server with the storage unit) for each client backup or restore bptm

-removable media (tape) manager

-used on storage units of type Logical Tape

-started by bpbrm on backups and restores

-during backups and restores, one of these is started (on the server with the storage unit) for each client backup or restore

-also responsible for managing the media database

-used to display info in the Media Reports screen when you select Media List

bpbrm

-backup/restore manager

-started by bpsched on backups/archives

-started by bprd on restores

-during backups and restores, one of these is started (on the server with the storage unit) for each client backup or restore

-responsible for managing both the client and the media manager processes. uses error status from both to determine ultimate

status of backup or restore.

bpdbm -database manager

-manages class, config/behavior, storage unit, and error DB's

-started by the inetd(1M) process

bpcd -"client daemon"

-used on clients (and remote servers) to initiate other product programs, without requiring /.rhosts entries for the server on each client

-started by the inetd(1M) process

bparchive

-command-line program on clients to initiate archives

-communicates with bprd on server

bpbackup

-command-line program on clients to initiate backups

-communicates with bprd on server

bpbkar

-program used on standard clients to generate backup images

-not used directly by client users

bplist

-command-line program on clients to initiate file lists

-communicates with bprd on server

bprestore

-command-line program on clients to initiate restores

-communicates with bprd on server

tar

-program used on standard clients to restore backup images

bp

-menu user interface for backups, archives, and restores

8.4 General TAG: 7054050 Drive Testing and Fault Resolution

When a media hardware problem is identified by Ops Analysts they will attempt resolution themselves, without involving NetBackup Support. This resolution will include callout of onsite support (e.g. NACC Ops, CE), if required, and subsequent liaison to resolve the fault. However, NetBackup Support may be contacted for more detailed problem analysis if the problem is not immediately identified as a tape/drive/library fault.

8.5 TAG: 7054051 Dealing With Tape Drive Problems

The majority of backup service effecting problems that will occur on NetBackup will be because of Tape Drive failures of one sort or another. It is important then to understand the components in a drive path, and which tools we can use to diagnose the problem.

When dealing with drive problems always check first to determine whether the problem is drive or robot related, often the error codes are the same.

8.6 TAG: 7054052 Determining whether problem is Drive or Robot.

It is not always going to be easy to determine whether the problem is that the drive is broken, or the robot, but for the most part you can tell very quickly by looking at the following.

Is the robot a shared library?

If it is, are tapes being mounted on another server? If so it is unlikely that the problem is robotic. Can you obtain drive status?

If you cannot you definitely have a robot problem Are the tape drives in AVR mode?

If so it made be a local problem with either the connection to the robot. Are the tape drives down and remain down when you bring them up?If so this usually means you have a drive related problem in the first instance.

What does the messages/syslog sayNetbackup media management logs useful information to the syslog, and this should always be examined.

At the end of your initial diagnosis you may not be completely sure whether the problem is robot or drive related, but if in doubt start looking at the drive first, as this is much more common a problem.

8.6.1 TAG: 7054056 Points To Note

The following points to bear in mind when dealing with drive problems are the result of experience, and not from methodical diagnosis, but they have happened on more than 1 occasion, are not easy to spot.

Everything seems ok until you take one drive downThe H/W engineers (usually Sun or Sequent) have swapped the SCSI cables around. NetBackup will spot the tape it wants is in a drive and use it, however if the tape drives are in the wrong way around then when you shut one down the tape will be put in the down drive. This is very confusing. The resolution is very simple change the robot drive numbers around and run stopltid and ltid. The NetBackup Support team should do this.

Ltid, avrd, vmd, tlmd or tldd will not start after a reboot, or after you have started netbackup. Look in the syslog first, it always tells you why they did not start.- If the device file is not in UNIX then ltid will not start, but this is reported in the syslog.

After a reboot, it has been known (particularly on Compaq) for the device files to be renamed. This has the same effect as removing the device files. It is also a nightmare trying to map the new files back on. There is no easy way to identify whether this has happened, if an ls al command is issued on the directory it should be possible to identify if there are new device files. SCSI extenders complicate the situation!Many of the drive problems are caused by SCSI extenders, either failing or having a glitch. Resetting the extenders may be enough to fix a problem, however from our point of view we just call out an engineer; it will be the engineer who will determine what action is required.

Reboot or Not RebootIt is by no means clear as to when a reboot is required after or during work on a drive. Some manufacturers support peripherals better than others do. By deconfiguring the SCSI bus on Sequent boxes we can do most operations without requiring a reboot. Compaq also are fairly robust. HP and Sun are questionable, although Sun should not require a reboot under normal circumstances. If the SCSI cable is unplugged from the back of the host machine, then most times you will have to reboot the machine. If you cannot use a device after it has been fixed, always seek guidance from TSG before arranging to reboot a box.

8.7 TAG: 7054062 Device Configuration Utility - Tpconfig

Tpconfig is the netbackup utility to configure devices used by NetBackup. Full details about its use are covered in the Veritas NetBackup manuals.

This utility will mostly be used by Storage Administrators, but there is one command that the OAs will find useful:

tpconfig -d , -this lists all the devices configured in NetBackup.

9 TAG: 7054063 Deconfigure/Reconfigure Sequent Drives

9.1.1 TAG: 7054064 Deconfigure

Prior to deconfiguring, Ensure that the drive in question is DOWN in Netbackup, as otherwise the drive will be actively polled, preventing deconfig.

To perform this task root access is required.

Check on Device Manager for the device name by which the drive in question is set or use tpconfig l which displays the same informationIn brief the steps are:

1. Check dumpconf | grep tc

2. Deconfiguredevctl -d tcndevctl -d scsibusnn

3. Reconfiguredevctl -c qcicn

4. Checkdumpconf | grep tc9.1.2 TAG: 7054065 To perform the DECONFIG

List current config using: dumpconf | grep tcEnter "devctl -d tcn" where n = drive number e.g. tc2

(this can be found by using tpconfig, list drive configuration)

If all is well, you will see a brief statement confirming that the drive in question has been deconfigured

# devctl -d tc0

devctl: deconfiguring tc0 from scsibus18

The output from the above command will give the scsibus number

Enter "devctl -d scsibusnn" where n = scsibus number e.g. scsibus18

If all is well, you will see a brief statement confirming that the scsibus in question has been deconfigured (see below)

# devctl -d scsibus18

devctl: deconfiguring scsibus18 from qcic5

Keep a note of the drivename, scsibus and qcic (or fcbr) for reconfiguring

The engineer should now be able to carry out any work necessary.

9.1.3 TAG: 7054066 ReconfigurationPrior to reconfiguring,

If you have misplaced the deconfig details, you can find them in the relevant ktlog via /usr/adm/ktlog/yyyy/mm/dd - e.g. /usr/adm/ktlog/1999/10/04. You should find output similar to the following:

#37f8c521 16:17:53 tolog/note p8598 devctl -dD tc1

#37f8c521 16:17:53 tolog/note p8598 NAME CFGTYPE DEVNUM UNIT FLAGS OnBUS OnDEVICE

#37f8c521 16:17:53 tolog/note p8598 deconfig: tc1 tc 1 0x00000000 S scsi scsibus22

#37f8c522 16:17:54 tolog/note p8598 devctl: deleted tc1: type: tc: devnum 0x1

#37f8c53b 16:18:19 tolog/note p8635 devctl -dD scsibus22

#37f8c53b 16:18:19 tolog/note p8635 NAME CFGTYPE DEVNUM UNIT FLAGS OnBUS OnDEVICE

#37f8c53b 16:18:19 tolog/note p8635 deconfig: scsibus22 scsibus 22 0x00000060 SM mscsi qcic6

#37f8c53c 16:18:20 tolog/note p8635 devctl: deleted scsibus22: type: scsibus: devnum 0x16

This information can also be listed using the ktmesg command

9.1.4 TAG: 7054067 To perform the RECONFIG

Enter "devctl -c qcicn" where n = qcic number e.g. qcic13

If all is well, you will see a brief statement confirming that the devices have been found as in the example below.

# devctl -c qcic5

devctl: Found scsibus18, tc0

To check what the current settings are and confirm the above commands, enter "dumpconf | grep tc" which will produce output similar to the following:

tc0 tc 0 0x00000000 S scsi scsibus18

tc1 tc 1 0x00000000 S scsi scsibus22

tc2 tc 2 0x00000000 S scsi scsibus26

tc3 tc 3 0x00000000 S scsi scsibus30

tcpmux pseudo -

If the original settings are showing, the work has been completed successfully.

If, on using the 'devctl -c qcicn' command to reconfigure, only one new device shows, try a query on the found device e.g. 'devctl -c scsibus18' which may detect the other device.

Note: If the system responses diverge in any way from the examples given above, contact Sequent TSS Group.

10 TAG: 7054068 Resetting the SCSI extenders 10.1 TAG: 7054054 Overview

There is a requirement to use SCSI extenders to connect some tape drives because there is a physical limit of 25m on SCSI, which in most computer halls is not enough to allow for full use to be made of expensive robotic libraries.

The use of SCSI extenders greatly increases the number of points of failure that can are present in the configuration. It also complicates problem resolution.

NOTE: the resetting of SCSI extenders is usually done by the Data Centre Operations teams. However the procedure is as follows:

1. Remove the covers at the back of the robot. Identify the offending drive.

2. On top of the drive there is a small LCD panel connected via a cable.On this panel, using the arrow keys, scroll down to "UNLOAD DRIVE" andpress "RETURN. This will eject any tape which might be mounted.

3. Next, go to the PC at the end of the silo and send a 'KEEP' signal asfollows:from the open window, click on 'COMMANDS' then from the drop downmenu, click on 'KEEP'.move the cursor to "SOURCE" and enter drive number e.g. D21, must bein upper case.Then click on "EXECUTE".

4. Go back to the LCD display panel connected to the drive and from thekeypad menu, press 'E' then '2'.You should see 'FIBLEN' on the display. This indicates that you havecarried out a 'fibre length' check. Then press 'C' to clear.However, if nothing happens after entering '2' then power OFF the drive.The ON/OFF switch is at the back of the drive on the right hand side.Then power back ON.

5. Reset both SCSI extenders.

11 TAG: 7054073 Useful Media Commands

11.1 TAG: 7054077 BPMEDIA: Freeze, unfreeze, suspend or unsuspend media

bpmedia -[parameter] -ev [media_id]

e.g.

#bpmedia -freeze -ev volser (DTP216)

ParameterDescription of Parameter

-freezeFreeze specified media id

-unfreezeUnfreeze specified media id

-suspendSuspend specified media id

-unsuspendUnsuspend specified media id

-evSpecify media id

-hspecify host

11.2 TAG: 7054079 Tape Management

# bpadmReports

Media

Media Summary

11.2.1 TAG: 7054080 Killing Processes

# kill 9 {PID Number}

#bpps -a

root 1206 1 0 Mar 22 ? 0:32 /usr/openv/netbackup/bin/bprd

root 1218 1 0 Mar 22 ? 1:35 /usr/openv/netbackup/bin/bpdbm

11.2.2 TAG: 7054081 VT100 Command Line

For producing readable lists from usually non-readable text files.

# cd /usr/openv/netbackup/bin/admincmd

# ./bpcllist ORACLE -L or -U

11.2.3 TAG: 7054083 Logs

These may be in various locations dependant on the platform, however check in the following first.

/usr/adm/ktlog/ - Sequent

/usr/spool/adm/

/usr/adm/syslog/

/var/adm/syslog/ - HP

/var/adm/messages Sun

errpt a - AIX

/usr/openv/netbackup/logs/bp*

11.2.4 TAG: 7054085 WHO Command

who -b (Shows when system was last re-booted)

11.2.5 TAG: 7054086 MT (SCSI Level) Commands

All mt commands involve the use of the full pathname of the drive device file. They can normally only be issued from root and talk to the AMU on a SCSI level.

# mt -f /dev/rmt/tc0c status

tc0: Waiting up to 90 seconds for tape ready...

tc0: Device Not Ready (check cartridge).

/dev/rmt/tc0c: I/O error

# mt -f /dev/rmt/tc1c status

/dev/rmt/tc1c: I/O error

# mt -f /dev/rmt/tc2c status

tc2: Waiting up to 90 seconds for tape ready...

tc2: Device Not Ready (check cartridge).

/dev/rmt/tc2c: I/O error

# mt -f /dev/rmt/tc3c status

tc3: Waiting up to 90 seconds for tape ready...

tc3: Device Not Ready (check cartridge).

/dev/rmt/tc3c: I/O error

The above mt commands will work only if a tape is loaded onto drive so tapes should be loaded using das commands.

12 TAG: 7054087 Restore Guidance

12.1 TAG: 7054088 Background

This section is intended for use by groups who may be required to perform restores on Unix Systems using NetBackup. There are a number of different ways to run a restore. These are:

1. From the Admin GUI.

2. From the command line interface, the bpadm or bp panels.

3. From the command line directly.

12.1.1 TAG: 7054364 Introduction

The restore facility for Netbackup is extremely powerful. Its use must only be considered with sufficient justification by the requestor. It can be very easy to destroy a UNIX box with Netbackup, particularly if you are restoring files in / or /usr. If you are at all unsure about what is being requested by the user, then refuse to complete the request. In the past many unnecessary restore requests have been executed because the customer was not completely sure what was required to fix the problem.

In short do not be afraid to ask why they require restores.

Wherever possible, the file or files should be restored to an alternative path. If the user wishes to restore to the original directory, see if you need to overwrite data, doing a restore with overwrite not allowed is much safer and should be considered as the normal method for performing restores. Always impress on the requestor the importance of performing restores safely, even if it means the user has to do some work.

12.1.2 TAG: 7054365 NetBackup

NetBackup backs up data to cartridge. To recover this data is straightforward as long as the correct information is used to initiate the restore process.

It must be remembered that a file in UNIX does not always exist, as it could be a hard or soft link to another real file. If NetBackup is asked to backup a soft link it will not follow it! So to ensure that the data is backed up the target of the link must be specified. This also holds true when attempting to restore the file. For a hard links the situation is slightly different. But as hard links are not commonly used details are not included in this document.

If you attempt to restore a soft link then no data will be restored. Therefore providing the name of the soft link as the file to be restored is worthless.

12.2 TAG: 7054366 Restore information

There are some pieces of information that must be provided to allow a restore to be carried out. These are as follows:

1. The host name of the box from which the data was backed up.

2. The operating system of the box. Is it a UNIX or an NT client.

3. The date and time, or dates and times between which, the backup was taken. The shorter the time between the start and end the quicker the search through the NetBackup database will be, and hence the restore will take less time.

4. Should the data be placed in an alternate location, or can the data be overwritten? (Insist the requestor specifies what is required).

5. Should the file, files or directories be renamed when restored?

6. The host name of the target box, if different to the source box.

Cross client restores are only possible if both the target and source box are connected to the same NetBackup server.

Cross client restores must be enabled within NetBackup.

7. The fully qualified names of the files or directories to be restored.

8. A contact number for when the work is complete.

9. Acquiring a timeline:

In order for Bridge Operations to effectively manage high priority restores (P1 or P2's), ensure we have a documented time-line provided by the requester in order to be able to checkpoint our progress. The onus is on Bridge Operations to obtain from the requester this information before we take on the responsibility of the restore. Ensure this information is documented in the Clarify case. This will allow ourselves and the MIT team to have some visibility of expected progress and will allow us to make a considered judgement for further escalation." This timeline can only be used for very rough guidance. This will vary depending on the size of the database and whether locally attached drives are used. If locally attached drives are used then the restore should take no longer than the backup. On a shared server this should be estimated at 20% longer than the backup.

12.3 Media related items

Identifying the media required and its location

The quickest and easiest way to identify the media required is to run the restore request and look in the log file.

An alternative is to use the bpimagelist command on the master server.

tpadsm01 $ bpimagelist -?

bpimagelist: unrecognized option -?

USAGE: bpimagelist [-media] [-l|-L|-U|-idonly]

[-d mm/dd/yyyy hh:mm:ss] [-e mm/dd/yyyy hh:mm:ss] [-hoursago hours]

[-keyword keyword phrase]

[-client client_name] [-server server_name]

[-backupid backup_id] [-option option_name]

[-class class_name] [-ct class_type]

[-rl retention_level]

[-sl sched_label] [-st sched_type]

[-M master_server...] [-v] The class name, client name, start and end dates (and possibly times) are the minimum requirements to get a comprehensive media list. If you can specify more information it will be a more accurate list.

bpimagelist media d 07/01/2002 -e 07/05/2002 client tpedm01-fe class ORACLE

This will provide a list of all the cartridges used for ORACLE class backup between the dates specifie for the TPEDM01 client.12.4 TAG: 7054367 Restore processes

If a restore is required to resolve a service affecting problem with a production system then a problem record should be raised and can be used to initiate and document the restore.

Note: Please note that all Operations run restores will use the overwrite = no option as a default. This is to ensure that customer data is not inadvertently over written. Therefore all restore requests where the data is to be restored to its original location the files affected will have to be deleted before the restore can complete.

If a restore is required as part of some testing or development work then a change record should be raised. This will allow the implementors to plan their work effectively.

The record will be used to track the progress of the restore, to deal with any unforeseen problems, i.e. cartridges not available, and request the appropriate access. Restores must be run from an account that has the correct access level to the files and directories, usually the root account is used.

If it is for a number of files or directories then a list file should be created. This list file should contain each fully qualified file name, one on each line. This file can then be used as an input file for the restore and save a lot of work for the Operations team.

Warning: The list file must not contain any blank lines or extra blank spaces at the end of the file names. If it does the restore process will fail with an invalid line length message and an Exit Status 144 error code.

If the files need to be renamed individually then a rename list file should be created. This file must contain the following syntax:

change original_fully_qualified_file_name to new_fully_qualified_file_name

One line for each rename must be specified; no wildcards or substitution can be done.

Warning: The same restrictions apply to this file as to the list file.

The dates between which the backups were done must be supplied, if possible. By specifying a specific date range the search time through the NetBackup database will be reduced.

12.4.1 TAG: 7054091 Preparation for Restore

Before invoking the bp utility, ask yourself the following questions.

Do I need root to perform the restore

Am I sure that the host name is correct for the restore (particularly important for E10k boxes)

Do I need to be on the client to do the restore or can I do it from the server

Is the client in a HA configuration.

Is the restore request for a raw partition or a filesystem file

Has the user requested a restore to an alternative path and if so what do you need to do about links.

Once you are sure you know the answer to each of the above, then proceed.

12.4.2 TAG: 7054092 Performing the Restore

Logon to the correct Master backup server. This assumes you are doing a Server directed restore.

12.4.3 TAG: 7054093 Logon to Root

NetBackup uses standard UNIX file permissions to control access to files so unless you have read/write permissions to the file you will need to use root. Remember that you may not even see the file in NetBackup if you do not have the correct permissions. Using root will ensure that you can always see the client.

12.4.4 TAG: 7054094 Invoke bp

Use /usr/openv/netbackup/bin/bp

At this point you select the restore menu, and then you will be presented with options to restore from different kinds of backups. If you are restoring from normal filesystem backups, then select restore from backups. If you are restoring from raw partitions then select restore from raw. If you are restoring NT then select restore ms-windows.

Primary Options Menu. Initiating a restore

At this screen you will be able to make all the selections needed before actually initiating the restore.

You can check all the options are correct before you initiate the restore.

Source client/Destination Client: both these entries must be the same, if not you may restore a file from one machine to another. Make sure that the clientname is as known to NetBackup with all relevant suffixes (i.e. fe)

Date Range: ensure that you have the correct date range and that it is as narrow as possible. When Netbackup searches its database it could be uncompressing large amounts of data about backups. Obviously the wider the search the more time it will take to do this. You also run the risk of filling the /usr/openv disk partition.

File Path: specify the file path you require to search on.

Specify Alternative Path: use this if you wish to restore to an alternative directory, care must be used with this option: - remember that the restore from prompt requires you to enter the pattern that will be used to match for restores. It must be the directory you wish to restore or at a lower directory path level than all the different paths you wish to restore. E.g. If you wished to restore /usr/openv/patches, and /usr/openv/version. The restore from prompt may default to /usr/openv/patches/, this would mean that you could end up restoring patches to a different directory, but version would be restored to the original location. The correct location for the restore from would be /usr/openv. Also when you specify an alternative path, the restore will create a path from that level on, so a restore using restore from as /usr/openv and restore to as /tmp will create /tmp/patches directory.

Set your directory depth for searching, this defaults to 1, if you have this at too low a level it may seem that the file is not backed up. Because it will only show files or directories to that level. Setting the level to zero means you will see everything down from the path, but be careful with its use, as it can give you masses of data to look at.

When you are selecting files and directories you are searching the NetBackup database so this may take some time. Also be aware you can drill through directories using zoom in and out, this can be useful if you are searching for a file and the user is unsure where it was. If you need to you can build up a restore job, for instance you may select some files for restore, change your path and select some more. Use edit/view to see you selections, before you initiate a restore. When initiating a restore read the prompts carefully. Decide where you wish to place the progress log. The output from this log can be significant in size if you are changing the paths for a number of files.12.4.5 TAG: 7054095 Running a restore

The progress of the restore must be tightly monitored in order to enable successful completion in the minimum time possible. Be aware of the timelines specified.

1) Firstly - has the job actually started or is it queued? If the job is queued Section 11.5 (Reduction of Netbackup drive usage to allow a restore to run) shows you how to ensure sufficient resources are made available to allow the restore to commence.

2) Once the restore has been initiated start monitoring the process log. The process logs are in the format of bplog.rest.xxx and unless you specified differently whilst setting up the restore are created in the root directory. e.g. tail -f /bplog.rest.001

3) Ensure the tapes requested are successfully mounted. If the log states the tapes are not in the library or the restore job appears to hang then check the physical location of the tapes using relevant dasadmin or mtlib commands as documented in the respective Robotic Library sections. For tapes not in the library inform the relevant Hardware site of this and ask them to locate the requested tapes and insert them back into the library. Once the tapes have arrived back onsite, placed in the library, ensure the tapes are made onsite. Note in some circumstances when vaulting has run, the required tapes will be physically on drives but marked as not available to the robot which will cause the restore to hang. In these circumstances the status of the tape will need to be changed back and any pending requests either resubmitted or denied as appropriate. In order to check for this, run vmoprcmd on the relevant master/ media server and check for any Pending Requests from the output. Running vmoprcmd resubmit < request id > will allow the restore to continue. 4) Under Netbackup 4.5 you can check up on the restore using the activity monitor. This will show you the throughput in KB/s, Number of files restored and also the percentage complete. One other way to check up on what is happening is via the URL .

5) The bp process created by the restore on the server can be checked to ensure that the CPU process is incrementing.

6) It is also worthwhile checking that the files requested to be restored are actually being restored by cd'ing to the relevant directory on the client and using the ls command to ensure progress is being made.

When a restore is running there are some points to bear in mind:-

When you are looking at the restore job it may take a while before a restore kicks in. This is because the restore is searching a large NB client database for the files; this may take up to 20 minutes. On heavily used servers, the restore may time out. If so just reissue the restore request.

If the file is in a large system single stream backup you may notice it takes a long time to restore relatively small files, after the restore is positioned on the tape. This is because the restore has to read through the backup to find the files on the tape.

You may see waiting for mount of tape, this could be because the drives are in use for backups, or the tape is offsite, you check for an offsite tape by looking at the requested tape in media manager.

Alternatively you may wish to use the panels

Issue bpadm

NetBackup Server: tpcds1

NetBackup Administration

------------------------

s) Storage Unit Management...

c) Class Management...

g) Global Configuration...

r) Reports...

m) Manual Backups...

x) Special Actions...

u) User Backup/Restore...

e) Media Management...

h) Help

q) Quit

ENTER CHOICE:

Select u

Master Server: tpcds1

Client: tpcds1

Main Menu

---- ----

b) Backup...

r) Restore...

h) Help

q) Quit

ENTER CHOICE:

Select r>Master Server: tpcds1

Client: tpcds1

Restore Menu

------------

b) Restore Files and Directories from Backups...

a) Restore Files and Directories from Archives...

r) Restore From Raw Partition Backups...

f) Restore From Auspex FastBack Backups...

d) Restore From True Image Backups...

o) Restore From Oracle DB Backups...

i) Restore From Informix DB Backups...

s) Restore From Sybase DB Backups...

t) Restore From SQL-BackTrack DB Backups...

p) Restore From SAP DB Backups...

2) Restore From DB2 DB Backups...

m) Change Master Server...

h) Help

q) Quit Menu

ENTER CHOICE:

Select b>

Path: /usr/users/storage/

Start Date: 11/29/98 22:11:39 Master Server: tpcds1

End Date: 12/01/99 23:59:59 Source Client: tpcds1

Files Selected: 0 Destination Client: tpcds1

Directory Depth: 1 level Class Type: Standard

Display Mode: Brief Keyword Phrase:

Restore Backups

---------------

s) Select Files and Directories... p) Change Path...

e) Edit/View Selected Files... d) Change Date Range...

i) Initiate Restore c) Change Directory Depth...

x) Change Display Mode to Verbose m) Change Master Server...

l) List Backup Images... b) Change Source Client...

a) Specify Alternate Path... t) Change Destination Client

q) Quit Menu y) Change Class Type

h) Help k) Change Keyword Phrase

ENTER CHOICE:

Then p> to change path

d> to change date range

c> directory depth 0 gives the most information

s> to select your files/directories

When you are satisfied with the above

Enter I> to initiate restore

TAG: 7054096 Finishing A Restore

Send the restore log to the requestor, and ask them to verify the restore.

12.4.6 TAG: 7054097 Problems Locating a File.

If you cannot find the file you want to restore consider the following.

Is the source and destination client name correct

Is the file in a directory which is a linked directory name

Are the date ranges too narrow.

Is the path name right

Is the client/server a HA configuration

Is the restore a raw partition

Have the directory level settings been too high or too low.

Do I need to be rootFinally it may be that the file was not backed up, this should be checked against the policy to see if it covers the requested files/directories.

12.5 TAG: 7054100 Failover Restores

The reason for failover restores is when a Media Server is unreachable and your restore job is trying to use its tape drives. The bpmedia movedb command is then required to move the NetBackup catalog entries from the files you are restoring to an alternative Media Server.The restore attempts failed because NetBackup was trying to mount the relevant tape into the nfmpramb drive which was inoperable at the time. The restore request itself was resolved by copying an identical file across from the nfmprama server, but the need to provide a means of managing this problem in future remained.

Following guidance re. failover restores in the manuals, C/R 5400777 was raised to update the byadsm01 bp.conf in the hopes that this would enable the automatic switch to alternative drive(s) specified in the bp.conf file, in the event of similar restore failures. Unfortunately, the restore tests which followed failed and case 140017461 has been raised with Veritas to look into the causes.

In the meantime however, the following command has now been successfully tested and can be used from root access once the specific tape required is known:

"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev -newserver -oldserver "

In this example the command would read:

"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev BBL007 -newserver nfmprama -oldserver nfmpramb"

Details of the tape(s) required will always be are shown in the relevant bplog.rest.00n log, (usually in the root directory) from the failed restore attempt.

Once the restore has been completed successfully, it would be adviseable to run the same command in reverse to ensure that confusion is avoided at a future date should further restores be required using the same tape. Again in this context, the command would read:

"/usr/openv/netbackup/bin/admincmd/bpmedia -movedb -ev BBL007 -newserver nfmpramb -oldserver nfmprama"

As this method only provides a very specific solution and needs to be keyed in, efforts will be continued to progress the case with Veritas, in the hopes that we will be able to configure automatic restore failover functionality

TAG: 7054102 Overview

Bplist is a command supplied by Veritas to interrogate the NetBackup backup images database. It is an extremely useful NetBackup function as it allows users the ability to check the existence of backups, without having to use the restore panels. This command can be easily incorporated into scripts that can be run on a regular basis. There are a large number of parameters that are associated with bplist; these are documented in the NetBackup systems administration guide.

12.5.1 TAG: 7054359 Functional description.

The bplist binary will search the NetBackup master server for backups; by default the command will search using client name and master name specified in your bp.conf file. The bp.conf file resides in /usr/openv/netbackup or you can specify your own bp.conf file in your home directory that overrides the global bp.conf.

The netbackup database of backups is really a catalogue, which comprised of a series of directories and flat files, on some of the high volume servers the image information will be compressed. When a bplist command is executed, the range of dates specified should be considered carefully, as you could end up searching the entire set of backups for a server.

Using various parameters you will be able to get listings similar to the Unix command ls, and where there are multiple occurrences of a file in a listing, this will indicate that there are multiple backups.

12.5.2 TAG: 7054104 Using Bplist

The command /usr/openv/netbackup/bin/bplist can be used to find out if a file or directory was backed up on a specific date, e.g.

bplist s mm/dd/yy e mm/dd/yy fully_qualified_file_name

This command would have to be run on the client box, using the root account or one that has access to the files or directories being queried.The bplist command has global execute permissions, but it is important to realise that Netbackup security is based upon Unix file permissions. If you do not have permission to restore a file, then you will be unable to perform a bplist for the file. All that will happen is that you will be returned back to the command line.

The keyword function allows backups to be associated with a keyword, which is indexed and should speed up query times. It can be specified when the bpbackup command is issued, or by the NetBackup administrator when a scheduled backup is defined.

Before listing backups, confirm whether the file is filesystem, or raw partition.

Please note that by limiting the date range the database search time will be reduced.

It is also possible to do recursive searches through a directory tree structure. Use the bplist -? Command to get a full list of the available keywords available.

12.5.3 TAG: 7054105 Bplisting Filesystem Backups

The following examples all are examples where the command is issued from the client, to find files in a directory called /usr/home.

Listing filesystem file backups.

bplist R -l -s mm/dd/yy /usr/home

The example above will search for a filesystem backup recursively listing all files and directories from /usr/home downwards. The R option tells the bplist to display all files and directories from /usr/home downwards the R option will allow you to limit the depth of the search. E.g. R 2 will show 2 directory depths.

The s option tells bplist to start searching from the specified date onwards, there is also a e option for end date, and you can specify the hours and minutes (see the manual). Using an end-date will reduce the search time. The -b option displays the backup date and time of each file.

bplist R b l -s mm/dd/yy /usr/home

Will give a listing where the date and time of the backups are listed.

12.5.4 TAG: 7054106 Bplisting Raw Partition backups.

To list a raw partition backup the -r parameter must be specified, this will tell NetBackup to look for raw partition backups only.

A typical bplist command for a raw partition might be

bplist r s mm/dd/yy /raw/volume/name

Keyword Function

If backups are performed using the keyword within a backup class, it is even quicker to identify backups. Backups performed using the standard orahot and oracold scripts use this and a backup can be identified using the keyword.

E.g.

bplist R keyword dba_* l /

12.5.5 TAG: 7054107 HA Environments

If you are running bplist in a HA set-up be aware how the client was backed up. Which would be either by the shared IP address or by the physical IP address. Once you have this ensure that you are using the right client name, this uses the C client-name option.

Problems

If you do not get a result but you think you should, please do not hesitate to contact NetBackup support.

12.5.6 TAG: 7054368 Command line

The command line restore command, bprestore, is used when backed up or archived files or directories are to be restored. If a directory is specified all files backed up will be restored.

Caution: Note that by default a restore will probably use the MPRN unless the client name is the same as that used by NetBackup for the backup.

Warning: You will need to include the t 13 parameter if the client is an NT box. The default type is standard and suitable for UNIX clients only.

For example the syntax of the bprestore command to restore a list of files and rename them is as follows:

bprestore -K -L log_file_name R rename_file -C clientname s mm/dd/yy e mm/dd/yy f listfileWhere -K means that existing files will not be overwritten.

-L is the location of the log file (these can be large and must be managed).

-R is the file listing the renames to be done.

-C is the client server name. Please ensure you specify the same name that NetBackup uses to backup the data, e.g. dyfin04-fe instead of dyfin04.

-s is the start date.

-e is the end date.

-f is the list of files to be restored.

Another example of how to restore a specific file back to its original location and over-write any existing file is as follows:

bprestore L log_file_name C clientname s mm/dd/yy e mm/dd/yy fully_qualified_file_name

Other options and parameters can be specified. The complete syntax is:

/usr/openv/netbackup/bin/bprestore [-A | -B] [-K] [-l | -H | -y][-r] [-T] [-L progress_log] [-R rename_file] [-C client] [-D client]

[-S master_server] [-t class_type] [-c class] [-s mm/dd/yy

[hh:mm:ss]] [-e mm/dd/yy [hh:mm:ss]] [-w [hh:mm:ss]] [-k

"keyword_phrase"] -f listfile | filenames

12.6 TAG: 7054369 Reduction of NetBackup drive usage to allow a restore to run

12.6.1 TAG: 7054370 Introduction

The NetBackup infrastructure is in constant use. This means that the resources required for a high priority restore might all be allocated and not immediately available. The following document outlines the procedure to follow to ensure that sufficient resources are made available in a controlled manner. This will cause the least impact on other customers as their backups will not be cancelled but merely queued for the duration of the restore.

12.6.2 TAG: 7054371 Identification

When a restore is submitted it has a higher priority than a backup job. It will therefore take the first available drive and start running but due to the way that NetBackup runs it will try to keep a cartridge loaded for as long as there are backup jobs that can write to it. This is even in preference to other backups that may have been queued for a long time. So a restore may be queued because backup jobs are getting preference as it makes more efficient use of the cartridge drives.

If there are no drives available then it is advisable to try and free one by reducing the number of jobs running.

12.6.3 TAG: 7054372 Action

Reducing the number of running jobs can be achieved by the following method.

From the NetBackup administration GUI, open up the POLICIES icon, select the ORACLE job class, make a note of the Max Jobs/class parameter and reduce it to 4. This will have the effect of limiting the ORACLE backup to one drive only. Carry out a similar process for the ORACLE_REDO policy.

Note: It may be necessary to reduce the maximum number of jobs even further if there are limited cartridge drives available and if the restore is of a high priority and deemed necessary, kill off any system/non-urgent backups from the Activity Monitor.

This will gradually reduce the number of active jobs and release some drives.

Once the restore has completed you must re-instate the Max Jobs/class parameters to their original values.

12.7 TAG: 7054373 Additional information

12.7.1 TAG: 7054374 Raw volumes

Raw volume restores are a bit more complicated. In general specific files can not be restored, only the complete raw volume, for this reason restores are rarely done. If a restore is required then CWBACKUP should be consulted.

13 TAG: 7054110 Backups

It may be necessary on occasions to run test backups or manual backups as part of problem resolution. Please follow the actions listed below.

Using the Netbackup GUI -

On the main netbackup screen you will see a list of policies, search for small_test Highlight small_test then bring down the Actions toolbar and select manual backup and schedule and specific client name.

Check on Job Monitor that the job becomes active.

If you receive an error when using the GUI

Use bpadm (VT100)

#bpadm

select m> manual backups

select b> browse classes forward (keep going until small_test comes up)

then enter

i> initiate backup

14 TAG: 7054112 NetBackup Logmon Error Messages

Please note: with status codes 49, 50, 51, 56, 57, 58, 59, 76, 84, 95, 164 and 205, the Omnibus Threshold Rule provides an extra filter. Such traps are suppressed until over 20 traps have been issued in a twenty-minute period, at which point a trap is then sent. Please see section 14.1 below for fuller details.

The error codes are explained in more depth in section 14.14.1 TAG: 7054114 NetBackup Processes and Procmon Error Messages

There are six processes that need to be active on NetBackup servers, and these consist of two NetBackup processes and four Media Manager processes as follows:

NetBackup :-

bpdbm ; bprd

Media Manager :-

ltid ; avrd ; vmd ; tlmd