Cluster in Detail
Transcript of Cluster in Detail
![Page 1: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/1.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Clustering: For Geeks...
& for Normal People Too!
George Chiesa <[email protected]>
Daniel Nashed <[email protected]>
DATABASE
VIEW DATA
REPLICA
Push
Pull
Push
Pull
SERVER UPDATESERVERUPDATE
DATABASE
DATA VIEW
DATABASE
VIEW DATA
(replica)
Push
Push
SERVER UPDATESERVERUPDATE
DATABASE
DATA VIEW
CLREPL
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
This Presentation was not researched
nor conceived at the British Library
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
This was not conceived at BL.uk
This is bubble-bath-ware!
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
License: You have a limited license to this presentation.
Copyright 2000-2006 dotNSF and its' suppliers. This presentation is non exclusively LICENSED to you for internal usage within your own entity, company or organization.
For fair-usage purposes, please quote the source as "Bubble-Bath Ideas presentation at DNUG 2006, by G. Chiesa and D. Nashed"
We request this presentation NOT to be publicly reposted, please !
Public abstracts will be posted at http://dotNSF.com & http://nashcom.de
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Disclaimers: NO Proofs...
This presentation is based upon empyrical infoObserved behaviours, features, bugs, beyond...
I can NOT prove many of the hypothesis here
Please accept these pearls of wisdom "as is"
Some of this information may be obsolete soon
but it's useful to know what the state of art is
We ALWAYS report security issues to IBM in private.
and no, we will not discuss security bugs (all fixed:-)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Ok, just one hack from a red book
where I wrote something in...
Download and get this redbook:
SG24-7017Lotus SecurityHandbook (2004)
Hint: firefox's "modify header" plugin extension (free)
![Page 2: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/2.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
If you are using Reverse Proxies:
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
What is "Clustering for Geeks"
Clustering 101 (definitions/vocabulary)
Clustering For Geeks"is the art of
using documented functionality
and "stable observed behaviours"
to "automagically" provide a better and cheaper servICE (not serVER)
In some cases,
thinking quite outside of the box
pushing the product to the limits !
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The 50/50 rule/s:
50% of what you KNOW about clusters...
is quite useless !50% of what you don't know about clusters
is quite useful !!!Value Proposition 50%+50%=100%
50% of DDTs (Don't Do That!)s
And 50% of DO this !
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
What we're covering today60' version of a much longer workshop...
what is called "1352 Native Clustering"
Which pieces are client/server based
How each major piece work "per se"
How to make the puzzle work for you
V
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
About questions...
IT IS "OK"(not impolite)
To interrupt...
to ASK questions...
'ala' easyjet...
"within reason" :-)
We reserve the right to postpone the answers, but, when in doubt, raise hand!
100% of what you do not understand can, and WILL probably hurt you!
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Once upon a time... last millenium...
The STATE of the ART in 1995...
was THIN ethernet (ethernet 10 as in 10Mb)
if you were an IBM SHOP, you had TR/4/16
Each adaptor had one and only one address
And in 1995 LOTUS was already shippingClustering and Failover embedded in Notes 4.01
(at the time called NPN=Notes Public Networks)
So a LOT within Notes has a strong LEGACY.
So, we're going to provoke your brain to think!
![Page 3: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/3.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Server Configured in 1995...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
This is the MOST controversial!
If I were you I would use...JUST ONE TCPIP NOTES PORT
You can still have as many addresses
You can still listen to 0.0.0.0 in notes.ini
You can still have complex tcpip routing tables
YOU DO NOT NEED THE EXTRA LOGICof Notes trying to cope with Ethernet 10
and just one IP address per physical card.
K.I.S.S. (at the Notes/Domino Layer!!!)
Stay awake, more controversy to come...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Listen...(Bonus HACK): ( 42 443 )This time the answer is not 42 ;-) but instead: 443!
You can specity what you are "listening to"
You must understand netstat -an | find "LISTEN"
If you bind addresses you will listen just that BUT
You CAN specify "0.0.0.0" as a specific address!
You can use this to listen to all addresses at a portExample: You can set a notes server to
also listen on NRPC to port 443 on 0.0.0.0
this is a useful hack when you are behind a proxy
and want to access your home server
and the proxy only allows access to ports 80 and 443
port 443 proxies use transparent "connect method"
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
This how I connect to my serverWhen visiting customers
Using http proxies and not allowing 1352 direct.
If cust agrees to allow me to connect to my own server while at their premises...using their proxy
PORTS=TCPIP,TCPIP2
TCPIP=TCP,0,15,0,,45088,
TCPIP_TCPIPADDRESS=0,0.0.0.0:1352
TCPIP2=TCP,0,15,0,,45088,
TCPIP2_TCPIPADDRESS=0,0.0.0.0:443
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
HACK! How does that work?
In my server's Notes.ini
PORTS=TCPIP,TCPIP2
TCPIP=TCP,0,15,0,,45088,
TCPIP_TCPIPADDRESS=0,0.0.0.0:1352
TCPIP2=TCP,0,15,0,,45088,
TCPIP2_TCPIPADDRESS=0,0.0.0.0:443
Voila': I can connect using HTTP Proxy"transparent connect method" to 443
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster Aware "1352" Notes Clients:
a.k.a. Cluster-READY clients
Definition:
A Notes Client is said to be cluster-aware when it will perform custom logic to transparently and automatically fail-over from one server to another, upon server directive or LACK of reply
QUIZ:
what % of Notes Clients are CLUSTER Aware?
hint: what was the first version of Cluster Aware Notes client?
If I told you Notes 4.01 was the first one...
![Page 4: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/4.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster.NCF (client side)Servers also use it to connect to other servers!
Time=22/12/2001 14:26:46 (80256B2A:004F5AD8)
Cluster/NotesWeb
CN=Notes2/O=Notesweb
CN=Notes1/O=Notesweb
Time=03/01/2002 16:18:24 (80256B36:0059935B)
TheConifers.com
CN=dotNSF.TheConifers.com/O=TheConifers
CN=Linux.TheConifers.com/O=TheConifers
CN=WebSphere.TheConifers.com/O=TheConifers
CN=Win2k.TheConifers.com/O=TheConifers
CN=www.TheConifers.com/O=TheConifers
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Clustering
COMPLEX SET of design methodologies, techniques and heuristics
applied to "stuff"
that you can use to "make"
"n" things to be perceived as ONE bigger/better & "more reliable"
The key words of this slide are "PERCEIVED as"
NB: We're going to focus on
MultiPlatform SOFTWARE Clustering
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The "i" in RAID stands for: In-Expensive
In 1987, Patterson, Gibson and Katz at the University of California Berkeley, published "A Case for Redundant Arrays of Inexpensive Disks (RAID)" . This paper described various types of disk arrays, referred to by the acronym RAID. The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array of drives appears to the computer as a single logical storage unit or drive.
Perspective...C
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster Examples: 3, 5 or 20+
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster.ncf: (default max 2 mates TIMES 20 clusters, LKB 185700: Cluster_Name_Cache_Size=n (notes.ini)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Clustering & Failover in Action
![Page 5: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/5.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Server QUIT while reading...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster Mates:"Mate" is an industry NON-PC (non politically correct!) std term
Definition:A cluster of something is composed of mates
logically siblings among them (no master)
Domino Wise, a Cluster Mate can be:
Available (normal) (SAI>SAT)
Busy (Server_Availability_Index <= Server_Availability_Threshold)
Tip: You CAN BUSY a server by setting SAT=100
Unavailable (or unreacheable/perceived as such)
Restricted (Temp=1 or Perm=2)
Invalid (never contacted)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
cladmin Servertask in R5 takes care about administrative things
(D6+ not in servertasks=, launched automatically)
cldbdir takes care that cluster directory is up to date (D6+ not in servertasks=, launched automatically)
clrepl pushes changes to other replicas based on information from cluster directory
(D6+ not in servertasks=, launched automatically)
logs periodically into replication log (manual: tell clrepl log)
replica should still be active as a fallback and to init replicas!
Server Tasks involved
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
API Level call NSPingServer gives back a list of cluster mates and the availability
You can check this information via
> show cluster
Cluster Information
Cluster name: nsh-cluster, Server name: nsh-dus-02/Srv/NashCom/DE
Server cluster probe timeout: 1 minute(s)
Server cluster probe count: 185
Server availability threshold: 0
Server availability index: 100 (state: AVAILABLE)
Cluster members (2)...
server: nsh-dus-02/Srv/NashCom/DE, availability index: 100
server: nsh-dus-01/Srv/NashCom/DE, availability: 42
Server regularly check state of their Cluster Mates
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Portfolio techiques / Sizing heuristics
There are always 2 practical limits:Lower:
at LEAST how many you need to reduce risk
Upper:
at MOST hoy many can you manage effectively
Tip: Start with 3 or 4, fine tune afterwards
but pleasedo NOT start with 2 or 6
![Page 6: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/6.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Class of service:
by "n" instances of resource
Say, for the purpose of example, you have "3""whatevers": OSs, Sites, Servers, Routers, ISPs
say you name the 3 elements as A B and C
With 3 elements you can define the followingClasses of Service:
Top, simultaneously present in A+B+C
Middle, present in either: AB, AC or BC
Single, present just in A or B or C
Homework: Try the combinations for 4 units,C(4,4) + C(4,3) + C(4,2) + C(4,1)
Nota benissimo: DO STOP AT 4 ! ! !
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Almost Real Time Replication...
a) we need to define how we will syncronize
Bad News: Scheduled replication not good enough...
Some apps must be cluster aware enabled!
Good News:NATIVE Event/Queue Driven = CLREPL =
(aka Almost Real Time)
Most apps will automatically work better
b) we still need to spread the load/access.
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ClDbDir
It's a Notes Database, similar to catalogue, Cluster Specific (RepId depends on ClusterName)
Maintained by a server task of the same name
It's in the Enterprise Edition of Domino
Contains info about databases deployed in a cluster
Is used by Notes/Domino Cluster Aware modulesto know where to push what (and what NOT to!!!)
and for "failovers": a server finds resource elsewhere!
Like CATALOG, each server updates its OWN dbs
BEWARE: 8192 maximun number of useful entries; you do NOT get a warning NOR Error message!
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ClDbDir (contents)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Bonus Hack: Set Config Cluster_Admin_On=1
It also works IN NON Clustered servers!
You can afterwards do:
CL DEL filename (cluster delete)
CL COPY source dest REPLICA
CL OUT database (out of service)
CL IN database (in service again(both work but are only meaningful in clusters
Useful to OUT-of-service databases BEFORE adding an OLD server to a cluster
useful for decomissioning an old server
you HAVE to add a server to get it intothe CLIENT's Cluster.NCF C
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
DATABASE
VIEW DATA
REPLICA
Push
Pull
Push
Pull
SERVER UPDATESERVERUPDATE
DATABASE
DATA VIEW
From LKB: How Push-Pull (std) Replica works
![Page 7: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/7.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
DATABASE
VIEW DATA
(replica)
Push
Push
SERVER UPDATESERVERUPDATE
DATABASE
DATA VIEW
CLREPL
From LKB: How Push Cluster Replica works !
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Document changes are captured and trigger the cluster Replicator via a message queue
Cluster Replicator reads message queue and pushes changes to other all other replicas in the cluster regardless of replication settings (aka almost "real time" replication)
How does Cluster Replication works (details)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
CLREPL
CLREPL is a server task
It's an in-Memory QUEUE driven event replicator (REMEMBER BATH TUB !)
that SHOULD push content at most within 15 seconds - in average 7
thus ClRepl is also sometime called RTR
or "ALMOST" REAL TIME REPLICATOR
the KEY here is in "ALMOST"
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ClRepl (cont'd)
ClRepl PUSHES content modified locally to all cluster mates containing replicas of the modified database
Tips: It PUSHES ignoring source ACL
Check that the queue is not over filled
Always schedule CLASS+1 of themNB: CLREPL does NOT initialize "Replica Stubs"
It also knows what YES/NOT to push
Out Of Service (for quite obvious reasons) but also
Pending Delete (cldbdir does final push, not clrepl !)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ClRepl (cont'd)
ClRepl will keep an IN-memory queue
It's a QUEUE, and can be overfilled
It's in MEMORY and is NOT disk persistent
THUS, also schedule normal replicas: Tips: within reason, overschedulling pull replicas is not a huge issue, because the deltas are small
i.e. Enabled Replica From */Srv/Whatever to <each>/Srv/Whatever, PULL, every 60 Mins
Will make servers catch up fast, pulling at restart time.
TIP: SH ST REPLICA.CLUSTER.*Q*(Daniel to explain detail stats)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
General Rule: number of clrepl = cluster members "minus" 1
R5: servertasks=events4, repl, router, clrepl, clrepl, clrepl, ...
D6: Cluster_Replicators=n
My Tip, set to CLASS_OF_SERVICE PLUS one, not minus one, over schedule it and it's cheap, underschedule it and you will have problems!
Check if clustering works properly via
Show Stat Replica.Cluster.*
Replica.Cluster.WorkQueueDepth should be "small", i.e. less than 10
Replica.Cluster.RetryWaiting should be also "small" i.e. less than 5
Replica.Cluster.Failed should be zero if possible (easy to say :-)
Check the Max and Average Times in queue, should be < 10 seconds
Show Stat Server.Cluster.*Server.Cluster.OpenRedirects.xxx.Unsuccessful = 0
check for unsuccessful redirects!
Cluster Replicator Performance & Statistics
![Page 8: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/8.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
How to restrict access (LKB 7002910)Domino server clusters have an optional workload balancing feature that lets you distribute the workload of heavily-used databases across multiple servers in a cluster. To distribute workload, you limit or restrict the work that a server can perform using the following settings in the NOTES.INI:
Server_Availability_Threshold
This setting allows you to specify the maximum availability level beyond which the server attempts to redirect user requests to other servers in the cluster. A server's availability index is recalculated each minute and compared against any threshold you set. If the index falls below the server threshold, the server becomes BUSY. The Cluster Manager redirects access requests from a BUSY server to the servers in the cluster. When an attempt to redirect is unsuccessful, the user receives access to the BUSY server. Each time a redirection occurs, Notes generates a workload balancing event in the Notes log (LOG.NSF).
Server_MaxUsers This setting specifies the maximum number of user sessions allowed on a server. When the server reaches this limit, the server goes into a MAXUSERS state. The Cluster Manager then attempts to redirect new user request to other servers in the cluster. To see how often requests are being redirected, check the LOG.NSF for failover events. If redirection of the user request is unsuccessful, the user receives a message, and is not allowed access to the server.
Server_Restricted
This setting enables a server to deny new open database requests and places the server in a RESTRICTED state. Users who have active connections to databases retain their connections. The Cluster Manager attempts to redirect new requests to other servers in the cluster. When an attempt to redirect is unsuccessful, the user receives a message and is not allowed access to the server. For each redirection attempt, Notes generates a failover event in the LOG.NSF.
Note: You can use the Server_Restricted setting for any Domino server. This setting is not restricted to clusters.
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
SAI examples, un/touched
You may want to smooth this (or not)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Ensure you have full manager access for LocalDomainServers as a Server group or better */Srv/Org as Manager of type Server in all ACLs.. I prefer hardcoding OUs to groups. Works always!
Make sure all applications provide roles to give access to documents with reader fields (remember computed auth fields)
Give Servers all rights and roles to "see" all documents
Don't use replication formulas for clustered databases
Have a scheduled replication in case some events in the clrep-queue get lost or the server is down...
Add startup replication documents "from *" to ensure databases are up to date after server restart
Schedule replication to the Name of the cluster instead of single server names (load balancing & failover)
Best Practices for Cluster ReplicationC
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
There are issues with Database Quotas before R5.0.10
Good news:
New option in R5.0.10 CLREPL_OVERRIDE_QUOTAS=1
Domino 6 overrides quotas by default
you get the old behavior with Clrepl_Obeys_Quotas=1 (DDT)
Bad news:
If you already have this problem you need to delete replication history and CutOff Date to resolve existing replication problems
Lotus Script can clear the replication history
Set rep = db.ReplicationInfo , Call rep.ClearHistory() , Call rep.Save()
But not remove the CutOffDate (in most cases not needed)
Cluster Replication & Database Quotas
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Notes Named Network & Directory Assistance
Customer was using Notes Named Networks (NNN) across WAN connection
Caused unintended traffic
Directory Assistance (DA)Multiple replicas of 4 Directories where used
First Server in the list was a remote server in the same NNN in some cases!
Changed configuration to use the local server only
All servers had replicas of all directories
One external directory had huge number of deletion stubs due to external company always reimporting the directory :-(
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Changes/Recommendations
Only local servers in the same NNN
Use only local directories in (DA)
Used "*" to specify the local replica only (TN #1087708)
Evaluating Extended Directory Catalog to further optimization
Directory catalog could simplify working with external addresses and allow more flexibility
Avoid large number of changes in Domino directoriesLess need to update views in Domino Directory
Less deletion stubs
Not the first time we have seen nightly complete delete/add import agents in customer environments
![Page 9: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/9.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
How to use NNNs (KISS)
One for TCPIP (and one per Cluster Port )
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Other High Availability Tips Domino 6/7 support multiple versions on one logical UNIX/Linux box
much easier update and coexistence of multiple releases and allows to have a easy to handle "go back" scenario
Fault-RecoveryMaximize server availability
Faster Server Restart after crash!
Automatic collect NSDs for faster troubleshooting
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Domino Server Availability Index (SAI or AI)
Domino 6+ uses a new algorithm to calculate the workload of a server and the resulting AI
A number of customers reported unpredictable, alternating AI which caused Clustering to fail.
Algorithm was enhanced in D6.0.2CF2 and additional notes.ini parameters have been introduced.
But there is another bug that is hopefully finally fixed in D6.5.6 and D7.0.2!
We traced AI at customer site
Live Environment
Test Environment with Server.Load
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
LoadMon
Domino 6/7 use a module called "LoadMon"Routine calculating speed of current transactions, summarizes and compares them with previous intervals and minimum values (RunningAvgTime & MinAvgTransTime)
unit: microseconds
OPEN_DB
OPEN_NOTE
CLOSE_DB
DB_INFO_GET
DB_REPLINFO_GET
GET_OBJECT_SIZE
READ_OBJECT
GET_SPECIAL_NOTE_ID
DB_READ_HIST
DB_WRITE_HIST
SERVER_AVAILABLE_LITE
NIF_OPEN_NOTE
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Expansion Factor (XF)
XF is calculated based on the performance values of current transactions in relation to minimum time for a transaction
It's the number of times the current transactions take longer than the minimum transaction time
XF values for different transactions build a overall XF
This XF is computed and converted into AI based on a Range to scale the XF (TN #1112352)
Notes.ini Server_Transinfo_Range n is 6 by default and specifies the maximum Expansion Factor of a Domino Server. The XF is calculated 2 raised to the power n (64 by default)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
LoadMon Notes.ini Settings
SERVER_TRANSINFO_MAX (default 5 / max 60)
number of statistics collections stored in LoadMon
SERVER_TRANSINFO_UPDATE_INTERVAL (default 15)
interval for statistics capturing & calculation
SERVER_MIN_TRANS (default 5)
minimum transactions needed for a statistic value to be valid
SERVER_TRANSINFO_NORMALIZE (default 3000)
SERVER_TRANSINFO_HTTP_NORMALIZE (12000)
as far we found out used to initialize empty statistics (zero in loadmon.ncf) on startup in Domino 6
![Page 10: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/10.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Debugging LoadMondebug_loadmon=1
Enables LoadMon Debugging, writes additional information to server console
07.10.2003 07:08:09 Loadmon: Domino AI = 100, XF = 1
And adds additional 46 statistics counters (server.loadmon.*)
Can be captured locally or remotely via "show server" or statistics collection program.
nstats servername or C-API NSFGetServerStats (...)
loadmon.ncfloadmon.ncf in Domino data directory stores last information from loadmon before server is shutdown
loaded on server start to initialize statistics counters
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Server.LoadMon.TransInfo.AI.Type = 0
Server.LoadMon.TransInfo.CurrentTransCount.CLOSE_DB = 3
Server.LoadMon.TransInfo.CurrentTransCount.DB_INFO_GET = 2
Server.LoadMon.TransInfo.CurrentTransCount.DB_READ_HIST = 0
Server.LoadMon.TransInfo.CurrentTransCount.DB_REPLINFO_GET = 5
Server.LoadMon.TransInfo.CurrentTransCount.DB_WRITE_HIST = 0
Server.LoadMon.TransInfo.CurrentTransCount.GET_NOTE_INFO = 0
Server.LoadMon.TransInfo.CurrentTransCount.GET_OBJECT_SIZE = 0
Server.LoadMon.TransInfo.CurrentTransCount.GET_SPECIAL_NOTE_ID = 0
Server.LoadMon.TransInfo.CurrentTransCount.NIF_OPEN_NOTE = 0
Server.LoadMon.TransInfo.CurrentTransCount.OPEN_DB = 3
Server.LoadMon.TransInfo.CurrentTransCount.OPEN_NOTE = 7
Server.LoadMon.TransInfo.CurrentTransCount.READ_OBJECT = 0
Server.LoadMon.TransInfo.CurrentTransCount.SERVER_AVAILABLE_LITE = 2
Server.LoadMon.TransInfo.HttpNormalize = 12000
Server.LoadMon.TransInfo.IntervalInSeconds = 15
Server.LoadMon.TransInfo.Max = 5
Server.LoadMon.TransInfo.MinAvgTransTime.CLOSE_DB = 58.1818181818182
46 statistics found
BEWARE LARGE OVERFLOW
INTO NEGATIVE VALUES
Quit, delete loadmon.ncf, restart server
(do after upgrades!)
se co DEBUG_LOADMON=1
Server.LoadMon.TransInfo.MinAvgTransTime.DB_INFO_GET = 119.875
Server.LoadMon.TransInfo.MinAvgTransTime.DB_READ_HIST = 210.666666666667
Server.LoadMon.TransInfo.MinAvgTransTime.DB_REPLINFO_GET = 88.5714285714286
Server.LoadMon.TransInfo.MinAvgTransTime.DB_WRITE_HIST = 240.2
Server.LoadMon.TransInfo.MinAvgTransTime.GET_NOTE_INFO = 110.235087719298
Server.LoadMon.TransInfo.MinAvgTransTime.GET_OBJECT_SIZE = 141.777777777778
Server.LoadMon.TransInfo.MinAvgTransTime.GET_SPECIAL_NOTE_ID = 93.333333333
Server.LoadMon.TransInfo.MinAvgTransTime.NIF_OPEN_NOTE = 1,031.4285714286
Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_DB = 429.166666666667
Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_NOTE = 272.987714987715
Server.LoadMon.TransInfo.MinAvgTransTime.READ_OBJECT = 134.285714285714
Server.LoadMon.TransInfo.MinAvgTransTime.SERVER_AVAILABLE_LITE = 95.3333333
Server.LoadMon.TransInfo.MinTrans = 5
Server.LoadMon.TransInfo.Normalize = 3000
Server.LoadMon.TransInfo.Range = 15
Server.LoadMon.TransInfo.RunningAvgTime.CLOSE_DB = 214.333333333333
Server.LoadMon.TransInfo.RunningAvgTime.DB_INFO_GET = 172
Server.LoadMon.TransInfo.RunningAvgTime.DB_READ_HIST = 0
Server.LoadMon.TransInfo.RunningAvgTime.DB_REPLINFO_GET = 187
Server.LoadMon.TransInfo.RunningAvgTime.DB_WRITE_HIST = 0
Server.LoadMon.TransInfo.RunningAvgTime.GET_NOTE_INFO = 0
Server.LoadMon.TransInfo.RunningAvgTime.GET_OBJECT_SIZE = 0
Server.LoadMon.TransInfo.RunningAvgTime.GET_SPECIAL_NOTE_ID = 0
Server.LoadMon.TransInfo.RunningAvgTime.NIF_OPEN_NOTE = 0
Server.LoadMon.TransInfo.RunningAvgTime.OPEN_DB = 4,143
Server.LoadMon.TransInfo.RunningAvgTime.OPEN_NOTE = 738
Server.LoadMon.TransInfo.RunningAvgTime.READ_OBJECT = 0
Server.LoadMon.TransInfo.RunningAvgTime.SERVER_AVAILABLE_LITE = 104
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
AI in D6.0.1 without Optimizing of Loadmon
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Domino 6.0.1 AIX 5.1 dropping AI
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
What did we find out?
AI with default interval 15 sec and 5 sampling values does not always result in steady AI
we needed to find values which provide
steady values for cluster-failover not to occur "randomly" or cause Ping-Pong effects
reasonable time to reflect current workload in AI
Standard interval and sampling 15*5 cover 45 seconds
Interval 10 seconds with 20 sampling values cover 200 seconds
Standard Server.Load Scripts do not help much because most transactions are not used in standard scripts
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Listen...(HACK 2)
You need to understand which fields are
Listens
(usually in specific tabs)
HostNames that are
NOT Listens
for example:you can tell domino that it's HTTP hostname
is the name of something else
even in a different machine
urls will be created nicely
![Page 11: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/11.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
HACK 3: How to use clustering for server consolidation
Add ALL servers to ONE CLUSTER...
Make sure you have Dbs no more than 3 times
SET the SAT of the OLD servers to 100
This will BUSY them out
Users will LOADBALANCE to new servers
for all NON ADMIN/Managers users
Unless you forgot an app just in old servers
because it will continue to access old servers
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
For example, to get this
Cluster name: DOMPMAC01, Server name: DOMAGP01/SRV/Customer
Server cluster probe timeout: 1 minute(s)
Server cluster probe count: 47191
Server cluster default port: *
Server availability threshold: 100
Server availability index: 0 (state: BUSY)
Server availability default minimum transaction time: 3000
Cluster members (11):
Server: DOMPMA01/SRV/Customer, availability index: 81
Server: DOMPMA02/SRV/Customer, availability index: 78
Server: DOMPIN02/SRV/Customer, availability index: 65
Server: DOMPIN01/SRV/Customer, availability index: 63
Server: DOMMYP01/OLD/SRV/Customer, availability index: 0
Server: DOMMYP02/OLD/SRV/Customer, availability index: 0
server: DOMOEP01/SRV/Customer, availability: BUSY
server: DOMHEP01/SRV/Customer, availability: BUSY
server: DOMCVP01/SRV/Customer, availability: BUSY
server: DOMVGP01/SRV/Customer, availability: BUSY
server: DOMAGP01/SRV/Customer, availability: BUSY
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster information:
Cluster name: DOMPMAC01, Server name: DOMMYP01/SRV/Customer
Server cluster probe timeout: 1 minute(s)
Server cluster probe count: 62831
Server cluster default port: *
Server availability threshold: 0
Server availability index: 28 (state: AVAILABLE)
Server availability default minimum transaction time: 3000
Cluster members (11):
Server: DOMPMA02/SRV/Customer, availability index: 79 )) SERVER_AVAILABILITY_THRESHOLD=5
Server: DOMPMA01/SRV/Customer, availability index: 78 )) SERVER_AVAILABILITY_THRESHOLD=5
Server: DOMPIN01/SRV/Customer, availability index: 64 )) SERVER_AVAILABILITY_THRESHOLD=5
Server: DOMPIN02/SRV/Customer, availability index: 39 )) SERVER_AVAILABILITY_THRESHOLD=5
Server: DOMMYP02/OLD/SRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0
Server: DOMMYP01/OLDSRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0
server: DOMHEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100
server: DOMVGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100
server: DOMCVP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100
server: DOMAGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100
server: DOMOEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100
Fine tuning via SAI/SAT/RangeC
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
And when you turn off a server...
Remember to ignore the probes failures
if annoyed increase the period of the probe
Server_Cluster_Probe_Timeout=1 (minute)
Dead server do not run cldbdir, thus (hack!)In New servers' CLDBDIR DELETE manually
ALL instances of DBs in the old servers
Failover by replicaID uses the new servers!
CLREPL will NOT attempt to keep dead servers updated (EXTREMELY IMPORTANT!!!!!!!!)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
You can keep old dead servers
In the cluster for reasonable long time
BUT you must check the logs and
sh st replica.cluster.*q*
You can't have lost transactions..
because CLDBDIR thinks the old servers
are EMPTY but alive
CL Manager will say once a minute they are
unreacheable, which is what you want for
AUTOMATIC user failover... over time...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
To finally delete the server
use AdminP !!!
![Page 12: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/12.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Other Caveats/Tips/Tricks:
You must make sure you edit the old servers' records in NAB to remove mail routing
You do not want mail to be attempted to be
routed via old dead servers
You'd better do server decomission reportBEFORE turning them off...
a machine turned off produces no reportsDO NOT remove old old server from cluster yet
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Always TEST failovers
with a TEST user ID that is
NON Administrator
NON manager of apps databases
It is assumed that managers knowwhere they want to access dbs
and will NOT attempt failover
if you test with ADMIN.id: will drive you MAD
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster Analysis is a great feature to figure out about problems in your cluster
It's part of the Admin Client and (Server / Analysis / Cluster ...)
Run it to find problems with ACL, Replication, not existing databases, ...
Tips
Run it, print it and sign off all warnings you find
Use FT Search to remove multiple occurrences of similar or already fixed problems until DB is empty
Run Analysis again to see you addressed al problems
Cluster AnalysisC
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Failover
Definition:
Server Initiated
due to reactive Load Balance or failures
Client Initiated
server is dead or perceived as dead
requires client to know how to connect to cluster mates without server assistance!
Tips: insert the address in name:CN=<FullyQualifiedDomainName>/Whatever
CN=194.196.39.11/Srv/LotusEmea/Net
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
DEBUG_NOSTDOUT=1
If you leave debug parameters ON in prod
capture the debug in files
debug_Outfile=
and NOT in StdOut
for performance reasons and also...
for sanity of old 3rd party apps (&BACKUPs)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
DO NOT USE THIS PLEASE
DEBUG_RUN_AS_ROOT=1
it WILL allow you to run as root in UNIX/Linux
it will NOT allow you later to run as non root
unless you fix all the owners, permissions,etc
of everything it created. (just DDT please!)
Exception: Some custom restores required rootGET A NEW VERSION OF RESTORE TOOL
![Page 13: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/13.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Replication DebuggingDEBUG_REPL=2 & DEBUG_REPL_ALL=1
Log_Replication: (not ORable, different values, -1 does not work!)
Log_Replication=0....No replication logging
Log_Replication=1....Logs server replication events
Log_Replication=2....Adds logging of replication activity at the database level
Log_Replication=3....Adds logging of replication activity at the note level
Log_Replication=4....Adds logging of replication activity at the field level
Log_Replication=5....Adds summary logging
RTR_Logging: (Tip: You can OR (sum) these, i.e. 63 is a LOT!)
RTR_Logging= 1....Default level of logging (major routines, events, etc.)
RTR_Logging= 2....Log all context structure changes
RTR_Logging= 4....Log replications: attempted & performed
RTR_Logging= 8....Log iterations through main polling loop
RTR_Logging=16...Verbose debug logging
RTR_Logging=32...Log all lock operations
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Replica.Cluster.Docs.Added = 26790
Replica.Cluster.Docs.Deleted = 16060
Replica.Cluster.Docs.Updated = 378378
Replica.Cluster.Failed = 30
Replica.Cluster.Files.Local = 83
Replica.Cluster.Files.Remote = 83
Replica.Cluster.Retry.Skipped = 222
Replica.Cluster.Retry.Waiting = 0
Replica.Cluster.SecondsOnQueue = 13
Replica.Cluster.SecondsOnQueue.Avg = 2
Replica.Cluster.SecondsOnQueue.Max = 3593
Replica.Cluster.Servers = 1
Replica.Cluster.SessionBytes.In = 160450213
Replica.Cluster.SessionBytes.Out = 824894460
Replica.Cluster.Successful = 13484
Replica.Cluster.WorkQueueDepth = 0
Replica.Cluster.WorkQueueDepth.Avg = 0
Replica.Cluster.WorkQueueDepth.Max = 4
sh st replica.cluster.* (if you do not read the stats, why bother clustering?)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Network_Sprayer_Address=*
Useful to disable name checking after connect
I just wished it did work better (not always works)
DO_NOT_USE_REMEMBERED_ADDRESSES=1
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Failover by Path
Normally, you should NOT get it
What you should get are mostly by RepId
It is a sign that you have multiple instances of the same replica id in one server
You should (almost) never have duplicate
SH DIR in the server tells you duplicates
Requested to be added to ADMIN client next
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Server_TransInfo_Normalize
default = 3000
Units is Miliseconds * 100 of std transaction
3000 is a BAAAAAAAAD default
Fortunately Loadmon.ncf helpsto save old real times for all transactions
USE: AvailabilityIndexType=1 (for nonHTTP)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Server_TransInfo_Range
If you don't know better,
set between 10 and 40
default is 6 and is WAAAAAAAAY TOOO LOW
Alledgedly (rumour)
it helps also NON clustered HTTP servers
Apparently some code in http checks SAI
for self tuning, and a better SAI uses HW better
![Page 14: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/14.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Tell CLREPL pause/resume
Useful to be able to read something
If you are using a very high debug level
Remember to resume it, else you will get nuts trying to figure out what happened.
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Show AI (in AIx but is different)
What should be seen is this; > show aiRange XF Hits Min AI Max AI
nconsole DOMPHU00 "sh ai"
1 2 48406 93 100
2 4 1380 77 93
3 8 1226 64 77
4 16 821 51 64
5 32 106 38 51
6 64 39 26 37
7 128 16 20 25 ...Current value of SERVER_TRANSINFO_RANGE is 6.
<<changes suggested for
SERVER_TRANSINFO_RANGE>>
nconsole DOMPHU01 "sh ai "
1 2 48826 93 100
2 4 1052 77 93
3 8 1148 64 77
4 16 711 51 64
5 32 197 38 51
6 64 40 27 38
7 128 0
8 256 4 1 5
9 512 13 0 0
10 1024 11 0 0
11 2048 1 0 0
12 4096 1 0 0
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Clustering: For Geeks... and for Normal People Too! Q&A
George Chiesa <[email protected]>
Daniel Nashed <[email protected]>
DATABASE
VIEW DATA
REPLICA
Push
Pull
Push
Pull
SERVER UPDATESERVERUPDATE
DATABASE
DATA VIEW
DATABASE
VIEW DATA
(replica)
Push
Push
SERVER UPDATESERVERUPDATE
DATABASE
DATA VIEW
CLREPL
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
SUPPORT "EXTRA" MATERIAL
These are the support pages...
Which you can get by asking for them at the back of your business card...
We politely request NO REPOSTING...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Mission Critical Service
Much better defined by the
Total Cost of NOT HAVING IT
when you need it
In other words, something that despite having a (well known?) TCO
may prove too much more significantly
painful & expensive "NOT TO HAVE"
Keys: TOTAL costs of NOT having
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The "Nines":
2 nines (99%) =circa= 88 hours/year
3 nines (99.9%) =circa= 9 hours/year
4 nines (99.99%) =circa= 52 minutes/year
5 nines (99.999%) =circa= 5 minutes/year
Downtime costs per user = [(Total hours of Unscheduled downtime (25% of user population) X (Hourly user salary) + (Total hours of Scheduled downtime X Hourly Messaging Administrator Salary) ] / Number of messaging users
NOTA BENE: R.S.E. and Change Management/Control needs
![Page 15: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/15.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Business Users do NOT care what you dowith your PLANNED down time
as much as they care NOT to have ANY UN-PLANNED down times during "biz time"
Business users can plan around PLANNED un-availability of mission critical sytems
What Business Users can NOT usually acceptis having to have both Planned and UN-Pl'd
YOU CAN NOT REDUCE BOTH TO ZERO
on an individual component basis
Key: "individual component basis"
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Never begin asking for the budget...
ask for preference/aversion
acceptable time of UNplanned downtime against money to prevent them
Have the user KEEP updated a contingency "Plan B" for alternative/manual processing, so they realise how much mission critical their system really is...
TEST their plan B (fire drill :-)
Ask again for the "TC of not Having"
Ask again for "Not Having Aversion"
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
RunFaster=1
RunSafer=1
DoNotCrash=1
DoNotGetHacked=1
DoNotScrewMySLA=1
DoNotRuinMyBonus=1
DoNotGetMeSacked=1
Which of these do ACTUALLY EXIST ?
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
High Availability
My petty own TWO definitions
Historical = (ex-post)
the FACT that a service has been available in the past
Predicted = (ex-ante)
a "PERCEPTION" in terms of Probability that a service will be up
when it will be needed in the future
KEY: do NOT extrapolate past availability
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Strategic Planning:
My petty own definition (borrowed from many:-)Analize possible future scenarios/events, their value and impact to you
What can go wrong, and how much will it cost me/my entity NOT to have the service
Estimate the "a priori" / "pari passu" probability of these events
Analize, decide and take actions TODAY that will improve the probability of the desired events and scenarios actually happening
Keyword of this slide is TODAY
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
There is no such thing as
"THE BEST" practice as absolute recipe
Does it make sense to ask ?
Will the server be up tomorrow?
NO SLA will make it happen...at most you will get damages/penalties
It makes sense to Actively Plan & Design:
WHAT CAN I DO TODAY to IMPROVE the probablity or likelihood that a Service will be perceived as available when needed?
![Page 16: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/16.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The (pre) Works
You must apply generally agreed Best Practices
for making the individual items more reliable
Examples:
Clean your network of unwanted traffic
Deploy Storage & IO sensibly, i.e. http://www.Lotus.com/Performance
Automate the deployment customizations
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The Works: Networking
Apply standard tuning to OS and TCP
DELETE every single other protocol you can
PRINT and understand relevant KB notes
Examples of TcpIp advised hacks:EnablePMTUDiscovery=0
TcpTimedWaitDelay=30
etc
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Analyze your network and Investigate and EliminateALL non essential traffic
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Domino and I/O Optimization
single RAID5 volume
I/O controller
Don't do this!
bottlenecks
Controller Channels
OS Kernel
Page file
Notes executables
Log files
Domino data
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Domino and I/O Optimization
(better)
RAID5 volume
I/O controller
Separate drive
OS Kernel
Page file
bottlenecksController Channels
Notes executables
Log files
Domino data
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Domino and I/O Optimization
(even better)
RAID(1, 5) volume
I/O controller
Separate drive
OS Kernel
Page file
Controller Channels
bottlenecksNotes executables
Log files
Domino data
OSPage
Domino
![Page 17: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/17.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Domino and I/O Optimization
(much better)
RAID(1, 5) volume
Separate drive
OS kernel
Page files
Controllers
Notes executables
Log files
bottlenecks
Apps, Domino
I/O technology
OS technology
I/O controller
OS
Page
Domino
I/O controller
RAID(1, 5) volume
\data
\data Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Hardening HARDWARE Installand Post-upgrade script
Any/Everything in the box installed CAN fail
If something is not installed it can not fail
Physically remove from the boxes ANY hw not used
Modems, Audio, etc
DISABLE everything you can't take out
Classic: lpt1, com1, com2, etc
BOOT SEQUENCE: C, CD, A
DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPS
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Hardening SOFTWARE Installand Post-upgrade scripts
MOST SW vulnerabilities are based on SW Bugs
ALL software has (some) known + unknown BUGS
If a software is not installed it can not run :-)
If a software is not running its Bugs don't matter
UNINSTALL everything you do not absolutely need
Remove all un-needed online-documentation
Win32: SPECIFICALLYUNINSTALL WORKSTATION LAN SERVICES!!!
Remote Management: do NOT mix/shareintranet security/passwords/domains/etc
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
UNPLUG ALL NETWORK CABLES BEFORE UPGRADE, install from "safe" CDs, NEVER via LAN/WAN/etc
After new Install, WindowsUpdate or equivalentdisable everything you do not need
better yet, UNINSTALL what you do not need
check what services are running / started / auto
netstat -an | find "Listen" (check EACH)
Beware of R.S.E. (Reverse Social Engineering)
DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPs
Hardening SOFTWARE Installand Post-upgrade scripts (cont')
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Hard trends / environmental changes
It's a wild world out there...There is a lot of Win32 out there...online / aDSL! unpatched / running "Admin"
Most Win32 patches require "reboots"
Linux is as secure as senior the admins
and viceversa, also true to the lower end
Vulnerabilities (KNOWN and not)
13% of DNS servers have known vulnerabilities, according to ICANN
PACE of change in OS patching levels
External and "Internal" ScriptKiddies
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Dilbertian Examples or WYPIWYG
IF you Pay people to keep the UPTIME of individual machines (stress on individual)
They WILL schedule + preventative maint timeThey will NOT apply patches a.s.a.p./available
They will NOT down a service EVEN when at risk
99% of hacked/virused machines were
"already well known vulnerabilities"
It will cost you much MORE money and troublesand you will get LESS value for your money
![Page 18: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/18.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
SLAs are as useful to prevent damage as insurance/assurance [ :-) ]
Make you feel better about evil things OUTCOMESbut they do NOTHING TO prevent evil things
from happening in the first place
Some "Dilbertian" examples:I will insure my house in order for it
NOT to go on fire, when you'd better buy insurance in case of disaster BUT ALSO
get a smoke detector (detection)
get fire estinguishers (response !)
I will ask people to sign NDAs...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
High Availability
Something that is "likely" to be available...
Must be architected and run as such
"Architected" implies with "HEURISTICS", most of which are "difficult to quantify"
It's easier to measure Sq Feet of Grass to Mown
than quantifying "Garden Landscaping Work"
"Run" requires having meaningful WYPIWYG
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The HUMAN Factor: WYPIWYG
WYPIWYG is actually W.Y.P.I.W.Y.G.
"What You Print Pay Is What You Get"
If you measure the wrong things...
you WILL get wrong behaviours and outputs
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
SPOFs = Single Point of FailureDefinition:
A single point of failure is a anything that is not redundant enough and whose failure will cause damage to the availability of a service
I will NOT repeat here the trivial ones
Some "hidden SPOFs":
check bill of materials for anything that has1
mouse/keyboard/Switch ==>IMPLY SAME RACK
UPS/ISP/Site:
you may have to consider multi site/homed
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The HUMAN Factor: WTPIWTD
WTPIWTD is actually W.T.P.I.W.T.D.
"What THEY Pay Is What THEY Demand"Make sure the BizSponsor pays by BILLBACK
a class of service with expected resilience
a % of your fulfillment platform
Never let a user "own" a box that you runeasier to say than to do, but try :-)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The beauty of Notes/Domino: Secure Replication
Deploy to more than one site enabled byReplication of databases
scheduled replication
event driven replication
both
Tips:do NOT deploy by OS copy nor FTP, use replica
Hardcode Cluster OU in ACLs ie. */Srv/<whatever>
[Names]: Add to prevent pull replication issues
![Page 19: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/19.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Credits:
Our Teachers
Lotus/IBM/Iris:
too many links, thanx to all !
Our Partners:
Penumbra Partnering Inc. http://www.PENUMBRA.org
Our Customers
Some names in our site :-)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
From Lotus Operating Principles:
"Establish Purpose Before Action" as in
Alice (In wonderland)
Tell me Mr. Cat, which "Route" should I use?
Cat:
Where do you want to go ?
Alice:
Dunno, haven't figured that out yet!
Cat:
it does not matter which one you choose!
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ALL the "answers"are already out there
somewhere most, in the internet
the VALUE question is how to figure out
WHAT ARE THERELEVANT QUESTIONS ?
It's uselful to define "relevant"the "YOU ARE HERE"
has changedfrom "my Domino World"
to "my Enterprise choices"
Who moved my cheese ?C
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
MTBF = MEAN TIME BETWEEN (garanteed !!!) FAILURES
Average of when you can expect something to fail
Assumes eveything will eventually fail - by design!MTBF implies P(F,eventually)=1.0
Murphy's LAW ...and...
Never Let a Machine Know You Need It :-)
Please engrave in my tomb-stone:The devil is in the variances to averages
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
PLAN AROUND UnPlanned Failures
you KNOW with a P(X fail,eventually)=1
that individual components =
something = will fail (eventually)
but you do not know WHEN, WHAT, HOW
TRY to make cross-correlations work for you
Don't forget Murphy's Law
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Leverage on differences
reduce risk by using stuff that will fail eventually BUT with negative or zero correlation
Win32 code-streams have a huge in-built-correlation, so do UNIX's/Linux's
Lower Correlation between Win32,Linux,etc
Lower Correlation between AS400/iSeries / rest
Use this to weight how you "spread" stuff
![Page 20: Cluster in Detail](https://reader034.fdocuments.in/reader034/viewer/2022050801/553cbfce4a7959fe7f8b498c/html5/thumbnails/20.jpg)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Embedded Dis-Services
Anything having EITHER an MTBF, an SLA or windowsupdate.com or liveupdates
has "Embedded individual outages"
SLA implies Dis-Service agreement trade-offs
The Business User does NOT care for INDIVIDUAL SLAs/MTBFs
So you could, can and must
Architect and Design
a CLUSTERed Solution
and offer a CLUSTER SLA
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Manage measurables, the right ones
If you measure & pay people for the cluster SLAand "free" them from component's SLAs:
For Individual Machines/OS/HW/Components:they will get downed to investigate/fix/update
sooner, a.s.a.p. known vulnerabilities/problems
+ preventive maint made during prime time
less dependencies on graveyard-shift work
The user will get better and overall cheaper service
less dis-service, and smoother/safer Operations
Operators will match demand of services with + offer
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Portfolio Principles
"there is nothing wrong with putting all your eggs in one basket, just watch that basket" Henry Ford
don't put all your eggs in one basket cause you can't watch it close enough
don't put all your eggs in too many baskets cause you can't watch them all close enough
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Testing Tips & Tricks:
my first SW manager taught me in 1980:Design with Testing in mind;
what you can not PROVE that works will either NOT work from day one but remain hidden until needed or fail in the future...
Document the testing... for regression testing
A Fellow Penumbra told me: You do not need a boat, you need a friend who has one and knows how to use it....
Same for a protocol analyser: you just can NOT guess the client/server dialogue (ex caching)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Co pyright 200 2 dotNSF, In c. - All r ight s
r eser vedP lease cont act dot NSF a t +44 771 85 87 673 f or mor ep resenta tions & inf orm ati on ... . vi sit: ht tp:/ /dot NSF.com
"What You Print Pay Is What You Get"
If you measure the wrong things...
you WILL get wrong behaviours and output
WYPIWYG is actually W.Y.P.I.W.Y.G
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
High Availability
The art of doing something "automagically" to improve the perceived performance of the cluster, usually by making intelligent usage of idle resources.
Proactive:
Load Spreading
Reactive
Performning Load "re-"Balancing by trying to fail over to less busy clustermates