MACH the Knife

mpruet's picture



c:\mpruet\page1

The MACH11 Cluster


Overview


Setup of the HDR secondary

The Remote Standalone Secondary (RSS)

Setup of the RSS node

The Shared Disk Secondary (SDS)

Setup of the SDS node

Failover within the MACH11 Cluster

Special Case with SDS only Clusters

Promotion of the RSS node into an HDR Secondary

Demotion of the HDR Secondary into an RSS Node



Overview

The MACH11 cluster, introduced with IDS 11,  is an extension of
the traditional HDR.  It provides a fully integrated
solution  for multiple levels of availability, and is the
foundation for Continuous Availability.  

The MACH11 cluster introduces two new types of secondary server
which complement the existing HDR secondary. The first is the Remote
Standalone Secondary (RSS) and the other is the Shared Disk Secondary
(SDS).  The main difference between the RSS node and the SDS node
is that while the RSS node maintains a physical copy of data on disk, the SDS node maintains only the shared memory buffer
pool.  As the name implies, the SDS node is also attached to the
same physical disks as the primary node by using a shared disk
subsystem.

There can be only one primary node
within the cluster.  Also
there can be only one HDR secondary.  However, there can be any
number of RSS  and/or SDS nodes within the cluster.  Also it
is important to understand that only logged data is replicated within
the MACH11 cluster.

For additional information checkout  "Availability Solutions with Informix Dynamic Server 11"

The High Availability Data Replication Secondary (HDR)

High Availability Data Replication (HDR) has been part of the
Informix Dynamic Server

since IDS 6.  It provides support for a
hot backup system which is also available for dirty read processing.
 HDR works by shipping the logs from the primary node to the
secondary where they are applied to the physical chunks on the
secondary.   HDR is a member of the  MACH11
cluster and much of the technology which is used to implement the
rest of the MACH11 cluster is based on HDR technology


Setup of the HDR Secondary

There are six steps to bring up an HDR secondary.  The first two
steps are often overlooked, but yet are fairly important.  This
involves making sure that the chunk files exist on what will become the
secondary server and making certain that any UDR/Datablade executable
is installed on the secondary node. The files must have the same path
as they do on the primary node and the UDR/Datablade executable must be
in the same locations.  It may be that this involves nothing more
than issuing the unix 'touch' command, or the establishment of links
to the appropriate directory.  Also, care must be taken to ensure
that the chunk files have the proper owner, group and permissions.
 Generally these will be owner - informix, group - informix,
permissions owner and group r+w.

The following chart describes the steps to create an HDR secondary.

Step Description Primary Secondary
1 Create chunk files on the secondary   This is a manual step which must be performed on the secondary node.
2 Install UDRs and datablades on the secondary   This is a manual step which must be performed on the secondary node.
3 Update the reserved pages and set this node into the primary and set the identity of the secondary node onmode -d primary <secondary_node>  
4 Perform a backup of the primary ontape -s -L 0 

(onbar -b -L 0)

 
5 Perform a physical restore on the secondary   ontape -p

(onbar  -r -p)

6 Mark the secondary as a secondary and point to the primary instance   onmode -d secondary <primary_node>

table 1

In step 3. we set a flag in the reserved pages which identifies this
node as an HDR primary and also identify the network connection to the
HDR secondary.  In step 4, we perform a full system backup of the
primary and then (step 5) perform a physical restore on the HDR
secondary node. ( A physical
restore does not perform the rollforward of the logical log files.
 That means when the ontape/onbar command is finished,  the restored instance is positioned at the backup
checkpoint. The rollforward of the logs is done by transmitting the logical logs from the primary node.)

The HDR secondary must run the same executable binary oninit as the
primary.  The host of the HDR secondary must be similar to the
primary but does not have to be identical.  For instance, the
primary might be a 24 processor system and the HDR secondary might be a
4 processor system.  However, it is important not to undersize the
HDR secondary system because if  the HDR secondary is unable to
process the log records as fast as the primary creates them, then
backflow can occur.  If this should happen then user activity on
the promary can block until the HDR secondary can catch up.  

While the HDR secondary can be used for report processing, its primary
purpose is to provide failover support in the event that the primary
node is lost.  To make the secondary into a primary node, simply
run 'onmode -d primary <old_primary_node>'.

The Remote Standalone Secondary (RSS)

The primary purpose of the RSS node is to act as a backup for the
HDR secondary.  If the primary node is down, the HDR secondary is normally promoted into the primary.  However, if the
original primary is going to be down for an extended period of time, it
is possible to promote the RSS node into the HDR secondary.

Unlike the HDR secondary, the RSS node communicates with the primary
using a full duplexed model.  This means that it is not necessary
for the secondary to acknowledge every message sent from the primary
before the next message is sent.  Because the communication model
is full duplexed, it is possible for communication with  the RSS
secondary to  better utilize the network capacity.
 This means that the RSS node can normally better utilize the
available network bandwidth than the HDR secondary. 

That means
that the RSS node is better able to handle long distance communication
networks than the HDR secondary is.  However, this comes at a
cost.  The RSS node can only work in asynchronous mode.  Even
the checkpoint is asychronous.  Because of this, the RSS node is
not able to be promoted directly into a primary node.  However,
it can be promoted into the HDR secondary and then subsequently be
promoted into the primary node.

Not only can the RSS node be promoted to the HDR secondary node, but
also the HDR secondary node can be demoted into an RSS node.  This
might be desired during some periods of time to take advantage of the
full duplexed communications model.

Setup of the RSS node

RSS requires that the server utilize Index Page Logging.  Normally
when an index is created, we only log the create index operation, not
the work that is done by the create index itself.  With
traditional HDR, the index pages from the index build are directly transmitted to the
secondary as part of the index creation.  With RSS, we felt that the
cost of attempting to transfer the index to multiple RSS nodes would
impact user activity too much, so we chose instead to simply place those pages
into the logical log. We do not place all of the index into the log as
a single transaction.  Instead we may generate multiple
transactions to log the index creation so as to avoid any long
transaction during the index build.  This feature is activated by
setting LOG_INDEX_BUILDS to 1 in the onconfig.   

The setup of an RSS node is very similar to the setup of the HDR secondary node. 

Step Description Primary RSS Node
1 Setup chunk files on the secondary   Manual process to create any chunk files on the secondary node
2 Install any UDRs and DataBlades   Manual process to install any UDRs and/or DataBlades on the secondary node
3 Register RSS node in the sysha database onmode -d add RSS <node> <password>  
4 Perform a backup on the source ontape -s -L 0

(onbar -b -L 0)

 
5 Perform physical restore on the secondary   ontape -p

(onbar -r -p)

6 Connect to the primary   onmode -d RSS <promary> <password>
table 2

We establish the potential RSS node in step three and also set an
optional password for the initial connect request.  If the sysha
database does not yet exist on the primary node, it will automatically
be created in the root chunk.  The optional password is only used
in the initial connection from the RSS node to the primary.  After
taking a full system backup on the primary and restoring it on the
secondary (again a physical restore), the setup is completed by issuing
onmode -d RSS on the RSS node.
 This will cause a network connection to be established with the
primary and replication to be established.  If a password was used
as part of  the onmode -d add RSS command on the primary, then the same password is required as part of the onmode -d RSS command.

The Shared Disk Secondary (SDS)

While the HDR secondary and RSS nodes maintain both the buffer cache
and a disk copy of the database chunks, the shared disk secondary node
only maintains the buffer cache.  Instead of maintaining a copy of
the chunks on local disk, the SDS node uses the same physical disks as
the primary on a shared disk subsystem such as Veritas or GPFS.
 The reason that we implemented the Shared Disk Secondary was to
take advantage of newer disk technology.  For instance, the
customer might want to have a standby instance but use disk
mirroring or some other means of hardware availability solution to
provide for the disk redundancy.

The setup of the SDS node is a different process than the setup of
the HDR secondary or RSS nodes.  Instead of performing a backup of
the primary node and physical restore on the secondary node, the SDS
node is instantiated by simply issuing a checkpoint on the primary node
and the SDS node starting the roll-forward of the logs as of that
checkpoint LSN.  As the primary is flushing logs to disk, it sends
LSN that it has flushed to the SDS node.  The SDS node will then
read and process the logs up to that LSN.  As the SDS node is
processing log records, it sends a notification to the primary as to
how far in the logs it has processes.  That way, the primary is
able to determine when it is safe to flush a buffer back to disk.
 Basically, the primary will not flush a page to disk until the
SDS nodes have progressed past the LSN in which the page was changed by
an update to that page.

There are some  new onconfig parameters which need to be set in
order to use the shared disk secondary node.  These are -

  • SDS_ENABLE
This parameter must be set on the
secondary node to enable the support of SDS.  It is not directly
set on the primary node.  Rather it is internally set as part of
the onmode -d set SDS primary command.  Be aware that the database engine can not be initialized (oninit -iy) if SDS_ENABLE is set.


To enable this parameter, set the value to 1.  To disable set to 0.


  • SDS_TIMEOUT
This is the maximum time (in seconds)
that the primary will wait for the SDS node to advance far enough so
that a given page can be flushed.  If  the SDS node has not
advanced far enough in the logs to allow the primary to flush the page
when this time expires, then the SDS node will be automatically
disconnected from the cluster and be shut down.


  • SDS_TEMPDBS
The shared disk secondary is not
allowed to use the same temporary dbspace as the primary.  Instead
it  must use a local dbspace and avoid using the existing
temporary dbspaces which are defined within the database.  This
parameter is used to define the temporary dbspace which the SDS node
will use.  This dbspace is dynamically created when the SDS node is
started.  It is not created by running onspaces.


The format of this parameter is
 <dbspace_Name>,<path>,<pagesize>,<offset>,<size>
.  There can be multiple of these in the onconfig.  As the
startup of the SDS node, these temporary dbspaces are created and
assigned dbspace/chunk numbers at the high end of the available numbers
rather than the next dbspace/chunk number.  The first SDS_TEMPDBS
would have a chunk number of  32766 and a dbspace number of 2047.
 Since the SDS_TEMPDBS is local to the instance, the same chunk
and dbspace numbers will be used on another SDS node within the same
cluster.


  • SDS_PAGING
This parameter is the path to two files
that are used to hold pages which might need to be flushed on the SDS
node between checkpoints.  These are files, not chunks.  Each
file can act as temporary disk storage for chunks of any page size.
 The paging files are extended in 1 megabyte increments.


There is an in-memory hash structure which keeps track of all of the
pages contained within the paging files.  When we are reading a
page from disk, we first examine the hash structures to see if the page
is contained in one of the paging files.  If it is, then the read
is switched to read the image contained in the paging file.  If
the page is not found in the hash structure, then the page is read from
the chunk.  By using the paging file, we are certain to read the
correct version of the page.


We need two paging files in order to support non-blocking checkpoints.
 Basically what happens is that at the start of the checkpoint,
the two paging file hash structures are swapped.  At the end of
the checkpoint, the oldest paging file is reset so that it appears to
be empty.

table 3

Setup of the SDS node

It is fairly simple to start up a shared disk secondary node, but
there is some preparation which must first be performed.  The
first thing that must be done is to ensure that the disk are available
on the SDS node in the same location as they are on the primary.
 It is also important to understand that the SDS secondary can
only run on a true shared disk device.  You will not be successful
with attempting to run SDS and using NFS cross mounted files
because of operating system buffering.  The devices must be
true
shared disk devices such as Veritas or GPFS.  

Once that has been done, the onconfig file must be updated to set
the parameters described in the table 3.  I would recommend that
both SDS_TEMPDBS and SDS_PAGING point to files on a local disk rather
than to files on the shared disk subsystem, if possible. 

There are basically only two steps necessary to start up the shared disk secondary instance.  These steps are

Step Description Primary SDS
1 Define the primary listener port onmode -d set SDS primary <port>   
2 Connect to the primary   oninit

table 4

The first step will define the port at
which the primary node will accept an SDS connection.  This
information is written to page zero of the reserved pages.  Unlike
the RSS setup, there is no optional password.  Since the disks are
shared between the primary and the SDS nodes, the environment is
already a controlled environment so it was felt that there was no need
for any password.  When the SDS node is started up in step 2, it
reads the reserved pages and finds out where the primary node is
located.  It then connects to the primary and establishes itself
as an SDS node attached to the primary.

Failover within the MACH11 Cluster

One of the advantages of the MACH11
cluster is that there are multiple levels of failover.  There is
the HDR secondary, the RSS node, and the SDS node.  However
because we have multiple levels of failover, we could have had an
additional level of complexity in performing a failover operation.

The onmode -d make primary <listener_port>  [force] command
 simplifies failover.  Basically this command simply moves
the role of primary to the node on which the command is issued,
regardless of the role of the server within the cluster.  If there
are SDS nodes within the cluster, then the new primary will do what is
necessary to realign those nodes to the new primary.  If there is
an HDR secondary, then it will be realigned to the new primary.
 If there are RSS nodes, then they will be redirected
automatically to the new primary.  If the cluster is replicating
by using Enterprise Replication as well, then all of the ER connections will
also be automatically realigned to the new primary node.

Normally the onmode -d make primary
command will attempt to connect to the existing primary node and shut
it down.  However, if the primary is currently down, then you can
use the 'force' option to cause the command to proceed even though it
can not connect to the primary node.

The ideal order of failover is:

  1. First failover to a SDS node.
  2. Then failover to the HDR secondary.
  3. Finally failover to an RSS node

  The
following chart describes what happens to the remaining nodes withing
the server when failover occurs.

New Primary Remaining SDS nodes HDR secondary Remaining RSS nodes
SDS node Remaining SDS nodes realign to the new SDS primary  HDR secondary realigns to the new SDS primary Remaining RSS nodes realign to the new SDS primary
HDR secondary All SDS nodes are shut down New Primary Remaining RSS nodes realign to the new primary
RSS node All SDS nodes are shut down HDR secondary is shutdown Remaining RSS nodes are shut down

Table 5

Special Case with SDS only Clusters

There is a special case in which there are
only SDS nodes within the MACH11 cluster.  Generally the primary node
needs to be started first and then the SDS nodes. This is because the
SDS node connects to the primary during startup to request that a
checkpoint be issued. It
is possible that the
entire cluster could  fail, such as might happen with a power
failure in a blade server.   In such a case, it might be possible
that the primary can not be restarted
- perhaps because of a power supply failure.  If for any reason it
is not possible to restart the primary, then it would seem that the
entire
cluster would be down.

However, there is a new option to the
oninit process which will not only start up the instance, but will also
shift the role of primary server at the same time.  If such a condition
should occur in which all of the nodes within an SDS-only cluster
fail and the primary can not be restarted, then all that needs to be
done is to issue oninit -SDS=<new_listener_port> on one of the other SDS nodes.
 When this option is used start the server, the reserved pages are updated to allow that instance to be
started as the primary node.  After that is run, then the
remaining SDS nodes can be started by simply running oninit and will connect to the new primary.

It is very important that before oninit
-SDS=<port> is run, that the DBA be certain that the existing
primary is really down.  Failure to perform that check will result
in disk corruption.

Promotion of the RSS node into an HDR Secondary

It is farily simple to promote an RSS node
into an HDR Secondary node while the servers are online.
 Basically all this entails is to replace the method of receiving
the logs on the secondary node as both the RSS and HDR secondary nodes
use the same technology to apply the log records. Suppose that our
primary node is called server_main and the RSS node is called
server_sec.  To promote server_sec into an HDR secondary all that
would need to be done is:

Step Description server_main server_sec
1 Define server_main as an HDR primary onmode -d primary server_sec  
2 Connect server_sec to the server_main as an HDR secondary   onmode -d secondary server_main

Table 6

This procedure will cause the RSS log
transport layer to be shutdown, but will not shutdown the apply threads
on server_sec.  Then when step 2 is executed,  the HDR
transport threads will start and send the log records to the already
running apply threads.

Demotion of the HDR Secondary into an RSS Node

It may be that there is a desire to
convert the existing HDR secondary into an RSS node.  Again, this
is a fairly easy process to perform while online. Basically all that
needs to be done is to simply issue onmode -d RSS <primary_node> on
the RSS node.  However, it needs to be understood that once this
command is issued, the primary is no longer an HDR primary.
 The DBA will need to execute  onmode -d primary  to
reestablish the HDR relationship amoungst the nodes.