MACH the Knife
The MACH11 Cluster
The MACH11 cluster, introduced with IDS 11, is an extension of
the traditional HDR. It provides a fully integrated
solution for multiple levels of availability, and is the
foundation for Continuous Availability.
The MACH11 cluster introduces two new types of secondary server
which complement the existing HDR secondary. The first is the Remote
Standalone Secondary (RSS) and the other is the Shared Disk Secondary
(SDS). The main difference between the RSS node and the SDS node
is that while the RSS node maintains a physical copy of data on disk, the SDS node maintains only the shared memory buffer
pool. As the name implies, the SDS node is also attached to the
same physical disks as the primary node by using a shared disk
There can be only one primary node
within the cluster. Also
there can be only one HDR secondary. However, there can be any
number of RSS and/or SDS nodes within the cluster. Also it
is important to understand that only logged data is replicated within
the MACH11 cluster.
For additional information checkout "Availability Solutions with Informix Dynamic Server 11"
High Availability Data Replication (HDR) has been part of the
Informix Dynamic Server
since IDS 6. It provides support for a
hot backup system which is also available for dirty read processing.
HDR works by shipping the logs from the primary node to the
secondary where they are applied to the physical chunks on the
secondary. HDR is a member of the MACH11
cluster and much of the technology which is used to implement the
rest of the MACH11 cluster is based on HDR technology
There are six steps to bring up an HDR secondary. The first two
steps are often overlooked, but yet are fairly important. This
involves making sure that the chunk files exist on what will become the
secondary server and making certain that any UDR/Datablade executable
is installed on the secondary node. The files must have the same path
as they do on the primary node and the UDR/Datablade executable must be
in the same locations. It may be that this involves nothing more
than issuing the unix 'touch' command, or the establishment of links
to the appropriate directory. Also, care must be taken to ensure
that the chunk files have the proper owner, group and permissions.
Generally these will be owner - informix, group - informix,
permissions owner and group r+w.
The following chart describes the steps to create an HDR secondary.
|1||Create chunk files on the secondary||This is a manual step which must be performed on the secondary node.|
|2||Install UDRs and datablades on the secondary||This is a manual step which must be performed on the secondary node.|
|3||Update the reserved pages and set this node into the primary and set the identity of the secondary node||onmode -d primary <secondary_node>|
|4||Perform a backup of the primary||ontape -s -L 0
(onbar -b -L 0)
|5||Perform a physical restore on the secondary||ontape -p
(onbar -r -p)
|6||Mark the secondary as a secondary and point to the primary instance||onmode -d secondary <primary_node>|
In step 3. we set a flag in the reserved pages which identifies this
node as an HDR primary and also identify the network connection to the
HDR secondary. In step 4, we perform a full system backup of the
primary and then (step 5) perform a physical restore on the HDR
secondary node. ( A physical
restore does not perform the rollforward of the logical log files.
That means when the ontape/onbar command is finished, the restored instance is positioned at the backup
checkpoint. The rollforward of the logs is done by transmitting the logical logs from the primary node.)
The HDR secondary must run the same executable binary oninit as the
primary. The host of the HDR secondary must be similar to the
primary but does not have to be identical. For instance, the
primary might be a 24 processor system and the HDR secondary might be a
4 processor system. However, it is important not to undersize the
HDR secondary system because if the HDR secondary is unable to
process the log records as fast as the primary creates them, then
backflow can occur. If this should happen then user activity on
the promary can block until the HDR secondary can catch up.
While the HDR secondary can be used for report processing, its primary
purpose is to provide failover support in the event that the primary
node is lost. To make the secondary into a primary node, simply
run 'onmode -d primary <old_primary_node>'.
The primary purpose of the RSS node is to act as a backup for the
HDR secondary. If the primary node is down, the HDR secondary is normally promoted into the primary. However, if the
original primary is going to be down for an extended period of time, it
is possible to promote the RSS node into the HDR secondary.
Unlike the HDR secondary, the RSS node communicates with the primary
using a full duplexed model. This means that it is not necessary
for the secondary to acknowledge every message sent from the primary
before the next message is sent. Because the communication model
is full duplexed, it is possible for communication with the RSS
secondary to better utilize the network capacity.
This means that the RSS node can normally better utilize the
available network bandwidth than the HDR secondary.
that the RSS node is better able to handle long distance communication
networks than the HDR secondary is. However, this comes at a
cost. The RSS node can only work in asynchronous mode. Even
the checkpoint is asychronous. Because of this, the RSS node is
not able to be promoted directly into a primary node. However,
it can be promoted into the HDR secondary and then subsequently be
promoted into the primary node.
Not only can the RSS node be promoted to the HDR secondary node, but
also the HDR secondary node can be demoted into an RSS node. This
might be desired during some periods of time to take advantage of the
full duplexed communications model.
RSS requires that the server utilize Index Page Logging. Normally
when an index is created, we only log the create index operation, not
the work that is done by the create index itself. With
traditional HDR, the index pages from the index build are directly transmitted to the
secondary as part of the index creation. With RSS, we felt that the
cost of attempting to transfer the index to multiple RSS nodes would
impact user activity too much, so we chose instead to simply place those pages
into the logical log. We do not place all of the index into the log as
a single transaction. Instead we may generate multiple
transactions to log the index creation so as to avoid any long
transaction during the index build. This feature is activated by
setting LOG_INDEX_BUILDS to 1 in the onconfig.
The setup of an RSS node is very similar to the setup of the HDR secondary node.
|1||Setup chunk files on the secondary||Manual process to create any chunk files on the secondary node|
|2||Install any UDRs and DataBlades||Manual process to install any UDRs and/or DataBlades on the secondary node|
|3||Register RSS node in the sysha database||onmode -d add RSS <node> <password>|
|4||Perform a backup on the source||ontape -s -L 0
(onbar -b -L 0)
|5||Perform physical restore on the secondary||ontape -p
(onbar -r -p)
|6||Connect to the primary||onmode -d RSS <promary> <password>|
We establish the potential RSS node in step three and also set an
optional password for the initial connect request. If the sysha
database does not yet exist on the primary node, it will automatically
be created in the root chunk. The optional password is only used
in the initial connection from the RSS node to the primary. After
taking a full system backup on the primary and restoring it on the
secondary (again a physical restore), the setup is completed by issuing
onmode -d RSS on the RSS node.
This will cause a network connection to be established with the
primary and replication to be established. If a password was used
as part of the onmode -d add RSS command on the primary, then the same password is required as part of the onmode -d RSS command.
While the HDR secondary and RSS nodes maintain both the buffer cache
and a disk copy of the database chunks, the shared disk secondary node
only maintains the buffer cache. Instead of maintaining a copy of
the chunks on local disk, the SDS node uses the same physical disks as
the primary on a shared disk subsystem such as Veritas or GPFS.
The reason that we implemented the Shared Disk Secondary was to
take advantage of newer disk technology. For instance, the
customer might want to have a standby instance but use disk
mirroring or some other means of hardware availability solution to
provide for the disk redundancy.
The setup of the SDS node is a different process than the setup of
the HDR secondary or RSS nodes. Instead of performing a backup of
the primary node and physical restore on the secondary node, the SDS
node is instantiated by simply issuing a checkpoint on the primary node
and the SDS node starting the roll-forward of the logs as of that
checkpoint LSN. As the primary is flushing logs to disk, it sends
LSN that it has flushed to the SDS node. The SDS node will then
read and process the logs up to that LSN. As the SDS node is
processing log records, it sends a notification to the primary as to
how far in the logs it has processes. That way, the primary is
able to determine when it is safe to flush a buffer back to disk.
Basically, the primary will not flush a page to disk until the
SDS nodes have progressed past the LSN in which the page was changed by
an update to that page.
There are some new onconfig parameters which need to be set in
order to use the shared disk secondary node. These are -
||This parameter must be set on the
secondary node to enable the support of SDS. It is not directly
set on the primary node. Rather it is internally set as part of
the onmode -d set SDS primary command. Be aware that the database engine can not be initialized (oninit -iy) if SDS_ENABLE is set.
To enable this parameter, set the value to 1. To disable set to 0.
||This is the maximum time (in seconds)
that the primary will wait for the SDS node to advance far enough so
that a given page can be flushed. If the SDS node has not
advanced far enough in the logs to allow the primary to flush the page
when this time expires, then the SDS node will be automatically
disconnected from the cluster and be shut down.
||The shared disk secondary is not
allowed to use the same temporary dbspace as the primary. Instead
it must use a local dbspace and avoid using the existing
temporary dbspaces which are defined within the database. This
parameter is used to define the temporary dbspace which the SDS node
will use. This dbspace is dynamically created when the SDS node is
started. It is not created by running onspaces.
The format of this parameter is
||This parameter is the path to two files
that are used to hold pages which might need to be flushed on the SDS
node between checkpoints. These are files, not chunks. Each
file can act as temporary disk storage for chunks of any page size.
The paging files are extended in 1 megabyte increments.
There is an in-memory hash structure which keeps track of all of the
We need two paging files in order to support non-blocking checkpoints.
It is fairly simple to start up a shared disk secondary node, but
there is some preparation which must first be performed. The
first thing that must be done is to ensure that the disk are available
on the SDS node in the same location as they are on the primary.
It is also important to understand that the SDS secondary can
only run on a true shared disk device. You will not be successful
with attempting to run SDS and using NFS cross mounted files
because of operating system buffering. The devices must be
shared disk devices such as Veritas or GPFS.
Once that has been done, the onconfig file must be updated to set
the parameters described in the table 3. I would recommend that
both SDS_TEMPDBS and SDS_PAGING point to files on a local disk rather
than to files on the shared disk subsystem, if possible.
There are basically only two steps necessary to start up the shared disk secondary instance. These steps are
|1||Define the primary listener port||onmode -d set SDS primary <port>|
|2||Connect to the primary||oninit|
The first step will define the port at
which the primary node will accept an SDS connection. This
information is written to page zero of the reserved pages. Unlike
the RSS setup, there is no optional password. Since the disks are
shared between the primary and the SDS nodes, the environment is
already a controlled environment so it was felt that there was no need
for any password. When the SDS node is started up in step 2, it
reads the reserved pages and finds out where the primary node is
located. It then connects to the primary and establishes itself
as an SDS node attached to the primary.
One of the advantages of the MACH11
cluster is that there are multiple levels of failover. There is
the HDR secondary, the RSS node, and the SDS node. However
because we have multiple levels of failover, we could have had an
additional level of complexity in performing a failover operation.
The onmode -d make primary <listener_port> [force] command
simplifies failover. Basically this command simply moves
the role of primary to the node on which the command is issued,
regardless of the role of the server within the cluster. If there
are SDS nodes within the cluster, then the new primary will do what is
necessary to realign those nodes to the new primary. If there is
an HDR secondary, then it will be realigned to the new primary.
If there are RSS nodes, then they will be redirected
automatically to the new primary. If the cluster is replicating
by using Enterprise Replication as well, then all of the ER connections will
also be automatically realigned to the new primary node.
Normally the onmode -d make primary
command will attempt to connect to the existing primary node and shut
it down. However, if the primary is currently down, then you can
use the 'force' option to cause the command to proceed even though it
can not connect to the primary node.
The ideal order of failover is:
- First failover to a SDS node.
- Then failover to the HDR secondary.
- Finally failover to an RSS node
following chart describes what happens to the remaining nodes withing
the server when failover occurs.
|New Primary||Remaining SDS nodes||HDR secondary||Remaining RSS nodes|
|SDS node||Remaining SDS nodes realign to the new SDS primary||HDR secondary realigns to the new SDS primary||Remaining RSS nodes realign to the new SDS primary|
|HDR secondary||All SDS nodes are shut down||New Primary||Remaining RSS nodes realign to the new primary|
|RSS node||All SDS nodes are shut down||HDR secondary is shutdown||Remaining RSS nodes are shut down|
There is a special case in which there are
only SDS nodes within the MACH11 cluster. Generally the primary node
needs to be started first and then the SDS nodes. This is because the
SDS node connects to the primary during startup to request that a
checkpoint be issued. It
is possible that the
entire cluster could fail, such as might happen with a power
failure in a blade server. In such a case, it might be possible
that the primary can not be restarted
- perhaps because of a power supply failure. If for any reason it
is not possible to restart the primary, then it would seem that the
cluster would be down.
However, there is a new option to the
oninit process which will not only start up the instance, but will also
shift the role of primary server at the same time. If such a condition
should occur in which all of the nodes within an SDS-only cluster
fail and the primary can not be restarted, then all that needs to be
done is to issue oninit -SDS=<new_listener_port> on one of the other SDS nodes.
When this option is used start the server, the reserved pages are updated to allow that instance to be
started as the primary node. After that is run, then the
remaining SDS nodes can be started by simply running oninit and will connect to the new primary.
It is very important that before oninit
-SDS=<port> is run, that the DBA be certain that the existing
primary is really down. Failure to perform that check will result
in disk corruption.
It is farily simple to promote an RSS node
into an HDR Secondary node while the servers are online.
Basically all this entails is to replace the method of receiving
the logs on the secondary node as both the RSS and HDR secondary nodes
use the same technology to apply the log records. Suppose that our
primary node is called server_main and the RSS node is called
server_sec. To promote server_sec into an HDR secondary all that
would need to be done is:
|1||Define server_main as an HDR primary||onmode -d primary server_sec|
|2||Connect server_sec to the server_main as an HDR secondary||onmode -d secondary server_main|
This procedure will cause the RSS log
transport layer to be shutdown, but will not shutdown the apply threads
on server_sec. Then when step 2 is executed, the HDR
transport threads will start and send the log records to the already
running apply threads.
It may be that there is a desire to
convert the existing HDR secondary into an RSS node. Again, this
is a fairly easy process to perform while online. Basically all that
needs to be done is to simply issue onmode -d RSS <primary_node> on
the RSS node. However, it needs to be understood that once this
command is issued, the primary is no longer an HDR primary.
The DBA will need to execute onmode -d primary to
reestablish the HDR relationship amoungst the nodes.