Tutorials/howto maintain cluster

From CubeiaWiki

Jump to: navigation, search

Contents

How To: Maintain Cluster

NB:  This how-to is relevant to the Firebase Enterprise Edition only.

Prerequisites

You should have a basic understanding of a Firebase cluster before reading this how to. You may find these two links interesting:

Versions and Deployments

A cluster is supposed to live across multiple versions of its games and indeed of the Firebase platform itself. You should modify your installation directory on the cluster to make it easy to switch between versions.

The folders in an installation that are most often changed when a new Firebase version is released are these:

Image:install-dirs-fb.jpeg

Of course, the folder changed when updating a game is the "game" folder.

In order to make it quick to switch between versions (which is important if you need to make a roll-back on a live cluster) we'll use symlinks, as shown in the installation manual. The result may look like this (using 'ls -l'):

Image:install-dirs-fbl.jpeg

If you study the image above you'll see that the important binary directories are now symlinks, making switching to a new version really quick:

cd /usr/local/firebase
rm -f bin; rm -f lib
ln -s /usr/local/firebase/<new-version>/bin /usr/local/firebase/bin
ln -s /usr/local/firebase/<new-version>/lib /usr/local/firebase/lib

Updates

Firebase is capable of rolling updates if the serialized form of the game or Firebase is not changed. We recommend the following for live clusters:

  • Use rolling updates for games only if you're certain the serialized form is the same and you have tested it on a staging environment.
  • Do not use rolling updates for Firebase upgrades. There are too many variables that may go wrong even if you have a very good staging environment.

For a live cluster, an upgrade with restart may look like this:

  1. Make sure the tournament schedule is clear for the down-time.
  2. Notify players when the server is going down (and preferably some time before as well).
  3. Close firewall.
  4. Stop the cluster in reverse startup order.
  5. Perform software update.
  6. Start the cluster normally.
  7. Perform initial tests to make sure everything is OK.
  8. Open firewall.

Monitoring a Cluster

Firebase comes with a large set of JMX beans which can be monitored with a standard JConsole which is a part of the Java distribution. However, for a cluster you may wish to use a tool to gather data from all servers at once instead of handling multiple console windows. For example Nagios.

Failure Modes

There are five primary failure modes for a Firebase cluster:

  1. Runtime failure. For example out of memory errors, dead locks etc.
  2. Start, stop, or failure errors. I.e. errors that happen when a node goes down, or is started.
    1. Firebase deadlock. This should not happen, save logs and report a bug.
    2. Game specific deadlock / failure. This should be tested in staging before going live.
    3. Event or state drop. These errors should not occur unless the load of the system is very high and are regarded as a bug by Cubeia.
      1. Dropped events. An event was lost either on its way in for execution or on its way out.
      2. Dropped state. A state replication failed, and the last executed event was not saved from the failing server.

A runtime failure is almost always a bug. The exception is when runtime threads are blocked due to external integrations.

Failures are handles in a best possible manner. Firebase is tested with "kill", "kill -9" and should survive a disconnected network cable. However, more involved hardware failure may end up in unspecified errors.

Dropped events or states are hard to detect. Cubeia tests Firebase using specialized games usng a request/response pattern with integer counters verifying legal sequences. It should be noted that even when these kinds of errors have been reported in Cubeia test environment, they usually only affect less than one percent of the executing tables.

Personal tools