# 2 Node Cluster

## Overview

In this scenario, we are going to set up two Storware Backup & Recovery servers in High Availability, Active/Passive mode. This is possible by using techniques such as a pacemaker and corosync. At least a basic understanding of these is highly desirable. This how-to is intended for RPM-based systems such as Red Hat / CentOS. If you run Storware Backup & Recovery on a different OS, you may need to refer to your distribution docs.

Our environment is built on the following assumptions:

1. **node1** - first Storware Backup & Recovery server + Storware Backup & Recovery node, IP: `10.41.0.4`
2. **node2** - second Storware Backup & Recovery server + Storware Backup & Recovery node, IP: `10.41.0.5`
3. **Cluster IP**:  `10.41.0.10` - We will use this IP to connect to our **active** Storware Backup & Recovery service. This IP will float between our servers and will point to an active instance.
4. MariaDB master <-> master replication

{% hint style="info" %}
Make sure to run all of the commands with administrative privileges. For simplicity, the following commands will be executed as `root`
{% endhint %}

![](https://github.com/Storware/backup-and-recovery-manual/blob/master/deployment/.gitbook/assets/overview-high_availability%20\(1\)%20\(1\)%20\(1\)%20\(1\)%20\(1\)%20\(1\)%20\(1\).png)

## HA cluster setup

### **Preparing the environment**

1. **Stop and disable the Storware Backup & Recovery server, node and database** as the cluster will manage these resources.

   ```
   systemctl disable vprotect-server vprotect-node mariadb --now
   ```
2. **Enable HA repo:**

   ```
   dnf config-manager --enable highavailability
   ```
3. Use yum to **check updates** pending

   ```
   yum update
   ```
4. **Check the hosts file** `/etc/hosts`, as you might find an entry such as:

   ```
   127.0.0.1 <your_hostname_here>
   ```

   **Delete it,** as this prevents the cluster from functioning properly (your nodes will not "see" each other) and add entries of your two nodes:

   ```
   NODE1_IP <easy_to_use_node1_name>
   NODE2_IP <easy_to_use_node2_name>
   ```

   In this case, we will add:

   ```
   10.41.0.4  node1
   10.41.0.5  node2
   ```

### Installation

{% hint style="info" %}
Run these commands on **both servers**
{% endhint %}

1. **On both servers run**

   ```
   yum install -y pacemaker pcs psmisc policycoreutils-python-utils
   ```
2. **Add a firewall rule to allow HA traffic** - TCP ports 2224, 3121, and 21064, and UDP port 5405 (both servers)

   ```
   firewall-cmd --permanent --add-service=high-availability
   firewall-cmd --reload
   ```
3. **(Optional)** While testing, depending on your environment, you may encounter problems related to network traffic, permissions, etc. While it might be a good idea to temporarily disable the firewall and SELinux, we do not recommend disabling that mechanism in the production environment, as it creates significant security issues.\
   **If you choose to disable the firewall, bear in mind that Storware will no longer be available on ports 80/443. Instead, connect to ports 8080/8181 respectively.**

   ```
   # setenforce 0
   # sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config
   # systemctl mask firewalld.service
   # systemctl stop firewalld.service
   # iptables --flush
   ```
4. **Enable and start PCS daemon**

   ```
   systemctl enable pcsd.service --now
   ```

### **Cluster configuration**

Installation of a `pcs` package automatically creates a user **hacluster** with no password authentication. While this may be good for running locally, you will require a password for this account to perform the rest of the configuration - **configure the same password on both nodes:**

* Set password for **hacluster**

  ```
  passwd hacluster
  Changing password for user hacluster.
  New password:
  Retype new password:
  passwd: all authentication tokens updated successfully.
  ```

**Corosync configuration**

1. On node 1, issue a command to authenticate as a **hacluster** user:

   ```
   [root@node1 ~]# pcs host auth node1 node2
   Username: hacluster
   Password:
   node1: Authorized
   node2: Authorized
   ```
2. **Generate and synchronise the corosync configuration**

   ```
   [root@node1 ~]# pcs cluster setup sbrcluster node1 node2
   ```

   ​Take a look at your output, which should look similar to below:

   ```
   Destroying cluster on nodes: node1, node2...
   node1: Stopping Cluster (pacemaker)...
   node2: Stopping Cluster (pacemaker)...
   node1: Successfully destroyed cluster
   node2: Successfully destroyed cluster

   Sending 'pacemaker_remote authkey' to 'node1', 'node2'
   node1: successful distribution of the file 'pacemaker_remote authkey'
   node2: successful distribution of the file 'pacemaker_remote authkey'
   Sending cluster config files to the nodes...
   node1: Succeeded
   node2: Succeeded

   Synchronizing pcsd certificates on nodes node1, node2...
   node1: Success
   node2: Success
   Restarting pcsd on the nodes in order to reload the certificates...
   node1: Success
   node2: Success
   ```
3. **Enable and start your new cluster**

   ```
   [root@node1 ~]# pcs cluster start --all && pcs cluster enable --all
   node1: Starting Cluster (corosync)...
   node2: Starting Cluster (corosync)...
   node1: Starting Cluster (pacemaker)...
   node2: Starting Cluster (pacemaker)...
   node1: Cluster Enabled
   node2: Cluster Enabled
   ```
4. OK! You have our cluster enabled. You have not created any resources (such as a floating IP) yet, but before you proceed, we still have a few settings to modify. Because you are using only two nodes, we need to **disable the default quorum policy** (this command should not return any output)

   ```
   [root@node1 ~]# pcs property set no-quorum-policy=ignore
   ```
5. You should also **define default failure settings** These two settings combined will define how many failures can occur for a node to be marked as ineligible for hosting a resource, and after what time this restriction will be lifted. You define the defaults here, but it may be a good idea to also set these values at the resource level, depending on your experience. **Run these commands**:

   ```
   [root@node1 ~]# pcs resource defaults failure-timeout=30s
   [root@node1 ~]# pcs resource defaults migration-threshold=3
   ```
6. As long as you are not using any fencing device in our environment (here we are not), you need to - **disable stonith.** The second part of this command verifies running-config. These commands normally do not return any output. **Run this command**:

   ```
   [root@node1 ~]# pcs property set stonith-enabled=false && crm_verify -L
   ```

### **Resource creation**

1. First, you will create a resource that represents our **floating IP** 10.41.0.10

{% hint style="info" %}
From this moment, you need to use this IP when connecting to your vProtect server.
{% endhint %}

2. Adjust your IP and cidr\_netmask, and you're good to go:

   ```
   [root@node1 ~]# pcs resource create "Failover_IP" ocf:heartbeat:IPaddr2 ip=10.41.0.10 cidr_netmask=22 op monitor interval=30s
   ```
3. Immediately, you should see our IP is up and running on one of the nodes (most likely on the one you issued this command for):

   ```
   [root@node1 ~]# ip a
   [..]
   2: ens160:  mtu 1500 qdisc mq state UP group default qlen 1000
       link/ether 00:50:56:a6:9f:c6 brd ff:ff:ff:ff:ff:ff
       inet 10.41.0.4/22 brd 10.41.3.255 scope global ens160
          valid_lft forever preferred_lft forever
       inet 10.41.0.10/22 brd 10.41.3.255 scope global secondary ens160
          valid_lft forever preferred_lft forever
       inet6 fe80::250:56ff:fea6:9fc6/64 scope link
          valid_lft forever preferred_lft forever
   ```
4. As you can see, our floating IP 10.41.0.10 has been successfully assigned as the second IP of interface ens160. We should also check if the Storware Backup & Recovery web interface is up and running. You can do this by opening the web browser and typing in [https://10.41.0.10](https://10.41.0.10/).
5. The next step is to **define a resource** responsible for monitoring network connectivity. Note that you need to use **your gateway IP** in the **host\_list** parameter

   ```
   [root@node1 ~]# pcs resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=10.41.0.1 clone
   [root@node1 ~]# pcs constraint location Failover_IP rule score=-INFINITY pingd lt 1 or not_defined pingd
   ```
6. You have to **define a set of cluster resources** responsible for other services crucial for the Storware node and the server itself. Here, we will logically link these services with our floating IP. Whenever the floating IP disappears from our server, these services will be stopped. You also have to define the proper order for services to start and stop, as for example, starting the Storware server without a running database makes little sense.

   ```
   [root@node1 ~]# pcs resource create "vProtect-node" systemd:vprotect-node op monitor timeout=300s on-fail="stop" --group vProtect-group
   [root@node1 ~]# pcs resource create "vProtect-server" service:vprotect-server op start on-fail="stop" timeout="300s" op stop timeout="300s" on-fail="stop" op monitor timeout="300s" on-fail="stop" --group vProtect-group
   ```

   These commands do not return any output.
7. **Define resource colocation**

   ```
   [root@node1 ~]# pcs constraint colocation add Failover_IP with vProtect-group
   ```
8. **Set node preference**

   ```
   [root@node1 ~]# pcs constraint location Failover_IP prefers node1=INFINITY
   [root@node1 ~]# pcs constraint location vProtect-group prefers node1=INFINITY
   ```

At this point, the pacemaker HA cluster is functional.

However, there is still thing we need to consider - **Creating DB replication**

### MariaDB replication

In this section, we explain how to set up master<->master MariaDB replication.

1. **On both nodes**, if you have the firewall enabled, **allow** communication via **port** 3306

   ```
   firewall-cmd --add-port=3306/tcp --permanent
   firewall-cmd --complete-reload
   ```

**Steps to run on the first node** - in this case 10.41.0.4

This server will be the source of DB replication.

1. **Stop the Storware server, node and database**

   ```
   [root@node1 ~]# systemctl stop vprotect-server vprotect-node mariadb
   ```
2. **Copy your license and node information** from the first node to the second node:

   ```
   [root@node1 ~]# scp /opt/vprotect/node/.session.properties node2:/opt/vprotect/node/.session.properties
   [root@node1 ~]# scp /opt/vprotect/server/license.key node2:/opt/vprotect/server/license.key
   ```
3. **Edit the config file, enable binary logging, and start MariaDB again**. Depending on your distribution, the config file location may vary. Most likely it is `/etc/my.cnf` or `/etc/my.cnf.d/server.cnf`

   In the **\[mysqld]** section, add the lines:

   ```
   [root@node1 ~]# vi /etc/my.cnf.d/server.cnf
   .
   .
   [mysqld]
   log-bin
   server_id=1
   replicate-do-db=vprotect
   .
   .
   [root@node1 ~]# systemctl start mariadb
   ```
4. Now **log in to your MariaDB**, **create** a user used for replication, and assign appropriate rights to it.

   For the purpose of this task, we will set the username to 'replicator' and the password to `R3pLic4ti0N`

   ```
   [root@node1 ~]# mysql -u root -p
   Enter password:
   [..]
   MariaDB [(none)]> create user 'replicator'@'%' identified by 'R3pLic4ti0N';
   Query OK, 0 rows affected (0.026 sec)

   MariaDB [(none)]> grant replication slave on *.* to 'replicator'@'%';
   Query OK, 0 rows affected (0.001 sec)

   MariaDB [(none)]> FLUSH PRIVILEGES;
   Query OK, 0 rows affected (0.001 sec)
   ```

   **Don't log out** just yet, we need to **check** the master status and
5. **Write down the log file name and position**, as it is required for proper slave configuration.

   ```
   MariaDB [(none)]> show master status;
   +----------------------+----------+--------------+------------------+
   | File                 | Position | Binlog_Do_DB | Binlog_Ignore_DB |
   +----------------------+----------+--------------+------------------+
   | node1-bin.000007     |    46109 |              |                  |
   +----------------------+----------+--------------+------------------+
   ```
6. Dump the vprotect database and copy it onto the second server (node2).

   ```
   [root@node1 ~]# mysqldump -u root -p vprotect > /tmp/vprotect.sql
   [root@node1 ~]# scp /tmp/vprotect.sql root@vprotect2:/tmp/
   ```

#### Steps to run on the 2nd server, node2: 10.41.0.5

1. **Stop the vprotect server, node, and database**

   ```
   [root@node1 ~]# systemctl stop vprotect-server vprotect-node mariadb
   ```
2. **Edit** the MariaDB **config** file. **Assign a different server id**, for example: 2. Then **start MariaDB**.

   ```
   [root@node2 ~]# vi /etc/my.cnf.d/server.cnf
   log-bin
   server_id=2
   replicate-do-db=vprotect
   [root@node2 ~]# systemctl start mariadb
   ```
3. **Load the database dump** copied from storware1.

   ```
   [root@vprotect2 ~]# mysql -u root -p vprotect < /tmp/vprotect.sql
   ```

At this point, you have two identical databases on our two servers.

4. **Log in to the MariaDB instance, create a replication user with a password**. Use the same user as on node1. Grant the necessary permissions.

   Set the master host. You ***must*** use the user\_master\_log\_file and master\_log\_pos written down earlier. Change the IP of the master host to match your network configuration.

   ```
   MariaDB [(none)]> STOP SLAVE;
   MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST = '10.41.0.4', MASTER_USER = 'replicator',MASTER_PASSWORD='R3pLic4ti0N',MASTER_LOG_FILE = 'vprotect1-bin.000007',MASTER_LOG_POS=46109;
   Query OK, 0 rows affected (0.004 sec)
   ```
5. **Start the slave**, check the master status, and **write down the file name and position.**

   ```
   MariaDB [(none)]> start slave;
   Query OK, 0 rows affected (0.001 sec)

   MariaDB [(none)]> SHOW MASTER STATUS;
   +----------------------+----------+--------------+------------------+
   | File                 | Position | Binlog_Do_DB | Binlog_Ignore_DB |
   +----------------------+----------+--------------+------------------+
   | node2-bin.000002     |   501051 |              |                  |
   +----------------------+----------+--------------+------------------+
   1 row in set (0.000 sec)
   ```

**Go back to the first server (node1)**

1. **Stop the slave,** then change the master host using the parameters noted down in the previous step. Also, **change** the master host IP to match your network configuration.

   ```
   MariaDB [(none)]> stop slave;
   MariaDB [(none)]> MariaDB [(none)]>  change master to master_host='10.41.0.5', master_user='replicator', master_password='R3pLic4ti0N',MASTER_LOG_FILE = 'node2-bin.000002', master_log_pos=501051;
   Query OK, 0 rows affected (0.004 sec)
   MariaDB [(none)]> start slave;
   Query OK, 0 rows affected (0.001 sec)
   ```

At this point, you have successfully configured MariaDB master<->master replication.

## Testing the setup

The fastest way to test our setup is to invoke

```
pcs node standby vprotect1
```

This puts node1 into standby mode, which prevents it from hosting any cluster resources.

After a while, you should see your resources up and running on node2.

Note that if you perform a normal OS shutdown (not a forced one), the pacemaker will wait for a long time for a node to come back online, which in fact will prevent completion of shutdown. As a result, resources ***will not*** switch correctly to the other node.
