Group Replication and Troubleshooting

This page provides installation information for Porta on Prem with MySQL’s group replication.

QuickRef

Main machine database has container name porta-db
- MySQL port 3306
- Replication port 33061
Backup machine database has container name porta-db-2
- MySQL port 3307
- Replication port 33062
Arbiter machine database has container name porta-db-3
- MySQL port 3308
- Replication port 33063

View database logs

These logs will contain information about replication, and other MySQL logs.

Helper: porta-onprem-bundle\porta-helpers\porta-database\view-database-log.bat

View database replication group members as reported by this machine

Helper: porta-onprem-bundle\porta-helpers\porta-database\view-group-repl-status.bat

View database replication group members as reported by each machine

Helper: porta-onprem-bundle\porta-helpers\porta-database\view-ALL-group-repl-status.bat

Start Group Replication

Helper: porta-onprem-bundle\porta-helpers\porta-database\actions\START-repl.bat

Stop Group Replication

Helper: porta-onprem-bundle\porta-helpers\porta-database\actions\STOP-repl.bat

Create a backup of the database

See: Disaster Backup and Restoration

Restore from a backup of the database

See: Disaster Backup and Restoration

Resetting a Database Container

See the ”! Reset Database !” section in Replication Group Member Recovery

Successful Log Messages

Non-Primary Member

Example of successful log messages for a non-primary member:

# The database has found the group to join and it has a primary member
2024-01-24T14:44:29.647313Z 16 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 10.100.100.176:3306.'

# If there are changes to be applied, the database will begin applying them
2024-01-29T14:42:36.696980Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'

# The database has joined the group and began replicating
2024-01-24T14:44:29.647977Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.100.100.176:3306, 10.100.100.177:3307 on view 17061069903020065:2.'

# The database has officially successfully joined the group and its data is in sync with the other databases
2024-01-24T14:44:39.153344Z 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.'

Primary Member

Example of successful log messages for a primary member:

# The replication group lost its primary member and has elected a new one
2024-01-29T15:13:13.328189Z 0 [System] [MY-011507] [Repl] Plugin group_replication reported: 'A new primary with address 192.168.50.42:3308 was elected. The new primary will execute all previous group transactions before allowing writes.'

# This database has been elected as the new primary
2024-01-29T15:13:13.634688Z 22 [System] [MY-011510] [Repl] Plugin group_replication reported: 'This server is working as primary member.'

Troubleshooting Errors

`ERROR 3092 (HY000) at line 1: The server is not configured properly to be an active member of the group.`

The cause of this error truly is dependent on error log details of this machine (and often others). Check the error logs of this machine and look for those errors in this document. It may also help to check the logs fo the other machines in the group for errors around the same timestamp.

Each Machine Only Sees Itself

If each machine only sees itself in the replication group, this likely means that there was no existing group to join. This can happen if the machines are not able to communicate with each other, or if the group replication process was not bootstrapped on any of the machines.

Example after viewing replication group status on each machine (each will have one OFFLINE entry like this):

+---------------------------+--------------------------------------+----------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST    | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+----------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 3059e925-bc43-11ee-a26c-0242ac120002 | 10.100.100.177 |        3307 | OFFLINE      |             |                | XCom                       |
+---------------------------+--------------------------------------+----------------+-------------+--------------+-------------+----------------+----------------------------+

To fix this, group replication needs to be restarted on each machine: See Bootstrap the Replication Group in the recovery guide.

A Joining Member Creates Its Own Group Instead of Joining Existing Group

If a joining member creates its own group, this likely means that the group_replication_bootstrap_group setting is set to ON for the joining member.

Run the set-bootstrap-OFF.bat helper in porta-helpers/porta-database/actions/caution to turn this off.
Run the restart-repl.bat helper in porta-helpers/porta-database/actions to restart replication on the machine.

Error joining existing group

`Timeout while waiting for the group communication engine to be ready` / `Error connecting to all peers`

Another member’s logs might also display Old incarnation found while trying to add node

Error log example (for the member that is failing to join):

[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Timeout while waiting for the group communication engine to be ready!'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The group communication engine is not ready for the member to join. Local port: 33062'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 33062'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node porta-db-3:33063 when joining a group. My local port is: 33062.'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node porta-db-3:33063 when joining a group. My local port is: 33062.'
[ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 33062'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 33062'

If Error is Happening On First Join

Try:

pinging
- ping <MACHINE_IP>)
telnet (from Powershell)
- telnet <MACHINE_IP> <DATABASE_PORT>
remote mysql access
- porta-helpers\porta-run\sysadmin\CLI\access-mysql.bat

If these all work, check the hosts file and ensure the IP to container name mapping is correct on both machines. (This was the issue the last time this error was seen during setup).

If Error is Happening On Re-join

This error often occurs if there is no bootstrapped member. Run this check to get a sense of the state of the cluster: View Database Replication Group Members as reported by each machine

If each machine is OFFLINE and only sees itself, then you need to bootstrap the group replication process on one of the machines. See Bootstrap the Replication Group in the recovery guide.
If a machine is listed as UNREACHABLE, check the details in its logs and see the section Old incarnation found while trying to add node.

`Old incarnation found while trying to add node`

If another member, i.e., porta-db-3 has already tried and failed to join the group and is listed as OFFLINE, this error may display on the existing member when trying to START GROUP_REPLICATION again.

Example of what this log entry would look like on the primary member in an existing group:

[Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Old incarnation found while trying to add node porta-db-3:33063 17339258949477546. Please stop the old node or wait for it to leave the group.'

Example of what this log entry would look like on the member that is not able to join the existing group:

[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 33063'
[ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'

Run this check on each machine to get a sense of the state of the cluster: View Database Replication Group Members as reported by each machine

There is an `UNREACHABLE` Primary Member

One of the machines will likely display one of the member’s MEMBER_STATE as UNREACHABLE:

+---------------------------+--------------------------------------+----------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST    | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+----------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | e0ff1ffa-bac5-11ee-9fc8-0242ac120002 | 10.100.100.176 |        3306 | UNREACHABLE  | PRIMARY     | 8.0.32         | XCom                       |
| group_replication_applier | faa25e2f-bac6-11ee-a05b-0242ac120005 | 10.100.100.177 |        3307 | ONLINE       | SECONDARY   | 8.0.32         | XCom                       |
| group_replication_applier | faa25e2f-bac6-11ee-a05b-0242ac120003 | 10.100.100.178 |        3308 | ONLINE       | SECONDARY   | 8.0.32         | XCom                       |
+---------------------------+--------------------------------------+----------------+-------------+--------------+-------------+----------------+----------------------------+

NOTE: When the UNREACHABLE member also has a MEMBER_ROLE of PRIMARY, no new primary will be elected until the unreachable member has been kicked out. If there is a majority remaining, the member should be kicked out automatically after a timeout period.

If there is no majority remaining to vote the member out, or you would like to remove to member quickly, group replication needs to be restarted on each machine: See Bootstrap the Replication Group in the recovery guide.

There is no `UNREACHABLE` Primary Member

If there is no UNREACHABLE primary member, then there may be a problem with transactions. Check the machine’s database logs (see section View database logs) for details. If the logs contain “This member has more executed transactions than those present in the group”, see the section This member has more executed transactions than those present in the group.

`This member has more executed transactions than those present in the group`

Sometimes this can occur when attempting to join a group that the member is already a part of. Check the group members from the failed joiner to confirm that it is not already a member of the group: View Database Replication Group Members as reported by each machine. If all members are present in the result, then the member is already a part of the group.

If the member is not a part of the group, this error can also occur if a member has a higher gtid_executed value than the existing members. This can happen if the member was previously part of a group, or was the primary of its own group, and is now being added to a different group.

In most cases, recovery can be done similar to the steps for “Single Database Failing to Rejoin Group”, but ideally would not be done until off hours and a Disguise engineer or support team member can assist.