Group Replication and Troubleshooting
This page provides installation information for Porta on Prem with MySQL’s group replication.
QuickRef
- Main machine database has container name
porta-db
- MySQL port
3306
- Replication port
33061
- MySQL port
- Backup machine database has container name
porta-db-2
- MySQL port
3307
- Replication port
33062
- MySQL port
- Arbiter machine database has container name
porta-db-3
- MySQL port
3308
- Replication port
33063
- MySQL port
View database logs
These logs will contain information about replication, and other MySQL logs.
- Helper:
porta-onprem-bundle\porta-helpers\porta-database\view-database-log.bat
View database replication group members as reported by this machine
- Helper:
porta-onprem-bundle\porta-helpers\porta-database\view-group-repl-status.bat
View database replication group members as reported by each machine
- Helper:
porta-onprem-bundle\porta-helpers\porta-database\view-ALL-group-repl-status.bat
Start Group Replication
- Helper:
porta-onprem-bundle\porta-helpers\porta-database\actions\START-repl.bat
Stop Group Replication
- Helper:
porta-onprem-bundle\porta-helpers\porta-database\actions\STOP-repl.bat
Create a backup of the database
See: Disaster Backup and Restoration
Restore from a backup of the database
See: Disaster Backup and Restoration
Resetting a Database Container
See the “! Reset Database !” section in Replication Group Member Recovery
Successful Log Messages
Non-Primary Member
Example of successful log messages for a non-primary member:
Primary Member
Example of successful log messages for a primary member:
Troubleshooting Errors
ERROR 3092 (HY000) at line 1: The server is not configured properly to be an active member of the group.
The cause of this error truly is dependent on error log details of this machine (and often others). Check the error logs of this machine and look for those errors in this document. It may also help to check the logs fo the other machines in the group for errors around the same timestamp.
Each Machine Only Sees Itself
If each machine only sees itself in the replication group, this likely means that there was no existing group to join. This can happen if the machines are not able to communicate with each other, or if the group replication process was not bootstrapped on any of the machines.
Example after viewing replication group status on each machine (each will have one OFFLINE
entry like this):
To fix this, group replication needs to be restarted on each machine: See Bootstrap the Replication Group in the recovery guide.
A Joining Member Creates Its Own Group Instead of Joining Existing Group
If a joining member creates its own group, this likely means that the group_replication_bootstrap_group
setting is set to ON
for the joining member.
- Run the
set-bootstrap-OFF.bat
helper inporta-helpers/porta-database/actions/caution
to turn this off. - Run the
restart-repl.bat
helper inporta-helpers/porta-database/actions
to restart replication on the machine.
Error joining existing group
Timeout while waiting for the group communication engine to be ready
/ Error connecting to all peers
Another member’s logs might also display Old incarnation found while trying to add node
Error log example (for the member that is failing to join):
If Error is Happening On First Join
Try:
- pinging
ping <MACHINE_IP>
)
- telnet (from Powershell)
telnet <MACHINE_IP> <DATABASE_PORT>
- remote mysql access
porta-helpers\porta-run\sysadmin\CLI\access-mysql.bat
If these all work, check the hosts file and ensure the IP to container name mapping is correct on both machines. (This was the issue the last time this error was seen during setup).
If Error is Happening On Re-join
This error often occurs if there is no bootstrapped member. Run this check to get a sense of the state of the cluster: View Database Replication Group Members as reported by each machine
- If each machine is
OFFLINE
and only sees itself, then you need to bootstrap the group replication process on one of the machines. See Bootstrap the Replication Group in the recovery guide. - If a machine is listed as
UNREACHABLE
, check the details in its logs and see the section Old incarnation found while trying to add node.
Old incarnation found while trying to add node
If another member, i.e., porta-db-3
has already tried and failed to join the group and is listed as OFFLINE
, this error may display on the existing member when trying to START GROUP_REPLICATION
again.
Example of what this log entry would look like on the primary member in an existing group:
Example of what this log entry would look like on the member that is not able to join the existing group:
Run this check on each machine to get a sense of the state of the cluster: View Database Replication Group Members as reported by each machine
There is an UNREACHABLE
Primary Member
One of the machines will likely display one of the member’s MEMBER_STATE
as UNREACHABLE
:
NOTE: When the UNREACHABLE
member also has a MEMBER_ROLE
of PRIMARY
, no new primary will be elected until the unreachable member has been kicked out. If there is a majority remaining, the member should be kicked out automatically after a timeout period.
If there is no majority remaining to vote the member out, or you would like to remove to member quickly, group replication needs to be restarted on each machine: See Bootstrap the Replication Group in the recovery guide.
There is no UNREACHABLE
Primary Member
If there is no UNREACHABLE
primary member, then there may be a problem with transactions. Check the machine’s database logs (see section View database logs) for details. If the logs contain “This member has more executed transactions than those present in the group”, see the section This member has more executed transactions than those present in the group.
This member has more executed transactions than those present in the group
Sometimes this can occur when attempting to join a group that the member is already a part of. Check the group members from the failed joiner to confirm that it is not already a member of the group: View Database Replication Group Members as reported by each machine. If all members are present in the result, then the member is already a part of the group.
If the member is not a part of the group, this error can also occur if a member has a higher gtid_executed
value than the existing members. This can happen if the member was previously part of a group, or was the primary of its own group, and is now being added to a different group.
In most cases, recovery can be done similar to the steps for “Single Database Failing to Rejoin Group”, but ideally would not be done until off hours and a Disguise engineer or support team member can assist.