**架构说明**:
- **负载层**: HAProxy + Keepalived 提供高可用负载均衡和VIP管理
- **数据库层**: MariaDB Galera 集群实现多活数据库
- **监控层**: Prometheus + Grafana 实现全方位监控
- **仲裁层**: Galera Arbitrator 提供奇数节点支持
---
## 环境准备
### 系统要求
- **操作系统**: RockyLinux 8/9
- **硬件要求**:
- 最小 4GB RAM (生产环境建议 16GB+)
- 最小 2 CPU 核心 (生产环境建议 8 核心+)
- **网络要求**:
- 节点间延迟 < 5ms
- 节点间带宽 > 1Gbps
### 基础配置
```bash
# 所有节点执行
# 1. 禁用SELinux
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
setenforce 0
# 2. 配置防火墙
firewall-cmd --permanent --add-port={3306,4444,4567,4568,9200,8080}/tcp
firewall-cmd --reload
# 3. 配置主机名解析
cat >> /etc/hosts <<EOF
192.168.1.101 db1
192.168.1.102 db2
192.168.1.103 db3
192.168.1.201 lb1
192.168.1.202 lb2
EOF
# 4. 时间同步
dnf install -y chrony
systemctl enable --now chronyd
chronyc sources
```
---
## MariaDB Galera 集群部署
### 安装组件
```bash
# 所有数据库节点执行
# 1. 添加MariaDB仓库
curl -sS https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | sudo bash
# 2. 安装必要组件
dnf install -y MariaDB-server MariaDB-backup galera-4
```
### Galera 配置
**文件路径**:`/etc/my.cnf.d/galera.cnf`
```ini
[mysqld]
# 基础配置
binlog_format=ROW
default-storage-engine=InnoDB
innodb_autoinc_lock_mode=2
# Galera 配置
wsrep_on=ON
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
wsrep_cluster_name="galera_cluster"
wsrep_cluster_address="gcomm://db1,db2,db3"
wsrep_node_name="db1" # 节点名需修改
wsrep_node_address="192.168.1.101" # 节点IP需修改
wsrep_sst_method=mariabackup
wsrep_sst_auth="sst_user:${SST_PASSWORD}" # 替换${SST_PASSWORD}
# 防脑裂配置
wsrep_provider_options="gcache.size=1G;
pc.ignore_sb=true;
evs.suspect_timeout=PT30S;
evs.inactive_timeout=PT1M"
```
### 健康检测配置 (9200端口)
```bash
# 所有数据库节点执行
# 1. 安装xinetd
dnf install -y xinetd
# 2. 创建检测脚本
cat > /usr/local/bin/galera_healthcheck <<'EOF'
#!/bin/bash
CONFIG_FILE="/etc/mysql/conf.d/healthcheck.cnf"
RESPONSE=$(mysql --defaults-extra-file=$CONFIG_FILE -e "SHOW STATUS LIKE 'wsrep_ready'" 2>/dev/null | grep 'ON')
if [ -n "$RESPONSE" ]; then
echo -e "HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n"
exit 0
else
echo -e "HTTP/1.1 503 Service Unavailable\r\nContent-Length: 0\r\n\r\n"
exit 1
fi
EOF
# 3. 创建配置文件
cat > /etc/mysql/conf.d/healthcheck.cnf <<EOF
[client]
user=exporter
password=${EXPORTER_PASSWORD} # 替换${EXPORTER_PASSWORD}
EOF
chmod 600 /etc/mysql/conf.d/healthcheck.cnf
chmod +x /usr/local/bin/galera_healthcheck
# 4. 配置xinetd服务
cat > /etc/xinetd.d/galera-healthcheck <<EOF
service galera-healthcheck
{
disable = no
flags = REUSE
socket_type = stream
port = 9200
wait = no
user = root
group = root
server = /usr/local/bin/galera_healthcheck
log_on_failure += USERID
per_source = UNLIMITED
}
EOF
# 5. 启动服务
systemctl enable --now xinetd
# 验证健康检查
curl -i http://localhost:9200
```
### 初始化集群
```bash
# 1. 在第一个节点初始化集群
galera_new_cluster
# 2. 启动其他节点
systemctl start mariadb
# 3. 所有节点设置开机自启
systemctl enable mariadb
```
### 创建系统用户
```sql
-- 在所有节点执行
-- SST用户 (替换${SST_PASSWORD})
CREATE USER 'sst_user'@'localhost' IDENTIFIED BY '${SST_PASSWORD}';
GRANT RELOAD, PROCESS, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sst_user'@'localhost';
-- 监控用户 (替换${EXPORTER_PASSWORD})
CREATE USER 'exporter'@'localhost' IDENTIFIED BY '${EXPORTER_PASSWORD}' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
```
---
## HAProxy + Keepalived 部署
### HAProxy 配置
**文件路径**:`/etc/haproxy/haproxy.cfg`
```conf
global
log /dev/log local0
maxconn 4000
user haproxy
group haproxy
defaults
mode tcp
timeout connect 10s
timeout client 30s
timeout server 30s
frontend galera_front
bind *:3306
default_backend galera_back
backend galera_back
balance leastconn
option tcp-check
tcp-check connect port 9200
server db1 192.168.1.101:3306 check port 9200 inter 5s rise 2 fall 3
server db2 192.168.1.102:3306 check port 9200 inter 5s rise 2 fall 3
server db3 192.168.1.103:3306 check port 9200 inter 5s rise 2 fall 3
listen stats
bind *:8080
mode http
stats enable
stats uri /haproxy?stats
stats refresh 10s
stats auth admin:${HAPROXY_ADMIN_PASS} # 替换${HAPROXY_ADMIN_PASS}
```
### Keepalived 配置
**主节点文件路径**:`/etc/keepalived/keepalived.conf`
```conf
vrrp_script chk_haproxy {
script "pidof haproxy"
interval 2
weight 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass ${KEEPALIVED_AUTH_PASS} # 替换${KEEPALIVED_AUTH_PASS}
}
virtual_ipaddress {
192.168.1.200/24
}
track_script {
chk_haproxy
}
}
```
> **备节点配置**:
> - `state` 设置为 `BACKUP`
> - `priority` 设置为 `90`
### 启动服务
```bash
# 所有负载节点执行
systemctl enable --now haproxy keepalived
# 验证VIP
ip addr show eth0 | grep 192.168.1.200
```
---
## 备份与恢复
### 定时备份脚本
**文件路径**:`/usr/local/bin/backup_galera.sh`
```bash
#!/bin/bash
# 配置文件:包含所有敏感信息
CONFIG_FILE="/etc/galera/backup.conf"
# 加载配置文件
if [ ! -f "$CONFIG_FILE" ]; then
echo "$(date) - 配置文件 $CONFIG_FILE 不存在" >> /var/log/mariabackup.log
exit 1
fi
source "$CONFIG_FILE"
# 创建备份目录
mkdir -p $BACKUP_DIR/$DATE || {
echo "$(date) - 创建目录失败" >> $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "MariaDB 备份失败:创建目录失败"
exit 1
}
chown -R mysql:mysql $BACKUP_DIR/$DATE
# 设置文件限制
ulimit -n 65535
# 开始备份
echo "===== 开始备份: $(date) =====" >> $LOG_FILE
mariabackup --backup \
--user=$BACKUP_USER \
--password=$BACKUP_PASS \
--target-dir=$BACKUP_DIR/$DATE \
--parallel=4 \
--galera-info \
--open-files-limit=65535 2>> $LOG_FILE
if [ $? -ne 0 ]; then
echo "备份失败!" >> $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "MariaDB 备份失败:$(tail -n 10 $LOG_FILE)"
exit 1
fi
# 准备备份
echo "===== 准备备份: $(date) =====" >> $LOG_FILE
mariabackup --prepare \
--target-dir=$BACKUP_DIR/$DATE \
--open-files-limit=65535 2>> $LOG_FILE
# 验证关键文件存在
if [ -f "$BACKUP_DIR/$DATE/xtrabackup_checkpoints" ]; then
echo "备份验证成功" >> $LOG_FILE
# 使用XZ多核压缩备份目录
echo "===== 压缩备份: $(date) =====" >> $LOG_FILE
tar -I 'xz -T0' -cf $BACKUP_DIR/$DATE.tar.xz -C $BACKUP_DIR $DATE
if [ $? -eq 0 ]; then
# 删除原始备份目录
rm -rf $BACKUP_DIR/$DATE
echo "备份压缩成功" >> $LOG_FILE
else
echo "备份压缩失败!" >> $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "MariaDB 备份压缩失败"
fi
else
echo "备份验证失败: 缺少xtrabackup_checkpoints文件!" >> $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "MariaDB 备份验证失败:缺少关键文件"
exit 1
fi
# 保留最近365天备份
find $BACKUP_DIR -name "*.tar.xz" -mtime +365 -delete
echo "===== 备份完成: $(date) =====" >> $LOG_FILE
```
### 备份配置文件
**文件路径**:`/etc/galera/backup.conf`
```bash
# 备份配置 - 请替换实际值
BACKUP_ROOT="${SECURE_BACKUP_PATH}" # 替换为安全路径
BACKUP_DIR="${BACKUP_ROOT}/galera_backups"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_USER="sst_user"
BACKUP_PASS="${SST_PASSWORD}" # 替换${SST_PASSWORD}
LOG_FILE="/var/log/mariabackup.log"
DT_TOKEN="${DINGTALK_TOKEN}" # 替换${DINGTALK_TOKEN}
DT_SIGN="${DINGTALK_SIGN}" # 替换${DINGTALK_SIGN}
```
### 设置定时任务
```bash
chmod 600 /etc/galera/backup.conf
chmod +x /usr/local/bin/backup_galera.sh
# 每天凌晨2点执行备份
echo "0 2 * * * /usr/local/bin/backup_galera.sh" | crontab -
```
### 自动化恢复脚本
**文件路径**:`/usr/local/bin/restore_galera.sh`
```bash
#!/bin/bash
# Galera集群数据恢复脚本
# 用法: restore_galera.sh <备份文件路径>
# 示例: restore_galera.sh ${BACKUP_ROOT}/galera_backups/20231001-020000.tar.xz
# 配置文件
CONFIG_FILE="/etc/galera/restore.conf"
source "$CONFIG_FILE"
if [ $# -eq 0 ]; then
echo "错误:请提供备份文件路径"
echo "用法: $0 <备份文件路径>"
exit 1
fi
BACKUP_FILE=$1
TEMP_DIR="/tmp/galera_restore_$(date +%s)"
LOG_FILE="/var/log/galera_restore.log"
# 验证备份文件存在
if [ ! -f "$BACKUP_FILE" ]; then
echo "错误:备份文件 $BACKUP_FILE 不存在" | tee -a $LOG_FILE
exit 1
fi
echo "===== 开始恢复: $(date) =====" | tee -a $LOG_FILE
# 1. 停止MariaDB服务
echo "[步骤1/8] 停止MariaDB服务..." | tee -a $LOG_FILE
systemctl stop mariadb >> $LOG_FILE 2>&1
if [ $? -ne 0 ]; then
echo "错误:停止MariaDB服务失败" | tee -a $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "Galera恢复失败:停止服务失败"
exit 1
fi
# 2. 清空数据目录
echo "[步骤2/8] 清空数据目录..." | tee -a $LOG_FILE
rm -rf /var/lib/mysql/* >> $LOG_FILE 2>&1
# 3. 创建临时目录
echo "[步骤3/8] 创建临时目录 $TEMP_DIR..." | tee -a $LOG_FILE
mkdir -p $TEMP_DIR >> $LOG_FILE 2>&1
if [ $? -ne 0 ]; then
echo "错误:创建临时目录失败" | tee -a $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "Galera恢复失败:创建临时目录失败"
exit 1
fi
# 4. 解压备份(使用XZ多核解压)
echo "[步骤4/8] 解压备份文件 $BACKUP_FILE..." | tee -a $LOG_FILE
tar -I 'xz -T0' -xf $BACKUP_FILE -C $TEMP_DIR >> $LOG_FILE 2>&1
if [ $? -ne 0 ]; then
echo "错误:解压备份失败" | tee -a $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "Galera恢复失败:解压备份失败"
exit 1
fi
# 获取解压后的目录名
BACKUP_DIR_NAME=$(ls $TEMP_DIR)
RESTORE_DIR="$TEMP_DIR/$BACKUP_DIR_NAME"
# 5. 执行恢复
echo "[步骤5/8] 恢复数据到/var/lib/mysql..." | tee -a $LOG_FILE
mariabackup --copy-back \
--target-dir=$RESTORE_DIR \
--open-files-limit=65535 >> $LOG_FILE 2>&1
if [ $? -ne 0 ]; then
echo "错误:数据恢复失败" | tee -a $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "Galera恢复失败:数据恢复失败"
exit 1
fi
# 6. 修复权限
echo "[步骤6/8] 修复数据目录权限..." | tee -a $LOG_FILE
chown -R mysql:mysql /var/lib/mysql >> $LOG_FILE 2>&1
# 7. 启动服务
echo "[步骤7/8] 启动MariaDB服务..." | tee -a $LOG_FILE
systemctl start mariadb >> $LOG_FILE 2>&1
if [ $? -ne 0 ]; then
echo "错误:启动MariaDB服务失败" | tee -a $LOG_FILE
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "Galera恢复失败:启动服务失败"
exit 1
fi
# 8. 验证恢复
echo "[步骤8/8] 验证集群状态和数据一致性..." | tee -a $LOG_FILE
mysql -e "SHOW STATUS LIKE 'wsrep%'" >> $LOG_FILE 2>&1
mysql -e "CHECK TABLE mysql.user EXTENDED" >> $LOG_FILE 2>&1
# 清理临时文件
rm -rf $TEMP_DIR
echo "===== 恢复完成: $(date) =====" | tee -a $LOG_FILE
echo "恢复日志: $LOG_FILE"
echo "请验证数据完整性和集群状态"
# 发送成功通知
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "Galera数据恢复成功完成"
```
### 恢复配置文件
**文件路径**:`/etc/galera/restore.conf`
```bash
# 恢复配置 - 请替换实际值
DT_TOKEN="${DINGTALK_TOKEN}" # 替换${DINGTALK_TOKEN}
DT_SIGN="${DINGTALK_SIGN}" # 替换${DINGTALK_SIGN}
```
### 恢复操作流程
1. **准备配置文件**:
```bash
# 创建配置文件
sudo mkdir -p /etc/galera
sudo nano /etc/galera/restore.conf
# 添加内容:
# DT_TOKEN="your_actual_token"
# DT_SIGN="your_actual_sign"
sudo chmod 600 /etc/galera/restore.conf
```
2. **选择备份文件**:
```bash
# 列出可用备份
ls -l ${SECURE_BACKUP_PATH}/galera_backups/*.tar.xz
```
3. **执行恢复脚本**:
```bash
chmod +x /usr/local/bin/restore_galera.sh
/usr/local/bin/restore_galera.sh ${SECURE_BACKUP_PATH}/galera_backups/20231001-020000.tar.xz
```
4. **验证恢复结果**:
```sql
-- 检查集群状态
SHOW STATUS LIKE 'wsrep%';
-- 验证关键数据
SELECT COUNT(*) FROM important_table;
```
### 恢复流程注意事项
1. **配置文件安全**:确保配置文件权限为600
2. **空间要求**:确保有足够的磁盘空间(XZ压缩需要更多临时空间)
3. **服务中断**:恢复期间数据库不可用
4. **验证步骤**:恢复后必须执行数据验证
5. **性能优化**:XZ多核解压会充分利用CPU资源
---
## 故障处理手册
### 常规故障排查
```sql
-- 检查集群状态
SHOW STATUS LIKE 'wsrep%';
-- 关键指标:
-- wsrep_cluster_size: 集群节点数 (应为3)
-- wsrep_cluster_status: 集群状态 (应为Primary)
-- wsrep_connected: 网络连接状态 (应为ON)
-- wsrep_ready: 节点就绪状态 (应为ON)
```
### 脑裂检测与处理
#### 自动化脑裂检测脚本
**文件路径**:`/usr/local/bin/cluster_integrity_check.sh`
```bash
#!/bin/bash
# 配置文件
CONFIG_FILE="/etc/galera/monitor.conf"
source "$CONFIG_FILE"
# 获取当前节点状态
CURRENT_STATUS=$(mysql -u$MONITOR_USER -p$MONITOR_PASS -e "SHOW STATUS LIKE 'wsrep_cluster_status'" -sN | awk '{print $2}')
CURRENT_SIZE=$(mysql -u$MONITOR_USER -p$MONITOR_PASS -e "SHOW STATUS LIKE 'wsrep_cluster_size'" -sN | awk '{print $2}')
if [ "$CURRENT_STATUS" != "Primary" ]; then
MESSAGE="[CRITICAL] Node $(hostname) in $CURRENT_STATUS state!"
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "$MESSAGE"
exit 1
fi
if [ "$CURRENT_SIZE" -lt "$EXPECTED_SIZE" ]; then
MESSAGE="[CRITICAL] Cluster size $CURRENT_SIZE < $EXPECTED_SIZE!"
/usr/local/bin/dingtalk-cli-sender -t $DT_TOKEN -s $DT_SIGN "$MESSAGE"
exit 1
fi
```
### 监控配置文件
**文件路径**:`/etc/galera/monitor.conf`
```bash
# 监控配置 - 请替换实际值
EXPECTED_SIZE=3
MONITOR_USER="exporter"
MONITOR_PASS="${EXPORTER_PASSWORD}" # 替换${EXPORTER_PASSWORD}
DT_TOKEN="${DINGTALK_TOKEN}" # 替换${DINGTALK_TOKEN}
DT_SIGN="${DINGTALK_SIGN}" # 替换${DINGTALK_SIGN}
```
---
## 附录
### 敏感信息占位符说明
| **占位符** | **说明** | **示例值** |
|--------------------------|------------------------------|-------------------------|
| `${SECURE_BACKUP_PATH}` | 安全备份存储路径 | /secure/db_backups |
| `${SST_PASSWORD}` | SST用户密码 | s3cretPass |
| `${EXPORTER_PASSWORD}` | 监控用户密码 | ExporterPass123 |
| `${HAPROXY_ADMIN_PASS}` | HAProxy管理密码 | AdminPass123 |
| `${KEEPALIVED_AUTH_PASS}`| Keepalived认证密码 | Keep@l1ve |
| `${DINGTALK_TOKEN}` | 钉钉机器人Token | d14e9ddfefd414... |
| `${DINGTALK_SIGN}` | 钉钉机器人签名 | SEC1089bac8cb8b... |
### 安全最佳实践
1. **配置文件管理**:
```bash
# 设置配置文件权限
chmod 600 /etc/galera/*.conf
chown root:root /etc/galera/*.conf
```
2. **密码轮换策略**:
- SST密码:每90天轮换
- 监控密码:每180天轮换
- HAProxy密码:每365天轮换
3. **审计日志**:
```bash
# 记录所有管理操作
echo "$(date) $(whoami) $@" >> /var/log/galera_audit.log
```
### 版本更新记录
| **版本** | **日期** | **修改内容** |
|----------|----------------|----------------------------------|
| 1.7 | 2025-10-03 | 添加自动化恢复脚本 |
| **1.8** | **2025-10-04** | **统一敏感信息管理,增强安全性** |
| | | **备份压缩优化为XZ多核格式** |
| | | **敏感路径隐私保护** |
### 紧急恢复联系人
| **角色** | **联系方式** | **响应时间** |
|----------------|-------------------------|--------------|
| 值班运维 | +86-138XXXXXXX | 24/7 |
| 高级DBA | dba-oncall@example.com | 24/7 |
> **最后更新**: 2025-10-04
> **安全等级**: 机密
> **文档校验**: 使用校验和验证完整性
> `md5sum: a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6`
---
## 安全增强说明
### 敏感信息管理策略
1. **统一配置中心**:
- 所有敏感信息存储在`/etc/galera/`目录下的配置文件中
- 配置文件权限设置为`600`,仅root可访问
- 使用环境变量或source命令加载配置
2. **占位符系统**:
| 占位符 | 实际值位置 |
|------------------------|--------------------------|
| `${SECURE_BACKUP_PATH}` | backup.conf |
| `${SST_PASSWORD}` | galera.cnf, backup.conf |
| `${EXPORTER_PASSWORD}` | healthcheck.cnf, monitor.conf |
| `${HAPROXY_ADMIN_PASS}` | haproxy.cfg |
| `${KEEPALIVED_AUTH_PASS}`| keepalived.conf |
| `${DINGTALK_TOKEN}` | 各脚本配置文件 |
| `${DINGTALK_SIGN}` | 各脚本配置文件 |
3. **部署流程**:
```mermaid
graph LR
A[准备配置文件] --> B[替换占位符]
B --> C[设置文件权限]
C --> D[验证配置]
D --> E[部署服务]
```
|