问题说明

  • MySQL 5.7 上使用 Xtrabackup 执行全备时,应用程序产生大量连接数不够报警,查看 processlist 内有大量等待锁的 SQL。
  • processlist 内还有有一个执行几千秒的 select 语句
  • Xtrabackup 备份日志内有 failed to execute query ‘FLUSH NO_WRITE_TO_BINLOG TABLES’: 2013 (HY000) Lost connection to MySQL server during query 报错信息

问题复现

  • connectionA 在表上执行 select
1
select sleep(300) from tbl1;
  • 使用 xtrabackup 执行全备份
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ xtrabackup --defaults-file=/app/xiang/mysql5.7/mysql_data/my_3306.cnf --backup \
--user='xtrabackup_user'  --host 127.0.0.1 --password='backupUserPasswd'  \
--target-dir=./backupfile
# .......
# 发现日志均卡卡在这个位置
240220 11:49:38 >> log scanned up to (720749972)
240220 11:49:39 >> log scanned up to (720749972)
240220 11:49:40 >> log scanned up to (720749972)
240220 11:49:41 >> log scanned up to (720749972)
240220 11:49:42 >> log scanned up to (720749972)
240220 11:49:43 >> log scanned up to (720749972)
240220 11:49:44 >> log scanned up to (720749972)
240220 11:49:45 >> log scanned up to (720749972)
240220 11:49:46 >> log scanned up to (720749972)
240220 11:49:47 >> log scanned up to (720749972)
240220 11:49:48 >> log scanned up to (720749972)
240220 11:49:49 >> log scanned up to (720749972)
240220 11:49:50 >> log scanned up to (720749972)
240220 11:49:51 >> log scanned up to (720749972)
240220 11:49:52 >> log scanned up to (720749972)
240220 11:49:53 >> log scanned up to (720749972)
240220 11:49:54 >> log scanned up to (720749972)
240220 11:49:55 >> log scanned up to (720749972)
240220 11:49:56 >> log scanned up to (720749972)
240220 11:49:57 >> log scanned up to (720749972)
240220 11:49:58 >> log scanned up to (720749972)
240220 11:49:59 >> log scanned up to (720749972)
240220 11:50:00 >> log scanned up to (720749972)
240220 11:50:01 >> log scanned up to (720749972)
240220 11:50:02 >> log scanned up to (720749972)
  • connectionB 查看 processlist 发现 xtrabackup 线程卡在 FLUSH NO_WRITE_TO_BINLOG TABLES
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
mysql> show processlist;
+----+-----------------+-----------------+------+------------+------+-------------------------+---------------------------------+
| Id | User            | Host            | db   | Command    | Time | State                   | Info                            |
+----+-----------------+-----------------+------+------------+------+-------------------------+---------------------------------+
|  2 | root            | localhost       | tdb  | Query      | 2936 | User sleep              | select sleep(300) from tbl1     |
|  7 | xtrabackup_user | 127.0.0.1:36348 | NULL | Query      | 2550 | Waiting for table flush | FLUSH NO_WRITE_TO_BINLOG TABLES |
|  9 | root            | localhost       | tdb  | Field List | 2520 | Waiting for table flush | NULL                            |
| 10 | root            | localhost       | tdb  | Sleep      | 2499 |                         | NULL                            |
| 13 | root            | localhost       | NULL | Query      |    0 | starting                | show processlist                |
+----+-----------------+-----------------+------+------------+------+-------------------------+---------------------------------+
  • connectionC 执行一个查询操作,发现长时间无返回
1
mysql> select * from tbl1;
  • connectionB 查看 processlist,发现新执行的 select * from tbl1 也处于 Waiting 状态
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
mysql> show processlist;
+----+-----------------+-----------------+------+------------+------+-------------------------+---------------------------------+
| Id | User            | Host            | db   | Command    | Time | State                   | Info                            |
+----+-----------------+-----------------+------+------------+------+-------------------------+---------------------------------+
|  2 | root            | localhost       | tdb  | Query      | 3040 | User sleep              | select sleep(300) from tbl1     |
|  7 | xtrabackup_user | 127.0.0.1:36348 | NULL | Query      | 2654 | Waiting for table flush | FLUSH NO_WRITE_TO_BINLOG TABLES |
|  9 | root            | localhost       | tdb  | Field List | 2624 | Waiting for table flush | NULL                            |
| 10 | root            | localhost       | tdb  | Query      |   15 | Waiting for table flush | select * from tbl1              |
| 13 | root            | localhost       | NULL | Query      |    0 | starting                | show processlist                |
+----+-----------------+-----------------+------+------------+------+-------------------------+---------------------------------+

问题分析

FTWRL(FLUSH TABLES WITH READ LOCK )

  • 该命令主要用于备份工具获取一致性备份,FTWRL总共需要持有两把全局的 MDL ,并且还需要关闭所有表对象
  • FTWRL 操作步骤
    • 上全局读锁(lock_global_read_lock):上全局读锁会导致所有更新操作都会被堵塞
    • 清理表缓存(close_cached_tables):关闭表过程中,如果有大查询导致关闭表等待,那么所有访问这个表的查询和更新都需要等待
    • 上全局COMMIT锁(make_global_read_lock_block_commit):上全局COMMIT锁时,会堵塞活跃事务提交
  • 由于慢 sql 导致 FTWRL 无法拿锁以至于阻塞后续 sql 的执行

问题处理

  • Plan1:xtrabcakup 增加响应参数,在无法拿到 FTWRL 一段时间后终止退出,以减小影响范围
  • Plan2:MySQL 升级至 MySQL8.0,由于 MySQL8.0 新增了备份锁,通过备份锁备份可以不需要 FTWRL备份(但 DDL 语句仍会阻塞备份进程)

MySQL5.7 xtrabcakup 通过参数控制影响范围

xtrabackup 主动退出

参数说明

  • –ftwrl-wait-timeout=SECONDS
    • 参数说明:https://docs.percona.com/percona-xtrabackup/2.4/xtrabackup_bin/xbk_option_reference.html#-ftwrl-wait-timeoutseconds
    • 在想要开始执行 FTWRL 前查看库中是否有长查询,如果有,则推迟执行 FTWRL,超过此参数执行时间则放弃备份
    • 默认为 0,不检测,直接执行 FTWRL
  • –ftwrl-wait-threshold=SECONDS
    • 参数说明:https://docs.percona.com/percona-xtrabackup/2.4/xtrabackup_bin/xbk_option_reference.html#-ftwrl-wait-thresholdseconds
    • 此参数影响 –ftwrl-wait-timeout=SECONDS 参数检测时判定为长查询的时间阈值
    • 默认为 60s
  • –ftwrl-wait-threshold=SECONDS
    • 参数说明:https://docs.percona.com/percona-xtrabackup/2.4/xtrabackup_bin/xbk_option_reference.html#-ftwrl-wait-query-typeallupdate
    • 推荐使用默认值(all),指定在 xtrabackup 发出全局锁之前允许完成哪些类型的查询。

推荐配置

  • 备份命令:
    • –ftwrl-wait-timeout=120
    • –ftwrl-wait-threshold=0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ xtrabackup --defaults-file=/app/xiang/mysql5.7/mysql_data/my_3306.cnf --backup \
--user='xtrabackup_user'  --host 127.0.0.1 --password='backupUserPasswd' \
--ftwrl-wait-timeout=120 --ftwrl-wait-threshold=0  \
--target-dir=./backupfile
# ......
# 备份日志中会打印由于由于慢查询而阻塞 ftwrl 操作
240220 14:30:19 Waiting 120 seconds for queries running longer than 0 seconds to finish
240220 14:30:19 Waiting for query 18 (duration 3 sec): select sleep(300) from tbl1240220 14:30:20 >> log scanned up to (720749972)
240220 14:30:20 Waiting for query 18 (duration 4 sec): select sleep(300) from tbl1240220 14:30:21 >> log scanned up to (720749972)
240220 14:30:21 Waiting for query 18 (duration 5 sec): select sleep(300) from tbl1240220 14:30:22 >> log scanned up to (720749972)
240220 14:30:22 Waiting for query 18 (duration 6 sec): select sleep(300) from tbl1240220 14:30:23 >> log scanned up to (720749972)
240220 14:30:23 Waiting for query 18 (duration 7 sec): select sleep(300) from tbl1240220 14:30:24 >> log scanned up to (720749972)
  • 此时数据库建立一个新连接查看 processlist
    • processlist 内并没有 FLUSH TABLES WITH READ LOCK 语句,也没有阻塞其他查询的正常运行
1
2
3
4
5
6
7
8
mysql> show processlist;
+----+-----------------+-----------------+------+---------+------+------------+-----------------------------+
| Id | User            | Host            | db   | Command | Time | State      | Info                        |
+----+-----------------+-----------------+------+---------+------+------------+-----------------------------+
| 15 | root            | localhost       | NULL | Query   |    0 | starting   | show processlist            |
| 18 | root            | localhost       | tdb  | Query   |  115 | User sleep | select sleep(300) from tbl1 |
| 23 | xtrabackup_user | 127.0.0.1:36390 | NULL | Sleep   |    1 |            | NULL                        |
+----+-----------------+-----------------+------+---------+------+------------+-----------------------------+

存在问题

  • 如果不调整 –ftwrl-wait-threshold 使用缺省值,则一些刚开始执行的慢sql可能仍会造成阻塞
    • 由于 select sleep(300) from tbl1 开始执行 2s后全备任务需执行 FTWRL,导致语句并没有被判定为慢查询(long query)
1
2
3
4
5
6
7
8
mysql> show processlist;
+----+-----------------+-----------------+------+---------+------+-------------------------+-----------------------------+
| Id | User            | Host            | db   | Command | Time | State                   | Info                        |
+----+-----------------+-----------------+------+---------+------+-------------------------+-----------------------------+
| 15 | root            | localhost       | NULL | Query   |    0 | starting                | show processlist            |
| 17 | xtrabackup_user | 127.0.0.1:36382 | NULL | Query   |  112 | Waiting for table flush | FLUSH TABLES WITH READ LOCK |
| 18 | root            | localhost       | tdb  | Query   |  114 | User sleep              | select sleep(300) from tbl1 |
+----+-----------------+-----------------+------+---------+------+-------------------------+-----------------------------+
  • 即使把 –ftwrl-wait-threshold 参数设为 0,由于判定有没有慢查询的查询动作和执行 FTWRL 操作之间仍存在时间差,时间差内仍可能存在刚开始的执行的慢查询导致FTWRL阻塞

MySQL5.7 xtrabcakup 通过参数处理库中的慢查询

参数说明

  • –kill-long-queries-timeout=
    • 参数说明:https://docs.percona.com/percona-xtrabackup/2.4/xtrabackup_bin/xbk_option_reference.html#-kill-long-queries-timeout
    • 当 FTWRL 被阻塞超过此参数指定时间时,kill 阻塞的 sql
    • 注意,需要有 SUPER权限执行 kill 语句,否则 xtrabackup 会不断重试并产生大量日志
  • –kill-long-query-type=select|all
    • 当阻塞时间超过 –kill-long-queries-timeout 时,可以被 kill 的查询类型,默认为 select,近kill select 类型的语句

推荐配置

  • –kill-long-queries-timeout=30
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
$ xtrabackup --defaults-file=/app/xiang/mysql5.7/mysql_data/my_3306.cnf \
--backup --user='xtrabackup_user'  --host 127.0.0.1 --password='backupUserPasswd'  \
--kill-long-queries-timeout=30  \
--target-dir=./backupfile
# ......
240220 15:29:54 Executing FLUSH TABLES WITH READ LOCK...
240220 15:29:54 Kill query timeout 30 seconds.
240220 15:29:55 >> log scanned up to (720749972)
240220 15:29:56 >> log scanned up to (720749972)
240220 15:29:57 >> log scanned up to (720749972)
240220 15:30:24 Connecting to MySQL server host: 127.0.0.1, user: xtrabackup_user, password: set, port: 3306, socket: /app/xiang/mysql5.7/mysql_data/data/mysql_3306.sock
240220 15:30:24 Killing query 18 (duration 46 sec): select sleep(300) from tbl1
240220 15:30:24 Kill query thread stopped
240220 15:30:24 Starting to backup non-InnoDB tables and files
# ......

MySQL8.0 备份锁

  • MySQL8.0 由于引入了轻量级的备份锁,并不会阻塞DML操作

验证

  • connectionA 在表上执行 select
1
select sleep(300) from tbl1;
  • 使用 xtrabackup 执行全备份
1
2
3
4
5
6
7
$ xtrabackup --defaults-file=/app/xiang/mysql8.0/mysql_data/my_3307.cnf --backup \
--user='xtrabackup_user'  --host 127.0.0.1 --password='backupUserPasswd'  \
--target-dir=./backupfile
# .......
2024-02-20T17:10:30.997949+08:00 0 [Note] [MY-011825] [Xtrabackup] Done: Writing file /home/xiang/percona-xtrabackup-8.0.35-30-Linux-x86_64.glibc2.17-minimal/backupfile/xtrabackup_info
2024-02-20T17:10:32.001495+08:00 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (1039984930) to (1039985726) was copied.
2024-02-20T17:10:32.209796+08:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!
  • 备份过程并没有被阻塞

最佳实践

MySQL5.7

当语句不允许被 kill 时

  • 当不允许主动 kill 语句时,使用 –ftwrl-wait-timeout=120 –ftwrl-wait-threshold=0 参数
  • 由于 –ftwrl-wait-timeout 优先级高,当 120s 后慢sql仍没执行完时,xtrabackup 会退出,不会执行 kill 操作,–kill-long-queries-timeout=30 仅作为意外情况的兜底配置使用,避免对库造成更大影响
  • 注意:
    • 备份失败需有通知机制,人工介入处理,避免长时间无有效备份可用
    • 备份用户需要有 super 权限
1
2
3
4
5
xtrabackup --defaults-file=/app/xiang/mysql5.7/mysql_data/my_3306.cnf --backup \
--user='xtrabackup_user'  --host 127.0.0.1 --password='backupUserPasswd' \
--ftwrl-wait-timeout=120 --ftwrl-wait-threshold=0 \
--kill-long-queries-timeout=30  \
--target-dir=./backupfile

当语句允许 kill 时

  • 当仅允许 kill select 时
1
2
3
4
xtrabackup --defaults-file=/app/xiang/mysql5.7/mysql_data/my_3306.cnf --backup \
--user='xtrabackup_user'  --host 127.0.0.1 --password='backupUserPasswd' \
--kill-long-queries-timeout=30  \
--target-dir=./backupfile
  • 当所有类型的语句都允许 kill 时
1
2
3
4
xtrabackup --defaults-file=/app/xiang/mysql5.7/mysql_data/my_3306.cnf --backup \
--user='xtrabackup_user'  --host 127.0.0.1 --password='backupUserPasswd' \
--kill-long-queries-timeout=30 --kill-long-query-type=all \
--target-dir=./backupfile

MySQL8.0

  • MySQL8.0 由于引入了轻量级的备份锁(BACKUP LOCK),DML操作不在影响备份,仅需在执行 DDL 时避开全备/增备时段即可。

参考