MySQL磁盘IO高问题排查:从现象到根因的完整指南
情境与背景
在生产环境中,经常会遇到CPU和内存使用率很低,但磁盘IO很高的情况。这通常是IO密集型瓶颈的典型表现,需要系统性地排查才能找到根本原因。
一、问题定位:确认是MySQL导致的IO高
1.1 系统层面检查
磁盘IO监控:
## 第一步:确认IO来源
**系统命令**:
```bash
# 查看磁盘IO统计
iostat -x 1 10
# 查看进程IO使用
iotop
# 查看磁盘使用情况
df -h
# 查看磁盘读写速度
dd if=/dev/zero of=/tmp/test bs=1G count=1 oflag=direct
关键指标:
iostat_metrics:
rMB/s: "每秒读取MB数"
wMB/s: "每秒写入MB数"
%util: "设备繁忙程度"
avgqu-sz: "平均队列长度"
await: "平均等待时间(ms)"
确认MySQL进程:
# 查找MySQL进程ID
ps aux | grep mysqld
# 查看MySQL进程的IO使用
iotop -p <pid>
### 1.2 MySQL状态检查
**MySQL状态查询**:
```markdown
## 第二步:检查MySQL状态
**连接状态**:
```bash
# 查看当前连接
SHOW PROCESSLIST;
# 查看完整连接信息
SHOW FULL PROCESSLIST;
# 查看连接数
SHOW STATUS LIKE 'Threads_%';
关键状态变量:
mysql_status:
Threads_connected: "当前连接数"
Threads_running: "活跃连接数"
Questions: "总查询数"
Queries: "总SQL语句数"
Innodb_buffer_pool_reads: "从磁盘读取的页数"
Innodb_buffer_pool_read_requests: "缓冲池读取请求数"
## 二、慢查询分析
### 2.1 慢查询日志配置
**启用慢查询日志**:
```markdown
## 第三步:分析慢查询
**配置慢查询日志**:
```bash
# 查看当前配置
SHOW VARIABLES LIKE 'slow_query%';
SHOW VARIABLES LIKE 'long_query_time';
# 临时启用慢查询日志
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 1;
SET GLOBAL log_queries_not_using_indexes = ON;
my.cnf配置:
slow_query_config:
slow_query_log: "ON"
slow_query_log_file: "/var/log/mysql/slow.log"
long_query_time: 1
log_queries_not_using_indexes: "ON"
log_slow_admin_statements: "ON"
### 2.2 慢查询日志分析
**分析工具**:
```markdown
**慢查询分析工具**:
```bash
# 使用mysqldumpslow分析
mysqldumpslow -s t /var/log/mysql/slow.log
# 使用pt-query-digest分析(Percona Toolkit)
pt-query-digest /var/log/mysql/slow.log
# 查看前10条最慢查询
mysqldumpslow -s t -t 10 /var/log/mysql/slow.log
关键分析维度:
slow_query_analysis:
Query_time: "查询耗时"
Lock_time: "锁等待时间"
Rows_sent: "返回行数"
Rows_examined: "检查行数"
Full_scan: "是否全表扫描"
Full_join: "是否全表连接"
## 三、执行计划分析
### 3.1 EXPLAIN分析
**执行计划详解**:
```markdown
## 第四步:分析执行计划
**EXPLAIN使用**:
```sql
EXPLAIN SELECT * FROM orders WHERE user_id = 12345;
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 12345;
执行计划字段:
explain_columns:
id: "查询ID"
select_type: "查询类型"
table: "表名"
type: "访问类型"
possible_keys: "可能使用的索引"
key: "实际使用的索引"
key_len: "索引长度"
ref: "与索引比较的列"
rows: "预计扫描行数"
Extra: "额外信息"
type字段解读:
type_interpretation:
ALL: "全表扫描(最差)"
index: "索引全扫描"
range: "范围扫描"
ref: "非唯一索引扫描"
eq_ref: "唯一索引扫描"
const: "常量查询(最优)"
Extra字段解读:
extra_interpretation:
Using index: "使用覆盖索引"
Using where: "使用WHERE条件"
Using filesort: "使用文件排序(性能差)"
Using temporary: "使用临时表(性能差)"
Using join buffer: "使用连接缓冲"
### 3.2 索引失效场景
**常见索引失效原因**:
```markdown
## 索引失效场景
**索引失效原因**:
```yaml
index_failure_cases:
- "SELECT * FROM users WHERE name LIKE '%test%'" # 前缀模糊匹配
- "SELECT * FROM users WHERE age + 1 = 30" # 列上有函数运算
- "SELECT * FROM users WHERE status = 1 OR age > 18" # OR条件(只有一个条件有索引)
- "SELECT * FROM users WHERE id IN (SELECT id FROM orders)" # 子查询
- "SELECT * FROM users WHERE name = 'test' COLLATE utf8_bin" # 字符集不一致
- "SELECT * FROM users WHERE created_at > '2024-01-01' AND status = 1" # 索引顺序问题
复合索引最左前缀原则:
composite_index_principle:
index: "idx_name_age_status (name, age, status)"
valid_queries:
- "WHERE name = 'test'"
- "WHERE name = 'test' AND age = 18"
- "WHERE name = 'test' AND age = 18 AND status = 1"
invalid_queries:
- "WHERE age = 18" # 不满足最左前缀
- "WHERE age = 18 AND status = 1" # 不满足最左前缀
- "WHERE status = 1" # 不满足最左前缀
## 四、InnoDB状态分析
### 4.1 缓冲池分析
**缓冲池状态**:
```markdown
## 第五步:分析InnoDB状态
**缓冲池命中率**:
```sql
-- 计算缓冲池命中率
SELECT
(1 - (sum(innodb_buffer_pool_reads) / sum(innodb_buffer_pool_read_requests))) * 100 AS hit_rate
FROM information_schema.global_status;
缓冲池配置:
buffer_pool_config:
innodb_buffer_pool_size:
description: "缓冲池大小"
recommendation: "物理内存的50-70%"
innodb_buffer_pool_instances:
description: "缓冲池实例数"
recommendation: "每4GB一个实例"
innodb_buffer_pool_dump_at_shutdown:
description: "关闭时保存缓冲池"
recommendation: "开启"
innodb_buffer_pool_load_at_startup:
description: "启动时加载缓冲池"
recommendation: "开启"
缓冲池状态查询:
SHOW ENGINE INNODB STATUS;
-- 查看缓冲池状态
SELECT * FROM information_schema.INNODB_BUFFER_POOL_STATS;
### 4.2 日志刷盘分析
**日志刷盘配置**:
```markdown
**日志刷盘策略**:
```yaml
log_flush_config:
innodb_flush_log_at_trx_commit:
description: "日志刷盘策略"
values:
- "0: 每秒刷盘(性能最好,可能丢失数据)"
- "1: 每次事务提交刷盘(最安全,性能最差)"
- "2: 每次事务提交写入OS缓存,每秒刷盘(平衡)"
innodb_log_file_size:
description: "重做日志文件大小"
recommendation: "256MB-2GB"
innodb_log_buffer_size:
description: "日志缓冲区大小"
recommendation: "16-64MB"
sync_binlog:
description: "binlog刷盘策略"
values:
- "0: 由OS决定"
- "1: 每次事务提交刷盘"
刷盘频率分析:
-- 查看日志刷盘统计
SHOW STATUS LIKE 'Innodb_os_log%';
-- Innodb_os_log_written: 写入日志字节数
-- Innodb_os_log_fsyncs: fsync次数
### 4.3 磁盘写入分析
**写入热点分析**:
```markdown
**写入操作分析**:
```sql
-- 查看写入统计
SHOW STATUS LIKE 'Com_insert%';
SHOW STATUS LIKE 'Com_update%';
SHOW STATUS LIKE 'Com_delete%';
-- 查看InnoDB写入统计
SHOW STATUS LIKE 'Innodb_data_written';
SHOW STATUS LIKE 'Innodb_pages_written';
临时表分析:
-- 查看临时表使用情况
SHOW GLOBAL STATUS LIKE 'Created_tmp%';
-- 临时表配置
SHOW VARIABLES LIKE 'tmp_table_size';
SHOW VARIABLES LIKE 'max_heap_table_size';
## 五、常见根因与解决方案
### 5.1 全表扫描
**问题与解决方案**:
```markdown
## 常见根因分析
### 根因1:全表扫描
**现象**:
```yaml
symptoms:
- "type = ALL"
- "rows_examined 远大于 rows_sent"
- "Extra: Using where"
原因:
causes:
- "缺少索引"
- "索引失效"
- "查询条件不适合索引"
解决方案:
solutions:
- "创建合适的索引"
- "优化查询条件"
- "使用覆盖索引"
example:
problem: "SELECT * FROM orders WHERE user_id = 12345"
solution: "CREATE INDEX idx_user_id ON orders(user_id)"
### 5.2 索引失效
**问题与解决方案**:
```markdown
### 根因2:索引失效
**现象**:
```yaml
symptoms:
- "key = NULL"
- "possible_keys 有值但 key 为NULL"
原因:
causes:
- "使用LIKE '%xxx'"
- "列上有函数运算"
- "OR条件不满足"
- "字符集不一致"
解决方案:
solutions:
- "避免前缀模糊匹配"
- "避免在列上使用函数"
- "优化OR条件"
- "统一字符集"
example:
problem: "SELECT * FROM users WHERE DATE(created_at) = '2024-01-01'"
solution: "SELECT * FROM users WHERE created_at >= '2024-01-01' AND created_at < '2024-01-02'"
### 5.3 文件排序
**问题与解决方案**:
```markdown
### 根因3:文件排序
**现象**:
```yaml
symptoms:
- "Extra: Using filesort"
- "ORDER BY 字段没有索引"
原因:
causes:
- "ORDER BY 字段没有索引"
- "ORDER BY 多个字段,索引顺序不一致"
解决方案:
solutions:
- "创建包含ORDER BY字段的索引"
- "使用覆盖索引"
example:
problem: "SELECT * FROM orders WHERE user_id = 12345 ORDER BY create_time DESC"
solution: "CREATE INDEX idx_user_time ON orders(user_id, create_time DESC)"
### 5.4 日志刷盘频繁
**问题与解决方案**:
```markdown
### 根因4:日志刷盘频繁
**现象**:
```yaml
symptoms:
- "高write IO"
- "Innodb_os_log_fsyncs 频繁"
- "小事务频繁提交"
原因:
causes:
- "innodb_flush_log_at_trx_commit = 1"
- "sync_binlog = 1"
- "大量小事务"
解决方案:
solutions:
- "调整 innodb_flush_log_at_trx_commit = 2"
- "调整 sync_binlog = 100"
- "合并小事务"
example:
before: "每秒1000次提交,每次fsync"
after: "合并为每秒10次提交,每次fsync"
### 5.5 数据写入量大
**问题与解决方案**:
```markdown
### 根因5:数据写入量大
**现象**:
```yaml
symptoms:
- "高write IO"
- "磁盘使用率持续增长"
- "大量INSERT/UPDATE操作"
原因:
causes:
- "批量写入"
- "频繁更新"
- "日志表持续写入"
解决方案:
solutions:
- "使用批量插入"
- "使用异步写入"
- "分库分表"
- "归档历史数据"
example:
before: "INSERT INTO logs VALUES (...)"
after: "INSERT INTO logs VALUES (...), (...), (...)"
## 六、优化方案总结
### 6.1 索引优化
**索引优化策略**:
```markdown
## 优化方案
### 索引优化
```yaml
index_optimization:
create_index:
- "为WHERE条件列创建索引"
- "为ORDER BY列创建索引"
- "使用复合索引"
drop_index:
- "删除冗余索引"
- "删除未使用的索引"
analyze_index:
- "定期使用ANALYZE TABLE"
- "检查索引碎片"
### 6.2 SQL优化
**SQL优化策略**:
```markdown
### SQL优化
```yaml
sql_optimization:
select_columns:
- "避免SELECT *"
- "只查询需要的列"
join_optimization:
- "使用INNER JOIN替代子查询"
- "小表驱动大表"
limit_optimization:
- "避免OFFSET过大"
- "使用游标分页"
group_by_optimization:
- "使用索引优化GROUP BY"
- "避免ORDER BY RAND()"
### 6.3 配置优化
**配置优化策略**:
```markdown
### 配置优化
```yaml
configuration_optimization:
memory:
- "innodb_buffer_pool_size = 物理内存的50-70%"
- "innodb_log_buffer_size = 64M"
io:
- "innodb_flush_log_at_trx_commit = 2"
- "sync_binlog = 100"
- "innodb_log_file_size = 1G"
connection:
- "max_connections = 合理值"
- "wait_timeout = 60"
### 6.4 架构优化
**架构优化策略**:
```markdown
### 架构优化
```yaml
architecture_optimization:
read_write_split:
- "主从复制"
- "读请求路由到Slave"
sharding:
- "分库分表"
- "按业务拆分"
caching:
- "Redis缓存热点数据"
- "应用层缓存"
partitioning:
- "使用分区表"
- "按时间分区"
## 七、实战案例
### 7.1 案例:全表扫描导致IO高
**案例描述**:
```markdown
## 实战案例
### 案例1:全表扫描
**问题现象**:
- CPU: 10%
- 内存: 30%
- 磁盘IO: 90%+
**排查过程**:
```yaml
investigation:
step_1: "iotop确认MySQL进程IO高"
step_2: "查看慢查询日志"
step_3: "发现慢查询:SELECT * FROM orders WHERE status = 1"
step_4: "EXPLAIN分析:type = ALL,全表扫描"
step_5: "检查索引:status列没有索引"
解决方案:
solution:
create_index: "CREATE INDEX idx_status ON orders(status)"
result:
before: "查询耗时5秒,IO高"
after: "查询耗时<100ms,IO正常"
### 7.2 案例:日志刷盘导致IO高
**案例描述**:
```markdown
### 案例2:日志刷盘频繁
**问题现象**:
- 每秒大量小事务提交
- write IO持续很高
- innodb_flush_log_at_trx_commit = 1
**排查过程**:
```yaml
investigation:
step_1: "iostat显示高write IO"
step_2: "SHOW STATUS LIKE 'Innodb_os_log_fsyncs'"
step_3: "发现每秒fsync次数超过1000"
step_4: "检查应用代码:频繁小事务"
解决方案:
solution:
batch_transactions: "合并小事务为批量操作"
config_tuning: "innodb_flush_log_at_trx_commit = 2"
result:
before: "write IO: 50MB/s"
after: "write IO: 5MB/s"
## 八、面试1分钟精简版(直接背)
**完整版**:
排查步骤:1. 使用iostat/iotop查看磁盘IO情况,确认是MySQL进程;2. 使用SHOW PROCESSLIST查看当前会话;3. 分析慢查询日志,找出耗时查询;4. 使用EXPLAIN分析执行计划,看是否全表扫描或索引失效;5. 查看InnoDB状态,检查缓冲池命中率、日志刷盘频率;6. 常见根因:全表扫描、索引失效、排序使用filesort、日志刷盘策略不当。优化方案:添加索引、优化SQL、调整innodb_flush_log_at_trx_commit、增加缓存。
**30秒超短版**:
iostat确认IO来源,慢查询日志找问题,EXPLAIN分析执行计划,常见原因:全表扫描、索引失效、filesort、刷盘频繁;优化:加索引、优化SQL、调整配置。
## 九、总结
### 9.1 排查流程总结
```yaml
troubleshooting_flow:
step_1: "确认IO来源"
command: "iostat, iotop"
step_2: "检查MySQL状态"
command: "SHOW PROCESSLIST"
step_3: "分析慢查询"
command: "mysqldumpslow, pt-query-digest"
step_4: "分析执行计划"
command: "EXPLAIN"
step_5: "检查InnoDB状态"
command: "SHOW ENGINE INNODB STATUS"
step_6: "定位根因并优化"
actions: ["加索引", "优化SQL", "调整配置"]
9.2 常见根因总结
common_causes:
full_table_scan:
description: "全表扫描"
solution: "添加索引"
index_failure:
description: "索引失效"
solution: "优化查询条件"
filesort:
description: "文件排序"
solution: "创建排序索引"
log_flush:
description: "日志刷盘频繁"
solution: "调整刷盘策略"
heavy_write:
description: "写入量大"
solution: "批量写入、异步写入"
9.3 最佳实践清单
best_practices:
monitoring:
- "监控磁盘IO"
- "监控慢查询"
- "设置IO告警"
prevention:
- "定期检查索引使用情况"
- "定期分析慢查询日志"
- "定期优化配置"
optimization:
- "使用覆盖索引"
- "避免全表扫描"
- "优化排序和分组"
9.4 记忆口诀
IO高先看进程,确认MySQL是元凶,
慢查询日志找问题,EXPLAIN分析执行计划,
全表扫描最常见,索引失效是关键,
filesort要优化,刷盘策略调一调,
添加索引加缓存,IO问题解决了。
文档信息
- 本文作者:soveran zhong
- 本文链接:https://blog.clockwingsoar.cn/2026/05/09/mysql-io-troubleshooting-best-practices/
- 版权声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)