集群操作工具polysh

背景

线上服务集群一般至少有2台机器,并发稍微高点的集群,机器数量就更多了。在公司的运维不够给力,或者还没做到将日志集中处理的时候,查日志就是一件极其痛苦的事情,你每次需要查询某个东西,甚至只是测试一下这个条件,就得要所有的服务器跑一遍。。。有时候还很难一下找到需要的日志。这时候,一个可以批量操作某个规则下机器的工具就必不可少。

总而言之,我们需要有一种方式,可以批量请求线上机器

使用polysh

step 0 : 登陆线上机器
step 1 : 安装polysh
sudo yum search polysh

Icon

很可能查不到这个工具
这时候要刷新yum仓库
sudo yum clean all

sudo yum install polysh.noarch
step 2 : 配置个人信息
...~]$ sudo vi .ssh/config
Host *StrictHostKeyChecking no
step 3 : 运行polysh
[root@l-group1 ~]$ polysh "l-service<1-2>.example.cn1"
ready (2)> m-service.example.cn1 :  17:24:57 up 334 days,  2:23,  3 users,  load average: 0.13, 0.21, 0.24l-service.example.cn1 :  17:24:57 up 334 days,  2:23,  4 users,  load average: 0.10, 0.14, 0.19
ready (2)>

如何阅读GC日志

Demo

without evacuation (partially stw)
1.416: [GC pause (young) (initial-mark), 0.62417980 secs]                    // initial mark phase…....2.042: [GC concurrent-root-region-scan-start]                                // concurrent scan phase start2.067: [GC concurrent-root-region-scan-end, 0.0251507]                       // concurrent scan phase end2.068: [GC concurrent-mark-start]                                            // concurrent mark phase start3.198: [GC concurrent-mark-reset-for-overflow]                               //   well, some awkward exception occurred and would be self-adjusted soon4.053: [GC concurrent-mark-end, 1.9849672 sec]                               // concurrent mark phase end4.055: [GC remark 4.055: [GC ref-proc, 0.0000254 secs], 0.0030184 secs]      // remark phase done, stw[Times: user=0.00 sys=0.00, real=0.00 secs]                           4.088: [GC cleanup 117M->106M(138M), 0.0015198 secs]                         // cleanup phase done, stw[Times: user=0.00 sys=0.00, real=0.00 secs]4.090: [GC concurrent-cleanup-start]                                         // concurrently free up regions that is empty, start4.091: [GC concurrent-cleanup-end, 0.0002721]                                // concurrently free up regions that is empty, end

with evacuation (completely stw)
0.522: [GC pause (young), 0.15877971 secs]                                     // 这是一次ygc// 开始时间点0.522,stw持续时间0.15877971[Parallel Time: 157.1 ms]                                                   //   本次ygc的并行时间157.1ms[GC Worker Start (ms):  522.1  522.2  522.2  522.2                       //     4个GCWorker的启动时间Avg: 522.2, Min: 522.1, Max: 522.2, Diff:   0.1]                        //       平均值,最小值,最大值,最大最小差[Ext Root Scanning (ms):  1.6  1.5  1.6  1.9                             //     4个GCWorker扫描根的时长Avg:   1.7, Min:   1.5, Max:   1.9, Diff:   0.4]                        //       平均值,最小值,最大值,最大最小差[Update RS (ms):  38.7  38.8  50.6  37.3                                 //     4个GCWorkrer更新RS的时长Avg:  41.3, Min:  37.3, Max:  50.6, Diff:  13.3]                        //       平均值,最小值,最大值,最大最小差[Processed Buffers : 2 2 3 2                                          //     4个GCWorker更新的RS数量Sum: 9, Avg: 2, Min: 2, Max: 3, Diff: 1]                             //       总数,平均值,最小值,最大值,最大最小差[Scan RS (ms):  9.9  9.7  0.0  9.7                                       //     4个GCWorker扫描RS的时长Avg:   7.3, Min:   0.0, Max:   9.9, Diff:   9.9]                        //       平均值,最小值,最大值,最大最小差[Object Copy (ms):  106.7  106.8  104.6  107.9                           //     4个GCWorker复制存活对象的时长Avg: 106.5, Min: 104.6, Max: 107.9, Diff:   3.3]                        //       平均值,最小值,最大值,最大最小差[Termination (ms):  0.0  0.0  0.0  0.0                                   //     4个GCWorker的终止时长Avg:   0.0, Min:   0.0, Max:   0.0, Diff:   0.0]                        //       平均值,最小值,最大值,最大最小差[Termination Attempts : 1 4 4 6                                       //     4个GCWorker的尝试终止次数Sum: 15, Avg: 3, Min: 1, Max: 6, Diff: 5]                            //       总数,平均值,最小值,最大值,最大最小差[GC Worker End (ms):  679.1  679.1  679.1  679.1                         //     4个GCWorker的停止时间Avg: 679.1, Min: 679.1, Max: 679.1, Diff:   0.1]                        //       平均值,最小值,最大值,最大最小差[GC Worker (ms):  156.9  157.0  156.9  156.9                             //     4个GCWorker的存活时间Avg: 156.9, Min: 156.9, Max: 157.0, Diff:   0.1]                        //       平均值,最小值,最大值,最大最小差[GC Worker Other (ms):  0.3  0.3  0.3  0.3                               //     4个GCWorker花费在其他杂事上的时间Avg:   0.3, Min:   0.3, Max:   0.3, Diff:   0.0]                        //       平均值,最小值,最大值,最大最小差[Clear CT:   0.1 ms]                                                        //   清除卡表的时长,本阶段串行执行[Other:   1.5 ms]                                                           //   花费在其他时间上事情上的时间,本阶段串行执行[Choose CSet:   0.0 ms]                                                  //     为回收集选择区域的时间[Ref Proc:   0.3 ms]                                                     //     处理对象引用的时间[Ref Enq:   0.0 ms]                                                      //     对象引用入队时间[Free CSet:   0.3 ms]                                                    //     释放回收集的时间[Eden: 12M(12M)->0B(10M) Survivors: 0B->2048K Heap: 13M(64M)->9739K(64M)]   //   回收效果//   Eden从12M回收到0B,容量从12M减少到10M//   Survivors从0B增加到2048K//   堆从13M减少到9739K,容量保持不变[Times: user=0.59 sys=0.02, real=0.16 secs]

参考资料

权威参考
https://blogs.oracle.com/poonam/understanding-g1-gc-logs

性能优化

概述

业务表征点
  • 请求量
  • 响应时间
  • 数据量
系统表征点
  • CPU负载
  • 内存压力
  • IO压力:网络,硬盘
  • 存储压力:数据库,缓存
  • 中间件压力:消息队列
  • Java:JVM
第三表征点
  • 系统监控/报警
  • 系统日志

排查方式

排查思路
  • 最核心的系统/接口是什么?
  • 过去一段时间,bug/故障最多的地方是哪里?
  • 门店数/订单量/…翻2/5/10倍之后,最先支撑不住的点在哪里?
排查流程
  • 排查范围/排查点
  • 排查&发现问题
  • 确定问题的影响范围并按优先级排序
  • 逐一解决问题
先思考,再动手
  • 切忌手太快,发现一个解决一个。但解决的未必是重要问题
  • 切忌排查不全面,贸然动手引起联锁反应
思考全面,方案全面
  • 灰度策略,回滚策略,分支策略,要全面
逻辑和数据
  • 逻辑上是好方案
  • 数据上(预评估和后评估)是好结果
  • 不断分析badcase迭代优化