最新消息:

linux perf – 性能测试和优化工具

未分类 admin 5142浏览 0评论

Perf简介

Perf是Linux kernel自带的系统性能优化工具。虽然它的版本还只是0.0.2,Perf已经显现出它强大的实力,足以与目前Linux流行的OProfile相媲美了。

Perf 的优势在于与Linux Kernel的紧密结合,它可以最先应用到加入Kernel的new feature。而像OProfile, GProf等通常会“慢一拍”。Perf的基本原理跟OProfile等类似,也是在CPU的PMU registers中Get/Set performance counters来获得诸如instructions executed, cache-missed suffered, branches mispredicted等信息。Linux kernel对这些registers进行了一系列抽象,所以你可以按进程,按CPU或者按counter group等不同类别来查看Sample信息。

 

使用Perf

Perf的使用流程和OProfile很像。所以如果你会用OProfile的话,用Perf就很简单。这里只是简单翻译一下在[1]中的Perf examples中举的例子。有更多发现的话以后会继续更新。

$ perf record -f -- git gc   Counting objects: 1283571, done. Compressing objects: 100% (206724/206724), done. Writing objects: 100% (1283571/1283571), done. Total 1283571 (delta 1070675), reused 1281443 (delta 1068566) [ perf record: Captured and wrote 31.054 MB perf.data (~1356768 samples) ] 
  
$ perf report --sort comm,dso,symbol | head -10
# Samples: 1355726
#
# Overhead          Command                            Shared Object  Symbol
# ........  ...............  .......................................  ......
#
    31.53%              git  /usr/bin/git                             [.] 0x0000000009804f
    13.41%        git-prune  /usr/bin/git-prune                       [.] 0x000000000ad06d
    10.05%              git  /lib/tls/i686/cmov/libc-2.8.90.so        [.] _nl_make_l10nflist
     5.36%        git-prune  /usr/lib/libz.so.1.2.3.3                 [.] 0x00000000009d51
     4.48%              git  /lib/tls/i686/cmov/libc-2.8.90.so        [.] memcpy

perf record相当于opcontrol –-start, 而perf report相当于opreport.

Perf用例

查看所有可用的counters用’perf list’:

titan:~> perf list
 [...]
 kmem:kmalloc                             [Tracepoint event]
 kmem:kmem_cache_alloc                    [Tracepoint event]
 kmem:kmalloc_node                        [Tracepoint event]
 kmem:kmem_cache_alloc_node               [Tracepoint event]
 kmem:kfree                               [Tracepoint event]
 kmem:kmem_cache_free                     [Tracepoint event]
 kmem:mm_page_free_direct                 [Tracepoint event]
 kmem:mm_pagevec_free                     [Tracepoint event]
 kmem:mm_page_alloc                       [Tracepoint event]
 kmem:mm_page_alloc_zone_locked           [Tracepoint event]
 kmem:mm_page_pcpu_drain                  [Tracepoint event]
 kmem:mm_page_alloc_extfrag               [Tracepoint event]
你可以用以上counter的任意组合来跑你的测试程序。比如,用以下命令来看跑
hackbench时page alloc/free的次数。
titan:~> perf stat -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc
-e kmem:mm_pagevec_free -e kmem:mm_page_free_direct ./hackbench 10
Time: 0.575
Performance counter stats for './hackbench 10':
         13857  kmem:mm_page_pcpu_drain
         27576  kmem:mm_page_alloc
          6025  kmem:mm_pagevec_free
         20934  kmem:mm_page_free_direct
   0.613972165  seconds time elapsed
Perf可以帮你统计N次结果的数值波动情况:
titan:~> perf stat --repeat 5 -e kmem:mm_page_pcpu_drain -e
  kmem:mm_page_alloc -e kmem:mm_pagevec_free -e
  kmem:mm_page_free_direct ./hackbench 10
Time: 0.627
Time: 0.644
Time: 0.564
Time: 0.559
Time: 0.626
Performance counter stats for './hackbench 10' (5 runs):
         12920  kmem:mm_page_pcpu_drain    ( +-   3.359% )
         25035  kmem:mm_page_alloc         ( +-   3.783% )
          6104  kmem:mm_pagevec_free       ( +-   0.934% )
         18376  kmem:mm_page_free_direct   ( +-   4.941% )
   0.643954516  seconds time elapsed   ( +-   2.363% )
有了以上的统计数据,你可以开始sample某一个你关心的tracepoint(比如page
allocations):
titan:~/git> perf record -f -e kmem:mm_page_alloc -c 1 ./git gc
Counting objects: 1148, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (450/450), done.
Writing objects: 100% (1148/1148), done.
Total 1148 (delta 690), reused 1148 (delta 690)
[ perf record: Captured and wrote 0.267 MB perf.data (~11679 samples) ]
查看哪个function引起了page allocations:
titan:~/git> perf report
# Samples: 10646
#
# Overhead          Command               Shared Object
# ........  ...............  ..........................
#
   23.57%       git-repack  /lib64/libc-2.5.so
   21.81%              git  /lib64/libc-2.5.so
   14.59%              git  ./git
   11.79%       git-repack  ./git
    7.12%              git  /lib64/ld-2.5.so
    3.16%       git-repack  /lib64/libpthread-2.5.so
    2.09%       git-repack  /bin/bash
    1.97%               rm  /lib64/libc-2.5.so
    1.39%               mv  /lib64/ld-2.5.so
    1.37%               mv  /lib64/libc-2.5.so
    1.12%       git-repack  /lib64/ld-2.5.so
    0.95%               rm  /lib64/ld-2.5.so
    0.90%  git-update-serv  /lib64/libc-2.5.so
    0.73%  git-update-serv  /lib64/ld-2.5.so
    0.68%             perf  /lib64/libpthread-2.5.so
    0.64%       git-repack  /usr/lib64/libz.so.1.2.3

更进一步的查看:

titan:~/git> perf report –sort comm,dso,symbol

# Samples: 10646
#
# Overhead          Command               Shared Object  Symbol
# ........  ...............  ..........................  ......
#
    9.35%       git-repack  ./git                       [.] insert_obj_hash
    9.12%              git  ./git                       [.] insert_obj_hash
    7.31%              git  /lib64/libc-2.5.so          [.] memcpy
    6.34%       git-repack  /lib64/libc-2.5.so          [.] _int_malloc
    6.24%       git-repack  /lib64/libc-2.5.so          [.] memcpy
    5.82%       git-repack  /lib64/libc-2.5.so          [.] __GI___fork
    5.47%              git  /lib64/libc-2.5.so          [.] _int_malloc
    2.99%              git  /lib64/libc-2.5.so          [.] memset
  
同时,call-graph(函数调用图)也可以被记录下来,并且能告诉你每个函数所占用的百分比。
titan:~/git> perf record -f -g -e kmem:mm_page_alloc -c 1 ./git gc
Counting objects: 1148, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (450/450), done.
Writing objects: 100% (1148/1148), done.
Total 1148 (delta 690), reused 1148 (delta 690)
[ perf record: Captured and wrote 0.963 MB perf.data (~42069 samples) ]
titan:~/git> perf report -g
# Samples: 10686
#
# Overhead          Command               Shared Object
# ........  ...............  ..........................
#
   23.25%       git-repack  /lib64/libc-2.5.so
               |
               |--50.00%-- _int_free
               |
               |--37.50%-- __GI___fork
               |          make_child
               |
               |--12.50%-- ptmalloc_unlock_all2
               |          make_child
               |
                --6.25%-- __GI_strcpy
   21.61%              git  /lib64/libc-2.5.so
               |
               |--30.00%-- __GI_read
               |          |
               |           --83.33%-- git_config_from_file
               |                     git_config
               |                     |
  [...]
用以下命令可以查看整个系统10秒内的page allocation次数:

titan:~/git> perf stat -a -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct sleep 10

Performance counter stats for 'sleep 10':
        171585  kmem:mm_page_pcpu_drain
        322114  kmem:mm_page_alloc
         73623  kmem:mm_pagevec_free
        254115  kmem:mm_page_free_direct
  10.000591410  seconds time elapsed

用以下命令查看每隔1秒,系统page allocation的波动状况:

titan:~/git> perf stat --repeat 10 -a -e kmem:mm_page_pcpu_drain -e
  kmem:mm_page_alloc -e kmem:mm_pagevec_free -e
  kmem:mm_page_free_direct sleep 1
Performance counter stats for 'sleep 1' (10 runs):
         17254  kmem:mm_page_pcpu_drain    ( +-   3.709% )
         34394  kmem:mm_page_alloc         ( +-   4.617% )
          7509  kmem:mm_pagevec_free       ( +-   4.820% )
         25653  kmem:mm_page_free_direct   ( +-   3.672% )
   1.058135029  seconds time elapsed   ( +-   3.089% )

通过反汇编往往能找出是哪行代码生成的指令会引起问题。

titan:~/git> perf annotate __GI___fork
------------------------------------------------
 Percent |      Source code & Disassembly of libc-2.5.so
------------------------------------------------  : Disassembly of section .plt:  : Disassembly of section .text:  : 00000031a2e95560 <__fork>: [...] 0.00 : 31a2e95602: b8 38 00 00 00 mov $0x38,�x 0.00 : 31a2e95607: 0f 05 syscall 83.42 : 31a2e95609: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 0.00 : 31a2e9560f: 0f 87 4d 01 00 00 ja 31a2e95762 <__fork+0x202> 0.00 : 31a2e95615: 85 c0 test  �x,�x 
以上结果显示__GI__forks的83.42%的时间来源于0x38的系统调用。

值得优化某个特定的函数吗?

你也许想知道是否值得去优化你程序中的某个特定函数。一个很好的例子是git mailing list中讨论的关于SHA1 哈希算法优化的问题,我们可以用perf来预判优化的结果。具体参见Linus的回信[2].

"perf report --sort comm,dso,symbol" profiling shows the following for
'git fsck --full' on the kernel repo, using the Mozilla SHA1:
   47.69%               git  /home/torvalds/git/git     [.] moz_SHA1_Update
   22.98%               git  /lib64/libz.so.1.2.3       [.] inflate_fast
    7.32%               git  /lib64/libc-2.10.1.so      [.] __GI_memcpy
    4.66%               git  /lib64/libz.so.1.2.3       [.] inflate
    3.76%               git  /lib64/libz.so.1.2.3       [.] adler32
    2.86%               git  /lib64/libz.so.1.2.3       [.] inflate_table
    2.41%               git  /home/torvalds/git/git     [.] lookup_object
    1.31%               git  /lib64/libc-2.10.1.so      [.] _int_malloc
    0.84%               git  /home/torvalds/git/git     [.] patch_delta
    0.78%               git  [kernel]                   [k] hpet_next_event

很明显,SHA1加密算法的性能在这里很关键。

如何测量latency

如果你在build kernel时enabled了

 CONFIG_PERF_COUNTER=y
 CONFIG_EVENT_TRACING=y

那你可以加-tip参数来使用几个新的performance counter来测量scheduler的lantencies。

perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20

以上命令能够得出等待CPU用了多少时间。你可以重复10次这样的操作:

aldebaran:/home/mingo> perf stat --repeat 10 -e /
             sched:sched_stat_wait:r -e task-clock ./hackbench 20
Time: 0.251
Time: 0.214
Time: 0.254
Time: 0.278
Time: 0.245
Time: 0.308
Time: 0.242
Time: 0.222
Time: 0.268
Time: 0.244
Performance counter stats for './hackbench 20' (10 runs):
         59826  sched:sched_stat_wait    #      0.026 M/sec   ( +-   5.540% )
   2280.099643  task-clock-msecs         #      7.525 CPUs    ( +-   1.620% )
   0.303013390  seconds time elapsed   ( +-   3.189% )
读取scheduling的events counter
# perf list 2>&1 | grep sched:
 sched:sched_kthread_stop                   [Tracepoint event]
 sched:sched_kthread_stop_ret               [Tracepoint event]
 sched:sched_wait_task                      [Tracepoint event]
 sched:sched_wakeup                         [Tracepoint event]
 sched:sched_wakeup_new                     [Tracepoint event]
 sched:sched_switch                         [Tracepoint event]
 sched:sched_migrate_task                   [Tracepoint event]
 sched:sched_process_free                   [Tracepoint event]
 sched:sched_process_exit                   [Tracepoint event]
 sched:sched_process_wait                   [Tracepoint event]
 sched:sched_process_fork                   [Tracepoint event]
 sched:sched_signal_send                    [Tracepoint event]
 sched:sched_stat_wait                      [Tracepoint event]
 sched:sched_stat_sleep                     [Tracepoint event]
 sched:sched_stat_iowait                    [Tracepoint event]

对于latency analysis而言,stat_wait/sleep/iowait是值得注意的event。如果你想看所有delays和它们的mix/max/avg,你可以:

 perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20
 perf trace

perf stats for doing nothing
http://blog.csdn.net/bluebeach/article/details/5912062

Perf stats for “doing nothing”

I’ve recently discovered the perf Linux tool. I heard that oprofile was deprecated and that there is a new tool, and I noted down to try it sometime.

Updated: more languages, fixed typos, more details, some graphs. Apologies if this shows twice in your feed.

The problem with perf stats is that I hate bloat, or even perceived bloat. Even when it doesn’t affect me in any way, the concept of wasted cycles makes me really sad.

You probably can guess where this is going… I said, well, let’s see what perf says about a simple “null” program. Surely doing nothing should be just a small number of instructions, right?

Note: I think that perf also records kernel-side code, because the lowest I could get was about ~50K instructions for starting a null program in assembler that doesn’t use libc and just executes the syscall asm instruction. However, these ~50K instructions are noise the moment you start to use more high-level languages. Yes, this is expected, but the I was still shocked. And there’s lots of delta between languages I’d expected to behave somewhat identical.

Again, this is not important in the real world. At all. They are just numbers, and probably the noise (due to short runtime) has lots of influence on the resulting numbers. And I might have screwed up the measurements somehow.

Test setup

Each program was the equivalent of ‘exit 0’ in the appropriate form for the language. During the measurements, the machine was as much as possible idle (single-user mode, measurements run at real-time priority, etc.). For compiled languages, -O2 was used. For scripts, a simple #!/path/to/interpreter (without options, except in the case of Python, see below) was used. Each program/script was run 500 times (perf’s -r 500) and I’ve checked that the variations were small (±0.80% on the metrics I used).

You can find all the programs I’ve used at http://git.k1024.org/perf-null.git/, the current tests are for the tag version perf-null-0.1.

The raw data for the below tables/graphs is at log-4.

Results

Compiled languages

Language Cycles Instructions
asm 63K 51K
c-dietlibc 74K 57K
c-libc-static 177K 107K
c-libc-shared 506K 300K
c++-static 178K 107K
c++-dynamic 1,750K 1,675K
haskell-single 2,229K 1,338K
haskell-threaded 2,629K 1,522K
ocaml-bytecode 3,271K 2,741K
ocaml-native 1,042K 666K

Going from dietlibc to glibc doubles the number of instructions, and for libc going from static to dynamic linking again roughly doubles it. I didn’t manage to compile a program dynamically-linked against dietlibc.

C++ is interesting. Linked statically, it is in the same ballpark as C, but when linked dynamically, it executes an order of magnitude (!) more instructions. I would guess that the initialisation of the standard C++ library is complex?

Haskell, which has a GC and quite a complex runtime, executes slightly less instructions than C++, but uses more cycles. Not bad, given the capabilities of the runtime. The two versions of the Haskell program are with the single-threaded runtime and with the multi-threaded one; not much difference. A fully statically-linked Haskell binary (not recommended usually) goes below 1M instructions, but not by much.

OCaml is a very nice surprise. The bytecode runtime is a bit slow to startup, but the (native) compiled version is quite fast to start: only 2× number of instructions and cycles compared to C, for an advanced language. And twice as fast as Haskell ☺. Nice!

Shells

Language Cycles Instructions
dash 766K 469K
bash 1,680K 1,044K
mksh 1,258K 942K
mksh-static 504K 322K

So, dash takes ~470K instructions to start, which is way below the C++ count and a bit higher than the C one. Hence, I’d guess that dash is implemented in C ☺.

Next, bash is indeed slower on startup than dash, and by slightly more than 2× (both instructions and cycles). So yes, switching /bin/sh from bash to dash makes sense.

I wasn’t aware of mksh, so thanks for the comments. It is, in the static variant, more efficient that dash, by about 1.5×. However, the dynamically linked version doesn’t look too great (dash is also dynamically linked; I would guess a statically-linked dash “beats” mksh-static).

Text processing

I’ve added perl here (even though it’s a ‘full’ language) just for comparison; it’s also in the next section.

Language Cycles Instructions
mawk 849K 514K
gawk 1,363K 980K
perl 2,946K 2,213K

A normal spread. I knew the reason why mawk is Priority: required is that it’s faster than gawk, but I wouldn’t have guessed it’s almost twice as fast.

Interpreted languages

Here is where the fun starts…

Language Cycles Instructions
lua 5.1 1,947K 1,485K
lua 5.2 1,724K 1,335K
lua jit 1,209K 803K
perl 2,946K 2,213K
tcl 8.4 5,011K 4,552K
tcl 8.5 6,888K 6,022K
tcl 8.6 8,196K 7,236K
ruby 1.8 7,013K 6,128K
ruby 1.9.3 35,870K 35,022K
python 2.6 -S 11,752K 10,247K
python 2.7 -S 11,438K 10,198K
python 3.2 -S 29,003K 27,409K
pypy -S 21,106K 10,036K
python 2.6 25,143K 21,989K
python 2.7 47,325K 50,217K
python 2.7 -O 47,341K 50,185K
python 3.2 113,567K 124,133K
python 3.2 -O 113,424K 124,133K
pypy 90,779K 68,455K

The numbers here are not quite what I expected. There’s a huge delta between the fastest (hi Lua!) and the slowest (bye Python!).

I wasn’t familiar with Lua, so I tested it thanks to the comments. It is, I think, the only language which actually improves from one version to the next (bonus points), and where the JIT version also make is faster. In context, lua jit starts faster than C++.

Perl is the one that goes above C++’s instructions count, but not by much. From the point of view of the system, a Perl ‘hello world’ is only about 1.3×-1.6x slower than a C++ one. Not bad, not bad.

Next category is composed of TCL and Ruby, both of which had older versions 2-3× slower than Perl, but whose most recent versions are even more slower. TCL has an almost constant slowdown across versions (5M, 6.9M, 8.2M cycles), but Ruby seems to have taken a significant step backwards: 1.9.3 is 5× slower than 1.8. I wonder why? As for TCL, I didn’t expect it to be slower to startup than Perl; good to know.

Last category is Python. Oh my. If you run perf stat python -c 'pass' you get some unbelievable numbers, like 50M instructions to do, well, nothing. Yes, it has a GC, yes, it does import modules at runtime, but still… On closer investigation, the site module and the imports it does do eat a lot of time. Running a simpler python -S brings it back to a more reasonable 10M instructions, which is in-line with the other interpreted languages.

However, even with the -S taken into account, Python also slows down across versions: a tiny improvement from 2.6 to 2.7, but (like Ruby) a 3× slowdown from 2.7 to 3.2. Trying the “optimised” version (-O) doesn’t help at all. Trying pypy, which was based on Python 2.7, makes it around 2× slower to startup (both with and without -S).

So in the interpreted languages, it seems only Lua is trying to improve, the rest of the languages are piling up bloat with every version. Note: I should have tried multiple perl versions too.

Java

Java is in its own category; you guess why ☺, right?

GCJ was version 4.6, whereas by java below I mean OpenJDK Runtime Environment (IcedTea6 1.11) (6b24-1.11-4).

Language Cycles Instructions
null-gcj 97,156K 74,576K
java -jamvm 85,535K 80,102K
java -server 147,174K 136,803K
java -zero 132,967K 124,977K
java -cacao 229,799K 205,312K

Using gcj to compile to “native code” (not sure whether that’s native-native or something else) results in a binary that uses less than 100M cycles to start, but the jamvm VM is faster than that (85M cycles). Not bad for java! Python 3.2 is slower to startup—yes, I think the world has gone crazy.

However, the other VMs are a few times slower: server (the default one) is ~150M cycles, and cacao is ~230M cycles. Wow.

The other thing about java is that it was the only one that couldn’t be put nicely in a file that you just ‘exec’ (there is binfmt_misc indeed, but that doesn’t allow different Java classes to use different Java VMs, so I don’t count this), as opposed to every single other thing I tested here. Someone didn’t grow on Unix?

Comparative analysis

Since there are almost 4 orders of magnitude difference between all the things tested here, a graph of cycles or instructions is not really useful. However, cycles/instruction, branches percentage and branches miss-predicted percentage can be. Hence first the cycles/instructions:

Cycles/instruction

Pypy is jumping out of the graph here, with the top value of over 2 cycles/instruction. Lua JIT is also bigger than Lua non-JIT, so maybe there’s something to this (mostly joking, two data points don’t make a series). On the other hand, Python wins as best cycles/instruction (0.91). Lots of ILP, to get below 1?

Java gets, irrespective of VM, consistently near 1.0-1.1. C++ gets very different numbers between static linking (1.666) and dynamic linking (1.045), whereas C has basically identical numbers. mksh also has a difference between dynamic and static linking. Hmm…

Ruby, TCL and Python have consistent values across versions.

And that’s about what I can see from that graph. Next up, percentage of branches out of total instructions and percentage of branches missed:

Branch statistics

Note that the two lines shouldn’t really be on the same graph; for the branch %, the 100% is the total instructions count, but for the branch miss %, the 100% is the total branch count. Anyway.

There are two low-value outliers:

  • dynamically-linked C++ has a low branch percentage (17.46%) and a very low branch miss percentage (only 4.32%)
  • gcj-compiled java has a very low branch miss percentage (only 2.82%!!!), even though is has a “regular” branch percentage (20.85%)

So it seems the gcj libraries are well optimised? I’m not familiar enough with this topic, but on the graph it does indeed stand out.

On the other end, mksh-static has a high branch miss percentage: 11.60%, which jumps clearly ahead of all the others; this might be why it has a high cycles/instruction count, due to all the stalls in misprediction; one has to wonder why it confuses the branch predictor?

I find it interesting that the overall branch count is very similar across languages, both when most of the cost is in the kernel (e.g. asm) and when the user-space cost heavily over-weighs the kernel (e.g. Java). The average is 20.85%, minimum is 17.46%, max 22.93%, standard deviation (if I used gnumeric correctly) is just 0.01. This seems a bit suspicious to me ☺. On the other hand, the mispredicted branches percentage varies much more: from a measly 2.82% to 11.60% (5x difference).

Summary

So to recap, counting just instructions:

  • going from dietlibc to glibc: 2× increase
  • going from statically-linked libc to dynamically-linked libc: doubles it again
  • going from C to C++: 5× increase
  • C++ to Perl: 1.3×
  • Perl to Ruby: 3×
  • Ruby to Python (-S): 1.6x
  • Python -S to regular Python: 5x
  • Python to Java: 1×-2×, depending on version/runtime
  • branch percentage (per total instructions) is quite consistent across all of the programs

Overall, you get roughly three orders of magnitude slower startup between a plain C program using dietlibc and Python. And all, to do basically nothing.

On the other hand, I learned some interesting things while doing it, so it wasn’t quite for nothing ☺.


转载请注明:爱开源 » linux perf – 性能测试和优化工具

您必须 登录 才能发表评论!