linux gprof tool

https://www.ibm.com/developerworks/cn/linux/l-gnuprof.html

Tenslica

Profiling with the Xtensa ISS has several advantages over hardware profiling:

You do not need to compile the Xtensa program with special options (e.g., ‘-hwpg’)
before profiling it.
There is no instrumentation code added to the Xtensa program, so the profile results
are not distorted by any extra code.
The Xtensa ISS can easily record the execution of every instruction, so there is no need
to rely on statistical approximations like PC-sampling.
Instead of counting execution cycles, the Xtensa ISS can optionally record profile data
for other events, such as cache misses. You can then use xt-gprof or Xplorer to view
a profile of these other events.

benchmark

pipeline interlock

However, consider the following instructions:
LD adr -> r10
AND r10,r3 -> r11
The data read from the address adr is not present in the data cache until after the Memory Access stage of the LD instruction. By this time, the AND instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the AND instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are stalled – they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the LD and AND instructions.
This NOP is termed a pipeline bubble since it floats in the pipeline, like an air bubble, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a pipeline interlock.
branch delay

performance tuning

data alignment for vectorization

void sum(int *a, int *b, int *c, int n)
{ 
   
#pragma aligned (a, 8)
#pragma aligned (b, 8)
#pragma aligned (c, 8)
    int i;
    for (i=0; i<n; i++) { 
   
        a[i] = b[i] + c[i];
} }

Controlling Vectorization Through Pragmas
each iteration of the loop is independent of all other iterations. This pragma will often make a loop vectorizable.

void copy (int *a, int *b, int n)
{ 
   
    int i;
#pragma concurrent
    for (i = 0; i < n; i++)
        a[i] = b[i];
}

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://yundeesoft.com/26538.html

Xtensa 仿真环境tunning分析（ISS profile）「建议收藏」

linux gprof tool

Tenslica

benchmark

performance tuning

相关推荐

发表回复

分享到：