Xtensa 仿真环境tunning分析(ISS profile)「建议收藏」

Xtensa 仿真环境tunning分析(ISS profile)「建议收藏」ProfilingwiththeXtensaISShasseveraladvantagesoverhardwareprofiling:YoudonotneedtocompiletheXtensaprogramwithspecialoptions(e.g.,‘-hwpg’)beforeprofilingit.Thereisnoinstr…

大家好,欢迎来到IT知识分享网。

linux gprof tool

https://www.ibm.com/developerworks/cn/linux/l-gnuprof.html

Tenslica

Profiling with the Xtensa ISS has several advantages over hardware profiling:

  • You do not need to compile the Xtensa program with special options (e.g., ‘-hwpg’)
    before profiling it.
  • There is no instrumentation code added to the Xtensa program, so the profile results
    are not distorted by any extra code.
  • The Xtensa ISS can easily record the execution of every instruction, so there is no need
    to rely on statistical approximations like PC-sampling.
  • Instead of counting execution cycles, the Xtensa ISS can optionally record profile data
    for other events, such as cache misses. You can then use xt-gprof or Xplorer to view
    a profile of these other events.

benchmark

  • pipeline interlock

    However, consider the following instructions:
    LD adr -> r10
    AND r10,r3 -> r11
    The data read from the address adr is not present in the data cache until after the Memory Access stage of the LD instruction. By this time, the AND instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the AND instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are stalled – they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the LD and AND instructions.
    This NOP is termed a pipeline bubble since it floats in the pipeline, like an air bubble, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a pipeline interlock.

  • branch delay

performance tuning

  • data alignment for vectorization
void sum(int *a, int *b, int *c, int n)
{ 
   
#pragma aligned (a, 8)
#pragma aligned (b, 8)
#pragma aligned (c, 8)
    int i;
    for (i=0; i<n; i++) { 
   
        a[i] = b[i] + c[i];
} }
  • Controlling Vectorization Through Pragmas
    each iteration of the loop is independent of all other iterations. This pragma will often make a loop vectorizable.
void copy (int *a, int *b, int n)
{ 
   
    int i;
#pragma concurrent
    for (i = 0; i < n; i++)
        a[i] = b[i];
}

免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://yundeesoft.com/26538.html

(0)

相关推荐

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注

关注微信