- UID
- 3808
- 精华
- 积分
- 1480
- 威望
- 点
- 宅币
- 个
- 贡献
- 次
- 宅之契约
- 份
- 最后登录
- 1970-1-1
- 在线时间
- 小时
|
本帖最后由 watermelon 于 2020-1-24 03:00 编辑
今天本来是有一道数据结构的题目,给出了三种不同的算法,要求自己动手实现,算出他们的时间复杂度并且编程计算各个函数所用的时间来和分析的时间增长率来做对比。
一开始我想的是原先翻译站长的帖子时候其教我的QueryPerformanceCount这个API来获取函数运行时间的;然后我想了想问问群里有没有别的更快的办法,结果和同学们有了很
深入的交流。其中了解到了rdtsc,tdtsc等指令。今天晚上我着重学习了rdtsc指令,以下内容均为intel手册https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf中的学习笔
记,代码均为例程,由于学习了之前内容,所以对例程比较好理解,小弟我按照自己的理解在原注释的基础上增添了一些自己的注释,更加便于理解。
RDTSC全称为read-time stamp counter, 是一条指令用来计算程序运行时候cpu的时钟周期数,其会返回一个64-bit的数据;其中高32位储存在edx中,低32位储存在eax中。
在C语言中包含对应的头文件,内嵌汇编即可使用rdtsc指令,如果使用过QueryPerformanceCount这个API的同学一定对rdtsc的使用上手很快!
(一)rdtsc工作原理
rdtsc指令用来可以记录处理器每一个时钟周期循环,当重置的时候,rdtsc会将时间戳置零。此处应当注意!rdtsc记录的是时钟周期的循环数目而不是时间!这点与
QueryPerformanceCount在理解上是相通的,我们如果想要用rdtsc计算程序运行的时间,当然要用rdtsc记录的时钟周期数除以CPU在当时的频率即可。
seconds = cycles / frequency
Note:frequency is given in Hz, where:1,000,000Hz = 1Mhz(这里的兆是按照数学上处理的,而不是计算机中常见的1024*x)
(二)rdtsc的在使用的时候的注意事项
有一些编译器使用rdtsc时候会显示编译错误,比如Visual C++ 5.0,此时可以使用 "__emit"来编写宏,是编译器可以正确认识到rdtsc和cpuid指令(cpuid指令的作用下面会讨论)。
此处两处宏定义为:
- #define cpuid __asm __emit 0fh __asm __emit 0a2h
- #define rdtsc __asm __emit 0fh __asm __emit 031h
复制代码
由于我使用的是VS2013,所以我不用加上这两句即可成功编译程序。
(三)影响rdtsc指令结果的因素
<1> Out-of-Order-Execution
此处的意思是在使用rdtsc时候,程序执行的顺序可能会和源代码中编写的代码的先后顺序不一样,手册上将这个原因归结于操作系统所做的事情太多了,可能会造成混乱。
此时我们就可以用到CPUID这条“顺序执行指令”(serializing instruction)了,CPUID这条指令可以保证程序执行的顺序和源代码编写的是一样的,但是也会有副作用,比如
cpuid指令使用的时间消耗要进行考虑,后期要将其去除,详细请见文章后面的例程。
比如Pentium Ⅱ Processor和Pentium Pro Processor由于rdtsc的Out-of-Order execution需要使用cpuid这样的串行指令来保证程序执行的先后顺序和源代码编写的先后顺序相同,
而在拥有MMX技术的处理器上则不用cpuid这条指令。
<2> Data Cache and Instruction Cache
手册上明确说了,在相同的代码段使用rdtsc可能会经常出现不同的结果(怎么样,薛定谔的猫),而这个经常是由于缓存效应引起的。只有小段的代码才能完全消除“缓存效应”(Cache effects),
此时需要将数据和指令全部存储在距离CPU最近的L1缓存器中,而将内存中的数据存入高速缓存的操作在之前通常叫做“Cache Warming”,注意,此处对数据的大小还有限制,要小于1KB。
并且还要将所用的指令写在一个循环或者函数中,并且要多次调用来获得更加精确的值,此处在例程中也会有非常明显的体现。
<3> Register Overwriting
在之前讨论了有些编译器不认得rdtsc和cpuid指令,故要在之前的宏定义中利用__emit来重写编译器。
但是此时rdtsc和cpuid所操作的寄存器有可能会因为这些指令改写寄存器而导致混乱(rdtsc改写edx,eax; cpuid改写eax, ebx, ecx, edx)。
因此编译器可能不会将数据正确存入上述寄存器中,所以必须要手动将数据压栈保存,类似于C语言中函数的开始部分的pushad操作,末尾的popad操作,后面的程序也会看到pushad和popad。
但是此时有两种情况可以对Register Overwriting不予考虑;第一:完全使用rdtsc指令编写的独立代码段。第二:被测量的代码段使用汇编语言编写,并在该代码段内的变量被实际使用了,
不能使只声明而不使用,此时编译器会自动处理栈的申请操作。
<4> Counter Overflow
计数溢出,这个一般不会出现,但是如果程序编写不小心,还是可以会出现的,比如低32位的结果赋值在eax中,可以记录时钟周期循环次数为2^32,如果一个CPU当时频率为400MHz,
那么他最多可以记录(2^32)/400M = 10.74秒,如果此时一个程序运行超过10秒,那么他就会溢出了。但是对于同样的频率400MHz的CPU,使用全部64位来记录使用时候,则需要1400年左右的时间才会溢出。
接下来是小弟我一句手册上的例子程序,认真读了并且在原来注释的基础上增添了一些自己的理解所增加的注释,便于读者理解,交流讨论,同时便于日后复习。
程序1:利用CPUID串行指令来使rdtsc按照源码编写顺序执行
- #include <Windows.h>
- #include <intrin.h>
- #include <stdio.h>
- #include <stdlib.h>
- /* If your compiler can't implicitly recongize the instruction
- * such as `cpuid` and `rdtsc`, you need to make these two macros
- * below to overwrite the complier.
- * For example, when you use the Visual C++ 5.0, you need to make
- * these two marcos. If you use the Visual Studio 2013, needn't.
- */
- //#define cpuid __asm __emit 0fh __asm __emit 0a2h
- //#define rdtsc __asm __emit 0fh __asm __emit 031h
- #pragma intrinsic(__rdtsc)
- int main(void)
- {
- int time, subtime;
- float x = 5.0f;
- __asm
- {
- // Make three warm-up passes through the timing
- // routine to make sure that the CPUID and RDTSC
- // instruction are ready.
- cpuid
- rdtsc
- mov subtime, eax // Let `subtime` to store lower
- // 32 bits data of rdtsc
- // TODO:
- // You can make some processes which you want to test
- // its time consuming.
- cpuid
- rdtsc
- sub eax, subtime
- mov subtime, eax // Calculate the result of time consuming.
- /* Make three duplications. */
- cpuid
- rdtsc
- mov subtime, eax
- cpuid
- rdtsc
- sub eax, subtime
- mov subtime, eax
- cpuid
- rdtsc
- mov subtime, eax
- cpuid
- rdtsc
- sub eax, subtime
- mov subtime, eax // Only the last value of subtime is k
- // subtime should now represent the overhead
- // cost of the MOV and CPUID instrcutions.
- fld x // Send the float `x` to the FPU register.
- fld x
- cpuid // Serialize execution
- rdtsc // Read the time stamp to eax
- mov time, eax
- fdiv // float type division
- cpuid // Serialize execution
- rdtsc
- sub eax, time // Get the difference which is the result of
- // CPU cycles.
- }
- time = time - subtime; // Subtract the overhead
- printf("%d\n", time); // Print the total time of divide to screen.
- return 0;
- }
复制代码
程序2:为了使结果更加精确,重复多次运行
- // This code will find an average number of cycles taken
- // to go through a loop. There is no cache warming, so
- // all cache effects are included in the cycle count.
- // To use this in your own code, simply paste in the six
- // marked sections into the designated locations in your code.
- #include <stdio.h>
- #include <Windows.h>
- #include <intrin.h>
- // If you use the Visual C++5.0 to compile your code, add these macros.
- /* BEING SECTION 1 */
- //#define CPUID __asm __emit 0fh __asm __emit 0a2h
- //#define RDTSC __asm __emit 0fh __asm __emit 031h
- /* END SECTION 1 */
- #define SIZE 5
- /* BEGIN SECTION 2 */
- unsigned FindBase(); // Function Declaration.
- /* END SECTION 2 */
- int main(int argc, char *argv[])
- {
- int i;
- /* BEGIN SECTION 3 */
- unsigned int base = 0, iterations = 0, sum = 0;
- unsigned int cycles_high1 = 0, cycles_low1 = 0;
- unsigned int cycles_high2 = 0, cycles_low2 = 0;
- unsigned __int64 temp_cycles1 = 0;
- unsigned __int64 temp_cycles2 = 0;
- // Stored signed so it can be converted to a
- // double for viewing.
- __int64 total_cycles = 0;
- double seconds = 0.0L;
- unsigned int mhz = 8100000000; // The frequency of CPU working now, it should be your CPU specification!!!
- base = FindBase();
- /* END SECTION 3 */
- for (i = 0; i < SIZE; i++)
- {
- /* BEGIN SECTION 4 */
- __asm
- {
- pushad
- cpuid // Lineral execution.
- rdtsc
- mov cycles_high1, edx
- mov cycles_low1, eax
- popad
- }
- /* END SECTION 4 */
- // TODO: User code to be measured is in this section.
- Sleep(3000); // Sleep 3 seconds.
- /* BEGIN SECTION 5 */
- __asm
- {
- pushad
- cpuid // Lineral execution.
- rdtsc
- mov cycles_high2, edx
- mov cycles_low2, eax
- popad
- }
- // Move the cycle counts into 64-bit integers.
- // It is easy to understand if you make a computation on your paper with your pen.
- temp_cycles1 = ((unsigned __int64)cycles_high1 << 32) | cycles_low1;
- temp_cycles2 = ((unsigned __int64)cycles_high2 << 32) | cycles_low2;
- // Add to total cycle count
- total_cycles += temp_cycles2 - temp_cycles1 - base;
- iterations++;
- /* END SECTION 5 */
- }
- // Now the total cycle count and iterations are available to be used as desired.
- // Example:
- seconds = (double)(total_cycles) / (double)(mhz);
- printf("Average cycles per loop: %f\n",
- (double)(total_cycles / iterations));
- printf("Average seconds per loop: %f\n",
- seconds / iterations);
- return 0;
- }
- /* BEGIN SECTION 6 */
- // This function used to measure the base time.
- unsigned int FindBase()
- {
- unsigned int base, base_extra = 0;
- unsigned int cycles_low, cycles_high;
- // The following test run the basic cycle counter to
- // find the overhead associated with each cycle
- // measurement. It is run multiple times simply
- // because the first call to CPUID normally takes
- // longer than subsequent calls.
- // Typically after the second run the results are
- // consistent. It is run three times just to make sure.
- __asm
- {
- pushad
- cpuid
- rdtsc
- mov cycles_high, edx
- mov cycles_low, eax
- popad
- pushad
- cpuid
- rdtsc
- popad
- pushad
- cpuid
- rdtsc
- mov cycles_high, edx
- mov cycles_low, eax
- popad
- pushad
- cpuid
- rdtsc
- popad
- pushad
- cpuid
- rdtsc
- // You know, `cycles_high` and `cycles_low` are multiply evaluated.
- // To make the result more precise.
- mov cycles_high, edx
- mov cycles_low, eax
- popad
- pushad
- cpuid
- rdtsc
- sub eax, cycles_low
- mov base_extra, eax // Computation.
- popad
- pushad
- cpuid
- rdtsc
- mov cycles_high, edx
- mov cycles_low, eax
- popad
- pushad
- cpuid
- rdtsc
- sub eax, cycles_low
- mov base, eax // Computation, again.
- popad
- }
- // The following provides insurance for the above code,
- // in the instance the final test causes a miss to
- // the instruction cache.
- if (base_extra < base)
- {
- base = base_extra;
- }
- return base;
- }
- /* END SECTION 6 */
复制代码
程序3:讨论利用小段代码和重复计算结果来消除Cache effects
- // Code for testing a small, stand-alone section of code
- // for repeatable results.
- #include <stdio.h>
- #include <Windows.h>
- #include <intrin.h>
- // ....You know, if you use the msvc5.0, please add these two marcos.
- //#define CPUID __asm __emit 0fh __asm __emit 0a2h
- //#define RDTSC __asm __emit 0fh __asm __emit 031h
- unsigned int TestFunc();
- int main(int argc, char *argv[])
- {
- unsigned int base = 0;
- unsigned int cyc, cycles1, cycles2, cycles3;
- unsigned int cycles4, cycles5;
- // The following tests run the basic cycle counter to
- // find the overhead associated with each cycle
- // measurement.
- // It is run multiple times simply because the first
- // call to CPUID normally takes longer than subsequent
- // calls. Typically after the second run the results
- // consistent. It is run three times just to make sure.
- __asm
- {
- cpuid
- rdtsc
- mov cyc, eax
- cpuid
- rdtsc
- sub eax, cyc
- mov base, eax
- cpuid
- rdtsc
- mov cyc, eax
- cpuid
- rdtsc
- sub eax, cyc
- mov base, eax
- cpuid
- rdtsc
- mov cyc, eax
- cpuid
- rdtsc
- sub eax, cyc
- mov base, eax
- }
- // This section calls the function that contains the
- // user's code. It is called multiple times to
- // eliminate instruction cache effects and get
- // repeatable results.
- cycles1 = TestFunc();
- cycles2 = TestFunc();
- cycles3 = TestFunc();
- cycles4 = TestFunc();
- cycles5 = TestFunc();
- // By the second or third run, both data and instruction
- // cache effects should have been eliminated, and
- // results will be consistent.
- printf("Base : %d\n", base);
- printf("Cycle counts:\n");
- printf("%d\n", cycles1 - base);
- printf("%d\n", cycles2 - base);
- printf("%d\n", cycles3 - base);
- printf("%d\n", cycles4 - base);
- printf("%d\n", cycles5 - base);
- return 0;
- }
- unsigned int TestFunc()
- {
- // TEST CODE, DO WHAT YOU WANT.
- float z, q, x, y;
- float result;
- unsigned cycles;
- // Load the values here, not at creation, to make sure
- // the data is moved into the cache.
- // Actually, I have a question here????????
- // According to the previous pages,
- // this varibles should be defined in this section.
- cycles = 0;
- result = 0.0f;
- x = 2.0f;
- y = 100.0f;
- z = 12.0f;
- q = 5.0f;
- __asm
- {
- cpuid
- rdtsc
- mov cycles, eax
- }
- // process
- z += y;
- q *= x;
- result = z / q;
- __asm
- {
- cpuid
- rdtsc
- sub eax, cycles
- mov cycles, eax
- }
- return cycles;
- }
复制代码
由于小弟英文水平和知识水平有限,肯定会有一些理解有偏差甚至知识性错误的地方,希望各位大佬多多批评指正。
参考文献:
Intel ---- Use the RDTSC Instruction for Performance Moniroting
https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf
MSDN: https://docs.microsoft.com/zh-cn/previous-versions/twchhe95(v=vs.110)
blog:
http://blog.chinaunix.net/uid-24774106-id-2779245.html
https://blog.csdn.net/solstice/article/details/5196544 |
|