调试“资源竞争”
多线程会在写代码、维护和调试并行执行单元的时候带来很多挑战,尤其是共享资源的时候,例如多线程程序中使用全局变量和线程环境上下文状态,
Broken invariants introduced by the parallel nature of a program and the relative order in which its threads get scheduled by the
operating system are called race conditions.
翻译过来,即: 线程由于系统计划执行的顺序不确定性和并行性而造成对程序变量值得破坏,称为资源竞争(race condition)
这种错误情况很难复制,因为导致程序失败的线程计划执行顺序相对来说是不确定的,最好的阻止这类bug产生的方式是在检查设计和编码阶段仔细检查。
多线程程序设计不仅对普通人,即使对最优秀的工程师来说也很难,要避免这类bug,第一步是认识他们,主要有3种方式可能产生资源竞争:
在需要正确同步的多个线程中修改共享内存。这种情况经常产生的逻辑错误是内部代码变量被未同步的线程访问破坏,这种情况极为棘手,通常不会产生程序崩溃之类很明显的现象,而是仅仅产生错误结果。
某个共享变量的生命期超过使用它们的工作线程的生命期。变量在释放后,会导致线程访问无效内存地址。
工作线程中的某模块代码在卸载后被程序执行,例如执行dll模块时,同样这种情况会产生访问冲突。
// Start multiple threads to use the shared hash object
//
for (n = 0; n < threads.Length; n++)
{
threads[n] = new Thread(ThreadProc);
threads[n].Start(Encoding.UTF8.GetBytes("abc"));
}
//
// Wait for all the threads to finish
//
for (n = 0; n < threads.Length; n++)
{
threads[n].Join();
}
}
private static HashAlgorithm g_hashFunc;
}
复制代码
C:\book\code\chapter_09\HashRaceCondition\Bug>test.exe 2
Thread #3 done processing. Hash value was qZk+NkcGgWq6PiVxeFDCbJzQ2J0=
Thread #4 done processing. Hash value was qZk+NkcGgWq6PiVxeFDCbJzQ2J0=
C:\book\code\chapter_09\HashRaceCondition\Bug>test.exe 2
Thread #3 done processing. Hash value was +MHYcAb79+XMSwJsMTi8BGiD3HE=
Thread #4 done processing. Hash value was AAAAAAAAAAAAAAAAAAAAAAAAAAA=
通常解决上面错误的方式有2种:1.给每个线程分配独立的hash对象。2.给全局变量加锁 这2种方式走了2个极端
前者需要在每个线程中控制hash对象生命周期,创建和销毁对象。后者摒弃了多核带来的性能优势。
一个更好的解决方式是使用线程池设计模式。
池对象是一系列对象的集合,池可以对对象进行检查并分配给线程完成一定操作,线程使用之后会将对象返回给池,并设置为可重用状态。
这种设计模式会重置对象状态为可重用状态,这种操作的代价显然比重新创建一个对象来的小,同时也适应并行机制。
使用池对象时,通常会将该池中填充满对象。典型池对象定义入下:
class ObjectPool<T>
where T : class, new()
{
public ObjectPool(
int capacity
)
{
...
m_objects = new Stack<T>(capacity);
//
// objects will be lazily created only as needed
//
for (n = 0; n < capacity; n++)
{
m_objects.Push(null);
}
m_objectsLock = new object();
m_semaphore = new Semaphore(capacity, capacity);
}
public T GetObject()
{
T obj;
m_semaphore.WaitOne();
lock (m_objectsLock)
{
obj = m_objects.Pop();
}
if (obj == null)
{
obj = new T(); // delay-create the object
}
return obj;
}
public void ReleaseObject(
T obj
)
{
...
lock (m_objectsLock)
{
m_objects.Push(obj);
}
//
// Signal that one more object is available in the pool
//
m_semaphore.Release();
}
private Stack<T> m_objects;
private object m_objectsLock;
private Semaphore m_semaphore;
}
复制代码
再给先前c#代码使用了对象池后,结果会正常:
Thread #3 done processing. Hash value was qZk+NkcGgWq6PiVxeFDCbJzQ2J0=
Thread #4 done processing. Hash value was qZk+NkcGgWq6PiVxeFDCbJzQ2J0=
Test completed in 16 ms
上面的代码很可能会像预期的那样正常运行很多次,但是存在一个严重的资源竞争问题,spParameter内存地址绑定的智能指针会在MainHR退出时摧毁,所以在工作线程回调例程中使用是不安全的。如果运行过足够多次程序,可以看到打印的信息暴漏了一些垃圾字符,说明有内存崩溃。
Test message... Hello World!
Test message... Hello World!
Test message... Hello World!
Test message... Hello World!
Test message... Äello World!
Test message... Äello World!
Success.
Test message... okkkk
解决DLL生命周期管理bug:
最好的方式是将API中异步的部分在设计时体现出来,这样调用者才知道该如何正确使用。同时可以使用引用计数来修复这个问题。
跟踪DLL模块自身的引用计数,在调用LoadLibrary后执行API时每创建一个线程就增加计数,当工作线程回调例程结束时就通过
FreeLibraryAndExitThread减少计数,这样做,并不等同于将调用释放动态链接库和将退出线程作为同一个原子性事务处理,
就像FreeLibrary之后ExitThread一样,这样做控制权永远不能返回给回调函数,即使通过FreeLibrary将引用计数降为0 卸载DLL模块。
That being said, there is a way to fix the problem even if you were really bent on keeping the API
signature unchanged. The idea behind this fix, again, is to use reference counting—only this time,
you keep track of references to the DLL module itself. You increment the reference count of the
DLL module by calling LoadLibrary every time a new background thread is created by your API, and
you then have the worker-thread callback routine decrement it upon its exit by using the atomic
FreeLibraryAndExitThread
Win32 API call. It’s critical that you use this atomic call to free the library
and exit the thread as part of the same transaction, as opposed to calling FreeLibrary followed by
ExitThread. This way, control is never returned to the callback even if the backing DLL module gets
unloaded
from memory as its reference count drops down to 0 from the FreeLibrary call!
解决DLL生命周期管理bug:
最好的方式是将API中异步的部分在设计时体现出来,这样调用者才知道该如何正确使用。同时可以使用引用计数来修复这个问题。
跟踪DLL模块自身的引用计数,在调用LoadLibrary后执行API时每创建一个线程就增加计数,当工作线程回调例程结束时就通过
FreeLibraryAndExitThread减少计数,这样做,并不等同于将调用释放动态链接库和将退出线程作为同一个原子性事务处理,
就像FreeLibrary之后ExitThread一样,这样做控制权永远不能返回给回调函数,即使通过FreeLibrary将引用计数降为0 卸载DLL模块。
That being said, there is a way to fix the problem even if you were really bent on keeping the API
signature unchanged. The idea behind this fix, again, is to use reference counting—only this time,
you keep track of references to the DLL module itself. You increment the reference count of the
DLL module by calling LoadLibrary every time a new background thread is created by your API, and
you then have the worker-thread callback routine decrement it upon its exit by using the atomic
FreeLibraryAndExitThread
Win32 API call. It’s critical that you use this atomic call to free the library
and exit the thread as part of the same transaction, as opposed to calling FreeLibrary followed by
ExitThread. This way, control is never returned to the callback even if the backing DLL module gets
unloaded
from memory as its reference count drops down to 0 from the FreeLibrary call!