Fix a race condition at Python shutdown when waiting for threads.
Wait until the Python thread state of all non-daemon threads get
deleted (join all non-daemon threads), rather than just wait until
Python threads complete.
* Add threading._shutdown_locks: set of Thread._tstate_lock locks
of non-daemon threads used by _shutdown() to wait until all Python
thread states get deleted. See Thread._set_tstate_lock().
* Add also threading._shutdown_locks_lock to protect access to
threading._shutdown_locks.
* Add test_finalization_shutdown() test.