Parallel execution of threads

I have seen someone implement a system of threads controlled by fork-join where the threads share and update variables. The system seemed to be relying on two properties to guarantee mutual exclusion when performing shared-variable updates: (1) the non-pre-emption of thread execution, i.e. that a thread will be uninterrupted until it suspends itself at a timing control and (2) that the threads run on a single processor.

The first of these properties, non-pre-emption, is, as I understand it, a requirement imposed by the SystemVerilog LRM. However, I have not seen anything that makes me believe that the LRM imposes the second property, that of a single processor. Therefore, I have considered the system mentioned above to be inherently non-portable. This seems unfortunate because the system is deployed across an organization but will not be robust against improvements in simulator technology in the direction of using multiple procesors.

Can anyone comment or enlighten?

This is not correct. The SystemVerilog LRM does not guarantee non-pre-emption. See section 4.7 Nondeterminism. The LRM does guarantee execution order within a single thread (process).

The reason Verilog allows preemption of a thread has more to do with optimization than anything to do with multiprocessor design. An real Verilog design has thousands even millions of “threads”. It is not practical to model these as actual OS threads, so the compiler’s optimizations combine as many threads as it can together. For example, all the the threads that are activated on @(posedge clk) might be combined into one real thread. This does not happen so much on the “testbench” side of the environment, but the option is there.

Until the number of processors in a multiprocessor platform approaches the number of threads in a simulated design, how the threads are partitioned on to each processor remains a bottleneck.

Okay. Interesting. Sounds like it is necessary to develop cooperating parallel threads under the assumption that any interleaving of statments may be possibile and hence full attention to mutual exclusion of critical sections must be given at the thread-implementation level.

Do you agree?

I must admit that I didn’t scour the LRM but thought that non-premption must be assured in it based on this statement by Janick Bergeron in “Writing Testbenches in SystemVerilog”, pg. 163:

"In an operating system, every thread has a limit on the amount of processor time it can have during each execution slice. Once that limit is exhausted,the thread is kicked out of the processor to be replaced by another. There is no such limit in a simulator. Any execution thread keeps executing until it explicitly requests to be kicked out…

“In SystemVerilog, an execution thread simulates and and keeps simulating, until and active timing-control statement–@, # or wait–is executed.”

In reply to fostler:

Yes I agree.

With all due respect to my colleague, that book is not the standard, nor has Janick been involved with the development of the Verilog/SystemVerilog standard. I can see that one could make that statement from simple observations. In practice, no simulator is going to randomly interleave statements without some predictable cause.

There are a few places in the LRM where order is undefined, but people have come to depend a specific implementation to give them a specific ordering of what should be a race condition.

In reply to dave_59:

Just came across a situation where a simulator (not Questa) is behaving as if statements are not executing atomically. In the example below, there is a wait(done) in an initial block thread. In another thread, done is set to 1 and then back to 0 with no delay in a single thread. If you assume atomic execution, the wait statement should never unblock there is no time where it is true while it has a chance to execute. However, if the compiler embeds the evaluation of the wait expression (realize that the expression can be arbitrarily more complex than a simple done bit), it is effectively interleave statements from two different threads.

module top;
   reg [5:0] counter;
   reg 	   clk;
   reg 	   done;
   
   initial 
     begin : initial_block
	counter = 0;
	clk = 0;
	done = 0;
	repeat (100)
	  begin
	     #5 clk = 0;
	     #5 clk = 1;
	  end
	$display("timeout");
	$finish;
     end
   
   always @(posedge clk)
     begin : counter_block
	counter = counter + 1;
	done = (counter == 12);
	if (done) begin
	   counter = 0;
	   done = 0;
	end
     end
   initial begin : done_block
      wait(done);
      $display("I'm done");
      $finish;
   end
endmodule

This technique is used by many simulators as part of an optimization to limit the number of context switches it needs to do.