Implementing driver with configurable number of outstanding transactions

Suppose I have some protocol with 2 phases- ctrl & data and i want to implement outstanding limiter inside the driver.

the follwing code is a simplified code which illustrate my first attempt:
these are the phases driver


task ctrl_phase_driver();
	trans tr;
	vif.master_cb.ctrl_valid <= 0;
	forever
	begin
		if(tr == null)
		begin
			wait(pending_tr.size());
			tr = pending_tr.pop_front();
			@(vif.master_cb);
		end

		vif.master_cb.ctrl_sig   <= tr.ctrl_sig;
		vif.master_cb.ctrl_valid <= 1;

		@(vif.master_cb iff vif.master_cb.ctrl_ready);

		vif.master_cb.ctrl_valid <= 0;

		outstanding_tr.push_back(tr); // push transaction to data phase driver queue

		tr = pending_tr.pop_front();

	end
endtask

task data_phase_driver();
	trans tr;
	vif.master_cb.data_valid <= 0;
	forever
	begin
		if(tr == null)
		begin
			wait(outstanding_tr.size());
			tr = outstanding_tr.pop_front();
			@(vif.master_cb);
		end

		foreach(tr.data[beat_num])
		begin
			vif.master_cb.data  <= tr.data[beat_num];
			vif.master_cb.last  <= (beat_num+1 == tr.data.size());
			vif.master_cb.data_valid <= 1;
			@(vif.master_cb iff vif.master_cb.data_ready);

			if(beat_num+1 == tr.data.size())
				--num_outstanding;

			vif.master_cb.data_valid <= 0;

			repeat(tr.delay_between_beats[beat_num]) @(vif.master_cb);
		end

		tr = outstanding_tr.pop_front();
	end
endtask

in addition, there is kind of centeral scheduler which pops transaction items from seq_item_port and push them to outstanding_tr.
the outstanding mechanism should be centeral since there are actually multiple 2-phases interfaces and the outstanding limit is function of both max_total_outstanding max_if_outstanding.
(for example, there are 3 interfaces with 4,5,6 max_if_outstanding but the whole system can be confiugred with max_total_outstanding=8)
But for this discussion we can neglect that and see the schduler as (pseudo code):


task sch();

	forever
	begin
		wait(
			(req_fifo.size())
			&&
			(num_outstanding < cfg.num_if_outstanding_transactions)
		);

		pending_tr.push_back( req_fifo.pop_front() );
		++num_outstanding;
	end

endtask

the problem is that:
lets say that cfg.num_if_outstanding_transactions=1
since num_outstanding decrement (inside the data_phase_driver) is done after @(vif.master_cb iff vif.master_cb.data_ready)
the sch is aware of that only after the clocking block event, and hence the next ctrl phase will start after one cycle delay with respect to last data phase beat (i.e, next ctrl phase can not be back2back with last data phase transaction)

my current solution is to add seperate thread to data_phase_driver which monitors the end of transaction:


task data_phase_driver();
	trans tr;

	fork
		forever
		begin
			wait(vif.master_cb.triggered && vif.data_ready && vif.data_valid && vif.last);
			--num_outstanding;
			wait(!vif.master_cb.triggered);
		end
	join_none

	vif.master_cb.data_valid <= 0;
	forever
	begin
		if(tr == null)
		begin
			wait(outstanding_tr.size());
			tr = outstanding_tr.pop_front();
			@(vif.master_cb);
		end

		foreach(tr.data[beat_num])
		begin
			vif.master_cb.data  <= tr.data[beat_num];
			vif.master_cb.last  <= (beat_num+1 == tr.data.size());
			vif.master_cb.data_valid <= 1;
			@(vif.master_cb iff vif.master_cb.data_ready);

			vif.master_cb.data_valid <= 0;

			repeat(tr.delay_between_beats[beat_num]) @(vif.master_cb);
		end
	end
endtask

this code works with my current simulator but I’ve the feeling that I’m doing it badly (e.g., sampling signals without clocking block)
any suggestion or comment will be great.

thanks!