I have a DUT which takes a stream of bytes as input and the output comes in a pipelined way, I need to write scoreboard for it in UVM

The DUT has almost similar input and output interface with an 8 bit data line data_in and a valid for the same, and on the output there is a data_out line and valid, the DUT calculates the sum of three sequential bytes in the stream and give out data_out in the following way…like the frst, sec and third data sequentially on data_out line will be calculated in following way

data_out_0 = data_in_0 + data_in_1 + data_in_2

data_out_1 = data_in_1 + data_in_2 + data_in_3

data_out_2 = data_in_2 + data_in_3 + data_in_4

…etc

I am just trying to think of a skeleton for UVM monitor and scoreboard for this, in order to write scoreboard for this pipelined output, is it a good idea to use semaphores in monitor ???, doing something like

semaphore lock = new(1);

task run_phase (uvm_phase phase);
fork
collect_transactions;
collect_transactions;
join
endtask: run_phase

task collect_transactions;
forever begin
lock.get();
wait for clk
…capture one byte,
// Unlock semaphore
lock.put();
…wait for one clock
collect one byte
…wait for one clock
collect another byte

endtask

but I am not sure what to do with the queue, it will get overwritten by another parallel task in fork join or if i take an automatic queue for fork join, will that work ?

Or if there is some easier way to do scoreboarding for this kind of pipelined output, please suggest.

In reply to GC:

I believe you are thinking too complicated. Youu do not have a pipelined processing. Simply generate a seq_item having 3 bytes of data. In the driver you are providing the correct data to the DUT and maybe storing the data you want to reuse in the next step. This makes your life more easy.
In the monitor you are observing the virtual interface and extracting your 3 bytes of data without any semaphore.

Oh so did you mean capturing the data from driving interface and then sending it to scoreboard via analysis port. Then in scoreboard I can store bytes in a queue expected_data, then calculate what the output interface is expected to have and then compare ?
Like I can calculate the output data from the stimulus data and then compare

sum=0;
foreach(expected_data[i])
begin
sum= sum+expected_data[i]+expected_data_[i+1]+expected_data[i+2];
sum_1.push_back(sum);
sum=0;
end

sum_1 should have same values as captured at the output interface .