Generate ONLY the Verilog module code for the following specification. ## Problem Description Implement a synchronous accelerator that computes the inclusive prefix sums of one batch of eight signed input lanes. INPUT BATCH FORMAT: - A batch is accepted on each rising edge of `clk` for which `rst=0` and `in_valid=1`. - The eight input lanes are packed into `in_data` as follows: x0 = `in_data[11:0]` x1 = `in_data[23:12]` x2 = `in_data[35:24]` x3 = `in_data[47:36]` x4 = `in_data[59:48]` x5 = `in_data[71:60]` x6 = `in_data[83:72]` x7 = `in_data[95:84]` - Each `xk` is a signed 12-bit two's-complement value in the range -2048 to 2047. PREFIX-SUM RULE: - For every accepted batch, produce eight signed outputs `y0` through `y7` defined by: `y0 = x0` `y1 = x0 + x1` `y2 = x0 + x1 + x2` `y3 = x0 + x1 + x2 + x3` `y4 = x0 + x1 + x2 + x3 + x4` `y5 = x0 + x1 + x2 + x3 + x4 + x5` `y6 = x0 + x1 + x2 + x3 + x4 + x5 + x6` `y7 = x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7` - Use exact arithmetic. Do not saturate, wrap, truncate, round, or reorder lanes. - The outputs must be packed into `out_data` using the same lane order: `y0` in `out_data[14:0]` `y1` in `out_data[29:15]` `y2` in `out_data[44:30]` `y3` in `out_data[59:45]` `y4` in `out_data[74:60]` `y5` in `out_data[89:75]` `y6` in `out_data[104:90]` `y7` in `out_data[119:105]` - Each `yk` must be represented as a signed 15-bit two's-complement value. The exact output range for this benchmark is -16384 to 16376 inclusive. TIMING AND THROUGHPUT: - The module has no ready or backpressure signal. - It must be able to accept one new input batch on every cycle for which `in_valid=1`, including long runs of back-to-back valid batches. - Each accepted batch must produce exactly one output cycle with `out_valid=1`. - The output latency must be exactly 3 clock cycles for every accepted batch. - Count latency from the rising edge on which the batch is accepted. If that acceptance edge is cycle N, the batch is still not allowed to appear on cycles N, N+1, or N+2. - If a batch is accepted on cycle N, then on cycle N+3: `out_valid` must be 1 `out_data` must contain the eight prefix sums for that batch - That batch must not appear earlier than cycle N+3 and must not be delayed past cycle N+3. - Example: if a batch is accepted on the 5th rising edge after reset deasserts, its result must be presented on the 8th rising edge after reset deasserts. - If batches are accepted on consecutive cycles, their corresponding output cycles must also occur on consecutive cycles after the pipeline fills. - The evaluator may present a new input batch on the same cycle that a previous batch is being produced on the outputs. RESET AND IDLE BEHAVIOUR: - `rst` is synchronous and active-high. - While `rst=1`, do not accept new input batches, discard any in-flight batches, and drive `out_valid=0` and `out_data=0`. - After reset is deasserted, the next accepted batch starts a new empty pipeline with the full 3-cycle latency. - If no batch was accepted exactly 3 cycles earlier, then `out_valid` must be 0 and `out_data` must be 0. ## Interface Specification Module Name: prefix_scan8 Ports: - input 1 clk // System clock - input 1 rst // Synchronous active-high reset - input 1 in_valid // High when `in_data` contains a valid 8-lane input batch to be accepted this cycle - input 96 in_data // Packed batch of eight signed 12-bit input values, with lane 0 in bits [11:0] and lane 7 in bits [95:84] - output 1 out_valid // High when `out_data` contains the valid prefix-sum result for one accepted batch - output 120 out_data // Packed batch of eight signed 15-bit inclusive prefix sums, with lane 0 in bits [14:0] and lane 7 in bits [119:105] ## Requirements - Generate ONLY the Verilog module code - Do NOT output any reasoning, analysis, scratchpad, or tags - Start directly with `module prefix_scan8` as the first line of your response - Do NOT include any testbenches - Do NOT include any explanations or comments outside the code - End with `endmodule` - Ensure the code is correct and synthesizable