# Prefix Scan8 Test Strategy

## Scope

The testbench should treat the DUT as a black-box synchronous batch
accelerator that accepts one packed 8-lane input vector whenever `in_valid=1`
and, exactly 3 cycles later, emits one packed 8-lane output vector containing
the inclusive prefix sums of that same batch.

The benchmark is intentionally focused on parallel-prefix style arithmetic:
correct lane unpacking, signed 12-bit interpretation, exact cumulative
addition across the eight lanes, fixed-latency pipeline timing, continuous
throughput for back-to-back batches, and reset flushing.

Python is useful here because it can generate packed expected vectors for both
directed and randomized batches without hand-calculating every 15-bit field.
If used later, it must be used only offline to generate expected cycle/value
pairs. The runnable Verilog testbench must hardcode those final expectations
and must not execute Python at runtime.

## Coverage Goals

- Reset behaviour: confirm `out_valid=0` and `out_data=0` while `rst=1`.
- Reset input ignore: confirm a cycle with both `rst=1` and `in_valid=1` does
  not accept a batch or schedule any future output.
- Exact latency: confirm every accepted batch produces its output exactly
  3 cycles later, not earlier and not later.
- Idle output behaviour: confirm `out_valid=0` and `out_data=0` on cycles
  where no batch was accepted exactly 3 cycles earlier.
- Lane packing: confirm lane 0 maps to the low bits and lane 7 maps to the
  high bits on both input and output buses.
- Inclusive semantics: confirm each output lane includes all earlier input
  lanes from the same batch, including its own lane.
- Signed arithmetic: confirm negative 12-bit inputs are sign-extended before
  accumulation and negative prefix sums are encoded correctly as signed
  15-bit two's-complement values.
- Extreme values: confirm the DUT handles batches containing `-2048` and
  `2047` without truncation, saturation, or wraparound.
- Back-to-back throughput: confirm consecutive accepted batches yield
  consecutive valid outputs after the 3-cycle latency.
- Reset flush: confirm asserting reset while outputs are pending discards all
  pre-reset in-flight batches and forces post-reset timing to restart from an
  empty pipeline.

## Planned Directed Scenarios

- Hold reset high for several cycles, including one cycle with `in_valid=1`,
  then deassert reset and confirm no output appears 3 cycles later from the
  ignored reset-time batch.
- Feed a simple positive batch such as `1, 2, 3, 4, 5, 6, 7, 8`; verify the
  exact output lanes are `1, 3, 6, 10, 15, 21, 28, 36`.
- Feed a mixed-sign batch such as `-3, 7, -2, 1, -8, 4, 0, 5`; verify the
  exact output lanes are `-3, 4, 2, 3, -5, -1, -1, 4`.
- Feed a duplicate-heavy batch such as `5, 5, 5, 5, 5, 5, 5, 5`; verify the
  cumulative outputs increase as `5, 10, 15, 20, 25, 30, 35, 40`.
- Feed an all-zero batch; verify all output lanes are zero and the DUT does
  not emit stale data from a previous batch.
- Feed an extreme alternating batch such as
  `2047, -2048, 2047, -2048, 2047, -2048, 2047, -2048`; verify exact signed
  prefix sums and correct two's-complement packing.
- Feed at least four back-to-back batches with no gaps; verify outputs appear
  on four consecutive cycles exactly 3 cycles later and each output matches
  its corresponding input batch.
- Insert one or two idle cycles with `in_valid=0` between accepted batches;
  verify the output valid stream contains matching gaps 3 cycles later.
- Accept one or more batches, then assert reset before their scheduled output
  cycles; verify those pending outputs never appear.
- After reset, feed a fresh batch and verify it still incurs the full 3-cycle
  latency and is not contaminated by pre-reset state.

## Checking Method

- Maintain a cycle counter in the testbench and an expected-output queue keyed
  by the cycle at which each accepted batch should appear.
- On every cycle, compare the observed `out_valid` against whether an output
  is expected for that cycle.
- When an output is expected, compare the entire 120-bit `out_data` word
  exactly.
- When no output is expected, require both `out_valid=0` and `out_data=0`.
- Use clear failure messages that include the stimulus tag, cycle number, and
  both expected and observed packed output values.
- Prefer a compact set of auditable directed vectors first, then add a small
  number of deterministic pseudo-random batches to broaden signed-value and
  carry-propagation coverage.

## Python Use

If a helper script is added later, use it only offline to generate golden
vectors for the directed and pseudo-random test batches above.

Planned workflow:

1. Encode each test cycle as `(rst, in_valid, batch_or_none, tag)`.
2. Simulate the spec, not the RTL:
   - If `rst=1`, clear all pending outputs and record `(out_valid=0,
     out_data=0)` for that cycle.
   - If `rst=0` and `in_valid=1`, unpack the eight signed 12-bit values,
     compute the eight inclusive prefix sums as normal Python integers, pack
     them into signed 15-bit fields, and schedule that packed output for
     cycle `current_cycle + 3`.
   - On each cycle, emit either the scheduled packed output or the idle value
     `(out_valid=0, out_data=0)`.
3. Print the resulting expected arrays or Verilog initializer lines for the
   final testbench.
4. Copy those literal expected values into `testbench.v`. The final
   simulation must remain pure Verilog and must not call Python.

## Golden-Data Confidence

The offline Python helper should be kept simple and auditable:

- Use plain integer arithmetic for the prefix sums rather than mirroring any
  RTL structure.
- Add a tiny self-check in the script for a few hand-computed batches so the
  packing and signed conversions are validated before freezing vectors.
- Seed any pseudo-random batch generation so the resulting golden vectors are
  deterministic and reproducible.

This keeps the future Verilog testbench self-contained while making the
expected results easy to regenerate and review.