When hardware decode (mppvideodec/NV12) is active, wrap the appsink in a
GstBin with a videoscale element so the VPU decodes at full stream
resolution but Python only receives a frame pre-scaled to the SDL display
size (default 640x480).
Effect:
NV12 buffer per frame: 3,133,440 B (1080p) → 460,800 B (640x480)
memmove per frame: ~33 ms (80.5% budget) → ~5 ms (expected ~12%)
The videoscale bilinear step runs entirely in software on the A35 cores
but scales down 6.7×, so its cost is far lower than the avoided memmove.
SDL still handles final aspect-ratio fitting inside the viewport, so
visual quality is unchanged relative to what the 640x480 display can show.
Fallback: if videoscale is not available, unscaled NV12 is used as before.
Add benchmark log table to development-status.md comparing:
- a201594: extract_dup+from_buffer_copy (2 copies, 6MB/frame) → 36.5ms, 87.6% budget
- da02e74: buffer.map+memmove into reusable ctypes array (1 copy, 3MB/frame) → 33.6ms, 80.5% budget
Note that the 3.1MB memmove is now the remaining bottleneck and further
reduction would require DMA-buf zero-copy via kernel VPU driver support.
Update next actions: profile SDL upload overhead, explore dmabuf fd path,
and consider 720p downscale option if stutter appears under combined load.
Instead of extract_dup (GLib alloc+memcpy → Python bytes) followed by
from_buffer_copy (Python bytes → ctypes array) — two 3MB copies per frame —
use Gst.Buffer.map(READ) to get a zero-allocation pointer to the decoded
frame memory, then memmove directly into a pre-allocated reusable ctypes
array (_raw_arr).
This reduces the per-frame copy path from 2 copies (6MB) to 1 memmove
(3MB), with no Python bytes object allocation at all. The memmove happens
under _frame_lock so render() on the main thread never reads a partial frame.
_raw_arr is allocated once on the first frame (or on resolution change) and
reused for every subsequent frame.
_Frame no longer carries a pixels field. Tests updated accordingly.
Benchmark updated to use the same buffer.map+memmove path as the app.
mppvideodec outputs NV12 (hardware format) which GStreamer videoconvert
converts to BGRA in scalar software code — slower than avdec_h264 which
uses libav's NEON-optimised YUV→BGRA path.
Default behaviour: software decode (avdec_h264) at PRIMARY rank.
The MPP plugin is still detected and logged so the user knows it is
installed and operational.
Set R36S_HW_DECODE=1 to re-enable the rank boost once a zero-copy
NV12→SDL_UpdateNVTexture (or similar) upload path is implemented.
On linux-aarch64 the conda gst-libav package has an unfixable ABI mismatch
(libdav1d.so.6 missing, libicuuc.so.78 via libxml2-16). Fix: use system
gstreamer1.0-libav installed via apt with GST_PLUGIN_PATH, and preload
system libgomp.so.1 to avoid static TLS block errors when dlopen loads
libgstlibav.so. avdec_h264 and avdec_aac now register correctly on device.
These vars are stored in conda activate.d/gst-env.sh and in deploy/run.sh.