When hardware decode (mppvideodec/NV12) is active, wrap the appsink in a
GstBin with a videoscale element so the VPU decodes at full stream
resolution but Python only receives a frame pre-scaled to the SDL display
size (default 640x480).
Effect:
NV12 buffer per frame: 3,133,440 B (1080p) → 460,800 B (640x480)
memmove per frame: ~33 ms (80.5% budget) → ~5 ms (expected ~12%)
The videoscale bilinear step runs entirely in software on the A35 cores
but scales down 6.7×, so its cost is far lower than the avoided memmove.
SDL still handles final aspect-ratio fitting inside the viewport, so
visual quality is unchanged relative to what the 640x480 display can show.
Fallback: if videoscale is not available, unscaled NV12 is used as before.
Instead of extract_dup (GLib alloc+memcpy → Python bytes) followed by
from_buffer_copy (Python bytes → ctypes array) — two 3MB copies per frame —
use Gst.Buffer.map(READ) to get a zero-allocation pointer to the decoded
frame memory, then memmove directly into a pre-allocated reusable ctypes
array (_raw_arr).
This reduces the per-frame copy path from 2 copies (6MB) to 1 memmove
(3MB), with no Python bytes object allocation at all. The memmove happens
under _frame_lock so render() on the main thread never reads a partial frame.
_raw_arr is allocated once on the first frame (or on resolution change) and
reused for every subsequent frame.
_Frame no longer carries a pixels field. Tests updated accordingly.
Benchmark updated to use the same buffer.map+memmove path as the app.