When hardware decode (mppvideodec/NV12) is active, wrap the appsink in a
GstBin with a videoscale element so the VPU decodes at full stream
resolution but Python only receives a frame pre-scaled to the SDL display
size (default 640x480).
Effect:
NV12 buffer per frame: 3,133,440 B (1080p) → 460,800 B (640x480)
memmove per frame: ~33 ms (80.5% budget) → ~5 ms (expected ~12%)
The videoscale bilinear step runs entirely in software on the A35 cores
but scales down 6.7×, so its cost is far lower than the avoided memmove.
SDL still handles final aspect-ratio fitting inside the viewport, so
visual quality is unchanged relative to what the 640x480 display can show.
Fallback: if videoscale is not available, unscaled NV12 is used as before.