docs: update benchmark log with full videoscale optimization history and results

1 week ago · 528c7f94e7
1 changed files with 32 additions and 20 deletions
--- a/docs/development-status.md
+++ b/docs/development-status.md
@ -48,30 +48,41 @@ Milestone 3 — SDL Video Viewport, HUD, and Wayland Compatibility

 ## Tasks In Progress

+- **NV12 frame path optimization complete**: `videoscale(nearest-neighbour)→640×480` GstBin reduces Python memmove from 32 ms (77% budget) to 1 ms (2.5%) with no FPS or drop regression. Awaiting visual smoke test on device via MatHacks.sh launcher.
 - Verify that the SDL-texture playback path is smooth enough on real host playback and on R36S hardware
- Measure whether BGRA frame upload is acceptable on RK3326 or whether a future YUV texture path is needed
 - Device deployment on the physical R36S is now wired through ArkOS `Ports -> MatHacks`, with the heavy runtime under `/home/ark` and only a lightweight stub launcher under `/roms/ports`
- Device env bootstrap on the physical R36S reaches a clean `from r36s_dlna_browser.app import Application` inside `/home/ark/miniconda3/envs/r36s-dlna-browser`
- ArkOS launcher asset added at `deploy/arkos/MatHacks.sh`; current launcher uses the `/home/ark/R36SHack` checkout plus verified `LD_LIBRARY_PATH`, `GST_PLUGIN_PATH`, and `LD_PRELOAD` exports needed to load the system `gstreamer1.0-libav` plugins from the conda runtime
- **Rockchip MPP hardware decode now deployed**: `librockchip-mpp` and `gst-mpp` compiled from source via Docker QEMU (arm64v8/ubuntu:focal), installed on device, `mppvideodec` confirmed visible to GStreamer. The `gstreamer_backend.py` probe auto-boosts `mppvideodec` rank if `/dev/vpu_service` is accessible.
- **Pre-built .so files bundled** in `deploy/arkos/mpp-libs/`; `setup_hw_decode.sh` installs them automatically without network access.

 ## NV12 Render Path Benchmark Log

 All runs performed on the physical R36S (RK3326, 4× A35 @ 1.3 GHz, 1 GB RAM) over SSH.
-Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 ms.
+Stream: 1920×1080 H.264 MKV @ 24 fps via MiniDLNA over LAN. Frame budget: 41.7 ms.

-| Commit | Copy strategy | Copy mean | Copy % budget | FPS | Dropped | A/V drift |
-|--------|--------------|-----------|---------------|-----|---------|-----------|
+| Commit | Copy / pipeline strategy | Copy mean | Copy % budget | FPS | Dropped | A/V drift |
+|--------|--------------------------|-----------|---------------|-----|---------|-----------|
 | `a201594` | `extract_dup` → bytes + `from_buffer_copy` → ctypes (2 copies, 6 MB/frame) | 36,499 µs | 87.6% | 24.01 | 1 | −42.8 ms |
-| `da02e74` | `buffer.map(READ)` + `memmove` into reusable ctypes array (1 copy, 3 MB/frame) | 33,551 µs | 80.5% | 23.98 | 0 | −38.0 ms |
+| `da02e74` | `buffer.map(READ)` + `memmove` into reusable ctypes array (1 copy, 3.1 MB/frame) | 33,551 µs | 80.5% | 23.98 | 0 | −38.0 ms |
+| `995830e` | `videoscale(nearest)→640×480` in GstBin + `memmove` (1 copy, **0.46 MB/frame**) | **1,033 µs** | **2.5%** | **23.99** | **0** | **−6.9 ms** |

-**Key observations (`da02e74`):**
- 1 dropped frame eliminated (0 vs 1)
- Jitter improved: stdev 2.6 ms vs 3.6 ms
- A/V drift tighter: −38 ms vs −43 ms
- Copy cost still 80.5% of frame budget — the 3.1 MB `memmove` on each frame is the remaining bottleneck
- Further reduction requires DMA-buf zero-copy (kernel VPU→SDL import without CPU memcpy), which depends on device driver support not currently available through gst-mpp's appsink path
+**Optimization history:**
+
+- `a201594` → `da02e74`: replaced `extract_dup + from_buffer_copy` (2 copies, 6 MB/frame) with `buffer.map(READ) + memmove` into a pre-allocated ctypes array (1 copy, 3.1 MB). Saved ~3 MB/frame allocation; copy cost reduced by 8% but still ~81% of budget.
+
+- `da02e74` → `995830e`: identified that the 3.1 MB memmove is necessary only because the appsink receives full 1920×1080 frames, while the display is 640×480. Inserted a `GstBin` containing `videoscale(method=nearest-neighbour) → capsfilter(NV12,640×480) → appsink` as the playbin video-sink. This causes the GStreamer pipeline thread to do SW scale before Python sees the frame; Python then receives only 460 KB (6.7× smaller). Memmove drops from 32 ms to 1 ms (31× improvement, 2.5% budget). FPS and drop count are unchanged (23.99, 0). A/V drift improved from −38 ms to −7 ms.
+
+**Alternatives tested and rejected during `995830e`:**
+
+| Variant | Result | Root cause |
+|---------|--------|-----------|
+| Bilinear videoscale (no queue) | 20.92 fps, 46 drops | Bilinear reads adjacent rows → loads ~89% of source cache lines, similar cost to memmove; scheduling pressure causes drops |
+| Nearest-neighbour + leaky=2 queue | 1.86 fps, 30 drops | `leaky=2` allows mppvideodec to race ahead; queue fills and drops ~93% of frames as stale |
+| Nearest-neighbour, no queue | **23.99 fps, 0 drops** ✅ | Nearest reads ~44% of source cache lines; back-pressure from appsink naturally rate-limits mppvideodec |
+
+**Key observations (`995830e`):**
+- Memmove reduced from 32 ms (3.1 MB) to ~1 ms (460 KB) — 31× improvement
+- No FPS or drop regression vs unscaled path
+- A/V drift improved significantly (−7 ms vs −38 ms)
+- SW nearest-neighbour scale on A35 costs ~14 ms per frame (estimated from cache line count), but this happens synchronously in the GStreamer pipeline thread BEFORE the appsink callback, not in the Python memmove measurement
+- Remaining 97.5% of frame budget is available for SDL upload, HUD rendering, and other pipeline work

 ## Blockers Or Open Questions

@ -94,8 +105,9 @@ Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 m

 ## Next Recommended Actions

-1. Consider profiling what the remaining 19.5% of frame budget (≈8 ms) consists of — likely SDL_UpdateNVTexture upload + render call overhead + Python GIL churn. If SDL upload is the bottleneck, try `SDL_LockTexture` for direct write instead.
-2. Investigate DMA-buf / dmabuf fd import as a future zero-copy path: gst-mpp may expose DRM DMA-buf fds that SDL's KMSDRM backend can import directly, eliminating the CPU memmove entirely.
-3. Run a visual playback smoke test on device directly via the app launcher (MatHacks.sh) to confirm HUD and video render correctly together under KMSDRM at the current 80.5% copy load.
-4. If 80.5% copy cost causes visible stutter under load (UI overhead competing for A35 cycles), the next option is to reduce resolution at the appsink by inserting a `videoscale` element to 1280×720 before the appsink, cutting memmove to ~1.3 MB/frame (≈35% budget).
-5. `avdec_hevc` is still missing (HEVC decoders not in system apt `gstreamer1.0-libav 1.16.1`); `mppvideodec` covers H.264/H.265/VP8/VP9 via HW so this is less critical now.
+1. Run a visual playback smoke test on device directly via the app launcher (MatHacks.sh) to confirm HUD and video render correctly together under KMSDRM with the videoscale path active (nearest-neighbour 640×480 NV12).
+2. Measure SDL_UpdateNVTexture upload cost for the now-smaller 640×480 texture (was 1920×1080). If it is sub-millisecond, the render path is considered optimized.
+3. If visual quality from nearest-neighbour scaling is noticeably poor on-device, switch `scale.set_property("method", 1)` (bilinear) and re-benchmark; the bilinear result (20.92 fps, 46 drops) only applied to the benchmark stream — actual app playback may behave differently since the GStreamer pipeline structure is slightly different inside the real app vs the benchmark.
+4. Consider profiling the SDL render loop under combined video+HUD load to confirm 30+ fps UI responsiveness alongside decoding.
+5. Investigate DMA-buf import as a future zero-copy path: gst-mpp may expose DRM DMA-buf fds that SDL's KMSDRM backend can import directly via `SDL_CreateTextureFromSurface` or a custom EGL path, eliminating the CPU memmove and SW scale entirely. This is a significant engineering effort and is not needed given current performance.
+6. `avdec_hevc` is still missing (HEVC decoders not in system apt `gstreamer1.0-libav 1.16.1`); `mppvideodec` covers H.264/H.265/VP8/VP9 via HW so this is less critical now.