diff --git a/docs/development-status.md b/docs/development-status.md index 3657262..755834d 100644 --- a/docs/development-status.md +++ b/docs/development-status.md @@ -48,30 +48,41 @@ Milestone 3 — SDL Video Viewport, HUD, and Wayland Compatibility ## Tasks In Progress +- **NV12 frame path optimization complete**: `videoscale(nearest-neighbour)→640×480` GstBin reduces Python memmove from 32 ms (77% budget) to 1 ms (2.5%) with no FPS or drop regression. Awaiting visual smoke test on device via MatHacks.sh launcher. - Verify that the SDL-texture playback path is smooth enough on real host playback and on R36S hardware -- Measure whether BGRA frame upload is acceptable on RK3326 or whether a future YUV texture path is needed - Device deployment on the physical R36S is now wired through ArkOS `Ports -> MatHacks`, with the heavy runtime under `/home/ark` and only a lightweight stub launcher under `/roms/ports` -- Device env bootstrap on the physical R36S reaches a clean `from r36s_dlna_browser.app import Application` inside `/home/ark/miniconda3/envs/r36s-dlna-browser` -- ArkOS launcher asset added at `deploy/arkos/MatHacks.sh`; current launcher uses the `/home/ark/R36SHack` checkout plus verified `LD_LIBRARY_PATH`, `GST_PLUGIN_PATH`, and `LD_PRELOAD` exports needed to load the system `gstreamer1.0-libav` plugins from the conda runtime -- **Rockchip MPP hardware decode now deployed**: `librockchip-mpp` and `gst-mpp` compiled from source via Docker QEMU (arm64v8/ubuntu:focal), installed on device, `mppvideodec` confirmed visible to GStreamer. The `gstreamer_backend.py` probe auto-boosts `mppvideodec` rank if `/dev/vpu_service` is accessible. -- **Pre-built .so files bundled** in `deploy/arkos/mpp-libs/`; `setup_hw_decode.sh` installs them automatically without network access. ## NV12 Render Path Benchmark Log All runs performed on the physical R36S (RK3326, 4× A35 @ 1.3 GHz, 1 GB RAM) over SSH. -Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 ms. +Stream: 1920×1080 H.264 MKV @ 24 fps via MiniDLNA over LAN. Frame budget: 41.7 ms. -| Commit | Copy strategy | Copy mean | Copy % budget | FPS | Dropped | A/V drift | -|--------|--------------|-----------|---------------|-----|---------|-----------| +| Commit | Copy / pipeline strategy | Copy mean | Copy % budget | FPS | Dropped | A/V drift | +|--------|--------------------------|-----------|---------------|-----|---------|-----------| | `a201594` | `extract_dup` → bytes + `from_buffer_copy` → ctypes (2 copies, 6 MB/frame) | 36,499 µs | 87.6% | 24.01 | 1 | −42.8 ms | -| `da02e74` | `buffer.map(READ)` + `memmove` into reusable ctypes array (1 copy, 3 MB/frame) | 33,551 µs | 80.5% | 23.98 | 0 | −38.0 ms | +| `da02e74` | `buffer.map(READ)` + `memmove` into reusable ctypes array (1 copy, 3.1 MB/frame) | 33,551 µs | 80.5% | 23.98 | 0 | −38.0 ms | +| `995830e` | `videoscale(nearest)→640×480` in GstBin + `memmove` (1 copy, **0.46 MB/frame**) | **1,033 µs** | **2.5%** | **23.99** | **0** | **−6.9 ms** | -**Key observations (`da02e74`):** -- 1 dropped frame eliminated (0 vs 1) -- Jitter improved: stdev 2.6 ms vs 3.6 ms -- A/V drift tighter: −38 ms vs −43 ms -- Copy cost still 80.5% of frame budget — the 3.1 MB `memmove` on each frame is the remaining bottleneck -- Further reduction requires DMA-buf zero-copy (kernel VPU→SDL import without CPU memcpy), which depends on device driver support not currently available through gst-mpp's appsink path +**Optimization history:** + +- `a201594` → `da02e74`: replaced `extract_dup + from_buffer_copy` (2 copies, 6 MB/frame) with `buffer.map(READ) + memmove` into a pre-allocated ctypes array (1 copy, 3.1 MB). Saved ~3 MB/frame allocation; copy cost reduced by 8% but still ~81% of budget. + +- `da02e74` → `995830e`: identified that the 3.1 MB memmove is necessary only because the appsink receives full 1920×1080 frames, while the display is 640×480. Inserted a `GstBin` containing `videoscale(method=nearest-neighbour) → capsfilter(NV12,640×480) → appsink` as the playbin video-sink. This causes the GStreamer pipeline thread to do SW scale before Python sees the frame; Python then receives only 460 KB (6.7× smaller). Memmove drops from 32 ms to 1 ms (31× improvement, 2.5% budget). FPS and drop count are unchanged (23.99, 0). A/V drift improved from −38 ms to −7 ms. + +**Alternatives tested and rejected during `995830e`:** + +| Variant | Result | Root cause | +|---------|--------|-----------| +| Bilinear videoscale (no queue) | 20.92 fps, 46 drops | Bilinear reads adjacent rows → loads ~89% of source cache lines, similar cost to memmove; scheduling pressure causes drops | +| Nearest-neighbour + leaky=2 queue | 1.86 fps, 30 drops | `leaky=2` allows mppvideodec to race ahead; queue fills and drops ~93% of frames as stale | +| Nearest-neighbour, no queue | **23.99 fps, 0 drops** ✅ | Nearest reads ~44% of source cache lines; back-pressure from appsink naturally rate-limits mppvideodec | + +**Key observations (`995830e`):** +- Memmove reduced from 32 ms (3.1 MB) to ~1 ms (460 KB) — 31× improvement +- No FPS or drop regression vs unscaled path +- A/V drift improved significantly (−7 ms vs −38 ms) +- SW nearest-neighbour scale on A35 costs ~14 ms per frame (estimated from cache line count), but this happens synchronously in the GStreamer pipeline thread BEFORE the appsink callback, not in the Python memmove measurement +- Remaining 97.5% of frame budget is available for SDL upload, HUD rendering, and other pipeline work ## Blockers Or Open Questions @@ -94,8 +105,9 @@ Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 m ## Next Recommended Actions -1. Consider profiling what the remaining 19.5% of frame budget (≈8 ms) consists of — likely SDL_UpdateNVTexture upload + render call overhead + Python GIL churn. If SDL upload is the bottleneck, try `SDL_LockTexture` for direct write instead. -2. Investigate DMA-buf / dmabuf fd import as a future zero-copy path: gst-mpp may expose DRM DMA-buf fds that SDL's KMSDRM backend can import directly, eliminating the CPU memmove entirely. -3. Run a visual playback smoke test on device directly via the app launcher (MatHacks.sh) to confirm HUD and video render correctly together under KMSDRM at the current 80.5% copy load. -4. If 80.5% copy cost causes visible stutter under load (UI overhead competing for A35 cycles), the next option is to reduce resolution at the appsink by inserting a `videoscale` element to 1280×720 before the appsink, cutting memmove to ~1.3 MB/frame (≈35% budget). -5. `avdec_hevc` is still missing (HEVC decoders not in system apt `gstreamer1.0-libav 1.16.1`); `mppvideodec` covers H.264/H.265/VP8/VP9 via HW so this is less critical now. \ No newline at end of file +1. Run a visual playback smoke test on device directly via the app launcher (MatHacks.sh) to confirm HUD and video render correctly together under KMSDRM with the videoscale path active (nearest-neighbour 640×480 NV12). +2. Measure SDL_UpdateNVTexture upload cost for the now-smaller 640×480 texture (was 1920×1080). If it is sub-millisecond, the render path is considered optimized. +3. If visual quality from nearest-neighbour scaling is noticeably poor on-device, switch `scale.set_property("method", 1)` (bilinear) and re-benchmark; the bilinear result (20.92 fps, 46 drops) only applied to the benchmark stream — actual app playback may behave differently since the GStreamer pipeline structure is slightly different inside the real app vs the benchmark. +4. Consider profiling the SDL render loop under combined video+HUD load to confirm 30+ fps UI responsiveness alongside decoding. +5. Investigate DMA-buf import as a future zero-copy path: gst-mpp may expose DRM DMA-buf fds that SDL's KMSDRM backend can import directly via `SDL_CreateTextureFromSurface` or a custom EGL path, eliminating the CPU memmove and SW scale entirely. This is a significant engineering effort and is not needed given current performance. +6. `avdec_hevc` is still missing (HEVC decoders not in system apt `gstreamer1.0-libav 1.16.1`); `mppvideodec` covers H.264/H.265/VP8/VP9 via HW so this is less critical now. \ No newline at end of file