@ -48,30 +48,41 @@ Milestone 3 — SDL Video Viewport, HUD, and Wayland Compatibility
## Tasks In Progress
## Tasks In Progress
- **NV12 frame path optimization complete**: `videoscale(nearest-neighbour)→640×480` GstBin reduces Python memmove from 32 ms (77% budget) to 1 ms (2.5%) with no FPS or drop regression. Awaiting visual smoke test on device via MatHacks.sh launcher.
- Verify that the SDL-texture playback path is smooth enough on real host playback and on R36S hardware
- Verify that the SDL-texture playback path is smooth enough on real host playback and on R36S hardware
- Measure whether BGRA frame upload is acceptable on RK3326 or whether a future YUV texture path is needed
- Device deployment on the physical R36S is now wired through ArkOS `Ports -> MatHacks`, with the heavy runtime under `/home/ark` and only a lightweight stub launcher under `/roms/ports`
- Device deployment on the physical R36S is now wired through ArkOS `Ports -> MatHacks`, with the heavy runtime under `/home/ark` and only a lightweight stub launcher under `/roms/ports`
- Device env bootstrap on the physical R36S reaches a clean `from r36s_dlna_browser.app import Application` inside `/home/ark/miniconda3/envs/r36s-dlna-browser`
- ArkOS launcher asset added at `deploy/arkos/MatHacks.sh`; current launcher uses the `/home/ark/R36SHack` checkout plus verified `LD_LIBRARY_PATH`, `GST_PLUGIN_PATH`, and `LD_PRELOAD` exports needed to load the system `gstreamer1.0-libav` plugins from the conda runtime
- **Rockchip MPP hardware decode now deployed**: `librockchip-mpp` and `gst-mpp` compiled from source via Docker QEMU (arm64v8/ubuntu:focal), installed on device, `mppvideodec` confirmed visible to GStreamer. The `gstreamer_backend.py` probe auto-boosts `mppvideodec` rank if `/dev/vpu_service` is accessible.
- **Pre-built .so files bundled** in `deploy/arkos/mpp-libs/`; `setup_hw_decode.sh` installs them automatically without network access.
## NV12 Render Path Benchmark Log
## NV12 Render Path Benchmark Log
All runs performed on the physical R36S (RK3326, 4× A35 @ 1.3 GHz, 1 GB RAM) over SSH.
All runs performed on the physical R36S (RK3326, 4× A35 @ 1.3 GHz, 1 GB RAM) over SSH.
Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 ms.
Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 ms.
- `a201594` → `da02e74`: replaced `extract_dup + from_buffer_copy` (2 copies, 6 MB/frame) with `buffer.map(READ) + memmove` into a pre-allocated ctypes array (1 copy, 3.1 MB). Saved ~3 MB/frame allocation; copy cost reduced by 8% but still ~81% of budget.
- A/V drift tighter: −38 ms vs −43 ms
- Copy cost still 80.5% of frame budget — the 3.1 MB `memmove` on each frame is the remaining bottleneck
- `da02e74` → `995830e`: identified that the 3.1 MB memmove is necessary only because the appsink receives full 1920×1080 frames, while the display is 640×480. Inserted a `GstBin` containing `videoscale(method=nearest-neighbour) → capsfilter(NV12,640×480) → appsink` as the playbin video-sink. This causes the GStreamer pipeline thread to do SW scale before Python sees the frame; Python then receives only 460 KB (6.7× smaller). Memmove drops from 32 ms to 1 ms (31× improvement, 2.5% budget). FPS and drop count are unchanged (23.99, 0). A/V drift improved from −38 ms to −7 ms.
- Further reduction requires DMA-buf zero-copy (kernel VPU→SDL import without CPU memcpy), which depends on device driver support not currently available through gst-mpp's appsink path
**Alternatives tested and rejected during `995830e`:**
| Variant | Result | Root cause |
|---------|--------|-----------|
| Bilinear videoscale (no queue) | 20.92 fps, 46 drops | Bilinear reads adjacent rows → loads ~89% of source cache lines, similar cost to memmove; scheduling pressure causes drops |
| Nearest-neighbour + leaky=2 queue | 1.86 fps, 30 drops | `leaky=2` allows mppvideodec to race ahead; queue fills and drops ~93% of frames as stale |
| Nearest-neighbour, no queue | **23.99 fps, 0 drops** ✅ | Nearest reads ~44% of source cache lines; back-pressure from appsink naturally rate-limits mppvideodec |
**Key observations (`995830e`):**
- Memmove reduced from 32 ms (3.1 MB) to ~1 ms (460 KB) — 31× improvement
- No FPS or drop regression vs unscaled path
- A/V drift improved significantly (−7 ms vs −38 ms)
- SW nearest-neighbour scale on A35 costs ~14 ms per frame (estimated from cache line count), but this happens synchronously in the GStreamer pipeline thread BEFORE the appsink callback, not in the Python memmove measurement
- Remaining 97.5% of frame budget is available for SDL upload, HUD rendering, and other pipeline work
## Blockers Or Open Questions
## Blockers Or Open Questions
@ -94,8 +105,9 @@ Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 m
## Next Recommended Actions
## Next Recommended Actions
1. Consider profiling what the remaining 19.5% of frame budget (≈8 ms) consists of — likely SDL_UpdateNVTexture upload + render call overhead + Python GIL churn. If SDL upload is the bottleneck, try `SDL_LockTexture` for direct write instead.
1. Run a visual playback smoke test on device directly via the app launcher (MatHacks.sh) to confirm HUD and video render correctly together under KMSDRM with the videoscale path active (nearest-neighbour 640×480 NV12).
2. Investigate DMA-buf / dmabuf fd import as a future zero-copy path: gst-mpp may expose DRM DMA-buf fds that SDL's KMSDRM backend can import directly, eliminating the CPU memmove entirely.
2. Measure SDL_UpdateNVTexture upload cost for the now-smaller 640×480 texture (was 1920×1080). If it is sub-millisecond, the render path is considered optimized.
3. Run a visual playback smoke test on device directly via the app launcher (MatHacks.sh) to confirm HUD and video render correctly together under KMSDRM at the current 80.5% copy load.
3. If visual quality from nearest-neighbour scaling is noticeably poor on-device, switch `scale.set_property("method", 1)` (bilinear) and re-benchmark; the bilinear result (20.92 fps, 46 drops) only applied to the benchmark stream — actual app playback may behave differently since the GStreamer pipeline structure is slightly different inside the real app vs the benchmark.
4. If 80.5% copy cost causes visible stutter under load (UI overhead competing for A35 cycles), the next option is to reduce resolution at the appsink by inserting a `videoscale` element to 1280×720 before the appsink, cutting memmove to ~1.3 MB/frame (≈35% budget).
4. Consider profiling the SDL render loop under combined video+HUD load to confirm 30+ fps UI responsiveness alongside decoding.
5. `avdec_hevc` is still missing (HEVC decoders not in system apt `gstreamer1.0-libav 1.16.1`); `mppvideodec` covers H.264/H.265/VP8/VP9 via HW so this is less critical now.
5. Investigate DMA-buf import as a future zero-copy path: gst-mpp may expose DRM DMA-buf fds that SDL's KMSDRM backend can import directly via `SDL_CreateTextureFromSurface` or a custom EGL path, eliminating the CPU memmove and SW scale entirely. This is a significant engineering effort and is not needed given current performance.
6. `avdec_hevc` is still missing (HEVC decoders not in system apt `gstreamer1.0-libav 1.16.1`); `mppvideodec` covers H.264/H.265/VP8/VP9 via HW so this is less critical now.