Browse Source

docs: update benchmark log with full videoscale optimization history and results

main
Matteo Benedetto 1 week ago
parent
commit
528c7f94e7
  1. 52
      docs/development-status.md

52
docs/development-status.md

@ -48,30 +48,41 @@ Milestone 3 — SDL Video Viewport, HUD, and Wayland Compatibility
## Tasks In Progress
- **NV12 frame path optimization complete**: `videoscale(nearest-neighbour)→640×480` GstBin reduces Python memmove from 32 ms (77% budget) to 1 ms (2.5%) with no FPS or drop regression. Awaiting visual smoke test on device via MatHacks.sh launcher.
- Verify that the SDL-texture playback path is smooth enough on real host playback and on R36S hardware
- Measure whether BGRA frame upload is acceptable on RK3326 or whether a future YUV texture path is needed
- Device deployment on the physical R36S is now wired through ArkOS `Ports -> MatHacks`, with the heavy runtime under `/home/ark` and only a lightweight stub launcher under `/roms/ports`
- Device env bootstrap on the physical R36S reaches a clean `from r36s_dlna_browser.app import Application` inside `/home/ark/miniconda3/envs/r36s-dlna-browser`
- ArkOS launcher asset added at `deploy/arkos/MatHacks.sh`; current launcher uses the `/home/ark/R36SHack` checkout plus verified `LD_LIBRARY_PATH`, `GST_PLUGIN_PATH`, and `LD_PRELOAD` exports needed to load the system `gstreamer1.0-libav` plugins from the conda runtime
- **Rockchip MPP hardware decode now deployed**: `librockchip-mpp` and `gst-mpp` compiled from source via Docker QEMU (arm64v8/ubuntu:focal), installed on device, `mppvideodec` confirmed visible to GStreamer. The `gstreamer_backend.py` probe auto-boosts `mppvideodec` rank if `/dev/vpu_service` is accessible.
- **Pre-built .so files bundled** in `deploy/arkos/mpp-libs/`; `setup_hw_decode.sh` installs them automatically without network access.
## NV12 Render Path Benchmark Log
All runs performed on the physical R36S (RK3326, 4× A35 @ 1.3 GHz, 1 GB RAM) over SSH.
Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 ms.
Stream: 1920×1080 H.264 MKV @ 24 fps via MiniDLNA over LAN. Frame budget: 41.7 ms.
| Commit | Copy strategy | Copy mean | Copy % budget | FPS | Dropped | A/V drift |
|--------|--------------|-----------|---------------|-----|---------|-----------|
| Commit | Copy / pipeline strategy | Copy mean | Copy % budget | FPS | Dropped | A/V drift |
|--------|--------------------------|-----------|---------------|-----|---------|-----------|
| `a201594` | `extract_dup` → bytes + `from_buffer_copy` → ctypes (2 copies, 6 MB/frame) | 36,499 µs | 87.6% | 24.01 | 1 | −42.8 ms |
| `da02e74` | `buffer.map(READ)` + `memmove` into reusable ctypes array (1 copy, 3 MB/frame) | 33,551 µs | 80.5% | 23.98 | 0 | −38.0 ms |
| `da02e74` | `buffer.map(READ)` + `memmove` into reusable ctypes array (1 copy, 3.1 MB/frame) | 33,551 µs | 80.5% | 23.98 | 0 | −38.0 ms |
| `995830e` | `videoscale(nearest)→640×480` in GstBin + `memmove` (1 copy, **0.46 MB/frame**) | **1,033 µs** | **2.5%** | **23.99** | **0** | **−6.9 ms** |
**Key observations (`da02e74`):**
- 1 dropped frame eliminated (0 vs 1)
- Jitter improved: stdev 2.6 ms vs 3.6 ms
- A/V drift tighter: −38 ms vs −43 ms
- Copy cost still 80.5% of frame budget — the 3.1 MB `memmove` on each frame is the remaining bottleneck
- Further reduction requires DMA-buf zero-copy (kernel VPU→SDL import without CPU memcpy), which depends on device driver support not currently available through gst-mpp's appsink path
**Optimization history:**
- `a201594``da02e74`: replaced `extract_dup + from_buffer_copy` (2 copies, 6 MB/frame) with `buffer.map(READ) + memmove` into a pre-allocated ctypes array (1 copy, 3.1 MB). Saved ~3 MB/frame allocation; copy cost reduced by 8% but still ~81% of budget.
- `da02e74``995830e`: identified that the 3.1 MB memmove is necessary only because the appsink receives full 1920×1080 frames, while the display is 640×480. Inserted a `GstBin` containing `videoscale(method=nearest-neighbour) → capsfilter(NV12,640×480) → appsink` as the playbin video-sink. This causes the GStreamer pipeline thread to do SW scale before Python sees the frame; Python then receives only 460 KB (6.7× smaller). Memmove drops from 32 ms to 1 ms (31× improvement, 2.5% budget). FPS and drop count are unchanged (23.99, 0). A/V drift improved from −38 ms to −7 ms.
**Alternatives tested and rejected during `995830e`:**
| Variant | Result | Root cause |
|---------|--------|-----------|
| Bilinear videoscale (no queue) | 20.92 fps, 46 drops | Bilinear reads adjacent rows → loads ~89% of source cache lines, similar cost to memmove; scheduling pressure causes drops |
| Nearest-neighbour + leaky=2 queue | 1.86 fps, 30 drops | `leaky=2` allows mppvideodec to race ahead; queue fills and drops ~93% of frames as stale |
| Nearest-neighbour, no queue | **23.99 fps, 0 drops** ✅ | Nearest reads ~44% of source cache lines; back-pressure from appsink naturally rate-limits mppvideodec |
**Key observations (`995830e`):**
- Memmove reduced from 32 ms (3.1 MB) to ~1 ms (460 KB) — 31× improvement
- No FPS or drop regression vs unscaled path
- A/V drift improved significantly (−7 ms vs −38 ms)
- SW nearest-neighbour scale on A35 costs ~14 ms per frame (estimated from cache line count), but this happens synchronously in the GStreamer pipeline thread BEFORE the appsink callback, not in the Python memmove measurement
- Remaining 97.5% of frame budget is available for SDL upload, HUD rendering, and other pipeline work
## Blockers Or Open Questions
@ -94,8 +105,9 @@ Stream: 1920×1080 H.264 MKV @ 24fps via MiniDLNA over LAN. Frame budget: 41.7 m
## Next Recommended Actions
1. Consider profiling what the remaining 19.5% of frame budget (≈8 ms) consists of — likely SDL_UpdateNVTexture upload + render call overhead + Python GIL churn. If SDL upload is the bottleneck, try `SDL_LockTexture` for direct write instead.
2. Investigate DMA-buf / dmabuf fd import as a future zero-copy path: gst-mpp may expose DRM DMA-buf fds that SDL's KMSDRM backend can import directly, eliminating the CPU memmove entirely.
3. Run a visual playback smoke test on device directly via the app launcher (MatHacks.sh) to confirm HUD and video render correctly together under KMSDRM at the current 80.5% copy load.
4. If 80.5% copy cost causes visible stutter under load (UI overhead competing for A35 cycles), the next option is to reduce resolution at the appsink by inserting a `videoscale` element to 1280×720 before the appsink, cutting memmove to ~1.3 MB/frame (≈35% budget).
5. `avdec_hevc` is still missing (HEVC decoders not in system apt `gstreamer1.0-libav 1.16.1`); `mppvideodec` covers H.264/H.265/VP8/VP9 via HW so this is less critical now.
1. Run a visual playback smoke test on device directly via the app launcher (MatHacks.sh) to confirm HUD and video render correctly together under KMSDRM with the videoscale path active (nearest-neighbour 640×480 NV12).
2. Measure SDL_UpdateNVTexture upload cost for the now-smaller 640×480 texture (was 1920×1080). If it is sub-millisecond, the render path is considered optimized.
3. If visual quality from nearest-neighbour scaling is noticeably poor on-device, switch `scale.set_property("method", 1)` (bilinear) and re-benchmark; the bilinear result (20.92 fps, 46 drops) only applied to the benchmark stream — actual app playback may behave differently since the GStreamer pipeline structure is slightly different inside the real app vs the benchmark.
4. Consider profiling the SDL render loop under combined video+HUD load to confirm 30+ fps UI responsiveness alongside decoding.
5. Investigate DMA-buf import as a future zero-copy path: gst-mpp may expose DRM DMA-buf fds that SDL's KMSDRM backend can import directly via `SDL_CreateTextureFromSurface` or a custom EGL path, eliminating the CPU memmove and SW scale entirely. This is a significant engineering effort and is not needed given current performance.
6. `avdec_hevc` is still missing (HEVC decoders not in system apt `gstreamer1.0-libav 1.16.1`); `mppvideodec` covers H.264/H.265/VP8/VP9 via HW so this is less critical now.
Loading…
Cancel
Save