FYP23081 - VisionFlow

The University of Hong Kong

· Department of Computer Science

COMP4804 Final Year Project

FYP23081

VisionFlow: A Cloud-based Cross-Platform Collaborative Visual Scripting IDE for Computer Vision Tasks through WebRTC

Justus Ip / Supervised by T.W. Chim

Motivations

In recent years, Computer Vision (CV) has emerged as powerful tools for various applications across domains such as manufacturing, healthcare, transportation and surveillance.
However, there are certain problems that hinder the adoption of these technologies in practice.

Parking spot detection using Mask-RCNN.

Hardware Constraint

Many advanced CV libraries and models require powerful GPUs or TPUs to run smoothly.
This makes it unfeasible for small organisations to experiment with these "niche" technologies due to financial constraint.

Nvidia RTX Series Graphic Cards. Many CV and ML libraries relies on CUDA, Nvidia's GPU hardware acceleration technology, to run smoothly.

User Friendliness

Learning to develop CV tasks requires intermediate coding skills.
Setting up the required software and libraries for CV tasks can be complicated and time consuming.
This poses difficulties for students and researchers without technical programming background.
Although node-based visual scripting IDEs such as MatLab allow programming without prior programming knowledge, there is no equivalent for CV tasks.

Node-based visual scripting IDEs such as MatLab allow programming without prior programming knowledge.

Black Box Problem

During the development process of CV pipelines, such as in OpenCV, it is challenging to evaluate the performance due to the absence of real-time feedback in the intermediate stages.
This lack of real-time feedback can make it difficult to understand how well the pipeline is performing and identify any issues that may arise.

License Plate Recognition Process. The steps in the middle are usually opaque during the development process, making it hard to evaluate the performance.

Introducing VisionFlow

VisionFlow is a Web-Based Visual Scripting Integrated Development Environment (IDE) for Computer Vision programs.
It enables users to develop computer vision programs without writing code, but using ready-to-use building blocks.

A Python script that makes a video blurry, using OpenCV library.

The same program remade using VisionFlow.

You can build very advanced computer vision programs with VisionFlow.

You can integrate and output data into other services and your custom apps.

You can build custom nodes with the included Code Editor.

It is fully web-based. It works on any device with a web browser.
It has full touch support.
It supports real-time collaboration, just like Google Docs.

Technology

Real-Time Video Streaming through WebRTC

VisionFlow utilises the WebRTC protocol, commonly used in video conference software, to enable real-time, low-latency video streaming.
Hardware acceleration libraries such as libx264 and libvpx are leveraged to compress and encode video streams in real-time.
The streaming process is protected by end-to-end encryption, implemented through the DTLS-SRTP specification, which enhances security and privacy.

VisionFlow's WebRTC Signaling and System Diagram.

Real-Time Edits and Collaboration through WebSocket

VisionFlow utilises TLS-encrypted WebSocket protocol for end-to-end encrypted, real-time and bidirectional data communication between the client and server.
Through WebSocket, real-time project edits can be instantly published to computation nodes and effects can be immediately reflected to the client.
Users can also collaborate and work together in real-time, making simultaneous edits and updates to the project, similar to Google Docs.

VisionFlow's Project State Sync Flow. The project state is sync across clients connected through websocket, and can be fetched from rapid Redis store on client request.

Extensible API with Pythonic Syntax

VisionFlow provides an Application Programming Interface (API) that allows users to develop custom nodes using pure Python syntax without calling custom library functions.
This is made possible through Python object Introspection, where information such as input and output field names and types are introspected at runtime like Java Reflection.
This approach flattens the learning curve, enabling users to develop custom nodes in a manner similar to coding a computer vision program conventionally.

VisionFlow's Class Introspection and Shadowing.

Distributed Computing

VisionFlow can be configured to distribute load on multiple backend servers (compute nodes) to harness collective processing power, enabling parallel processing of multiple video streams. The results are then aggregated in a centralized server.
The communication between the backend inference nodes and the master server is facilitated through Remote Procedure Call (RPC) which ensures low latency and efficient data transfer.

VisionFlow's Backend Architecture. Multiple compute nodes can be set up to take in separate multiple camera streams. Data is then aggregated in master server.

Multi-Platform Support

VisionFlow supports streaming to Chromium-based browsers such as Google Chrome, Microsoft Edge through webm video format, and Webkit-based browsers such as Safari through h264 video format.
VisionFlow allows touch-based interaction, enabling users to create and run CV programs on mobile devices such as iPad and iPhone. It supports multi-touch gestures such as zooming and panning.

VisionFlow running on Webkit on iPhone and iPad. The same project is opened and video feed are simultaneous.

Next Steps

Enhancing the visual scripting capabilities: Create additional nodes, and advanced functionalities to enable users to create complex computer vision workflows more easily.
Performance optimization: Optimizing the performance of VisionFlow to ensure faster execution of visual scripts, efficient resource utilization, and the ability to handle large-scale computer vision tasks.
Documentation and tutorials: Developing comprehensive documentation, tutorials, and educational resources to support users in understanding and maximizing the potential of VisionFlow, enabling them to quickly get started and excel in their computer vision tasks.
Publishing to the public: Making VisionFlow accessible to the public by offering a public release/open-source version. Public access can lead to valuable feedback, bug reports, and feature suggestions, ultimately improving the quality and usability of VisionFlow for all users. Monetization strategies can be explored to sustain the project.

Screenshots