Vision and AI Middleware

Embedded Vision Middleware for Safety and Security

management middleware

Arcturus Vision Middleware transforms cameras from passive observers into active vision detection systems by enabling analytics for safety, security and surveillance.

The middleware provides enablement for systems that need to detect, classify, record, report and act on events that occur in live video. The core of the middleware consists of a pipeline of optimized vision processing algorithms combined with Convolutional Neural Network (CNN) object detection and classification using a multi-box, multi-class approach. The output of the vision processing is stored in a database and provided to specialized, higher-level analytics applications that have been developed specially for safety, surveillance and security. These applications include motion, intrusion, boundary crossing, zone incursion, loitering or motionless behavior, abandoned package detection, face verification, object disappearance, crowd and crowd flow analytics.

The vision system ties directly into Arcturus Voice and Media Middleware including video source acquisition via native hardware or IP camera stream. Live video mode using webRTC and stored video playback mode using RTSP with scrubbing controls. SIP based video and voice call routing using VoIP communication systems can also be supported. Mbarx Secure IoT endpoint stack is used as the connectivity, notification and host control architecture.

Machine Vision Boundary Crossing Demo Factsheet Contact Us

Edge-Based Vision Processing Overview

Vision Middleware Diagram

Reduce Bandwidth

Existing systems rely on the continuous transmission of pixel data from each camera to a central location for processing or monitoring. As deployments of surveillance cameras and their resolutions increase, this places upward pressure on the network, resulting in data bottlenecks, extensibility problems and the requirement for new infrastructure to support capacity expansion. With an edge-based approach, analytics processing occurs closer to the camera source or in the same physical system. Continuous high bit-rate traffic is replaced by small-packet notifications that define important events and provide access to video on-demand. With an architecture that includes vision processing at the edge, it is possible to:

  • Significantly lower bandwidth resulting in improved network management and resilience
  • Reduced dependency on big, high-availability backbone connections
  • Lower operating costs on pay-per-bit networks, such as LTE
  • Upgrade or expand systems, without replacing existing infrastructure

Increase Effectiveness

Intelligence at the edge transforms video surveillance from passive observation to active detection by identifying specific events and acting on them. Analysts can be notified of events as they occur in real-time, instead of waiting for an observer to report them. This approach allows analysts to respond quickly using streamlined workflows to identify the event, how it transpired, who the actors are and formulate an effective response. Since processing is done close to the video source, detection events can trigger local actions such as lights, sirens, locks, etc. even if connectivity is lost. This means:

  • Improved focus, situational awareness and workflows for analysts
  • Reduced response times to critical events
  • Automated real-time local actions and integration with other equipment

Easily Search Archived Video

Vision processing is performed continuously and output is written to a database with camera source and video time-code references. This database forms an ongoing narrative of the live video as it is analyzed. A database approach makes it possible to perform forensic searches using standard queries, eliminating the need to review footage frame-by-frame. Database queries can include filters, ranges and classes:

  • Time and date range
  • Region of interest
  • Object classification type
  • Event type
  • Characterization


    Vision Algorithms
  • Background subtraction (using spatiotemporal binary similarity)
  • Blob or object tracking
  • Euclidean distance
  • Feature extractions using co-occurrence matrices, Motion History Intensity (MHI), Bag of Words (BoW), tracklets
    Inference and Classification
  • Light weight deep neural network, designed for embedded vision applications
  • Streamlined neural network architecture using depth-wise separable convolutions
  • Multi-box detection with pre-trained classes for detecting people, vehicles, cyclists, motorcycle, luggage, backpacks, bags
  • Secure Web Service
  • Real-time event notifications from detection applications
  • Live video mode
  • Playback video mode
  • Video player with with scrubbing, play, pause controls
  • Event notification timeline
  • Timeline event clustering
  • Workflows to receive notifications, review incidents
  • Analytics database search workflow
  • Video Acquisition
  • V4L – Video for Linux (local camera)
  • IP cameras
  • mjpeg, H.264
  • Resolutions up to FHD (1080p), 30fps
  • Video Pre-processing
  • Image privacy masking
  • Blur and solid masking types
  • Video Post-processing
  • Image overlays e.g.: location, branding, date/time
  • Configurable presentation including bounding boxes
  • Object-based dynamic masking
  • Video Output
  • Event video clips
  • WebRTC live video (VP8/VP9)
  • RTSP video playback
  • SIP/RTP video communications
  • Local or remote video storage
  • mjpeg, H.264, VP8, VP9 encode, decode
  • Video Storage
  • Configurable encoding format
  • Video output to NVR
  • Video event clips
  • Periodic storage syncing
  • Loop recording with event protection
  • Vision Analytics Output
  • Analytics database
  • Date time ranges
  • Region of interest
  • Object classes
  • Event and incidents types
  • Faces and other metadata tags
  • Connectivity
  • I/O and/or UART for local connectivity
  • Secure system integration and cloud connectivity via network


Video encoding, algorithmic and CNN processing availability varies across CPUs based on overall performance and available hardware acceleration.

Processors and Architectures
  • Arm®v8 Cortex®-A53 (64-bit), Arm Cortex-A9 (32-bit), Power®e500
Operating System Support
  • Linux 3.x, 4.x
  • GCC
  • Video encoding, algorithmic and CNN processing availability varies across CPUs based on overall performance and available hardware acceleration. Refer to Voice and Media Middleware for additional information on video options and Mbarx secure IoT for connectivity and integration details.

    Complete Vision System Solution

    Arcturus offers simple engagement packages to help get development moving quickly. This can include specialized algorithm processing, custom analytics applications or tailored workflows. Systems can take advantage of a number of preexisting blocks including video storage, playback, live stream, voice and video communications. A responsive web interface provides streamlined workflows, real-time notifications, live and playback video modes, event searches, filters, queries and a high-level timeline of events.

    Event Detection and Notification

    Analytics applications report events via web-based incident notifications or distributed via a secure protocol. Live video streams use WebRTC, eliminating the need for plugins with most modern browsers. Streamlined workflows help alert analysts to situations quickly and respond effectively.

    Vision System Home Screen Capture
    Streamlined Incident Workflows

    Incident notification allow analysts to review the sequence of video that triggered the event from a timeline summary. Video playback supports integrated RTSP-based player controls with play/pause and scrubbing. Video detection overlays, location and timestamp information is generated dynamically and applied as a overlay during playback, maintaining the forensic integrity of the stored video. Video sequences can be exported from the system, packaged as clips or as time-code references to external storage.

    Vision Intrusion Detection