ARM-based Video Intercom System with Next-Gen Human Presence Detection using Deep Learning

Author: Mario Gavran

Mentor: assoc. prof. dr. Matej Rojc
Program: Level 2 study, Electrical Engineering, course Electronics
Date: july, 2022
DKUM: MARIO GAVRAN

Abstract

This master’s thesis presents an advanced video system with human presence detection based on deep learning and an ARM microcontroller. The objective of the thesis is to develop a system that works as a smart video intercom, which could be installed, e.g. on the entrance door, and autonomously alert the owner that a guest is in front of the door. The main goal is to use an AI algorithm, namely the neural network model on a constrained device, such as an ARM microcontroller, as their main advantage is lower power consumption and cost. The thesis also describes commonly used methods to reduce the power and memory footprint and to implement and accelerate the deep learning algorithms more efectively. Further, the most notable deep learning hardware and some general platforms are described in more detail. The thesis also presents the development of a human presence detection system based on an ARM microcontroller, VGA camera, and LCD, where TensorFlow Lite Micro, an open-source C++ framework for deploying deep learning models to embedded platforms and a pre-trained neural network model for person presence detection are used.

AI on the Edge

It is widely known today, that AI applications require a lot of computing power and energy. The AlphaGo computer, for example, used almost 2000 CPUs and 300 GPUs, resulting in a $3000 cost per game. Additionally, in this pursuit of achieving better accuracy, capability and performance, the deep learning models are getting larger and more demanding. For example, the winner of the ImageNet recognition challenge for image recognition in 2015 was 16 times larger than the one from 2012, and the winner for speech recognition required ten times more training operations than the one from 2014. These facts encourage investment in developing eficient methods for downsizing tremendous memory usage demands, increasing the eficiency and computational cost of the inference process and bringing machine learning (ML) even to the smallest and most eficient hardware devices. Some market research reports predict that the global edge AI software market size is projected to reach hundreds of millions of USD by the quarter of this century. This rapid growth is attributed mostly to the emergence of the 5G network. Many reports also predict that the video surveillance segment would have the largest market size along with autonomous vehicles, access management and predictive maintenance segments. This thesis presents concepts and a broad overview of the edge AI and ML. Some useful and popular methods, available hardware and software infrastructure for enabling the use of these technologies in the constrained environment of the embedded systems, are also presented. Specific development of human presence detection is then described and tested. In the ML industry, there are four basic demand categories. The first of them is the one that doesn’t need to be low cost and needs high power computing, which is used for training large ML systems required for research and exploration of what can be done with ML. The second type of demand is for training already-designed and used systems and models with new data or adding new features or labels and objects that, for example, need to be recognized in existing image recognition systems. This consumes less power, uses lower precision arithmetic, and is cheaper. The third category is running ML on powerful servers in data centres used, e.g. newsfeeds on news and social media sites, filtering search results, etc. In this case, power consumption and latency are the main concerns because the trend shows there is a fast-growing number of services using this technology and a growing number of users using them. The last category is embedded devices, cars, phones, smart cameras, etc. All these edge devices have less power available, have a smaller memory footprint, and usually have some sort of accelerated arithmetics.

Tensorflow Lite Micro

Embedded platforms have always lacked the benefits of many advances in technology that made software development easier and faster due to high diversity and fragmentation among different hardware vendors. Unlike mainstream generalcomputing platforms that have unified, backward compatible and standardized ISA, virtual memory, operating system and other similar benefits, embedded platforms are missing all of them because of the highly-specific nature of their applications and their close relation to the hardware architecture and low memory and performance capabilities. Almost all software running on embedded systems is compiled from the source code; therefore, there is a lower demand for unification and crosscompatibility that leads to proprietary and closed ecosystems. The TensorFlow Lite (TFL) is a set of tools that enables on-device ML for mobile devices, while the TFLM is its subset meant to be used for 32-bit processors, namely MCUs. It is tested on different processors, most extensively on ARM Cortex-M architecture, but also on other popular platforms like ESP32, ARC and RISCV. It is written in C++ and intended to be used as a library, included in an application project and compiled with the rest of the application, unlike some other inference engine tools like GLOW AOT, which is a compiler that generates object files that can be used dynamically or statically in a project

Video Intercom System With Human Detection

Intercom systems have been a part of all kinds of properties for decades, from homes and offices to industrial facilities and any other property that requires managed visitor access, helping their owners to increase security and convenience by simplifying property access. Regular intercom systems enable two-way communication between a tenant and a visitor and allow the tenant to grant access by opening a door remotely. Looking at history, one can find that the first telephone-based intercom system was patented in 1894. But even an earlier version existed in the form of metal tubes that carried sound and were used between offices. Today’s modern electronic intercoms are becoming increasingly popular and more versatile, using contemporary technology and enabling features like internet connectivity to be able to grant access from anywhere in the world. More advanced and expensive systems feature facial recognition and authentication and enable touchless access control. Although there are already many smart intercoms available, there is still room for novel experiments and improvements. Most of these devices utilize multi-core application processors, which require a more complex design and increase power consumption. They often require internet connectivity to send a request to the inference server and could thus hurt the clients’ trust.

ARM-based Video Intercom System with Next-Gen Human Presence Detection using Deep Learning

Discussion and results

Laboratory of Digital Signal Processing

E-mail

Contact

Address

Phone