Making ‘computers with eyes’: data labelling in vision systems

Having attended London Tech Week’s ‘AI Summit’, Sam Holland took an interest in the latest AI software on show – particularly their potential to perform object detection with less training from humans. Sam Holland discusses the manual form of AI training known as ‘data labelling’ and how modern vision systems benefit from such training by humans.

Exploring ‘data labelling’ and why many AI systems require it

While data labelling has many applications (such as those relevant to language processing and humanoid robotics), this article looks at the term in the context of object detection and object recognition (while later focusing on object classification). Accordingly, the piece considers the role of the two vision systems known as machine vision and computer vision.

The term ‘data labelling’ encompasses the process by which humans can train AI software to ‘see’, and therefore recognise and/or analyse, relevant items that fall within the scope of one or more computer-connected cameras.

To consider the value of data labelling in vision systems, the following three queries need to be addressed:

These three questions will be explored throughout the next sections.

The labelling process in vision systems

The data labelling process, which is also sometimes called data annotation or data tagging, involves the work of people (annotators), who can be anything from dedicated labelling staff through to and including members of a crowdsourced third-party website. (See ‘crowdsourced data collection’ for examples of how crowdsourcing can also inform human-based research.) The work of these annotators is manual: they are required to view images and/or videos where they can then assign to the many on-screen objects their correct nouns.

As just one example of the above, think of how beneficial this mode of programming is to machine learning in autonomous cars: one or more humans view a video of a car driving through a street, and throughout the on-screen journey, every stop sign, traffic light, pedestrian, and so on may be easily labelled. Each and every correctly applied noun is used to train the given automotive software for a time when the car will eventually be left to its own devices.

Such data labelling processes can often, if not always, be carried out intuitively thanks to the instinctive ways that human brains – as opposed to computer processors – work. More specifically, human brains benefit from intuition and common sense, whereas processors benefit from datasets (the value of which is covered next). Consider, for instance, that you could easily tell apart an orange dog and a fox even though they are both canines of the same colour. Accordingly, if a vision system were to require such information, a human would need to use their common sense to label an orange dog and a fox in order for the given software to be able to form a ‘training dataset’.

Achieving training datasets for vision systems

A training dataset is simply one or more pieces of data that are fed into an AI systems’ algorithms so that the given vision system can form analyses and/or predictions based on its trained ability to ‘see’ the object(s) of interest. How computer vision systems and machine vision systems utilise these training datasets do differ, however. Computer vision systems apply more to analytical forms of image processing whereas machine vision systems are more useful in time-sensitive, industrial contexts (like robotic picking and packing on the factory floor).

Regardless of the chosen applications, the use of data labelling to achieve such training datasets is of course labour-intensive and time-consuming. Many software companies therefore stress the importance of using dedicated annotation professionals. The AI organisation Keymakr, for instance, explains that even the capacity of ‘automatic data labelling’ (wherein machines train other machines) is limited and nevertheless requires human intervention. While such labelling may expedite the annotation of “easily identifiable labels”, says Keymakr, it still yields “a significant amount of errors” that annotators must review and verify.

Computer vision and machine vision

Computer vision (CV) involves software that analyses a multitude of visual stimuli, such as those that may be digitally scrutinised on a screen or page. Due to the level of data processing demands, these CV systems must be programmed, not only to carry out object detection – but both object recognition and object classification.

Object detection is the process by which a vision system identifies an object, usually a basic one such as a large piece of industrial equipment; object recognition is the ability to specify multiple images throughout a diverse image; and, perhaps most significantly of all, object classification is the process by which a CV system not only recognises objects, but assigns a ‘class’ to the various objects within an image or video.

Consider that CV systems that scan mammographs and radiographs have (for decades, in fact) proven suitable to assisting in cancer diagnoses. This is a product of computer vision’s capacity to ‘classify images of tumours. In other words, object classification may be used to ascertain the extent to which a tumour may be benign or malignant. This is because tumours tend to have different textures depending on which extreme applies.

CV systems’ object recognition and classification capabilities are suitable for applications that often take great time, care, and attention. Machine vision (MV), on the other hand, is mainly useful ‘on the factory floor’. It provides machines with a specific industrial application: their cameras ‘see’ the objects of interest, and if the MV system has been programmed correctly by the training datasets of data annotators, their processors will respond to the objects of interest. An example of object detection taking place in industry is when a robotic arm distinguishes, and then collects, a required component from a manufacturing conveyor belt.

The importance of context in vision systems

On top of the potential of advanced vision systems to identify, recognise – and even classify – objects, such technology is also further informed by the nature of human reasoning. Scene context encompasses the use of semantic reasoning to benefit the likelihood of achieving object recognition. After all, even humans may benefit from seeing objects in context: as a Frontiers paper from the University of Trento and Vrije Universiteit Amsterdam explains, humans are more likely to recognise a sandcastle on a beach than a sandcastle on a football field.

In a Nature paper, ‘Machine vision benefits from human contextual expectations’, the researchers explain that this similarity (as well as the differences) between artificial intelligence and human reasoning can be utilised in vision systems. In what may be seen as an enhanced form of data labelling, the Nature researchers asked participants to indicate their contextual expectations of target objects (such as cars and people) in relation to their associated contexts (such as roads and other urban environments).

The researchers concluded that this contextual input from humans, combined with a neural network, can enhance the accuracy of object recognition in vision systems. “We demonstrate,” write the researchers, “that … predicted human expectations can be used to improve the performance of state-of-the-art object detectors.”

The combination of human and artificial intelligence

The applications of vision systems vary from the basic industrial level (such as conveyor belt monitoring on the factory floor) through to and including context-specific cancer diagnoses. It is clear that the importance of datasets in achieving data labelling and scene context remains a largely human-led operation.

Nevertheless, it cannot be ascertained whether artificial intelligence may ever master what humans would call ‘common sense’. Perhaps, as the rise of automation becomes increasingly apparent, it remains possible that data labelling will itself eventually be a job for machines rather than humans. This would form yet another breakthrough in the phenomenon known as machine learning.

More information on the value of artificial intelligence in medicine can be found at our Healthcare page.

Plus, IoT Insider’s sister publication Electronic Specifier has more at its AI section. Also, feel free to comment below or delve deeper at our LinkedIn.