Once we’ve trained it to do so.
We humans use our eyes and brains to see, visually sensing the world around us. The goal of computer vision is to give this seeing capability to a machine. Computer vision automatically extracts and analyses useful information from an image or a sequence of images. A computer can also see things that we can’t, with multiple channels of perception, meaning that machines have greater visual capabilities than we do. Because of this superhuman seeing power, there is loads of potential for computer vision to have a profound impact on our lives. But first, we’ve got to train them. We can only achieve this with sufficient data and an understanding of actually how computers see the world.
Right now, computer vision often has fun and playful applications. If you’ve ever seen a video demo of augmented reality glasses, the focus is on wowing you with amazing visuals, like projecting sharks swimming through your living room or creatures bursting through the wall. But in actuality, computer vision tech has not yet reached the sophisticated level that it could attain in the future. In short, there is loads more potential for computer vision to advance.
Let’s take the not-hotdog app from HBO’s Silicon Valley, the cute inside joke of the tech world, as an example. For those of you who haven’t seen the show (you really should), it revolves around a tech startup company. They created an app that they hoped would become the ‘Shazam of food’, but when they tested it, the app could only identify hotdogs; everything else was categorised as ‘not-hotdog’. The creators of the show then developed the app in real-life for laughs.
You might argue that the app fills a great need of humanity, who were previously unable to identify hot dogs from not-hotdogs for themselves. But the least tech-savvy characters of the show were less than impressed with the app’s anticlimactic demonstration. The demonstration scene shows us that building a useful classifier is anything but simple. It also represents how people’s expectations of what such classifiers can sometimes clash with the reality of what they can actually achieve. Dinesh seems pleased without the outcome of the app, as he points out that it works well as a classifier and only needs to be trained. However, as Monica dryly notes, Jian-Yang would have to repeat what he did to train the app to identify hot dogs for every single food in existence. That would involve feeding the classifier thousands of images of each food, a super boring process, as Jian-Yang would have to manually scrape the internet for those pictures. So, as a core piece of tech, the model is good; what it lacks is sufficient training data.
It’s not just in the show that such complexity exists. The real-life creators of the app stress in their blog post that while the basic app was built on one laptop with a single GPU over a weekend, the time and effort that went into making the final product user-friendly was considerable. They spent weeks optimising the overall accuracy of the app, as well as a whole weekend optimizing the user experience around iOS & Android permissions.
So, computer vision, training a model to identify and classify objects in images, is not so simple to implement. It requires thousands of images as training data, as well as lots of time, effort and patience from its developers. The not-hotdog app shows us that while there is computer vision technology has heaps of potential in the future, sufficient training data is crucial to making them work.
Chihuahua vs Muffin
Even when a classifier has been trained with extensive data, there is still the potential for it to make mistakes, just as a child who hasn’t yet seen enough examples of an object to properly learn what it is. Consider the phenomenon that gripped social media: the chihuahua vs muffin meme.
The ability to distinguish chihuahuas from muffins, puppies from bagels, and parrots from guacamole, is the hallmark of any image classifier, of course. When put to the test, image recognition software Clarifai achieved an impressive accuracy rate, correctly distinguishing chihuahuas from muffins 95.8% of the time.
However, if you look at these other images, you can see that the model went way off the mark when it tried to classify them. Here, you can see that not only did the model fail to detect the object in the image, it has also classified the water as a car.
Likewise, in this image, Microsoft’s CaptionBot seems to be seeing a dog instead of a terrifying bug creature. Even then, the classifier isn’t certain: stating that it ‘can’t be confident’.
Where did these classifiers go wrong? Why is one classifier able to confidently distinguish chihuahuas from muffins, while others are unable to accurately classify a duck or a bug? The answer is: we don’t quite know. On one level, we could say that it comes down to data. The more training data a classifier is given, the more accurate it will be at identifying an object; a lack of sufficient training data will hamper a classifier’s ability to detect objects. So, if we fed these classifiers more pictures of ducks and bugs, it should get better at spotting them in images.
But ultimately, we should acknowledge an obvious but significant truth: that computer vision and human vision are nothing alike. Sometimes, we simply don’t know why a machine sees something totally different in the image. As computers are increasingly teaching themselves to see, we aren’t sure precisely how computer vision differs from our own.
How do Computers See the World?
Researchers from Cornell and Wyoming Universities have tried to gain a deeper understanding of the difference between human and computer vision by attempting to trick these image classifying algorithms into seeing things that aren’t there. The group used a neural network called AlexNet that has achieved considerably accurate results in image recognition. Operating AlexNet in reverse, they asked a version of the software to create an image of a guitar by generating random pixels across an image. This version had no previous knowledge of guitars. They then asked a second version of the software, which had already been trained to recognise guitars, to rate the image made by the first network. This confidence rating given by the second network was then used to refine the first network’s next attempt at creating a guitar image. After thousands of rounds of this between the two networks, the first network made an image that the second network recognized as a guitar with 99% confidence.
These ‘guitar’ images looked nothing recognisably like a guitar to the human eye and in fact resembled coloured static. They found that the software has no interest in piecing together structural details that a human might need to recognise a guitar. Instead, the researchers believe that the software is looking at specific distance or color relationships between pixels and possibly also overall color and texture. The findings suggest that neural networks develop a variety of visual cues that help them to identify objects, and while some of these cues might seem familiar to humans, others might not. This research offers us new insight into the way that computers really see the world. However, what we can take away from this is that more needs to be done to understand actually how computers see the world, before computer vision technology can reach its full potential.
Making Security Visual
With sufficient training data, and a better overall understanding of exactly how computers see the world, the future applications of computer vision could be immense. So what might computer vision be like in the future?
If we trained computers to recognise the fingerprints and iris scans of the whole population, this could transform the way we think about our security and privacy in the future. Computer vision could facilitate a shift towards using iris and fingerprint scans to manage access to restricted areas and buildings, as well as retrieve our medical or criminal records. While this would require a small sacrifice of anonymity, we would essentially become the password to our own data; access to our records would not be based on the information that we remember or the key that we carry. This will make it easier to access our records, minimise scenarios were patients are misidentified, and also simplify the process of restrict access to high-security areas. Such technology could be a great stride towards making us safer; it could be used to prevent high-risk dementia patients from leaving their nursing home or to reduce the number of prisoners escaping from jail.
Computer vision has begun play a role in keeping us safe online and in the future will play an increasingly vital role in this. Facebook describes the emerging importance of computer vision in its recently announced policy on preventing the spread of terrorism via social media sites. Working in partnership with other big players on the social media scene like YouTube and Twitter, a key element of Facebook’s anti-terrorism strategy is to use image matching technology. Whenever a person attempts to upload terrorist content, AI systems analyse whether the image matches a known terrorist photo or video. So, if a terrorist image has been removed previously, other accounts will be prevented from uploading the same image to Facebook. Thus, in many cases, terrorist content intended for upload simply never reaches the platform. In the future, computer vision will play an increasingly important role in our security, both in public and online.
Computer vision could have a significant impact on the way our cities are planned and built, making them more efficient and safe. Construction workers could benefit from augmented reality blueprints and plans to use throughout the building process. This would make the building experience more visual, allowing construction workers to work with greater accuracy without having to consult multiple plans after each step. The building process would become faster and more efficient, lessening the potential for errors that could cause problems further down the line. This, in turn, would improve the quality of buildings, which would be safer and sturdier as a result.
The quality assessment process of surfaces, such as roads and pavements, could also be enhanced with computer vision. Through a combination of human expertise and the superhuman seeing power of a machine, the quality inspection of these surfaces would become more accurate, as human error would be minimised. Computer vision would be able to detect complex defects in surfaces faster and more accurately and so also improve the quality of our roads.
In addition, computer vision can generate data which can be used to revolutionise the way in which we manage traffic in major metropolitan areas. Traditionally, dangerous spots could only be identified after accidents had already occurred. However, as computer vision also identifies and classifies near-miss situations, it is a preventative rather than reactive approach to avoid collisions. The algorithm will be able to pinpoint the most dangerous intersections, as well as produce information that will be helpful for preventing crashes, such as the days and hours which carry the most risk. Authorities can then use this data to pinpoint the most hazardous locations. This knowledge will then influence decision-makers in their planning of roadways and other infrastructure, allowing them to avoid building particularly hazardous intersections. In this way, the way our cities are planned and built could be shaped by computer vision. In achieving this future, imagine how far we’ll have come from the not-hotdog app!
Computer Vision at Scyfer
One of our goals as a deep learning company is to use state-of-the-art computer vision knowledge in our projects to come up with solutions that are both innovative and in line with our clients’ needs. With this in mind, we designed a traffic management solution using computer vision for the SETA consortium. SETA challenged us to model mobility and build a solution that would better count traffic modalities and classify specific traffic situations. Scyfer analyses CCTV camera image data and fuses information from multiple sources together. In the camera stream, we use machine learning to track cars, buses, motorcycles, bicycles and pedestrians. As a result of this, we generate high-quality data on the mobility pattern of the viewed area. This, in turn, provides valuable data for traffic modelling. The resulting models will be used to inform decision-makers on how to improve town planning and infrastructure, as well as enable individuals to plan their journeys more efficiently.
Towards Complementing Human Expertise
As it advances, computer vision has the power to take away some of the hassles of our lives so that we can focus on the things that matter to us. In the workplace, computer vision will complement human expertise and, in a sense, give workers superhuman abilities; to construct buildings smarter and more efficiently and to spot visual defects with greater accuracy. Out and about, or online, computer vision could take away some of our anxieties about our safety, by helping us to manage our traffic better or prevent the spread of terrorist content on the web. In this way, computer vision has the potential change the way we see the world, by allowing us to focus on the things that matter the most to us. But what is crucial to realising this future is properly trained models and relevant data to be able to achieve greater levels of accuracy, as well as a greater understanding of how computers see the world. So, while the applications of computer vision right now are fun, in the future, it has the potential to optimise many aspects of our lives.