Machines have learned to see better than ever. What might this mean for urbanism?
Within the span of a few years, computer vision technologies became commonplace. Google offers reverse image search, Facebook tags users’ friends in their photos and allows for editing in smiley faces, blushing cheeks, and puppy ears. Ethics experts have concerns about user privacy: it has been a year since an app for searching personal social profiles by an image of their owner came onto the market.
How does a computer see our world? What can it tell us about our cities? Strelka Magazine talked to Habidatum urban data analyst and Strelka Institute alumna Anna Lvova and learned about computer vision, its applications, and the ways it may transform our cities.
Machine vision has attained unparalleled results. It was only in the middle of the 2000s that a computer could not tell the difference between a cat and a dog. Today, they can distinguish a Cocker Spaniel from an Irish Terrier in less than a second. Panasonic’s fridge sends a warning when food goes bad. Volvo cars slow down automatically when they detect a deer or a moose crossing the road. Today, cameras are effectively becoming another search box: analysing a photo can help find necessary auto parts, a recipe for the food you have left, or a shirt you just saw on a stranger and a pair of shoes to match. There are also wilder applications: a Russian company has even applied photo recognition and analysis to palm-reading. These services are dubbed “Shazam for X”, with X being faces, clothes, fingerprints, etc. Except instead of music, they perceive and analyse images.
A computer has an easy time recognizing a stingray, a mushroom, or a hamster. A muzzle, a ladle, or a headlight present a more complex challenge. The introduction of computer vision made a great impact on the development of robotics, UAVs, augmented reality, medical diagnosis, and multitude of other industries.
So far, CityClass, created by Roman Kuchukov, has been the only urban development project in Russia where computer vision has been applied. And the decision to use the technology was motivated more by pure research interest rather than by market demand. Some time ago, Kuchukov was engaged in creating an urban development concept for Irkutsk and was assigned to draft a functional zoning map for the city: a time-consuming and repetitive task. Roman came up with an idea: if a person can tell the difference between an industrial zone, a residential area, and a historic district by looking at satellite imagery, so can a machine.
Roman Kuchukov, Strelka Institute graduate, architect, CityClass project founder: “The city map is divided into cells, each one containing one of seven development types: pre-revolution development, Stalinist development, microdistricts, modern development, private residence areas, industrial development, and green spaces. First of all, I selected and marked several cells of each type on the map. The computer analysed these examples and learned to distinguish between different types of development. The images were then processed via the neural network, which provided me with a completed zoning map.”
According to Victor Lempizky, associate professor at the Skolkovo Institute of Science and Technology, the prototype could easily be adapted to work with other cities and tasks. “A neural network trained with one task can then be used to compare images for a different task. The most popular image database, ImageNet, does not offer a lot of images of buildings. However, a neural network trained using the service will be able to recognize the similarities and differences between various types of buildings. Moreover, a neural network can be fine-tuned for similar tasks. A network trained to recognize Paris buildings will easily adjust to recognizing buildings in Moscow. The transition will be much quicker and require significantly fewer training examples than starting anew.”
In his project, Kuchukov only used one set of images to train the program. As development types are generally similar in different cities, he applied it to analyse a total of five cities: Moscow, Nizhny Novgorod, Kazan, Samara and Yekaterinburg.
Looking at the results for Moscow, it is easy to identify the historic center – a large cluster of red cells – as well as ZIL, Leninsky Avenue, Sparrow Hills, and new development near Moscow State University.
Roman Kuchukov: “We are able to give a machine skills which take people many years to acquire, and then use the neural network to complete tasks which take up to 90% of an urban planner’s time. This is an important result of computer learning, and it has the potential to completely change the practices and approaches we utilize today.”
What does an image tell us about a city?
OpenStreetMap, Google Earth, and Yandex.Maps have become common tools for experts working with cities and architecture. Map services allow for estimating development density, average building height, facade conditions, commercial diversity, and even the presence of outdoor signs. The online maps and satellite images used in CityClass present a viable alternative to official documentation. The latter is often obsolete or even missing. What is labeled as a “small forest” in official documents might turn out to be a developed settlement. And a historic monument might already be history itself.
When, in 2014, the Proshin mansion on Tverskaya-Yamskaya St. was accidentally demolished, the public learned of it only when a passer-by peeked behind the fence. If public activists had access to satellite surveillance, computer algorithms could have notified them of suspicious activities and the illegal demolition could have been prevented.
In order to build a great image recognition engine, there is a need for polished algorithms and lots of computational power, as well as for a large database of images to train the program. Recently, there has been no lack of either, and urban research can now be conducted with social networks, UAVs, CCTV, and orbital satellites.
Poverty, sales, oil, and water through the eye of a satellite
The Earth observation industry is blooming: more than 1300 satellites currently orbit the Earth, including private satellites offering imagery for sale. Silicon Valley VCs are on the hunt: just recently, Uber has entered a partnership with DigitalGlobe, whose images are used by Google: the service wants to bypass map providers and direct its drivers by using the images. The good news is that stronger competition means cheaper satellite imagery. A few years ago, satellite pictures were exclusive to governments and the largest companies. Today, even a mid-sized mall in the middle of an American desert can afford them.
These companies are seeking the ability to scan every place on Earth every single hour. They have not yet figured out exactly how vast the potential of this endeavour is, but express full confidence in its future. Here are a few of the already discovered applications:
Stanford researchers trained a neural net to predict the poverty level in Africa, where governments do not have the funds necessary to collect sociodemographic data. The researchers first gathered daytime images to find human settlements, and then compared those to images taken after dusk to learn where people cannot even afford to light their homes at night.
The private company Orbital Insight offers its own poverty index, but its other solutions are even more intriguing. One of Orbital Insight’s products analyses the vacancy of parking lots near shopping malls and predicts sales by the number of parked cars and their duration of stay. The company sells this data to retailers and periodically releases a global analysis; in its latest prognosis, the company managed to hit closer to the real numbers than Bloomberg. Orbital Insight has also learned how to discover illegal deforestation sites, calculate the progress of urbanization in developing regions, and estimate the amount of potable water and global oil reserves by tracking the shadows cast by oil tankers. And, of course, how to determine zoning types, track road development, and urban sprawl.
Down to earth
While satellites watch us from above, ordinary computers keep their eyes at the same level as ours. Google Street View has allowed many of us to visit most distant places of the planet, walk the streets of foreign cities, and even enter buildings. The service is a perfect tool for analysing the morphology of streets, their design codes, green zones, illumination, number of fences, and quality of road materials. For instance, the authors of the What Makes Paris Look Like Paris? study used Street View images to discover the visual differences between boulevards and streets and find the architectural elements unique to the city. Place Pulse, a crowdsourced survey developed by MIT, is another intriguing project. Respondents are presented with two Google Street View images and have to choose the one they find safer, more beautiful, livelier, and wealthier. The survey allowed the researchers to find out whether urban perception differs in various countries, and also helped gather an enormous amount of tagged data, which could be further used to train programs and predict the characteristics of other cities. Social networks and crowdsourcing provide yet another valuable source of information. Since 2015, people have been producing over a trillion photos annually. Some of them end up on social networks. The networks have been used for urban analysis for quite some time, but researchers have usually prioritised text and geolocation data. Meanwhile, there is very little information exclusive to the text form: images show smiles, emotions, poses, the faces of your friends, and even your location. Even without people, images reveal the condition of buildings and the level of air pollution. In Singapore, Instagram images are analysed to determine air quality. Last comes the most obvious and yet most frightening source of visual data, the omnipresent CCTV cameras. They track traffic speed and can read number plates, but they are also learning to recognize human faces and compare them to databases. City security services eagerly employ this technology. One example is Wales authorities comparing pictures taken of UEFA Champions League Final visitors with criminal databases. But there are much wider applications. The face is becoming a new credit card, a new employee pass card, and a new passport. Your face can be used to pay a bus fare or enter a museum.
How does computer vision work
Imagine that you have to explain what a human person looks like to an alien. You say that a human has two eyes. But what if a human turns in such a way that you can only see one eye? Would an alien still recognize this object as a human? Seeing and perceiving is an incredibly complex ability: we learn to do it from the moment of our birth, and to us it comes naturally that objects can change their form, position, and surroundings without changing their essence. But for machines, this is not at all obvious.
Machine learning has helped to resolve this problem. While common programs present sets of instruction combinations which tell the computer to “walk up to the crossing, stop, then walk again”, machine learning involves complex learning processes. At some point, the program will recognize the existing patterns and understand that the red traffic light signals a stop.
So-called convolutional neural networks are the champions of image recognition. These networks are a collection of units called artificial neurons, which are organized in layers. Uploaded images are divided into a multitude of smaller pieces, which then travel through each of these layers. For example, if we upload an image of a person, one layer might analyse face features, the other – body parts, while the third analyses clothing. One particular neuron may react to the shine of a dress or the folds of a shirt. There is no need to know what is happening in each layer or neuron. After the analysis, the neural network makes a decision as to whether there is a person in the image or not.
Victor Lempitsky: “Machine learning means that the ‘supervisor’ only controls the output values and does not have to pay any attention to the neuron values in dozens of intermediate levels. The final result being correct is the ultimate goal, and the exact way the machine learned to achieve that result is much less important.”
ImageNet Large-Scale Visual Recognition Challenge is the Olympics for neural networks. The contest has been held annually since 2010 and consists of three events. The first is to go through 150,000 photographs and tag each appearance of one out of 1000 possible items (for instance, an umbrella, a Doberman, or a labyrinth). The second is to find a specific object in a photograph. The third is to do the same, but with video.
Back in 2010, 28.2% of the answers made by the winner of the first contest were incorrect (compared to 5.1% of wrong answers made by a human). In 2015, the machines have surpassed the human with only 4.94% wrong answers.
Viktor Lempitsky: “Here, neural networks are superior to a human brain, simply because people are incapable of simultaneously processing thousands of parameters. Also, computers do it much more quickly. A developed and trained neural net analyses an image within hundreds of milliseconds. For GPU-accelerated networks, this number goes down to tens of milliseconds.”
These neural networks were first proposed 30 years ago: back then, they were already able to recognize hand-written numbers. For certain reasons, the technology was given the cold shoulder. After 20 years of oblivion, neural networks made their comeback and found favour both within scientific and business communities. Yann LeCun, one of the earliest and most recognizable names in machine learning, has now been employed by Facebook.
What kind of future do you see?
Although computers have already surpassed human capabilities, they still make mistakes. One notorious example happened when the Google Photo algorithm tagged two black people as gorillas. Discussions surrounding the mishap focused mainly on ethical problems, but there was another danger to it: if urban services adopt computer recognition on a larger scale, the cost of an error will increase exponentially. However, the main complaint with machine vision is its potential to violate personal privacy. The thought that anyone can be found anywhere is terrifying. It is important to remember that it is we who decide how technology will be utilized. It can be used to search for a missing person, just as it can be employed to spy on someone after work. The first ape to ever pick up a stick could use it both to procure food and beat up other apes. And in both cases, the stick was merely a tool. The fear of losing one’s job is another driver behind Neo-Luddism. Recently, Spanish architect David Romero created full-colour renditions of destroyed Frank Lloyd Wright’s buildings from black-and-white photographs. Software is able to restore image colour within seconds, allowing anyone to enjoy the entire palette of pre-revolution Moscow and Saint-Petersburg, the original Pennsylvania Station, and pre-war Dresden and Rotterdam. Today, machines not only process existing images, but also create images themselves. Dutch researchers trained a neural network to turn sketches into photorealistic images. The Prisma app transforms uploaded photos into works of art.
Can you imagine a service which could actually paint a picture based on a selection of images chosen by the user? Of course, a building is more than just a picture – there are also materials, planning and utilities involved. The common idea used to be that machines should handle the boring calculations and leave the creative work to humans. Computer vision is changing this approach. Will computers be able to imagine better cities than humans can?
Alarmists say that computer vision heralds total auto- and technocracy. Others predict an age of democracy and self-organization unlike anything we have seen before. Already, today technology allows a shift from representative to direct democracy: why delegate your decisions to another person when you could cast your personal vote from any place on Earth?
Here, machine vision makes another step forward: what if certain decisions were not required at all?
Why does functional zoning even exist, forcing citizens to use certain places to live and others to work? As a result of this, some abide by the rules while others find them obstructive and violate them.
City management is doing its best to keep pace with modern times. But adapting takes time, and meanwhile, reality does not stand still. A look at new satellite images might reveal that a de jure green space is now effectively a residential area.
Computer vision is able to turn city management into city surveillance. And we must decide whether we will use it to spy and enforce or to observe and adapt.