In a world that produces and uploads more than two billion images to social media every day, it may seem trivial that the ImageNet visual database holds just fourteen million images. Yet, this database has played a pivotal role in the development of global AI systems that identify, classify and create images. Computer vision—object detection, facial recognition, scene reconstruction, image classification, pattern detection, edge detection, video tracking, etc.—is the cornerstone of contemporary AI, and arguably the most controversial.

Each of the fourteen million images in the ImageNet database has been labelled with a noun selected from a predetermined list of categories, which identifies the principal object in the image (toilet tissue, chair, goldfish, etc.) This labelled image is then assigned to a category that links it, through a nested hierarchy of 22,000 subcategories, to one of nine top-level categories. For example, a chair is a category of seat, which is a category of furniture, which is a category of furnishing, which is, finally, a top-level category of artifact. This mind-numbing process of labelling was verified by a team of 50,000 pieceworkers, hired through Amazon’s Mechanical Turk, who labelled an average of fifty images per minute. 

ImageNet is an astounding feat, a critical component of global AI research, easily accessible for no cost, and supported by sustained, collaborative and responsible research. But it is also a recipe for disaster—fundamental questions of privacy (the images were scraped from the Internet without permissions); bias (assumptions within labelling and the classification system); and technical error have challenged its authoritative status. In recent years, the ImageNet research team, led by Fei-Fei Li at Stanford University, has actively addressed many of these concerns.

Text To Speech

A two-dimensional representation of 50,000 images from ImageNet, 2012, Image: Andrej Karpathy