Microsoft has quietly deleted MS Celeb, a training dataset of 10 million images used for recognizing face images. According to a report from the Financial Times (via Engadget), the database which contains pictures from one million celebrities had become quite problematic as it also includes images from “arguably private individuals.” More disturbing, the dataset has reportedly been used by big tech giants including IBM, Panasonic, Nvidia, as well as other Chinese companies suspected to be using it to improve their surveillance technologies.
The Redmond giant created its MS Celeb database back in 2016, gathering images from the web under a Creative Commons license. At the time, the company claimed that MS Celeb was the largest publicly available one in the world. However, the problem with that specific process was that the dataset that Microsoft created includes pictures from people who never asked to be a part of it, causing some potential privacy issues.
However, Microsoft denied that its MS Celeb database was removed because of privacy concerns. "The site was intended for academic purposes,” the company told the Financial Times. “It was run by an employee that is no longer with Microsoft." Still, in an period were tech giants are facing increased scrutiny, it wouldn’t have been right from Microsoft to leave this massive database online while pleading for a regulation of facial recognition technology at the same time.
Even though MS Celeb has now been taken offline, its content hasn't completely disappeared from the Internet. Adam Harvey, a researcher familiar with similar massive datasets told the Financial Times that "people are posting it on GitHub, hosting the files on Dropbox and Baidu Cloud, so there is no way from stopping them from continuing to post it and use it for their own purposes."