Building A scalable data management system for computer vision tasks
By Shirley Du |Engineer, Discovery
On Pinterest, computer vision plays a central role in enabling search and discovery from every Pin and every image within a Pin. We train visual models to not just identify objects, but also predict attributes, related ideas, and exact product matches. As people conduct hundreds of millions of visual searches on Pinterest every month, these models are increasingly becoming more powerful with results surfaced through technologies like Lens camera search, automated Shop the Look, and visual search across the platform.
Successful training of machine learning models, especially deep learning models, typically relies on collecting high quality training data at scale. Furthermore, in production applications, many auxiliary tasks often require more engineering work than the model design itself: data collection, label cleaning, quality control, crowdsourcing, versioning, dataset storage, taxonomy management, etc. As a solution to help with such tasks, we designed a centralized, scalable, and flexible data management tool called Voyager.
Voyager serves as a centralized hub for storing computer vision training data at Pinterest. The training and evaluation of deep learning models require datasets with various labels, formats and sizes, and Voyager allows us to create such datasets easily from scratch. This feature is particularly useful in exploratory applications because we only need a small dataset to start experimenting, and it’s most convenient when we can just bootstrap a dataset on our own.
Imagine we wanted to bootstrap a dataset with 100 images of high heels. In Voyager, there are several ways to quickly define a taxonomy and add images to do this:
- Add images by performing a text search. Simply search by the term you are looking for (e.g., “heels”), and choose result images to be added to your category.
- Add images by performing a visual search. Select an existing image and our visual search backend will find images that are visually similar.
- Drag and drop from your local desktop.
- Import from the Pinterest database using our unified computer vision dataset schema (explained below)
Even though 100 images might not seem like a lot, when fine-tuning existing models, this may be enough to bootstrap a new category. What makes Voyager even more powerful is that it also supports dynamic taxonomy management, which mimics that of a file system. Users can manage their data similar to how they manage files and folders on laptops. This feature is especially useful in deep learning applications in terms of data collection. For example, in order to train our Lens model, we need training images that resemble the quality distribution of user input images, where these images usually have properties like low resolution, low light conditions, and are sometimes taken at unconventional angles. Hence, the convenience of having a file-system-like tool makes collecting such images that satisfy the above requirements fast and easy.
In addition, to further improve the user experience and diversity of visual search results, we also focus on attributes of images, such as color, material, gender, and other useful information associated with the image. To collect this kind of data, Voyager allows our internal product specialists to label multiple attributes in real time.
Voyager is not just a web tool. It is also a system that handles data collection, cleaning, visualization, and deep vision model training. This system is supported by an underlying unified data labeling schema.
Unified data labeling schema
An important lesson we learned from developing our data management toolchain is that we needed to define a unified schema optimized for all computer vision tasks at Pinterest. Without a unified schema, we frequently ran into issues like difficult data visualization, incompatible dataset formats, brittle type and schema inference in vision dataset consumers, along with a host of other concerns.
Thus, our schema satisfies the following requirements:
- Each image can have image-level labels (e.g., this image is a living room scene), and region-level labels (e.g. this region in this image is a bed). A region can be defined as a box, a mask, a polygon, etc.
- Any region (including the full image) can have a label set from several independent taxonomies (e.g., semantic category, color palette, material, pattern, etc). Describing the same region across multiple taxonomies has proven to be important for our multi-task learning efforts, which we will cover in a future blog post.
- It should be possible to relate a region to regions in other images (e.g., this couch in this living room scene is similar to that couch in the other living room scene).
- We should be able to record labels from taxonomies that have no spatial associations in the image (e.g., this image is black-and-white)
- Images need to be co-located as raw bytes with the metadata, not linked, to protect against relocated or modified images (using safe mechanisms for removing training examples from deleted images).
The advantage of this schema is that on the data labeling side (web tool), we can easily render image-level and region-level information and write back to the database for modifications. On the deep learning model training side (training framework), the trainers can quickly deserialize and extract image metadata at once. By storing image raw bytes in this unified data structure, we also ensure the intactness of all images, which further guarantees the reliability of performance and improve the feasibility of reproducing and retraining the model.
When things go at scale, modularizing the data management system is becoming increasingly important. Having separation of responsibilities is also becoming a crucial mission for our system. Therefore, having a consistent, unified, and backward-compatible schema is an important first step towards connecting modules together and achieving those goals.
With a reliable unified schema, we are now ready to perform data cleaning in Voyager. However, we have continuously evolving datasets due to hundreds of product specialists cleaning our data at the same time. In industrial applications where massive data is being produced everyday and fed into machine learning models, it is even more common to have continuous management of the datasets. The reason is that the maintenance and iterations on models require easy access to existing data entries so that historical performance and analysis could be conveniently stored and tracked, and that reproducing model performance can be therefore reliable. Reproducibility is crucial to industrial model development. Without it, we risk claiming gains from making modifications to the deep learning models without realizing that the real source of improvement was random. To solve this problem, we implemented an automatic versioning system for keeping track of data.
To achieve automatic versioning, we developed pipelines running under Voyager that publishes datasets in real time. These pipelines pick up the latest snapshot of our datasets and dumps them into the database. This way, it ensures reusability within the organization and reproducibility of model performances within the team.
Scalable training data collection pipeline
Crowdsourcing is the most efficient way to collect data at scale. However, ensuring data integrity and consistency is nontrivial because scene images vary and different crowdsourcing labelers and QA person might have different definitions of what an object is and standards of how to draw a bounding box for an object. Thus, ensuring data quality can be a headache.
Take collecting bounding boxes for detection tasks as an example: in order to train our detection model in automated Shop the Look that gives Pinners the best experience to explore and shop objects contained in the scenes, we need training images that have well-drawn bounding boxes around all prominent objects following a consistent but sophisticated labeling guide. We ask the labelers to label one category at a time (e.g., label all the chairs in the scene) to ensure that bounding boxes are drawn at speed following our training guide. Then we ask labelers to verify each box is drawn correctly (e.g., boxes are tight enough). After boxes are verified, we send them to ask labelers to verify if each image is labeled at full recall (e.g., all the chairs in the scene are correctly labeled). We iterate this process until we get the quality that meet our standard. This process is generalized as a “targeted” data collection pipeline since we are collecting one category at a time. This pipeline is extremely quick and scalable for collecting detection data, and it is being adopted widely at Pinterest.
After we gather targeted data from this pipeline, we ingest these results into our visualization & cleaning tool to perform internal QA. This workflow will, of course, also perform merging when there’s a conflict with our existing image dataset and dedup images from our test dataset. If the quality doesn’t pass our standard, we will ask the labelers to redo the task and repeat the process; otherwise, an export workflow will be kicked off to dump the cleaned data to our source of truth location.
Extensible for the future
We believe that having a reliable, scalable, and maintainable data management system is crucial to deep learning model training as the vision tasks get more and more complicated. Our data management system significantly simplifies common deep learning tasks by reducing our turnaround time for the process of data collection, model training, debug, and more data collection. Looking ahead, we look forward to an even more powerful system that can allow us to do instant training — an instant feedback loop to debug into models. Please stay tuned!
Acknowledgements: Our data management tool is a collaborative effort at Pinterest. Special thanks to Dmitry Kislyuk, Michael Feng, Andrew Zhai, Jeff Harris, Chuck Rosenberg for the collaboration on this project.