Song Cui and Dhananjay Shrouty | Software Engineers, Content Knowledge
Interest Taxonomy at Pinterest
We recently began rolling out the beta version of Pinterest Trends, a new tool that gives a view of the top U.S. search terms on Pinterest within the past 12 months, and a look into when search terms peak for a better understanding of how content performs on the platform. Because people come to Pinterest to plan, we have unique insight into emerging trends, which is why we also publish an annual Pinterest 100 report (the most recent one released last month) to show what’s to come in 2020.
We’re able to gather these insights because Pinterest is fundamentally a different kind of platform where over 320 million people from around the world come to save ideas and plan — from the everyday like figuring out what to wear to work, to life stages big and small, like purchasing a new home or deciding where to travel. More than 200 billion ideas have been saved to over 4 billion boards to-date on the platform, providing insights into emerging trends and early signals into consumer behavior and evolving tastes.
But in order to understand trends as they’re happening, we need to understand the content Pinners are searching, and the categories the Pins are associated with. To do that, we’ve built a taxonomy-based knowledge management system that enables content understanding in a highly efficient way.
Taxonomy is a methodology that classifies entities and defines the hierarchical relationship among them. It’s widely used as a knowledge management system in the industry, and has proven success in improving the accuracy of the machine learning models in search, user-behavior modeling, and classification tasks.
At Pinterest, we use a taxonomy to organize popular topics and entities (which we call “interests”) and curate nodes for ads targeting. These interests are grouped together in a hierarchy parent-child tree structure, with each child as a subclass of its single parent. The top level taxonomy nodes define the broad verticals, such as “Women’s Fashion” and “DIY and Crafts” that capture the general interests associated with Pins across Pinterest. We have children nodes up to 11 levels that capture more granular topics.
Example use cases
The Interest Taxonomy has many different use cases in business, product and production signals in engineering. In this blog, we just share four use cases.
Ads manager is the main interface for advertisers. To assist Pinterest advertisers, some Interest Taxonomy nodes are available for pick up as shown below:
The Interest Taxonomy is used in interest based targeting to help advertisers reach the right audience based on Pinterest’s unique understanding of Pinners’ interests, taste, and what they’re planning.
To see their Ads campaign performance by interests, advertisers can select the ‘interests targeting’ breakdown at the Ads Group or Promoted Pin level. This granularity is available for the ‘delivery’, ‘performance’ or ‘custom’ reports.
Mapping Pins to Interest Taxonomy
We built Pin2Interest (P2I), a scalable machine learning system for content classification, to map our corpus of 200B+ Pins to our interest taxonomy. The results from P2I are used to generate personalized recommendations and create ranking features for other machine learning models. P2I is in production and has many consumers such as home feed ranking and Ads targeting.
P2I leverages both text and visual inputs such as annotations, visual embeddings, and board names. It uses Natural Language Processing (NLP) techniques such as lexical expansion and embedding similarities to map the inputs of every single image to a list of taxonomy nodes as prediction candidates. Then, a search relevance model is used to predict and rank the matching score between the image and every single taxonomy node. A sample P2I output is shown below, including the most relevant interest prediction with a score for the image.
The taxonomy hierarchy information is also used as P2I ranking information. More than 99% of the Pins can be mapped to at least one taxonomy node. The granularity and quality of the taxonomy is critical for the P2I accuracy. If the content of the image belongs to a very particular topic and the taxonomy does not have a similar node to cover this topic, P2I will map this image to a node with a different context and prediction accuracy drops.
P2I and the Interest Taxonomy provide critical insight in content understanding. For example, we can monitor the number of images per every taxonomy node and this metric tells us the topic trending and declining in Pinterest content.
Mapping users to Interest Taxonomy
The Interest Taxonomy is also leveraged in an ML system to infer users’ interests. The system is called user2interest. One of the most important input signals to the ML system is user engaged Pins and corresponding interest labels of those pins output from Pin2Interest (mentioned in the previous section).
The user interest signal is widely used at Pinterest for Ads targeting and organic recommendations, and it can also provide insights on the Interest Taxonomy from user perspective. For example, we can compute statistics such as the number of users per taxonomy node to inform our advertisers of Pinners’ overall interest shifts.
Mapping queries to Interest Taxonomy
Query2Interest (Q2I) maps short text queries to the taxonomy nodes. This signal leverages Pintext, a multitask text embedding system in Pinterest to compute the similarity score between the short text and taxonomy nodes. It groups queries with similar categories and meanings to the taxonomy nodes. Q2I is in production and used in various Ads and organic surfaces. Mapping queries to the Interest Taxonomy helps Pinterest understand the users’ intent so we can serve relevant results to them.
Creating and maintaining Interest Taxonomy
The taxonomy curation process contains the following two important components: 1) data modeling into RDF graphWebProtégé visualization and curation; 2) Engineering workflow to facilitate incremental changes in the taxonomy. We will introduce the details below.
RDF data modeling, WebProtégé visualization and curation
For modeling the data in the taxonomy, we use RDF (Resource Description Framework) triples to generate graphs which can then be used for the curation as well. We use open source tool WebProtégé for visualization and human curation of the taxonomy which facilitates us creating a high quality taxonomy by collaborative curation. The RDF data model we use is depicted below:
The figure below shows the data modeling in WebProtégé that we use for collaborative curation.
From RDF to production DB
The engineering workflows take the RDF graphs (in XML format) as the input and generate the relational DB tables for downstream consumption. For every iteration of the taxonomy development, we develop/extend on the taxonomy which was developed from the previous iteration. Therefore we follow an incremental way of taxonomy generation and development. When we create a new version of the taxonomy we consistently perform and support operations like adding a new node, renaming an existing node, deleting a node and merging two or more nodes into one in order to develop a taxonomy which is of high quality and relevant to the content present at Pinterest. We have developed heuristic rules for all the situations which require node changes (such as node renaming, node merging and node deletion).
Updating Interest Taxonomy
Pinterest Taxonomy aims to capture the most important and timely topics from Pinterest content. Active topics used in various products such as topic feed and shopping are all covered by our taxonomy. These terms are mined from popular annotations used in Pins, board names, and top search queries.
When we want to add a new topic to the existing Taxonomy, we first send out the candidate terms to teams like content safety and legal for review. Then, we rely on a neural network based ML algorithm to predict the likelihood of the existing nodes as the parent of the candidate term. The predicted parents are reviewed manually. After that, the new nodes are added to the current Taxonomy in WebProtégé by our Taxonomists. The whole process is depicted as below:
The key assumption made by the NTE model is that at least one affine projection exists, such that once a new term’s (e.g., “litecoin”) embedding is transformed using this matrix. The transformed embeddings nearest neighbor is its parent, e.g., “Cryptocurrency”. Thus the key is to learn the transformation matrix. For simplicity, the new term is represented as q (query) and the potential parent is represented as p (parent) in the following model diagram:
The loss is then computed as the sum of two components. The first component encourages the query projection p to be similar to its true parent’s embedding ep. The second component encourages the query projection to be dissimilar from m “negatively sampled” parents, denoted by -p or p’. The overall loss for each query-parent pair (q,p) is:
We collect positive labels from the existing taxonomy hierarchy and use negative samples for training the model. The model is used in production for several large scale taxonomy expansion projects. Final human review is still needed because the taxonomy is exposed externally for advertisers so we need very high quality data.
To support Pinterest international expansion, Pinterest taxonomy is translated into 17 languages for 20 countries and will continue to expand to new markets. The English Taxonomy serves as the ground truth for all international versions.
Moving forward, we’re excited to keep evolving how we capture and understand trends in a more timely and systematic manner. Our Interest Taxonomy and downstream signals (e.g. P2I, U2I, B2I, Q2I) will be updated regularly and automatically. In the coming future, we will also be working towards building up new types of relationships among entities automatically in our taxonomy and associate attributes (link). If you’re interested in learning more about what our Knowledge Engineering team and other teams do, and how to join us, check out our careers page.
We want to thank everyone who has contributed to this project: our EM Rui Li, Yunsong Guo; our PM Troy Ma, Miwa Takaki; our engineers Yimeng Zhang, Emaad Ahmed Manzoor (intern), International team Helene Labriet-Gross, Evelyn Obamos, Francesca Di Marco, CatherineRose Mountain, Serena Perfetto and the Stanford Protege team. Also special thanks to Bo Zhao, Jinyu Xie, Rui Huang for their thoughtful suggestions.