Person, Flooring, Wood, Floor, Hardwood, Furniture, Plywood, Couch, Sphere

Understanding Pins through keyword extraction

Heath Vinicombe | Software Engineer, Knowledge

Pinterest is well known for its visual discovery engine, but much of our content is also accompanied by text. As a result, it’s important for us to be able to understand text to ensure we recommend relevant ideas to Pinners. In this blog post, I’ll talk about “annotations,” one of the main signals we use at Pinterest for understanding text.

Overview of annotations

Annotations are short keywords or phrases between one and six words that describe the subject of the Pin.

In addition to its text content, each annotation also has a confidence score and a language associated with it. We extract multiple annotations per Pin across a total of 28 languages.

For the above example Pin, we might extract the following annotations:

  • (EN, sloth sanctuary, 0.99)
  • (EN, sloths, 0.95)
  • (EN, costa rica, 0.90)
  • (EN, carribean, 0.85)
  • (EN, animals, 0.80)
  • (EN, travel, 0.80)

Using annotations

Annotations are a fundamental signal used in a variety of product surfaces at Pinterest, often as features within Machine Learning models. We’ve seen great experiment metrics gains from adding new annotation-based features to models, and typically annotations are one of the most important features. Examples of where we use annotations include:

  • Ads CTR prediction
  • Home feed candidate generation & ranking
  • Related Pins candidate generation & ranking
  • Search retrieval & ranking
  • Board suggestions for new Pins
  • Detecting unsafe content

Case studies

Search

Annotations are stored in the inverted index. When a user performs a search, annotations are used to retrieve Pins with annotations matching the user’s query. The advantages of storing annotations in the inverted index rather than storing all tokens are:

  • Annotation scores tend to be more correlated with relevance than just TF-IDF alone
  • Storing just the annotations uses less space than storing all tokens, which is important when there are over 200 billion Pins to index

Related Pins

Related Pins are the list of recommendations you see under “more like this” after tapping on a Pin. Annotations are used to generate some of the features used by the related Pins model. In particular, the annotations for a Pin can be thought of as a sparse vector with indices corresponding to annotation ids and values corresponding to annotation scores. The cosine similarity between the annotation vectors of two Pins is a good measure of the relatedness of the two Pins. In the figure below, the two Maseratis are more similar to one another than they are to the Honda, and this is reflected in the cosine similarity scores.

Content Safety Filter

Pinterest works hard on classifying content that goes against our community guidelines, such as self-injury and pornography. Annotations are one of the signals used by content safety filters to detect unsafe content and prevent our Pinners from encountering it.

System Overview

The workhorse of the annotation system is a weekly Scalding batch workflow to compute annotations for all Pins. However, the issue with such batch workflows is that there may be a multiple-day lag until annotations are computed for fresh Pins. To mitigate this, we also have an “Instant Annotator” service to compute annotations for fresh Pins within seconds of their creation and store the annotations in HBase. Annotation consumers can fallback to these instant annotations if the batch annotations have not yet been computed for a Pin.

See below for an overview of various components which will be talked about in more detail in the following sections.

Annotations dictionary

Annotations are limited to a finite vocabulary known internally as the Dictionary. This dictionary is stored in a MySQL database along with additional metadata. A UI makes it easy to view dictionary terms, add new terms, delete terms and view change logs.

The advantage of using such a dictionary over allowing annotations to be arbitrary ngrams is that it guarantees the annotations will be valid and useful phrases instead of misspellings (e.g., “recipies”), stopwords (e.g., “the”), fragments (e.g., “of liberty”) and generic phrases (e.g., “ideas”, “things”). Furthermore, the dictionary is a convenient place to store additional metadata such as translation and knowledge graph relations. This dictionary is used by many teams at Pinterest and not just for annotations.

The dictionary initially started with popular topics that were manually entered by users, but it has grown to include additional sources of terms such as search queries, hashtags, etc. A significant amount of human curation has gone into building the dictionary to ensure its quality is maintained, and we periodically use heuristics to trim out bad terms and use a spell checker to remove misspellings. We have around 100,000 terms in the dictionary for each language.

Candidate Extraction

The first step in computing annotations for a Pin is to extract potential candidates from a variety of text sources such as:

  • Pin title, description, url
  • Board name and description
  • Page title and description of the link
  • Search queries that frequently lead to clicks on the Pin
  • Names of objects detected in the image using a visual classifier

The following steps are used to extract candidates:

  • A text language detector determines the language of the text.
  • The text is tokenized into words with a tokenizer according to language.
  • A sliding window is used to generate all ngrams containing between 1 and 6 words.
  • The ngrams are normalized by stripping out accents and punctuation and then stemming or lemmatizing depending on the language.
  • Ngrams are matched against the annotations dictionary.
  • The extracted annotations are canonicalized to reduce duplication (e.g., “sloth” is canonicalized to “sloths” since it is not useful to have both of these annotations on a Pin). Canonical mappings are stored in the dictionary.
  • Features

    Features are extracted for each annotation candidate to be later used for scoring.

    Pin — Annotation features:

    • TF-IDF
    • Embedding similarity — cosine similarity between Pin embedding and annotation embedding
    • Source — some text sources tend to yield higher quality annotations than others, and annotations that were extracted from multiple sources (e.g., both Pin title and board title) tend to be better than annotations that were only present in a single source (e.g., just board title)

    Annotation features:

    • IDF
    • Category Entropy — annotations that are popular across multiple categories tend to be more generic and less useful
    • Search frequency

    We found our model performed better when we normalized our features such that the value distribution was similar across language and Pin popularity (i.e., number of repins).

    Model

    Not all annotations we extract as candidates are relevant to the Pin. For example, take the following Pin description:

    “The Sloth Sanctuary in Costa Rica is the only sloth sanctuary in the world. Click to read more about my journey there + sees pics of baby sloths!”

    From that description, we extract annotations such as “world”, “journey” and “read” that are not relevant to the Pin and do not make good keywords. The purpose of our model then is to score annotations so that we can filter out irrelevant ones and only keep the most useful ones.

    Training labels are obtained through crowdsourcing where judges are asked to label for a given (Pin, annotation) pair whether the annotation is relevant to the Pin. Around 150,000 labels per language are used.

    Initially, we started with a logistic regression model that predicts the probability that an annotation is relevant to the Pin. This yielded decent results and performed much better than previous versions of annotations that did not use a model. Later, we migrated to a XGradient Boosted Decision Tree model trained with XGBoost. Switching to this model gave us a 4% absolute improvement in precision and simplified our feature engineering since we could remove all monotonic feature transforms and no longer needed to impute values for missing features.

    Conclusion

    Incredibly useful signals with a variety of applications across recommendations, retrieval and ranking can be built from high-quality keywords. Pinterest has seen many engagement and relevance wins through adopting such signals.

    Acknowledgments: Thanks to Anant Subramanian, Arun Prasad, Attila Dobi, Heath Vinicombe, Jennifer Zhao, Miwa Takaki, Troy Ma and Yunsong Guo for their contributions to this project.


    Understanding Pins through keyword extraction was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Read more >

    Jobs people like you viewed

    Trust & Safety Technical Strategist

    San Francisco California United States San Francisco, California, United States Policy Community Operations > Policy
    We’re looking for a curious problem-solver to join our Trust & Safety team as a Technical Specialist. With your top-notch critical thinking skills, you’ll define metrics and build dashboards to monitor the performance of our Trust & Safety...

    SWE, Ads Reporting Infra

    San Francisco California United States San Francisco, California, United States Engineering Engineering > Eng
    Pinterest is one of the fastest growing online advertising platforms, and our continued success depends on our ability to generate reliable, accurate, real-time data to advertisers, business analysts, and Ads Serving systems, and empower these int...

    Director of Agency & Industry Marketing

    London England United Kingdom London, England, United Kingdom Marketing Brand, Marketing and Communications > Partner Marketing
    We’re looking for a Director of Agency and Industry Marketing to bring inspiration and thought leadership to the agency and marketing communities in North America and Europe. You'll lead a team of marketers based in New York and Europe in developi...

    Business Recruiter (Contract)

    San Francisco California United States San Francisco, California, United States Recruiting People > Recruiting
    Pinterest brings millions of people the inspiration to create a life they love for everything; whether that be tonight’s dinner, next summer’s vacation, or a dream house down the road. We’re looking for a Business Recruiter to help attract and hir...

    Product Counsel

    San Francisco California United States San Francisco, California, United States Legal and Policy Legal and Policy > Legal
    Pinterest brings millions of people the inspiration to create a life they love. We’re looking for a practical lawyer to join our legal team to support the expansion of our ads product and policy teams, with a focus on our global monetization produ...

    Intellectual Property Operations Specialist

    San Francisco California United States San Francisco, California, United States Community Operations Community Operations > Pinner Operations
    Pinterest brings millions of Pinners the inspiration to create a life they love for everything; whether that be tonight’s dinner, next summer’s vacation, or a dream house down the road. We’re looking for a motivated self-starter to join our Intell...