Word Embedding and Sentence Embedding: The Tools We Use
If you’ve ever been to a foreign country where you don’t understand the language, you know how difficult it is to communicate.
Now think about computers.
As we all know, they speak the language of numbers. Everything that goes through a computer is converted in numbers and then converted again so humans can understand it. Computers need to represent text as numbers or word embedding so we can understand each other.
This is not an easy task and that’s the reason there’s a lot of research on that topic. And, if you’re interested in word embedding and sentence embedding, you already know how important it is when you’re building a solution that relies on the computer understanding correctly what a human is saying.
When we started developing KISSPlatform, our agile competitive intelligence tool, we used many different NLP (natural language processing) models to build it.
One of KISSPlatform's most important features is that people can input their ideas, describing them as if they were talking with a friend. Based on that idea, our KISSPlatform can give you relevant patents, tell you who your competitors are, and score it.
A key part of building our software was to use the best types of word embedding and also sentence embedding techniques to bring the best experience.
Our ultimate goal is to make a competitive intelligence tool that’s as easy to use as your phone camera. Think of it as an idea selfie that you can re-take any time you want, iterating how the idea looks and even its surroundings. In the end, you can upload the best idea and beat your competitors #InnovationLife!
These are the tools we used for word embedding and sentence embedding.
Word Embedding Tools
This was our approach on how to create a word embedding for NLP.
Word2Vec is an example of a great tool for word embedding. We like the Gensim version, which you can get here.
Basically, what Word2Vec does is to place related words in the same neighborhood. The more related the words are, the closer they are to each other.
If you’re interested in learning more about it, you can find really good tutorials for Word2Vec. The Gensim word2vec tutorial is pretty good! I personally used this one when I was learning how to train our earliest models. Shout out to Free Code Camp, which posted this tutorial. I really like their clear explanations and great (and free!) tutorials.
I’ve been programming since I was 16, which is somewhat (*cough, cough*) pre-modern AI. Word2Vec was super helpful to learn when I wanted to teach myself AI programming. It’s a great way to start if you’re learning about word embedding, even if it has many limitations compared to other word embedding tools.
But first, what is FastText? And what are the pros and cons of FastText vs Word2Vec?
FastText, like Word2Vec, is a word embedding model, but they each treat words differently. Word2Vec considers words as the smallest language unit to train on. FastText instead uses character n-grams.
This difference can have a significant impact.
For example, think about “-ing” words (or gerunds for those of you who remember your English grammar classes). In Word2Vec, “thinking” is a single word. In FastText, it’s a collection of n-grams. It makes a big difference because it enables the model to potentially recognize both “think” and “thinking” as being related.
This difference means that FastText embeddings are more accurate than Word2Vec embeddings.
Both a continuous bag of words and a skip-gram model can be used to train on the FastText embeddings. However, one point is that training on these n-grams can increase the time required, unlike Word2Vec, which is simpler and therefore faster.
Words are great and n-grams are even better! But those are not the only ways to create word embedding vectors.
Doc2Vec starts with Word2Vec but instead of being limited only to words, it also considers the identity of the paragraph in which the words are found. In other words, it considers the context of the document for training.
We ultimately didn’t use Doc2Vec although we did try it, as we trained on many documents like company and product descriptions and patents.
A quick note on word embedding tools
Word embedding models like Word2Vec and Doc2Vec have some great advantages, one of them is that they’re fast and easy to train. They can be very helpful but they’re limited. For example, they might consider “red”, “yellow”, and “green” to be related. That’s true, they are but not if you’re stopped at a traffic light!
The AI said they were all the same!
Deep Learning Models We Use
We used deep learning models because they’re pre-trained. These models all use a form of transfer learning in which they’re pre-trained on a generic set of documents and then they can be fine-tuned on your specific documents. The extent to which the initial set of documents is generic is of considerable importance.
For example, we found that some “generic” models didn’t work as well for us, because they were trained on a set of documents that may have been wide-ranging but still lacked many technical words. It’s not the same to train a model using Wikipedia than using Arxiv!
These are the models we tried.
BERT stands for Bidirectional Encoder Representations from Transformers and was provided as an open-source model by Google AI Language researchers in 2018.
The model has already been trained on a lot of web documents. Unsurprisingly, Google has quite a bit of them!
We, just like many others, could then use these pre-trained models for fine-tuning training on our documents. In our case, we used US patent applications, plus company and product descriptions.
BERT is popular and there’s a good reason. Unlike other methods, it’s deeply context-based. This means that the context of words in relation to each other is important. Just like it would be in any language.
Unlike other context-based models, it’s bidirectional. Concepts like “pride of lions” vs “lion pride” wouldn’t confuse it.
The important thing when using BERT is to fine-tune train it. If you want to learn how to do it, with a Jupyter Notebook example, no less, check out this great Tensorflow BERT tutorial.
After using BERT, we realized it’s not great for comparing the similarity of two sentences. Although BERT is bi-directional, it has a limit to the extent to which it’s context-based (maximum 512 tokens, which can be words or sub-words).
SBERT creates sentence embedding rather than word embedding, meaning that the context for words in a sentence isn’t lost. Complete code and documentation can be found at the SBERT website, created by the authors of the original paper.
We used one version of SBERT to create a more universal sentence embedding for multiple tasks. Although SBERT is not the only sentence embedding model, we’re huge fans of it!
Google Universal Sentence Embeddings
Now, what if you want to create sentence embeddings for many tasks, not just for comparing two sentences? Then Google Universal Sentence Embeddings comes to the rescue! It includes two different encoders that can be used for fine-tune training: Transformer or Deep Averaging Network (DAN).
Building KISSPlatform has been a journey of testing which tools are useful for what we want to do and to understand that, as important as they are, they’re just a means to achieving our final goal: supporting everyone with a unique idea to innovate and iterate faster and better.
This is just a list of tools that we personally used and tried but there are a lot of other tools out there that can also help you build your own software.