Domain Names & Word Vectors – a Powerful Combo
It’s been a while since I’ve posted to my own site and while I’m unable to share private datasets and related code, for anyone who uses data to help solve a problem, there are a lot of great tools to be excited about.
One area I spend a lot of time with is text analytics which can be applied in a variety of different ways – to determine the frequency of words in a book, related documents, websites, news articles, social media, etc. The words can be combined to create a unique dataset for your own purpose which can later be utilized for sentiment analysis, determine common themes, categorization, trend analysis or predicting associated words for example. Word matrixes assign numerical values to words which allow us to make calculations (yes, you can calculate words) for instance:
- “domain names are cool” reflected in the vector [76, 23, 13, 182] and
- “he owns really a cool domain‘ with the vector [44, 985, 32, 76, 182, 76]
Where you can easily spot ‘cool’ expressed by the integer value 182 and ‘domain’ with the integer of 76. Word vectors allows us to use natural language processing and machine learning which can help us reach our stated objective.
So how does this apply to domain industry?
Domains are simply words or combinations of words. Depending on the purpose, we can assign various components to help us recognize the value of that domain name. For example, in 2020, there were approximately 205,000 domain names registered with the term, ‘covid’ or ‘corona’ in the name. Perhaps we want to understand related terms in this dataset such as ‘mask’, ‘treatment’, ‘vaccine’ that might influence demand and in turn, the value of the domain name(s). Or perhaps we need to protect people from ingesting chloroquine and want to review all domain name registrations using the terms to ensure the content is not supporting this potentially deadly idea. Text analytics can also be a powerful tool to identify related terms that might not come to mind such as ‘fish tank cleaner’ or ‘bleach’ for example (and no, do not ingest any of these poisonous chemicals).
Word vectors are also great for discovering associations, alternatives and categorizations. Sure, we can open a thesaurus but sometimes it’s just too literal. For example the term ‘book’ provides the following results: magazine, novel, textbook, pamphlet, etc. but utilizing a word matrix I often discover broader categories of words such as novel, story, chapter, comic, published, stories, etc. which depending on your purpose may be more applicable.
For anyone who prefers visual representations over long paragraphs of complicated text, understanding how to effectively represent data is in my opinion, worth the investment. Below are two short videos to show how words are used in matrixes and later reflected in interactive three-dimensional plots.
If you’re interested in data science, especially natural language processing, please connect. It’s an interesting field and I always appreciate learning how others are wrangling their text projects.