Machine Learning. You may have heard of it.
It’ll be the end of times. It’ll be the start of humanity. It’ll know when I want to buy more dog food. These are comments you’ve heard from your relatives or even comments you’ve asked yourself. If you’re in an IT role, it may have usurped your other favorite question: “Can you setup my WiFi?”
Yes, machine learning is a big topic and is contested by folks like Stephen Hawking and Ray Kurzweil. But it isn’t all gloom/doom, eternal life, and confused grandparents. No, most of the time, machine learning techniques are just used to get a handle on datasets that are more massive than malicious.
What you’re looking at (with generous hand-waving and oversimplification) are 2500 English words, organized by the similarities a computer found after it read Wikipedia. I was trying to reproduce an image made in 2008 by Joseph Turian, who was building on the work of Laurens van der Maaten, Geoffrey Hinton, Ronan Collobert, and Jason Weston. My only goal, besides learning, was to make some fun artwork for my flat.
If you’re curious about the low-level view of what I just wrote, read the following couple sections. I swear they’re still pretty high-level, with links to more depth if you’re into that sort of thing.
If you’re like a good majority of my family and friends, who requested “something we can understand, you nerd”, scroll down until you start seeing some technicolor images, where you can read more about word graphs or just stare at the pretty colors.
If you’re like the editors here at CircleCI and wonder how this at all relates to what I’m paid to do day to day, keep your eyes peeled for a future post on the testing mechanisms I used to build out this project.
tSNE is a machine learning technique developed by Van der Maaten and Hinton in 2008. tSNE reduces the dimensionality of datasets such that localized similarities are better preserved.
While a reduction of dimensions sounds like a really foreign concept, we as humans are actually ridiculously good at it. Think about painting. When you look at a painting, you are effectively viewing a three dimensional space in two dimensions. Now, before humans understood how vanishing points, and perspective worked, our mapping of three dimensions to two was poor. Local similarities were poorly preserved.
Now, imagine that instead of our paintings representing three dimensions, they represent 4. Okay, now 10. No, now 500. I would be willing to guess that your visualization/ability to imagine 4D was pushing the envelope, and that 500 dimensions was maybe a few too many.
Computers don’t have this problem with mass numbers of dimensions in datasets. To them, it’s just a walk in the binary park (sorta). But we humans really like three dimensions at most and far prefer two when we can get it. tSNE’s incredible value is that it can take that 500 dimension dataset and reduce it to 2D such that the space ends up useful, rather than confusing. ie, tSNE is able to reduce dimensions losing less information than a more naive approach.
I first encountered the original Turian image in 2013 while attending the University of Toronto. Hinton’s machine learning course was one of my first exposures to the subject. The course is available via Coursera and is a pleasant introduction to the field.
SENNA and Word Embeddings
Having the ability to view large datasets is great, but it does require something rather obvious: the dataset. A potentially non-obvious choice for a dataset is all of Wikipedia (yes, all of it). In 2008, Collobert and Weston turned their SENNA algorithm loose on a download of Wikipedia and turned the largest encyclopedia known in human history into a bunch of numbers.
That bunch of numbers is known as a language model and is traditionally stored as a matrix of numbers. The matrix they built is 130,000x50, or 130,000 words each represented by 50 decimal digits. That model is useful because it’s understandable by a machine and can be used for various predictions, like what word is most likely to follow this other word in this sentence, and so on.
Turian’s original image was only 2,500 words. Throwing caution to the wind, I decided to see if I could render all 130,000 possible words. My laptop was unhappy with these attempts and bucked most tries. However, I was able to get 11,340 words to render without too much heartache.
I chose the words based on some basic criteria:
- Is this one of the 3,000 most common words in the English language?
- Is this a location in the world (city, state, country, etc.)?
- Is this a number, letter, or symbol?
Then, I rendered each dataset in 4 different ways: first (and closest to how Turian rendered his), a two-dimensional dataset turned into a single image; second through fourth were three images of a three-dimensional dataset, rendered in colors where the hue of the color represents the non-present axis. So, if you have a graph with three axes (x, y, and z), graph x with y and z becomes the color. Then y with z and x becomes the color, and so on.
Below are the three datasets and associated images. The images come in two flavors: wall art, and svg. If you click on the svg version, it’ll load in your browser, and you can find a word you’re curious about with CMD+F (or CTRL-F on Windows).
NOTE: The full prints below are in the tens of MB size, hence why they’re not inlined here
Places in the world
In the above image we’re looking at 8,401 places in the world laid out in 2 dimensions. The full set of “places in the world” numbered 46,201 but the trained model only knew about the roughly eight thousand of them.
Provinces and states
3,000 most common words
These are the 3,000 most common words in the english language. Only 117 of these words overlap with those of the Places in the World.
A strong concentration of verbs
Print Quality Images!
Note that all original images can be found here in varying formats (png, svg).