MLX Week 1
I recently joined an ML bootcamp in London (mlx.institute). It’s a program run by ex startup founders and big tech software engineers.
Having completed the first week, I can say that being in a room with a bunch of smart people working towards the same goal is one of the most exciting and refreshing feelings. Although, the program was quite intense, running from 9-6 every day, even on holidays, but the challenge was clearly there.
This week’s task was trying to predict Hacker News (an online story/ post sharing website) upvotes (likes) from its title, posting time, url domain and author. The other (arguably more important) goal was to learn, through this ‘toy’ task, some modern natural language processing (NLP) methods and doing some coding.
I say this, as we put a bit of effort in understanding a core paper, that is Word2Vec. Word2Vec is a shallow neural network, trained in a self-supervised manner from a large body of text, that aims to represent single words in a vector space that captures semantic (meaning) and syntactic (nature of use) of the word. We trained Word2Vec on text8 (the first \(10^8\) bytes of Wikipedia). This was a taste of ML research, as during the project we first had to code the model by reading the paper, and explored the model hyperparameter and training parameter space.
NLP and computational linguistics are deep fields and we’ve only barely scratched the surface, but this can be thought of a programmer’s introduction to the field.
By taking a bag-of-word representation, we are assuming that humans ‘judge’ a title based on buzzwords, and then whatever aggregation mechanism (here the mean) accurately captures a representation of the appeal of the title. Other approaches clearly exist, such as one-hot encoding titles with popular nouns, or sequence-based models such as recurrent neural networks or self-attention that captures the interaction between different words in the title. What we want really is to represent the meaning of a title and use this meaning for downstream tasks.
In the end, we also realised the original regression task was too hard and switched to a binned classification task of the rough order of magnitude of the popularity of posts.
I’d describe our outcome as
- a Word2Vec model
- a multi-feature model, that combines a bag-of-words embedding representation of post titles. This was packaged into a streamlit app and dockerised.
I’d do a few things differently next time. First I’d like to build a strong baseline first with the simplest model (like an XGBoost or a linear regression). Next I’d like to do more EDA on the titles, even just printing out some popular posts. Finally I’d like to explore training the model end-to-end too.
Another challenge during the project was moving fast. This meant that code constantly required refactoring. One trick I use from a previous mentor is to define the interfaces of objects quickly, which lets you quickly form abstractions and reduces cognitive load massively when coding.
Tools learnt:
- Machine Learning: torch
- ML Infra: weights and biases (training monitoring), huggingface (dataset storage)
- Deployment: docker, streamlit
Looking forward to the next week.
References
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. ‘Efficient Estimation of Word Representations in Vector Space’. arXiv, 7 September 2013. https://doi.org/10.48550/arXiv.1301.3781.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. ‘Distributed Representations of Words and Phrases and Their Compositionality’. In Advances in Neural Information Processing Systems, Vol. 26. Curran Associates, Inc., 2013. https://proceedings.neurips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.