I started studying machine learning about 8 months ago. I was a programmer before that.
Since then I have worked on a few problems I found interesting. I spent a lot of time on reinforcement learning algorithms on Atari games and on another environment I developed. I’ve also tried some small neural network architectural ideas and played a bit with NLP.
This is a list of practices I adopted along the way that I think are worth sharing. I’ve also open-sourced most of the helper classes I wrote in github.com/lab.
Debugging machine learning projects, especially reinforcement learning, is super hard. So you have to try your best to get it right the first time. Once, I spent two days trying to get code working; it was painful. Overestimating my self was the biggest mistake I made. It’s far easier to make subtle mistakes that are very hard to catch in machine learning, compared to normal programming.
I realized it was more efficient to spend 2x time coding than you usually would. I write a small bit of code, read it, then write more and so on. Although it feels slow, overall it has been more efficient with saved debugging time.
Although this was obvious from the beginning, it took me a few bugs and days of debugging to adapting it.
It is better to write small pieces of code and test them before adding more. For example, if you are trying Q-learning, it makes sense to first have a simple model, without double-Q, prioritized replay, recurrence, etc. Once you get the simplest version working, you can add other complexities one after the other. This way you will only have a small set of code that you need to debug if it doesn’t work. Also, since often the simpler versions run faster, so it’s easier to iterate when you are fixing things.
Another advantage is that you will have some tangible results from the beginning and will be gradually improving.
You should keep in mind the problem you are solving. The evaluation criteria should measure the success in solving that problem For example, if the final benefit of your model is saving money, you should be able to estimate how much money can be saved with a given model. Otherwise, you could get caught in optimizing some loss whilst making little progress on solving the problem.
You also need baseline performance level and theoretical optimal level. If there are other solutions you should know the state of the art performance. This way you can check your relative progress and know the limits.
This is kind of obvious. But, at the beginning, I didn’t do it because I was too lazy to write the code for saving and loading data. The pre-processing could take more than a minute depending on the dataset. This could amount to hours in total, when repeated across many trials.
Now, I save data after pre-processing all the time. I usually save the data in NumPy arrays and meta information in JSON files.
You should start saving models from the prototyping stage, especially if the training takes more than a couple of minutes.
I often found myself training the simplest versions (to see code correctness) without model saving/loading code. And regretted later, when I wanted to play with the trained model (try predictions, visualize, etc.).
When I started off, I didn’t keep the experiments and results organized - looking through unorganized logs was hard, reproducing results was often impossible, and I had lost most of the analytics data.
It’s quite important for you to be able to re-evaluate old experiments and to keep the old results. Most people seem to save the config files, of the experiments.
I chose to use a separate python file for each experiment and I keep all the key experimental variations. I also keep track of the git commits along with experiment results.
I have open-sourced the tools I used to organize experiments in github.com/lab.
Most of the time the ideas for analysing training logs/stats pop-up, after the training is complete. So it’s best to keep everything stored ready to be analyzed.
I used to log only the stuff that I thought was important at the time. And always wanted more when analyzing. So now I log everything I can think of, up to the gradients and weights. If the models are big, I pick some random nodes from each layer for logging.
Also histograms and distributions are way more useful than summarized statistics. They help you spot problems and opportunities for improvements. For instance, with a histogram of losses, you will see if you have a large loss on a few samples or if the loss is a normal distribution.
I have included some code I wrote to log histograms of NumPy arrays, or PyTorch arrays to TensorBoard summaries in lab.
TensorBoard is super useful and easy to use. Use it for monitoring and analyzing.
Measure time taken and progress for each component. It helps figure out where to improve efficiency. Also, it shows you what your program is doing when running.
I time most of the initialization code, and the main steps in the training loop, like sampling, processing samples, training, calculating validation error, etc.
Again, I’ve included helper classes I used for monitoring in lab.
In the first couple of weeks, I tended not to worry much about software design. I hacked together experiments with copy-pastes from earlier experiments and a some changes. It was faster to prototype, but became a problem later, like with any other software project.
So if you plan on working on the project long-term better to use good design. Or rewrite with a well thought design after trying the prototypes - which is usually what I do.
Good design and code reuse saves you time because it’s shorter, is less prone to bugs because you will be writing less code, and easier to maintain and fix bugs.
Good programming practices such as type hints, declaring constants, meaningful names, named parameters, etc. help as the code gets bigger. It helps you read the code and understand without having to navigate across code.
I found myself reading my code a lot more than with other software projects I had worked on. There were occasions where I had to refer to old code that I had thrown away too. So it’s good to keep the code readable even if you plan on throwing it away.
Also, refactoring tools work well when the code uses type hints and named parameters.
Commenting complex logic and simplified equations improves code readability. It also helps write bug free code. If you see the derivations next to the code, the chances of making a mistake or a typo is less.
I used literate programming sometimes, with LaTeX for maths. Too bad the IDE doesn’t render them, so I generate HTML pages and proof read the code. It also helps understand algorithms better. I still reference some of the old annotated RL algorithms I wrote when I forget stuff.
Features like refactoring, go to definition, etc. are quite useful as your code base grows. Also a good auto-complete is handy when you start using new libraries.
With Jupyter Notebooks, you can write code in small sequential steps: write a small piece of code, run it, check results, go to next step.
But notebooks become unfriendly, as things get more complex.
Most of the time, I use a notebook when I start working on a new problem. Then rewrite it in Python with proper design.
I rely on TensorBoard for the basic analytics, and for everything else I use Jupyter notebooks.
You can do custom analyses by writing code. It might be harder than clicking a button on TensorBoard, but you can do precisely what you want. You can control everything from the type of visualization, to how axes labels and axes ticks are placed.
Also, notebooks keep visualizations saved, so you can check them anytime without running code.
I’ve included some helper classes I used for creating custom visualizations of TensorBoard summaries in lab.
On notebooks, you can check and test each step as you are parsing data. You can output samples and statistics after each pre-processing step. Since the state is in memory, you don’t have to re-run all the previous steps when you make a change.
When it’s your code, it’s easier to play with it. Using libraries for the core parts you want to experiment with is a bad idea. It is difficult later when you want to make changes.
This is different from using libraries in other software development work, when you don’t have to deviate from standards. But in most machine learning scenarios a lot of things are still kind of alpha and you want to try variations to improve those.
It’s easy to get outdated if you stay focused on a project and don’t reading for a while.
Journals help organize your ideas. It will act as a reference and will help you avoid repeating mistakes.
The journal also helped me get back to work after returning from long vacations. You can start from where you left.
I use a long Markdown document as the journal. I also used a Moleskine for some projects.
This is quite important for me because I am working by myself. It’s easy to lose focus and waste time.
First, I tracked time on the journal itself, but later started using a calendar app.