Jon Eskin's Blog - Reproducing GPT-2 on Cloud GPUs

Introduction

Reproducing GPT-2 from scratch is a great way to build an understanding of deep learning and large language models. A number of learning resources about GPT-2 architecture are available, including Andrej Karpathy’s free Zero to Hero course and Sebastian Raschka’s book Build a Large Language Model from Scratch. While these resources provide excellent coverage of how GPT-2 works under the hood, there’s a few details about challenges you’ll face in the wild when actually training on expensive hardware that aren’t extensively covered.

Understanding Cloud GPU Costs

It’s important to understand the process of provisioning servers and the details of what you’re doing to avoid racking up an unnecessarily large bill. I recently used a modified version of the reference pytorch implementation from Karpathy’s llm.c project to do my own reproduction of the model (This version appeared to be more up to date than NanoGPT and build-nanogpt variants at the time I ran it).

Many cloud providers offer VPSs with GPUs attached. I’ve seen Lambda Labs recommended here and there, so I checked their service first. One thing that I made sure to do was to ask for some trial credits - if they weren’t available, I would have chosen a different service. This instinct paid off because I made a number of mistakes and would have felt bad about paying the bill afterwards.

Lambda Labs instances worked with no fuss. If you’ve used VPSs like Digital Ocean’s droplets or Linode, Lambda’s instances work the same way. You upload your SSH public key and remote in to a ubuntu instance.

If like me, you hadn’t rented expensive GPUs before, it’s important to know how they are billed. I’m used to small linux VPSs that charge $5-$10/mo in the past, but servers set up for GPU workloads run for far more than that - about $24/hr on the instance I was interested in. Billing is a flat rate per hour, and it accumulates by the minute. It doesn’t matter whether you’re actively running any jobs, or even if the instance is currently running. The second you provision it the clock starts ticking, and you continue being billed until you destroy the instance and any data on it. This has major implications on how you’ll want to do your training.

Recommended Approach

Let’s say you wanted to train the network on 8 x H100s. If you rented a server with this setup, which costs about $24/hr, you would soon discover that you spend a good chunk of your time on data preparation. It involves downloading, tokenizing, and sharding the data - which in my case was the 10B sample of fineweb-edu. This is an IO bound operation, so all those expensive GPUs are not going to make it go any faster. If you run all that prep and your training run fails, you’re out $20.

My recommendation would be to start out with the cheapest server that is close to what you want to use for a full run. For example, if you wanted to eventually run on 8 X H100s, you would ideally want a server with multiple GPUs with at least 40GB of memory. That way you can fiddle with distributing loads and hyperparameters to improve your chances of getting things right on a real run.

My Run

After going through Karpathy’s course and skimming Raschka’s book, I decided to take a stab at reproducing GPT-2 based on some of the reference code from llm.c, with a few tweaks.

Data Preparation Phase

What I ended up doing was renting 4x 48GB A6000s for cheap and then performing the data preparation. As soon as the data preparation kicked off, I started compressing a copy of the project + the sharded data with tar -czvf llm.c.tar.gz llm.c. While that was working, I opened an additional client and started running some experiments, sanity checks, and small runs to make sure everything was working as expected. This gave me a chance to work out a few kinks that I didn’t know about.

After tar completed, I copied the archive back to my machine with scp -r -i ~/.ssh/id_ed25519 ubuntu@{Machine IP}:/home/ubuntu/llm.c/pylog_gpt2_124M .. Once that finished, I was done, so I shut down the instance and destroyed it.

Training Phase

Next, I tried spinning up an 8 x 80GB H100 instance. I moved my project and sharded data over with scp -C -i ~/.ssh/id_ed25519 llm.c.tar.gz ubuntu@{Machine IP}:/home/ubuntu/, and un-tar’d it with tar -xzvf llm.c.tar.gz, which took about half the time as the initial data preparation process. It probably would have been faster to upload it to a S3 bucket or something, but I don’t have one.

I tried running my code with a set of hyperparameters that I picked out, and the process failed with an OOM error. I was trying an aggressive mini-batch size, and unfortunately I didn’t feel comfortable tweaking the parameters to try to get it working since the bill racks up quickly. Instead, I blew up the instance to try different hyperparameters on a different set of hardware.

I had a hyperparameter set that I was sure would work with 8 x 40GB A1000s, so I spun that up and repeated the process. This time, it ran successfully to completion. I added some code that saved a model and did a little inferencing, so after that finished I copied the logs and model back over to my home machine and blew up this instance as well.

Conclusion

I’ve been slowly building up to understanding GPT-2’s architecture over the last few months and it was very fun to try out a training run. I can easily recommend the two learning resources mentioned above. If you have any questions about the process or get stuck, feel free to give me a holler.

Comment via email