Inserting and Hosting 551 Millions Records Cheaply - Case Study

Recently I worked on a pet project with cloud implementation. There was a dump file in text format with 551 million line of records which needs to be stored in a way where filtered results could be quickly returned from a query, and with a goal of low cost implementation.

The dataset is static. Large number of records. No relational requirement. The storage and querying has to be fast and low cost. Azure table storage seems to be a good candidate for this purpose. Azure table storage is a NoSQL key value store and is very cheap.

The dataset (with some Azure table metadata) is estimated to be around 50 GB when it is stored in Azure table storage. With the choice of locally redundant storage (LRS), the price of the dataset storage is ~50 GB * $0.07 / GB per month = ~$3.50 per month. For the data transfer calculation, Microsoft charges $0.00036 / 10,000 transaction for table (that includes read, write, delete). If it was to insert 551 million record transactions, it would cost $19.84. But we could better. More on that later.

Although the application and API used against the Azure table would be made public, this pet project is more an educational project for myself so I don't expect large traffic (way less than 10,000). The operating transaction cost would be minimal.

Back to the text file. It contains 551 million records. To increase the insert performance into Azure table storage, here are the implemented strategy.

- Provisioned an Azure VM in the same region. Network latency and throughput is much better compare with personal computer.
- Batched 100 records (max limit) in a transaction whenever possible.
- Built some logic to insert records parallelly. 24 consoles were used to process different records simultaneously.
- Turned Nagle off to increase throughput for table insert.

Azure VM
The process is mostly compute intensive. Fsv2, Fs, F Azure VM type are the compute optimized VM. In this case, Fsv2 (2 vcpu, 4 GB RAM, 4 data disks, 4000 max iops, $75.89 / month) and standard HDD disk was selected.

Multiple records in batch
Azure table storage has a maximum of 100 records limit in a batch operation. All records in the batch must have same partition key. Batching not only increase insert performance, it also save cost. One batch operation is considered as a transaction. If all batches are 100 records, 551 millions records insert cost would be only $0.1984! In this case, due to the parallel logic put in place, average about 5 out of 6 batches contain 100 records.

Parallel Logic
Single process would be boring and inefficient. The process do need to read the text file line by line and insert into Azure table storage. It doesn't seem like there is any native parallel process of reading file and performing the insert. Especially consider the file is around 25 GB in size and the VM has only 4 GB of memory. I thought of a logic where it could have multiple powershell process its own unique section line by line simultaneously with minimum memory required. I picked 24, just an arbitrary number but with previous experience. It does consume the 2 vcpu on VM around 90% the entire time. Good.

Turned off Nagle
Microsoft storage team has an article of how the Nagle algorithm affects the Azure table storage insert performance. Personally has seen quite a significant improvement for Azure queue by turning off Nagle. [System.Net.ServicePointManager]::UseNagleAlgorithm = $false

Once these are setup with a new Azure table created, I ran the powershell to process the records and insert into the Azure table. With 24 process running simultaneously, it inserts around 12k batch transaction / minute. Estimating ~90 records per batch transaction, that is average of 18,000 records insert / second.

The total times for inserting 551 millions records into Azure table took about 8 and 1/2 hours.

Setup cost (including get server up and script ready),
Azure VM ~ $1.46
Azure table (storage + transaction) - $3.23
Bandwidth - $0.04
Total initial setup cost: $4.73

VM is only needed for this initial setup process and is decommissioned right after.

Operating cost (including all website components in addition to Azure table)
Azure table (storage) ~ $3
Azure table transaction (it would need to be > 270k transaction to be charged $0.01) - $0
Azure functions (as backend API, $0.20 / million execution, 1 million execution free / month) - $0
Azure static website (angular app size is small < 850 KB) ~ $0
Outbound Bandwidth (First 5 GB is free, between 5GB - 10TB, $0.087 / GB. Estimate application returns around 100 byte per each request) ~ $0
Total monthly operating cost: ~$3

Potential improvements. I could simplify the parallel logic to further increase the ratio of 100 records batches. Also increase the degree of parallelism (eg. 48 or more simultaneous process) along VM with more vpcu to further reduce the duration.

No comments:

Post a Comment