Prof. Antonin Portelli at the University of Edinburgh and researcher with DiRAC (Distributed Research Utilizing Advanced Computing) was impressed with Storj back in 2021 when he first executed performance testing of Storj’s decentralized cloud storage. The results proved that Storj was promising to handle the sharing of the incredibly large datasets that their research created with the high performance compute cluster at DiRAC. The DiRAC team began using Storj for transferring large datasets to the various universities and institutions involved in the research.
Fast forward two years and the DiRAC team’s use of Storj was growing. Prof. Portelli heard about additional improvements that Storj had made and decided to perform a few different tests to see how performance had improved for data transfer and to evaluate Storj for archiving a supercomputer. This article provides details of the testing and explains the technology behind the results. If you’d like a deeper dive from Prof. Portelli, tune in to the recent webinar where he shared this data.
Why is this testing so important for those working with big data?
The data associated with the research done by DiRAC is mostly stored on-premise using disk and tape. However, when research is being collaborated on at various institutions around the world, figuring out how to securely and cost effectively share enormous datasets is a challenge. Prof. Portelli explained why Storj decentralized cloud storage was a good candidate for this use case.
- Scientific data is extremely expensive to generate, high resilience is critical.
- Scientific data generally needs to be accessible from several continents.
- Science is naturally decentralized through international collaborations!
- Science already uses distributed storage & computing systems. For example, the LHC computing grid.
Storj has the resilience, global accessibility, and price point to work. It is for these reasons that Prof. Portelli felt Storj would be a great option for scientific data curation and sharing. Of course none of this matters if it takes too long for the data transfer to occur.
The data transfer performance test using Storj
Prof. Portelli used a scientific dataset used in particle physics research with 2000 x4.7 GB binary files, for a total of 9.3TB to test Storj’s performance. The benchmark involved uploading the data set from Edinburgh to Storj, then download back to Edinburgh, download to a data center in US New Jersey, and compare with a direct SFTP transfer for a baseline.
After the upload to Storj, Prof. Portelli tracked the storage node locations where the segmented files were sent to on the Storj network. The map below shows the geographic analysis of the Storj distribution.
Storj calls 80 segments and sends them using parallelism where the 29 fastest segments can be used to reconstitute the file being requested. The data below shows the transfer rates for downloading to Edinburgh (UK) as well as a data center in US New Jersey showing various transfer groupings.
Prof. Portelli notes there was relatively no difference between download transfer rates in the US versus UK. “What you can see is really quite remarkable that there's essentially no difference in the transfer rate between downloading the dataset in the UK and in the United States. And this is very likely thanks to this kind of really good distribution between Europe and the US we just saw on the map.” As to Prof. Portelli’s reaction to the transfer rates? “The transfer rate is around 800 megabytes per second. This is extremely fast!” Prof. Portelli also noted that the upload was 2x slower than download and expects this may be a bottleneck from the origin file system, not part of the Storj network.
Prof. Portelli then compared these results with the testing he did two years ago and the improvement is impressive.
This data shows there has been a 2x-4x factor improvement in Storj transfer performance. Prof. Portelli summed up the results with two key takeaways:
- You can reach very fast transfer rates with parallelism.
- There is no apparent rate loss due to geographical distance between the EU and the US East Coast.
In Prof. Portelli’s words, “You can really reach very fast transfer rates thanks to parallelism. And this is something which is built-in natively in the Storj network, which is really nice because the data from many nodes are scattered all over the world. So you can really expect a good buildup of performances.”
Test for archiving a supercomputer to Storj
Another problem that scientific researchers are challenged with is what to do with the data from a supercomputer when it reaches end of life. Prof. Portelli admitted that the performance test involved a dataset that was convenient for data transfer in that it had a smaller number of big files. A much more difficult problem is to archive an entire file system where the files are not optimized for transfer.
A DiRAC IBM Blue Gene Q supercomputer was decommissioned in 2016 and the file system was preserved. In total it contained 300 TB from around 95M files. The team made the decision to archive this enormous dataset to Storj. Prof. Poretelli knew this would be a challenge as it is the equivalent of “backing up an entire data center.”
Prof. Portelli used Restic to group the smaller files together in order to reduce the 95M files into closer to 4M total. The results show that transfer rates reached 200 MB/s. While this is 4x slower than the previous test, Prof. Portelli explained that this is due to the operation Restic is doing to concatenate five 60 megabyte pieces, duplicate them, and anchor them. This is a very CPU and I/O intensive operation.”
Prof. Portelli was five days into the archival when he shared this data. He estimates the entire 300 TB will be transferred in 10-15 days. “The only thing that really matters here and interests me is that Storj plus Restic brought archiving this whole file system to a humanly acceptable time. So now we have hope.”
Prof. Portelli’s key takeaways:
- Use Restic! It massively helps with many small files.
- Restic uses rclone: access to native Storj protocol.
- Very decent 200 MiB/s rate considering the challenge.
Prof. Portelli is excited by this use case. “I think it's really interesting because archiving an entire file system for a data center is probably one of the hardest data transfer exercises you can do. And doing it to the cloud adds another layer of difficulty. But with Storj and Restic, this is possible.”
So how does Storj get such fast performance?
How do you get great performance when your data is encrypted, erasure-coded and distributed over a network of 24k+ storage nodes? Storj is optimized to use less bandwidth, avoids long-tail variance, minimizes coordination, and maximizes the use of parallelism. And as Prof. Portelli’s testing demonstrated, utilizing parallelism across Storj’s globally distributed network enables moving massive datasets fast.
Storj was relatively fast back when Prof. Portelli did his original testing in 2021, so what changed to create 2x-4x improvement?
Storj performance improvements:
- Increased and optimized the network - Doubled the number of storage nodes (12k to 24k) and expanded the global footprint. Performance and consistency naturally improve as the network grows. Storj has also improved node selection and data placement.
- Improved transport - Eliminated protocol/database round-trips and unnecessary pipeline stalls. Improved the transport protocol with techniques like noise protocol and TCP fast open.
- Smarter connection pooling and caching - Optimized connection pooling to minimize unneeded opening and closing of connections, which can slow performance.
And Storj isn’t done. Performance is a continued roadmap priority and work continues to reduce latency and get the time to first spike down as far as possible.
Prof. Portelli admits he was skeptical of Storj at first, but can’t deny the past two years of results from use and testing. “At DiRAC, we like breaking things. People give us new CPUs, new GPUs, and we try to push it as hard as we can until it breaks. It's been two years that I have been trying to break Storj and I thought I would be able to, but I can’t. I’ve not seen the limitations yet. And the transfer rates are extremely competitive.”
Want more details on the supercomputer specs and testing? You can hear the details from Prof. Portelli as well as the live Q&A from attendees in the on-demand webinar: Maximize Performance for Big Data.