In this guide, we will walk through the process and architecture rationale for configuring Hugging Face Datasets with Storj DCS using S3FS, until a native integration pattern is released.
What is Hugging Face – and Hugging Face Datasets?
Hugging Face is a platform that allows developers to train and deploy open-source AI models. It's similar to GitHub in that it provides a space for developers to code and deploy AI applications, including language models, transformers, text2image, and more.
One of the stand-out features of the platform is “Datasets” – which is a collection of over 5,000 ML datasets that are available for use. These datasets are pre-processed versions of publicly available data and are ready for use in natural language processing tasks such as text classification, language modeling, and machine translation. These datasets are designed to be easily integrated into Hugging Face's model training and evaluation library, allowing users to quickly and easily train models on the provided data.
The models and datasets on the platform can be made public or collaborated on privately within an organization repository. Additionally, the datasets are accompanied by extensive documentation in the form of Dataset Cards and Dataset Previews, allowing users to explore the data directly in their browser. While many of the datasets are public, organizations and individuals can also create private datasets to comply with licensing or privacy issues.
The 🤗 datasets library provides programmatic access to the datasets, making it easy to incorporate the datasets from the Hub into your projects. With a single line of code, you can access the datasets, even if they are large and don't fit on your computer, you can use streaming to efficiently access the data. The Hugging Face Hub documentation is a great place to learn more about Datasets on the platform.
Setup Storj with S3Fs
Storj will use s3fs in order to work with the Hugging Face APIs.First install some dependencies needed
Next enter your Storj S3 compatible access and secret key (see AWS SDK and Hosted Gateway MT)
Create a bucket (see Create a Bucket) from the dataset to be stored in. In this walk-through the bucket will be called my-dataset-bucket.
Transfer existing Hugging Face dataset to Storj
If your dataset is already on Hugging Face Hub, you can use the load_dataset_builder function to download and transfer it to Storj. It'll first download raw datasets to your specifiedcache_dir then prepare it to uploaded to Storj using the storage_options defined previously.Here we transfer the dataset imdb to Storj
Save dataset to Storj
Once you've encoded a dataset, you can persist it using the save_to_disk method.
Load dataset from Storj
Use the load_from_disk method you can download your datasets.
What is S3Fs
S3Fs is a Python library that provides a file-like interface to Amazon S3. It builds on top of botocore, a low-level library for interacting with the AWS SDK. With S3Fs, you can use typical file-system operations like cp, mv, ls, du, and glob to interact with your S3 buckets, as well as put and get local files to and from S3.
You can access the S3FS configuration for Storj here: https://s3fs.readthedocs.io/en/latest/index.html?highlight=storj#s3-compatible-storage
One of the key features of S3Fs is the ability to connect to S3 in different ways. You can connect anonymously, which only allows you to access publicly-available, read-only buckets, or you can use credentials that you explicitly supply or that are stored in configuration files.
When you open a file using S3Fs, you are given an S3File object that emulates the standard File protocol (read, write, tell, seek). This allows you to use functions that expect a file with S3. The library only supports binary read and write modes and has blocked caching.
S3Fs is built on top of fsspec, a library that provides a common interface to various file-systems. This allows S3Fs to be easily integrated with other libraries that use fsspec.
Why should I store my datasets on Storj DCS?
Put simply, Better Performance, Better Economics.
Ultimately, storing your Hugging Face datasets on Storj provides a number of benefits, including:
Speed and performance: Decentralized storage has a positive impact on performance and speed, as it allows for faster data retrieval, as well as scalability of resources.
Cost savings: Storj uses a pay-as-you-use pricing model, which can be beneficial if you need to store large amounts of data and expect to use it infrequently.
Data sharing and distribution: Storj allows users to share data with others in a secure and private way. Additionally, it allows for split-storage or multi-cloud storage, enabling users to distribute their data over several clouds and locations. This has been helpful for compliance and redundancy reasons.
Decentralization: Storj is a decentralized cloud storage platform, meaning that data is stored across a network of nodes, rather than in a centralized data center. This makes the data more resistant to outages and censorship, and can also help to ensure that the data is stored in a more secure and private way.
Want step-by-step written instructions? See Storj Docs
https://docs.storj.io/dcs/how-tos/hugging-face
Interested in creating a Storj account: storj.io/signup?partner=kevin