We are using the HuggingFace hub to host all new datasets. All you will have to do is to upload your dataset, along with potential challenge splits, using the steps outlined below.
Table of Contents
- Creating the dataset
- Potential Errors
Create a HuggingFace account
To upload the dataset, you will need a HuggingFace account. If you already have one, you can skip this step.
If not, go to huggingface.co/join and create an account.
Join the GEM Organization
We are hosting all datasets in the GEM organization which you can join by following this link.
To install all the requirements, follow the steps below:
# Install the hub interface. pip install huggingface_hub # Install Git for large files. git lfs install # Log in to the hub huggingface-cli login
Great, you are now prepared to create your dataset!
Creating the dataset
The following steps largely follow this tutorial.
Set-up the repository
First, we will create the empty repository we will use to host the dataset.
huggingface-cli repo create YOUR_DATASET_NAME --type dataset --organization GEM # Once created, we can download it. git clone https://huggingface.co/datasets/GEM/YOUR_DATASET_NAME
Preparing the files
You will need to add the following files to the repository.
your_dataset_name.jsonis a Dataset card that is created following our other tutorial and using our collection tool. If you are completing the data part first, feel free to leave it empty for now. However, only a dataset with completed data card is part of GEM, so please add it once it is ready.
The raw data files of the dataset (optional, if they are hosted elsewhere you can specify the URLs in the dataset script).
your_dataset_name.pyis your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). For information on how to create a dataset script, see the documentation. You can start from the template and simply fill in the details.
dataset_infos.jsoncontains metadata about the dataset (required only if you have a dataset script).
While we don't have strong restrictions on the dataset formats, please follow the guidelines:
- Each dataset should have splits named
test. Additional challenge sets splits can be named
- Each split should have a field called
gem_idwhich has the naming convention of
idis an incrementing number starting at 1. Please look at our existing data loader for reference.
Uploading the dataset
First add all the dataset files to
git lfs tracking and then use git as usual to track all other files.
cp /somewhere/data/*.json . git lfs track *.json git add .gitattributes git add *.json git commit -m "add json files"
Afterwards you can also add the data card and all other files and commit them. Once everything is ready, simply run
git push. After you enter your HuggingFace username and password, everything will be uploaded to the Hub!
You can find an example dataset with a challenge dataset (generated using an NL-Augmenter transformation) here.
You can update the dataset simply by pushing updates the same way.
git: 'lfs' is not a git command.
git lfs needs to be installed separately. Depending on your operating system, you can follow this post to solve the issue.
403 Client Error: Forbidden for url
You may encounter the following when trying to create a dataset:
403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create - You don't have the rights to create a dataset under this namespace
This happens when you are not part of the organization or have a typo in the creation command. Ensure that you are (1) logged in, (2) member of the GEM organization, and (3) have typed the
--organization GEM command using all upper case letters.