How to load a huggingface dataset from local path?

Question

Take a simple example in this website, https://huggingface.co/datasets/Dahoas/rm-static:

if I want to load this dataset online, I just directly use,

from datasets import load_dataset
dataset = load_dataset("Dahoas/rm-static")

What if I want to load dataset from local path, so I download the files and keep the same folder structure from web Files and versions fristly,

-data
|-test-00000-of-00001-bf4c733542e35fcb.parquet
|-train-00000-of-00001-2a1df75c6bce91ab.parquet
-.gitattributes
-README.md
-dataset_infos.json

Then, put them into my folder, but shows error when loading:

dataset_path ="/data/coco/dataset/Dahoas/rm-static"
tmp_dataset = load_dataset(dataset_path)

It shows FileNotFoundError: No (supported) data files or dataset script found in /data/coco/dataset/Dahoas/rm-static.

James Owers · Accepted Answer · 2024-01-12 16:28:04Z

6

Save the data with save_to_disk then load it with load_from_disk. For example:

import datasets
ds = datasets.load_dataset("Dahoas/rm-static") 
ds.save_to_disk("Path/to/save")

and later if you wanna re-utilize it just normal load_dataset will work

ds = datasets.load_from_disk("Path/to/save")

you can verify the same by printing the dataset you will be getting same result for both. This is the easier way out. The file format it is generally saved in is arrow.

For the second method where you are downloading the parquet file. Would require you to explicitly declaring the dataset and it config, might be included in json and then you can load it.

edited Jan 12 at 16:28

James Owers

8,18713 gold badges58 silver badges73 bronze badges

answered Sep 1, 2023 at 4:33

Arunbh Yashaswi

1,2077 silver badges13 bronze badges

Would suggest to go by first method might be time consuming but lot better in term of saving effort and time the next time you load the data. Please feel free to post anymore question over this.
– Arunbh Yashaswi
Commented Sep 1, 2023 at 4:35
Hi, thanks for your reply, I have tried your method, but when I load the dataset by dataset = load_dataset("Path/to/save") it shows that error, raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match") ValueError: Couldn't cast _data_files: list<item: struct<filename: string>> child 0, item: struct<filename: string> child 0, filename: string _fingerprint: string _format_columns: null _format_kwargs: struct<> _format_type: null _output_all_columns: bool _split: string to {'builder_name': Value(dtype='string', id=None), }
– 4daJKong
Commented Sep 1, 2023 at 6:00
On the other hand, if I use dataset = load_from_disk("Path/to/save") it didn't show a problem
– 4daJKong
Commented Sep 1, 2023 at 6:01
Hi @4daJKong. That is the correct way to do it. Were you able to load it that way? you can refer same at huggingface.co/docs/datasets/package_reference/loading_methods
– Arunbh Yashaswi
Commented Sep 4, 2023 at 4:11

Add a comment |

Franck Dernoncourt · Accepted Answer · 2024-06-21 02:27:08Z

6

I solved this question by myself, it is easy to use:

data_files = {“train”:“train-00000-of-00001-2a1df75c6bce91ab.parquet”,“test”:“test-00000-of-00001-8c7c51afc6d45980.parquet”}
raw_datasets = load_dataset(“parquet”, data_dir=‘/Your/Path/Dahoas/rm-static/data’, data_files=data_files)

edited Jun 21 at 2:27

Franck Dernoncourt

81.5k75 gold badges356 silver badges523 bronze badges

answered Sep 1, 2023 at 6:53

4daJKong

2,0172 gold badges13 silver badges27 bronze badges

Add a comment |

L Tyrone · Accepted Answer · 2024-07-03 06:53:16Z

0

You can load a csv data file from local path using:

from datasets import load_dataset
dataset = load_dataset('csv', data_files='final.csv')

or to load multiple files, use:

dataset = load_dataset('csv', data_files={'train' ['my_train_file_1.csv', 'my_train_file_2.csv'], 'test': 'my_test_file.csv'})

For more details, follow the Hugging Face documentation.

edited Jul 3 at 6:53

L Tyrone

4,77121 gold badges27 silver badges36 bronze badges

answered Jul 3 at 6:28

haseeb

1

Add a comment |

Collectives™ on Stack Overflow

How to load a huggingface dataset from local path?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
python
huggingface
huggingface-datasets
huggingface-hub
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged pythonhuggingfacehuggingface-datasetshuggingface-hub or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
huggingface
huggingface-datasets
huggingface-hub
or ask your own question.