Loading Shards Slow Datasets Error When · Issue 4081 · Oobabooga Textgeneration

Author pipeapps 30 Jan 2025

Same model and same machine, sometimes it takes less than 1 minute, but sometimes it takes more than 10 minutes. However, i’ve encountered an issue where the data loading process seems to be triggered 8 times in parallel, which, i suspect, leads to excessive disk read overhead and. Resolved, was caused by low disk performance.

Failure after loading checkpoint shards. · Issue 655

Loading Shards Slow Datasets Error When · Issue 4081 · Oobabooga Textgeneration

I'm able to get fast iteration speeds when iterating over the dataset without shuffling. Here is the worker function i used to debug that loads only the file paths from the dataset, but does the reading locally: Using tfrecords with tfds seems to be a lot faster, which isn't really what i'd expect.

In particular it splits the dataset in shards of 500mb and uploads each shard as a parquet file on.

However, there might be huge datasets that exceed the size of your local ssd. The rest of this blog post tells. To parallelize data loading, we give each process some shards (or data sources) to process. Saving a dataset on hf using.push_to_hub () does upload multiple shards.

Even on small problems and on your desktop, it can speed up i/o tenfold and simplifies data management and processing of large datasets. I am currently only running on one node, typically with 16 gpus; It's very possible the way i'm. When i shuffle the dataset, the iteration speed is reduced by ~1000x.

Failure after loading checkpoint shards. · Issue 655

Therefore it's unnecessary to have a number of workers greater than.

That way resulting shards are not copies of dataset.arrow. A possible workaround is to keep the data in the shared filesystem and bundle the small recordings into. Just wondering if there’s any hope of me. Is there any way that checkpoint shards can maybe be cached or.

How do i download and load a dataset in batches without caching all of it? I processed the datasets into several shards. Loading checkpoint shards is very slow. I noticed that when i try loading at shard around 10gb in size, it takes more than 10gb of ram and eventually crashes my runtime.

Datasets loading slow · Issue 21 · huggingface/jat · GitHub

I am wondering what is the right way to do data reading/loading under ddp.

I have a relatively large dataset (a. If i want to load them as one piece i can do. I suspect this might be an issue with my dataset consisting of large arrays.

Load_datasets is extremely slow in loading HF datasets Beginners