๐Ÿ“ฆ Business Process Automation Datasets

Curated datasets for document AI, workflow automation, and enterprise chatbot development.

a-m-team/AM-DeepSeek-R1-Distilled-1.4M

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-03-30

View โ†’
task_categories:text-generation language:zh language:en license:cc-by-nc-4.0 size_categories:1M<n<10M arxiv:2503.19633 region:us code math reasoning ...truncated...

agentica-org/DeepCoder-Preview-Dataset

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-09

View โ†’
language:en license:mit size_categories:10K<n<100K format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars region:us ...truncated...

agentica-org/DeepScaleR-Preview-Dataset

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-02-10

View โ†’
language:en license:mit size_categories:10K<n<100K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

alfathterry/bbc-full-text-document-classification

train your NLP skills with this dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2024-04-04

View โ†’
education nlp multiclass classification news

allenai/olmOCR-mix-0225

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-02-25

View โ†’
license:odc-by size_categories:100K<n<1M format:parquet modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

Anthropic/EconomicIndex

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-03-27

View โ†’
language:en license:mit size_categories:1K<n<10K format:csv modality:tabular modality:text library:datasets library:pandas library:mlcroissant library:polars ...truncated...

ayoubcherguelaine/company-documents-dataset

Company Documents Dataset for Classification and Information Retrieval

Use case:

Source: Kaggle | Type: Text | Updated: 2024-05-23

View โ†’
business

ByteDance-Seed/Multi-SWE-bench

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-11

View โ†’
task_categories:text-generation license:other arxiv:2504.02605 region:us code

ByteDance-Seed/Multi-SWE-RL

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-09

View โ†’
task_categories:text-generation license:other arxiv:2504.02605 region:us code

cais/hle

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-04

View โ†’
license:mit size_categories:1K<n<10K format:parquet modality:image modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

cais/mmlu

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-03-08

View โ†’
task_categories:question-answering task_ids:multiple-choice-qa annotations_creators:no-annotation language_creators:expert-generated multilinguality:monolingual source_datasets:original language:en license:mit size_categories:100K<n<1M format:parquet ...truncated...

camel-ai/loong

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-01

View โ†’
task_categories:question-answering language:en license:mit size_categories:1K<n<10K format:parquet modality:text library:datasets library:pandas library:mlcroissant library:polars ...truncated...

CohereForAI/kaleidoscope

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-10

View โ†’
language:ar language:bn language:hr language:nl language:en language:fr language:de language:hi language:hu language:lt ...truncated...

Congliu/Chinese-DeepSeek-R1-Distill-data-110k

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-02-21

View โ†’
task_categories:text-generation task_categories:text2text-generation task_categories:question-answering language:zh license:apache-2.0 size_categories:100K<n<1M format:json modality:tabular modality:text library:datasets ...truncated...

dataturks/resume-entities-for-ner

A document annotation dataset to perform NER on resumes.

Use case:

Source: Kaggle | Type: Text | Updated: 2018-07-12

View โ†’
earth and nature biology business linguistics nlp text

davanstrien/reasoning-required

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-10

View โ†’
task_categories:text-classification task_categories:text-generation language:en license:mit size_categories:1K<n<10K format:parquet modality:text library:datasets library:pandas library:mlcroissant ...truncated...

devashishprasad/documnet-layout-recognition-dataset-publaynet-t0

IBMs PubLayNet dataset at kaggle for document layout recognition.

Use case:

Source: Kaggle | Type: Text | Updated: 2021-06-10

View โ†’
computer science computer vision deep learning cnn image

facebook/natural_reasoning

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-02-21

View โ†’
task_categories:text-generation language:en license:cc-by-nc-4.0 size_categories:1M<n<10M format:json modality:text library:datasets library:pandas library:mlcroissant library:polars ...truncated...

fka/awesome-chatgpt-prompts

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-01-06

View โ†’
task_categories:question-answering license:cc0-1.0 size_categories:n<1K format:csv modality:text library:datasets library:pandas library:mlcroissant library:polars region:us ...truncated...

FreedomIntelligence/medical-o1-reasoning-SFT

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-02-22

View โ†’
task_categories:question-answering task_categories:text-generation language:en language:zh license:apache-2.0 size_categories:10K<n<100K format:json modality:text library:datasets library:pandas ...truncated...

future-technologies/Universal-Transformers-Dataset

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-10

View โ†’
task_categories:text-classification task_categories:token-classification task_categories:table-question-answering task_categories:question-answering task_categories:zero-shot-classification task_categories:translation task_categories:summarization task_categories:feature-extraction task_categories:text-generation task_categories:text2text-generation ...truncated...

getomni-ai/ocr-benchmark

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-02-21

View โ†’
license:mit size_categories:1K<n<10K format:imagefolder modality:image modality:text library:datasets library:mlcroissant region:us

glaiveai/reasoning-v1-20m

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-03-19

View โ†’
task_categories:text-generation language:en license:apache-2.0 size_categories:10M<n<100M format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars ...truncated...

HuggingFaceFW/fineweb

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-01-31

View โ†’
task_categories:text-generation language:en license:odc-by size_categories:10B<n<100B format:parquet modality:tabular modality:text library:datasets library:dask library:mlcroissant ...truncated...

humansintheloop/arabic-documents-ocr-dataset

10K images that are further classified into 12 classes (Invoices, Books, etc.)

Use case:

Source: Kaggle | Type: Text | Updated: 2023-06-07

View โ†’
global business image text middle east arabic

jensenbaxter/10dataset-text-document-classification

A collection of ~1000 newsgroup documents from 10 different newsgroups

Use case:

Source: Kaggle | Type: Text | Updated: 2020-06-08

View โ†’
earth and nature business online communities

kageneko/legal-case-document-summarization

Use case:

Source: Kaggle | Type: Text | Updated: 2024-03-17

View โ†’
law

konradb/pfizer-documents

Pfizer-BioNTech vaccine-related data - #pfizerdocuments

Use case:

Source: Kaggle | Type: Text | Updated: 2022-11-02

View โ†’
public health

LLM360/MegaMath

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-09

View โ†’
task_categories:text-generation language:en license:odc-by size_categories:100M<n<1B format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars ...truncated...

m-a-p/COIG-P

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-09

View โ†’
size_categories:1M<n<10M format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars arxiv:2504.05535 region:us

miscovery/General_Facts_in_English_Arabic_Egyptian_Arabic

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-06

View โ†’
task_categories:question-answering task_categories:translation task_categories:text-generation task_categories:fill-mask language:en language:ar license:mit size_categories:10K<n<100K format:csv modality:tabular ...truncated...

MohamedRashad/Quran-Recitations

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-03-30

View โ†’
task_categories:automatic-speech-recognition task_categories:text-to-speech language:ar size_categories:100K<n<1M format:parquet modality:audio modality:text library:datasets library:dask library:mlcroissant ...truncated...

mozilla-foundation/common_voice_17_0

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-06-16

View โ†’
annotations_creators:crowdsourced language_creators:crowdsourced multilinguality:multilingual source_datasets:extended|common_voice language:ab language:af language:am language:ar language:as language:ast ...truncated...

nenriki/document-clustering

Text similarity and Agglomerative Document Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2022-03-07

View โ†’
nlp clustering

nisten/battlefield-medic-sharegpt

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-08

View โ†’
license:mit size_categories:1K<n<10K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

nvidia/Llama-Nemotron-Post-Training-Dataset

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-09

View โ†’
license:cc-by-4.0 size_categories:1M<n<10M format:json modality:text library:datasets library:dask library:mlcroissant region:us

nvidia/OpenCodeReasoning

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-07

View โ†’
task_categories:text-generation license:cc-by-4.0 size_categories:100K<n<1M format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars arxiv:2504.01943 ...truncated...

OmniSVG/MMSVG-Icon

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-09

View โ†’
license:cc-by-nc-sa-4.0 size_categories:100K<n<1M format:json modality:text library:datasets library:pandas library:mlcroissant library:polars arxiv:2504.06263 region:us

OmniSVG/MMSVG-Illustration

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-09

View โ†’
license:cc-by-nc-sa-4.0 size_categories:100K<n<1M format:json modality:text library:datasets library:pandas library:mlcroissant library:polars arxiv:2504.06263 region:us

open-r1/codeforces-cots

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-03-28

View โ†’
license:cc-by-4.0 size_categories:100K<n<1M format:parquet modality:tabular modality:text library:datasets library:dask library:mlcroissant library:polars region:us

open-r1/OpenR1-Math-220k

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-02-18

View โ†’
language:en license:apache-2.0 size_categories:100K<n<1M format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars region:us

open-thoughts/OpenThoughts2-1M

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-07

View โ†’
license:apache-2.0 size_categories:1M<n<10M format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars region:us synthetic ...truncated...

openai/gsm8k

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-01-04

View โ†’
task_categories:text2text-generation annotations_creators:crowdsourced language_creators:crowdsourced multilinguality:monolingual source_datasets:original language:en license:mit size_categories:10K<n<100K format:parquet modality:text ...truncated...

openai/openai_humaneval

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-01-04

View โ†’
task_categories:text2text-generation annotations_creators:expert-generated language_creators:expert-generated multilinguality:monolingual source_datasets:original language:en license:mit size_categories:n<1K format:parquet modality:text ...truncated...

patrickaudriaz/tobacco3482jpg

Document Structure Learning Dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2019-04-10

View โ†’
earth and nature education image

proj-persona/PersonaHub

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-03-04

View โ†’
task_categories:text-generation task_categories:text-classification task_categories:token-classification task_categories:fill-mask task_categories:table-question-answering task_categories:text2text-generation language:en language:zh license:cc-by-nc-sa-4.0 size_categories:100K<n<1M ...truncated...

Rapidata/2k-ranked-images-open-image-preferences-v1

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-10

View โ†’
license:apache-2.0 size_categories:1K<n<10K format:parquet modality:image modality:text library:datasets library:pandas library:mlcroissant library:polars region:us ...truncated...

Rapidata/Reve-AI-Halfmoon_t2i_human_preference

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-08

View โ†’
task_categories:text-to-image task_categories:image-to-text task_categories:image-classification task_categories:reinforcement-learning language:en license:cdla-permissive-2.0 size_categories:10K<n<100K format:parquet modality:image modality:text ...truncated...

Rapidata/text-2-video-human-preferences-pika2.2

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-08

View โ†’
task_categories:video-classification task_categories:text-to-video task_categories:text-classification language:en license:apache-2.0 size_categories:1K<n<10K format:parquet modality:image modality:tabular modality:text ...truncated...

ritvik1909/document-classification-dataset

A small dataset to try out Document Classification algorithms

Use case:

Source: Kaggle | Type: Text | Updated: 2022-07-06

View โ†’
computer science nlp computer vision image text multiclass classification

sachinsharma1123/document-classification

Classify the document with correct lables

Use case:

Source: Kaggle | Type: Text | Updated: 2020-07-12

View โ†’
earth and nature computer science

shaz13/real-world-documents-collections

A document type collection from various public datasets

Use case:

Source: Kaggle | Type: Text | Updated: 2020-07-06

View โ†’
earth and nature

shivamb/legal-citation-text-classification

Legal Industry - Citations Text Classification

Use case:

Source: Kaggle | Type: Text | Updated: 2021-11-11

View โ†’
australia government law nlp text

shivamkushwaha/bbc-full-text-document-classification

2225 documents in five categories can be used for clustering and classification.

Use case:

Source: Kaggle | Type: Text | Updated: 2019-01-26

View โ†’
software

SparkAudio/voxbox

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-11

View โ†’
task_categories:text-to-speech language:zh language:en license:cc-by-nc-sa-4.0 size_categories:10M<n<100M format:webdataset modality:audio modality:text library:datasets library:webdataset ...truncated...

starriver030515/FUSION-Finetune-12M

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-10

View โ†’
task_categories:question-answering task_categories:visual-question-answering task_categories:table-question-answering language:en language:zh license:apache-2.0 size_categories:1K<n<10K format:parquet modality:image modality:text ...truncated...

sthabile/noisy-and-rotated-scanned-documents

Can a predictive model be used to recognise the angle of a scanned document?

Use case:

Source: Kaggle | Type: Text | Updated: 2020-03-06

View โ†’
cnn text

sunilthite/text-document-classification-dataset

Text Document Classification Dataset for Classification and Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2023-12-04

View โ†’
education software text news english

tanishqdublish/text-classification-documentation

Text Document Classification Dataset for Classification and Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2024-01-12

View โ†’
education software news email and messaging text classification

UCSC-VLAA/MedReason

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-10

View โ†’
license:apache-2.0 size_categories:10K<n<100K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars arxiv:2504.00993 region:us ...truncated...

vevotx/Tahoe-100M

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-08

View โ†’
license:cc0-1.0 size_categories:100M<n<1B format:parquet modality:tabular modality:text modality:timeseries library:datasets library:dask library:mlcroissant library:polars ...truncated...

virtuoussy/Multi-subject-RLVR

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-02

View โ†’
task_categories:question-answering language:en license:apache-2.0 size_categories:100K<n<1M format:parquet modality:text library:datasets library:pandas library:mlcroissant library:polars ...truncated...

wikimedia/wikipedia

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-01-09

View โ†’
task_categories:text-generation task_categories:fill-mask task_ids:language-modeling task_ids:masked-language-modeling language:ab language:ace language:ady language:af language:alt language:am ...truncated...

yahma/alpaca-cleaned

Use case:

Source: Hugging Face | Type: Text | Updated: 2023-04-10

View โ†’
task_categories:text-generation language:en license:cc-by-4.0 size_categories:10K<n<100K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars ...truncated...

yiweilu2033/well-documented-alzheimers-dataset

This is a well-documented, skull-stripped, new MRI dataset.Take what you want

Use case:

Source: Kaggle | Type: Text | Updated: 2024-12-16

View โ†’
diseases computer vision classification deep learning image