๐Ÿ“ฆ Business Process Automation Datasets

Curated datasets for document AI, workflow automation, and enterprise chatbot development.

aanonyyy/F5I9N7A1

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-10-22

View โ†’
size_categories:100K<n<1M format:parquet modality:audio modality:text library:datasets library:dask library:mlcroissant library:polars region:us

alfathterry/bbc-full-text-document-classification

train your NLP skills with this dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2024-04-04

View โ†’
education nlp multiclass classification news

Anthropic/hh-rlhf

Use case:

Source: Hugging Face | Type: Text | Updated: 2023-05-26

View โ†’
license:mit size_categories:100K<n<1M format:json modality:text library:datasets library:dask library:mlcroissant library:polars arxiv:2204.05862 region:us ...truncated...

APRIL-AIGC/Soul-Bench

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-16

View โ†’
task_categories:image-to-video size_categories:n<1K format:text modality:audio modality:image modality:text modality:video library:datasets library:mlcroissant arxiv:2512.13495 ...truncated...

ayoubcherguelaine/company-documents-dataset

Company Documents Dataset for Classification and Information Retrieval

Use case:

Source: Kaggle | Type: Text | Updated: 2024-05-23

View โ†’
business

bshada/open-schematics

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-17

View โ†’
task_categories:text-generation task_categories:image-to-text task_categories:text-to-image language:en license:cc-by-4.0 size_categories:10K<n<100K format:parquet modality:image modality:text library:datasets ...truncated...

cais/hle

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-09-10

View โ†’
license:mit size_categories:1K<n<10K format:parquet modality:image modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

datapizza-ai-lab/salaries

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-28

View โ†’
language:it license:cc-by-nc-4.0 size_categories:10K<n<100K format:json modality:tabular modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

dataturks/resume-entities-for-ner

A document annotation dataset to perform NER on resumes.

Use case:

Source: Kaggle | Type: Text | Updated: 2018-07-12

View โ†’
earth and nature biology business linguistics nlp text

devashishprasad/documnet-layout-recognition-dataset-publaynet-t0

IBMs PubLayNet dataset at kaggle for document layout recognition.

Use case:

Source: Kaggle | Type: Text | Updated: 2021-06-10

View โ†’
computer science computer vision deep learning cnn image

facebook/research-plan-gen

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-28

View โ†’
size_categories:10K<n<100K format:parquet modality:text library:datasets library:pandas library:polars library:mlcroissant region:us

facebook/sam-audio-bench

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-29

View โ†’
language:en license:cc-by-nc-4.0 size_categories:n<1K format:parquet modality:tabular modality:text library:datasets library:pandas library:mlcroissant library:polars ...truncated...

FBK-MT/MCIF

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-15

View โ†’
task_categories:automatic-speech-recognition task_categories:question-answering task_categories:summarization task_categories:visual-question-answering task_categories:translation language:en language:de language:it language:zh license:cc-by-4.0 ...truncated...

fka/awesome-chatgpt-prompts

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-31

View โ†’
task_categories:question-answering task_categories:text-generation license:cc0-1.0 size_categories:n<1K format:csv modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

FreedomIntelligence/medical-o1-reasoning-SFT

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-04-22

View โ†’
task_categories:question-answering task_categories:text-generation language:en language:zh license:apache-2.0 size_categories:10K<n<100K format:json modality:text library:datasets library:pandas ...truncated...

FutureMa/DramaBench

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-29

View โ†’
task_categories:text-generation language:en license:mit size_categories:n<1K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

gaia-benchmark/GAIA

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-10-28

View โ†’
language:en size_categories:n<1K format:parquet modality:audio modality:document modality:image modality:text library:datasets library:pandas library:mlcroissant ...truncated...

google/deepsearchqa

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-17

View โ†’
task_categories:question-answering language:en license:apache-2.0 size_categories:n<1K format:csv modality:text library:datasets library:pandas library:mlcroissant library:polars ...truncated...

google/mobile-actions

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-18

View โ†’
language:en license:cc-by-4.0 size_categories:1K<n<10K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars region:us ...truncated...

HeshamHaroon/Arabic_Function_Calling

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-14

View โ†’
task_categories:text-generation task_categories:question-answering language:ar license:apache-2.0 size_categories:10K<n<100K format:json modality:text library:datasets library:pandas library:mlcroissant ...truncated...

HiDream-ai/ReCo-Data

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-26

View โ†’
task_categories:image-to-video language:en license:cc-by-nc-sa-4.0 size_categories:1M<n<10M format:webdataset modality:text library:datasets library:webdataset library:mlcroissant arxiv:2512.17650 ...truncated...

hotpotqa/hotpot_qa

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-08-11

View โ†’
task_categories:question-answering annotations_creators:crowdsourced language_creators:found multilinguality:monolingual source_datasets:original language:en license:cc-by-sa-4.0 size_categories:100K<n<1M format:parquet modality:text ...truncated...

HuggingFaceFW/fineweb

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-07-11

View โ†’
task_categories:text-generation language:en license:odc-by size_categories:10B<n<100B modality:tabular modality:text arxiv:2306.01116 arxiv:2109.07445 arxiv:2406.17557 doi:10.57967/hf/2493 ...truncated...

HuggingFaceH4/MATH-500

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-15

View โ†’
benchmark:official task_categories:text-generation language:en size_categories:n<1K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars ...truncated...

humansintheloop/arabic-documents-ocr-dataset

10K images that are further classified into 12 classes (Invoices, Books, etc.)

Use case:

Source: Kaggle | Type: Text | Updated: 2023-06-07

View โ†’
global business image text middle east arabic

Idavidrein/gpqa

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-03-28

View โ†’
task_categories:question-answering task_categories:text-generation language:en license:cc-by-4.0 size_categories:1K<n<10K format:csv modality:tabular modality:text library:datasets library:pandas ...truncated...

ILSVRC/imagenet-1k

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-09-17

View โ†’
task_categories:image-classification task_ids:multi-class-image-classification annotations_creators:crowdsourced language_creators:crowdsourced multilinguality:monolingual source_datasets:original language:en license:other size_categories:1M<n<10M format:parquet ...truncated...

jensenbaxter/10dataset-text-document-classification

A collection of ~1000 newsgroup documents from 10 different newsgroups

Use case:

Source: Kaggle | Type: Text | Updated: 2020-06-08

View โ†’
earth and nature business online communities

kageneko/legal-case-document-summarization

Use case:

Source: Kaggle | Type: Text | Updated: 2024-03-17

View โ†’
law

keyushnisar/legal-docu

Formats, Usage, and Prerequisites

Use case:

Source: Kaggle | Type: Text | Updated: 2025-03-04

View โ†’
education law intermediate text generation

konradb/pfizer-documents

Pfizer-BioNTech vaccine-related data - #pfizerdocuments

Use case:

Source: Kaggle | Type: Text | Updated: 2022-11-02

View โ†’
public health

Lewandofski/OpenVE-3M

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-25

View โ†’
license:cc-by-nc-4.0 size_categories:1M<n<10M format:webdataset modality:text modality:video library:datasets library:webdataset library:mlcroissant arxiv:2512.07826 region:us ...truncated...

lirang04/GroundingME

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-26

View โ†’
task_categories:zero-shot-object-detection license:other size_categories:1K<n<10K format:parquet format:optimized-parquet modality:image modality:text library:datasets library:dask library:polars ...truncated...

LLM360/TxT360-3efforts

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-11

View โ†’
license:cc-by-4.0 size_categories:1M<n<10M format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars arxiv:2512.06201 region:us

MiniMaxAI/VIBE

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-23

View โ†’
task_categories:text-generation language:en license:mit size_categories:n<1K format:parquet modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

Mxode/Chinese-Instruct

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-05-09

View โ†’
task_categories:text-generation task_categories:question-answering language:zh license:cc-by-sa-4.0 size_categories:1M<n<10M format:json modality:text library:datasets library:dask library:mlcroissant ...truncated...

nebius/SWE-rebench-openhands-trajectories

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-27

View โ†’
license:cc-by-4.0 size_categories:10K<n<100K format:parquet modality:tabular modality:text library:datasets library:pandas library:polars library:mlcroissant region:us ...truncated...

nenriki/document-clustering

Text similarity and Agglomerative Document Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2022-03-07

View โ†’
nlp clustering

nvidia/Nemotron-CC-v2.1

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-22

View โ†’
task_categories:text-generation license:other size_categories:1B<n<10B format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant arxiv:2508.14444 ...truncated...

nvidia/Nemotron-Instruction-Following-Chat-v1

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-15

View โ†’
language:en license:cc-by-4.0 size_categories:100K<n<1M format:json modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

nvidia/Nemotron-Math-Proofs-v1

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-18

View โ†’
language:en license:cc-by-sa-4.0 size_categories:100K<n<1M format:json modality:text library:datasets library:pandas library:mlcroissant library:polars arxiv:2512.15489 ...truncated...

nvidia/Nemotron-Pretraining-Code-v2

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-22

View โ†’
task_categories:text-generation license:other size_categories:100M<n<1B format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant arxiv:2508.14444 ...truncated...

nvidia/Nemotron-Pretraining-Specialized-v1

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-22

View โ†’
task_categories:text-generation license:cc-by-4.0 size_categories:10M<n<100M format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant arxiv:2508.14444 ...truncated...

openai/frontierscience

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-16

View โ†’
license:apache-2.0 size_categories:n<1K format:json modality:text library:datasets library:dask library:mlcroissant region:us

openai/gdpval

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-09-25

View โ†’
size_categories:n<1K format:parquet modality:audio modality:document modality:image modality:text modality:video library:datasets library:pandas library:mlcroissant ...truncated...

openai/gsm8k

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-20

View โ†’
benchmark:official task_categories:text-generation annotations_creators:crowdsourced language_creators:crowdsourced multilinguality:monolingual source_datasets:original language:en license:mit size_categories:10K<n<100K format:parquet ...truncated...

OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-12

View โ†’
task_categories:text-generation language:en license:apache-2.0 size_categories:100K<n<1M format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars ...truncated...

patrickaudriaz/tobacco3482jpg

Document Structure Learning Dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2019-04-10

View โ†’
earth and nature education image

ritvik1909/document-classification-dataset

A small dataset to try out Document Classification algorithms

Use case:

Source: Kaggle | Type: Text | Updated: 2022-07-06

View โ†’
computer science nlp computer vision image text multiclass classification

ronantakizawa/github-top-developers

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-20

View โ†’
task_categories:text-classification task_categories:time-series-forecasting task_categories:text-retrieval language:en license:mit size_categories:10K<n<100K format:csv modality:tabular modality:text library:datasets ...truncated...

roneneldan/TinyStories

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-08-12

View โ†’
task_categories:text-generation language:en license:cdla-sharing-1.0 size_categories:1M<n<10M format:parquet modality:text library:datasets library:dask library:mlcroissant library:polars ...truncated...

sachinsharma1123/document-classification

Classify the document with correct lables

Use case:

Source: Kaggle | Type: Text | Updated: 2020-07-12

View โ†’
earth and nature computer science

scrapegraphai/scrapegraphai-100k

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-21

View โ†’
language:en license:apache-2.0 size_categories:10K<n<100K format:parquet modality:tabular modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

shaz13/real-world-documents-collections

A document type collection from various public datasets

Use case:

Source: Kaggle | Type: Text | Updated: 2020-07-06

View โ†’
earth and nature

shivamkushwaha/bbc-full-text-document-classification

2225 documents in five categories can be used for clustering and classification.

Use case:

Source: Kaggle | Type: Text | Updated: 2019-01-26

View โ†’
software

spatialverse/SAGE-3D_Collision_Mesh

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-17

View โ†’
task_categories:robotics task_categories:image-to-3d license:cc-by-nc-4.0 size_categories:n<1K format:imagefolder modality:image library:datasets library:mlcroissant arxiv:2510.21307 region:us

spatialverse/SAGE-3D_InteriorGS_usdz

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-17

View โ†’
task_categories:robotics task_categories:image-to-3d language:en license:cc-by-nc-4.0 size_categories:n<1K format:imagefolder modality:3d modality:image library:datasets library:mlcroissant ...truncated...

sthabile/noisy-and-rotated-scanned-documents

Can a predictive model be used to recognise the angle of a scanned document?

Use case:

Source: Kaggle | Type: Text | Updated: 2020-03-06

View โ†’
cnn text

sunilthite/text-document-classification-dataset

Text Document Classification Dataset for Classification and Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2023-12-04

View โ†’
education software text news english

tanishqdublish/text-classification-documentation

Text Document Classification Dataset for Classification and Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2024-01-12

View โ†’
education software news email and messaging text classification

TeichAI/claude-4.5-opus-high-reasoning-250x

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-11-28

View โ†’
size_categories:n<1K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

wikimedia/wikipedia

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-01-09

View โ†’
task_categories:text-generation task_categories:fill-mask task_ids:language-modeling task_ids:masked-language-modeling language:ab language:ace language:ady language:af language:alt language:am ...truncated...

yiweilu2033/well-documented-alzheimers-dataset

This is a well-documented, skull-stripped, new MRI dataset.Take what you want

Use case:

Source: Kaggle | Type: Text | Updated: 2024-12-16

View โ†’
diseases computer vision classification deep learning image