๐Ÿ“ฆ Business Process Automation Datasets

Curated datasets for document AI, workflow automation, and enterprise chatbot development.

ajibawa-2023/Cpp-Code-Large

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-04

View โ†’
task_categories:text-generation language:en license:mit size_categories:1M<n<10M format:json modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

alfathterry/bbc-full-text-document-classification

train your NLP skills with this dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2024-04-04

View โ†’
education nlp multiclass classification news

allenai/Dolci-Think-SFT-Olmo-Hybrid

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-05

View โ†’
license:odc-by size_categories:1M<n<10M format:parquet format:optimized-parquet modality:text library:datasets library:dask library:polars library:mlcroissant region:us

Anthropic/EconomicIndex

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-05

View โ†’
language:en license:mit arxiv:2503.04761 region:us AI LLM Economic Impacts Anthropic

AudioVisual-Caption/ASID-1M

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-04

View โ†’
task_categories:image-text-to-text language:en license:cc-by-2.0 size_categories:100K<n<1M format:json modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

AweAI-Team/BeyondSWE

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-05

View โ†’
task_categories:text-generation language:en license:cc-by-4.0 size_categories:n<1K format:json modality:3d modality:text library:datasets library:pandas library:polars ...truncated...

AweAI-Team/Scale-SWE

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-05

View โ†’
size_categories:10K<n<100K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant arxiv:2602.09892 region:us

AweAI-Team/Scale-SWE-Distilled

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-28

View โ†’
size_categories:10K<n<100K format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant arxiv:2602.09892 region:us

ayoubcherguelaine/company-documents-dataset

Company Documents Dataset for Classification and Information Retrieval

Use case:

Source: Kaggle | Type: Text | Updated: 2024-05-23

View โ†’
business

bigcode/the-stack-v2

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-04-23

View โ†’
task_categories:text-generation language_creators:crowdsourced language_creators:expert-generated multilinguality:multilingual language:code license:other size_categories:1B<n<10B format:parquet modality:tabular modality:text ...truncated...

BytedTsinghua-SIA/CUDA-Agent-Ops-6K

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-27

View โ†’
task_categories:text-generation language:en license:cc-by-4.0 size_categories:1K<n<10K format:parquet modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

crownelius/Opus-4.6-Reasoning-3300x

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-02

View โ†’
license:apache-2.0 size_categories:1K<n<10K format:parquet modality:text library:datasets library:pandas library:polars library:mlcroissant region:us

dataturks/resume-entities-for-ner

A document annotation dataset to perform NER on resumes.

Use case:

Source: Kaggle | Type: Text | Updated: 2018-07-12

View โ†’
earth and nature biology business linguistics nlp text

dddraxxx/ref-adv-s

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-02

View โ†’
task_categories:visual-question-answering task_categories:object-detection language:en license:cc-by-4.0 size_categories:1K<n<10K format:parquet format:optimized-parquet modality:image modality:text library:datasets ...truncated...

devashishprasad/documnet-layout-recognition-dataset-publaynet-t0

IBMs PubLayNet dataset at kaggle for document layout recognition.

Use case:

Source: Kaggle | Type: Text | Updated: 2021-06-10

View โ†’
computer science computer vision deep learning cnn image

FINAL-Bench/ALL-Bench-Leaderboard

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-08

View โ†’
task_categories:text-generation task_categories:visual-question-answering task_categories:text-to-image task_categories:text-to-video task_categories:text-to-audio annotations_creators:expert-generated source_datasets:original language:en license:mit size_categories:n<1K ...truncated...

GD-ML/GenMRP

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-06

View โ†’
task_categories:tabular-classification task_categories:graph-ml size_categories:100K<n<1M format:csv modality:text library:datasets library:dask library:polars library:mlcroissant region:us ...truncated...

google/WaxalNLP

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-03

View โ†’
task_categories:automatic-speech-recognition task_categories:text-to-speech language_creators:creator_1 multilinguality:multilingual source_datasets:UGSpeechData source_datasets:DigitalUmuganda/AfriVoice source_datasets:original language:ach language:aka language:amh ...truncated...

HuggingFaceFW/finephrase

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-07

View โ†’
task_categories:text-generation task_ids:language-modeling annotations_creators:machine-generated language_creators:found source_datasets:HuggingFaceFW/fineweb-edu/sample-350BT language:en license:odc-by size_categories:1B<n<10B modality:tabular modality:text ...truncated...

HuggingFaceFW/fineweb

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-07-11

View โ†’
task_categories:text-generation language:en license:odc-by size_categories:10B<n<100B modality:tabular modality:text arxiv:2306.01116 arxiv:2109.07445 arxiv:2406.17557 doi:10.57967/hf/2493 ...truncated...

HuggingFaceFW/fineweb-edu

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-07-11

View โ†’
task_categories:text-generation language:en license:odc-by size_categories:1B<n<10B format:parquet modality:tabular modality:text library:datasets library:dask library:polars ...truncated...

humansintheloop/arabic-documents-ocr-dataset

10K images that are further classified into 12 classes (Invoices, Books, etc.)

Use case:

Source: Kaggle | Type: Text | Updated: 2023-06-07

View โ†’
global business image text middle east arabic

Jackrong/Qwen3.5-reasoning-700x

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-02

View โ†’
task_categories:question-answering language:en license:apache-2.0 size_categories:n<1K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

jensenbaxter/10dataset-text-document-classification

A collection of ~1000 newsgroup documents from 10 different newsgroups

Use case:

Source: Kaggle | Type: Text | Updated: 2020-06-08

View โ†’
earth and nature business online communities

kageneko/legal-case-document-summarization

Use case:

Source: Kaggle | Type: Text | Updated: 2024-03-17

View โ†’
law

karpathy/tinystories-gpt4-clean

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-08

View โ†’
license:cdla-sharing-1.0 size_categories:1M<n<10M format:parquet modality:text library:datasets library:pandas library:polars library:mlcroissant arxiv:2305.07759 region:us

konradb/pfizer-documents

Pfizer-BioNTech vaccine-related data - #pfizerdocuments

Use case:

Source: Kaggle | Type: Text | Updated: 2022-11-02

View โ†’
public health

LeeXiangNO1/DyNativeGaussian_sequence

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-10

View โ†’
license:cc-by-nc-4.0 size_categories:10K<n<100K format:text modality:3d modality:text library:datasets library:mlcroissant region:us

nebius/SWE-rebench-V2

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-03

View โ†’
task_categories:text-generation language:en license:cc-by-4.0 size_categories:10K<n<100K format:parquet modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

nebius/SWE-rebench-V2-PRs

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-03

View โ†’
task_categories:text-generation language:en license:cc-by-4.0 size_categories:100K<n<1M format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

nenriki/document-clustering

Text similarity and Agglomerative Document Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2022-03-07

View โ†’
nlp clustering

nohurry/Opus-4.6-Reasoning-3000x-filtered

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-10

View โ†’
license:apache-2.0 size_categories:1K<n<10K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant region:us

nvidia/Nemotron-ClimbMix

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-10-21

View โ†’
task_categories:text-generation language:en license:cc-by-nc-4.0 size_categories:100M<n<1B format:json modality:tabular library:datasets library:dask library:mlcroissant arxiv:2504.13161 ...truncated...

nvidia/Nemotron-Research-GooseReason-0.7M

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-01

View โ†’
language:en license:cc-by-nc-4.0 size_categories:100K<n<1M format:json modality:document modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

nvidia/Nemotron-Terminal-Corpus

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-27

View โ†’
task_categories:question-answering language:en license:cc-by-4.0 size_categories:100K<n<1M format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

ogutsevda/graph-nucls

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-03

View โ†’
task_categories:graph-ml license:cc-by-nc-sa-4.0 size_categories:n<1K format:imagefolder modality:image library:datasets library:mlcroissant arxiv:2603.00143 region:us histopathology ...truncated...

ogutsevda/graph-pannuke

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-03

View โ†’
task_categories:graph-ml license:cc-by-nc-sa-4.0 size_categories:1K<n<10K format:csv modality:text library:datasets library:dask library:polars library:mlcroissant arxiv:2603.00143 ...truncated...

OmniLottie/MMLottie-2M

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-07

View โ†’
language:en license:cc-by-nc-sa-4.0 size_categories:1M<n<10M format:parquet modality:image modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

OmniLottie/MMLottieBench

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-09

View โ†’
language:en license:cc-by-nc-sa-4.0 size_categories:n<1K format:parquet modality:image modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

openai/graphwalks

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-05

View โ†’
license:mit size_categories:1K<n<10K format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant region:us

openai/gsm8k

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-20

View โ†’
benchmark:official task_categories:text-generation annotations_creators:crowdsourced language_creators:crowdsourced multilinguality:monolingual source_datasets:original language:en license:mit size_categories:10K<n<100K format:parquet ...truncated...

patrickaudriaz/tobacco3482jpg

Document Structure Learning Dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2019-04-10

View โ†’
earth and nature education image

peteromallet/dataclaw-peteromallet

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-25

View โ†’
task_categories:text-generation language:en license:mit size_categories:n<1K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

reasat/badlad-train

Bengali document layout analysis dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2023-05-06

View โ†’
beginner computer vision deep learning bengali object detection

ritvik1909/document-classification-dataset

A small dataset to try out Document Classification algorithms

Use case:

Source: Kaggle | Type: Text | Updated: 2022-07-06

View โ†’
computer science nlp computer vision image text multiclass classification

Roman1111111/gemini-3-pro-10000x-hard-high-reasoning

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-20

View โ†’
task_categories:question-answering task_categories:text-generation language:en license:mit size_categories:10K<n<100K format:json modality:text library:datasets library:pandas library:polars ...truncated...

Roman1111111/gemini-3.1-pro-hard-high-reasoning

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-21

View โ†’
task_categories:question-answering task_categories:text-generation language:en license:mit size_categories:1K<n<10K format:json modality:text library:datasets library:pandas library:polars ...truncated...

ronantakizawa/github-codereview

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-05

View โ†’
task_categories:text-generation language:en language:code license:mit size_categories:100K<n<1M format:parquet modality:tabular modality:text library:datasets library:dask ...truncated...

ronantakizawa/github-top-code

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-23

View โ†’
task_categories:text-generation language:code license:mit size_categories:1M<n<10M format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

ronantakizawa/webui

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-28

View โ†’
task_categories:image-to-text task_categories:text-generation task_categories:object-detection language:en license:mit size_categories:10K<n<100K format:parquet format:optimized-parquet modality:image modality:text ...truncated...

roneneldan/TinyStories

Use case:

Source: Hugging Face | Type: Text | Updated: 2024-08-12

View โ†’
task_categories:text-generation language:en license:cdla-sharing-1.0 size_categories:1M<n<10M format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

sachinsharma1123/document-classification

Classify the document with correct lables

Use case:

Source: Kaggle | Type: Text | Updated: 2020-07-12

View โ†’
earth and nature computer science

ScaleAI/SWE-Atlas-QnA

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-04

View โ†’
size_categories:n<1K format:csv modality:text library:datasets library:pandas library:polars library:mlcroissant region:us

shaz13/real-world-documents-collections

A document type collection from various public datasets

Use case:

Source: Kaggle | Type: Text | Updated: 2020-07-06

View โ†’
earth and nature

shivamkushwaha/bbc-full-text-document-classification

2225 documents in five categories can be used for clustering and classification.

Use case:

Source: Kaggle | Type: Text | Updated: 2019-01-26

View โ†’
software

skylenage/DeepVision-103K

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-26

View โ†’
task_categories:image-text-to-text language:en license:mit size_categories:100K<n<1M format:parquet format:optimized-parquet modality:image modality:text library:datasets library:pandas ...truncated...

sthabile/noisy-and-rotated-scanned-documents

Can a predictive model be used to recognise the angle of a scanned document?

Use case:

Source: Kaggle | Type: Text | Updated: 2020-03-06

View โ†’
cnn text

sunilthite/text-document-classification-dataset

Text Document Classification Dataset for Classification and Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2023-12-04

View โ†’
education software text news english

tanishqdublish/text-classification-documentation

Text Document Classification Dataset for Classification and Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2024-01-12

View โ†’
education software news email and messaging text classification

TeichAI/claude-4.5-opus-high-reasoning-250x

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-11-28

View โ†’
size_categories:n<1K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

TianHongZXY/CHIMERA

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-03

View โ†’
task_categories:text-generation task_categories:question-answering annotations_creators:machine-generated language:en license:apache-2.0 size_categories:10K<n<100K format:parquet format:optimized-parquet modality:text library:datasets ...truncated...

TIGER-Lab/MMLU-Pro

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-01-19

View โ†’
benchmark:official task_categories:question-answering language:en license:mit size_categories:10K<n<100K format:parquet modality:tabular modality:text library:datasets library:pandas ...truncated...

togethercomputer/CoderForge-Preview

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-26

View โ†’
size_categories:100K<n<1M format:parquet format:optimized-parquet modality:text library:datasets library:dask library:polars library:mlcroissant region:us

TuringEnterprises/Open-RL

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-03-04

View โ†’
task_categories:question-answering language:en license:mit size_categories:n<1K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

vafaeii/open-scilay

Large-scale dataset for Robust OCR, Layout Analysis, and VLM Pre-trainin

Use case:

Source: Kaggle | Type: Text | Updated: 2026-02-18

View โ†’
computer vision image text image-to-text synthetic

yiweilu2033/well-documented-alzheimers-dataset

This is a well-documented, skull-stripped, new MRI dataset.Take what you want

Use case:

Source: Kaggle | Type: Text | Updated: 2024-12-16

View โ†’
diseases computer vision classification deep learning image