๐Ÿ“ฆ Business Process Automation Datasets

Curated datasets for document AI, workflow automation, and enterprise chatbot development.

alfathterry/bbc-full-text-document-classification

train your NLP skills with this dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2024-04-04

View โ†’
education nlp multiclass classification news

AlicanKiraz0/Turkish-Finance-SFT-Dataset

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-12

View โ†’
task_categories:question-answering language:tr license:mit size_categories:1K<n<10K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

allenai/molmospaces

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-16

View โ†’
license:odc-by license:cc-by-4.0 size_categories:100K<n<1M format:parquet format:optimized-parquet modality:tabular modality:text library:datasets library:pandas library:polars ...truncated...

allenai/olmOCR-bench

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-19

View โ†’
benchmark:official benchmark:eval-yaml language:en license:odc-by size_categories:1K<n<10K modality:document modality:text arxiv:2502.18443 region:us text

atreydesai/qgqa-gpt-5.2-20260213-041705

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-13

View โ†’
size_categories:1K<n<10K format:parquet format:optimized-parquet modality:tabular modality:text library:datasets library:pandas library:polars library:mlcroissant region:us

ayoubcherguelaine/company-documents-dataset

Company Documents Dataset for Classification and Information Retrieval

Use case:

Source: Kaggle | Type: Text | Updated: 2024-05-23

View โ†’
business

cais/hle

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-01-20

View โ†’
benchmark:official benchmark:eval-yaml license:mit size_categories:1K<n<10K format:parquet modality:image modality:text library:datasets library:pandas library:polars ...truncated...

commoncrawl/CommonLID

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-10

View โ†’
task_categories:text-classification language:ace language:acf language:aeb language:afr language:amh language:apd language:ara language:arb language:arg ...truncated...

DataMuncher-Labs/UltiMath

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-01-18

View โ†’
task_categories:text-generation language:en license:cc-by-sa-4.0 size_categories:10B<n<100B format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

dataturks/resume-entities-for-ner

A document annotation dataset to perform NER on resumes.

Use case:

Source: Kaggle | Type: Text | Updated: 2018-07-12

View โ†’
earth and nature biology business linguistics nlp text

deepgenteam/DeepGen-1.0

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-13

View โ†’
license:apache-2.0 size_categories:n<1K format:imagefolder modality:image library:datasets library:mlcroissant arxiv:2602.12205 region:us

devashishprasad/documnet-layout-recognition-dataset-publaynet-t0

IBMs PubLayNet dataset at kaggle for document layout recognition.

Use case:

Source: Kaggle | Type: Text | Updated: 2021-06-10

View โ†’
computer science computer vision deep learning cnn image

futurehouse/labbench2

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-12

View โ†’
task_categories:question-answering license:cc-by-sa-4.0 size_categories:1K<n<10K format:parquet format:optimized-parquet modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

FutureMa/EvasionBench

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-19

View โ†’
benchmark:official benchmark:eval-yaml task_categories:text-classification language:en license:apache-2.0 size_categories:10K<n<100K format:parquet modality:text library:datasets library:pandas ...truncated...

galaxyMindAiLabs/stem-reasoning-complex

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-15

View โ†’
task_categories:text-generation task_categories:question-answering language:en language:zh license:apache-2.0 size_categories:100K<n<1M format:parquet modality:text library:datasets library:dask ...truncated...

GD-ML/IntTravel_dataset

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-19

View โ†’
task_categories:other size_categories:1B<n<10B format:csv modality:text library:datasets library:dask library:polars library:mlcroissant arxiv:2602.11664 region:us ...truncated...

google/MapTrace

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-01-03

View โ†’
task_categories:image-to-text language:en license:cc-by-4.0 size_categories:10K<n<100K format:parquet modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

google/WaxalNLP

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-18

View โ†’
task_categories:automatic-speech-recognition task_categories:text-to-speech language_creators:creator_1 multilinguality:multilingual source_datasets:UGSpeechData source_datasets:DigitalUmuganda/AfriVoice source_datasets:original language:ach language:aka language:amh ...truncated...

HuggingFaceFW/fineweb

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-07-11

View โ†’
task_categories:text-generation language:en license:odc-by size_categories:10B<n<100B modality:tabular modality:text arxiv:2306.01116 arxiv:2109.07445 arxiv:2406.17557 doi:10.57967/hf/2493 ...truncated...

HuggingFaceFW/fineweb-edu

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-07-11

View โ†’
task_categories:text-generation language:en license:odc-by size_categories:1B<n<10B format:parquet modality:tabular modality:text library:datasets library:dask library:polars ...truncated...

humansintheloop/arabic-documents-ocr-dataset

10K images that are further classified into 12 classes (Invoices, Books, etc.)

Use case:

Source: Kaggle | Type: Text | Updated: 2023-06-07

View โ†’
global business image text middle east arabic

Idavidrein/gpqa

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-01-22

View โ†’
benchmark:official task_categories:question-answering task_categories:text-generation language:en license:cc-by-4.0 size_categories:1K<n<10K format:csv modality:tabular modality:text library:datasets ...truncated...

ILSVRC/imagenet-1k

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-09-17

View โ†’
task_categories:image-classification task_ids:multi-class-image-classification annotations_creators:crowdsourced language_creators:crowdsourced multilinguality:monolingual source_datasets:original language:en license:other size_categories:1M<n<10M format:parquet ...truncated...

jensenbaxter/10dataset-text-document-classification

A collection of ~1000 newsgroup documents from 10 different newsgroups

Use case:

Source: Kaggle | Type: Text | Updated: 2020-06-08

View โ†’
earth and nature business online communities

kageneko/legal-case-document-summarization

Use case:

Source: Kaggle | Type: Text | Updated: 2024-03-17

View โ†’
law

kensho/PubTables-v2

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-12

View โ†’
license:cdla-permissive-2.0 size_categories:1M<n<10M format:webdataset modality:image modality:text library:datasets library:webdataset library:mlcroissant arxiv:2512.10888 region:us

keyushnisar/legal-docu

Formats, Usage, and Prerequisites

Use case:

Source: Kaggle | Type: Text | Updated: 2025-03-04

View โ†’
education law intermediate text generation

konradb/pfizer-documents

Pfizer-BioNTech vaccine-related data - #pfizerdocuments

Use case:

Source: Kaggle | Type: Text | Updated: 2022-11-02

View โ†’
public health

lm-provers/FineProofs-SFT

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-14

View โ†’
task_categories:text-generation task_categories:question-answering language:en license:apache-2.0 size_categories:10K<n<100K format:parquet format:optimized-parquet modality:tabular modality:text library:datasets ...truncated...

ma-xu/fine-t2i

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-20

View โ†’
task_categories:image-to-text task_categories:text-to-image language:en license:apache-2.0 size_categories:100K<n<1M format:webdataset modality:image modality:text library:datasets library:webdataset ...truncated...

markov-ai/computer-use

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-13

View โ†’
task_categories:robotics task_categories:image-to-text license:apache-2.0 size_categories:n<1K format:parquet format:optimized-parquet modality:image modality:text modality:timeseries modality:video ...truncated...

MathArena/aime_2026

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-16

View โ†’
benchmark:official benchmark:eval-yaml language:en license:cc-by-nc-sa-4.0 size_categories:n<1K format:parquet format:optimized-parquet modality:tabular modality:text library:datasets ...truncated...

moonworks/lunara-aesthetic-image-variations

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-09

View โ†’
task_categories:image-to-image task_categories:text-to-image license:apache-2.0 size_categories:1K<n<10K format:parquet modality:image modality:text library:datasets library:dask library:polars ...truncated...

Nanbeige/ToolMind-Web-QA

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-19

View โ†’
task_categories:text-generation language:en license:apache-2.0 arxiv:2602.13367 region:us synthetic deep search

nenriki/document-clustering

Text similarity and Agglomerative Document Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2022-03-07

View โ†’
nlp clustering

nohurry/Opus-4.6-Reasoning-3000x-filtered

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-10

View โ†’
license:apache-2.0 size_categories:1K<n<10K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant region:us

nvidia/SAGE-10k

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-11

View โ†’
task_categories:text-to-3d language:en license:apache-2.0 size_categories:10K<n<100K arxiv:2602.10116 region:us Scene-Generation Interactive-Scenes Embodied-AI Scene-Understanding ...truncated...

nyuuzyou/suno

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-03

View โ†’
task_categories:audio-classification task_categories:text-to-audio annotations_creators:found multilinguality:multilingual source_datasets:original language:en language:ja language:multilingual license:cc0-1.0 size_categories:100K<n<1M ...truncated...

openbmb/Ultra-FineWeb

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-12-10

View โ†’
task_categories:text-generation language:en language:zh license:apache-2.0 size_categories:1B<n<10B modality:text arxiv:2505.05427 arxiv:2506.07900 arxiv:2412.04315 region:us ...truncated...

openbmb/Ultra-FineWeb-L3

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-09

View โ†’
task_categories:text-generation language:en language:zh license:apache-2.0 size_categories:n<1K format:json modality:text library:datasets library:pandas library:polars ...truncated...

openbmb/UltraData-Math

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-10

View โ†’
task_categories:text-generation language:en language:zh license:apache-2.0 size_categories:100M<n<1B format:parquet modality:text library:datasets library:dask library:polars ...truncated...

opencsg/Fineweb-Edu-Chinese-V2.2

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-02

View โ†’
task_categories:text-generation task_categories:question-answering language:zh license:apache-2.0 size_categories:10B<n<100B arxiv:2501.08197 arxiv:2305.11206 arxiv:2305.15717 arxiv:2307.01850 arxiv:2307.08701 ...truncated...

OpenDriveLab-org/Kai0

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-14

View โ†’
task_categories:robotics license:cc-by-nc-sa-4.0 size_categories:1K<n<10K format:parquet modality:tabular modality:timeseries modality:video library:datasets library:pandas library:polars ...truncated...

openfoodfacts/product-database

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-19

View โ†’
language:en language:fr language:de language:es language:it language:nl language:pl language:pt language:sv language:bg ...truncated...

OpenMed/Medical-Reasoning-SFT-Mega

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-06

View โ†’
task_categories:text-generation task_categories:question-answering language:en license:apache-2.0 size_categories:1M<n<10M format:parquet format:optimized-parquet modality:text library:datasets library:dask ...truncated...

OpenResearcher/OpenResearcher-Dataset

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-12

View โ†’
license:mit size_categories:10K<n<100K format:parquet format:optimized-parquet modality:tabular modality:text library:datasets library:dask library:polars library:mlcroissant ...truncated...

patrickaudriaz/tobacco3482jpg

Document Structure Learning Dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2019-04-10

View โ†’
earth and nature education image

perplexity-ai/draco

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-13

View โ†’
language:en license:mit size_categories:n<1K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant arxiv:2602.11685 ...truncated...

PleIAs/common_corpus

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-19

View โ†’
language:en language:fr language:de language:zh language:it language:es language:ja language:pl language:la language:nl ...truncated...

princeton-nlp/SWE-bench_Verified

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-02-18

View โ†’
size_categories:n<1K format:parquet modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

reasat/badlad-train

Bengali document layout analysis dataset

Use case:

Source: Kaggle | Type: Text | Updated: 2023-05-06

View โ†’
beginner computer vision deep learning bengali object detection

ritvik1909/document-classification-dataset

A small dataset to try out Document Classification algorithms

Use case:

Source: Kaggle | Type: Text | Updated: 2022-07-06

View โ†’
computer science nlp computer vision image text multiclass classification

sachinsharma1123/document-classification

Classify the document with correct lables

Use case:

Source: Kaggle | Type: Text | Updated: 2020-07-12

View โ†’
earth and nature computer science

shaz13/real-world-documents-collections

A document type collection from various public datasets

Use case:

Source: Kaggle | Type: Text | Updated: 2020-07-06

View โ†’
earth and nature

shivamkushwaha/bbc-full-text-document-classification

2225 documents in five categories can be used for clustering and classification.

Use case:

Source: Kaggle | Type: Text | Updated: 2019-01-26

View โ†’
software

sthabile/noisy-and-rotated-scanned-documents

Can a predictive model be used to recognise the angle of a scanned document?

Use case:

Source: Kaggle | Type: Text | Updated: 2020-03-06

View โ†’
cnn text

sunilthite/text-document-classification-dataset

Text Document Classification Dataset for Classification and Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2023-12-04

View โ†’
education software text news english

tanishqdublish/text-classification-documentation

Text Document Classification Dataset for Classification and Clustering

Use case:

Source: Kaggle | Type: Text | Updated: 2024-01-12

View โ†’
education software news email and messaging text classification

TeichAI/claude-4.5-opus-high-reasoning-250x

Use case:

Source: Hugging Face | Type: Text | Updated: 2025-11-28

View โ†’
size_categories:n<1K format:json modality:text library:datasets library:pandas library:mlcroissant library:polars region:us

TeichAI/Pony-Alpha-15k

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-17

View โ†’
size_categories:10K<n<100K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant region:us

tencent/CL-bench

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-06

View โ†’
task_categories:text-generation language:en license:other size_categories:1K<n<10K format:json modality:text library:datasets library:pandas library:polars library:mlcroissant ...truncated...

TIGER-Lab/MMLU-Pro

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-01-19

View โ†’
benchmark:official task_categories:question-answering language:en license:mit size_categories:10K<n<100K format:parquet modality:tabular modality:text library:datasets library:pandas ...truncated...

uv-scripts/ocr

Use case:

Source: Hugging Face | Type: Text | Updated: 2026-02-19

View โ†’
region:us uv-script ocr vision-language-model document-processing hf-jobs

yiweilu2033/well-documented-alzheimers-dataset

This is a well-documented, skull-stripped, new MRI dataset.Take what you want

Use case:

Source: Kaggle | Type: Text | Updated: 2024-12-16

View โ†’
diseases computer vision classification deep learning image