๐ฆ Business Process Automation Datasets
Curated datasets for document AI, workflow automation, and enterprise chatbot development.
alfathterry/bbc-full-text-document-classification
train your NLP skills with this dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2024-04-04
View โ
education
nlp
multiclass classification
news
AlicanKiraz0/Turkish-Finance-SFT-Dataset
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-12
View โ
task_categories:question-answering
language:tr
license:mit
size_categories:1K<n<10K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
allenai/molmospaces
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-16
View โ
license:odc-by
license:cc-by-4.0
size_categories:100K<n<1M
format:parquet
format:optimized-parquet
modality:tabular
modality:text
library:datasets
library:pandas
library:polars
...truncated...
allenai/olmOCR-bench
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-19
View โ
benchmark:official
benchmark:eval-yaml
language:en
license:odc-by
size_categories:1K<n<10K
modality:document
modality:text
arxiv:2502.18443
region:us
text
atreydesai/qgqa-gpt-5.2-20260213-041705
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-13
View โ
size_categories:1K<n<10K
format:parquet
format:optimized-parquet
modality:tabular
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
region:us
ayoubcherguelaine/company-documents-dataset
Company Documents Dataset for Classification and Information Retrieval
Use case:
Source: Kaggle | Type: Text | Updated: 2024-05-23
View โ
business
cais/hle
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-01-20
View โ
benchmark:official
benchmark:eval-yaml
license:mit
size_categories:1K<n<10K
format:parquet
modality:image
modality:text
library:datasets
library:pandas
library:polars
...truncated...
commoncrawl/CommonLID
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-10
View โ
task_categories:text-classification
language:ace
language:acf
language:aeb
language:afr
language:amh
language:apd
language:ara
language:arb
language:arg
...truncated...
DataMuncher-Labs/UltiMath
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-01-18
View โ
task_categories:text-generation
language:en
license:cc-by-sa-4.0
size_categories:10B<n<100B
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
dataturks/resume-entities-for-ner
A document annotation dataset to perform NER on resumes.
Use case:
Source: Kaggle | Type: Text | Updated: 2018-07-12
View โ
earth and nature
biology
business
linguistics
nlp
text
deepgenteam/DeepGen-1.0
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-13
View โ
license:apache-2.0
size_categories:n<1K
format:imagefolder
modality:image
library:datasets
library:mlcroissant
arxiv:2602.12205
region:us
devashishprasad/documnet-layout-recognition-dataset-publaynet-t0
IBMs PubLayNet dataset at kaggle for document layout recognition.
Use case:
Source: Kaggle | Type: Text | Updated: 2021-06-10
View โ
computer science
computer vision
deep learning
cnn
image
futurehouse/labbench2
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-12
View โ
task_categories:question-answering
license:cc-by-sa-4.0
size_categories:1K<n<10K
format:parquet
format:optimized-parquet
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
FutureMa/EvasionBench
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-19
View โ
benchmark:official
benchmark:eval-yaml
task_categories:text-classification
language:en
license:apache-2.0
size_categories:10K<n<100K
format:parquet
modality:text
library:datasets
library:pandas
...truncated...
galaxyMindAiLabs/stem-reasoning-complex
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-15
View โ
task_categories:text-generation
task_categories:question-answering
language:en
language:zh
license:apache-2.0
size_categories:100K<n<1M
format:parquet
modality:text
library:datasets
library:dask
...truncated...
GD-ML/IntTravel_dataset
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-19
View โ
task_categories:other
size_categories:1B<n<10B
format:csv
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
arxiv:2602.11664
region:us
...truncated...
google/MapTrace
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-01-03
View โ
task_categories:image-to-text
language:en
license:cc-by-4.0
size_categories:10K<n<100K
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
google/WaxalNLP
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-18
View โ
task_categories:automatic-speech-recognition
task_categories:text-to-speech
language_creators:creator_1
multilinguality:multilingual
source_datasets:UGSpeechData
source_datasets:DigitalUmuganda/AfriVoice
source_datasets:original
language:ach
language:aka
language:amh
...truncated...
HuggingFaceFW/fineweb
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-07-11
View โ
task_categories:text-generation
language:en
license:odc-by
size_categories:10B<n<100B
modality:tabular
modality:text
arxiv:2306.01116
arxiv:2109.07445
arxiv:2406.17557
doi:10.57967/hf/2493
...truncated...
HuggingFaceFW/fineweb-edu
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-07-11
View โ
task_categories:text-generation
language:en
license:odc-by
size_categories:1B<n<10B
format:parquet
modality:tabular
modality:text
library:datasets
library:dask
library:polars
...truncated...
humansintheloop/arabic-documents-ocr-dataset
10K images that are further classified into 12 classes (Invoices, Books, etc.)
Use case:
Source: Kaggle | Type: Text | Updated: 2023-06-07
View โ
global
business
image
text
middle east
arabic
Idavidrein/gpqa
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-01-22
View โ
benchmark:official
task_categories:question-answering
task_categories:text-generation
language:en
license:cc-by-4.0
size_categories:1K<n<10K
format:csv
modality:tabular
modality:text
library:datasets
...truncated...
ILSVRC/imagenet-1k
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-09-17
View โ
task_categories:image-classification
task_ids:multi-class-image-classification
annotations_creators:crowdsourced
language_creators:crowdsourced
multilinguality:monolingual
source_datasets:original
language:en
license:other
size_categories:1M<n<10M
format:parquet
...truncated...
jensenbaxter/10dataset-text-document-classification
A collection of ~1000 newsgroup documents from 10 different newsgroups
Use case:
Source: Kaggle | Type: Text | Updated: 2020-06-08
View โ
earth and nature
business
online communities
kageneko/legal-case-document-summarization
Use case:
Source: Kaggle | Type: Text | Updated: 2024-03-17
View โ
law
kensho/PubTables-v2
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-12
View โ
license:cdla-permissive-2.0
size_categories:1M<n<10M
format:webdataset
modality:image
modality:text
library:datasets
library:webdataset
library:mlcroissant
arxiv:2512.10888
region:us
keyushnisar/legal-docu
Formats, Usage, and Prerequisites
Use case:
Source: Kaggle | Type: Text | Updated: 2025-03-04
View โ
education
law
intermediate
text generation
konradb/pfizer-documents
Pfizer-BioNTech vaccine-related data - #pfizerdocuments
Use case:
Source: Kaggle | Type: Text | Updated: 2022-11-02
View โ
public health
lm-provers/FineProofs-SFT
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-14
View โ
task_categories:text-generation
task_categories:question-answering
language:en
license:apache-2.0
size_categories:10K<n<100K
format:parquet
format:optimized-parquet
modality:tabular
modality:text
library:datasets
...truncated...
ma-xu/fine-t2i
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-20
View โ
task_categories:image-to-text
task_categories:text-to-image
language:en
license:apache-2.0
size_categories:100K<n<1M
format:webdataset
modality:image
modality:text
library:datasets
library:webdataset
...truncated...
markov-ai/computer-use
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-13
View โ
task_categories:robotics
task_categories:image-to-text
license:apache-2.0
size_categories:n<1K
format:parquet
format:optimized-parquet
modality:image
modality:text
modality:timeseries
modality:video
...truncated...
MathArena/aime_2026
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-16
View โ
benchmark:official
benchmark:eval-yaml
language:en
license:cc-by-nc-sa-4.0
size_categories:n<1K
format:parquet
format:optimized-parquet
modality:tabular
modality:text
library:datasets
...truncated...
moonworks/lunara-aesthetic-image-variations
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-09
View โ
task_categories:image-to-image
task_categories:text-to-image
license:apache-2.0
size_categories:1K<n<10K
format:parquet
modality:image
modality:text
library:datasets
library:dask
library:polars
...truncated...
Nanbeige/ToolMind-Web-QA
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-19
View โ
task_categories:text-generation
language:en
license:apache-2.0
arxiv:2602.13367
region:us
synthetic
deep search
nenriki/document-clustering
Text similarity and Agglomerative Document Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2022-03-07
View โ
nlp
clustering
nohurry/Opus-4.6-Reasoning-3000x-filtered
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-10
View โ
license:apache-2.0
size_categories:1K<n<10K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
region:us
nvidia/SAGE-10k
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-11
View โ
task_categories:text-to-3d
language:en
license:apache-2.0
size_categories:10K<n<100K
arxiv:2602.10116
region:us
Scene-Generation
Interactive-Scenes
Embodied-AI
Scene-Understanding
...truncated...
nyuuzyou/suno
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-03
View โ
task_categories:audio-classification
task_categories:text-to-audio
annotations_creators:found
multilinguality:multilingual
source_datasets:original
language:en
language:ja
language:multilingual
license:cc0-1.0
size_categories:100K<n<1M
...truncated...
openbmb/Ultra-FineWeb
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-10
View โ
task_categories:text-generation
language:en
language:zh
license:apache-2.0
size_categories:1B<n<10B
modality:text
arxiv:2505.05427
arxiv:2506.07900
arxiv:2412.04315
region:us
...truncated...
openbmb/Ultra-FineWeb-L3
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-09
View โ
task_categories:text-generation
language:en
language:zh
license:apache-2.0
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:polars
...truncated...
openbmb/UltraData-Math
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-10
View โ
task_categories:text-generation
language:en
language:zh
license:apache-2.0
size_categories:100M<n<1B
format:parquet
modality:text
library:datasets
library:dask
library:polars
...truncated...
opencsg/Fineweb-Edu-Chinese-V2.2
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-02
View โ
task_categories:text-generation
task_categories:question-answering
language:zh
license:apache-2.0
size_categories:10B<n<100B
arxiv:2501.08197
arxiv:2305.11206
arxiv:2305.15717
arxiv:2307.01850
arxiv:2307.08701
...truncated...
OpenDriveLab-org/Kai0
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-14
View โ
task_categories:robotics
license:cc-by-nc-sa-4.0
size_categories:1K<n<10K
format:parquet
modality:tabular
modality:timeseries
modality:video
library:datasets
library:pandas
library:polars
...truncated...
openfoodfacts/product-database
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-19
View โ
language:en
language:fr
language:de
language:es
language:it
language:nl
language:pl
language:pt
language:sv
language:bg
...truncated...
OpenMed/Medical-Reasoning-SFT-Mega
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-06
View โ
task_categories:text-generation
task_categories:question-answering
language:en
license:apache-2.0
size_categories:1M<n<10M
format:parquet
format:optimized-parquet
modality:text
library:datasets
library:dask
...truncated...
OpenResearcher/OpenResearcher-Dataset
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-12
View โ
license:mit
size_categories:10K<n<100K
format:parquet
format:optimized-parquet
modality:tabular
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
patrickaudriaz/tobacco3482jpg
Document Structure Learning Dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2019-04-10
View โ
earth and nature
education
image
perplexity-ai/draco
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-13
View โ
language:en
license:mit
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
arxiv:2602.11685
...truncated...
PleIAs/common_corpus
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-19
View โ
language:en
language:fr
language:de
language:zh
language:it
language:es
language:ja
language:pl
language:la
language:nl
...truncated...
princeton-nlp/SWE-bench_Verified
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-02-18
View โ
size_categories:n<1K
format:parquet
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
reasat/badlad-train
Bengali document layout analysis dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2023-05-06
View โ
beginner
computer vision
deep learning
bengali
object detection
ritvik1909/document-classification-dataset
A small dataset to try out Document Classification algorithms
Use case:
Source: Kaggle | Type: Text | Updated: 2022-07-06
View โ
computer science
nlp
computer vision
image
text
multiclass classification
sachinsharma1123/document-classification
Classify the document with correct lables
Use case:
Source: Kaggle | Type: Text | Updated: 2020-07-12
View โ
earth and nature
computer science
shaz13/real-world-documents-collections
A document type collection from various public datasets
Use case:
Source: Kaggle | Type: Text | Updated: 2020-07-06
View โ
earth and nature
shivamkushwaha/bbc-full-text-document-classification
2225 documents in five categories can be used for clustering and classification.
Use case:
Source: Kaggle | Type: Text | Updated: 2019-01-26
View โ
software
sthabile/noisy-and-rotated-scanned-documents
Can a predictive model be used to recognise the angle of a scanned document?
Use case:
Source: Kaggle | Type: Text | Updated: 2020-03-06
View โ
cnn
text
sunilthite/text-document-classification-dataset
Text Document Classification Dataset for Classification and Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2023-12-04
View โ
education
software
text
news
english
tanishqdublish/text-classification-documentation
Text Document Classification Dataset for Classification and Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2024-01-12
View โ
education
software
news
email and messaging
text classification
TeichAI/claude-4.5-opus-high-reasoning-250x
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-11-28
View โ
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
TeichAI/Pony-Alpha-15k
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-17
View โ
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
region:us
tencent/CL-bench
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-06
View โ
task_categories:text-generation
language:en
license:other
size_categories:1K<n<10K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
TIGER-Lab/MMLU-Pro
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-01-19
View โ
benchmark:official
task_categories:question-answering
language:en
license:mit
size_categories:10K<n<100K
format:parquet
modality:tabular
modality:text
library:datasets
library:pandas
...truncated...
uv-scripts/ocr
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-19
View โ
region:us
uv-script
ocr
vision-language-model
document-processing
hf-jobs
yiweilu2033/well-documented-alzheimers-dataset
This is a well-documented, skull-stripped, new MRI dataset.Take what you want
Use case:
Source: Kaggle | Type: Text | Updated: 2024-12-16
View โ
diseases
computer vision
classification
deep learning
image