๐ฆ Business Process Automation Datasets
Curated datasets for document AI, workflow automation, and enterprise chatbot development.
ajibawa-2023/Cpp-Code-Large
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-04
View โ
task_categories:text-generation
language:en
license:mit
size_categories:1M<n<10M
format:json
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
alfathterry/bbc-full-text-document-classification
train your NLP skills with this dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2024-04-04
View โ
education
nlp
multiclass classification
news
allenai/Dolci-Think-SFT-Olmo-Hybrid
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-05
View โ
license:odc-by
size_categories:1M<n<10M
format:parquet
format:optimized-parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
region:us
Anthropic/EconomicIndex
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-05
View โ
language:en
license:mit
arxiv:2503.04761
region:us
AI
LLM
Economic Impacts
Anthropic
AudioVisual-Caption/ASID-1M
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-04
View โ
task_categories:image-text-to-text
language:en
license:cc-by-2.0
size_categories:100K<n<1M
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
AweAI-Team/BeyondSWE
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-05
View โ
task_categories:text-generation
language:en
license:cc-by-4.0
size_categories:n<1K
format:json
modality:3d
modality:text
library:datasets
library:pandas
library:polars
...truncated...
AweAI-Team/Scale-SWE
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-05
View โ
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
arxiv:2602.09892
region:us
AweAI-Team/Scale-SWE-Distilled
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-28
View โ
size_categories:10K<n<100K
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
arxiv:2602.09892
region:us
ayoubcherguelaine/company-documents-dataset
Company Documents Dataset for Classification and Information Retrieval
Use case:
Source: Kaggle | Type: Text | Updated: 2024-05-23
View โ
business
bigcode/the-stack-v2
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-04-23
View โ
task_categories:text-generation
language_creators:crowdsourced
language_creators:expert-generated
multilinguality:multilingual
language:code
license:other
size_categories:1B<n<10B
format:parquet
modality:tabular
modality:text
...truncated...
BytedTsinghua-SIA/CUDA-Agent-Ops-6K
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-27
View โ
task_categories:text-generation
language:en
license:cc-by-4.0
size_categories:1K<n<10K
format:parquet
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
crownelius/Opus-4.6-Reasoning-3300x
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-02
View โ
license:apache-2.0
size_categories:1K<n<10K
format:parquet
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
region:us
dataturks/resume-entities-for-ner
A document annotation dataset to perform NER on resumes.
Use case:
Source: Kaggle | Type: Text | Updated: 2018-07-12
View โ
earth and nature
biology
business
linguistics
nlp
text
dddraxxx/ref-adv-s
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-02
View โ
task_categories:visual-question-answering
task_categories:object-detection
language:en
license:cc-by-4.0
size_categories:1K<n<10K
format:parquet
format:optimized-parquet
modality:image
modality:text
library:datasets
...truncated...
devashishprasad/documnet-layout-recognition-dataset-publaynet-t0
IBMs PubLayNet dataset at kaggle for document layout recognition.
Use case:
Source: Kaggle | Type: Text | Updated: 2021-06-10
View โ
computer science
computer vision
deep learning
cnn
image
FINAL-Bench/ALL-Bench-Leaderboard
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-08
View โ
task_categories:text-generation
task_categories:visual-question-answering
task_categories:text-to-image
task_categories:text-to-video
task_categories:text-to-audio
annotations_creators:expert-generated
source_datasets:original
language:en
license:mit
size_categories:n<1K
...truncated...
GD-ML/GenMRP
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-06
View โ
task_categories:tabular-classification
task_categories:graph-ml
size_categories:100K<n<1M
format:csv
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
region:us
...truncated...
google/WaxalNLP
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-03
View โ
task_categories:automatic-speech-recognition
task_categories:text-to-speech
language_creators:creator_1
multilinguality:multilingual
source_datasets:UGSpeechData
source_datasets:DigitalUmuganda/AfriVoice
source_datasets:original
language:ach
language:aka
language:amh
...truncated...
HuggingFaceFW/finephrase
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-07
View โ
task_categories:text-generation
task_ids:language-modeling
annotations_creators:machine-generated
language_creators:found
source_datasets:HuggingFaceFW/fineweb-edu/sample-350BT
language:en
license:odc-by
size_categories:1B<n<10B
modality:tabular
modality:text
...truncated...
HuggingFaceFW/fineweb
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-07-11
View โ
task_categories:text-generation
language:en
license:odc-by
size_categories:10B<n<100B
modality:tabular
modality:text
arxiv:2306.01116
arxiv:2109.07445
arxiv:2406.17557
doi:10.57967/hf/2493
...truncated...
HuggingFaceFW/fineweb-edu
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-07-11
View โ
task_categories:text-generation
language:en
license:odc-by
size_categories:1B<n<10B
format:parquet
modality:tabular
modality:text
library:datasets
library:dask
library:polars
...truncated...
humansintheloop/arabic-documents-ocr-dataset
10K images that are further classified into 12 classes (Invoices, Books, etc.)
Use case:
Source: Kaggle | Type: Text | Updated: 2023-06-07
View โ
global
business
image
text
middle east
arabic
Jackrong/Qwen3.5-reasoning-700x
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-02
View โ
task_categories:question-answering
language:en
license:apache-2.0
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
jensenbaxter/10dataset-text-document-classification
A collection of ~1000 newsgroup documents from 10 different newsgroups
Use case:
Source: Kaggle | Type: Text | Updated: 2020-06-08
View โ
earth and nature
business
online communities
kageneko/legal-case-document-summarization
Use case:
Source: Kaggle | Type: Text | Updated: 2024-03-17
View โ
law
karpathy/tinystories-gpt4-clean
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-08
View โ
license:cdla-sharing-1.0
size_categories:1M<n<10M
format:parquet
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
arxiv:2305.07759
region:us
konradb/pfizer-documents
Pfizer-BioNTech vaccine-related data - #pfizerdocuments
Use case:
Source: Kaggle | Type: Text | Updated: 2022-11-02
View โ
public health
LeeXiangNO1/DyNativeGaussian_sequence
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-10
View โ
license:cc-by-nc-4.0
size_categories:10K<n<100K
format:text
modality:3d
modality:text
library:datasets
library:mlcroissant
region:us
nebius/SWE-rebench-V2
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-03
View โ
task_categories:text-generation
language:en
license:cc-by-4.0
size_categories:10K<n<100K
format:parquet
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
nebius/SWE-rebench-V2-PRs
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-03
View โ
task_categories:text-generation
language:en
license:cc-by-4.0
size_categories:100K<n<1M
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
nenriki/document-clustering
Text similarity and Agglomerative Document Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2022-03-07
View โ
nlp
clustering
nohurry/Opus-4.6-Reasoning-3000x-filtered
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-10
View โ
license:apache-2.0
size_categories:1K<n<10K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
region:us
nvidia/Nemotron-ClimbMix
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-10-21
View โ
task_categories:text-generation
language:en
license:cc-by-nc-4.0
size_categories:100M<n<1B
format:json
modality:tabular
library:datasets
library:dask
library:mlcroissant
arxiv:2504.13161
...truncated...
nvidia/Nemotron-Research-GooseReason-0.7M
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-01
View โ
language:en
license:cc-by-nc-4.0
size_categories:100K<n<1M
format:json
modality:document
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
nvidia/Nemotron-Terminal-Corpus
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-27
View โ
task_categories:question-answering
language:en
license:cc-by-4.0
size_categories:100K<n<1M
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
ogutsevda/graph-nucls
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-03
View โ
task_categories:graph-ml
license:cc-by-nc-sa-4.0
size_categories:n<1K
format:imagefolder
modality:image
library:datasets
library:mlcroissant
arxiv:2603.00143
region:us
histopathology
...truncated...
ogutsevda/graph-pannuke
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-03
View โ
task_categories:graph-ml
license:cc-by-nc-sa-4.0
size_categories:1K<n<10K
format:csv
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
arxiv:2603.00143
...truncated...
OmniLottie/MMLottie-2M
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-07
View โ
language:en
license:cc-by-nc-sa-4.0
size_categories:1M<n<10M
format:parquet
modality:image
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
OmniLottie/MMLottieBench
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-09
View โ
language:en
license:cc-by-nc-sa-4.0
size_categories:n<1K
format:parquet
modality:image
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
openai/graphwalks
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-05
View โ
license:mit
size_categories:1K<n<10K
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
region:us
openai/gsm8k
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-20
View โ
benchmark:official
task_categories:text-generation
annotations_creators:crowdsourced
language_creators:crowdsourced
multilinguality:monolingual
source_datasets:original
language:en
license:mit
size_categories:10K<n<100K
format:parquet
...truncated...
patrickaudriaz/tobacco3482jpg
Document Structure Learning Dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2019-04-10
View โ
earth and nature
education
image
peteromallet/dataclaw-peteromallet
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-25
View โ
task_categories:text-generation
language:en
license:mit
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
reasat/badlad-train
Bengali document layout analysis dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2023-05-06
View โ
beginner
computer vision
deep learning
bengali
object detection
ritvik1909/document-classification-dataset
A small dataset to try out Document Classification algorithms
Use case:
Source: Kaggle | Type: Text | Updated: 2022-07-06
View โ
computer science
nlp
computer vision
image
text
multiclass classification
Roman1111111/gemini-3-pro-10000x-hard-high-reasoning
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-20
View โ
task_categories:question-answering
task_categories:text-generation
language:en
license:mit
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
library:polars
...truncated...
Roman1111111/gemini-3.1-pro-hard-high-reasoning
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-21
View โ
task_categories:question-answering
task_categories:text-generation
language:en
license:mit
size_categories:1K<n<10K
format:json
modality:text
library:datasets
library:pandas
library:polars
...truncated...
ronantakizawa/github-codereview
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-05
View โ
task_categories:text-generation
language:en
language:code
license:mit
size_categories:100K<n<1M
format:parquet
modality:tabular
modality:text
library:datasets
library:dask
...truncated...
ronantakizawa/github-top-code
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-23
View โ
task_categories:text-generation
language:code
license:mit
size_categories:1M<n<10M
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
ronantakizawa/webui
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-28
View โ
task_categories:image-to-text
task_categories:text-generation
task_categories:object-detection
language:en
license:mit
size_categories:10K<n<100K
format:parquet
format:optimized-parquet
modality:image
modality:text
...truncated...
roneneldan/TinyStories
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-08-12
View โ
task_categories:text-generation
language:en
license:cdla-sharing-1.0
size_categories:1M<n<10M
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
...truncated...
sachinsharma1123/document-classification
Classify the document with correct lables
Use case:
Source: Kaggle | Type: Text | Updated: 2020-07-12
View โ
earth and nature
computer science
ScaleAI/SWE-Atlas-QnA
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-04
View โ
size_categories:n<1K
format:csv
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
region:us
shaz13/real-world-documents-collections
A document type collection from various public datasets
Use case:
Source: Kaggle | Type: Text | Updated: 2020-07-06
View โ
earth and nature
shivamkushwaha/bbc-full-text-document-classification
2225 documents in five categories can be used for clustering and classification.
Use case:
Source: Kaggle | Type: Text | Updated: 2019-01-26
View โ
software
skylenage/DeepVision-103K
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-26
View โ
task_categories:image-text-to-text
language:en
license:mit
size_categories:100K<n<1M
format:parquet
format:optimized-parquet
modality:image
modality:text
library:datasets
library:pandas
...truncated...
sthabile/noisy-and-rotated-scanned-documents
Can a predictive model be used to recognise the angle of a scanned document?
Use case:
Source: Kaggle | Type: Text | Updated: 2020-03-06
View โ
cnn
text
sunilthite/text-document-classification-dataset
Text Document Classification Dataset for Classification and Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2023-12-04
View โ
education
software
text
news
english
tanishqdublish/text-classification-documentation
Text Document Classification Dataset for Classification and Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2024-01-12
View โ
education
software
news
email and messaging
text classification
TeichAI/claude-4.5-opus-high-reasoning-250x
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-11-28
View โ
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
TianHongZXY/CHIMERA
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-03
View โ
task_categories:text-generation
task_categories:question-answering
annotations_creators:machine-generated
language:en
license:apache-2.0
size_categories:10K<n<100K
format:parquet
format:optimized-parquet
modality:text
library:datasets
...truncated...
TIGER-Lab/MMLU-Pro
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-01-19
View โ
benchmark:official
task_categories:question-answering
language:en
license:mit
size_categories:10K<n<100K
format:parquet
modality:tabular
modality:text
library:datasets
library:pandas
...truncated...
togethercomputer/CoderForge-Preview
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-02-26
View โ
size_categories:100K<n<1M
format:parquet
format:optimized-parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
region:us
TuringEnterprises/Open-RL
Use case:
Source: Hugging Face | Type: Text | Updated: 2026-03-04
View โ
task_categories:question-answering
language:en
license:mit
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
vafaeii/open-scilay
Large-scale dataset for Robust OCR, Layout Analysis, and VLM Pre-trainin
Use case:
Source: Kaggle | Type: Text | Updated: 2026-02-18
View โ
computer vision
image
text
image-to-text
synthetic
yiweilu2033/well-documented-alzheimers-dataset
This is a well-documented, skull-stripped, new MRI dataset.Take what you want
Use case:
Source: Kaggle | Type: Text | Updated: 2024-12-16
View โ
diseases
computer vision
classification
deep learning
image