๐ฆ Business Process Automation Datasets
Curated datasets for document AI, workflow automation, and enterprise chatbot development.
aanonyyy/F5I9N7A1
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-10-22
View โ
size_categories:100K<n<1M
format:parquet
modality:audio
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
region:us
alfathterry/bbc-full-text-document-classification
train your NLP skills with this dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2024-04-04
View โ
education
nlp
multiclass classification
news
Anthropic/hh-rlhf
Use case:
Source: Hugging Face | Type: Text | Updated: 2023-05-26
View โ
license:mit
size_categories:100K<n<1M
format:json
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
arxiv:2204.05862
region:us
...truncated...
APRIL-AIGC/Soul-Bench
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-16
View โ
task_categories:image-to-video
size_categories:n<1K
format:text
modality:audio
modality:image
modality:text
modality:video
library:datasets
library:mlcroissant
arxiv:2512.13495
...truncated...
ayoubcherguelaine/company-documents-dataset
Company Documents Dataset for Classification and Information Retrieval
Use case:
Source: Kaggle | Type: Text | Updated: 2024-05-23
View โ
business
bshada/open-schematics
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-17
View โ
task_categories:text-generation
task_categories:image-to-text
task_categories:text-to-image
language:en
license:cc-by-4.0
size_categories:10K<n<100K
format:parquet
modality:image
modality:text
library:datasets
...truncated...
cais/hle
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-09-10
View โ
license:mit
size_categories:1K<n<10K
format:parquet
modality:image
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
datapizza-ai-lab/salaries
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-28
View โ
language:it
license:cc-by-nc-4.0
size_categories:10K<n<100K
format:json
modality:tabular
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
dataturks/resume-entities-for-ner
A document annotation dataset to perform NER on resumes.
Use case:
Source: Kaggle | Type: Text | Updated: 2018-07-12
View โ
earth and nature
biology
business
linguistics
nlp
text
devashishprasad/documnet-layout-recognition-dataset-publaynet-t0
IBMs PubLayNet dataset at kaggle for document layout recognition.
Use case:
Source: Kaggle | Type: Text | Updated: 2021-06-10
View โ
computer science
computer vision
deep learning
cnn
image
facebook/research-plan-gen
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-28
View โ
size_categories:10K<n<100K
format:parquet
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
region:us
facebook/sam-audio-bench
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-29
View โ
language:en
license:cc-by-nc-4.0
size_categories:n<1K
format:parquet
modality:tabular
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
...truncated...
FBK-MT/MCIF
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-15
View โ
task_categories:automatic-speech-recognition
task_categories:question-answering
task_categories:summarization
task_categories:visual-question-answering
task_categories:translation
language:en
language:de
language:it
language:zh
license:cc-by-4.0
...truncated...
fka/awesome-chatgpt-prompts
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-31
View โ
task_categories:question-answering
task_categories:text-generation
license:cc0-1.0
size_categories:n<1K
format:csv
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
FreedomIntelligence/medical-o1-reasoning-SFT
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-22
View โ
task_categories:question-answering
task_categories:text-generation
language:en
language:zh
license:apache-2.0
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
...truncated...
FutureMa/DramaBench
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-29
View โ
task_categories:text-generation
language:en
license:mit
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
gaia-benchmark/GAIA
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-10-28
View โ
language:en
size_categories:n<1K
format:parquet
modality:audio
modality:document
modality:image
modality:text
library:datasets
library:pandas
library:mlcroissant
...truncated...
google/deepsearchqa
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-17
View โ
task_categories:question-answering
language:en
license:apache-2.0
size_categories:n<1K
format:csv
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
...truncated...
google/mobile-actions
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-18
View โ
language:en
license:cc-by-4.0
size_categories:1K<n<10K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
...truncated...
HeshamHaroon/Arabic_Function_Calling
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-14
View โ
task_categories:text-generation
task_categories:question-answering
language:ar
license:apache-2.0
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
...truncated...
HiDream-ai/ReCo-Data
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-26
View โ
task_categories:image-to-video
language:en
license:cc-by-nc-sa-4.0
size_categories:1M<n<10M
format:webdataset
modality:text
library:datasets
library:webdataset
library:mlcroissant
arxiv:2512.17650
...truncated...
hotpotqa/hotpot_qa
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-08-11
View โ
task_categories:question-answering
annotations_creators:crowdsourced
language_creators:found
multilinguality:monolingual
source_datasets:original
language:en
license:cc-by-sa-4.0
size_categories:100K<n<1M
format:parquet
modality:text
...truncated...
HuggingFaceFW/fineweb
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-07-11
View โ
task_categories:text-generation
language:en
license:odc-by
size_categories:10B<n<100B
modality:tabular
modality:text
arxiv:2306.01116
arxiv:2109.07445
arxiv:2406.17557
doi:10.57967/hf/2493
...truncated...
HuggingFaceH4/MATH-500
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-15
View โ
benchmark:official
task_categories:text-generation
language:en
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
...truncated...
humansintheloop/arabic-documents-ocr-dataset
10K images that are further classified into 12 classes (Invoices, Books, etc.)
Use case:
Source: Kaggle | Type: Text | Updated: 2023-06-07
View โ
global
business
image
text
middle east
arabic
Idavidrein/gpqa
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-03-28
View โ
task_categories:question-answering
task_categories:text-generation
language:en
license:cc-by-4.0
size_categories:1K<n<10K
format:csv
modality:tabular
modality:text
library:datasets
library:pandas
...truncated...
ILSVRC/imagenet-1k
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-09-17
View โ
task_categories:image-classification
task_ids:multi-class-image-classification
annotations_creators:crowdsourced
language_creators:crowdsourced
multilinguality:monolingual
source_datasets:original
language:en
license:other
size_categories:1M<n<10M
format:parquet
...truncated...
jensenbaxter/10dataset-text-document-classification
A collection of ~1000 newsgroup documents from 10 different newsgroups
Use case:
Source: Kaggle | Type: Text | Updated: 2020-06-08
View โ
earth and nature
business
online communities
kageneko/legal-case-document-summarization
Use case:
Source: Kaggle | Type: Text | Updated: 2024-03-17
View โ
law
keyushnisar/legal-docu
Formats, Usage, and Prerequisites
Use case:
Source: Kaggle | Type: Text | Updated: 2025-03-04
View โ
education
law
intermediate
text generation
konradb/pfizer-documents
Pfizer-BioNTech vaccine-related data - #pfizerdocuments
Use case:
Source: Kaggle | Type: Text | Updated: 2022-11-02
View โ
public health
Lewandofski/OpenVE-3M
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-25
View โ
license:cc-by-nc-4.0
size_categories:1M<n<10M
format:webdataset
modality:text
modality:video
library:datasets
library:webdataset
library:mlcroissant
arxiv:2512.07826
region:us
...truncated...
lirang04/GroundingME
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-26
View โ
task_categories:zero-shot-object-detection
license:other
size_categories:1K<n<10K
format:parquet
format:optimized-parquet
modality:image
modality:text
library:datasets
library:dask
library:polars
...truncated...
LLM360/TxT360-3efforts
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-11
View โ
license:cc-by-4.0
size_categories:1M<n<10M
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
arxiv:2512.06201
region:us
MiniMaxAI/VIBE
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-23
View โ
task_categories:text-generation
language:en
license:mit
size_categories:n<1K
format:parquet
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
Mxode/Chinese-Instruct
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-05-09
View โ
task_categories:text-generation
task_categories:question-answering
language:zh
license:cc-by-sa-4.0
size_categories:1M<n<10M
format:json
modality:text
library:datasets
library:dask
library:mlcroissant
...truncated...
nebius/SWE-rebench-openhands-trajectories
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-27
View โ
license:cc-by-4.0
size_categories:10K<n<100K
format:parquet
modality:tabular
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
region:us
...truncated...
nenriki/document-clustering
Text similarity and Agglomerative Document Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2022-03-07
View โ
nlp
clustering
nvidia/Nemotron-CC-v2.1
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-22
View โ
task_categories:text-generation
license:other
size_categories:1B<n<10B
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
arxiv:2508.14444
...truncated...
nvidia/Nemotron-Instruction-Following-Chat-v1
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-15
View โ
language:en
license:cc-by-4.0
size_categories:100K<n<1M
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
nvidia/Nemotron-Math-Proofs-v1
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-18
View โ
language:en
license:cc-by-sa-4.0
size_categories:100K<n<1M
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
arxiv:2512.15489
...truncated...
nvidia/Nemotron-Pretraining-Code-v2
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-22
View โ
task_categories:text-generation
license:other
size_categories:100M<n<1B
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
arxiv:2508.14444
...truncated...
nvidia/Nemotron-Pretraining-Specialized-v1
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-22
View โ
task_categories:text-generation
license:cc-by-4.0
size_categories:10M<n<100M
format:parquet
modality:text
library:datasets
library:dask
library:polars
library:mlcroissant
arxiv:2508.14444
...truncated...
openai/frontierscience
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-16
View โ
license:apache-2.0
size_categories:n<1K
format:json
modality:text
library:datasets
library:dask
library:mlcroissant
region:us
openai/gdpval
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-09-25
View โ
size_categories:n<1K
format:parquet
modality:audio
modality:document
modality:image
modality:text
modality:video
library:datasets
library:pandas
library:mlcroissant
...truncated...
openai/gsm8k
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-20
View โ
benchmark:official
task_categories:text-generation
annotations_creators:crowdsourced
language_creators:crowdsourced
multilinguality:monolingual
source_datasets:original
language:en
license:mit
size_categories:10K<n<100K
format:parquet
...truncated...
OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-12
View โ
task_categories:text-generation
language:en
license:apache-2.0
size_categories:100K<n<1M
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
...truncated...
patrickaudriaz/tobacco3482jpg
Document Structure Learning Dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2019-04-10
View โ
earth and nature
education
image
ritvik1909/document-classification-dataset
A small dataset to try out Document Classification algorithms
Use case:
Source: Kaggle | Type: Text | Updated: 2022-07-06
View โ
computer science
nlp
computer vision
image
text
multiclass classification
ronantakizawa/github-top-developers
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-20
View โ
task_categories:text-classification
task_categories:time-series-forecasting
task_categories:text-retrieval
language:en
license:mit
size_categories:10K<n<100K
format:csv
modality:tabular
modality:text
library:datasets
...truncated...
roneneldan/TinyStories
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-08-12
View โ
task_categories:text-generation
language:en
license:cdla-sharing-1.0
size_categories:1M<n<10M
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
...truncated...
sachinsharma1123/document-classification
Classify the document with correct lables
Use case:
Source: Kaggle | Type: Text | Updated: 2020-07-12
View โ
earth and nature
computer science
scrapegraphai/scrapegraphai-100k
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-21
View โ
language:en
license:apache-2.0
size_categories:10K<n<100K
format:parquet
modality:tabular
modality:text
library:datasets
library:pandas
library:polars
library:mlcroissant
...truncated...
shaz13/real-world-documents-collections
A document type collection from various public datasets
Use case:
Source: Kaggle | Type: Text | Updated: 2020-07-06
View โ
earth and nature
shivamkushwaha/bbc-full-text-document-classification
2225 documents in five categories can be used for clustering and classification.
Use case:
Source: Kaggle | Type: Text | Updated: 2019-01-26
View โ
software
spatialverse/SAGE-3D_Collision_Mesh
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-17
View โ
task_categories:robotics
task_categories:image-to-3d
license:cc-by-nc-4.0
size_categories:n<1K
format:imagefolder
modality:image
library:datasets
library:mlcroissant
arxiv:2510.21307
region:us
spatialverse/SAGE-3D_InteriorGS_usdz
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-12-17
View โ
task_categories:robotics
task_categories:image-to-3d
language:en
license:cc-by-nc-4.0
size_categories:n<1K
format:imagefolder
modality:3d
modality:image
library:datasets
library:mlcroissant
...truncated...
sthabile/noisy-and-rotated-scanned-documents
Can a predictive model be used to recognise the angle of a scanned document?
Use case:
Source: Kaggle | Type: Text | Updated: 2020-03-06
View โ
cnn
text
sunilthite/text-document-classification-dataset
Text Document Classification Dataset for Classification and Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2023-12-04
View โ
education
software
text
news
english
tanishqdublish/text-classification-documentation
Text Document Classification Dataset for Classification and Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2024-01-12
View โ
education
software
news
email and messaging
text classification
TeichAI/claude-4.5-opus-high-reasoning-250x
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-11-28
View โ
size_categories:n<1K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
wikimedia/wikipedia
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-01-09
View โ
task_categories:text-generation
task_categories:fill-mask
task_ids:language-modeling
task_ids:masked-language-modeling
language:ab
language:ace
language:ady
language:af
language:alt
language:am
...truncated...
yiweilu2033/well-documented-alzheimers-dataset
This is a well-documented, skull-stripped, new MRI dataset.Take what you want
Use case:
Source: Kaggle | Type: Text | Updated: 2024-12-16
View โ
diseases
computer vision
classification
deep learning
image