๐ฆ Business Process Automation Datasets
Curated datasets for document AI, workflow automation, and enterprise chatbot development.
a-m-team/AM-DeepSeek-R1-Distilled-1.4M
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-03-30
View โ
task_categories:text-generation
language:zh
language:en
license:cc-by-nc-4.0
size_categories:1M<n<10M
arxiv:2503.19633
region:us
code
math
reasoning
...truncated...
agentica-org/DeepCoder-Preview-Dataset
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-09
View โ
language:en
license:mit
size_categories:10K<n<100K
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
region:us
...truncated...
agentica-org/DeepScaleR-Preview-Dataset
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-02-10
View โ
language:en
license:mit
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
alfathterry/bbc-full-text-document-classification
train your NLP skills with this dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2024-04-04
View โ
education
nlp
multiclass classification
news
allenai/olmOCR-mix-0225
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-02-25
View โ
license:odc-by
size_categories:100K<n<1M
format:parquet
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
Anthropic/EconomicIndex
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-03-27
View โ
language:en
license:mit
size_categories:1K<n<10K
format:csv
modality:tabular
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
...truncated...
ayoubcherguelaine/company-documents-dataset
Company Documents Dataset for Classification and Information Retrieval
Use case:
Source: Kaggle | Type: Text | Updated: 2024-05-23
View โ
business
ByteDance-Seed/Multi-SWE-bench
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-11
View โ
task_categories:text-generation
license:other
arxiv:2504.02605
region:us
code
ByteDance-Seed/Multi-SWE-RL
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-09
View โ
task_categories:text-generation
license:other
arxiv:2504.02605
region:us
code
cais/hle
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-04
View โ
license:mit
size_categories:1K<n<10K
format:parquet
modality:image
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
cais/mmlu
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-03-08
View โ
task_categories:question-answering
task_ids:multiple-choice-qa
annotations_creators:no-annotation
language_creators:expert-generated
multilinguality:monolingual
source_datasets:original
language:en
license:mit
size_categories:100K<n<1M
format:parquet
...truncated...
camel-ai/loong
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-01
View โ
task_categories:question-answering
language:en
license:mit
size_categories:1K<n<10K
format:parquet
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
...truncated...
CohereForAI/kaleidoscope
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-10
View โ
language:ar
language:bn
language:hr
language:nl
language:en
language:fr
language:de
language:hi
language:hu
language:lt
...truncated...
Congliu/Chinese-DeepSeek-R1-Distill-data-110k
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-02-21
View โ
task_categories:text-generation
task_categories:text2text-generation
task_categories:question-answering
language:zh
license:apache-2.0
size_categories:100K<n<1M
format:json
modality:tabular
modality:text
library:datasets
...truncated...
dataturks/resume-entities-for-ner
A document annotation dataset to perform NER on resumes.
Use case:
Source: Kaggle | Type: Text | Updated: 2018-07-12
View โ
earth and nature
biology
business
linguistics
nlp
text
davanstrien/reasoning-required
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-10
View โ
task_categories:text-classification
task_categories:text-generation
language:en
license:mit
size_categories:1K<n<10K
format:parquet
modality:text
library:datasets
library:pandas
library:mlcroissant
...truncated...
devashishprasad/documnet-layout-recognition-dataset-publaynet-t0
IBMs PubLayNet dataset at kaggle for document layout recognition.
Use case:
Source: Kaggle | Type: Text | Updated: 2021-06-10
View โ
computer science
computer vision
deep learning
cnn
image
facebook/natural_reasoning
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-02-21
View โ
task_categories:text-generation
language:en
license:cc-by-nc-4.0
size_categories:1M<n<10M
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
...truncated...
fka/awesome-chatgpt-prompts
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-01-06
View โ
task_categories:question-answering
license:cc0-1.0
size_categories:n<1K
format:csv
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
...truncated...
FreedomIntelligence/medical-o1-reasoning-SFT
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-02-22
View โ
task_categories:question-answering
task_categories:text-generation
language:en
language:zh
license:apache-2.0
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
...truncated...
future-technologies/Universal-Transformers-Dataset
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-10
View โ
task_categories:text-classification
task_categories:token-classification
task_categories:table-question-answering
task_categories:question-answering
task_categories:zero-shot-classification
task_categories:translation
task_categories:summarization
task_categories:feature-extraction
task_categories:text-generation
task_categories:text2text-generation
...truncated...
getomni-ai/ocr-benchmark
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-02-21
View โ
license:mit
size_categories:1K<n<10K
format:imagefolder
modality:image
modality:text
library:datasets
library:mlcroissant
region:us
glaiveai/reasoning-v1-20m
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-03-19
View โ
task_categories:text-generation
language:en
license:apache-2.0
size_categories:10M<n<100M
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
...truncated...
HuggingFaceFW/fineweb
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-01-31
View โ
task_categories:text-generation
language:en
license:odc-by
size_categories:10B<n<100B
format:parquet
modality:tabular
modality:text
library:datasets
library:dask
library:mlcroissant
...truncated...
humansintheloop/arabic-documents-ocr-dataset
10K images that are further classified into 12 classes (Invoices, Books, etc.)
Use case:
Source: Kaggle | Type: Text | Updated: 2023-06-07
View โ
global
business
image
text
middle east
arabic
jensenbaxter/10dataset-text-document-classification
A collection of ~1000 newsgroup documents from 10 different newsgroups
Use case:
Source: Kaggle | Type: Text | Updated: 2020-06-08
View โ
earth and nature
business
online communities
kageneko/legal-case-document-summarization
Use case:
Source: Kaggle | Type: Text | Updated: 2024-03-17
View โ
law
konradb/pfizer-documents
Pfizer-BioNTech vaccine-related data - #pfizerdocuments
Use case:
Source: Kaggle | Type: Text | Updated: 2022-11-02
View โ
public health
LLM360/MegaMath
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-09
View โ
task_categories:text-generation
language:en
license:odc-by
size_categories:100M<n<1B
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
...truncated...
m-a-p/COIG-P
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-09
View โ
size_categories:1M<n<10M
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
arxiv:2504.05535
region:us
miscovery/General_Facts_in_English_Arabic_Egyptian_Arabic
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-06
View โ
task_categories:question-answering
task_categories:translation
task_categories:text-generation
task_categories:fill-mask
language:en
language:ar
license:mit
size_categories:10K<n<100K
format:csv
modality:tabular
...truncated...
MohamedRashad/Quran-Recitations
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-03-30
View โ
task_categories:automatic-speech-recognition
task_categories:text-to-speech
language:ar
size_categories:100K<n<1M
format:parquet
modality:audio
modality:text
library:datasets
library:dask
library:mlcroissant
...truncated...
mozilla-foundation/common_voice_17_0
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-06-16
View โ
annotations_creators:crowdsourced
language_creators:crowdsourced
multilinguality:multilingual
source_datasets:extended|common_voice
language:ab
language:af
language:am
language:ar
language:as
language:ast
...truncated...
nenriki/document-clustering
Text similarity and Agglomerative Document Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2022-03-07
View โ
nlp
clustering
nisten/battlefield-medic-sharegpt
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-08
View โ
license:mit
size_categories:1K<n<10K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
nvidia/Llama-Nemotron-Post-Training-Dataset
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-09
View โ
license:cc-by-4.0
size_categories:1M<n<10M
format:json
modality:text
library:datasets
library:dask
library:mlcroissant
region:us
nvidia/OpenCodeReasoning
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-07
View โ
task_categories:text-generation
license:cc-by-4.0
size_categories:100K<n<1M
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
arxiv:2504.01943
...truncated...
OmniSVG/MMSVG-Icon
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-09
View โ
license:cc-by-nc-sa-4.0
size_categories:100K<n<1M
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
arxiv:2504.06263
region:us
OmniSVG/MMSVG-Illustration
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-09
View โ
license:cc-by-nc-sa-4.0
size_categories:100K<n<1M
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
arxiv:2504.06263
region:us
open-r1/codeforces-cots
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-03-28
View โ
license:cc-by-4.0
size_categories:100K<n<1M
format:parquet
modality:tabular
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
region:us
open-r1/OpenR1-Math-220k
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-02-18
View โ
language:en
license:apache-2.0
size_categories:100K<n<1M
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
region:us
open-thoughts/OpenThoughts2-1M
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-07
View โ
license:apache-2.0
size_categories:1M<n<10M
format:parquet
modality:text
library:datasets
library:dask
library:mlcroissant
library:polars
region:us
synthetic
...truncated...
openai/gsm8k
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-01-04
View โ
task_categories:text2text-generation
annotations_creators:crowdsourced
language_creators:crowdsourced
multilinguality:monolingual
source_datasets:original
language:en
license:mit
size_categories:10K<n<100K
format:parquet
modality:text
...truncated...
openai/openai_humaneval
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-01-04
View โ
task_categories:text2text-generation
annotations_creators:expert-generated
language_creators:expert-generated
multilinguality:monolingual
source_datasets:original
language:en
license:mit
size_categories:n<1K
format:parquet
modality:text
...truncated...
patrickaudriaz/tobacco3482jpg
Document Structure Learning Dataset
Use case:
Source: Kaggle | Type: Text | Updated: 2019-04-10
View โ
earth and nature
education
image
proj-persona/PersonaHub
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-03-04
View โ
task_categories:text-generation
task_categories:text-classification
task_categories:token-classification
task_categories:fill-mask
task_categories:table-question-answering
task_categories:text2text-generation
language:en
language:zh
license:cc-by-nc-sa-4.0
size_categories:100K<n<1M
...truncated...
Rapidata/2k-ranked-images-open-image-preferences-v1
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-10
View โ
license:apache-2.0
size_categories:1K<n<10K
format:parquet
modality:image
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
region:us
...truncated...
Rapidata/Reve-AI-Halfmoon_t2i_human_preference
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-08
View โ
task_categories:text-to-image
task_categories:image-to-text
task_categories:image-classification
task_categories:reinforcement-learning
language:en
license:cdla-permissive-2.0
size_categories:10K<n<100K
format:parquet
modality:image
modality:text
...truncated...
Rapidata/text-2-video-human-preferences-pika2.2
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-08
View โ
task_categories:video-classification
task_categories:text-to-video
task_categories:text-classification
language:en
license:apache-2.0
size_categories:1K<n<10K
format:parquet
modality:image
modality:tabular
modality:text
...truncated...
ritvik1909/document-classification-dataset
A small dataset to try out Document Classification algorithms
Use case:
Source: Kaggle | Type: Text | Updated: 2022-07-06
View โ
computer science
nlp
computer vision
image
text
multiclass classification
sachinsharma1123/document-classification
Classify the document with correct lables
Use case:
Source: Kaggle | Type: Text | Updated: 2020-07-12
View โ
earth and nature
computer science
shaz13/real-world-documents-collections
A document type collection from various public datasets
Use case:
Source: Kaggle | Type: Text | Updated: 2020-07-06
View โ
earth and nature
shivamb/legal-citation-text-classification
Legal Industry - Citations Text Classification
Use case:
Source: Kaggle | Type: Text | Updated: 2021-11-11
View โ
australia
government
law
nlp
text
shivamkushwaha/bbc-full-text-document-classification
2225 documents in five categories can be used for clustering and classification.
Use case:
Source: Kaggle | Type: Text | Updated: 2019-01-26
View โ
software
SparkAudio/voxbox
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-11
View โ
task_categories:text-to-speech
language:zh
language:en
license:cc-by-nc-sa-4.0
size_categories:10M<n<100M
format:webdataset
modality:audio
modality:text
library:datasets
library:webdataset
...truncated...
starriver030515/FUSION-Finetune-12M
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-10
View โ
task_categories:question-answering
task_categories:visual-question-answering
task_categories:table-question-answering
language:en
language:zh
license:apache-2.0
size_categories:1K<n<10K
format:parquet
modality:image
modality:text
...truncated...
sthabile/noisy-and-rotated-scanned-documents
Can a predictive model be used to recognise the angle of a scanned document?
Use case:
Source: Kaggle | Type: Text | Updated: 2020-03-06
View โ
cnn
text
sunilthite/text-document-classification-dataset
Text Document Classification Dataset for Classification and Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2023-12-04
View โ
education
software
text
news
english
tanishqdublish/text-classification-documentation
Text Document Classification Dataset for Classification and Clustering
Use case:
Source: Kaggle | Type: Text | Updated: 2024-01-12
View โ
education
software
news
email and messaging
text classification
UCSC-VLAA/MedReason
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-10
View โ
license:apache-2.0
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
arxiv:2504.00993
region:us
...truncated...
vevotx/Tahoe-100M
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-08
View โ
license:cc0-1.0
size_categories:100M<n<1B
format:parquet
modality:tabular
modality:text
modality:timeseries
library:datasets
library:dask
library:mlcroissant
library:polars
...truncated...
virtuoussy/Multi-subject-RLVR
Use case:
Source: Hugging Face | Type: Text | Updated: 2025-04-02
View โ
task_categories:question-answering
language:en
license:apache-2.0
size_categories:100K<n<1M
format:parquet
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
...truncated...
wikimedia/wikipedia
Use case:
Source: Hugging Face | Type: Text | Updated: 2024-01-09
View โ
task_categories:text-generation
task_categories:fill-mask
task_ids:language-modeling
task_ids:masked-language-modeling
language:ab
language:ace
language:ady
language:af
language:alt
language:am
...truncated...
yahma/alpaca-cleaned
Use case:
Source: Hugging Face | Type: Text | Updated: 2023-04-10
View โ
task_categories:text-generation
language:en
license:cc-by-4.0
size_categories:10K<n<100K
format:json
modality:text
library:datasets
library:pandas
library:mlcroissant
library:polars
...truncated...
yiweilu2033/well-documented-alzheimers-dataset
This is a well-documented, skull-stripped, new MRI dataset.Take what you want
Use case:
Source: Kaggle | Type: Text | Updated: 2024-12-16
View โ
diseases
computer vision
classification
deep learning
image