Dataset Formatting Standards
Intermediate
Enforce consistent dataset formatting for HuggingFace — column naming, instruction format, train/test splits, and metadata requirements for reproducible ML experiments.
File Patterns
**/*.py**/*.jsonl**/*.csv
This rule applies to files matching the patterns above.
Rule Content
rule-content.md
# Dataset Formatting Standards
## Rule
All datasets prepared for HuggingFace MUST follow standard column naming, include metadata, use deterministic splits, and include a dataset card (README.md).
## Column Naming Standards
### Instruction Tuning
```python
# Required columns
{
"instruction": "Write a test for this function",
"input": "def add(a, b): return a + b", # Optional context
"output": "def test_add(): assert add(2, 3) == 5"
}
```
### Chat Format
```python
# Required columns
{
"messages": [
{"role": "system", "content": "You are a code reviewer."},
{"role": "user", "content": "Review this function..."},
{"role": "assistant", "content": "The function has..."}
]
}
```
### Text Classification
```python
{
"text": "The API response time improved by 40%",
"label": "positive"
}
```
## Split Requirements
```python
# ALWAYS use deterministic splits with seed
split = dataset.train_test_split(test_size=0.1, seed=42)
# Standard split ratios
# Training: 80-90%
# Validation: 5-10%
# Test: 5-10%
# For small datasets (< 1000 examples), use k-fold cross-validation
```
## Dataset Card (Required)
```markdown
---
language: en
license: apache-2.0
task_categories:
- text-generation
size_categories:
- 1K<n<10K
---
# Dataset Name
## Description
What this dataset contains and what it's for.
## Format
Column descriptions and example rows.
## Collection Method
How the data was collected/generated.
## Limitations
Known biases, missing categories, quality issues.
```
## Good Formatting
```python
dataset = Dataset.from_list([
{"instruction": "Explain recursion", "output": "Recursion is..."},
# Consistent format, clean text, no HTML artifacts
])
```
## Bad Formatting
```python
dataset = Dataset.from_list([
{"prompt": "explain recursion", "response": "Recursion is..."},
{"input": "what is OOP", "answer": "OOP stands for..."},
# Inconsistent column names, mixed formats
])
```
## Anti-Patterns
- Inconsistent column names across rows
- No dataset card (README.md) — others can't understand the data
- Non-deterministic splits (different results every run)
- Missing data quality checks (empty strings, duplicates, encoding issues)
- No train/test split (can't evaluate model performance)FAQ
Discussion
Loading comments...