Local Model Training & Fine-Tuning Guide#
What Was Established#
Guide for fine-tuning local LLMs (DeepSeek) using Hugging Face transformers, with emphasis on VRAM-efficient techniques for single-GPU setups.
Key Decisions#
- Framework: Hugging Face
transformers+TrainerAPI for fine-tuning - Model:
deepseek-ai/deepseek-llm-7b(example model) - Efficiency: LoRA (Low-Rank Adaptation) + 4-bit quantization via
bitsandbytesto fit large models on consumer GPUs
Setup#
pip install torch transformers datasets accelerate peft bitsandbytesVerify GPU: nvidia-smi — need CUDA 11.8+.
Fine-Tuning Steps#
1. Load Model with Quantization (VRAM-saving)#
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "deepseek-ai/deepseek-llm-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)2. Prepare Dataset#
Dataset formats:
- JSONL (recommended):
{"text": "...", "output": "..."} - Hugging Face Datasets:
load_dataset("json", data_files="train.jsonl") - Plain text: for continued pretraining
Structure for instruction fine-tuning:
### Instruction: {user_query}
### Response: {answer}3. Tokenize & Chunk#
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
padding="max_length",
max_length=512,
)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
dataset = tokenized_dataset["train"].train_test_split(test_size=0.1)4. Configure & Run Training#
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=2e-5,
num_train_epochs=3,
logging_steps=10,
save_steps=500,
fp16=True,
optim="adamw_torch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
)
trainer.train()5. LoRA for VRAM Efficiency#
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSUAL_LM",
)
model = get_peft_model(model, lora_config)6. Save & Load#
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")
# Load for inference
from transformers import pipeline
pipe = pipeline("text-generation", model="./fine_tuned_model", device=0)Advanced Tips#
- Gradient checkpointing:
model.gradient_checkpointing_enable()— reduces VRAM - DeepSpeed:
pip install deepspeed— for multi-GPU training - Monitor:
tensorboard --logdir ./results/runs - Start small: Test with 1K samples before scaling
- Monitor GPU:
nvidia-smiduring training
Domain-Specific: Legal Document Training#
For legal documents (contracts, emails, court docs):
- Extract text:
pdfplumberfor PDFs,emailstdlib for emails,python-docxfor DOCX - Redact PII:
spaCyNER to mask names, organizations, dates - Chunk by sections (e.g., “Article 1” headings) with max 1024 tokens
- Use LoRA for small datasets
- Combine with RAG for real-time retrieval
- Evaluate with ROUGE-L (summarization), precision/recall (clause extraction)
Open Questions#
- None identified — this was a general guide, not a specific project decision.
Related Pages#
Troubleshooting DeepSeek Language Switching, Wiki Pipeline Scripts, Ollama Configuration
Sources#
ingested/chats/031-Training Locally Hosted AI Models Guide.mdTraining Locally Hosted AI Models Guide ·raw/conversations/deepseek/031-Training Locally Hosted AI Models Guide.json