AI market intelligence model · Live on Ollama

Quantiva

Fine-tuning a Local LLM into an Indian Equity Technical Analyst

A case study in fine-tuning, debugging, and benchmarking an 8B open-weight model into a focused, grounded domain assistant — built end to end on a Mac mini and a $0.067/hr rented GPU.

Qwen3-8BUnsloth (LoRA)llama.cpp / GGUFOllamaMongoDBPython

Hardware: Apple M2 Pro (16 GB) for dev + deployment; rented RTX 4090 for training/merge

Total GPU spend: ~$2–3 across the entire project

View on Ollama

1. The problem

I wanted a local LLM that behaves like an Indian equity technicalanalyst: given a stock's recent price action, it should produce a structured, disciplined read — correct overbought/oversold classification, clear invalidation levels, Indian-market conventions (₹, NSE symbols, Nifty index tags) — and do it entirely on my own machine, fed by my own market database.

Out of the box, the base model (Qwen3-8B) was not fit for this. When handed a stock data block, it exhibited consistent, reproducible failure modes:

Inverted RSI logic — calling an RSI of 37 "overbought," or a deep-oversold reading "bullish momentum."
Wrong trade directions — placing stop-losses above entry, targets below.
Currency/format drift — "Rs" instead of ₹, occasional CJK characters in output.
No structural discipline — no reasoning block, no explicit "what invalidates this view."

The goal was to fix these specific, observable failures — and then prove they were fixed.

2. Architecture decision: fine-tune for skill, retrieve for data

The first real decision shaped everything else: the model should learn the skill of analysis, not memorize stock data.

Fine-tuning teaches voice, RSI logic, output format, and discipline. Done once.
A retrieval layer (MongoDB + a context-builder function) injects live data into the prompt at question-time.

This separation means the model never goes stale: prices update daily in MongoDB without ever retraining. The model is the analyst's brain; the database is the live data feed. It also rules out a whole class of problems — the model isn't expected to "know" prices, so it shouldn't be hallucinating them.

No vector DB — structured OHLCV lives in MongoDB and is queried directly. Universe: Nifty 500 / total market, ~5 years of daily candles already collected.

3. Data pipeline: get_stock_context()

The bridge between MongoDB and the model is a single function that, given a ticker:

Pulls ~60 daily candles from MongoDB.
Computes indicators: RSI(14), 50/200-day moving averages, 52-day range, period returns, volume ratio.
Formats a fixed text block — index header, ₹ values, the [TICKER NSE - Data as of …] convention.

This format is the contract: the model is trained on exactly this layout, so the inference-time context must match it byte-for-byte.

4. Training data: diversity over volume

I built the training set deliberately small and clean rather than large and noisy.

5 hand-written gold-standard seeds spanning the RSI spectrum: RELIANCE (oversold 26.3), ADANIENT (overbought 76.6), ITC (oversold 27.9), HDFCBANK (neutral 37.3), LT (bullish-but-not-overbought 63.7).
Scaled to 205 examples via a deterministic, rule-driven generator, then an independent validator, then a language-variety pass for natural phrasing.
Validation gates: 0 ticker mismatches, 0 "Rs" leakage, 0 CJK characters, 0 RSI misclassifications, 197/199 unique headlines.

I deliberately did not scale to 500 — past ~200 clean, diverse examples, more rows mostly added redundancy. Split 90/10 (seed=42) → 184 train / 21 val, formatted as system→user→assistant conversations.

5. Fine-tuning: LoRA on a rented RTX 4090

Using Unsloth's FastLanguageModel on unsloth/Qwen3-8B (4-bit), LoRA adapters (r=16, α=16) on the attention + MLP projections, Qwen3 chat template, train_on_responses_only, 3 epochs, lr 2e-4.

Result (4m 11s, 69 steps):

Metric	Value
Train loss	2.243 → 1.119
Eval loss	1.198 → 0.951 → 0.881 → 0.872
Trainable params	0.53% of total

Eval loss declined monotonically with no divergence — no overfitting. A direct GPU eval on held-out tickers (WIPRO 45.2, SUNPHARMA 78.4) confirmed every base-model failure mode was fixed: correct RSI classification, correct trade directions, clean ₹, decisive reasoning, varied structure. Fine-tuning verifiably worked at the adapter level.

6. The hard bug: a base-model mismatch in the GGUF merge

To run locally in Ollama, the LoRA adapter had to be merged into the base weights and exported to GGUF. The Mac couldn't do the fp16 merge (≈32 GB RAM needed, 16 GB available), so I did it on the GPU box. The first merge produced a GGUF that, when loaded in Ollama, looped endlessly, inverted RSI again, and analyzed the index instead of the stock.

Root cause:the adapter was trained on unsloth/qwen3-8b-unsloth-bnb-4bit, but the merge script loaded a different base — stock Qwen/Qwen3-8B. Merging an adapter onto a base it wasn't trained against corrupted the weights, producing degenerate output.

Fix:a re-merge script that loads the adapter on its exact training base via Unsloth's native pipeline, exports q8_0 through Unsloth's own llama.cpp path, and auto-writes a correct Qwen3 ChatML Modelfile.

Lesson: when output degenerates after a known-good training run, suspect the plumbing (merge base, export path, chat template) before the model.

7. Local deployment

ollama create from the corrected Modelfile, then ollama run. The model runs locally on the M2 Pro at ~5–10 tok/s — usable, not fast. Held-out tests across the RSI spectrum (oversold 24.8, neutral 58, bullish 68.5, overbought 78–81) all produced clean, correctly-classified, non-looping output. The Python wrapper wires get_stock_context() → Ollama so a user types a ticker and gets analysis grounded in live MongoDB data. The model is now published as mrinalux/lcx-ai on Ollama.

8. The hallucination problem (and why it's the real product risk)

Asked a question its data couldn't answer — "How has ADANIGREEN done over the last 5 years?" — the model fabricateda precise figure ("+185.58%"), because get_stock_context() only feeds ~60 days. With no 5-year data in the prompt, the model filled the gap with plausible fiction.

This is the single most dangerous property for a finance tool: confidently wrong numbers a user might act on. The fix has two layers:

Feed real data — extend the context builder to compute and include actual historical returns, CAGR, all-time-high, and drawdown from MongoDB.
Constrain scope — restrict the assistant to analyzing only the data block present, declining out-of-scope questions rather than inventing.

The honest takeaway: the model is a ~60-day technical analyst. Within its data window it's reliable; outside it, it must be prevented from answering, not trusted to.

9. Benchmarking: proving it, honestly

A claim isn't proven until it's measured against a baseline on held-out data with objective checks.

25 held-out stocks, zero overlap with the 205 training symbols, stratified 5-per-band across the RSI spectrum.
10 auto-graded checks, each mapped to a specific failure mode the project claimed to fix.
Identical inputs (same system prompt, same sampling) to both the fine-tuned model and base Qwen3-8B.

The first benchmark run scored a fake 0/25 due to two grader bugs — a benchmark that fails good answers is worse than no benchmark — so the grader was fixed before any result was reported.

Final result (held-out, identical prompts):

Check	Quantiva	Base Qwen3-8B
RSI label correct	100%	92%
RSI value cited	100%	96%
₹ used (not Rs)	100%	96%
No CJK characters	100%	100%
No repetition/loop	100%	100%
Reasoning block present	100%	0%
Current price cited	100%	80%
Invalidation / stop-loss	100%	92%
No hallucinated prices	100%	100%
Response length OK	100%	96%
Overall pass	100%	0%

Where fine-tuning made the decisive difference is format adherence and analytical discipline — the reasoning block (100% vs 0%) and explicit invalidation (100% vs 92%), plus consistent completeness.

10. Known limitations

Technical-only, data-dependent. Analyzes a provided ~60-day price/RSI/volume snapshot; can't fetch data itself; no fundamentals, news, or macro.
Hallucinates outside its data window. Mitigated by feeding real data + scope restriction, but a known risk requiring enforcement, not trust.
Borderline-RSI sensitivity in the 30–45 and 60–70 zones where the oversold/overbought boundary is fuzziest.
Small training set (205). Limits coverage of unusual regimes, though held-out testing showed good generalization.
Latency. ~33–54 s per query on the M2 Pro; a consumer GPU brings this to single-digit seconds.
Self-reported benchmark, n=25. Grades technical correctness and format discipline, not investment profitability.
Not financial advice. Never validated against actual market outcomes.

11. What this demonstrates

End to end, on consumer hardware and ~$3 of GPU time:

Collected and structured ~5 years of NSE OHLCV into MongoDB.
Built a retrieval + technical-indicator layer.
Diagnosed reproducible base-model failure modes.
Generated and validated a small, diverse, gate-checked training set.
Fine-tuned Qwen3-8B with LoRA (eval loss 1.20 → 0.87, no overfitting).
Diagnosed and fixed a non-obvious base-mismatch merge bug — by isolating training from export.
Quantized to q8_0 GGUF and deployed locally via Ollama.
Wired live MongoDB data into inference; identified and began enforcing against hallucination.
Built a stratified, held-out, baseline-compared benchmark — and caught/fixed its own grader bugs before trusting results.

The project's real lessons are the unglamorous ones: fine-tuning was the easy 30%; the model isn't the product; isolate failures before fixing them; and a benchmark is only as trustworthy as its grader.