Project Overview
Buying or selling a used car in Nigeria is an exercise in information asymmetry. Without reliable price data, buyers overpay and sellers undervalue — and neither party has a data-backed anchor for negotiation. This model changes that.
Trained on 9,835 listings scraped directly from Autochek Nigeria, the LightGBM model predicts used car prices in Nigerian Naira with 83% R² and a MAPE of 15.56% — meaning predictions are typically within ±16% of the actual market price. The key technical innovation is using sentence embeddings from all-mpnet-base-v2 to encode car names rather than manually extracting brand, model, and year as separate features.
Business impact: Given the average car price in the dataset of ₦18.1 million, a 15.56% MAPE translates to a typical prediction error of ~₦2.8 million — a reasonable margin for a market where negotiation ranges are often ₦2–5 million.
Why the Nigerian Used Car Market?
Nigeria's automotive market has structural characteristics that make second-hand cars the dominant purchase category — creating a large, underserved market for data-backed pricing tools:
High import duties make new vehicles unaffordable for the majority of buyers
Lower maintenance costs on familiar used models create strong value preference
No standardized pricing data — buyers and sellers negotiate without anchors
Global tools like Kelley Blue Book or CarGurus don't cover the Nigerian market. Prices vary significantly by origin (local vs. foreign-used), mileage consistency varies in reporting, and brand perception in Nigeria differs from Western markets — all of which require a locally-trained model rather than any adaptation of international tools.
Dataset
9,835 listings scraped from Autochek Nigeria (autochek.africa/ng) — one of Nigeria's largest used car marketplaces — with prices capped at ₦40 million to focus on the mainstream second-hand market segment:
| Statistic | Price (₦) |
|---|---|
| Mean | 18,140,400 |
| Median | 16,500,000 |
| Std Deviation | 8,460,545 |
| Min (after cleaning) | 2,315,000 |
| Max (cap) | 40,000,000 |
| Final sample size | 8,022 cars |
Raw features: Car name (brand + model + year), price, origin (local/foreign), mileage, engine type, transmission, fuel type, interior color, exterior color.
Price cap rationale: Capping at ₦40 million aligns the model with the target demographic — budget-conscious buyers in the second-hand market. Including ultra-luxury vehicles (>₦40M) would introduce a price segment with entirely different dynamics and degrade performance for the 95% of listings below the cap.
The Semantic Embedding Innovation
The core technical decision that differentiates this model is how car names are encoded. The conventional approach — manually extracting brand, model, and year as separate categorical features — loses important contextual relationships. Sentence embeddings preserve them:
Manual Feature Extraction
- Extract brand as separate column
- Extract model as separate column
- Extract year as numeric feature
- One-hot encode brand/model (high cardinality)
- Loses semantic relationships between models
- Fails completely on unseen car names
Sentence Embeddings (all-mpnet-base-v2)
- Encode full car name as 768-dim vector
- Captures "Toyota Camry 2015 vs 2016" nuance
- Preserves brand prestige relationships
- Generalizes to unseen car models
- Stepwise selection reduces 768D → 10 features
- Maintains contextual information throughout
The all-mpnet-base-v2 model from Sentence Transformers generates a 768-dimensional embedding for each car name. Stepwise feature selection then identifies the 10 most informative dimensions — retaining the semantic signal while eliminating noise. This approach is particularly valuable for the Nigerian market where the same model name can represent very different value propositions across years (e.g., a Toyota Camry 2010 vs. 2020 in a market with specific import patterns).
Feature Engineering
The final model uses 12 features selected from hundreds of candidates — 10 embedding dimensions plus 2 categorical variables:
Semantic Embedding Dimensions (10)
10 selected dimensions from the 768-dim car name embedding: name_emb_307, name_emb_741, name_emb_559, name_emb_618, name_emb_207, name_emb_661, name_emb_766, name_emb_541, name_emb_518 — each capturing a different semantic dimension of brand/model identity.
Origin (Local vs Foreign)
origin_local — whether the car is locally sourced or foreign-used. A strong predictor due to Nigeria's import duty structure: foreign-used (Tokunbo) vehicles command different pricing than locally-registered equivalents.
Interior Color: Coffee Brown
interior_color_coffee_brown — showed statistical significance in price predictions. Specific interior color preferences vary meaningfully in the Nigerian market, influencing resale value beyond what intuition suggests.
Exterior Color: Dark Silver
exterior_color_dark_silver — dark silver exterior also showed statistically significant price correlation. Color preferences in used car markets are market-specific and this feature captures a Nigerian buyer preference signal.
Removed Features
Mileage and engine type were dropped during feature selection — both showed minimal influence on price predictions in the Nigerian market context. This is likely due to inconsistent mileage reporting in Nigerian listings, and buyer prioritization of brand/model prestige over technical specifications when making purchase decisions.
Model & Performance
Algorithm: LightGBM Regressor — chosen for its speed, performance with tabular data, and ability to handle the mixed feature types (embedding dimensions + categorical indicators) without preprocessing overhead.
Configuration: n_estimators=300, max_depth=6, learning_rate=0.1, objective="regression"
Performance in context: For a market where sellers manually price cars by feel and negotiation ranges span ₦2–5 million, an RMSE of ₦3.48 million is a meaningful improvement over no baseline at all. The model gives both buyers and sellers a data-anchored starting point for any negotiation.
Residual Diagnostics
Three standard diagnostic checks confirm the model's statistical validity:
- Residuals vs. Predicted Values: Random scatter around zero with no clear pattern — confirming homoscedasticity (constant variance) and appropriate model specification. No systematic under- or over-prediction across the price range.
- Distribution of Residuals: Approximately normal distribution centered at zero — validating the model assumptions and confirming that prediction errors are unbiased.
- Q-Q Plot: Residuals closely follow the theoretical normal distribution in the central quantiles, with slight tail deviations indicating a small number of outliers — a normal and expected pattern in any real estate or vehicle pricing dataset.
What clean diagnostics tell us: The absence of systematic patterns in the residuals means the model isn't consistently wrong in any particular direction — it doesn't systematically over-price Toyotas or under-price Hondas, for example. The errors are random noise, not a structural modeling failure.
Key Insights
Embeddings Beat Manual Extraction
Semantic embeddings capture brand prestige, model popularity, and year-over-year differences far better than manually splitting car names into brand/model/year columns. The 768-dimensional vector preserves relationships that categorical encoding destroys.
Mileage Is Not a Strong Predictor Here
Mileage — critical in Western car markets — has minimal predictive value in Nigeria. Inconsistent reporting in listings and a buyer culture focused on brand/model over odometer readings explain this counterintuitive finding.
Origin Drives Significant Price Differences
Local vs. foreign-used (Tokunbo) is a strong price predictor — reflecting Nigeria's import duty structure and distinct buyer preferences between domestically-registered and imported vehicles.
Specific Colors Matter in This Market
Coffee brown interior and dark silver exterior both showed statistical significance — a finding specific to Nigerian buyer preferences that wouldn't emerge from a model trained on international data.
Live Demo
The model is deployed on Hugging Face Spaces. Enter any car name, origin, transmission, fuel type, and color details to get an instant price estimate in Nigerian Naira: