adeyemi@adediranadeyemi.com +234 816 273 5399
Machine Learning · NLP · Nigeria Market

Predicting Used Car Prices in the Nigerian Market

A LightGBM model trained on 9,835 Autochek Nigeria listings that uses sentence embeddings to capture brand and model semantics — achieving 83% R² and 15.56% MAPE. Buyers and sellers get instant, data-backed price estimates in Naira.

Stack
Python · LightGBM · all-mpnet-base-v2 · Flask
Data
9,835 listings scraped from Autochek Nigeria
Type
Regression ML · Semantic Embeddings · Web Scraping
Nigerian used car price prediction ML model by Adediran Adeyemi
0.83 R² — 83% of price variance explained
15.6% MAPE — typically within ±16% of actual price
9,835 Car listings scraped from Autochek Nigeria
768D Sentence embeddings → 10 selected features

Project Overview

Buying or selling a used car in Nigeria is an exercise in information asymmetry. Without reliable price data, buyers overpay and sellers undervalue — and neither party has a data-backed anchor for negotiation. This model changes that.

Trained on 9,835 listings scraped directly from Autochek Nigeria, the LightGBM model predicts used car prices in Nigerian Naira with 83% R² and a MAPE of 15.56% — meaning predictions are typically within ±16% of the actual market price. The key technical innovation is using sentence embeddings from all-mpnet-base-v2 to encode car names rather than manually extracting brand, model, and year as separate features.

Business impact: Given the average car price in the dataset of ₦18.1 million, a 15.56% MAPE translates to a typical prediction error of ~₦2.8 million — a reasonable margin for a market where negotiation ranges are often ₦2–5 million.

Why the Nigerian Used Car Market?

Nigeria's automotive market has structural characteristics that make second-hand cars the dominant purchase category — creating a large, underserved market for data-backed pricing tools:

💰

High import duties make new vehicles unaffordable for the majority of buyers

🔧

Lower maintenance costs on familiar used models create strong value preference

📊

No standardized pricing data — buyers and sellers negotiate without anchors

Global tools like Kelley Blue Book or CarGurus don't cover the Nigerian market. Prices vary significantly by origin (local vs. foreign-used), mileage consistency varies in reporting, and brand perception in Nigeria differs from Western markets — all of which require a locally-trained model rather than any adaptation of international tools.

Dataset

9,835 listings scraped from Autochek Nigeria (autochek.africa/ng) — one of Nigeria's largest used car marketplaces — with prices capped at ₦40 million to focus on the mainstream second-hand market segment:

StatisticPrice (₦)
Mean18,140,400
Median16,500,000
Std Deviation8,460,545
Min (after cleaning)2,315,000
Max (cap)40,000,000
Final sample size8,022 cars

Raw features: Car name (brand + model + year), price, origin (local/foreign), mileage, engine type, transmission, fuel type, interior color, exterior color.

Price cap rationale: Capping at ₦40 million aligns the model with the target demographic — budget-conscious buyers in the second-hand market. Including ultra-luxury vehicles (>₦40M) would introduce a price segment with entirely different dynamics and degrade performance for the 95% of listings below the cap.

The Semantic Embedding Innovation

The core technical decision that differentiates this model is how car names are encoded. The conventional approach — manually extracting brand, model, and year as separate categorical features — loses important contextual relationships. Sentence embeddings preserve them:

❌ Conventional Approach

Manual Feature Extraction

  • Extract brand as separate column
  • Extract model as separate column
  • Extract year as numeric feature
  • One-hot encode brand/model (high cardinality)
  • Loses semantic relationships between models
  • Fails completely on unseen car names
✓ This Project's Approach

Sentence Embeddings (all-mpnet-base-v2)

  • Encode full car name as 768-dim vector
  • Captures "Toyota Camry 2015 vs 2016" nuance
  • Preserves brand prestige relationships
  • Generalizes to unseen car models
  • Stepwise selection reduces 768D → 10 features
  • Maintains contextual information throughout

The all-mpnet-base-v2 model from Sentence Transformers generates a 768-dimensional embedding for each car name. Stepwise feature selection then identifies the 10 most informative dimensions — retaining the semantic signal while eliminating noise. This approach is particularly valuable for the Nigerian market where the same model name can represent very different value propositions across years (e.g., a Toyota Camry 2010 vs. 2020 in a market with specific import patterns).

Feature Engineering

The final model uses 12 features selected from hundreds of candidates — 10 embedding dimensions plus 2 categorical variables:

Semantic Embedding Dimensions (10)

10 selected dimensions from the 768-dim car name embedding: name_emb_307, name_emb_741, name_emb_559, name_emb_618, name_emb_207, name_emb_661, name_emb_766, name_emb_541, name_emb_518 — each capturing a different semantic dimension of brand/model identity.

Origin (Local vs Foreign)

origin_local — whether the car is locally sourced or foreign-used. A strong predictor due to Nigeria's import duty structure: foreign-used (Tokunbo) vehicles command different pricing than locally-registered equivalents.

Interior Color: Coffee Brown

interior_color_coffee_brown — showed statistical significance in price predictions. Specific interior color preferences vary meaningfully in the Nigerian market, influencing resale value beyond what intuition suggests.

Exterior Color: Dark Silver

exterior_color_dark_silver — dark silver exterior also showed statistically significant price correlation. Color preferences in used car markets are market-specific and this feature captures a Nigerian buyer preference signal.

Removed Features

Mileage and engine type were dropped during feature selection — both showed minimal influence on price predictions in the Nigerian market context. This is likely due to inconsistent mileage reporting in Nigerian listings, and buyer prioritization of brand/model prestige over technical specifications when making purchase decisions.

Model & Performance

Algorithm: LightGBM Regressor — chosen for its speed, performance with tabular data, and ability to handle the mixed feature types (embedding dimensions + categorical indicators) without preprocessing overhead.

Configuration: n_estimators=300, max_depth=6, learning_rate=0.1, objective="regression"

83% R² Score Variance in car prices explained by the model
15.6% MAPE Predictions within ±16% of actual price
₦3.5M RMSE vs. avg price of ₦18.1M — 19% relative error

Performance in context: For a market where sellers manually price cars by feel and negotiation ranges span ₦2–5 million, an RMSE of ₦3.48 million is a meaningful improvement over no baseline at all. The model gives both buyers and sellers a data-anchored starting point for any negotiation.

Residual Diagnostics

Three standard diagnostic checks confirm the model's statistical validity:

  • Residuals vs. Predicted Values: Random scatter around zero with no clear pattern — confirming homoscedasticity (constant variance) and appropriate model specification. No systematic under- or over-prediction across the price range.
  • Distribution of Residuals: Approximately normal distribution centered at zero — validating the model assumptions and confirming that prediction errors are unbiased.
  • Q-Q Plot: Residuals closely follow the theoretical normal distribution in the central quantiles, with slight tail deviations indicating a small number of outliers — a normal and expected pattern in any real estate or vehicle pricing dataset.

What clean diagnostics tell us: The absence of systematic patterns in the residuals means the model isn't consistently wrong in any particular direction — it doesn't systematically over-price Toyotas or under-price Hondas, for example. The errors are random noise, not a structural modeling failure.

Key Insights

Embeddings Beat Manual Extraction

Semantic embeddings capture brand prestige, model popularity, and year-over-year differences far better than manually splitting car names into brand/model/year columns. The 768-dimensional vector preserves relationships that categorical encoding destroys.

Mileage Is Not a Strong Predictor Here

Mileage — critical in Western car markets — has minimal predictive value in Nigeria. Inconsistent reporting in listings and a buyer culture focused on brand/model over odometer readings explain this counterintuitive finding.

Origin Drives Significant Price Differences

Local vs. foreign-used (Tokunbo) is a strong price predictor — reflecting Nigeria's import duty structure and distinct buyer preferences between domestically-registered and imported vehicles.

Specific Colors Matter in This Market

Coffee brown interior and dark silver exterior both showed statistical significance — a finding specific to Nigerian buyer preferences that wouldn't emerge from a model trained on international data.

Live Demo

The model is deployed on Hugging Face Spaces. Enter any car name, origin, transmission, fuel type, and color details to get an instant price estimate in Nigerian Naira:

Live Nigerian used car price predictor — enter car name (e.g., "Toyota Camry 2018"), origin, and specifications to get an instant ₦ price estimate.

Tech Stack

LightGBMPrice Model
all-mpnet-base-v2Embeddings
Beautiful SoupWeb Scraping
FlaskWeb App
DockerDeployment
HuggingFaceHosting
Python LightGBM Sentence Transformers all-mpnet-base-v2 Scikit-learn Pandas Beautiful Soup Flask Docker HuggingFace Spaces Web Scraping Nigerian Market

Work with Adediran Adeyemi

Need a pricing model built for your specific market?

I build ML models that understand local market dynamics — not just global patterns. Nigeria, e-commerce, or any domain where data-backed pricing matters. First call is free.