Nigerian Used Car Price Prediction

Project Overview

Buying or selling a used car in Nigeria is an exercise in information asymmetry. Without reliable price data, buyers overpay and sellers undervalue — and neither party has a data-backed anchor for negotiation. This model changes that.

Trained on 9,835 listings scraped directly from Autochek Nigeria, the LightGBM model predicts used car prices in Nigerian Naira with 83% R² and a MAPE of 15.56% — meaning predictions are typically within ±16% of the actual market price. The key technical innovation is using sentence embeddings from all-mpnet-base-v2 to encode car names rather than manually extracting brand, model, and year as separate features.

Business impact: Given the average car price in the dataset of ₦18.1 million, a 15.56% MAPE translates to a typical prediction error of ~₦2.8 million — a reasonable margin for a market where negotiation ranges are often ₦2–5 million.

Why the Nigerian Used Car Market?

Nigeria's automotive market has structural characteristics that make second-hand cars the dominant purchase category — creating a large, underserved market for data-backed pricing tools:

💰

High import duties make new vehicles unaffordable for the majority of buyers

🔧

Lower maintenance costs on familiar used models create strong value preference

📊

No standardized pricing data — buyers and sellers negotiate without anchors

Global tools like Kelley Blue Book or CarGurus don't cover the Nigerian market. Prices vary significantly by origin (local vs. foreign-used), mileage consistency varies in reporting, and brand perception in Nigeria differs from Western markets — all of which require a locally-trained model rather than any adaptation of international tools.

Dataset

9,835 listings scraped from Autochek Nigeria (autochek.africa/ng) — one of Nigeria's largest used car marketplaces — with prices capped at ₦40 million to focus on the mainstream second-hand market segment:

Statistic	Price (₦)
Mean	18,140,400
Median	16,500,000
Std Deviation	8,460,545
Min (after cleaning)	2,315,000
Max (cap)	40,000,000
Final sample size	8,022 cars

Raw features: Car name (brand + model + year), price, origin (local/foreign), mileage, engine type, transmission, fuel type, interior color, exterior color.

Price cap rationale: Capping at ₦40 million aligns the model with the target demographic — budget-conscious buyers in the second-hand market. Including ultra-luxury vehicles (>₦40M) would introduce a price segment with entirely different dynamics and degrade performance for the 95% of listings below the cap.

The Semantic Embedding Innovation

The core technical decision that differentiates this model is how car names are encoded. The conventional approach — manually extracting brand, model, and year as separate categorical features — loses important contextual relationships. Sentence embeddings preserve them:

❌ Conventional Approach

Manual Feature Extraction

Extract brand as separate column
Extract model as separate column
Extract year as numeric feature
One-hot encode brand/model (high cardinality)
Loses semantic relationships between models
Fails completely on unseen car names

✓ This Project's Approach

Sentence Embeddings (all-mpnet-base-v2)

Encode full car name as 768-dim vector
Captures "Toyota Camry 2015 vs 2016" nuance
Preserves brand prestige relationships
Generalizes to unseen car models
Stepwise selection reduces 768D → 10 features
Maintains contextual information throughout

The all-mpnet-base-v2 model from Sentence Transformers generates a 768-dimensional embedding for each car name. Stepwise feature selection then identifies the 10 most informative dimensions — retaining the semantic signal while eliminating noise. This approach is particularly valuable for the Nigerian market where the same model name can represent very different value propositions across years (e.g., a Toyota Camry 2010 vs. 2020 in a market with specific import patterns).

Feature Engineering

The final model uses 12 features selected from hundreds of candidates — 10 embedding dimensions plus 2 categorical variables:

Semantic Embedding Dimensions (10)

10 selected dimensions from the 768-dim car name embedding: name_emb_307, name_emb_741, name_emb_559, name_emb_618, name_emb_207, name_emb_661, name_emb_766, name_emb_541, name_emb_518 — each capturing a different semantic dimension of brand/model identity.

Origin (Local vs Foreign)

origin_local — whether the car is locally sourced or foreign-used. A strong predictor due to Nigeria's import duty structure: foreign-used (Tokunbo) vehicles command different pricing than locally-registered equivalents.

Interior Color: Coffee Brown

interior_color_coffee_brown — showed statistical significance in price predictions. Specific interior color preferences vary meaningfully in the Nigerian market, influencing resale value beyond what intuition suggests.

Exterior Color: Dark Silver

exterior_color_dark_silver — dark silver exterior also showed statistically significant price correlation. Color preferences in used car markets are market-specific and this feature captures a Nigerian buyer preference signal.

Removed Features

Mileage and engine type were dropped during feature selection — both showed minimal influence on price predictions in the Nigerian market context. This is likely due to inconsistent mileage reporting in Nigerian listings, and buyer prioritization of brand/model prestige over technical specifications when making purchase decisions.

Model & Performance

Algorithm: LightGBM Regressor — chosen for its speed, performance with tabular data, and ability to handle the mixed feature types (embedding dimensions + categorical indicators) without preprocessing overhead.

Configuration: n_estimators=300, max_depth=6, learning_rate=0.1, objective="regression"

83% R² Score Variance in car prices explained by the model

15.6% MAPE Predictions within ±16% of actual price

₦3.5M RMSE vs. avg price of ₦18.1M — 19% relative error

Performance in context: For a market where sellers manually price cars by feel and negotiation ranges span ₦2–5 million, an RMSE of ₦3.48 million is a meaningful improvement over no baseline at all. The model gives both buyers and sellers a data-anchored starting point for any negotiation.

Residual Diagnostics

Three standard diagnostic checks confirm the model's statistical validity:

Residuals vs. Predicted Values: Random scatter around zero with no clear pattern — confirming homoscedasticity (constant variance) and appropriate model specification. No systematic under- or over-prediction across the price range.
Distribution of Residuals: Approximately normal distribution centered at zero — validating the model assumptions and confirming that prediction errors are unbiased.
Q-Q Plot: Residuals closely follow the theoretical normal distribution in the central quantiles, with slight tail deviations indicating a small number of outliers — a normal and expected pattern in any real estate or vehicle pricing dataset.

What clean diagnostics tell us: The absence of systematic patterns in the residuals means the model isn't consistently wrong in any particular direction — it doesn't systematically over-price Toyotas or under-price Hondas, for example. The errors are random noise, not a structural modeling failure.

Key Insights

Embeddings Beat Manual Extraction

Semantic embeddings capture brand prestige, model popularity, and year-over-year differences far better than manually splitting car names into brand/model/year columns. The 768-dimensional vector preserves relationships that categorical encoding destroys.

Mileage Is Not a Strong Predictor Here

Mileage — critical in Western car markets — has minimal predictive value in Nigeria. Inconsistent reporting in listings and a buyer culture focused on brand/model over odometer readings explain this counterintuitive finding.

Origin Drives Significant Price Differences

Local vs. foreign-used (Tokunbo) is a strong price predictor — reflecting Nigeria's import duty structure and distinct buyer preferences between domestically-registered and imported vehicles.

Specific Colors Matter in This Market

Coffee brown interior and dark silver exterior both showed statistical significance — a finding specific to Nigerian buyer preferences that wouldn't emerge from a model trained on international data.

Live Demo

The model is deployed on Hugging Face Spaces. Enter any car name, origin, transmission, fuel type, and color details to get an instant price estimate in Nigerian Naira:

Live Nigerian used car price predictor — enter car name (e.g., "Toyota Camry 2018"), origin, and specifications to get an instant ₦ price estimate.

Tech Stack

LightGBMPrice Model

all-mpnet-base-v2Embeddings

Beautiful SoupWeb Scraping

FlaskWeb App

DockerDeployment

HuggingFaceHosting

Python LightGBM Sentence Transformers all-mpnet-base-v2 Scikit-learn Pandas Beautiful Soup Flask Docker HuggingFace Spaces Web Scraping Nigerian Market

Related Projects

If this case study is relevant to your business challenge, these projects may also interest you:

Predicting Used Car Prices in the Nigerian Market

Project Overview

Why the Nigerian Used Car Market?

Dataset

The Semantic Embedding Innovation

Manual Feature Extraction

Sentence Embeddings (all-mpnet-base-v2)

Feature Engineering

Semantic Embedding Dimensions (10)

Origin (Local vs Foreign)

Interior Color: Coffee Brown

Exterior Color: Dark Silver

Removed Features

Model & Performance

Residual Diagnostics

Key Insights

Embeddings Beat Manual Extraction

Mileage Is Not a Strong Predictor Here

Origin Drives Significant Price Differences

Specific Colors Matter in This Market

Live Demo

Tech Stack

Need a pricing model built for your specific market?

Predicting Used Car Prices in the Nigerian Market

Project Overview

Why the Nigerian Used Car Market?

Dataset

The Semantic Embedding Innovation

Manual Feature Extraction

Sentence Embeddings (all-mpnet-base-v2)

Feature Engineering

Semantic Embedding Dimensions (10)

Origin (Local vs Foreign)

Interior Color: Coffee Brown

Exterior Color: Dark Silver

Removed Features

Model & Performance

Residual Diagnostics

Key Insights

Embeddings Beat Manual Extraction

Mileage Is Not a Strong Predictor Here

Origin Drives Significant Price Differences

Specific Colors Matter in This Market

Live Demo

Tech Stack

Related Projects

NYC Property Price Prediction

Lead Scoring & Conversion Prediction

Retail Location Strategy Analysis

Customer Reviews NLP Classification

Need a pricing model built for your specific market?