[Aiffel X SOCAR] VIAI — Vehicle Inspection A.I. Project Summary

Jieun Jeon
15 min readJan 21, 2022

Link to the Korean version

From November 10, 2021 to December 15, 2021, I worked on VIAI (Vehicle Inspection A.I.) project with a team I met in the Modulabs Aiffel AI program.

I want to share how our team worked and how the results came out. We implemented the Semantic Segmentation Model with the Human-In-The-Loop approach and deployed the service on GCP + GKE + TorchServe + KubeFlow.

The project was a corporate cooperation task of SOCAR (*Car Sharing company in South Korea), and the subject is Vehicle Damage Detection.

Team

Team VIAI

Before the Hackathon project, I was able to get to know each other’s study/coding styles and areas that they wanted to focus on through the Aiffel AI curriculum in advance. Based on this, I was able to think about how to form a team to produce the best results to complete the project and form a team.

The team’s name, VIAI, means that it will service A.I. products that automatically detect vehicle damage. It stands for (Vehicle Inspection A.I.).

Little Backgrounds about SOCAR App:

To use SOCAR vehicles, you must check the appearance of the vehicle before driving the vehicle. You can take pictures and upload them to the SOCAR mobile app if any damage is found to the exterior of the vehicle. Otherwise, you may have to take responsibility for the previous accident.

Project Overview

The definition of the problem of Vehicle Damage Detection, which was SOCAR’s second corporate cooperation task, and the project goals set by the VIAI team are as follows:

Problem Definition

🛠 The problem definition of the vehicle damage detection system of SOCAR is defined as follows.

  1. The burden of manpower for inspection of the vehicle’s damage status (damage area, type of damage) (a daily average of 70,000 to 80,000 sheets are inspected by a human)
  2. Difficulty in tracking the exact timestamp of damage that happened to the vehicle (also the burden of manpower tracking all photos taken by users)
  3. The need to upgrade the performance of the vehicle damage detection model.

Goal of VIAI

✔ The goals set by the VIAI team to solve the above problem are as follows.

  1. Automation of detection/monitoring of damaged areas and types of vehicle images.
  2. Web AnnotationTool with Admin dashboard is provided so that the Annotator can create a masking image through the Dashboard UI.
  3. Implement Data-center AI by implementing the Human-In-The-Loop cycle.

Tech Stack

Model

  • PyTorch
  • mmSegmentation
  • mmDetection

DataSet

  • Transfer Learning — ImageNet, Stanford Cars Dataset

Data Pipeline

  • Kubernetes
  • kubeflow

Serving

  • TorchServe
  • Google Cloud Platform
  • GKE (Google Kubernetes Engine)
  • Cloud Function
  • Cloud Storage
  • AI Platform

Client

  • Flask
  • PostgresSQL
  • VIA

Dataset

The dataset is a total of 7,000 images of vehicles provided by SOCAR. It was provided with a folder architecture in the form of train/, test/, valid/ by Dent, Scratch, and Spacing types, respectively.Project Milestone

Project Preparation period

October to early November

All of the team members needed to study Semantic Segmentation, the subject of the project. Due to the lack of knowledge of the model’s design as well as the GCP Cloud environment, Kubernetes, and Kubeflow, I started studying models and GCP, Docker, and Kubernetes in advance from the mid-to-late September when the team was formed.

It was around mid-November that SOCAR’s data was provided, so before that, we started developing models by referring to Kaggle’s other vehicle image datasets.

Mid-November

Based on the study, we built a baseline model and experimented with various techniques to improve model performance with various models. We first checked through the given vehicle photos and maskings by the accident area, and also organized the techniques for preprocessing the data. We conducted various model experiments, so we were able to quickly check the results of various models using frameworks such as mmDetection and mmSegment, and build a final model based on them.

At the same time, we devised a minimal architecture for model serving. For the Project Interim Report, we used GCP’s Cloud Native solution with Flask to complete a UI to implement a dashboard showing the results of vehicle image masking by accident type.

In this period, the goal we set for the actual hackathon is as follows.

  1. Building an end-to-end machine learning pipeline.
  2. Apply more model research and ideas.
  3. Dataset refining + application of augmentation.

Hackathon period: November to December

1) Problem definition and implementation

As mentioned in the project problem definition above, one of the goals for solving the problem of the VIAI team was to implement Data-centered AI. We chose repeated Human-In-The-Loop execution to improve the performance of the model by improving the quality of data.

Because the amount of data provided was small and the quality of the data was not refined, if the model’s masking results were poor, it was possible to improve the quality of the data and improve the performance of the model by building a pipeline that allows people to mask and retake the car accident.

To implement this, we made the MVP version of the app using Cloud Function, Compute Engine, Cloud Bucket, etc. for the PIR.

For the finalized data pipeline, we included Torchserve serving and kubeflow pipelines as follows.

2. Efforts to improve datasets and models

Dataset Experiments

1) Masked Only vs All Data

First, as a result of checking the distribution of datasets, we could see that the distribution of masked images varies greatly by type of damage. Accordingly, we experimented with the 1. method of learning only with vehicle images with masking and the 2. method of learning with all vehicle images without masking.

As a result, when learning only with vehicle images with masking, it was confirmed that the model caused errors in the following situations.

A) The light on the surface of the car is determined as Dent,

B) Model determines the shadow as Dent.

C) In some cases, Scratch was identified as Dent.

As a result, we were able to conclude that all data should be included and learned rather than learning only with masking data.

2) Data Augmentation

We checked the parameter importance concerning val_iou to decide which Augmentation techniques to use.

The figures were confirmed by an experiment using WanDB and also used this with monitoring of every model experiment.

3) Attempt to improve the performance of the model by increasing the amount of data.

And eventually, we tried to increase the amount of data, which is an effective way to improve the performance of the model easily and quickly. I tried two main methods, the first was to create an image using a code that synthesizes broken images, and the second was to collect vehicle photos with googling and create masking one by one.

a) Attempt to create a damaged vehicle image by combining the damaged image with a normal car image

First, each part of the vehicle was extracted and a bbox was calculated. And the algorithm calculated and marked the center point in each part, it cut the image of the accident and compose it on that center.

However, as seen with the human eyes, there was a clear difference between the accident image and the captured image (brightness, difference in vehicle background, difference in pixels due to differences in the type of camera taken, and so on).

Failure

It resulted in a decrease in model results, and in the end, We decided to focus on quality rather than the amount of data.

b) Create masking using the CVAT image masking tool

Secondly, I used the CVAT tool to create masking by ourselves and included the added images into the train set.

After adding the data, it was confirmed that the performance of the model improved. Thanks to Taewon, for masking more than 200+ vehicle images. 👍🏻

Model experiment

Baseline model experiment

Early baseline models were set and experimented with U-Net, U-Net++ (Nested U-Net), and DeepLabv3.

As a result of the Baseline model experiment, the test IOU is as follows.

To further improve performance, we were able to quickly experiment with various models by using model package tools such as mmDetection, mmSegmentation, and further experimented with writing codes that convert polygon data into mask images and conversely, utility codes that convert mask images into polygon data.

As a result of learning the package delivery model, Mask R-CNN and DeepLabv3+ showed good performance.

And through several experiments, we decided on the model as U-Net with Efficient Encoder.

Transfer Learning

To further upgrade the model, we performed more experiments to apply transfer learning. Transfer Learning is a technology that applies a model learned in a specific domain to another domain or task. We experimented with improving model performance by taking the weights of the previously learned model and applying them to other image datasets we want to classify.

There are a total of three datasets we tested for Transfer Learning: ImageNet, Stanford Car Dataset, and SOCAR Dataset. (The third SOCAR dataset utilized the vehicle image used in SOCAR’s vehicle part classification projects we did before.)

As a result of the Transfer Learning experiment, the SOCAR dataset was the best for Scratch and the Stanford Cars dataset for Dent and Spacing.

This is the final model performance result.

3. Data Pipeline

The apps we have built can be divided into two web applications.

  1. A user client where a user uploads a photo.
  2. An Admin Dashboard that provides tools for SOCAR administrators to detect damage to each vehicle, track accidents, inspect Inference results and model results, and directly annotate mask images.

The data flow designed to make the above two apps is as follows.

Cloud Function, Cloud Storage, AI Platform

The flow when the user uploads a picture is as follows.:

  1. User uploads vehicle exterior photos to Cloud Storage images-original using the web client
  2. Execute Cloud Function run-upload using Cloud Bucket's trigger function.
  3. Start Cloud Storage images-inferred, Cloud SQL viai-images, and kubeflow pipelines based on the reference results of Cloud Function run-upload.

The role of Cloud Function run-upload is as follows:

  1. create new field & update the data on the DB
  2. take the uploaded image and hit the inference API AI Platform Prediction (TorchServe) for dent, scratch, spacing
  3. if something went wrong or errors returned while the inference, move the inferred images from <original-images> originals/ to issues/ folder
  • error scenario: — ckpt file not found

4. take the results (mask images as b64 format) and 1. store images to the Cloud Storage <images-inferred> & 2. update the data on the DB

Database — PostgresSQL X Cloud SQL

The database was deployed to Cloud SQL using PostgresSQL.

I thought it would be better to combine vehicle usage history, user information, and vehicle information and build it as a Relational Database.

Also, the reason why PostgresSQL was chosen rather than MySQL in Relational Database was that PostgresSQL supported the data types we needed (boolean, arrays).

And most importantly, we expect to have high volumes for both db read and write so it is more suitable than MySQL, which is suitable for high volumes of reads.

Model Serving For Inference — TorchServe X AI Platform

Model Serving could also be distributed on kubeflow, but because the credits we got supported by GCP were running out, we chose to use AI Prediction of AI Platform with the dockerized TorchServe model.

In addition, we trained the model for each type of damage and needed a separate inference endpoint for this, so we used an efficient model serving service using TorchServe. TorchServe is an open-source model server for Pytorch, an open-source jointly developed by AWS and Facebook. We chose it because it had a built-in fast prediction API and multi-model service that we needed, and also model version management for A/B testing and RESTful endpoint for monitoring metrics.

Model Retrain Pipeline — Kubernetes, Kubeflow Pipeline

Retraining pipelines for re-training were implemented with Kubernetes, kubeflow pairing, and kubeflow pipelines.

The collection of images directly inspected by the inspector was set as a trigger for model re-training, and in addition, a module was configured to calculate the confidence score of the images of the model's information results and added to the re-learning baseline pipeline as an additional check.

The original goal was to apply a classifier model which determines the presence or absence of accident of the model-inferred image, but the performance of the classifier model was poor, so we temporarily used a calculation or confidence score, which is the average of the probability values in the area with a pixel value of 255 (which means that the area is classified as an accident)

*This is a temporary method, and it is not suitable for this because this is calculating the confidence score of each pixel, not the confidence score of the image itself.

The flow of the pipeline was set as follows.

  1. Count the number of newly added images in the images-annotated bucket.
  • If it’s less than a certain number, stop the pipeline.
  • If it’s more than a certain number, continue.

2. Re-training with newly added images.

  • checkpoint naming conventions
  • data_class : scratch, dent, spacing
  • test_IoU : 2.2f
  • train_date : yy_mm_dd
<data_class>_<test_IoU>_yy_mm_dd.pth

3. Evaluate whether the newly learned model has a higher test IoU than the currently running model.

  • Change the name of the checkpoint to test IoU.
  • Execute cloud function that automates deployment by replacing checkpoints if it is high.
  • If it’s not high, then terminate the pipeline.

The pipeline above was run with Recurring Job, allowing us to periodically check the newly masked image count.

And we added the WanDB authentication code to the Dockerfile of all pipelines, allowing us to check and monitor the results of each retrained model.

Retraining Pipeline Component — GCS X Docker X Kubeflow Pipeline

The Retraining Pipeline Component was implemented as follows.

1) Create PVC and access with GCS

A) Setting GCS access rights for kubeflow pipelines

B) DSL Component Configuration

We took data types of models such as dent, spacing, scratch as an argument so that training can start.

import argparse
parser = argparse.ArgumentParser(description="Define metadata")
parser.add_argument('-t','--type', requir, help="Choose Data Type - dent, scratch, spacing")args = parser.parse_args()args.type

After writing the train function, I dockerized, and uploaded it to GCR after image build, and use it.

C) Write the pipeline code.

import osimport kfp
from kfp import dsl
from kubernetes.client.models import V1EnvVar, V1VolumeMount, V1Volume, \\\\
V1SecretVolumeSource
if __name__ == '__main__':
my_run = kfp.Client().create_run_from_pipeline_func(pipeline_gcs, arguments={},
experiment_name='Sample Experiment')

In the GKE and on-premise environment, we mounted to gcsfuse and approached and used the storage bucket.

Kubeflow was distributed to GKE to configure pipelines.

References:

Annotation Tool

By connecting to the web app built with Flask using VIA’s open source image annotation tool, we expected SOCAR’s vehicle management administrator can make it easy to inspect and mask vehicles at once.

After importing the original image that requires annotation from GCS Storage Bucket images-notated, transferred the exported.json file to Flask web, converted category_id into a mask image, and saved it.

from io import BytesIO
from google.cloud import storage
client = storage.Client()
bucket = client.bucket("images-annotated")
blob = bucket.blob(f'data/dent/masks/{img_name}')
bs64 = BytesIO()
seg_img.save(bs64, format="png")
img_as_str = bs64.getvalue()
blob.upload_from_string(img_as_str, content_type='image/png')

This is the exported JSON created using the VIA annotation tool.

4. The Result

We implemented the User Client using Flask, which allows SOCAR app users to upload images of the vehicle’s exterior before starting the rental of the vehicle.

4. The Result

We implemented the User Client using Flask, which allows SOCAR app users to upload images of the vehicle’s exterior before starting the rental of the vehicle.

Demo — Dashboard

*Demo-Annotation .gif file was too large so I couldn’t upload it on this article. You can check the full Demo.gif in the Github README.

The roles of the detailed Admin Dashboard are as follows.

  • We assume that SOCAR’s vehicle inspector staff uses it.
  • With the login function, an annotator name is specified and stored to track who did the annotations for records.
  • Users can view detailed information for each vehicle, user, and vehicle sharing event through each list view with the Cars, Users, and Events tabs.
  • Users can see the events are colored with red (emphasized) if the damage was detected, and the event can be backtracked to find who first caused the damage.
  • Users can check the photos and model information results taken at the reservation (event), and provide tools to create masks immediately if model-inferred masking is wrong.
  • When Annotate is completed, the newly created mask image is stored in the bucket and utilized for model re-training.
  • The date and time of inspection and information of the inspector are displayed if already annotated

Now we’re done and can finally release this project 👏🏻

4. Future Research Plans

The project is completed, but our future research plans are as follows.

  1. Model auto-deploy pipeline (CD)
  2. Input image preprocessing (crop, classifier)
  3. Edit image re-training criteria

When the performance of the model improves through model re-training and validation, the pipeline can automatically deploy the new model to the GCP AI platform with the same docker file.

The part that relieves the burden on the model with image pre-processing, with the classifier model.

When the performance of the model improves through model re-training and validation, the pipeline can automatically deploy the new model to the GCP AI platform with the same docker file.

SOCAR Feedback

We were able to hear feedback after the final presentation with the SOCAR data group.

  • At the business level, it is a very cumbersome task for vehicle managers to annotate images only to improve model performance.
  • Possible improvements to the confidence score calculation method.
  • Advantages and disadvantages of using IOU as a metric and possible improvements.
  • How to advance the car part clustering model.

SOCAR data group leaders and each team leader pointed out exactly what we thought about and suspected during the development stage, and pointed out the appropriate future experimental/research keywords. Thank you so much for the detailed feedback from SOCAR’s Data Group: DK, KP, Yoon, and Cheese.

Conclusion

It was my first time building this kind of end-to-end ML workflow from scratch, and I’m so happy to see this project resulting in a complete ML product. Everyone in the VIAI team did a great job contributing their talents and skills to this project.

I am not just getting the skills and ML knowledge used in the project but learned how to communicate complex technical ideas and develop systems with considering scalability and scope of the problem.

After the project, everyone in the team began to build their new careers, and this VIAI project has become a turning point and inspired me towards a new field of MLOps and Data Engineering. Everything was possible because we did this together as a VIAI team. 👍🏻

Thank you for reading!

VIAI Github Repository

--

--