Data Science & Machine Learning in Containers

Deep Dive Into Containerization for Data Science & Machine Learning

image source: neptune.ai
  • keeping your environments clean (and making it easy to reset it),
  • most importantly, moving things from development to production becomes easier.
  • Version control at all stages
  • MLOps vs DevOps
  • Need for identical dev and prod environment
  • Essentials of Containers (meaning, scope, docker file and docker-compose etc.)
  • Jupyter notebook in containers
  • Application development with TensorFlow in containers as microservice
  • GPU & Docker

What you need to know

In order to fully understand the implementation of machine learning projects in containers, you should:

  • Be able to program in Python,
  • Be able to build basic machine learning and deep learning models with TensorFlow or Keras,
  • Have deployed at least one machine learning model.

Machine learning iterative processes and dependency

Learning is an iterative process. When a child learns to walk, it goes through a repetitive process of walking, falling, standing, walking, and so on — until it “clicks” and it can confidently walk.

  • EDA (Exploratory Data Analysis)
  • Data pre-processing
  • Feature engineering
  • Model training
  • Model evaluation
  • Model tuning and debugging
  • Deployment
  • The Micro Level (tuning hyperparameters): once you select a model (or set of models), you begin another iterative process at the micro-level, with the aim to get the best model hyperparameters.
  • The Macro Level (solving your problem): the first model you build for a problem will rarely be the best possible, even if you tune it perfectly with cross-validation. That’s because fitting model parameters and tuning hyperparameters are only two parts of the entire machine learning problem-solving workflow. At this stage, there is a need to iterate through some techniques for improving the model on the problem you are solving. These techniques include trying other models, or ensembling.
  • The Meta Level (improving your data): While improving your model (or training the baseline) you may see that the data that you are using is of poor quality (for example, mislabeled) or that you need more observation of a certain type (for example, images taken at night). In those situations improving your datasets and/or getting more data becomes very important. You should always keep the dataset as relevant as possible to the problem you are solving.

Version control at all stages

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. Because of the iterative processes involved in the development of an ML-powered product, versioning has become crucial to the success of the product, and future maintenance or optimization.

MLOps vs DevOps

Before we dive into containers for machine learning with TensorFlow, let’s quickly go through the similarities and differences between MLOps and DevOps.

  1. Development: DevOps is linear, and MLOps is more experimental in nature. The team needs to be able to manipulate model parameters and data features, and retrain models frequently as the data changes. This requires more complex feedback loops. Also, the team needs to be able to track operations for reproducibility without impeding workflow reusability.
  2. Testing: in MLOps, testing requires additional methods beyond what is normally done in DevOps. For example, MLOps requires tests for data validation, model validation, testing of model quality, model integration and differential tests.
  3. Deployment : the deployment process in MLOps is similar to DevOps, but it depends on the type of ML system you’re deploying. This becomes easier if the designed ML system is decoupled from the entire product cycle, and acts as an external unit to the software.
  4. Production: a production machine learning model is continuous and can be more challenging than traditional software in production. The intelligence can degrade with time as user data changes. MLOps needs model monitoring and auditing to avoid the unexpected.

Need for identical development and production environment

In software engineering there are typically two stages of product development — development and production.This can be reduced to one when cloud-native is the choice for both development and production, but the majority of ML apps are developed on local PCs before being pushed to cloud.

Essentials of Containers in MLOps

A container is a standard unit of software that packages code and all its dependencies, so the application runs quickly and reliably from one computing environment to another.

Jupyter notebook in containers

Being able to run a jupyter notebook on docker is great for data scientists, because you can do research and experimentation with or without directly interfacing the host machine.

  • Ports: we use this to map the host machine port to the container port. In this case, since jupyter runs on port 8888 by default, we have mapped this to port 8888 on the localhost, but feel free to change the localhost port if 8888 has already been used by another service.
  • Volumes: with this, we can bind-mount a local host directory to our working directory in the container. This is very useful to save intermediate files such as model artefacts to bind localhost directory since container files get removed once it stops running. In this case, we have bind-mounted the host notebook folder to projectDir (project directory on the running container notebook).
  • Environment: the official jupyter image can be run as a jupyter notebook or jupyter lab. With the environment tag, we can specify our preferred choice. Setting the “JUPYTER_ENABLE_LAB” to yes indicates that we have decided to run jupyter as a lab and not notebook.

Application development with TensorFlow in containers

For this project, we will develop an automated image classification solution for photographs of marine invertebrates taken by researchers in South Africa, and serve the model with Tensorflow serving. You can read more about the problem and the provided dataset on ZindiAfrica.

import time 
from tensorflow.keras.models import load_model
ts = int(time.time())
loadmodel = load_model('model.h5')
loadmodel.save(filepath=f'/home/jovyan/projectDir/classifier/my_model/{ts}', save_format='tf')

Model Serving Architecture

The goal is to build and deploy this as a microservice, using containers as a REST API which can be consumed by a bigger service, like the company website. With TensorFlow serving, there are two options to API endpoints — REST and gRPC.

  • gRPC: open-source remote procedure call system, initially developed at Google in 2015. It is preferred when working with extremely large files during inference, because It provides low-latency communication and smaller payloads than REST.
Project architecture
The docker-compose code-snippet

The predict function for inferencing

def model_predict(url,image):
request_json = json.dumps({"signature_name": "serving_default", "instances": image.tolist()})
request_headers = {"content-type": "application/json"}
response_json = requests.post(url, data=request_json, headers=request_headers)
prediction = json.loads(response_json.text)['predictions']
pred_class = np.argmax(prediction)
confidence_level = prediction[0][pred_class]
return (pred_class,confidence_level)
  • Build: the compose file for tensorflow_model_serving service has a build option which defines the context and name of the docker file to use for the building. In this case, we have it named Dockerfile with the following docker commands in it:
tf_service_host = 'tf_model_serving'
model_name = 'my_model'
REST_API_port = '8501'
model_predict_url = 'http://'+tf_service_host+':'+REST_API_port+'/v1/models/'+model_name+':predict'
  • PORT: This is the server port for the url, for REST API, the port is 8501 by default as seen in the architecture above.
  • MODEL_NAME: This is the name of the model we are serving. We set this “my_model” while configuring.
  • VERB: This can be: classify, regress or predict, based on the model signature. In our case, we use “predict”.
def model_predict(url,image):
request_json = json.dumps({"signature_name": "serving_default", "instances": image.tolist()})
request_headers = {"content-type": "application/json"}
response_json = requests.post(url, data=request_json, headers=request_headers)
prediction = json.loads(response_json.text)['predictions']
pred_class = np.argmax(prediction)
confidence_level = prediction[0][pred_class]
return (pred_class,confidence_level)
predicted_classes = []
for img in test_data:
predicted_classes.append(model_predict(url = model_predict_url, image=np.expand_dims(img,0)))
This will return [(0, 0.75897634),
(85, 0.798368514),
(77, 0.995417),
(120, 0.997971237),
(125, 0.906099916),
(66, 0.996572495),
(79, 0.977153897),
(106, 0.864411),
(57, 0.952410817),
(90, 0.99959296)]
for pred_class,confidence_level in predicted_classes:
print(f'predicted class= {Class_Name[pred_class]} with confidence level of {confidence_level}')
With the output
predicted class= Actiniaria with confidence level of 0.75897634
predicted class= Ophiothrix_fragilis with confidence level of 0.798368514
predicted class= Nassarius speciosus with confidence level of 0.995417
predicted class= Salpa_spp_ with confidence level of 0.997971237
predicted class= Solenocera_africana with confidence level of 0.906099916
predicted class= Lithodes_ferox with confidence level of 0.996572495
predicted class= Neolithodes_asperrimus with confidence level of 0.977153897
predicted class= Prawns with confidence level of 0.864411
predicted class= Hippasteria_phrygiana with confidence level of 0.952410817
predicted class= Parapagurus_bouvieri with confidence level of 0.99959296

GPU and Docker

Docker is a great tool to create a containerized machine learning and data science environments for research and experimentation, but it will be great if we can leverage GPU acceleration (if available on a host machine) to speed things up, especially with deep learning.

  • GPU support packages and software
tf.config.experimental.list_physical_devices('GPU')
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Conclusion

Check out the Project GitHub Repository and remember to star it in the link below:

Data Scientist |ML-Engineer at Data Science Nigeria. Open to consulting and new opportunities. https://opeyemibami.github.io/yhemmy/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store