Custom Docker images in AML

Data Science

Author

Jaume Amores

Published

April 22, 2024

Note: this post is just a draft in progress. As of now, it consists of a collection of random notes.

Things to change wrt hello-world

conda env includes current module, with setup.py and settings.ini from nbdev
The pipeline is run from a python_scripts folder.
The conda env is in root folder.
The config in a configs folder.
The components are in mylib/aml folder.
We add a docker file that copies files such as setup.py, settings.ini, data, wheels, etc. (everything needed by the component scripts, which is basically everything copied to simulation folder, except for the json file used for config of how the pipeline is built (e.g., the one indicating the name of the environment, etc.)
- We are going to try two methods: one based on python image, and another based on aml image.

Copying files

import os
import shutil

os.makedirs ('simulation/python_scripts', exist_ok=True)
os.makedirs ('simulation/my_lib/aml', exist_ok=True)
os.makedirs ('simulation/configs', exist_ok=True)
os.makedirs ('simulation/data', exist_ok=True)

# shutil.copy ('./hello_world.yml', 'simulation') => a different one will be created 
shutil.copy ('./pipeline_input.json', 'simulation/configs')

shutil.copy ('preprocessing/preprocessing.py', 'simulation/my_lib/aml')
shutil.copy ('training/training.py', 'simulation/my_lib/aml')
shutil.copy ('inference/inference.py', 'simulation/my_lib/aml')

shutil.copy ('data/dummy_input.csv', 'simulation/data')
shutil.copy ('data/dummy_test.csv', 'simulation/data')

Copying settings.ini and setup.py from nbdev

cd ../../..
git clone https://github.com/fastai/nbdev.git
cp nbdev/settings.ini home/posts/data_science/simulation
cp nbdev/setup.py home/posts/data_science/simulation
cd home/posts/data_science/simulation

C|hanging settings.ini

Edit the settings.ini file and replace the following entries as follows:

lib_name = my_lib
repo = simulation
requirements = pandas
               scikit-learn
               numpy
dev_requirements = joblib
                   azure-ai-ml
lib_path = my_lib

Then remove the following entries:

pip_requirements
conda_requirements
dev_requirements
console_scripts

Changing the conda environment file

%%writefile simulation/hello_world.yml
name: hello_world
dependencies:
    - python=3.10
    - pip
    - pip:
        - -e .[dev]

Overwriting simulation/hello_world.yml

Using custom docker image

I just searched in the list of curated environments present in my workspace, using the keyword sklearn in the search text box. At the time of writing this tutorial, the environment found is sklearn-1.1:30. By clicking on it, and then on the Context tab, we can read its dockerfile, which indicates, in the first line, the base docker image used: FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20240415.v1.

I found documentation about the docker image to be used by googling “docker mcr.microsoft.com/azureml/openmpi4.1.0”, which provided the following URL: https://hub.docker.com/_/microsoft-azureml. In there, we can find additional links: - https://github.com/Azure/AzureML-Containers => contains docker files for each image - Note: SDK code contained in this link is v1.

We can also explore this docker image by running it in interactive mode (see a cheat sheet of docker commands in here and here):

docker pull mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20240415.v1
docker run -it --entrypoint bash mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20240415.v1

In this tutorial we can see how to test this docker image :

docker run --rm -d -v $PWD/$BASE_PATH:$MODEL_BASE_PATH -p 8501:8501 \
 -e MODEL_BASE_PATH=$MODEL_BASE_PATH -e MODEL_NAME=$MODEL_NAME \
 --name="tfserving-test" docker.io/tensorflow/serving:latest
sleep 10

Possibilities

Either copy the dockerfile and replace the line:

COPY conda_dependencies.yaml .

with:

RUN mkdir -p data
COPY <MY-CONDA-ENV-YAML> .
COPY settings.ini .
COPY setup.py .
COPY data/dummy_input.csv data/
COPY data/dummy_test.csv data/
COPY data/dummy_test.csv data/

and change the name conda_dependencies.yaml with everywhere else in the file…

or name your conda env file conda_dependencies.yaml:

mv simulation/hello_world.yml simulation/conda_dependencies.yaml

and use the curated docker image as base image in your dockerfile:

FROM mcr.microsoft.com/azureml/curated/sklearn-1.1:30

RUN mkdir -p data
COPY settings.ini .
COPY setup.py .
COPY data/dummy_input.csv data/
COPY data/dummy_test.csv data/
COPY data/dummy_test.csv data/

Note that we can dedicate a specific folder for all the files that need to be copied and used in the Dockerfile, including conda_dependencies.yaml.

Final Dokerfile

From the two possibilities mentioned above, we use the first one, which is more modular:

cd ..

/mnt/batch/tasks/shared/LS_root/mounts/clusters/jaumecpu/code/Users/jau.m/home/posts/data_science

mv simulation/hello_world.yml simulation/conda_dependencies.yaml

mv: cannot stat 'simulation/hello_world.yml': No such file or directory

%%writefile simulation/Dockerfile
FROM mcr.microsoft.com/azureml/curated/sklearn-1.1:30

RUN mkdir -p data
COPY settings.ini .
COPY setup.py .
COPY data/dummy_input.csv data/
COPY data/dummy_test.csv data/
COPY data/dummy_test.csv data/

Overwriting simulation/Dockerfile

%%writefile simulation/Dockerfile
FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20240415.v1

WORKDIR /

ENV CONDA_PREFIX=/azureml-envs/sklearn-1.1
ENV CONDA_DEFAULT_ENV=$CONDA_PREFIX
ENV PATH=$CONDA_PREFIX/bin:$PATH

# This is needed for mpi to locate libpython
ENV LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

# Create conda environment
RUN mkdir -p data
COPY hello_world.yml .
COPY settings.ini .
COPY setup.py .
COPY data/dummy_input.csv data/
COPY data/dummy_test.csv data/
COPY data/dummy_test.csv data/
RUN conda env create -p $CONDA_PREFIX -f conda_dependencies.yaml -q && \
    rm conda_dependencies.yaml && \
    conda run -p $CONDA_PREFIX pip cache purge && \
    conda clean -a -y

Overwriting Dockerfile

Testing docker

#docker pull mcr.microsoft.com/azureml/curated/sklearn-1.1:30
docker build -t hello_world .
docker run -v ~/cloudfiles/code/Users/jau.m/home/posts/data_science/simulation/:/host_dir -it --entrypoint bash hello_world

Let’s try running the first job of the pipeline function: preprocessing_training_job.For this, we first look how the script needs to be run from command line, as indicated in the command call of the preprocessing component:

        command="python preprocessing.py "
            "--input_file ${{inputs.input_file}} "
            "-x ${{inputs.x}} "
            "--output_folder ${{outputs.output_folder}} "
            "--output_filename ${{inputs.output_filename}}",

Then, in order to see what the inputs inputs.input_file, ìnputs.x and inputs.output_filename are, we look at how the preprocessing_training_job is created:

    preprocessing_training_job = preprocessing_component(
        input_file=preprocessing_training_input_file,
        #output_folder: automatically determined
        output_filename=preprocessing_training_output_filename,
        x=x,
    )

and, in order to fill in those values we look at the ones passed to the pipeline function:

    three_components_pipeline_object = three_components_pipeline(
        # first preprocessing component
        preprocessing_training_input_file=Input(type="uri_file", path=config.preprocessing_training_input_file),
        preprocessing_training_output_filename=config.preprocessing_training_output_filename,
        x=config.x,

If we just replace those, we get:

python preprocessing.py --input config.preprocessing_training_input_file config.preprocessing_training_output_filename -x config.x

These values are given in the config file, so we visualize it:

!cat configs/pipeline_input.json

{
    "preprocessing_training_input_file": "./data/dummy_input.csv",
    "preprocessing_training_output_filename":"preprocessed_training_data.csv",
    "x": 10,
    "preprocessing_test_input_file": "./data/dummy_test.csv",
    "preprocessing_test_output_filename": "preprocessed_test_data.csv",
    "training_output_filename": "model.pk",
    "inference_output_filename": "inference_results.csv",
    "experiment_name": "e2e_three_components_in_script",
    "compute_name": "jaumecpu",
    "image": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    "conda_file": "./hello_world.yml",
    "name_env": "hello-world",
    "description_env": "Hello World",
    "docker_context_path": "."
}

With all this, we can put the pieces togeher. The names of the ouput folders are automatically generated by AML, and the folders automatically created. In our case, we will the name the ouput folder as preprocessing_training_ouput_folder, and create it before hand. We also have to copy the script my_lib/aml/preprocessing.py to current folder. Putting all together, we run the following in command line inside the docker container:

cp host_dir/my_lib/aml/preprocessing.py .
mkdir preprocessing_training_ouput_folder
python preprocessing.py --input ./data/dummy_input.csv --output_folder preprocessing_training_ouput_folder --output_filename preprocessed_training_data.csv -x 10

Change aml_utils.py

In order to use a custom docker image, we need to use a different way of creating the environment:

env = Environment(
    build=BuildContext(path=docker_context_path),
    name=name_env,
    description=description_env,
)

This change affects the function create_env in aml_utils.py:

def create_env (
    ml_client,
    image: str="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file: str="./pipeline.yml",
    name_env: str="pipeline",
    description_env: str="Pipeline environment",
    docker_context_path=None,
):
    if docker_context_path is None:
        "Creates environment in AML workspace"
        env = Environment (
            image=image,
            conda_file=conda_file,
            name=name_env,
            description=description_env,
        )
    else:
        env = Environment(
            build=BuildContext(path=docker_context_path),
            name=name_env,
            description=description_env,
        )
    ml_client.environments.create_or_update (env)

The change also affects functions that call create_env (connect_setup_and_run in aml_utils.py, and run_pipeline in hello_world_pipeline.py), since they need to pass the additional parameter docker_context_path. We also need to import BuildContext from azure.ai.ml.entities. With these changes, the complete aml_utils.py file is as follow:

%%writefile aml_utils.py
# Standard imports
import json

# Third-party imports
from sklearn.utils import Bunch

# AML imports
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Environment, BuildContext
from azure.identity import DefaultAzureCredential

def connect ():
    """Connects to Azure ML workspace and returns a handle to use it."""
    # authenticate
    credential = DefaultAzureCredential()

    # Get a handle to the workspace
    ml_client = MLClient.from_config (
        credential=credential,
    )
    return ml_client

def create_env (
    ml_client,
    image: str="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file: str="./pipeline.yml",
    name_env: str="pipeline",
    description_env: str="Pipeline environment",
    docker_context_path=None,
):
    if docker_context_path is None:
        "Creates environment in AML workspace"
        env = Environment (
            image=image,
            conda_file=conda_file,
            name=name_env,
            description=description_env,
        )
    else:
        env = Environment(
            build=BuildContext(path=docker_context_path),
            name=name_env,
            description=description_env,
        )
    ml_client.environments.create_or_update (env)
    
def connect_setup_and_run (
    pipeline_object, 
    experiment_name: str="pipeline experiment",
    compute_name: str="jaumecpu",
    image: str="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file: str="./pipeline.yml",
    name_env: str="pipeline",
    description_env: str="Pipeline environment",
    docker_context_path=None,
):
    """Does all the setup required to run the pipeline.
    
    This includes: connecting, creating environment, indicating our compute instance,
    creating and running the pipeline.
    """
    # connect
    ml_client = connect ()

    # create env
    create_env (
        ml_client,
        image=image,
        conda_file=conda_file,
        name_env=name_env,
        description_env=description_env,
        docker_context_path=docker_context_path,
    )

    # compute
    pipeline_object.settings.default_compute = compute_name 

    # create pipeline and run
    pipeline_job = ml_client.jobs.create_or_update(
        pipeline_object,
        # Project's name
        experiment_name=experiment_name,
    )

    # ----------------------------------------------------
    # Pipeline running
    # ----------------------------------------------------
    ml_client.jobs.stream(pipeline_job.name)

def read_config (config_path: str):
    # Read config json file
    with open (config_path,"rt") as config_file:
        config = json.load (config_file)

    config = Bunch (**config)

    return config

Overwriting aml_utils.py

cp aml_utils.py simulation/

AML documentation

Tutorial

Change config file

Add the following line to the previous config file: "docker_context_path": "."

%%writefile configs/pipeline_input.json
{
    "preprocessing_training_input_file": "./data/dummy_input.csv",
    "preprocessing_training_output_filename":"preprocessed_training_data.csv",
    "x": 10,
    "preprocessing_test_input_file": "./data/dummy_test.csv",
    "preprocessing_test_output_filename": "preprocessed_test_data.csv",
    "training_output_filename": "model.pk",
    "inference_output_filename": "inference_results.csv",
    "experiment_name": "e2e_three_components_in_script",
    "compute_name": "jaumecpu",
    "image": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    "conda_file": "./hello_world.yml",
    "name_env": "hello-world",
    "description_env": "Hello World",
    "docker_context_path": "."
}

Overwriting configs/pipeline_input.json

rm configs/untitled.txt

rm: cannot remove 'configs/untitled.txt': No such file or directory

No need

Change hello_world_pipeline.py file

Imports

Same imports section:

%%writefile simulation/hello_world_pipeline.py
# Standard imports
import argparse

# AML imports
from azure.ai.ml import (
    command,
    dsl,
    Input,
    Output,
)

# Utility functions
from aml_utils import (
    connect,
    create_env,
    connect_setup_and_run,
    read_config,
)

Overwriting hello_world_pipeline.py

Pipeline function

%%writefile -a simulation/hello_world_pipeline.py
@dsl.pipeline(
    description="Simulation hello-world",
)
def three_components_pipeline(
    # Preprocessing component parameters, first component:
    preprocessing_training_input_file: str,
    preprocessing_training_output_filename: str,
    x: int,
    
    # Preprocessing component parameters, second component:
    preprocessing_test_input_file: str,
    preprocessing_test_output_filename: str,
    
    # Training component parameters:
    training_output_filename: str, 
    
    # Inference component parameters:
    inference_output_filename: str,
):
    """
    Third pipeline: preprocessing, training and inference.
    
    Parameters
    ----------
    preprocessing_training_input_file: str
        Path to file containing training data to be preprocessed.
    preprocessing_training_output_filename: str
        Name of file containing the preprocessed, training data.
    x: int
        Number to add to input data for preprocessing it.
    preprocessing_test_input_file: str
        Path to file containing test data to be preprocessed.
    preprocessing_test_output_filename: str
        Name of file containing the preprocessed, test data.
    training_output_filename: str
        Name of file containing the trained model.
    inference_output_filename: str
        Name of file containing the output data with inference results.
    """
        
    # -------------------------------------------------------------------------------------
    # Preprocessing
    # -------------------------------------------------------------------------------------
    # Interface
    preprocessing_component = command(
        inputs=dict(
            input_file=Input (type="uri_file"),
            x=Input (type="number"),
            output_filename=Input (type="string"),
        ),
        outputs=dict(
            output_folder=Output (type="uri_folder"),
        ),
        code=f"./my_lib/aml/",  # location of source code: in this case, the root folder
        command="python preprocessing.py "
            "--input_file ${{inputs.input_file}} "
            "-x ${{inputs.x}} "
            "--output_folder ${{outputs.output_folder}} "
            "--output_filename ${{inputs.output_filename}}",
        environment="hello-world@latest",
        display_name="Pre-processing",
    )

    # Instantiation
    preprocessing_training_job = preprocessing_component(
        input_file=preprocessing_training_input_file,
        #output_folder: automatically determined
        output_filename=preprocessing_training_output_filename,
        x=x,
    )
    preprocessing_test_job = preprocessing_component(
        input_file=preprocessing_test_input_file,
        #output_folder: automatically determined
        output_filename=preprocessing_test_output_filename,
        x=x,
    )

    # -------------------------------------------------------------------------------------
    # Training component
    # -------------------------------------------------------------------------------------
    # Interface
    training_component = command(
        inputs=dict(
            input_folder=Input (type="uri_folder"),
            input_filename=Input (type="string"),
            output_filename=Input (type="string"),
        ),
        outputs=dict(
            output_folder=Output (type="uri_folder"),
        ),
        code=f"./my_lib/aml/",  # location of source code: in this case, the root folder
        command="python training.py "
            "--input_folder ${{inputs.input_folder}} "
            "--input_filename ${{inputs.input_filename}} "
            "--output_folder ${{outputs.output_folder}} "
            "--output_filename ${{inputs.output_filename}}",
        environment="hello-world@latest",
        display_name="Training",
    )

    # Instantiation
    training_job = training_component(
        input_folder=preprocessing_training_job.outputs.output_folder,
        input_filename=preprocessing_training_output_filename,
        #output_folder: automatically determined
        output_filename=training_output_filename,
    )

    # -------------------------------------------------------------------------------------
    # Inference
    # -------------------------------------------------------------------------------------
    # Interface
    inference_component = command(
        inputs=dict(
            preprocessed_input_folder=Input (type="uri_folder"),
            preprocessed_input_filename=Input (type="string"),
            model_input_folder=Input (type="uri_folder"),
            model_input_filename=Input (type="string"),
            output_filename=Input (type="string"),
        ),
        outputs=dict(
            output_folder=Output (type="uri_folder"),
        ),
        code=f"./my_lib/aml/",  # location of source code: in this case, the root folder
        command="python inference.py " 
            "--preprocessed_input_folder ${{inputs.preprocessed_input_folder}} "
            "--preprocessed_input_filename ${{inputs.preprocessed_input_filename}} "
            "--model_input_folder ${{inputs.model_input_folder}} "
            "--model_input_filename ${{inputs.model_input_filename}} "
            "--output_folder ${{outputs.output_folder}} "
            "--output_filename ${{inputs.output_filename}} ",

        environment="hello-world@latest",
        display_name="inference",
    )

    # Instantiation
    inference_job = inference_component(
        preprocessed_input_folder=preprocessing_test_job.outputs.output_folder,
        preprocessed_input_filename=preprocessing_test_output_filename,
        model_input_folder=training_job.outputs.output_folder,
        model_input_filename=training_output_filename,
        #output_folder: automatically determined
        output_filename=inference_output_filename,
    )

Appending to hello_world_pipeline.py

Create and run pipeline

Next we define a function that both creates and runs the pipeline implemented above. This function performs all the steps implemented so far: it reads a config file, instantiates a pipeline object by calling our three_components_pipeline function, and finally performs the pipeline set-up and runs it by calling connect_setup_and_run:

%%writefile -a simulation/hello_world_pipeline.py
def run_pipeline (
    config_path: str="./pipeline_input.json",
    experiment_name="hello-world-experiment",
):
    # read config
    config = read_config (config_path)

    # Build pipeline 
    three_components_pipeline_object = three_components_pipeline(
        # first preprocessing component
        preprocessing_training_input_file=Input(type="uri_file", path=config.preprocessing_training_input_file),
        preprocessing_training_output_filename=config.preprocessing_training_output_filename,
        x=config.x,
        
        # second preprocessing component
        preprocessing_test_input_file=Input(type="uri_file", path=config.preprocessing_test_input_file),
        preprocessing_test_output_filename=config.preprocessing_test_output_filename,
        
        # Training component parameters:
        training_output_filename=config.training_output_filename,
        
        # Inference component parameters:
        inference_output_filename=config.inference_output_filename,
    )

    connect_setup_and_run (
        three_components_pipeline_object, 
        experiment_name=experiment_name,
        compute_name=config.compute_name,
        image=config.image,
        conda_file=config.conda_file,
        name_env=config.name_env,
        description_env=config.description_env,
        docker_context_path=config.docker_context_path,
    )

Appending to hello_world_pipeline.py

Parsing arguments

%%writefile -a simulation/hello_world_pipeline.py
def parse_args ():
    """Parses input arguments"""
    
    parser = argparse.ArgumentParser()
    parser.add_argument (
        "--config-path", 
        type=str, 
        default="configs/pipeline_input.json",
        help="Path to config file specifying pipeline input parameters.",
    )
    parser.add_argument (
        "--experiment-name", 
        type=str, 
        default="simulation",
        help="Name of experiment.",
    )

    args = parser.parse_args()
    
    print ("Running hello-world pipeline with args", args)
    
    return args

Appending to hello_world_pipeline.py

Main section

%%writefile -a simulation/hello_world_pipeline.py
def main ():
    """Parses arguments and runs pipeline"""
    args = parse_args ()
    run_pipeline (
        args.config_path,
        args.experiment_name,
    )

# -------------------------------------------------------------------------------------
# -------------------------------------------------------------------------------------
if __name__ == "__main__":
    main ()

Appending to hello_world_pipeline.py

Try

cd simulation

/mnt/batch/tasks/shared/LS_root/mounts/clusters/jaumecpu/code/Users/jau.m/home/posts/data_science/simulation

%run hello_world_pipeline.py

Running hello-world pipeline with args Namespace(config_path='configs/pipeline_input.json', experiment_name='simulation')

Found the config file in: /config.json

Uploading simulation (0.04 MBs): 100%|██████████| 43370/43370 [00:01<00:00, 38468.84it/s]

RunId: upbeat_leg_dmp9bcvs9y
Web View: https://ml.azure.com/runs/upbeat_leg_dmp9bcvs9y?wsid=/subscriptions/6af6741b-f140-48c2-84ca-027a27365026/resourcegroups/helloworld/workspaces/helloworld

Streaming logs/azureml/executionlogs.txt
========================================

[2024-04-22 16:10:08Z] Submitting 2 runs, first five are: b47b7c2a:96c70098-5988-4f87-82e5-533cf367757a,f0c6ed40:8447c9f9-a361-4615-961a-f9362407c3fe
[2024-04-22 16:20:46Z] Completing processing run id 96c70098-5988-4f87-82e5-533cf367757a.
[2024-04-22 16:20:46Z] Completing processing run id 8447c9f9-a361-4615-961a-f9362407c3fe.
[2024-04-22 16:20:47Z] Submitting 1 runs, first five are: 0ef51f82:68e226d8-0c7a-4d40-ab80-df94e1eae12e
[2024-04-22 16:21:09Z] Completing processing run id 68e226d8-0c7a-4d40-ab80-df94e1eae12e.
[2024-04-22 16:21:10Z] Submitting 1 runs, first five are: 005c2297:9906cb00-9ecf-4b37-9a83-58942197aef9
[2024-04-22 16:21:33Z] Completing processing run id 9906cb00-9ecf-4b37-9a83-58942197aef9.

Execution Summary
=================
RunId: upbeat_leg_dmp9bcvs9y
Web View: https://ml.azure.com/runs/upbeat_leg_dmp9bcvs9y?wsid=/subscriptions/6af6741b-f140-48c2-84ca-027a27365026/resourcegroups/helloworld/workspaces/helloworld