import os
import shutil
'simulation/python_scripts', exist_ok=True)
os.makedirs ('simulation/my_lib/aml', exist_ok=True)
os.makedirs ('simulation/configs', exist_ok=True)
os.makedirs ('simulation/data', exist_ok=True)
os.makedirs (
# shutil.copy ('./hello_world.yml', 'simulation') => a different one will be created
'./pipeline_input.json', 'simulation/configs')
shutil.copy (
'preprocessing/preprocessing.py', 'simulation/my_lib/aml')
shutil.copy ('training/training.py', 'simulation/my_lib/aml')
shutil.copy ('inference/inference.py', 'simulation/my_lib/aml')
shutil.copy (
'data/dummy_input.csv', 'simulation/data')
shutil.copy ('data/dummy_test.csv', 'simulation/data') shutil.copy (
Custom Docker images in AML
Note: this post is just a draft in progress. As of now, it consists of a collection of random notes.
Things to change wrt hello-world
- conda env includes current module, with setup.py and settings.ini from nbdev
- The pipeline is run from a
python_scripts
folder. - The conda env is in root folder.
- The config in a
configs
folder. - The components are in
mylib/aml
folder. - We add a docker file that copies files such as setup.py, settings.ini, data, wheels, etc. (everything needed by the component scripts, which is basically everything copied to simulation folder, except for the json file used for config of how the pipeline is built (e.g., the one indicating the name of the environment, etc.)
- We are going to try two methods: one based on python image, and another based on aml image.
Copying files
Copying settings.ini and setup.py from nbdev
cd ../../..
git clone https://github.com/fastai/nbdev.git
cp nbdev/settings.ini home/posts/data_science/simulation
cp nbdev/setup.py home/posts/data_science/simulation
cd home/posts/data_science/simulation
C|hanging settings.ini
Edit the settings.ini
file and replace the following entries as follows:
lib_name = my_lib
repo = simulation
requirements = pandas
scikit-learn
numpy
dev_requirements = joblib
azure-ai-ml
lib_path = my_lib
Then remove the following entries:
pip_requirements
conda_requirements
dev_requirements
console_scripts
Changing the conda environment file
%%writefile simulation/hello_world.yml
name: hello_world
dependencies:- python=3.10
- pip
- pip:
- -e .[dev]
Overwriting simulation/hello_world.yml
Using custom docker image
I just searched in the list of curated environments present in my workspace, using the keyword sklearn
in the search text box. At the time of writing this tutorial, the environment found is sklearn-1.1:30
. By clicking on it, and then on the Context
tab, we can read its dockerfile, which indicates, in the first line, the base docker image used: FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20240415.v1
.
I found documentation about the docker image to be used by googling “docker mcr.microsoft.com/azureml/openmpi4.1.0”, which provided the following URL: https://hub.docker.com/_/microsoft-azureml
. In there, we can find additional links: - https://github.com/Azure/AzureML-Containers => contains docker files for each image - Note: SDK code contained in this link is v1.
We can also explore this docker image by running it in interactive mode (see a cheat sheet of docker commands in here and here):
docker pull mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20240415.v1
docker run -it --entrypoint bash mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20240415.v1
In this tutorial we can see how to test this docker image :
docker run --rm -d -v $PWD/$BASE_PATH:$MODEL_BASE_PATH -p 8501:8501 \
-e MODEL_BASE_PATH=$MODEL_BASE_PATH -e MODEL_NAME=$MODEL_NAME \
--name="tfserving-test" docker.io/tensorflow/serving:latest
sleep 10
Possibilities
Either copy the dockerfile and replace the line:
COPY conda_dependencies.yaml .
with:
RUN mkdir -p data
COPY <MY-CONDA-ENV-YAML> .
COPY settings.ini .
COPY setup.py .
COPY data/dummy_input.csv data/
COPY data/dummy_test.csv data/
COPY data/dummy_test.csv data/
and change the name conda_dependencies.yaml
with
or name your conda env file conda_dependencies.yaml
:
mv simulation/hello_world.yml simulation/conda_dependencies.yaml
and use the curated docker image as base image in your dockerfile:
FROM mcr.microsoft.com/azureml/curated/sklearn-1.1:30
RUN mkdir -p data
COPY settings.ini .
COPY setup.py .
COPY data/dummy_input.csv data/
COPY data/dummy_test.csv data/
COPY data/dummy_test.csv data/
Note that we can dedicate a specific folder for all the files that need to be copied and used in the Dockerfile, including conda_dependencies.yaml
.
Final Dokerfile
From the two possibilities mentioned above, we use the first one, which is more modular:
cd ..
/mnt/batch/tasks/shared/LS_root/mounts/clusters/jaumecpu/code/Users/jau.m/home/posts/data_science
/hello_world.yml simulation/conda_dependencies.yaml mv simulation
mv: cannot stat 'simulation/hello_world.yml': No such file or directory
%%writefile simulation/Dockerfile
/azureml/curated/sklearn-1.1:30
FROM mcr.microsoft.com
-p data
RUN mkdir
COPY settings.ini .
COPY setup.py ./dummy_input.csv data/
COPY data/dummy_test.csv data/
COPY data/dummy_test.csv data/ COPY data
Overwriting simulation/Dockerfile
%%writefile simulation/Dockerfile
/azureml/openmpi4.1.0-ubuntu20.04:20240415.v1
FROM mcr.microsoft.com
/
WORKDIR
=/azureml-envs/sklearn-1.1
ENV CONDA_PREFIX=$CONDA_PREFIX
ENV CONDA_DEFAULT_ENV=$CONDA_PREFIX/bin:$PATH
ENV PATH
# This is needed for mpi to locate libpython
=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
ENV LD_LIBRARY_PATH
# Create conda environment
-p data
RUN mkdir
COPY hello_world.yml .
COPY settings.ini .
COPY setup.py ./dummy_input.csv data/
COPY data/dummy_test.csv data/
COPY data/dummy_test.csv data/
COPY data-p $CONDA_PREFIX -f conda_dependencies.yaml -q && \
RUN conda env create && \
rm conda_dependencies.yaml -p $CONDA_PREFIX pip cache purge && \
conda run -a -y conda clean
Overwriting Dockerfile
Testing docker
#docker pull mcr.microsoft.com/azureml/curated/sklearn-1.1:30
docker build -t hello_world .
docker run -v ~/cloudfiles/code/Users/jau.m/home/posts/data_science/simulation/:/host_dir -it --entrypoint bash hello_world
Let’s try running the first job of the pipeline function: preprocessing_training_job
.For this, we first look how the script needs to be run from command line, as indicated in the command
call of the preprocessing
component:
="python preprocessing.py "
command"--input_file ${{inputs.input_file}} "
"-x ${{inputs.x}} "
"--output_folder ${{outputs.output_folder}} "
"--output_filename ${{inputs.output_filename}}",
Then, in order to see what the inputs inputs.input_file
, ìnputs.x
and inputs.output_filename
are, we look at how the preprocessing_training_job
is created:
= preprocessing_component(
preprocessing_training_job =preprocessing_training_input_file,
input_file#output_folder: automatically determined
=preprocessing_training_output_filename,
output_filename=x,
x )
and, in order to fill in those values we look at the ones passed to the pipeline function:
= three_components_pipeline(
three_components_pipeline_object # first preprocessing component
=Input(type="uri_file", path=config.preprocessing_training_input_file),
preprocessing_training_input_file=config.preprocessing_training_output_filename,
preprocessing_training_output_filename=config.x, x
If we just replace those, we get:
--input config.preprocessing_training_input_file config.preprocessing_training_output_filename -x config.x python preprocessing.py
in the config file, so we visualize it: These values are given
!cat configs/pipeline_input.json
{
"preprocessing_training_input_file": "./data/dummy_input.csv",
"preprocessing_training_output_filename":"preprocessed_training_data.csv",
"x": 10,
"preprocessing_test_input_file": "./data/dummy_test.csv",
"preprocessing_test_output_filename": "preprocessed_test_data.csv",
"training_output_filename": "model.pk",
"inference_output_filename": "inference_results.csv",
"experiment_name": "e2e_three_components_in_script",
"compute_name": "jaumecpu",
"image": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
"conda_file": "./hello_world.yml",
"name_env": "hello-world",
"description_env": "Hello World",
"docker_context_path": "."
}
With all this, we can put the pieces togeher. The names of the ouput folders are automatically generated by AML, and the folders automatically created. In our case, we will the name the ouput folder as preprocessing_training_ouput_folder
, and create it before hand. We also have to copy the script my_lib/aml/preprocessing.py
to current folder. Putting all together, we run the following in command line inside the docker container:
cp host_dir/my_lib/aml/preprocessing.py .
mkdir preprocessing_training_ouput_folder
python preprocessing.py --input ./data/dummy_input.csv --output_folder preprocessing_training_ouput_folder --output_filename preprocessed_training_data.csv -x 10
Change aml_utils.py
In order to use a custom docker image, we need to use a different way of creating the environment:
= Environment(
env =BuildContext(path=docker_context_path),
build=name_env,
name=description_env,
description )
This change affects the function create_env
in aml_utils.py
:
def create_env (
ml_client,str="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
image: str="./pipeline.yml",
conda_file: str="pipeline",
name_env: str="Pipeline environment",
description_env: =None,
docker_context_path
):if docker_context_path is None:
"Creates environment in AML workspace"
= Environment (
env =image,
image=conda_file,
conda_file=name_env,
name=description_env,
description
)else:
= Environment(
env =BuildContext(path=docker_context_path),
build=name_env,
name=description_env,
description
) ml_client.environments.create_or_update (env)
The change also affects functions that call create_env
(connect_setup_and_run
in aml_utils.py
, and run_pipeline
in hello_world_pipeline.py
), since they need to pass the additional parameter docker_context_path
. We also need to import BuildContext
from azure.ai.ml.entities
. With these changes, the complete aml_utils.py
file is as follow:
%%writefile aml_utils.py
# Standard imports
import json
# Third-party imports
from sklearn.utils import Bunch
# AML imports
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Environment, BuildContext
from azure.identity import DefaultAzureCredential
def connect ():
"""Connects to Azure ML workspace and returns a handle to use it."""
# authenticate
= DefaultAzureCredential()
credential
# Get a handle to the workspace
= MLClient.from_config (
ml_client =credential,
credential
)return ml_client
def create_env (
ml_client,str="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
image: str="./pipeline.yml",
conda_file: str="pipeline",
name_env: str="Pipeline environment",
description_env: =None,
docker_context_path
):if docker_context_path is None:
"Creates environment in AML workspace"
= Environment (
env =image,
image=conda_file,
conda_file=name_env,
name=description_env,
description
)else:
= Environment(
env =BuildContext(path=docker_context_path),
build=name_env,
name=description_env,
description
)
ml_client.environments.create_or_update (env)
def connect_setup_and_run (
pipeline_object, str="pipeline experiment",
experiment_name: str="jaumecpu",
compute_name: str="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
image: str="./pipeline.yml",
conda_file: str="pipeline",
name_env: str="Pipeline environment",
description_env: =None,
docker_context_path
):"""Does all the setup required to run the pipeline.
This includes: connecting, creating environment, indicating our compute instance,
creating and running the pipeline.
"""
# connect
= connect ()
ml_client
# create env
create_env (
ml_client,=image,
image=conda_file,
conda_file=name_env,
name_env=description_env,
description_env=docker_context_path,
docker_context_path
)
# compute
= compute_name
pipeline_object.settings.default_compute
# create pipeline and run
= ml_client.jobs.create_or_update(
pipeline_job
pipeline_object,# Project's name
=experiment_name,
experiment_name
)
# ----------------------------------------------------
# Pipeline running
# ----------------------------------------------------
ml_client.jobs.stream(pipeline_job.name)
def read_config (config_path: str):
# Read config json file
with open (config_path,"rt") as config_file:
= json.load (config_file)
config
= Bunch (**config)
config
return config
Overwriting aml_utils.py
/ cp aml_utils.py simulation
Change config file
Add the following line to the previous config file: "docker_context_path": "."
%%writefile configs/pipeline_input.json
{"preprocessing_training_input_file": "./data/dummy_input.csv",
"preprocessing_training_output_filename":"preprocessed_training_data.csv",
"x": 10,
"preprocessing_test_input_file": "./data/dummy_test.csv",
"preprocessing_test_output_filename": "preprocessed_test_data.csv",
"training_output_filename": "model.pk",
"inference_output_filename": "inference_results.csv",
"experiment_name": "e2e_three_components_in_script",
"compute_name": "jaumecpu",
"image": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
"conda_file": "./hello_world.yml",
"name_env": "hello-world",
"description_env": "Hello World",
"docker_context_path": "."
}
Overwriting configs/pipeline_input.json
/untitled.txt rm configs
rm: cannot remove 'configs/untitled.txt': No such file or directory
No need
Change hello_world_pipeline.py file
Imports
Same imports section:
%%writefile simulation/hello_world_pipeline.py
# Standard imports
import argparse
# AML imports
from azure.ai.ml import (
command,
dsl,
Input,
Output,
)
# Utility functions
from aml_utils import (
connect,
create_env,
connect_setup_and_run,
read_config,
)
Overwriting hello_world_pipeline.py
Pipeline function
%%writefile -a simulation/hello_world_pipeline.py
@dsl.pipeline(
="Simulation hello-world",
description
)def three_components_pipeline(
# Preprocessing component parameters, first component:
str,
preprocessing_training_input_file: str,
preprocessing_training_output_filename: int,
x:
# Preprocessing component parameters, second component:
str,
preprocessing_test_input_file: str,
preprocessing_test_output_filename:
# Training component parameters:
str,
training_output_filename:
# Inference component parameters:
str,
inference_output_filename:
):"""
Third pipeline: preprocessing, training and inference.
Parameters
----------
preprocessing_training_input_file: str
Path to file containing training data to be preprocessed.
preprocessing_training_output_filename: str
Name of file containing the preprocessed, training data.
x: int
Number to add to input data for preprocessing it.
preprocessing_test_input_file: str
Path to file containing test data to be preprocessed.
preprocessing_test_output_filename: str
Name of file containing the preprocessed, test data.
training_output_filename: str
Name of file containing the trained model.
inference_output_filename: str
Name of file containing the output data with inference results.
"""
# -------------------------------------------------------------------------------------
# Preprocessing
# -------------------------------------------------------------------------------------
# Interface
= command(
preprocessing_component =dict(
inputs=Input (type="uri_file"),
input_file=Input (type="number"),
x=Input (type="string"),
output_filename
),=dict(
outputs=Output (type="uri_folder"),
output_folder
),=f"./my_lib/aml/", # location of source code: in this case, the root folder
code="python preprocessing.py "
command"--input_file ${{inputs.input_file}} "
"-x ${{inputs.x}} "
"--output_folder ${{outputs.output_folder}} "
"--output_filename ${{inputs.output_filename}}",
="hello-world@latest",
environment="Pre-processing",
display_name
)
# Instantiation
= preprocessing_component(
preprocessing_training_job =preprocessing_training_input_file,
input_file#output_folder: automatically determined
=preprocessing_training_output_filename,
output_filename=x,
x
)= preprocessing_component(
preprocessing_test_job =preprocessing_test_input_file,
input_file#output_folder: automatically determined
=preprocessing_test_output_filename,
output_filename=x,
x
)
# -------------------------------------------------------------------------------------
# Training component
# -------------------------------------------------------------------------------------
# Interface
= command(
training_component =dict(
inputs=Input (type="uri_folder"),
input_folder=Input (type="string"),
input_filename=Input (type="string"),
output_filename
),=dict(
outputs=Output (type="uri_folder"),
output_folder
),=f"./my_lib/aml/", # location of source code: in this case, the root folder
code="python training.py "
command"--input_folder ${{inputs.input_folder}} "
"--input_filename ${{inputs.input_filename}} "
"--output_folder ${{outputs.output_folder}} "
"--output_filename ${{inputs.output_filename}}",
="hello-world@latest",
environment="Training",
display_name
)
# Instantiation
= training_component(
training_job =preprocessing_training_job.outputs.output_folder,
input_folder=preprocessing_training_output_filename,
input_filename#output_folder: automatically determined
=training_output_filename,
output_filename
)
# -------------------------------------------------------------------------------------
# Inference
# -------------------------------------------------------------------------------------
# Interface
= command(
inference_component =dict(
inputs=Input (type="uri_folder"),
preprocessed_input_folder=Input (type="string"),
preprocessed_input_filename=Input (type="uri_folder"),
model_input_folder=Input (type="string"),
model_input_filename=Input (type="string"),
output_filename
),=dict(
outputs=Output (type="uri_folder"),
output_folder
),=f"./my_lib/aml/", # location of source code: in this case, the root folder
code="python inference.py "
command"--preprocessed_input_folder ${{inputs.preprocessed_input_folder}} "
"--preprocessed_input_filename ${{inputs.preprocessed_input_filename}} "
"--model_input_folder ${{inputs.model_input_folder}} "
"--model_input_filename ${{inputs.model_input_filename}} "
"--output_folder ${{outputs.output_folder}} "
"--output_filename ${{inputs.output_filename}} ",
="hello-world@latest",
environment="inference",
display_name
)
# Instantiation
= inference_component(
inference_job =preprocessing_test_job.outputs.output_folder,
preprocessed_input_folder=preprocessing_test_output_filename,
preprocessed_input_filename=training_job.outputs.output_folder,
model_input_folder=training_output_filename,
model_input_filename#output_folder: automatically determined
=inference_output_filename,
output_filename
)
Appending to hello_world_pipeline.py
Create and run pipeline
Next we define a function that both creates and runs the pipeline implemented above. This function performs all the steps implemented so far: it reads a config file, instantiates a pipeline object by calling our three_components_pipeline
function, and finally performs the pipeline set-up and runs it by calling connect_setup_and_run
:
%%writefile -a simulation/hello_world_pipeline.py
def run_pipeline (
str="./pipeline_input.json",
config_path: ="hello-world-experiment",
experiment_name
):# read config
= read_config (config_path)
config
# Build pipeline
= three_components_pipeline(
three_components_pipeline_object # first preprocessing component
=Input(type="uri_file", path=config.preprocessing_training_input_file),
preprocessing_training_input_file=config.preprocessing_training_output_filename,
preprocessing_training_output_filename=config.x,
x
# second preprocessing component
=Input(type="uri_file", path=config.preprocessing_test_input_file),
preprocessing_test_input_file=config.preprocessing_test_output_filename,
preprocessing_test_output_filename
# Training component parameters:
=config.training_output_filename,
training_output_filename
# Inference component parameters:
=config.inference_output_filename,
inference_output_filename
)
connect_setup_and_run (
three_components_pipeline_object, =experiment_name,
experiment_name=config.compute_name,
compute_name=config.image,
image=config.conda_file,
conda_file=config.name_env,
name_env=config.description_env,
description_env=config.docker_context_path,
docker_context_path
)
Appending to hello_world_pipeline.py
Parsing arguments
%%writefile -a simulation/hello_world_pipeline.py
def parse_args ():
"""Parses input arguments"""
= argparse.ArgumentParser()
parser
parser.add_argument ("--config-path",
type=str,
="configs/pipeline_input.json",
defaulthelp="Path to config file specifying pipeline input parameters.",
)
parser.add_argument ("--experiment-name",
type=str,
="simulation",
defaulthelp="Name of experiment.",
)
= parser.parse_args()
args
print ("Running hello-world pipeline with args", args)
return args
Appending to hello_world_pipeline.py
Main section
%%writefile -a simulation/hello_world_pipeline.py
def main ():
"""Parses arguments and runs pipeline"""
= parse_args ()
args
run_pipeline (
args.config_path,
args.experiment_name,
)
# -------------------------------------------------------------------------------------
# -------------------------------------------------------------------------------------
if __name__ == "__main__":
main ()
Appending to hello_world_pipeline.py
Try
cd simulation
/mnt/batch/tasks/shared/LS_root/mounts/clusters/jaumecpu/code/Users/jau.m/home/posts/data_science/simulation
%run hello_world_pipeline.py
Running hello-world pipeline with args Namespace(config_path='configs/pipeline_input.json', experiment_name='simulation')
Found the config file in: /config.json
Uploading simulation (0.04 MBs): 100%|██████████| 43370/43370 [00:01<00:00, 38468.84it/s]
RunId: upbeat_leg_dmp9bcvs9y
Web View: https://ml.azure.com/runs/upbeat_leg_dmp9bcvs9y?wsid=/subscriptions/6af6741b-f140-48c2-84ca-027a27365026/resourcegroups/helloworld/workspaces/helloworld
Streaming logs/azureml/executionlogs.txt
========================================
[2024-04-22 16:10:08Z] Submitting 2 runs, first five are: b47b7c2a:96c70098-5988-4f87-82e5-533cf367757a,f0c6ed40:8447c9f9-a361-4615-961a-f9362407c3fe
[2024-04-22 16:20:46Z] Completing processing run id 96c70098-5988-4f87-82e5-533cf367757a.
[2024-04-22 16:20:46Z] Completing processing run id 8447c9f9-a361-4615-961a-f9362407c3fe.
[2024-04-22 16:20:47Z] Submitting 1 runs, first five are: 0ef51f82:68e226d8-0c7a-4d40-ab80-df94e1eae12e
[2024-04-22 16:21:09Z] Completing processing run id 68e226d8-0c7a-4d40-ab80-df94e1eae12e.
[2024-04-22 16:21:10Z] Submitting 1 runs, first five are: 005c2297:9906cb00-9ecf-4b37-9a83-58942197aef9
[2024-04-22 16:21:33Z] Completing processing run id 9906cb00-9ecf-4b37-9a83-58942197aef9.
Execution Summary
=================
RunId: upbeat_leg_dmp9bcvs9y
Web View: https://ml.azure.com/runs/upbeat_leg_dmp9bcvs9y?wsid=/subscriptions/6af6741b-f140-48c2-84ca-027a27365026/resourcegroups/helloworld/workspaces/helloworld