nbmodular

Convert notebooks to modular code.

Convert data science notebooks with poor modularity to fully modular notebooks that are automatically exported as python modules.

Motivation

In data science, it is usual to develop experimentally and quickly based on notebooks, with little regard to software engineering practices and modularity. It can become challenging to start working on someone else’s notebooks with no modularity in terms of separate functions, and a great degree of duplicated code between the different notebooks. This makes it difficult to understand the logic in terms of semantically separate units, see what are the commonalities and differences between the notebooks, and be able to extend, generalize, and configure the current solution.

Objectives

nbmodular is a library conceived with the objective of helping converting the cells of a notebook into separate functions with clear dependencies in terms of inputs and outputs. This is done though a combination of tools which semi-automatically understand the data-flow in the code, based on mild assumptions about its structure. It also helps test the current logic and compare it against a modularized solution, to make sure that the refactored code is equivalent to the original one.

Install

pip install nbmodular

Usage

Load ipython extension

%load_ext nbmodular.core.cell2func

This allows us to use the following magic commands, among others

function
print
function_info
print_pipeline

Let’s go one by one

function

Basic usage

The magic command function allows to run the code in the cell, as it would be normally done, and at the same time it performs a number of additional steps. Let’s go over each one in turn through the following example:

%%function two_plus_three
a = 2
b = 3
c = a+b
print (f'The result of adding {a}+{b} is {c}')

The result of adding 2+3 is 5

(a, b, c)

(2, 3, 5)

As we can see, the previous cell just runs as it would normally do. In addition to this, the code syntax is analyzed using an ast, and the result of this analysis is stored in a new object called two_plus_three_info. Let’s look at some of the information provided by this object.

First, the object stores the list of variables that were created inside this function:

two_plus_three_info.created_variables

['a', 'b', 'c']

By default, this object also stores the values of those variables:

two_plus_three_info.current_values

{'a': 2, 'b': 3, 'c': 5}

It stores the names of the variables used by this function and created before calling it:

two_plus_three_info.previous_variables

[]

In the previous example, there are no previous variables. We will see later an example which makes use of previous variables.

In addition to this, the cell magic %%function creates a new function which can be called normally later on. In our previous example, a function called two_plus_three has been created, let’s call it:

two_plus_three ()

The result of adding 2+3 is 5

We can also print the code of that function, using the line magic %print

%print two_plus_three

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')

Using the cell magic %%function is handy when we want to be able to inspect the variables created in the cell. In the short future, we will allow to prevent some of the variables to persist out of the cell, to avoid memory issues. We plan to do this in two ways:

Delete the variable (del), with the disadvantage that we won’t be able to inspect it later on.
Delete the variable only when a new cell magic is executed, so that we can still inspect the variables created in the last cell, and then move on to execute the next cell, at which point we remove previous variables that were memory-consuming.
We might as well, more in the long-term future, delete variables based on how much memory they consume, using some threshold parameter.

Let’s see now an example which uses variables created elsewhere:

my_previous_variable=10

%%function add_100
my_previous_variable = my_previous_variable + 100
print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')

The result of adding 100 to my_previous_variable is 110

add_100_info.previous_variables

['my_previous_variable']

my_previous_variable is also included in the list of created_variables, since a new value for this variable has been generated:

add_100_info.created_variables

['my_previous_variable']

All the functions created so far can be printed at once using print all:

%print all

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')

def add_100(my_previous_variable):
    my_previous_variable = my_previous_variable + 100
    print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')

And they are also written to a python module with the same name of the notebook (the current notebook being called “index.ipynb”):

!cat ../nbmodular/index.py

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')

def add_100(my_previous_variable):
    my_previous_variable = my_previous_variable + 100
    print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')

# -----------------------------------------------------
# pipeline
# -----------------------------------------------------
def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):
    """Pipeline calling each one of the functions defined in this module."""
    
    # load result
    result_file_name += '.pk'
    path_variables = Path ("index") / result_file_name
    if load and path_variables.exists():
        result = joblib.load (path_variables)
        return result

    two_plus_three ()
    add_100 (my_previous_variable)

    # save result
    result = Bunch ()
    if save:    
        path_variables.parent.mkdir (parents=True, exist_ok=True)
        joblib.dump (result, path_variables)
    return result

Dynamic outputs

So far, none of the created functions return any result. This is because there is no other function that needs any of the variables created inside neither two_plus_three nor add_100. Let’s see what happens when we add a new function that requires the variable c, which was created in two_plus_three:

%%function multiply_by_two
#|echo: false
d = c*2
print (f'Two times {c} is {d}')

Two times 5 is 10

Our new function makes use of the result computed in two_plus_three, so we need that function to return this result. This is done automatically, and the function two_plus_three updated:

%print two_plus_three

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')
    return c

We can see that two_plus_three now returns c. We can call it with the updated signature:

my_new_c = two_plus_three ()
my_new_c

The result of adding 2+3 is 5

Indicating function position

When adding a new function, we can indicate in which position of the pipeline we want it to be added. By default, it is added at the end. To indicate the position, simply pass –position to the magic cell

%%function my_function_in_pos_2 --position 2
<my code...>

Section print_pipeline below includes an example of this.

print

We can see each of the defined functions with print my_function:

%print multiply_by_two

def multiply_by_two(c):
    d = c*2
    print (f'Two times {c} is {d}')

We can print all the functions defined so far with %%function using print all

%print all

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')
    return c

def add_100(my_previous_variable):
    my_previous_variable = my_previous_variable + 100
    print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')

def multiply_by_two(c):
    d = c*2
    print (f'Two times {c} is {d}')

print_pipeline

As we add functions to the notebook, a pipeline function is defined. We can print this pipeline with the magic print_pipeline

%print_pipeline

# -----------------------------------------------------
# pipeline
# -----------------------------------------------------
def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):
    """Pipeline calling each one of the functions defined in this module."""
    
    # load result
    result_file_name += '.pk'
    path_variables = Path ("index") / result_file_name
    if load and path_variables.exists():
        result = joblib.load (path_variables)
        return result

    c = two_plus_three ()
    add_100 (my_previous_variable)
    multiply_by_two (c)

    # save result
    result = Bunch (c=c)
    if save:    
        path_variables.parent.mkdir (parents=True, exist_ok=True)
        joblib.dump (result, path_variables)
    return result

As we can see, the first and last parts of the pipeline function are dedicated to loading previously stored results, if the pipeline was run before, and saving the results of this execution. The central part calls the functions defined so far, using proper inputs and outputs. Having a pipeline function implemented for us is handy to see the data-flow (in terms of inputs and outputs) from the first function call to the last one.

One detail that we can see in the previous pipeline is that the variable my_previous_variable has not been defined before being used. However, if we try to call the pipeline function, it will not fail. This is because my_previous_variable exists in the global scope, and it is therefore treated as a global variable. If we want to make sure that all variables are local, we can do:

%delete_globals

raised_exception=False
try:
    index_pipeline()
except Exception as e:
    print (f'could not run pipeline: {e}')
    raised_exception=True
assert raised_exception

The result of adding 2+3 is 5
could not run pipeline: name 'my_previous_variable' is not defined

We can then add a new function that will provide a value for my_previous_variable:

%%function get_my_previous_variable --position 0
my_previous_variable = 100

%print_pipeline

# -----------------------------------------------------
# pipeline
# -----------------------------------------------------
def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):
    """Pipeline calling each one of the functions defined in this module."""
    
    # load result
    result_file_name += '.pk'
    path_variables = Path ("index") / result_file_name
    if load and path_variables.exists():
        result = joblib.load (path_variables)
        return result

    my_previous_variable = get_my_previous_variable ()
    c = two_plus_three ()
    add_100 (my_previous_variable)
    multiply_by_two (c)

    # save result
    result = Bunch (my_previous_variable=my_previous_variable,c=c)
    if save:    
        path_variables.parent.mkdir (parents=True, exist_ok=True)
        joblib.dump (result, path_variables)
    return result

Now we can call the pipeline without issues

index_pipeline()

The result of adding 2+3 is 5
The result of adding 100 to my_previous_variable is 200
Two times 5 is 10

{'my_previous_variable': 100, 'c': 5}

We can see that the returned value for my_previous_variable is the original value, since this value was not returned by add_100. If we want this function to return that variable, we need to either create another function that makes use of that value, or explictly indicate that we want add_100 to return that variable, as follows:

%%function add_100 --include-output my_previous_variable
my_previous_variable = my_previous_variable + 100
print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')

The result of adding 100 to my_previous_variable is 200

We can see that my_previous_variable was added in the output:

%print add_100

def add_100(my_previous_variable):
    my_previous_variable = my_previous_variable + 100
    print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')
    return my_previous_variable

now we can call the function and obtain the output we indicated:

add_100(50)==150

The result of adding 100 to my_previous_variable is 150

True

Another possibility is to modify the signature of a previously defined function using the magic line add_to_signature. Let’s do that with multiply_by_two. As we can see in the code above, this function doesn’t output anything at the moment.

%print multiply_by_two

def multiply_by_two(c):
    d = c*2
    print (f'Two times {c} is {d}')

Let’s call add_to_signature on it:

%add_to_signature multiply_by_two --output d

and check the result:

%print multiply_by_two

def multiply_by_two(c):
    d = c*2
    print (f'Two times {c} is {d}')
    return d

multiply_by_two (150)

Two times 150 is 300

The pipeline is updated with these changes:

%print_pipeline

# -----------------------------------------------------
# pipeline
# -----------------------------------------------------
def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):
    """Pipeline calling each one of the functions defined in this module."""
    
    # load result
    result_file_name += '.pk'
    path_variables = Path ("index") / result_file_name
    if load and path_variables.exists():
        result = joblib.load (path_variables)
        return result

    my_previous_variable = get_my_previous_variable ()
    c = two_plus_three ()
    my_previous_variable = add_100 (my_previous_variable)
    d = multiply_by_two (c)

    # save result
    result = Bunch (my_previous_variable=my_previous_variable,d=d,c=c)
    if save:    
        path_variables.parent.mkdir (parents=True, exist_ok=True)
        joblib.dump (result, path_variables)
    return result

Let’s check the result of calling the new pipeline:

cell_processor.call_history

[('two_plus_three',
  "#|echo: false\na = 2\nb = 3\nc = a+b\nprint (f'The result of adding {a}+{b} is {c}')\n"),
 ('add_100',
  "#|echo: false\nmy_previous_variable = my_previous_variable + 100\nprint (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')\n"),
 ('add_100',
  "#|echo: false\nmy_previous_variable = my_previous_variable + 100\nprint (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')\n"),
 ('hybrid', 'x = 3\nx = x + 4\nprint (x)\n'),
 ('hybrid', 'x = 3\nx = x + 4\nprint (x)\n'),
 ('multiply_by_two',
  "#|echo: false\nd = c*2\nprint (f'Two times {c} is {d}')\n"),
 ('get_my_previous_variable --position 0',
  '#| echo: false\nmy_previous_variable = 100\n'),
 ('add_100 --include-output my_previous_variable',
  "#| echo: false\nmy_previous_variable = my_previous_variable + 100\nprint (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')\n")]

function_info

We can get access to many of the details of each of the defined functions by calling function_info on a given function name:

two_plus_three_info = %function_info two_plus_three

two_plus_three_info = %function_info two_plus_three
#| echo: false

This allows us to see:

The name and value (at the time of running) of the local variables, arguments and results from the function:

two_plus_three_info.arguments

[]

two_plus_three_info.current_values

{'a': 2, 'b': 3, 'c': 5}

The variables in current_values can be accessed directly as attributes of two_plus_three_info:

two_plus_three_info.a, two_plus_three_info.b, two_plus_three_info.c

(2, 3, 5)

We can also see the return values of the function:

two_plus_three_info.return_values

['c']

We can inspect the original code written in the cell…

print (two_plus_three_info.original_code)

a = 2
b = 3
c = a+b
print (f'The result of adding {a}+{b} is {c}')

the code of the function we just created:

print (two_plus_three_info.code)

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')
    return c

.. and the AST trees:

print (two_plus_three_info.get_ast (code=two_plus_three_info.original_code))

Module(
  body=[
    Assign(
      targets=[
        Name(id='a', ctx=Store())],
      value=Constant(value=2)),
    Assign(
      targets=[
        Name(id='b', ctx=Store())],
      value=Constant(value=3)),
    Assign(
      targets=[
        Name(id='c', ctx=Store())],
      value=BinOp(
        left=Name(id='a', ctx=Load()),
        op=Add(),
        right=Name(id='b', ctx=Load()))),
    Expr(
      value=Call(
        func=Name(id='print', ctx=Load()),
        args=[
          JoinedStr(
            values=[
              Constant(value='The result of adding '),
              FormattedValue(
                value=Name(id='a', ctx=Load()),
                conversion=-1),
              Constant(value='+'),
              FormattedValue(
                value=Name(id='b', ctx=Load()),
                conversion=-1),
              Constant(value=' is '),
              FormattedValue(
                value=Name(id='c', ctx=Load()),
                conversion=-1)])],
        keywords=[]))],
  type_ignores=[])
None

print (two_plus_three_info.get_ast (code=two_plus_three_info.code))

Module(
  body=[
    FunctionDef(
      name='two_plus_three',
      args=arguments(
        posonlyargs=[],
        args=[],
        kwonlyargs=[],
        kw_defaults=[],
        defaults=[]),
      body=[
        Assign(
          targets=[
            Name(id='a', ctx=Store())],
          value=Constant(value=2)),
        Assign(
          targets=[
            Name(id='b', ctx=Store())],
          value=Constant(value=3)),
        Assign(
          targets=[
            Name(id='c', ctx=Store())],
          value=BinOp(
            left=Name(id='a', ctx=Load()),
            op=Add(),
            right=Name(id='b', ctx=Load()))),
        Expr(
          value=Call(
            func=Name(id='print', ctx=Load()),
            args=[
              JoinedStr(
                values=[
                  Constant(value='The result of adding '),
                  FormattedValue(
                    value=Name(id='a', ctx=Load()),
                    conversion=-1),
                  Constant(value='+'),
                  FormattedValue(
                    value=Name(id='b', ctx=Load()),
                    conversion=-1),
                  Constant(value=' is '),
                  FormattedValue(
                    value=Name(id='c', ctx=Load()),
                    conversion=-1)])],
            keywords=[])),
        Return(
          value=Name(id='c', ctx=Load()))],
      decorator_list=[])],
  type_ignores=[])
None

cell_processor

This magic line allows us to get access to the CellProcessor object managing the logic for running the above magic commands, which can become handy:

cell_processor = %cell_processor

cell_processor = %cell_processor
#| echo: false

Merging function cells

In order to explore intermediate results, it is convenient to split the code in a function among different cells. This can be done by passing the flag --merge True

%%function analyze
x = [1, 2, 3]
y = [100, 200, 300]
z = [u+v for u,v in zip(x,y)]

[101, 202, 303]

%print analyze

def analyze(x):
    x = [1, 2, 3]
    y = [100, 200, 300]
    z = [u+v for u,v in zip(x,y)]

%%function analyze --merge
product = [u*v for u, v in zip(x,y)]

%print analyze

def analyze(x):
    x = [1, 2, 3]
    y = [100, 200, 300]
    z = [u+v for u,v in zip(x,y)]
    product = [u*v for u, v in zip(x,y)]
    return x

Test functions

Test functions are implemented taking pytest as target test engine.

By passing the flag --test we indicate that the logic in the cell is dedicated to test other functions in the notebook.

This has the following consequences:

- The test function is not included in the overall pipeline.
- It has no inputs and outputs. 
- Required variables are obtained by calling a *data* function (see below) in the body, rather than taking those as input of the function.

Let’s see an example

%%function multiply_by_two --test
assert multiply_by_two(150)==300

Let’s look at the code generated for this test function:

%print test_multiply_by_two --test

def test_multiply_by_two():
    assert multiply_by_two(150)==300

Now, imagine that in order to obtain the input to multiply_by_two we need some code that obtains that input. We can define a data function that encapulates this code and returns it to our test function:`

%%function input_multiply_by_two --test --data
factors=[2, 2, 3, 5, 5]
value_to_multiply = 1
for factor in factors:
    value_to_multiply *= factor

factors=[2, 2, 3, 5, 5]
value_to_multiply = 1
for factor in factors:
    value_to_multiply *= factor

Now we change a little bit test_multiply_by_two to use value_to_multiply as input of `multiply_by_two``

%%function multiply_by_two --test
print(multiply_by_two(value_to_multiply))

Let’s see how test_multiply_by_two is implemented after applying the previous change:

%print test_multiply_by_two --test

def test_multiply_by_two():
    value_to_multiply = test_input_multiply_by_two()
    print(multiply_by_two(value_to_multiply))

We can see that the variable value_to_multiply is returned by calling the “test data” function test_input_multiply_by_two. We use this type of implementation to make it possible to use test engines such as pytest where the test functions need to be self-contained, i.e., they need to operate independently of other functions. Although pytest uses fixtures for this purpose, our test data functions provide an alternative to it.

We can see that test_input_multiply_by_two returns the required value_to_multiply, so that it can be used by test_multiply_by_two.

%print test_input_multiply_by_two --test --data

def test_input_multiply_by_two():
    factors=[2, 2, 3, 5, 5]
    value_to_multiply = 1
    for factor in factors:
        value_to_multiply *= factor
    return value_to_multiply

To prevent conflicts, two test data functions cannot return a variable with the same name:

%%function second_function --test --data
value_to_multiply = 10

If we run the previous code, we get a ValueError exception with the following message:

ValueError: detected common variables with other test data functions {'value_to_multiply'}:

Test functions are written in a separate test module, withprefix test_

os.listdir ('../tests')

['test_index.py']

assert os.listdir ('../tests')==['test_index.py']

Imports

In order to include libraries in our python module, we can use the magic imports. Those will be written at the beginning of the module:

%%imports
import pandas as pd

!cat ../nbmodular/index.py

#|echo: false
import pandas as pd
def get_my_previous_variable():
    my_previous_variable = 100
    return my_previous_variable

def two_plus_three():
    a = 2
    b = 3
    c = a+b
    print (f'The result of adding {a}+{b} is {c}')
    return c

def add_100(my_previous_variable):
    my_previous_variable = my_previous_variable + 100
    print (f'The result of adding 100 to my_previous_variable is {my_previous_variable}')
    return my_previous_variable

def multiply_by_two(c):
    d = c*2
    print (f'Two times {c} is {d}')
    return d

def analyze(x):
    x = [1, 2, 3]
    y = [100, 200, 300]
    z = [u+v for u,v in zip(x,y)]
    product = [u*v for u, v in zip(x,y)]
    return x

# -----------------------------------------------------
# pipeline
# -----------------------------------------------------
def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):
    """Pipeline calling each one of the functions defined in this module."""
    
    # load result
    result_file_name += '.pk'
    path_variables = Path ("index") / result_file_name
    if load and path_variables.exists():
        result = joblib.load (path_variables)
        return result

    my_previous_variable = get_my_previous_variable ()
    c = two_plus_three ()
    my_previous_variable = add_100 (my_previous_variable)
    d = multiply_by_two (c)
    x = analyze (x)

    # save result
    result = Bunch (x=x,my_previous_variable=my_previous_variable,d=d,c=c)
    if save:    
        path_variables.parent.mkdir (parents=True, exist_ok=True)
        joblib.dump (result, path_variables)
    return result

Imports can be indicated separately for the test module by passing the flag --test:

%%imports –test import matplotlib.pyplot as plt




::: {.cell}
``` {.python .cell-code}
!cat ../tests/test_index.py

#|echo: false
import matplotlib.pyplot as plt
def test_input_multiply_by_two():
    factors=[2, 2, 3, 5, 5]
    value_to_multiply = 1
    for factor in factors:
        value_to_multiply *= factor
    return value_to_multiply

def test_multiply_by_two():
    assert multiply_by_two(150)==300

def test_multiply_by_two():
    value_to_multiply = test_input_multiply_by_two()
    print(multiply_by_two(value_to_multiply))

:::

Defined functions

The cell magic %%function can also be used on cells that define functions:

import datetime
name = 'Jaume'

def determine_approximate_age (name, birthday_year=2000):
    #|echo: false
    current_year = datetime.datetime.today().year
    approximate_age = current_year-birthday_year
    print (f'hello {name}, your approximate age is {approximate_age}')
    return approximate_age

hello Jaume, your approximate age is 23

determine_approximate_age_info

Function determine_approximate_age:
    Arguments: ['name']
    Keyword arguments: {'birthday_year': 2000}
    Output: ['approximate_age']
    Created variables: ['current_year', 'approximate_age']

determine_approximate_age_info.approximate_age, determine_approximate_age_info.current_year

(23, 2023)

def determine_approximate_age(name, birthday_year=2000):
    current_year = datetime.datetime.today().year
    approximate_age = current_year-birthday_year
    print (f'hello {name}, your approximate age is {approximate_age}')
    return approximate_age, current_year

Functions can be included already being defined with signature and return values. The only caveat is that, if we want the function to be executed, the variables in the argument list need to be created outside of the function. Otherwise we need to pass the flag –norun to avoid errors:

%%function --not-run
def myfunc (x, y, a=1, b=3):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

def myfunc (x, y, a=1, b=3):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

Although the internal code of the function is not executed, it is still parsed using an AST:

myfunc_info.created_variables

['c']

myfunc_info.previous_variables

['a', 'b']

This allows to provide tentative warnings regarding names not found in the argument list

def other_func (x, y):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

def other_func (x, y):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

Let’s do the same but running the function:

a=1
b=3

%%function
def myfunc (x, y, a=1, b=3):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

def myfunc (x, y, a=1, b=3):
    #|echo: false
    print ('hello', a, b)
    c = a+b
    return c

hello 1 3

myfunc (10, 20)

hello 1 3

%print analyze

myfunc_info = %function_info myfunc
#|echo: false

myfunc_info

Function myfunc:
    Arguments: ['x', 'y']
    Keyword arguments: {'a': 1, 'b': 3}
    Output: ['c']
    Created variables: ['c']

myfunc_info.c

Storing local variables in memory

By default, when we run a cell function its local variables are stored in a dictionary called current_values:

%print analyze

The stored variables can be accessed by calling the magic function_info:

my_new_function_info = %function_info my_new_function

my_new_function_info.current_values

{'my_new_local': 3, 'my_other_new_local': 4}

This default behaviour can be overriden by passing the flag --not-store

%print analyze

my_second_new_function_info = %function_info my_second_new_function

my_second_new_function_info.current_values

{'my_second_variable': '__REMOVED__',
 'my_second_other_variable': '__REMOVED__'}

(Un)packing Bunch I/O

from sklearn.utils import Bunch

%print analyze

%print analyze

%print analyze

def bunch_processor(x, day=1):
    a = x["a"]
    b = x["b"]
    c = 3
    a = 4
    x["a"] = a
    x["c"] = c
    x["day"] = day
    return x

Function’s info object holding local variables

df = pd.DataFrame (dict(Year=[1,2,3], Month=[1,2,3], Day=[1,2,3]))
fy = '2023'

%print analyze

other args: fy 2023 x {'a': 1, 'b': 2} y [100, 200, 300]

An info object with name _info is created in memory, and can be used to get access to local variables

days_info.df_group

	index	Year	Month	Day
0	0	1	1	1
1	1	2	2	2
2	2	3	3	3

There is more information in this object: previous variables, code, etc.

days_info.current_values

{'df_group':    index  Year  Month  Day
 0      0     1      1    1
 1      1     2      2    2
 2      2     3      3    3}

days_info

Function days:
    Arguments: ['df', 'fy']
    Keyword arguments: {'x': 1, 'y': 3, 'n': 4}
    Output: ['df_group']
    Created variables: ['df_group']

The function can also be called directly:

days (df*100, 100, x=4)

other args: fy 100 x 4 y 3

	index	Year	Month	Day
0	0	100	100	100
1	1	200	200	200
2	2	300	300	300

Saving and loading

Saving / loading previous results

Functions can load previously computed results and save the results of the current execution. Let’s see an example:

x = 3
n = 5

def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

computing multiples

After running the previous cell, we can load the result of the function from disk:

joblib.load ('results/multiples_result.pickle')

[0, 3, 6, 9, 12]

By default, the result is saved in a folder called “results”, inside the current directory, and with a file name that is the same one as the name of the function, adding the suffix “_result” at the end. The type of result file used by default is “pickle”. All of these options can be changed as we will see later.

We can avoid the re-computing the results if we pass the flag --load:

def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

As we can see, the function hasn’t run, since there is no message printed on screen. If we don’t use the load flag, it will run normally:

def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

computing multiples

Saving / loading local variables

Instead of saving / loading the variables returned by the function, we can save or load the local variables by passing io-locals:

def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

computing multiples

After running the previous cell, we will have a file with path locals/multiples_locals.pickle, storing the local variables of the function:

joblib.load ('locals/multiples_locals.pickle')

{'factors': range(0, 5), 'result': [0, 3, 6, 9, 12]}

By default, the file is saved in a folder called “locals”, inside the current directory, and with a file name that is the same one as the name of the function, adding the suffix “_locals” at the end. The type of file used by default is “pickle”. All of these options can be changed as we will see later.

Again, we can avoid the re-computing the results if we pass the flag --load. This will load the local variables into the notebook’s memory. To demonstrate that, let’s first delete those variables from memory:

del factors
del result

We now load them from disk by passing the flags load and io-locals:

def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

As we can see, the function hasn’t run, since there is no printed message, and the local variables have been loaded and are now available:

print (f'factors: {factors}, result: {result}')

factors: range(0, 5), result: [0, 3, 6, 9, 12]

loading / saving in function’s code

We insert loading / saving code into the function being defined, by passing the flag --io-code:

def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    return result

computing multiples

multiples??

Signature:
multiples(
    n,
    x,
    load=False,
    save=False,
    io_type='pickle',
    io_root_path='results',
    io_file='multiples_result',
    load_args={},
    save_args={},
)
Docstring: <no docstring>
Source:   
def multiples(n, x, load=False, save=False, io_type="pickle", io_root_path="results", io_file="multiples_result", load_args={}, save_args={}):
    path_variables = Path (io_root_path) / f"{io_file}.{io_type}"
    if load and path_variables.exists():
        result = function_io.load (path_variables, io_type, **load_args)
        return result

    print ('computing multiples')
    factors = range(n)
    result = [x*i for i in factors]
    if save:
        function_io.save (result, path_variables, io_type, **save_args)
    return result
File:      /tmp/ipykernel_19269/2033624923.py
Type:      function

Calling this function with save=True will save the results to ‘results/multiples_result.pickle’, by default. This is the same path as the one used before, so let us remove it from disk first:

os.remove ('results/multiples_result.pickle')

multiples (7, 5, save=True)

computing multiples

[0, 5, 10, 15, 20, 25, 30]

joblib.load ('results/multiples_result.pickle')

[0, 5, 10, 15, 20, 25, 30]

We can also skip the computation in subsequent calls, by passing load=True:

multiples (7, 5, load=True)

[0, 5, 10, 15, 20, 25, 30]

As we can see, no message has been printed by calling the function, since the result is loaded from disk and the computation is skipped.

Loading / saving config parameters

def multiples (x, n):
    print ('computing multiples')
    factors = range(n)
    result = pd.DataFrame (dict(
        factors=factors,
        multiples=[x*i for i in factors],
    ))
    return result

computing multiples

pd.read_parquet ('results_df/computed_multiples.parquet')

	factors	multiples
0	0	0
1	1	3
2	2	6
3	3	9
4	4	12