Python API

This page outlines how to utilise the cache programatically. We step throught the three aspects illustrated in the diagram below: cacheing, staging and executing.

../_images/execution_process.svg

Illustration of the execution process.

Note

The full Jupyter notebook for this page can accessed here; api.ipynb. Try it for yourself!

Initialisation

from pathlib import Path
import nbformat as nbf
from jupyter_cache import get_cache
from jupyter_cache.base import NbBundleIn
from jupyter_cache.executors import load_executor, list_executors
from jupyter_cache.utils import (
    tabulate_cache_records, 
    tabulate_stage_records
)

First we setup a cache and ensure that it is cleared.

Important

Clearing a cache wipes its entire content, including any settings (such as cache limit).

cache = get_cache(".jupyter_cache")
cache.clear_cache()
cache
JupyterCacheBase('/Users/cjs14/GitHub/jupyter-cache/docs/using/.jupyter_cache')
print(cache.list_cache_records())
print(cache.list_staged_records())
[]
[]

Cacheing Notebooks

To directly cache a notebook:

record = cache.cache_notebook_file(
    path=Path("example_nbs", "basic.ipynb")
)
record
NbCacheRecord(pk=1)

This will add a physical copy of the notebook to tha cache (stripped of any text cells) and return the record that has been added to the cache database.

Important

The returned record is static, as in it will not update if the database is updated.

The record stores metadata for the notebook:

record.to_dict()
{'data': {},
 'pk': 1,
 'uri': 'example_nbs/basic.ipynb',
 'accessed': datetime.datetime(2020, 3, 13, 14, 21, 46, 271953),
 'description': '',
 'hashkey': '818f3412b998fcf4fe9ca3cca11a3fc3',
 'created': datetime.datetime(2020, 3, 13, 14, 21, 46, 271943)}

Important

The URI that the notebook is read from is stored, but does not have an impact on later comparison of notebooks. They are only compared by their internal content.

We can retrive cache records by their Primary Key (pk):

cache.list_cache_records()
[NbCacheRecord(pk=1)]
cache.get_cache_record(1)
NbCacheRecord(pk=1)

To load the entire notebook that is related to a pk:

nb_bundle = cache.get_cache_bundle(1)
nb_bundle
NbBundleOut(nb=Notebook(cells=1), record=NbCacheRecord(pk=1), artifacts=NbArtifacts(paths=0))
nb_bundle.nb
{'cells': [{'cell_type': 'code',
   'execution_count': 1,
   'metadata': {},
   'outputs': [{'name': 'stdout', 'output_type': 'stream', 'text': '1\n'}],
   'source': 'a=1\nprint(a)'}],
 'metadata': {'kernelspec': {'display_name': 'Python 3',
   'language': 'python',
   'name': 'python3'},
  'language_info': {'codemirror_mode': {'name': 'ipython', 'version': 3},
   'file_extension': '.py',
   'mimetype': 'text/x-python',
   'name': 'python',
   'nbconvert_exporter': 'python',
   'pygments_lexer': 'ipython3',
   'version': '3.6.1'},
  'test_name': 'notebook1'},
 'nbformat': 4,
 'nbformat_minor': 2}

Trying to add a notebook to the cache that matches an existing one will result in a error, since the cache ensures that all notebook hashes are unique:

record = cache.cache_notebook_file(
    path=Path("example_nbs", "basic.ipynb")
)
---------------------------------------------------------------------------
CachingError                              Traceback (most recent call last)
<ipython-input-10-5beccef01961> in <module>
      1 record = cache.cache_notebook_file(
----> 2     path=Path("example_nbs", "basic.ipynb")
      3 )

~/GitHub/jupyter-cache/jupyter_cache/cache/main.py in cache_notebook_file(self, path, uri, artifacts, data, check_validity, overwrite)
    271             ),
    272             check_validity=check_validity,
--> 273             overwrite=overwrite,
    274         )
    275 

~/GitHub/jupyter-cache/jupyter_cache/cache/main.py in cache_notebook_bundle(self, bundle, check_validity, overwrite, description)
    208             if not overwrite:
    209                 raise CachingError(
--> 210                     "Notebook already exists in cache and overwrite=False."
    211                 )
    212             shutil.rmtree(path.parent)

CachingError: Notebook already exists in cache and overwrite=False.

If we load a notebook external to the cache, then we can try to match it to one stored inside the cache:

notebook = nbf.read(str(Path("example_nbs", "basic.ipynb")), 4)
notebook
{'cells': [{'cell_type': 'markdown',
   'metadata': {},
   'source': '# a title\n\nsome text\n'},
  {'cell_type': 'code',
   'execution_count': 1,
   'metadata': {},
   'source': 'a=1\nprint(a)',
   'outputs': [{'name': 'stdout', 'output_type': 'stream', 'text': '1\n'}]}],
 'metadata': {'test_name': 'notebook1',
  'kernelspec': {'display_name': 'Python 3',
   'language': 'python',
   'name': 'python3'},
  'language_info': {'codemirror_mode': {'name': 'ipython', 'version': 3},
   'file_extension': '.py',
   'mimetype': 'text/x-python',
   'name': 'python',
   'nbconvert_exporter': 'python',
   'pygments_lexer': 'ipython3',
   'version': '3.6.1'}},
 'nbformat': 4,
 'nbformat_minor': 2}
cache.match_cache_notebook(notebook)
NbCacheRecord(pk=1)

Notebooks are matched by a hash based only on aspects of the notebook that will affect its execution (and hence outputs). So changing text cells will match the cached notebook:

notebook.cells[0].source = "change some text"
cache.match_cache_notebook(notebook)
NbCacheRecord(pk=1)

But changing code cells will result in a different hash, and so will not be matched:

notebook.cells[1].source = "change some source code"
cache.match_cache_notebook(notebook)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-16-ece82e78c6b0> in <module>
----> 1 cache.match_cache_notebook(notebook)

~/GitHub/jupyter-cache/jupyter_cache/cache/main.py in match_cache_notebook(self, nb)
    328         """
    329         hashkey = self._hash_notebook(nb)
--> 330         cache_record = NbCacheRecord.record_from_hashkey(hashkey, self.db)
    331         return cache_record
    332 

~/GitHub/jupyter-cache/jupyter_cache/cache/db.py in record_from_hashkey(hashkey, db)
    150             if result is None:
    151                 raise KeyError(
--> 152                     "Cache record not found for NB with hashkey: {}".format(hashkey)
    153                 )
    154             session.expunge(result)

KeyError: 'Cache record not found for NB with hashkey: 74933d8a93d1df9caad87b2e6efcdc69'

To understand the difference between an external notebook, and one stored in the cache, we can ‘diff’ them:

print(cache.diff_nbnode_with_cache(1, notebook, as_str=True))
nbdiff
--- cached pk=1
+++ other: 
## inserted before nb/cells/0:
+  code cell:
+    execution_count: 1
+    source:
+      change some source code
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          1

## deleted nb/cells/0:
-  code cell:
-    execution_count: 1
-    source:
-      a=1
-      print(a)
-    outputs:
-      output 0:
-        output_type: stream
-        name: stdout
-        text:
-          1


If we cache this altered notebook, note that this will not remove the previously cached notebook:

nb_bundle = NbBundleIn(
    nb=notebook,
    uri=Path("example_nbs", "basic.ipynb"),
    data={"tag": "mytag"}
)
cache.cache_notebook_bundle(nb_bundle)
NbCacheRecord(pk=2)
print(tabulate_cache_records(
    cache.list_cache_records(), path_length=1, hashkeys=True
))
  ID  Origin URI    Created           Accessed          Hashkey
----  ------------  ----------------  ----------------  --------------------------------
   2  basic.ipynb   2020-03-13 14:21  2020-03-13 14:21  74933d8a93d1df9caad87b2e6efcdc69
   1  basic.ipynb   2020-03-13 14:21  2020-03-13 14:21  818f3412b998fcf4fe9ca3cca11a3fc3

Notebooks are retained in the cache, until the cache limit is reached, at which point the oldest notebooks are removed.

cache.get_cache_limit()
1000
cache.change_cache_limit(100)

Staging Notebooks for Execution

Notebooks can be staged, by adding the path as a stage record.

Important

This does not physically add the notebook to the cache, merely store its URI, for later use.

record = cache.stage_notebook_file(Path("example_nbs", "basic.ipynb"))
record
NbStageRecord(pk=1)
record.to_dict()
{'uri': '/Users/cjs14/GitHub/jupyter-cache/docs/using/example_nbs/basic.ipynb',
 'traceback': '',
 'created': datetime.datetime(2020, 3, 13, 14, 21, 47, 304914),
 'assets': [],
 'pk': 1}

If the staged notbook relates to one in the cache, we will be able to retrieve the cache record:

cache.get_cache_record_of_staged(1)
NbCacheRecord(pk=1)
print(tabulate_stage_records(
    cache.list_staged_records(), path_length=2, cache=cache
))
  ID  URI                      Created             Assets    Cache ID
----  -----------------------  ----------------  --------  ----------
   1  example_nbs/basic.ipynb  2020-03-13 14:21         0           1

We can also retrieve a merged notebook. This is a copy of the source notebook with the following added to it from the cached notebook:

  • Selected notebook metadata keys (generally only those keys that affect its execution)

  • All code cells, with their outputs and metadata (only selected metadata can be merged if cell_meta is not None)

In this way we create a notebook that is fully up-to-date for both its code and textual content:

cache.merge_match_into_file(
    cache.get_staged_record(1).uri,
    nb_meta=('kernelspec', 'language_info', 'widgets'),
    cell_meta=None
)
(1,
 {'cells': [{'cell_type': 'markdown',
    'metadata': {},
    'source': '# a title\n\nsome text\n'},
   {'cell_type': 'code',
    'execution_count': 1,
    'metadata': {},
    'outputs': [{'name': 'stdout', 'output_type': 'stream', 'text': '1\n'}],
    'source': 'a=1\nprint(a)'}],
  'metadata': {'test_name': 'notebook1',
   'kernelspec': {'display_name': 'Python 3',
    'language': 'python',
    'name': 'python3'},
   'language_info': {'codemirror_mode': {'name': 'ipython', 'version': 3},
    'file_extension': '.py',
    'mimetype': 'text/x-python',
    'name': 'python',
    'nbconvert_exporter': 'python',
    'pygments_lexer': 'ipython3',
    'version': '3.6.1'}},
  'nbformat': 4,
  'nbformat_minor': 2})

If we add a notebook that cannot be found in the cache, it will be listed for execution:

record = cache.stage_notebook_file(Path("example_nbs", "basic_failing.ipynb"))
record
NbStageRecord(pk=2)
cache.get_cache_record_of_staged(2)  # returns None
cache.list_staged_unexecuted()
[NbStageRecord(pk=2)]
print(tabulate_stage_records(
    cache.list_staged_records(), path_length=2, cache=cache
))
  ID  URI                              Created             Assets    Cache ID
----  -------------------------------  ----------------  --------  ----------
   2  example_nbs/basic_failing.ipynb  2020-03-13 14:21         0
   1  example_nbs/basic.ipynb          2020-03-13 14:21         0           1

To remove a notebook from the staging area:

cache.discard_staged_notebook(1)
print(tabulate_stage_records(
    cache.list_staged_records(), path_length=2, cache=cache
))
  ID  URI                              Created             Assets
----  -------------------------------  ----------------  --------
   2  example_nbs/basic_failing.ipynb  2020-03-13 14:21         0

Execution

If we have some staged notebooks:

cache.clear_cache()
cache.stage_notebook_file(Path("example_nbs", "basic.ipynb"))
cache.stage_notebook_file(Path("example_nbs", "basic_failing.ipynb"))
NbStageRecord(pk=2)
print(tabulate_stage_records(
    cache.list_staged_records(), path_length=2, cache=cache
))
  ID  URI                              Created             Assets
----  -------------------------------  ----------------  --------
   2  example_nbs/basic_failing.ipynb  2020-03-13 14:21         0
   1  example_nbs/basic.ipynb          2020-03-13 14:21         0

Then we can select an executor (specified as entry points) to execute the notebook.

Tip

To view the executors log, make sure logging is enabled, or you can parse a logger directly to load_executor().

list_executors()
[EntryPoint.parse('basic = jupyter_cache.executors.basic:JupyterExecutorBasic')]
from logging import basicConfig, INFO
basicConfig(level=INFO)

executor = load_executor("basic", cache=cache)
executor
JupyterExecutorBasic(cache=JupyterCacheBase('/Users/cjs14/GitHub/jupyter-cache/docs/using/.jupyter_cache'))

Calling run_and_cache() will run all staged notebooks that do not already have matches in the cache. It will return a dictionary with lists for:

  • succeeded: The notebook was executed successfully with no (or only expected) exceptions

  • excepted: A notebook cell was encountered that raised an unexpected exception

  • errored: An exception occured before/after the actual notebook execution

Tip

Code cells can be tagged with raises-exception to let the executor known that a cell may raise an exception (see this issue on its behaviour).

Note

You can use the filter_uris and/or filter_pks options to only run selected staged notebooks. You can also specify the timeout for execution in seconds using the timeout option.

result = executor.run_and_cache()
result
INFO:jupyter_cache.executors.base:Executing: /Users/cjs14/GitHub/jupyter-cache/docs/using/example_nbs/basic.ipynb
INFO:jupyter_cache.executors.base:Execution Succeeded: /Users/cjs14/GitHub/jupyter-cache/docs/using/example_nbs/basic.ipynb
INFO:jupyter_cache.executors.base:Executing: /Users/cjs14/GitHub/jupyter-cache/docs/using/example_nbs/basic_failing.ipynb
ERROR:jupyter_cache.executors.base:Execution Failed: /Users/cjs14/GitHub/jupyter-cache/docs/using/example_nbs/basic_failing.ipynb
{'succeeded': ['/Users/cjs14/GitHub/jupyter-cache/docs/using/example_nbs/basic.ipynb'],
 'excepted': ['/Users/cjs14/GitHub/jupyter-cache/docs/using/example_nbs/basic_failing.ipynb'],
 'errored': []}

Successfully executed notebooks will be added to the cache, and data about their execution (such as time taken) will be stored in the cache record:

cache.list_cache_records()
[NbCacheRecord(pk=1)]
record = cache.get_cache_record(1)
record.to_dict()
{'data': {'execution_seconds': 1.7455324890000004},
 'pk': 1,
 'uri': '/Users/cjs14/GitHub/jupyter-cache/docs/using/example_nbs/basic.ipynb',
 'accessed': datetime.datetime(2020, 3, 13, 14, 21, 50, 803042),
 'description': '',
 'hashkey': '818f3412b998fcf4fe9ca3cca11a3fc3',
 'created': datetime.datetime(2020, 3, 13, 14, 21, 50, 803031)}

Notebooks which failed to run will not be added to the cache, but details about their execution (including the exception traceback) will be added to the stage record:

record = cache.get_staged_record(2)
print(record.traceback)
Traceback (most recent call last):
  File "/Users/cjs14/GitHub/jupyter-cache/jupyter_cache/executors/basic.py", line 152, in execute
    executenb(nb_bundle.nb, cwd=tmpdirname)
  File "/anaconda/envs/mistune/lib/python3.7/site-packages/nbconvert/preprocessors/execute.py", line 737, in executenb
    return ep.preprocess(nb, resources, km=km)[0]
  File "/anaconda/envs/mistune/lib/python3.7/site-packages/nbconvert/preprocessors/execute.py", line 405, in preprocess
    nb, resources = super(ExecutePreprocessor, self).preprocess(nb, resources)
  File "/anaconda/envs/mistune/lib/python3.7/site-packages/nbconvert/preprocessors/base.py", line 69, in preprocess
    nb.cells[index], resources = self.preprocess_cell(cell, resources, index)
  File "/anaconda/envs/mistune/lib/python3.7/site-packages/nbconvert/preprocessors/execute.py", line 448, in preprocess_cell
    raise CellExecutionError.from_cell_and_msg(cell, out)
nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell:
------------------
raise Exception('oopsie!')
------------------

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-1-714b2b556897> in <module>
----> 1 raise Exception('oopsie!')

Exception: oopsie!
Exception: oopsie!

We now have two staged records, and one cache record:

print(tabulate_stage_records(
    cache.list_staged_records(), path_length=2, cache=cache
))
  ID  URI                              Created             Assets    Cache ID
----  -------------------------------  ----------------  --------  ----------
   2  example_nbs/basic_failing.ipynb  2020-03-13 14:21         0
   1  example_nbs/basic.ipynb          2020-03-13 14:21         0           1
print(tabulate_cache_records(
    cache.list_cache_records(), path_length=1, hashkeys=True
))
  ID  Origin URI    Created           Accessed          Hashkey
----  ------------  ----------------  ----------------  --------------------------------
   1  basic.ipynb   2020-03-13 14:21  2020-03-13 14:21  818f3412b998fcf4fe9ca3cca11a3fc3

Timeout

A timeout argument can also be passed to run_and_cache() which takes value in seconds. Alternatively, timeout can also be specified inside the notebook metadata:

'execution': {
   'timoeut': 30
 }

Note

Timeout specified in notebook metadata will take precedence over the one passed as an argument to run_and_cache().