Mining your IPython Notebooks with nbgrep July 23, 2014

One of the things I like most about IPython Notebook is right there in the name – it’s a great notebook. Often I’ll figure out how to do something, be it talk to a certain API, format a graph in a particular way, parse a certain kind of file, and so on, in one of my notebooks.

The problem is this: I have notebooks in lots directories around my machine. Each data collection/analysis project has it’s own repository. I’ve got notebooks that go with presentations, notebooks for blog posts, a general playground direcory…. you get the idea. So, I’ll often remember that I solved a particular problem in the past, but not where I solved it.

The second problem is that grep and ack don’t work well with .ipynb files.

  1. They’re not normal line-oriented text, they’re JSON files.
  2. They don’t just have the code; they have your text, but more distractingly, they have the output files, many of which might be SVG or base64 encoded images, large HTML tables, etc.

I found a useful techique from Michelle Gill that helps address this second problem. Using jq, a command line JSON processor, you can pick out only the code cells.

$ jq '.worksheets[].cells[] | select(.cell_type=="code") | .input[]' MyFile.ipynb

Great! Now I just need to find all the notebooks. Since I’m on OSX, I know that Spotlight knows where all my .ipynb files are, and I can access that from the CLI with mdfind.

$ mdfind -onlyin ~/work -name '.ipynb'

Update: Thanks to Thomas Spura for the fork, this now works on linux with find if you don’t have mdfind; I updated the original gist.

Bolting those ideas together, and I have the very useful script nbgrep. So if I want to find the notebook I was playing around with the Twitter API in, it’s an nbgrep twitter away. (Bonus: in the terminal, you even get python syntax highlighting.)

$ nbgrep twitter

/Users/jbarratt/work/notebookcookbook/Tweet Relief.ipynb:

import twitter
auth = twitter.oauth.OAuth(creds['access_token'], 
twitter_api = twitter.Twitter(auth=auth)
search_results ='#oscon', count
    search_results =**kwargs)