Very powerful data analysis environment – org mode with ob-ipython

Table of Contents

Introduction

Emacs org-mode with ob-ipython is the most powerful data analysis environment I ever used. I find it much more powerful than other tools I used, including jupyter and beaker web notebooks or just writing python in PyCharm.

Emacs org mode with ob-ipython is like jupyter or beaker notebook, but in Emacs instead of browser and with many more features.

Word “Emacs” may be scary. There are pre-packaged and pre-configured emacs distribution that have much smaller learning curve, my favorite being Spacemacs (I am in progress of rebasing my config with it). You can just use 1% of capabilities of Emacs (probably majority of Emacs users do not approach 10% of Emacs capabilities) and still benefit from it.

If you are going to bring up the common quote of “emacs is fine operating system, but it lacks decent text editor” – Emacs now have decent text editor by using the vim emulation evil-mode. It’s the best vim emulation in existence and even many packages from vim are ported. Spacemacs is a nice emacs distribution that bundles evil mode.

I will try to introduce and describe org mode with ob-ipython it for users who never used Emacs before.

Since this blog post have been written in org mode, linear reading experience in exported format is less optimal experience than reading the org mode file directly in org mode.

Features (aka “What’s that powerful about it”)

Embed code blocks in any language

You can embed embeded source code and evaluate it with C-c C-c. Results of evaluation of your source code are appended after the source code block. Result can be text (including org table) or image (charts).

What’s more You can have separate org file and ipython console open side by side. With ipython, reading python docstrings and code completion works well. See my screenshot.

Since ob-ipython uses jupyter, you can get the same environment for anything that have jupyter kernel, including matlab, Scala, Spark or R and many more.

Results can be exported to many formats, like latex (demo) or this post.

This blog post is just an export of org mode file via org2blog. All code examples have been written in org mode using workflow described in this post.

Exporting works to formats like html, latex (native and beamer), markdown, jira, odt (than can be imported to google docs and word), wiki formats and many more.

Syntax highlighting can be preserved for some exports, like html or latex.

You can just learn one way to edit documents and presentations than can be exported to majority of formats on earth.

Programmable documents (aka “Literate programming”)

Emacs org mode with org babel is a full fledged literate programming environment. Some people have published whole books or research papers as a large executable document in org. There is an even Research paper about it.

Python computations in science and engineering book supports org mode and it’s far better book reading experience than anything I ever experienced before. I can tweak and re-run code examples, link from my other notes, tag or bookmark interesting sections, jump between sections and many more.

When writing some latex in college, I recall situations when I am half way through writing latex document. I would came up with the idea of some parameter tweak, and suddenly I have to re-generate all charts.

With org mode, the document is generated pragmatically. Not only you can easily re-generate it, but readers of your document can tweak parameters or supply their own data set and re-generate the whole document.

Another example is training machine model. You can define your model parameters as org constants. You can tweak some model parameter and have separate org mode headings for things like “performance statistics”, “top miss-classified cross validation samples”, etc. Added benefit is that you can commit all this to git.

As soon as you learn org mode all of it is easy and seamless.

Built in excel alternative

Sometimes just “manually” editing the data is the most productive thing to do. You can do it with org mode spreadsheet capabilities on org tables.

The added benefit is that formulas are written in lisp, that is cooler and more powerful language than Visual basic. http://orgmode.org/manual/Translator-functions.html

Integration with pandas

My current Table->Pandas->Table workflow works. It is somewhat clunky, but it can be improved. See examples section.

Integration with other formats

You can export org tables to many formats by exporting it to pandas and then using pandas exporter. Nevertheless, org supports sql, csv, latex, html exporters.

Pass data between languages

Similar functionality is offered by beaker notebook.

I found out that org mode as intermediate format for data sometimes works better for me.

Since intermediate format for a data frame is the org table, I can import data frame to org, edit it as spreadsheet and export it back. See Pass data directly between languages in examples section.

Outline view is powerful for organizing your work

Org mode outline view is very handy for organizing your work. When working on some larger problem, I am only focusing on small subset of it. Org mode lets me just expand sections that are currently relevant.

I also find adding embedding TODO items in the tree quite handy. When I encounter some problem I mark a subtree as TODO, and I can later inspect just subtree headlines with TODO items with them. See: todo.png

You can link to your existing codebase with org-ctags. It seems possible to provide ide-like navigation between code defined in org src buffers, but I didn’t configure it yet.

Many more

You don’t have to use all features offered by org mode.

Embed latex formulas

Also works in html export with mathjax.

Fast integration with source control

I like to keep my notes in source control. To avoid overheard of additional committing I use magit-mode. Out of the box you can commit directly from Emacs with 6 keyboard strokes. With a few lines of elisp you can auto generate commit messages or automatically commit based on some condition (e.g. save or file closed or focus-out-hook).

Everything in org is plain text, including results of eval of code blocks, so it will be treated well by the source control.

Run a webserver that will let people do basic editing of you org files in the browser

Spaced repetition framework (remember all those pesky maths formulas)

If you are like me, you forgot a lot of maths formulas since college. Spaced repetition is a learning methodology that helps you avoid forgetting important facts like maths formulas. I recommend this very good post about spaced repetition in general from gwern.

People primarily use spaced repetition for learning words in new languages, but I use it for maths formulas or technical facts.

There are spaced repitition tools like anki or super memo, but as soon as you want advanced features like latex support they support them very badly (IMO) or not at all.

org-drill is a spaced repetition framework in drill, that allows you to use all of the org features for creating flash cards. Also take a look at this interesting blog post.

Calendar

Managing papers citations

Tagging

Agenda views

Go on a diet

Installation

Install Emacs (with vim emulation)

Although I don’t use it, I recommend Spacemacs, pre-configured emacs distribution, like “Ubuntu” of Emacs.

Install python packages

If you don’t run those, you may run into troubles.

pip install --upgrade pip
pip install --upgrade ipython
pip install --upgrade pyzmq
pip install --upgrade jupyter

Install ob-ipython

org mode should be bundled with your emacs installation. If you are new to emacs, you can install packages using M-x package-install.

Elisp configuration

Add to your Emacs config:

(require ‘org)
(require ‘ob-ipython)

;; don’t prompt me to confirm everytime I want to evaluate a block
(setq org-confirm-babel-evaluate nil)

;;; display/update images in the buffer after I evaluate
(add-hook ‘org-babel-after-execute-hook ‘org-display-inline-images ‘append)

Troubleshooting

Verify that restarting ipython doesn’t help.

(ob-ipython-kill-kernel)

Open “Python” buffer to see python errors

Toggle elisp debug on error

(toggle-debug-on-error)

My workflow

I settled on workflow of having two buffers opened side by side. On one side I would have opened org file, on the other side I would the have ipython console.

I am experimenting with commands in the ipython console, and I copy back the permanent results I want to remember or share with people into the org src block.

Both windows re-use the same ipython kernel (So they share variables). You may have multiple kernels running. I have code completion and python docstrings in the ipython buffer.

Screenshot

ob-ipython.png

Default ipython configuration

If you want to run some code in each ipython block you can add it to ~/.ipython/profile_default/startup. Foe example, to avoid adding %matplotlib inline to each source code block:

echo "%matplotlib inline" >> ~/.ipython/profile_default/startup/66-matplot.py

TODO Configure yasnippet

ob-ipython docs suggest yasnippet for editing code. So far I have been using custom elisp code, but a few things can be nicer about yasnippet.

# -*- mode: snippet -*-
# name: ipython block
# key: py
# —
#+BEGIN_SRC ipython :session ${1::file ${2:$$(let ((temporary-file-directory "./")) (make-temp-file "py" nil ".png"))} }:exports ${3:both}
$0
#+END_SRC

Examples

Org table to pandas and plotting

date x y z
<2016-06-15 Wed> 1 1 1
<2016-06-16 Thu> 2 2 2
<2016-06-17 Fri> 4 3 3
<2016-06-18 Sat> 8 4 4
<2016-06-19 Sun> 16 5 30
<2016-06-20 Mon> 32 6 40

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

df = pd.DataFrame(table[1:], columns=table[0])
df.plot()

plot.png

Org table -> Pandas -> Org table

You have to write small reusable snippet to print pandas to org format. You can add it to your builtin ipython code snippets. You also need to tell src block to interpret results directly with :results output raw drawer :noweb yes.

def arr_to_org(arr):
line = "|".join(str(item) for item in arr)
return "|{}|".format(line)

def df_to_org(df):
return "\n".join([arr_to_org(df.columns)] +
[arr_to_org(row) for row in df.values])

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

df = pd.DataFrame(table[1:], columns=table[0])
df.y = df.y.apply(lambda y: y * 2)
print df_to_org(df)

date x y z
<2016-06-15 Wed> 1 2 1
<2016-06-16 Thu> 2 4 2
<2016-06-17 Fri> 4 6 3
<2016-06-18 Sat> 8 8 4
<2016-06-19 Sun> 16 10 30
<2016-06-20 Mon> 32 12 40

Afterwards, you may assign result table to variable, edit it with org spreadsheet capabilities and use in other python script.

Share code between code blocks

Since all code is executed within the same ipython kernel, it’s enough to put blocks one after another.

constant = 30

def some_function(x):
return constant * x

print some_function(30)

TODO Connect to existing ipython kernel

I added support of connecting to existing ipython kernel in https://github.com/gregsexton/ob-ipython/pull/71/files.

You can start an ipython kernel on a server with lots of ram and cpu and connect it to a local lightweight machine running emacs.

Create kernel using (outside of the org mode, as it blocks):

#!/usr/bin/env python
import os
from ipykernel.kernelapp import IPKernelApp

app = IPKernelApp.instance()
app.initialize([])
kernel = app.kernel
kernel.shell.push({'print_me': 'Running in previously started kernel.'})

app.start()

It will give you a connection json file name. Pass it as a session name.

#+BEGIN_SRC ipython :session kernel-8520.json
  print print_me
#+END_SRC
Running in previously started kernel.

TODO Use global constant

TODO Data frame sharing with org tables

TODO Pass data directly between languages

TODO Different language kernels

This should work:

#+BEGIN_SRC ipython :session :kernel clojure
(+ 1 2)
#+END_SRC

#+RESULTS:
: 3

Additional configuration I plan to do

Problems I did not resolve yet:

TODO ob-ipython-inspect in popup

Currently it opens a separate buffer. I would prefer a popup.

TODO Configure the org-edit-src-code to use ipython completion.

Currently, I have code completion only working in ipython buffer. It seems doable to configure it in the edit source block as well.

TODO Capture results from ipython to src block.

To avoid manual copying between ipython buffer and source code block, I could implement an ob-ipython-capture function, that would add last executed command in the ipython console to the src block. Keyboard macros can work cross-buffer, so this could be simple keyboard macro, but I didn’t try it out yet.

TODO Figure out why SVG doesn’t work

In order to make a svg graphic rather than png, you may specify the output format globally to IPython.

%config InlineBackend.figure_format = 'svg'
Advertisements

12 Comments

      1. Yeah, I’m facing this problem, too. I installed htmlize, which helped. Thanks for the gist. Very cool of you.

  1. Heya, this bit of code simplifies the printing of tables a bit:

    from tabulate import tabulate

    def tab(df, headers=”keys”):
    “””
    Pretty print DataFrame in an org table. Org tables are good.
    They also export nicely.
    “””
    return tabulate(df, tablefmt=”orgtbl”, headers=”keys”)

    By the way, thanks for this article. I’m absolutely loving this package. Way better than ein.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s