Table of Contents
Introduction
Github data on Google BigQuery
Github recently introduced all files query-able using the Google BigQuery, SQL-like distributed query engine. See the announcement from github. You can execute the BigQuery queries at the BigQuery console.
What is numpy, scipy and pandas and why top functions are useful?
numpy, scipy and pandas are popular python packages for data analysis. They are quite big, so it may be hard to distinguish between functions you would use all the time and functions that you use once in a while.
When looking at results in this post, my reaction to majority of entries was “well, duh, obviously”, but there were a few entries that I somehow missed and I was doing things the wrong way. For example, I must shamefully admit that I somehow missed pandas .iloc
function and I have been doing .reset_index
instead to get the n-th row.
It would also be very cool to sort code completions in python packages based on the frequency on github and I may hack something for Emacs soon.
Methodology
Using the new github data on BigQuery I calculated the most popular numpy, scipy and pandas functions and modules used in github repos.
Results are approximate. I firstly look for python files that import pandas (or numpy/scipy in their respective sections). Then I extract the regular expression:
r'[^a-zA-Z](?:pd|pandas)\.([^",\(\,\`) \':\[\]/={}]*)'
It looks for all characters following the pd.
or pandas.
except the negated character group.
It may include modules or some false positives
When constructing regular expression, my priority was to avoid false negatives, so there are some false positives like pandas.pydata.org
. It may sometimes include modules from parsing the import lines, like it would parse out “io” from import scipy.io
.
Data source
I will be using table contents_py from https://bigquery.cloud.google.com/dataset/fh-bigquery:github_extracts. It is a smaller BigQuery table that contains only python files stored on github.
Link to example usage
URL to example usage may be sometimes broken. It assumes that the file points to master branch. Other tables, like sample_contents include the branch in a field sample_ref, what lets me to generate the correct link.
contents_py only lists repository and file path. Until the gets resolved I am generating the link based on the assumption that all files are in the master branch.
I could potentially also link to the exact line number. In current table schema it would be a lot of work to achieve it. It would be much easier if the content would be exposed as a repeated field with (line contents, line number). I asked Google BigQuery team about it in .
This post on github
Revision history of this blog post is stored on github.
Top pandas functions and modules
Results are approximate and based on the heuristic assumption that people usually prefix python pandas functions by “pd.” or “pandas.”.
SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](?:pd|pandas)\.([^",\(\,\`) \':\[\]/={}]*)') AS function, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, COUNT(*) as count_total, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/master/", FIRST(sample_path)) AS example_url, FROM ( SELECT SPLIT(content, '\n') AS line, sample_path, sample_repo_name FROM [fh-bigquery:github_extracts.contents_py] WHERE content CONTAINS "import pandas" HAVING NOT LEFT(LTRIM(line),1)='#' AND REGEXP_MATCH(line, r'[^a-zA-Z](?:pd|pandas)\.') ) GROUP BY 1 ORDER BY 2 DESC LIMIT 500;
Full result list in google docs. Top 20 results:
Top pandas data frame functions
Results are again approximate and based on the heuristic assumption that data frames are usually named with the suffix “df”. To filter out noise, only files containing “import pandas” and matching regexp r”.*df\s=.*(?:pandas|pd)\.” are included.
SELECT REGEXP_EXTRACT(line, r"df\.([a-zA-Z-_\.]+)") AS pandas_function, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/master/", FIRST(sample_path)) AS example_url FROM ( SELECT SPLIT(content, '\n') AS line, sample_path, sample_repo_name FROM [fh-bigquery:github_extracts.contents_py] WHERE content CONTAINS "import pandas" and REGEXP_MATCH(content, r".*df\s=.*(?:pandas|pd)\.") HAVING line CONTAINS "df.") GROUP BY 1 HAVING LENGTH(pandas_function) > 1 ORDER BY 2 DESC LIMIT 1000;
Full result list in Google Docs. Top 20 results:
Top numpy functions and modules
Results are again approximate and it’s a simple string replace from the pandas version.
SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](?:np|numpy)\.([^",\(\,\`) \':\[\]/={}]*)') AS function, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, COUNT(*) as count_total, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/master/", FIRST(sample_path)) AS example_url, FROM ( SELECT SPLIT(content, '\n') AS line, sample_path, sample_repo_name FROM [fh-bigquery:github_extracts.contents_py] WHERE content CONTAINS "import numpy" HAVING NOT LEFT(LTRIM(line),1)='#' AND REGEXP_MATCH(line, r'[^a-zA-Z](?:np|numpy)\.') ) GROUP BY 1 ORDER BY 2 DESC LIMIT 500;
Full result list in Google docs. Top 20 results:
Top scipy functions and modules
Results are again approximate and it’s a simple string replace from the numpy version.
SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](?:sp|scipy)\.([^",\(\,\`) \':\[\]/={}]*)') AS function, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, COUNT(*) as count_total, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/master/", FIRST(sample_path)) AS example_url, FROM ( SELECT SPLIT(content, '\n') AS line, sample_path, sample_repo_name FROM [fh-bigquery:github_extracts.contents_py] WHERE content CONTAINS "import scipy" HAVING NOT LEFT(LTRIM(line),1)='#' AND REGEXP_MATCH(line, r'[^a-zA-Z](?:sp|scipy)\.') ) GROUP BY 1 ORDER BY 2 DESC LIMIT 500;
Full result list in google docs. Top 20 results:
Attribution
Regular expression used to extract function have improved upon by Felipe in the comment.
Other posts
You may also take a look at my other posts:
Try this one:
SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](pd\.[^”,\(\,\`) \’:\[\]/={}]*)’) function, COUNT(*) c
FROM (
SELECT SPLIT(content, ‘\n’) AS line,
FROM [fh-bigquery:github_extracts.contents_py]
WHERE content CONTAINS “import pandas as pd”
HAVING
NOT LEFT(LTRIM(line),1)=’#’
AND REGEXP_MATCH(line, r'[^a-zA-Z]pd\.’)
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1000
pd.DataFrame 39718
pd.Series 16664
pd.read_csv 15876
pd.concat 6930
pd.Timestamp 6411