Top pandas, numpy and scipy functions and modules used in github repos

Introduction

Github data on Google BigQuery

Github recently introduced all files query-able using the Google BigQuery, SQL-like distributed query engine. See the announcement from github. You can execute the BigQuery queries at the BigQuery console.

What is numpy, scipy and pandas and why top functions are useful?

numpy, scipy and pandas are popular python packages for data analysis. They are quite big, so it may be hard to distinguish between functions you would use all the time and functions that you use once in a while.

When looking at results in this post, my reaction to majority of entries was “well, duh, obviously”, but there were a few entries that I somehow missed and I was doing things the wrong way. For example, I must shamefully admit that I somehow missed pandas .iloc function and I have been doing .reset_index instead to get the n-th row.

It would also be very cool to sort code completions in python packages based on the frequency on github and I may hack something for Emacs soon.

Methodology

Using the new github data on BigQuery I calculated the most popular numpy, scipy and pandas functions and modules used in github repos.

Results are approximate. I firstly look for python files that import pandas (or numpy/scipy in their respective sections). Then I extract the regular expression:

r'[^a-zA-Z](?:pd|pandas)\.([^",\(\,\`) \':\[\]/={}]*)'

It looks for all characters following the pd. or pandas. except the negated character group.

It may include modules or some false positives

When constructing regular expression, my priority was to avoid false negatives, so there are some false positives like pandas.pydata.org. It may sometimes include modules from parsing the import lines, like it would parse out “io” from import scipy.io.

Data source

I will be using table contents_py from https://bigquery.cloud.google.com/dataset/fh-bigquery:github_extracts. It is a smaller BigQuery table that contains only python files stored on github.

This post on github

Revision history of this blog post is stored on github.

Top pandas functions and modules

Results are approximate and based on the heuristic assumption that people usually prefix python pandas functions by “pd.” or “pandas.”.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:pd|pandas)\.([^",\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT("https://github.com/",
        FIRST(sample_repo_name),
        "/blob/master/",
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS "import pandas"
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:pd|pandas)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in google docs. Top 20 results:

function count_distinct_repos count_total example_url
DataFrame 5486 47478
read_csv 4056 17567 https://github.com/fcollman/MakeAT/blob/master/make_make_file.py
Series 2248 19124
concat 1869 7456
to_datetime 774 3176 https://github.com/cbyn/bitpredict/blob/master/model/features.py
merge 650 2642 https://github.com/dmnfarrell/mirnaseq/blob/master/mirdeep2.py
date_range 548 3233 https://github.com/and2egg/philharmonic/blob/master/philharmonic/simulator/environment.py
read_table 499 1683 https://github.com/cdeboever3/cdpybio/blob/master/cdpybio/express.py
util.testing 477 1856 https://github.com/sauloal/cnidaria/blob/master/scripts/venv/lib/python2.7/site-packages/pandas/tseries/tests/test_timeseries_legacy.py
isnull 468 1459 https://github.com/Weissger/ext2rdf/blob/master/src/RDFConverter/TripleStructureConverter.py
DataFrame.from_dict 399 1455 https://github.com/mdbartos/vic_utils/blob/master/deprecated/mohseni_reg.py
Timestamp 387 7029 https://github.com/paulperry/quant/blob/master/vti_agg_7030.py
DatetimeIndex 336 1629 https://github.com/readevalprint/zipline/blob/master/zipline/utils/tradingcalendar.py
Index 322 2772 https://github.com/caseyclements/dask/blob/master/dask/dataframe/shuffle.py
read_excel 302 946 https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Lab/SPF_forecasts.py
notnull 284 713 https://github.com/DataViva/dataviva-scripts/blob/master/scripts/secex_monthly/_rdo_temp.py
DataFrame.from_csv 265 802 https://github.com/idbedead/RNA-sequence-tools/blob/master/RNA_Seq_analysis/make_monocle_data_js.py
HDFStore 251 783
DataFrame.from_records 249 534
MultiIndex.from_tuples 237 744 https://github.com/ZoomerAnalytics/xlwings/blob/master/xlwings/tests/test_xlwings.py
rolling_mean 233 651 https://github.com/Ernestyj/PyStudy/blob/master/finance/DaysTest/DaysDataPrepare.py

Top pandas data frame functions

Results are again approximate and based on the heuristic assumption that data frames are usually named with the suffix “df”. To filter out noise, only files containing “import pandas” and matching regexp r”.*df\s=.*(?:pandas|pd)\.” are included.

SELECT
  REGEXP_EXTRACT(line, r"df\.([a-zA-Z-_\.]+)") AS pandas_function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  CONCAT("https://github.com/",
          FIRST(sample_repo_name),
          "/blob/master/",
          FIRST(sample_path)) AS example_url
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS "import pandas"
    and REGEXP_MATCH(content, r".*df\s=.*(?:pandas|pd)\.")
  HAVING
    line CONTAINS "df.")
GROUP BY 1
HAVING LENGTH(pandas_function) > 1
ORDER BY 2 DESC
LIMIT 1000;

Full result list in Google Docs. Top 20 results:

pandas_function count_distinct_repos example_url
columns 1290
index 958
to_csv 945
loc 729
groupby 614 https://github.com/fepz/AyCC/blob/master/process_results.py
set_index 571
drop 473 https://github.com/lukassnoek/skbold/blob/master/skbold/exp_model/parse_presentation_logfile.py
ix 450
iloc 418
shape 387 https://github.com/sdpython/ensae_projects/blob/master/_unittests/ut_data/test_data_helper.py
iterrows 348 https://github.com/rmhyman/DataScience/blob/master/Lesson1/titanic_data_heuristic1.py
sort 341
append 340 https://github.com/MadsJensen/CAA/blob/master/calc_ali.py
copy 298 https://github.com/wavelets/lifelines/blob/master/tests/test_estimation.py
rename 288 https://github.com/Kirubaharan/hydrology/blob/master/Lake_bathymetry/dt_bathymetry/bathymetry_gps_merge.py
reset_index 283
apply 278 https://github.com/lukovkin/ufcnn-keras/blob/master/models/UFCNN_predict.py
dropna 273
head 263 https://github.com/Kirubaharan/hydrology/blob/master/Lake_bathymetry/dt_bathymetry/bathymetry_gps_merge.py
values 259
fillna 228
plot 203 https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/bootcamp_pandas-input.py

Top numpy functions and modules

Results are again approximate and it’s a simple string replace from the pandas version.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:np|numpy)\.([^",\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT("https://github.com/",
        FIRST(sample_repo_name),
        "/blob/master/",
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS "import numpy"
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:np|numpy)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in Google docs. Top 20 results:

function count_distinct_repos count_total example_url
array 23877 604263 https://github.com/AlexBourassa/Generic_UI/blob/master/Widgets/GraphWidget/Fitter.py
zeros 19406 280579 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
arange 13587 158705 https://github.com/jamesp/jpy/blob/master/jpy/maths/derive.py
sqrt 10297 77810 https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
ones 10028 80998 https://github.com/iamtrask/keras/blob/master/keras/models.py
sum 9829 85793 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
mean 9773 56402 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
linspace 8769 62970 https://github.com/Titan-C/learn-dmft/blob/master/examples/plot_ipt_coex.py
asarray 7745 82563 https://github.com/ratnania/caid/blob/master/caid-gui/viewer.py
ndarray 7617 71141 https://github.com/eirikgje/healpy/blob/master/healpy/pixelfunc.py
dot 7386 90422 https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
exp 6979 42446 https://github.com/pkgw/pwkit/blob/master/pwkit/dulk_models.py
abs 6979 43168 https://github.com/eirikgje/healpy/blob/master/healpy/pixelfunc.py
where 6781 56778 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
empty 6632 51718 https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
max 6533 31860
concatenate 6425 36532 https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
log 5742 33105 https://github.com/pkgw/pwkit/blob/master/pwkit/dulk_models.py
sin 5302 25481 https://github.com/jamesp/jpy/blob/master/jpy/maths/derive.py
vstack 5251 25913 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
min 5064 21231 https://github.com/gwpy/seismon/blob/master/seismon/psd.py

Top scipy functions and modules

Results are again approximate and it’s a simple string replace from the numpy version.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:sp|scipy)\.([^",\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT("https://github.com/",
        FIRST(sample_repo_name),
        "/blob/master/",
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS "import scipy"
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:sp|scipy)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in google docs. Top 20 results:

function count_distinct_repos count_total example_url
stats 2281 5717 https://github.com/geophysics/mtpy/blob/master/mtpy/modeling/occam2d.py
sparse 1706 6500 https://github.com/tscholak/smbkmeans/blob/master/tfidf_smbkmeans.py
optimize 1531 2788 https://github.com/cni/t1fit/blob/master/t1_fitter.py
io 1218 3079 https://github.com/wojtekwalczak/FB_datalab/blob/master/lib/most_distinctive.py
linalg 1199 3047
interpolate 972 2022 https://github.com/geophysics/mtpy/blob/master/mtpy/modeling/occam2d.py
special 968 1792 https://github.com/liberatorqjw/scikit-learn/blob/master/sklearn/utils/fixes.py
signal 915 1883 https://github.com/garibaldu/radioblobs/blob/master/code/code_1d/old_and_extra/score_GA.py
ndimage 864 2196 https://github.com/cni/t1fit/blob/master/t1_fitter.py
misc 650 1135 https://github.com/sillvan/hyperspy/blob/master/hyperspy/drawing/_markers/point.py
integrate 574 986 https://github.com/kleskjr/scipy/blob/master/scipy/stats/tests/test_distributions.py
sparse.linalg 495 1056
spatial.distance 469 721 https://github.com/wjchen84/rapprentice/blob/master/rapprentice/registration.py
spatial 420 766 https://github.com/delmic/odemis/blob/master/src/odemis/acq/align/coordinates.py
io.loadmat 414 1501 https://github.com/jdsika/TUM_SmartCardLab/blob/master/DPA/benchmark.py
sparse.csr_matrix 401 1305
org 369 894 https://github.com/chiotlune/ext/blob/master/gnuradio-3.7.0.1/gr-filter/examples/fir_filter_ccc.py
csr_matrix 361 2541 https://github.com/tscholak/smbkmeans/blob/master/tfidf_smbkmeans.py
array 352 3873 https://github.com/PMBio/limix/blob/master/limix/deprecated/io/data_util.py
issparse 334 2309 https://github.com/thilbern/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py

Attribution

Regular expression used to extract function have improved upon by Felipe in the comment.

8 Comments

  1. Try this one:

    SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](pd\.[^”,\(\,\`) \’:\[\]/={}]*)’) function, COUNT(*) c
    FROM (
    SELECT SPLIT(content, ‘\n’) AS line,
    FROM [fh-bigquery:github_extracts.contents_py]
    WHERE content CONTAINS “import pandas as pd”
    HAVING
    NOT LEFT(LTRIM(line),1)=’#’
    AND REGEXP_MATCH(line, r'[^a-zA-Z]pd\.’)
    )
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 1000

    pd.DataFrame 39718
    pd.Series 16664
    pd.read_csv 15876
    pd.concat 6930
    pd.Timestamp 6411