Top pandas, numpy and scipy functions and modules used in github repos

Introduction
Top pandas functions and modules
Top pandas data frame functions
Top numpy functions and modules
Top scipy functions and modules
Attribution
Other posts

Introduction

Github data on Google BigQuery

Github recently introduced all files query-able using the Google BigQuery, SQL-like distributed query engine. See the announcement from github. You can execute the BigQuery queries at the BigQuery console.

What is numpy, scipy and pandas and why top functions are useful?

numpy, scipy and pandas are popular python packages for data analysis. They are quite big, so it may be hard to distinguish between functions you would use all the time and functions that you use once in a while.

When looking at results in this post, my reaction to majority of entries was “well, duh, obviously”, but there were a few entries that I somehow missed and I was doing things the wrong way. For example, I must shamefully admit that I somehow missed pandas .iloc function and I have been doing .reset_index instead to get the n-th row.

It would also be very cool to sort code completions in python packages based on the frequency on github and I may hack something for Emacs soon.

Methodology

Using the new github data on BigQuery I calculated the most popular numpy, scipy and pandas functions and modules used in github repos.

Results are approximate. I firstly look for python files that import pandas (or numpy/scipy in their respective sections). Then I extract the regular expression:

r'[^a-zA-Z](?:pd|pandas)\.([^&quot;,\(\,\`) \':\[\]/={}]*)'

It looks for all characters following the pd. or pandas. except the negated character group.

It may include modules or some false positives

When constructing regular expression, my priority was to avoid false negatives, so there are some false positives like pandas.pydata.org. It may sometimes include modules from parsing the import lines, like it would parse out “io” from import scipy.io.

Data source

I will be using table contents_py from https://bigquery.cloud.google.com/dataset/fh-bigquery:github_extracts. It is a smaller BigQuery table that contains only python files stored on github.

Link to example usage

URL to example usage may be sometimes broken. It assumes that the file points to master branch. Other tables, like sample_contents include the branch in a field sample_ref, what lets me to generate the correct link.

contents_py only lists repository and file path. Until the gets resolved I am generating the link based on the assumption that all files are in the master branch.

I could potentially also link to the exact line number. In current table schema it would be a lot of work to achieve it. It would be much easier if the content would be exposed as a repeated field with (line contents, line number). I asked Google BigQuery team about it in .

This post on github

Revision history of this blog post is stored on github.

Top pandas functions and modules

Results are approximate and based on the heuristic assumption that people usually prefix python pandas functions by “pd.” or “pandas.”.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:pd|pandas)\.([^&quot;,\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT(&quot;https://github.com/&quot;,
        FIRST(sample_repo_name),
        &quot;/blob/master/&quot;,
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS &quot;import pandas&quot;
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:pd|pandas)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in google docs. Top 20 results:

function	count_distinct_repos	count_total	example_url
DataFrame	5486	47478
read_csv	4056	17567	https://github.com/fcollman/MakeAT/blob/master/make_make_file.py
Series	2248	19124
concat	1869	7456
to_datetime	774	3176	https://github.com/cbyn/bitpredict/blob/master/model/features.py
merge	650	2642	https://github.com/dmnfarrell/mirnaseq/blob/master/mirdeep2.py
date_range	548	3233	https://github.com/and2egg/philharmonic/blob/master/philharmonic/simulator/environment.py
read_table	499	1683	https://github.com/cdeboever3/cdpybio/blob/master/cdpybio/express.py
util.testing	477	1856	https://github.com/sauloal/cnidaria/blob/master/scripts/venv/lib/python2.7/site-packages/pandas/tseries/tests/test_timeseries_legacy.py
isnull	468	1459	https://github.com/Weissger/ext2rdf/blob/master/src/RDFConverter/TripleStructureConverter.py
DataFrame.from_dict	399	1455	https://github.com/mdbartos/vic_utils/blob/master/deprecated/mohseni_reg.py
Timestamp	387	7029	https://github.com/paulperry/quant/blob/master/vti_agg_7030.py
DatetimeIndex	336	1629	https://github.com/readevalprint/zipline/blob/master/zipline/utils/tradingcalendar.py
Index	322	2772	https://github.com/caseyclements/dask/blob/master/dask/dataframe/shuffle.py
read_excel	302	946	https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Lab/SPF_forecasts.py
notnull	284	713	https://github.com/DataViva/dataviva-scripts/blob/master/scripts/secex_monthly/_rdo_temp.py
DataFrame.from_csv	265	802	https://github.com/idbedead/RNA-sequence-tools/blob/master/RNA_Seq_analysis/make_monocle_data_js.py
HDFStore	251	783
DataFrame.from_records	249	534
MultiIndex.from_tuples	237	744	https://github.com/ZoomerAnalytics/xlwings/blob/master/xlwings/tests/test_xlwings.py
rolling_mean	233	651	https://github.com/Ernestyj/PyStudy/blob/master/finance/DaysTest/DaysDataPrepare.py

Top pandas data frame functions

Results are again approximate and based on the heuristic assumption that data frames are usually named with the suffix “df”. To filter out noise, only files containing “import pandas” and matching regexp r”.*df\s=.*(?:pandas|pd)\.” are included.

SELECT
  REGEXP_EXTRACT(line, r&quot;df\.([a-zA-Z-_\.]+)&quot;) AS pandas_function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  CONCAT(&quot;https://github.com/&quot;,
          FIRST(sample_repo_name),
          &quot;/blob/master/&quot;,
          FIRST(sample_path)) AS example_url
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS &quot;import pandas&quot;
    and REGEXP_MATCH(content, r&quot;.*df\s=.*(?:pandas|pd)\.&quot;)
  HAVING
    line CONTAINS &quot;df.&quot;)
GROUP BY 1
HAVING LENGTH(pandas_function) &gt; 1
ORDER BY 2 DESC
LIMIT 1000;

Full result list in Google Docs. Top 20 results:

pandas_function	count_distinct_repos	example_url
columns	1290
index	958
to_csv	945
loc	729
groupby	614	https://github.com/fepz/AyCC/blob/master/process_results.py
set_index	571
drop	473	https://github.com/lukassnoek/skbold/blob/master/skbold/exp_model/parse_presentation_logfile.py
ix	450
iloc	418
shape	387	https://github.com/sdpython/ensae_projects/blob/master/_unittests/ut_data/test_data_helper.py
iterrows	348	https://github.com/rmhyman/DataScience/blob/master/Lesson1/titanic_data_heuristic1.py
sort	341
append	340	https://github.com/MadsJensen/CAA/blob/master/calc_ali.py
copy	298	https://github.com/wavelets/lifelines/blob/master/tests/test_estimation.py
rename	288	https://github.com/Kirubaharan/hydrology/blob/master/Lake_bathymetry/dt_bathymetry/bathymetry_gps_merge.py
reset_index	283
apply	278	https://github.com/lukovkin/ufcnn-keras/blob/master/models/UFCNN_predict.py
dropna	273
head	263	https://github.com/Kirubaharan/hydrology/blob/master/Lake_bathymetry/dt_bathymetry/bathymetry_gps_merge.py
values	259
fillna	228
plot	203	https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/bootcamp_pandas-input.py

Top numpy functions and modules

Results are again approximate and it’s a simple string replace from the pandas version.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:np|numpy)\.([^&quot;,\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT(&quot;https://github.com/&quot;,
        FIRST(sample_repo_name),
        &quot;/blob/master/&quot;,
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS &quot;import numpy&quot;
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:np|numpy)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in Google docs. Top 20 results:

function	count_distinct_repos	count_total	example_url
array	23877	604263	https://github.com/AlexBourassa/Generic_UI/blob/master/Widgets/GraphWidget/Fitter.py
zeros	19406	280579	https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
arange	13587	158705	https://github.com/jamesp/jpy/blob/master/jpy/maths/derive.py
sqrt	10297	77810	https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
ones	10028	80998	https://github.com/iamtrask/keras/blob/master/keras/models.py
sum	9829	85793	https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
mean	9773	56402	https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
linspace	8769	62970	https://github.com/Titan-C/learn-dmft/blob/master/examples/plot_ipt_coex.py
asarray	7745	82563	https://github.com/ratnania/caid/blob/master/caid-gui/viewer.py
ndarray	7617	71141	https://github.com/eirikgje/healpy/blob/master/healpy/pixelfunc.py
dot	7386	90422	https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
exp	6979	42446	https://github.com/pkgw/pwkit/blob/master/pwkit/dulk_models.py
abs	6979	43168	https://github.com/eirikgje/healpy/blob/master/healpy/pixelfunc.py
where	6781	56778	https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
empty	6632	51718	https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
max	6533	31860
concatenate	6425	36532	https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
log	5742	33105	https://github.com/pkgw/pwkit/blob/master/pwkit/dulk_models.py
sin	5302	25481	https://github.com/jamesp/jpy/blob/master/jpy/maths/derive.py
vstack	5251	25913	https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
min	5064	21231	https://github.com/gwpy/seismon/blob/master/seismon/psd.py

Top scipy functions and modules

Results are again approximate and it’s a simple string replace from the numpy version.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:sp|scipy)\.([^&quot;,\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT(&quot;https://github.com/&quot;,
        FIRST(sample_repo_name),
        &quot;/blob/master/&quot;,
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS &quot;import scipy&quot;
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:sp|scipy)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in google docs. Top 20 results:

function	count_distinct_repos	count_total	example_url
stats	2281	5717	https://github.com/geophysics/mtpy/blob/master/mtpy/modeling/occam2d.py
sparse	1706	6500	https://github.com/tscholak/smbkmeans/blob/master/tfidf_smbkmeans.py
optimize	1531	2788	https://github.com/cni/t1fit/blob/master/t1_fitter.py
io	1218	3079	https://github.com/wojtekwalczak/FB_datalab/blob/master/lib/most_distinctive.py
linalg	1199	3047
interpolate	972	2022	https://github.com/geophysics/mtpy/blob/master/mtpy/modeling/occam2d.py
special	968	1792	https://github.com/liberatorqjw/scikit-learn/blob/master/sklearn/utils/fixes.py
signal	915	1883	https://github.com/garibaldu/radioblobs/blob/master/code/code_1d/old_and_extra/score_GA.py
ndimage	864	2196	https://github.com/cni/t1fit/blob/master/t1_fitter.py
misc	650	1135	https://github.com/sillvan/hyperspy/blob/master/hyperspy/drawing/_markers/point.py
integrate	574	986	https://github.com/kleskjr/scipy/blob/master/scipy/stats/tests/test_distributions.py
sparse.linalg	495	1056
spatial.distance	469	721	https://github.com/wjchen84/rapprentice/blob/master/rapprentice/registration.py
spatial	420	766	https://github.com/delmic/odemis/blob/master/src/odemis/acq/align/coordinates.py
io.loadmat	414	1501	https://github.com/jdsika/TUM_SmartCardLab/blob/master/DPA/benchmark.py
sparse.csr_matrix	401	1305
org	369	894	https://github.com/chiotlune/ext/blob/master/gnuradio-3.7.0.1/gr-filter/examples/fir_filter_ccc.py
csr_matrix	361	2541	https://github.com/tscholak/smbkmeans/blob/master/tfidf_smbkmeans.py
array	352	3873	https://github.com/PMBio/limix/blob/master/limix/deprecated/io/data_util.py
issparse	334	2309	https://github.com/thilbern/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py

Attribution

Regular expression used to extract function have improved upon by Felipe in the comment.

8 Comments

says:

July 1, 2016 at 10:43 am

Try this one:

SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](pd\.[^”,\(\,\`) \’:\[\]/={}]*)’) function, COUNT(*) c
FROM (
SELECT SPLIT(content, ‘\n’) AS line,
FROM [fh-bigquery:github_extracts.contents_py]
WHERE content CONTAINS “import pandas as pd”
HAVING
NOT LEFT(LTRIM(line),1)=’#’
AND REGEXP_MATCH(line, r'[^a-zA-Z]pd\.’)
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1000

pd.DataFrame 39718
pd.Series 16664
pd.read_csv 15876
pd.concat 6930
pd.Timestamp 6411

Pingback: Top angular directives on github | Robert Kozikowski's blog
Pingback: kozikow comments on "An Introduction to Scientific Python – Pandas"
Pingback: 1 – Top pandas, numpy and scipy functions and modules used in GitHub repos
Pingback: Visualizing relationships between python packages – Robert Kozikowski's blog
Pingback: Top 10 Pandas, NumPy and SciPy functions on GitHub – Alexander Galea's Blog
Pingback: Starting off with Python libraries | Sashiraj Chandrasekaran
Pingback: Pythonパッケージ間の共起関係を可視化してみる | プログラミング | POSTD

	hanale10 on Using PyCharm docker integrati…
	Dodge Coates on Very powerful data analysis en…
	Dodge Coates on Very powerful data analysis en…
	Dodge Coates on Very powerful data analysis en…
	kozikow on Very powerful data analysis en…

Top pandas, numpy and scipy functions and modules used in github repos

Table of Contents

Introduction

Github data on Google BigQuery

What is numpy, scipy and pandas and why top functions are useful?

Methodology

It may include modules or some false positives

Data source

Link to example usage

This post on github

Top pandas functions and modules

Top pandas data frame functions

Top numpy functions and modules

Top scipy functions and modules

Attribution

Other posts

Published by kozikow

8 Comments

Table of Contents

Introduction

Github data on Google BigQuery

What is numpy, scipy and pandas and why top functions are useful?

Methodology

It may include modules or some false positives

Data source

Link to example usage

This post on github

Top pandas functions and modules

Top pandas data frame functions

Top numpy functions and modules

Top scipy functions and modules

Attribution

Other posts

Like this:

Published by kozikow

8 Comments