More advanced github code search

Introduction

Documentation is not always enough and I like to search for code examples on github.

For example, today I wanted to search what different numpy functions people would pass to the custom aggfunc parameter in the pandas pivot_table.

When searching for things like “pivot_table aggfunc np”, the built-in github search does some kind of “OR” matching with some heurestic scoring. It sometimes returns lines only matching aggfunc or only pivot_table.

I extracted a BigQuery table {kozikow_github.data_py} that makes it easier to just search github for code examples. Each row have a field “line” with a line contents and an url to this line in github. It also have a prev_line and next_line contents.

You have a free 1TB of data scans on BigQuery, so you are very unlikely to run out of the quota by querying my table.

The table is targeted for scientific python, but it would be simple string replace of the SQL query to get the table for any other framework or language. Instructions how to generate your public table are at How to generate your own public table.

Demo

You can run the query at the BigQuery console, but you may need to register first.

SELECT
  line,
  url
FROM [wide-silo-135723:kozikow_github.data_py]
WHERE
   REGEXP_MATCH(line, r"(?:pd|pandas).*pivot_table")
   AND (REGEXP_MATCH(line, r"aggfunc.*(?:np|numpy)"))
LIMIT 20;

line	url
meta_df = pd.pivot_table(meta_df, values=””reads””, rows=index, aggfunc=np.sum) “	https://github.com/brwnj/repertoire/blob/master/repertoire/metacompare.py#L40
temp = pd.pivot_table(temp, values=””reads””, rows=index, aggfunc=np.sum) “	https://github.com/brwnj/repertoire/blob/master/repertoire/metacompare.py#L48
fp = pd.pivot_table(f, values=’count’, index=’Entry Exit GroupID’, columns=’AgeEnteredBucket’, aggfunc=np.sum, fill_value=0) “	https://github.com/dssg/homelessness-public/blob/master/clean.py#L442
table = pd.tools.pivot.pivot_table(df, rows=[‘week’], cols=[‘swimlane’], values=’count’, aggfunc=np.count_nonzero) “	https://github.com/worldofchris/jlf/blob/master/jlf_stats/metrics.py#L173
print grid_to_string(pd.DataFrame.pivot_table(df, values=’Size’, index=[field + ‘ Method’ for field in fields], columns=[‘Compressor’], aggfunc=np.min)) “	https://github.com/batterseapower/timeseries-compression/blob/master/python/analyse.py#L44
transpositions = pd.DataFrame.pivot_table(df[df[‘Compressor’] `= compressor], values`‘Size’, index=[field + ‘ Codec’], columns=[field + ‘ Method’], aggfunc=np.min) “	https://github.com/batterseapower/timeseries-compression/blob/master/python/analyse.py#L56
printable_df = pd.DataFrame.pivot_table(df, values=’Size’, index=[field + ‘ Codec’, field + ‘ Method’], columns=[‘Compressor’], aggfunc=np.min) “	https://github.com/batterseapower/timeseries-compression/blob/master/python/analyse.py#L73
printable_df = pd.DataFrame.pivot_table(df, values=’Size’, index=[field + ‘ Codec’], columns=[field + ‘ Method’], aggfunc=np.min) “	https://github.com/batterseapower/timeseries-compression/blob/master/python/analyse.py#L76
#dfQB = pd.pivot_table(dfQB,index=[‘name’],aggfunc=np.sum).reset_index() “	https://github.com/kpolimis/nflPlayerPerformance/blob/master/notebooks/plots_3year.py#L49
dfs.append(index_converter(pd.pivot_table(x, index=index_field, aggfunc=np.sum), ‘sum_var’)) “	https://github.com/khughitt/Pharmacogenomics_Prediction_Pipeline_P3/blob/master/tools/pipeline_helpers.py#L34
dfs.append(index_converter(pd.pivot_table(x, index=index_field, aggfunc=np.mean), ‘mean_var’)) “	https://github.com/khughitt/Pharmacogenomics_Prediction_Pipeline_P3/blob/master/tools/pipeline_helpers.py#L35
dfs.append(index_converter(pd.pivot_table(x[x>0], index=index_field, aggfunc=np.sum), ‘sum_pos’)) “	https://github.com/khughitt/Pharmacogenomics_Prediction_Pipeline_P3/blob/master/tools/pipeline_helpers.py#L62

I found a new numpy function, np.count_nonzero.

SQL generating the table

SELECT
    file_id,
    line,
    line_number,
    url,
    LAG(line, 1) OVER (
        PARTITION BY file_id ORDER BY line_number) AS prev_line,
    LEAD(line, 1) OVER (
        PARTITION BY file_id ORDER BY line_number) AS next_line,
FROM (SELECT
    line,
    file_id,
    RTRIM(line),
    POSITION(line) AS line_number,
    CONCAT("https://github.com/",
            sample_repo_name,
            "/blob/master/",
            sample_path, "#L",
            STRING(POSITION(line))) AS url,
    FROM (SELECT
        id AS file_id,
        # Add a space after each line.
        # It is required to ensure correct line numbering.
        SPLIT(REPLACE(content, "\n", " \n"), "\n") AS line,
        sample_repo_name,
        sample_path,
    FROM
        [fh-bigquery:github_extracts.contents_py]
    WHERE
        content CONTAINS "import pandas"
        OR content CONTAINS "import numpy"
        OR content CONTAINS "import scipy"
        OR content CONTAINS "import scikit"
        OR content CONTAINS "import sklearn"))

It took 47 seconds to generate it.

How to generate your own public table

In BigQuery console, after creating and choosing the project:

On the left side bar, click dash near your project and click “Create new dataset”
Click dash near the new dataset and “Share Dataset”.
In Add People section add “All Authenticated users” -> view
In query window go to “Show options”
Choose a Destination Table (you can create a new one) and click “Allow Large results”
Press Run Query

Caveat – each link points to master

Other tables, like sample_contents include the branch in a field sample_ref, what lets me to generate the correct link. contents_py misses this field, so I am assuming that file points to the master branch. On the bright side, it’s very rarely an issue.

My other blog post, top angular directives on github, generates links pointing to the correct branch. Felipe, if you are reading this, can you please add field sample_ref to contents_py?

Storage cost

Table is only 21 GB in size, so I will pay 42 cents per month to store it, 0.02$ per GB based on https://cloud.google.com/bigquery/pricing. I am still using cloud storage free trial, so it’s still free.