Top emacs packages used in github repos

Introduction
Bias towards repos with license
Query on sample_contents
Query on full, unsampled, data
Results
Things to improve

Introduction

Today github announced new BigQuery tables that let you query data about github repos using the BigQuery. I found about it from hacker news. Also see announcement from Google. The killer new feature is the ability to query the contents of all files stored on github.

I will use it to list top packages required in elisp files on github. If you are impatient you can jump directly to Results. To run a query go to BigQuery console and click “Compose Query”.

In contrast to old BigQuery with top starred repos I will list all packages required in any elisp file, rather than just repositories written in emacs lisp. This will include built in emacs packages or packages not stored on github.

Revision history of this blog post is stored on github.

Bias towards repos with license

As this hacker news post mentions, only repos with license are indexed. That will bias the result towards packages rather than personal emacs configs people put on github.

Query on sample_contents

sample_contents only contains 10% sample files from top 130k repos according to stars. It also contain pre-joined data between tables contents and files. It’s good for data exploration before running the bigger query on the full dataset.

SELECT
  line,
  COUNT(DISTINCT(sample_repo_name)) as count
FROM (
  SELECT
    SPLIT(content, '\n') line,
    sample_repo_name
  FROM
    [bigquery-public-data:github_repos.sample_contents]
  WHERE
    (sample_path LIKE '%.el'
      OR sample_path LIKE '%.emacs')
  HAVING
    line LIKE '(require%)')
GROUP BY
  1
ORDER BY
  count DESC
LIMIT
  10;

Top 10 results:

line	count
(require ‘cl-lib)	32
(require ‘ert)	22
(require ‘cl)	20
(require ‘comint)	19
(require ‘org)	14
(require ‘font-lock)	13
(require ‘compile)	11
(require ‘eieio)	10
(require ‘package)	9
(require ‘dash)	9

Since the top entry got only 32 hits it made me worry that sample size is too small.

Only 266 emacs repos are included in sample_contents:

SELECT
  COUNT(DISTINCT(sample_repo_name))
FROM [bigquery-public-data:github_repos.sample_contents]
WHERE
(sample_path LIKE '%.el' OR
    sample_path LIKE '%.emacs')

That made me think that I need to query the unsampled data.

Query on full, unsampled, data

contents contains all files, but require join with files to get file path. My query:

SELECT
  TOP(line, 5000),
  COUNT(*) AS c
FROM (
  SELECT
    SPLIT(contents.content, '\n') line,
    contents.id AS id
  FROM
    [bigquery-public-data:github_repos.contents] AS contents
  JOIN (
    SELECT
      path,
      id
    FROM
      [bigquery-public-data:github_repos.files]
    WHERE
      path LIKE '%.el'
      OR path LIKE '%.emacs') AS files
  ON
    (contents.id == files.id)
  WHERE
    contents.content CONTAINS '(require \''
  HAVING
    LEFT(line, 10) == "(require '"
    AND RIGHT(line, 1) == ")");

Thanks to all optimizations getting top 5000 packages takes only 60 seconds.

I have been getting a few errors, including Error: Resources exceeded during query execution. for a while. It required a couple of BigQuery tricks to make it work

TOP/COUNT is faster than GROUP BY/ORDER/LIMIT.
Filtering data prior to join in sub-query reduces memory usage.
Regexps and globs are expensive. LEFT/RIGHT is much faster.
Avoid scanning files without .el or .emacs extension (thanks to filtering done by join).

Results

Full list of top 5000 is at https://github.com/kozikow/kozikow-blog/blob/master/github_emacs_packages_5000.csv .

Packages with 1000+ usages:

package	count
(require ‘cl-lib)	18901
(require ‘org)	12144
(require ‘cl)	11695
(require ‘ob)	11073
(require ‘ert)	7978
(require ‘semantic)	6684
(require ‘easymenu)	5821
(require ‘comint)	5255
(require ‘eieio)	4691
(require ‘gnus)	4446
(require ‘font-lock)	4272
(require ‘quail)	3991
(require ‘helm)	3839
(require ‘dash)	3650
(require ‘calc-macs)	3602
(require ‘compile)	3524
(require ‘gnus-util)	3418
(require ‘erc)	3371
(require ‘calc-ext)	3295
(require ‘thingatpt)	3294
(require ‘org-compat)	3282
(require ‘org-macs)	3268
(require ‘shimbun)	3019
(require ‘url-parse)	2921
(require ‘format-spec)	2696
(require ‘company)	2607
(require ‘package)	2512
(require ‘ring)	2399
(require ‘message)	2334
(require ‘cc-mode)	2329
(require ‘python)	2243
(require ‘nnheader)	2232
(require ‘mm-util)	2227
(require ‘custom)	2223
(require ‘calendar)	2190
(require ‘nnoo)	2187
(require ‘gnus-sum)	2044
(require ‘mh-e)	2022
(require ‘ox)	1978
(require ‘gyp)	1964
(require ‘magit)	1953
(require ‘ede)	1900
(require ‘json)	1874
(require ‘dired)	1849
(require ‘nnmail)	1799
(require ‘button)	1781
(require ‘tramp)	1693
(require ‘slime)	1629
(require ‘etags)	1619
(require ‘ansi-color)	1618
(require ‘wid-edit)	1611
(require ‘url)	1566
(require ‘mm-decode)	1562
(require ‘gnus-art)	1543
(require ‘helm-help)	1538
(require ‘semantic/format)	1504
(require ‘outline)	1495
(require ‘imenu)	1493
(require ‘ob-eval)	1457
(require ‘ob-core)	1419
(require ‘url-util)	1396
(require ‘ecb-util)	1374
(require ‘pcomplete)	1357
(require ‘url-vars)	1337
(require ‘speedbar)	1336
(require ‘widget)	1273
(require ‘esh-util)	1260
(require ‘s)	1239
(require ‘helm-utils)	1234
(require ‘auto-complete)	1232
(require ‘xml)	1231
(require ‘semantic/db)	1219
(require ‘haskell-mode)	1201
(require ‘semantic/analyze)	1195
(require ‘ewoc)	1152
(require ‘gnus-range)	1151
(require ‘eshell)	1065
(require ‘cider-client)	1040
(require ‘eieio-base)	1037
(require ‘find-func)	1032
(require ‘semantic/ctxt)	1008
(require ‘help-mode)	1006
(require ‘gnus-int)	1001

Things to improve

In sampled query I counted unique repos that reference given library. In full query I just counted number of times given line appeared in any emacs file. It is better to count unique repos, but calculating distinct repos on a full contents would require additional join.

7 Comments

says:

June 30, 2016 at 2:51 pm

Nice! I added the article.

Oh, and a personal nit:

Pingback: Top angular directives on github | Robert Kozikowski's blog
Pingback: Top pandas functions used in github repos | Robert Kozikowski's blog
Pingback: Top angular directives used in GitHub repos | 神刀安全网
Pingback: 2016-07-04 Emacs News - sacha chua :: living an awesome life
Oscar Najera says:

July 6, 2016 at 11:06 pm

I would also test against the use-package keyword. Many of us use it more often than require

1. kozikow says:
  
  July 7, 2016 at 12:00 am
  
  Thanks! I’ll fix. I am writing new post with clustering packages, so I will do it correctly there. I expect results will be similar (except if use-package users would use different packages than require users).

	hanale10 on Using PyCharm docker integrati…
	Dodge Coates on Very powerful data analysis en…
	Dodge Coates on Very powerful data analysis en…
	Dodge Coates on Very powerful data analysis en…
	kozikow on Very powerful data analysis en…

Top emacs packages used in github repos

Table of Contents

Introduction

Bias towards repos with license

Query on sample_contents

Query on full, unsampled, data

Results

Things to improve

Published by kozikow

7 Comments

Table of Contents

Introduction

Bias towards repos with license

Query on sample_contents

Query on full, unsampled, data

Results

Things to improve

Like this:

Published by kozikow

7 Comments