Table of Contents
Introduction
Today github announced new BigQuery tables that let you query data about github repos using the BigQuery. I found about it from hacker news. Also see announcement from Google. The killer new feature is the ability to query the contents of all files stored on github.
I will use it to list top packages required in elisp files on github. If you are impatient you can jump directly to Results. To run a query go to BigQuery console and click “Compose Query”.
In contrast to old BigQuery with top starred repos I will list all packages required in any elisp file, rather than just repositories written in emacs lisp. This will include built in emacs packages or packages not stored on github.
Revision history of this blog post is stored on github.
Bias towards repos with license
As this hacker news post mentions, only repos with license are indexed. That will bias the result towards packages rather than personal emacs configs people put on github.
Query on sample_contents
sample_contents only contains 10% sample files from top 130k repos according to stars. It also contain pre-joined data between tables contents and files. It’s good for data exploration before running the bigger query on the full dataset.
SELECT line, COUNT(DISTINCT(sample_repo_name)) as count FROM ( SELECT SPLIT(content, '\n') line, sample_repo_name FROM [bigquery-public-data:github_repos.sample_contents] WHERE (sample_path LIKE '%.el' OR sample_path LIKE '%.emacs') HAVING line LIKE '(require%)') GROUP BY 1 ORDER BY count DESC LIMIT 10;
Top 10 results:
line | count |
---|---|
(require ‘cl-lib) | 32 |
(require ‘ert) | 22 |
(require ‘cl) | 20 |
(require ‘comint) | 19 |
(require ‘org) | 14 |
(require ‘font-lock) | 13 |
(require ‘compile) | 11 |
(require ‘eieio) | 10 |
(require ‘package) | 9 |
(require ‘dash) | 9 |
Since the top entry got only 32 hits it made me worry that sample size is too small.
Only 266 emacs repos are included in sample_contents:
SELECT COUNT(DISTINCT(sample_repo_name)) FROM [bigquery-public-data:github_repos.sample_contents] WHERE (sample_path LIKE '%.el' OR sample_path LIKE '%.emacs')
266
That made me think that I need to query the unsampled data.
Query on full, unsampled, data
contents contains all files, but require join with files to get file path. My query:
SELECT TOP(line, 5000), COUNT(*) AS c FROM ( SELECT SPLIT(contents.content, '\n') line, contents.id AS id FROM [bigquery-public-data:github_repos.contents] AS contents JOIN ( SELECT path, id FROM [bigquery-public-data:github_repos.files] WHERE path LIKE '%.el' OR path LIKE '%.emacs') AS files ON (contents.id == files.id) WHERE contents.content CONTAINS '(require \'' HAVING LEFT(line, 10) == "(require '" AND RIGHT(line, 1) == ")");
Thanks to all optimizations getting top 5000 packages takes only 60 seconds.
I have been getting a few errors, including Error: Resources exceeded during query execution.
for a while. It required a couple of BigQuery tricks to make it work
- TOP/COUNT is faster than GROUP BY/ORDER/LIMIT.
- Filtering data prior to join in sub-query reduces memory usage.
- Regexps and globs are expensive. LEFT/RIGHT is much faster.
- Avoid scanning files without .el or .emacs extension (thanks to filtering done by join).
Results
Full list of top 5000 is at https://github.com/kozikow/kozikow-blog/blob/master/github_emacs_packages_5000.csv .
Packages with 1000+ usages:
package | count |
---|---|
(require ‘cl-lib) | 18901 |
(require ‘org) | 12144 |
(require ‘cl) | 11695 |
(require ‘ob) | 11073 |
(require ‘ert) | 7978 |
(require ‘semantic) | 6684 |
(require ‘easymenu) | 5821 |
(require ‘comint) | 5255 |
(require ‘eieio) | 4691 |
(require ‘gnus) | 4446 |
(require ‘font-lock) | 4272 |
(require ‘quail) | 3991 |
(require ‘helm) | 3839 |
(require ‘dash) | 3650 |
(require ‘calc-macs) | 3602 |
(require ‘compile) | 3524 |
(require ‘gnus-util) | 3418 |
(require ‘erc) | 3371 |
(require ‘calc-ext) | 3295 |
(require ‘thingatpt) | 3294 |
(require ‘org-compat) | 3282 |
(require ‘org-macs) | 3268 |
(require ‘shimbun) | 3019 |
(require ‘url-parse) | 2921 |
(require ‘format-spec) | 2696 |
(require ‘company) | 2607 |
(require ‘package) | 2512 |
(require ‘ring) | 2399 |
(require ‘message) | 2334 |
(require ‘cc-mode) | 2329 |
(require ‘python) | 2243 |
(require ‘nnheader) | 2232 |
(require ‘mm-util) | 2227 |
(require ‘custom) | 2223 |
(require ‘calendar) | 2190 |
(require ‘nnoo) | 2187 |
(require ‘gnus-sum) | 2044 |
(require ‘mh-e) | 2022 |
(require ‘ox) | 1978 |
(require ‘gyp) | 1964 |
(require ‘magit) | 1953 |
(require ‘ede) | 1900 |
(require ‘json) | 1874 |
(require ‘dired) | 1849 |
(require ‘nnmail) | 1799 |
(require ‘button) | 1781 |
(require ‘tramp) | 1693 |
(require ‘slime) | 1629 |
(require ‘etags) | 1619 |
(require ‘ansi-color) | 1618 |
(require ‘wid-edit) | 1611 |
(require ‘url) | 1566 |
(require ‘mm-decode) | 1562 |
(require ‘gnus-art) | 1543 |
(require ‘helm-help) | 1538 |
(require ‘semantic/format) | 1504 |
(require ‘outline) | 1495 |
(require ‘imenu) | 1493 |
(require ‘ob-eval) | 1457 |
(require ‘ob-core) | 1419 |
(require ‘url-util) | 1396 |
(require ‘ecb-util) | 1374 |
(require ‘pcomplete) | 1357 |
(require ‘url-vars) | 1337 |
(require ‘speedbar) | 1336 |
(require ‘widget) | 1273 |
(require ‘esh-util) | 1260 |
(require ‘s) | 1239 |
(require ‘helm-utils) | 1234 |
(require ‘auto-complete) | 1232 |
(require ‘xml) | 1231 |
(require ‘semantic/db) | 1219 |
(require ‘haskell-mode) | 1201 |
(require ‘semantic/analyze) | 1195 |
(require ‘ewoc) | 1152 |
(require ‘gnus-range) | 1151 |
(require ‘eshell) | 1065 |
(require ‘cider-client) | 1040 |
(require ‘eieio-base) | 1037 |
(require ‘find-func) | 1032 |
(require ‘semantic/ctxt) | 1008 |
(require ‘help-mode) | 1006 |
(require ‘gnus-int) | 1001 |
Things to improve
In sampled query I counted unique repos that reference given library. In full query I just counted number of times given line appeared in any emacs file. It is better to count unique repos, but calculating distinct repos on a full contents would require additional join.