Table of Contents
Introduction
All github contents recently got query-able by the Google BigQuery. See the announcement from github. You can execute the BigQuery queries at the BigQuery console.
I used it to find top angular directives used on github. In first two sections I only list “ng-” directives. Even if parsing html with regexps is faulty, it is correct enough for frequency analysis.
Third section, Custom directives, uses a heuristic to find custom angular directives. You can use it to find popular directives you have not been aware of.
Revision history of this post is stored on github.
You may also look at my other posts analyzing github using BigQuery.
Unique usage per repository in top repositories with example usage
Counting unique usage is expensive and I am out of my free quota for this month. This query runs on a smaller table that only contains 10% random sample of files from top 130k top popular repositories.
It will calculate “how many unique repos use this directive” and add a link to an example usage.
SELECT REGEXP_EXTRACT(line, r".*[ <]+(ng-[a-zA-Z0-9-]+).*") AS line, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/", REGEXP_EXTRACT(FIRST(sample_ref), r"refs/heads/(.*)$"), "/", FIRST(sample_path)) AS example_url, FROM ( SELECT SPLIT(content, '\n') line, sample_repo_name, sample_path, sample_ref FROM [bigquery-public-data:github_repos.sample_contents] WHERE (sample_path LIKE '%.html' OR sample_path LIKE '%.ng') HAVING line CONTAINS 'ng-') GROUP BY 1 ORDER BY count_distinct_repos DESC LIMIT 500;
Full results on google docs. Top 20 results:
Top usages in all repositories
Previous query was looking only at the sample of data. This query looks through all files accessible on github.
SELECT TOP(line, 500), COUNT(*) AS c FROM ( SELECT REGEXP_EXTRACT((SPLIT(contents.content, '\n')), r".*[^a-zA-Z](ng-[a-zA-Z0-9-]+).*") line, contents.id AS id FROM [bigquery-public-data:github_repos.contents] AS contents JOIN ( SELECT path, id FROM [bigquery-public-data:github_repos.files] WHERE path LIKE '%.ng' OR path LIKE '%.html') AS files ON (contents.id == files.id) HAVING line CONTAINS "ng-");
Full results on google docs. Top 20 results:
directive | count |
---|---|
ng-click | 1572920 |
ng-model | 1355222 |
ng-show | 962245 |
ng-repeat | 697010 |
ng-if | 601903 |
ng-controller | 591669 |
ng-app | 460875 |
ng-class | 452863 |
ng-bind | 283218 |
ng-hide | 217121 |
ng-disabled | 168468 |
ng-include | 125913 |
ng-init | 125508 |
ng-submit | 118507 |
ng-switch-when | 111254 |
ng-href | 109513 |
ng-src | 108365 |
ng-template | 108241 |
ng-change | 101197 |
ng-bind-html | 89604 |
Custom directives
Methodology
I tried a heuristic for finding custom directives – extract all html tags and look at relative frequency in all html files vs “probably angular html” files.
“Probably angular html” is based on the assumption that “ng-” is ubiquitous in angular html, but not that frequent otherwise. Also some angular files use the .ng extension. This method is going to obviously have some false positives and negatives. Looking through results, 2.0 ratio was optimal. Here you can see top 50 results that were right past the edge of exclusion – ratio was between 2.0 and 2.5. Very few entries are legitimate.
I am again using the sampled sample_contents, since I ran out of free quota.
BigQuery query
SELECT tag, COUNT(1) / SUM(IF(probably_angular, 1, 0)) AS html_to_angular_ratio, COUNT(DISTINCT(sample_repo_name)) AS distinct_repository_count, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/", REGEXP_EXTRACT(FIRST(sample_ref), r"refs/heads/(.*)$"), "/", FIRST(sample_path)) AS example_url FROM ( SELECT SPLIT(REGEXP_REPLACE( REGEXP_REPLACE(content, r"['\"\\\/\$]+[a-zA-Z-]*", ""), r"[^a-zA-Z-]+", " "), " ") AS tag, (REGEXP_MATCH(content, r".*[ <]+ng-[a-zA-Z0-9-]+.*") OR (sample_path LIKE '%.ng')) AS probably_angular, sample_repo_name, sample_path, sample_ref FROM [bigquery-public-data:github_repos.sample_contents] WHERE (sample_path LIKE '%.html' OR sample_path LIKE '%.ng')) GROUP BY 1 HAVING html_to_angular_ratio < 2.0 ORDER BY distinct_repository_count DESC LIMIT 1000;
Results
All results in google docs, including ng- entries.
For example, it found directives from ionic framework, bootstrap or ng-file-upload. Top 20 results excluding the entries that start with ng- or *angular*
:
My other posts analyzing github using BigQuery
You may also take a look at my other posts:
nice!
some comments:
“big query” -> “bigquery” (one word please! otherwise I won’t be able to find your contributions)
“TOP(line, 300)” -> fast, approximate results, cool
“FROM [bigquery-public-data:github_repos.contents]” -> so you are going for the full 1.7TB archive? This query gets you out of the free tier. That’s ok, if you or someone else is willing to pay for it :).
“WHERE path LIKE ‘%.ng’ OR path LIKE ‘%.html’) AS files” -> it might make sense to first extract this data to a new table. Then every new query can go over this table instead, and all querying will be cheaper thereafter. It’s a good investment 🙂
Thanks, added.
Adding my comment:
> “big query” -> “bigquery”
Fixed.
> At this size and type of analysis we are really pushing the boundaries of BigQuery.
Github data set size, 3TB, is tiny comparing to some data sets Dremel is used at. I thought that BigQuery uses Dremel under the hood?
Anyway, if you would end up overwhelming one Dremel worker (what was happening in my case), you would start hitting Dremel OOMs at much smaller data set size than 3TB. “Find out if data under some join key is too big to fit into single workers memory” may be not obvious – I couldn’t find any external docs that would suggest it.
> That’s ok, if you or someone else is willing to pay for it:).
I am using 300 USD free trial.