Harvesting my Gists from GitHub

By R. S. Doiel 2017-12-10

This is a just quick set of notes on harvesting my Gists on GitHub so I have an independent copy for my own website.

Assumptions

In this gist I assume you are using Bash on a POSIX system (e.g. Raspbian on a Raspberry Pi) with the standard compliment of Unix utilities (e.g. cut, sed, curl). I also use Stephen Dolan’s jq as well as Caltech Library’s datatools. See the respective GitHub repositories for installation instructions. The gist harvest process was developed against GitHub’s v3 API (see developer.github.com).

In the following examples “USER” is assumed to hold your GitHub user id (e.g. rsdoiel for https://github.com/rsdoiel).

Getting my basic profile

This retrieves the public view of your profile.

    curl -o USER "https://api.github.com/users/USER"

Find the urL for your gists

Get the gists url from `USER.json.

    GISTS_URL=$(jq ".gists_url" "USER.json" | sed -E 's/"//g' | cut -d '{' -f 1)
    curl -o gists.json "${GISTS_URL}"

Now gists.json should hold a JSON array of objects representing your Gists.

Harvesting the individual Gists.

When you look at gists.json you’ll see a multi-level JSON structure. It has been formatted by the API so be easy to scrape. But since this data is JSON and Caltech Library has some nice utilities for working with JSON I’ll use jsonrange and jq to pull out a list of individual Gists URLS.

    jsonrange -i gists.json | while read I; do 
        jq ".[$I].files" gists.json | sed -E 's/"//g'
    done

Expanding this we can now curl each individual gist metadata to find URL to the raw file.

    jsonrange -i gists.json | while read I; do 
        jq ".[$I].files" gists.json | jsonrange -i - | while read FNAME; do
            jq ".[$I].files[\"$FNAME\"].raw_url" gists.json | sed -E 's/"//g'; 
        done;
    done

Now that we have URLs to the raw gist files we can use curl again to fetch each.

What do we want to store with our harvested gists? The raw files, metadata about the Gist (e.g. when it was created), the Gist ID. Putting it all together we have the following script.

    #!/bin/bash
    if [[ "$1" = "" ]]; then
        echo "USAGE: $(basename "$0") GITHUB_USERNAME"
        exit 1
    fi
    USER="$1"
    curl -o "$USER.json" "https://api.github.com/users/$USER"
    if [[ ! -s "$USER.json" ]]; then
        echo "Someting went wrong getting https://api.github.cm/users/${USER}"
        exit 1
    fi
    GISTS_URL=$(jq ".gists_url" "$USER.json" | sed -E 's/"//g' | cut -d '{' -f 1)
    curl -o gists.json "${GISTS_URL}"
    if [[ ! -s gists.json ]]; then
        echo "Someting went wrong getting ${GISTS_URL}"
        exit 1
    fi
    # For each gist harvest our file
    jsonrange -i gists.json | while read I; do
        GIST_ID=$(jq ".[$I].id" gists.json | sed -E 's/"//g')
        mkdir -p "gists/$GIST_ID"
        echo "Saving gists/$GIST_ID/metadata.json"
        jq ".[$I]" gists.json > "gists/$GIST_ID/metadata.json"
        jq ".[$I].files" gists.json | jsonrange -i - | while read FNAME; do
            URL=$(jq ".[$I].files[\"$FNAME\"].raw_url" gists.json | sed -E 's/"//g')
            echo "Saving gist/$GIST_ID/$FNAME"
            curl -o "gists/$GIST_ID/$FNAME" "$URL"
        done;
    done