Analyze Data with MongoDB and Ruby

This post gives a quick and very brief introduction on how to use MongoDB with Ruby. We will see how to install and connect to a MongoDB, how to populate the database with fake documents generated in just a couple of lines of code, and how to build some queries to gain insights into the stored documents.

Prerequisites

Let’s assume you already have Ruby installed.

Throughout this post, we are going to create a small playground consisting of a few files. Let’s create a project directory for them: mkdir -p ~/projects/rumo and cd ~/projects/rumo (rumo for Ruby Mongo). I also like to use rvm to create a fresh and separate Ruby environment for each project: rvm --rvmrc --create 2.3.1 and cd ./ to load the configuration. But that’s optional.

The next thing you need is MongoDB. Using brew (package manager for macOS) it’s as easy as a brew install mongo. In any other case, like building MongoDB from source, there is excellent guidance available on the official website. In case you haven’t read the (full) instruction on the official page, then here’s the gist how to start Mongo:

MongoDB needs a directory where it can store all its data. For me, it’s: mkdir -p ~/.mongodb/data/
Open another shell and start MongoDB with the previously created data directory passed as an argument: mongod --dbpath ~/.mongodb/data.

Having MongoDB installed and RVM in place, it’s time to create the Ruby files for our playground. Our project file structure is going to look like this:

├── .rvmrc
├── Gemfile
├── Rakefile
└── lib
    ├── boot.rb
    ├── database.rb
    └── document.rb

Create Gemfile with the following lines:

source 'https://rubygems.org'
gem 'rake'
gem 'mongo'
gem 'faker' # for generating fake documents

We will use the faker gem in a moment to generate test data to populate our MongoDB with, while the rake gem will allow us to boot an IRB session with our project files already loaded. The Rakefile to achieve the latter looks like follows:

task default: %w[ console ]
task :console do
  require 'irb'
  require 'irb/completion'
  require_relative './lib/boot.rb'
  ARGV.clear
  IRB.start
end

The lib/boot.rb ties everything together and loads both the installed gems and our custom files (which we are going to see hereafter):

require 'mongo'
require 'json'
require 'faker'
require 'date'
require_relative 'document'
require_relative 'database'

Generating documents

To have something in the database to run queries against, we need data. Faker fulfills exactly this need: generating data for all kinds of purposes. The Faker::Book class, for example, provides generators for author names, ISBNs, publishers, and much more. We mainly use this to generate documents for a fictitious book store. The below code should be sufficiently self-explanatory. Create lib/document.rb with the following contents:

module RuMo
  class Document
    def self.generate_many(qty = 5)
      (1..qty).map { generate_one }
    end

    def self.generate_one
      { title:        Faker::Book.title,
        author:       Faker::Book.author,
        publisher:    Faker::Book.publisher,
        isbn:         Faker::Code.isbn,
        price:        Faker::Commerce.price,
        release_date: Faker::Date.between( Date.new(2000,1,1), Date.today ),
        genres:       (0..rand(4)).map { Faker::Book.genre }
      }
    end
  end
end

Accessing the database

Our generated books need a place to be stored. Accessing Mongo and storing data is surprisingly easy:

creating a connection via Mongo::Client.new(<uri>),
accessing a collection by Mongo::Client#[Symbol], and
inserting an array of documents with Mongo::Collection#insert_many(Array).

module RuMo
  class Database
    attr_reader :name, :port

    def initialize(name: 'test', port: 27017)
      @name = name
      @port = port
    end

    def conn
      @conn ||= Mongo::Client.new("mongodb://127.0.0.1:#{port}/#{name}")
    end

    def collection
      @collection ||= conn[:books]
    end

    def insert(documents = [])
      collection.insert_many(documents)
    end
  end
end

Populating the database

Now that we have everything set up, let’s insert some documents and start writing some queries. Recall that we can directly boot into our project using the rake tasks we wrote:

rake
2.3.1 :001 > db = RuMo::Database.new
2.3.1 :002 > db.insert RuMo::Document.generate_many(100)
D, [2016-11-19T11:06:58.950593 #85952] DEBUG -- : MONGODB | 127.0.0.1:27017 | test.insert | STARTED | {"insert"=>"books", "documents"=>[{:title=>"I Sing the Body Electric", :author=>"Humberto Friesen", :publisher=>"Parragon", :isbn=>"777712218-5", :price=>47.51, :genres=>["Narrative nonfiction", "Classic"], :_id=>BSON::ObjectId('583024421c705c4fc09494...
D, [2016-11-19T11:06:58.952993 #85952] DEBUG -- : MONGODB | 127.0.0.1:27017 | test.insert | SUCCEEDED | 0.002316s
 => #<Mongo::BulkWrite::Result:0x007fe047969a40 @results={"n_inserted"=>100, "n"=>100, "inserted_ids"=>[BSON::ObjectId('583024421c705c4fc09494de'), &lt;... and 99 more BSON::ObjectId>]}>

That’s it. From now on, we omit Mongo’s debug outputs.

Now that we got a couple of sample books in our database, we can start to build useful queries. Let’s start by finding books that cost not more than (i.e., lte: less than or equal to) 15 (whatever the unit of price in our system is):

2.3.1 :003 > affordable_books = db.collection.find({ price: { :$lte => 15 } })
 => #<Mongo::Collection::View:0x70300620577380 namespace='test.books' @filter={"price"=>{"$lte"=>15}} @options={}>
2.3.1 :004 > affordable_books.count
 => 16
2.3.1 :005 > affordable_books.first
 => {"_id"=>BSON::ObjectId('583024421c705c4fc094947e'), "title"=>"The Monkey's Raincoat", "author"=>"Prince Bosco", "publisher"=>"Target Books", "isbn"=>"964194353-7", "price"=>0.33, "genres"=>["Fiction narrative", "Humor", "Reference book", "Mythology"]}

Obviously, your numbers might vary.

Aggregation

Aggregations go beyond simply finding data: they allow you to transform and filter (hey map-reduce) your data, and also to apply some basic statistical analyses on them. We will take a look at two concrete examples of aggregations. Every aggregation consists of one to N operations which together form a pipeline. Each pipeline step can be skipped, occur once or even multiple times.

Most popular genres

Our first aggregation counts how often each genre is associated with the books in our database, sorted in descending order.

query = 
  [
    { :$unwind => '$genres' },
    {
      :$group => {
        _id:   '$genres',
        count: { :$sum => 1 }
      }
    },
    { :$sort => { count: -1 } }
  ]

The query is comprised of three stages:

The $unwind stage splits each into document by the specified field. More concrete and using the above example query: One document containing genres, say, Graphic Novel and Fiction, goes in, and two documents with one genre each come out. Each document still has the genres field, but now the first document has the value genre: 'Graphic Novel' while the second one has genres: 'Fiction'.
The $group stage groups the incoming documents by genres. The field (or number of fields) to group a set of documents by is specified through the _id key. As part of this stage, we also count the number of documents in each group and store the result in the count field.
As the last step in the above pipeline, we sort the documents by the count field in descending order (-1: descending, 1: ascending).

Use Mongo::Collection#aggregate to run the above query:

2.3.1 :006 > db.collection.aggregate(query).each { |doc| puts doc }
{"_id"=>"Fiction narrative", "count"=>16}
{"_id"=>"Biography/Autobiography", "count"=>16}
...
{"_id"=>"Folklore", "count"=>2}

Number of books published per year

Our second aggregation is slightly more complex: we extract the publication year from within each document and count the number of books publishes each year. This query is enlightening insofar as it shows how to extract data from deep within a document.

query =
  [
    {
      :$project => {
        year: { :$year => '$release_date.year' }
      }
    },
    {
      :$group => {
        _id:   '$year',
        books: { :$sum => 1 }
      }
    },
    { :$sort => { books: -1 } }
  ]

The query is comprised of three stages:

The $project stage is essentially a map function: each incoming document is transformed into one outgoing document for the next pipeline stage. Put differently, consider a box full of candy bars with each candy bar consisting of n pieces. The unwind operation now breaks each candy bar into its pieces and puts them back into the box. This box is then passed to the next step in the pipeline. In our concrete example, we extract the year from the release_date using MongoDB’s built-in $year aggregation operator and store the result in a field called year. After this stage, each document contains two fields: _id (that’s always included), and the extracted year.
The $group stage groups the incoming documents by year (recall: the fields to group the data by is specified through the _id field), and captures the number of books per year in a new books field.
As the last step in the above pipeline, we sort the documents by the books field in descending order (-1: descending, 1: ascending).

We already know how to run the query, and after the above explanation, it’s also clear what it should return: a list of documents with two fields—_id as the year and books as the number of books sold in that year—sorted descending by the book count.

2.3.1 :007 > db.collection.aggregate(query).each { |doc| puts doc }
{"_id"=>2009, "books"=>12}
{"_id"=>2003, "books"=>12}
...
{"_id"=>2004, "books"=>3}

Conclusion

We saw how to install MongoDB and get it running, how easy it is to connect to the database from Ruby, and how queries and aggregations are structured. I hope you find this brief introduction and basic project helpful for your further explorations with Ruby and MongoDB. As a starter, you could try to use a more sophisticated wrapper around the plain MongoDB interface; Mongoid for example.

Prerequisites#

Generating documents#

Accessing the database#

Populating the database#

Aggregation#

Most popular genres#

Number of books published per year#

Conclusion#