How I fell in love with MongoDB

Today I want to tell you the story of how I met MongoDB and the whole NoSQL world.

It was November 2014, I was attending a Web Information Retrieval class and I was discussing with a PhD student about the projects that I had to work on that semester. "You know," I said "I have a thing for Twitter, I really love analyzing those data and extracting any kind of information. I would like to do something cool for my next projects, but I have issues in storing the data.. I can't always store them in a tsv file, in this way analyzing huge quantities of data becomes a nightmare". "MongoDB", he replied. "What?" "Just Google it. It stores data using a format similar to JSON, the native datatype of Twitter". That said, he left. Leaving me with a very confused look on my face and the word "MongoDB" written on my notebook.

A couple of months later I started working on my project for the Big Data class. We had to pick one of the technologies presented in the course, study it and then present a demo. The choice was quite wide, we could pick anything among data warehouses, non-relational databases and other tools. I felt like a kid in a candy shop. I did some research about the tools suggested during the class, but none of them completely satisfied me: data warehouses were too relational and I was looking for something new and different from the usual SQL stuff, something like a NoSQL database. In the end I reduced my choice to three DBMS: Cassandra and its columns, OrientDB and its graph model or MongoDB with its documents.
MongoDB went for the win. It was the most scalable among the three, DB-engines.com ranked it as the best non-relational database, it was perfect for storing my Twitter data and, last but not the least, green is my favourite color. So I give it a try. In less than one day I had setup everything on my laptop and were already collecting tweets on my database using a Python driver. Following an easy tutorial I started also doing some queries, from the basic find to the trickiest aggregate functions. In the end it took me more time to do the slides for presenting my project than to develop the project itself.
During the exam I told my professor that I was feeling really guilty for working so little and having MongoDB do all the job, he smiled and replied: "You're not cheating, you have just picked the perfect DB for your purposes. That is what I want you to learn: how to master big data analysis by picking the right tools".


In march 2015 I started working on my master thesis. I had several meetings with my advisor, trying to find the perfect topic for my work. I wanted to do a thesis on data integration, and my first requirement was that it had to be a really cool topic, no matter the amount of work.
The data integration research at my department (that, if you're wondering, is the department of Engineering in Computer Science at the University of Rome "La Sapienza") is mainly focused on the matter of ontology-based data access (OBDA), they have also developed one of the most known OBDA tools, Mastro, that works (or, spoiler alert, I should say worked) only with relational sources.
For those that have never heard about ontology-based data access in their lives, let me just say that it is a paradigm that allow to access data by using an high-level query language, without caring of their native query language so that, by means of some mappings, you can query a database without knowing anything about SQL. In fact what you're actually doing is querying the ontology that the OBDA system builds for you according to the data provided in the source.
The ontologies are usually expressed in RDF, so what Mastro does is to link relational data to RDF, by means of some mappings written in their proper language (R2RML).

Now let me get back to my thesis: can there be anything cooler than ontology-based data access? Yes, there is ontology-based data access with non-relational sources! This is a brand new branch in data integration, just consider that the first papers about this topic were published in February 2016. Initially my advisor suggested me to use RDF as language for the data source, so that it would be a quite easy job to map RDF to RDF. But I never liked easy stuff and, most important, by that time my heart already belonged to MongoDB, so I asked my advisor to let me try using a document-oriented database as source.
To make a long story short: I defined "Major", a mapping language that puts in correspondence JSON queries with RDF triples, and I designed an unfolding algorithm that exploits these mappings for "translating" a conjunctive query over RDF into a JSON query that can be executed over MongoDB. Moreover, I had the best timing in designing my algorithm, because I worked on it in November 2015, right after the release of MongoDB 3.2 with its brand new lookup operator. This operator turned out to be the key of my algorithm as it allows me to perform some join-like operations among collections.

I won't get more into the details of my thesis for now, but I'll be glad to tell you more if you're interested, let me just say that in the end the implementation really worked and I'm really proud of the results I achieved with this work. It turned out that, as my professor said, MongoDB really is the perfect DB for my purposes!

In the picture you can see me on the 26th January 2016, defending my thesis and telling the world how amazing the NoSQL world is!


 

Commenti