Thursday, August 18, 2016

GSoC 2016 - Add MongoDB to Tajo Storage

Introduction

The purpose of this blog post is to integrate and describe my contribution to Apache Tajo under Google Summer of Code Program 2016. Further this post describe issues I faced while implementing the MongoDB storage module for Apache Tajo and what are the possible/required improvements.

Apache Tajo.
Apache Tajo™ is a big data warehouse for Apache Hadoop. The key idea is that it supports SQL (actually Relational Database Management Systems) on top of Hadoop file system in a distributed manner. If you are more interested start it from here


Introduction to the Project - Add MongoDB Support for Apache Tajo
Is it only works as a data warehouse for Hadoop? of course not. Since it has a generic structure of reading and writing data, it is possible to connect other data storage systems to Apache Tajo. In other words Tajo will work as a big data warehouse for other storage systems too. All you need is a storage module for particular storage system. In that case, Tajo already contains default storage modules for HDFS, RDBMS(example:- MySQL, PostgreSQL), Amazon S3. My project was to implement a storage module for MongoDB. Then users can connect their mongodb databases to Apache Tajo and perform queries on them.


Commitment

For more details on the module I implemented, you can refer to my blog posts.
 

Issues to be Solved and Future Development


Table Scheme Problem
Apache Tajo tajo can handel two kinds of table spaces. 
  • MetaData provided table spaces
    • Here, the table-space itself provide the meta data of tables. It provides data such as,
      • List of tables in the database
      • Table Schemes
      • Table Statistics (Ex:- Number of rows)  
    • For examples MySQL Table Space, PostgreSQL Table Space can be taken. They are well structured db systems. They contain all the metadata required so table space itself can provide meta data
  • File Spaces (Table Spaces which do not provide metadata) 
    • Apache Tajo's primitive functionality is to provide SQL on top of Hadoop (or any other file system). File systems can't provide metadata themselves because they do not keep schemes, statistics inside them. 
    • Therefore Tajo maintains a catalog which contains meta-data of the table space, schema details, statistics, etc. 
MongoDB is on other hand do not have a scheme, but it can provide some kind of metadata such as,
  • The list of tables(Collections actually, Tajo tables are mapped into collections in MongoDB)
  • Table statistics
Therefore it was encouraged to implement MongoDB table space as a meta data provided table-space, but MongoDB does not maintain a schema. So schema details can't be provided by the table space

At the end of the day this leads not to maintain the schema details by anyone Catalog do not maintain it because meta-data is provide by table-space, table-space can't maintain it because MongoDB is schema less. 

We need to solve this problem. For reading purpose, this doesn't matter too much but for data insertion schema is really important. Because of this issue, even though appender is implemented mongodb-tablespace still do not support insert queries. 

Support Nested Columns
Apache Tajo supports nested columns in its tables. MongoDB also supports nested columns as Embedded Documents.  The current implementation of tajo-storage-mongodb do not support nested columns. It should be implemented in the future. 

Query Testing for Insertion
Because of the schema problem mongodb table space do not support insert queries. All the internal work for data insertion to mongodb table space is done, therefore ASAP the schema problem is resolved the Query Testing for Insertion should be enabled(test-code is already added). 

Improving Performance 
Performance of the module can be improved in various aspects. For a example higher level projections and filtering can be push-downed to the MongoDB for better performance.

So that's it. :-) 

Last but not least I want to thank Jaehwa Jung(my mentor), Jihoon Son and other members of Apache Tajo community for helping and immensely guiding me throughout the project.

I got a great exposure to many technologies and concepts. It was one of the best experiences in my life (As a programmer it's the best experience up to now ;-) and I am planning to contribute more in the future