Thursday, August 18, 2016

GSoC 2016 - Add MongoDB to Tajo Storage


The purpose of this blog post is to integrate and describe my contribution to Apache Tajo under Google Summer of Code Program 2016. Further this post describe issues I faced while implementing the MongoDB storage module for Apache Tajo and what are the possible/required improvements.

Apache Tajo.
Apache Tajo™ is a big data warehouse for Apache Hadoop. The key idea is that it supports SQL (actually Relational Database Management Systems) on top of Hadoop file system in a distributed manner. If you are more interested start it from here

Introduction to the Project - Add MongoDB Support for Apache Tajo
Is it only works as a data warehouse for Hadoop? of course not. Since it has a generic structure of reading and writing data, it is possible to connect other data storage systems to Apache Tajo. In other words Tajo will work as a big data warehouse for other storage systems too. All you need is a storage module for particular storage system. In that case, Tajo already contains default storage modules for HDFS, RDBMS(example:- MySQL, PostgreSQL), Amazon S3. My project was to implement a storage module for MongoDB. Then users can connect their mongodb databases to Apache Tajo and perform queries on them.


For more details on the module I implemented, you can refer to my blog posts.

Issues to be Solved and Future Development

Table Scheme Problem
Apache Tajo tajo can handel two kinds of table spaces. 
  • MetaData provided table spaces
    • Here, the table-space itself provide the meta data of tables. It provides data such as,
      • List of tables in the database
      • Table Schemes
      • Table Statistics (Ex:- Number of rows)  
    • For examples MySQL Table Space, PostgreSQL Table Space can be taken. They are well structured db systems. They contain all the metadata required so table space itself can provide meta data
  • File Spaces (Table Spaces which do not provide metadata) 
    • Apache Tajo's primitive functionality is to provide SQL on top of Hadoop (or any other file system). File systems can't provide metadata themselves because they do not keep schemes, statistics inside them. 
    • Therefore Tajo maintains a catalog which contains meta-data of the table space, schema details, statistics, etc. 
MongoDB is on other hand do not have a scheme, but it can provide some kind of metadata such as,
  • The list of tables(Collections actually, Tajo tables are mapped into collections in MongoDB)
  • Table statistics
Therefore it was encouraged to implement MongoDB table space as a meta data provided table-space, but MongoDB does not maintain a schema. So schema details can't be provided by the table space

At the end of the day this leads not to maintain the schema details by anyone Catalog do not maintain it because meta-data is provide by table-space, table-space can't maintain it because MongoDB is schema less. 

We need to solve this problem. For reading purpose, this doesn't matter too much but for data insertion schema is really important. Because of this issue, even though appender is implemented mongodb-tablespace still do not support insert queries. 

Support Nested Columns
Apache Tajo supports nested columns in its tables. MongoDB also supports nested columns as Embedded Documents.  The current implementation of tajo-storage-mongodb do not support nested columns. It should be implemented in the future. 

Query Testing for Insertion
Because of the schema problem mongodb table space do not support insert queries. All the internal work for data insertion to mongodb table space is done, therefore ASAP the schema problem is resolved the Query Testing for Insertion should be enabled(test-code is already added). 

Improving Performance 
Performance of the module can be improved in various aspects. For a example higher level projections and filtering can be push-downed to the MongoDB for better performance.

So that's it. :-) 

Last but not least I want to thank Jaehwa Jung(my mentor), Jihoon Son and other members of Apache Tajo community for helping and immensely guiding me throughout the project.

I got a great exposure to many technologies and concepts. It was one of the best experiences in my life (As a programmer it's the best experience up to now ;-) and I am planning to contribute more in the future 

Friday, June 24, 2016

Implemented Components at the Mid Term Evaluation - GSoC 2016

 It's 25th July and the mid term evaluation of GSoC is in this week. This post is about my progress in past four weeks. This post contains a list of implemented components and their purposes. Further my understanding, and plans on the implementation of other components.

List of implemented components

Maven Module: ( contains following java classes.
  • MongoCollectionReader
  • MongoDocumentDeserializer
Further, for testing purpose following classes were implemented.
Next section contains a brief description about each class mentioned above. 

Connection info is the class I use to keep the mongodb connection info. It's main purpose is to convert the given URI into a MongoColientURI. An object can be connected using fromURI static method and it the object contains the relevant MongoClientURI. Conversion is done by parsing the provided URI and mapping relevant parameters into a new MongoClientURI object.

MongoDBFragment is an extension of Fragment abstract class in the storage-common module. Fragment is like splits in map reduce. In other words table space divides the data set into splits, and send them to be processes separately. It has start_key and end_key which defines the range of the fragment. At the moment I didn't implemented that functionality because at the moment the table space takes the whole data-set as a single fragment. 

MongoDBFragmentSerde is the java class which is used to serialize the MongoDBFragment class. Fragment has to send to remote clusters for processing purposes, there for it should be able to be serialized in to a string and send. It is done using Google's Protocol Buffer. For that protoc buff need Serd class which can be used to serialize and deserialize the fragment class.

Metadata Provider provides meta data regarding the table space. It has following important methods getTables() and getTableDesc(). Since the mongo collection is mapped to a table in tajo, getTables() return the list of collections. getTableDesc() returns the table details, such as table stats and column descriptions. Here lies a problem. Since mongodb is schema less I can't provide column descriptions, but one of my mentor thought it's better if I can implement a meta data provider.

MongoDBTableSpace extends the TableSpace class to support the mongo storage system. It is the base class of this module. MongoDB table space will be defined using this class and also it has create table, purge table methods.

Since I implemented a meta data provider I was thinking to implement this as a normal storage system which provides schema such as jdbc. There was an error so I couldn't figure out. So I asked the for the help from one of my mentors Jihoon, (He is the one who implemented example-storage-module). At the moment we both are trying to figure out a solution. Meanwhile I wanted to implement Scanner anyway. Therefore I created a new branch TAJO-2079-FileSpace which support mongodb as a file storage system. Anyway today I was able to run it. Scanner works fine and it can read data from mongo collections. Select query works. Still do not support composite attributes.

This class is quite similar to the json line reader in example storage module. It provide iterating functionality through documents of a collection. It returns results as Tuples according to the target projection.

This is used to parse and convert a given document into a Tuple. At the moment it is implemented similar to the json deserializer. Actually mongo documents can be converted to a json string easily. Therefore, this class currently do is convert the document into a json object and convert. Which should be improved for performance in the future.

Next section I am going to describe testing approach.
This class is to create and host a mongodb instance in the localhost for testing. It further loads data from json files in the data set and create Mongo Collections. Mongo Instance is created using flapdoodle.embed.mongoWhen It runs for the first time it downloads mongodb for relevent platform. Jihoon think it can be a problem for continues integration if it doesn't happen quickly. Finally it register mongodb table space in tajo for testing. 
This is a simple Unit Test Class for ConnectionInfo class. It does not use MongoDBTestServer instance.
This is to check the functionality of MetadataProvider. It uses the server and get metadata provider for databases through the tablespace. In  TAJO-2079-FileSpace  branch this test is disabled. It is because file spaces do not provide meta data.
This class is to test the functionality of table space. It test the handler, create, delete functinality and methods such as getMetadataProvider method. 

This class is to test the table space for queries. It's parent class is QueryTestCaseBase. It runs simple tests using query files and compares results with result files. At the moment it only runes select query.

That is the description of currently implemented modules. 

Regarding the current situation of the project.
Original branch for this issue is TAJO-2079 but as I mentions in the scanner section, there was an issue in the metadata provided table space. Therefore I decided to create it as file table space for now and improve it to support metadata in the future. So I created a new branch TAJO-2079-FIleSpace. It contains the mongo-table-space as a file storage system(Tajo it self keep all metadata). And I implemented scanner for that. To see the scanner you have to visit TAJO-2079-FileSpace branch. Since file space do not have a metadata provider, to see the functionality of MetadataProvider you have to visit TAJO-2079 branch. 

Next blog post will be about how the mapping is done from MongoDB to Tajo-Tables. :) 

Monday, May 30, 2016

Automate the build and run in local machine - GSoC 2016

Wrote a small bash script today. Just few lines. It do these things specifically,
  • Build tajo-storage-mongodb module and copy the .jar file to the snapshot. The snap shot is already configured to use mongo storage.
  • Remove the logs. 
  • Start tajo. 
  • Wait for a little and open the log file with gedit. 
  • Stop tajo 
Lol. It is really a small script, but it simplified my work a lot.

Anyway I was able to run tajo with this configurations. 
Of course table space don't do anything yet but seeing something like this makes me really happy. :D

The First Week ( GSoC )

The beginning of the coding period was not actually rushing as expected. This week was allocated to discuss the architecture of the module, with my mentors. Actually it was done a long before. Of course still there are questions regarding the architecture but they can't be solved before hand. They will be solved during the implementation. It's agile guys!

Project at the moment

Created a new module for mongodb storage plugin which is going to be implemented throughout the summer by me ;) 
Created the following main classes by implementing those interfaces and abstract classes.
  • MongoDbTableSpace
  • MongoDbFragment
  • MongoDbScanner 
  • MongoDbAppender
Also implemented a class called ConnectionInfo to keep MongoDB connection. When I implement it copplied a lot from the JDBC connection info class. Thank you blrunner. Hope you will not be mad at me about that. ;)

Problems and Solutions

Let's discuss about some questions came across in the first week. The first question was regarding the newly created module. When I buld using mvn command it says the module was build successfully but the relevant jar was not in the snapshot. I couldn't find why was it. Actually I build it several time (around 10 times) by changing pom.xml file several times. Problem was not with the mvn configurations. The module was build in the module directory, but it should be copied into the snapshot directory. It is done by a command in pom.xml of tajo-storage module. Anyway I added the lien and it started to work fine.

The next question is replication. It is something complex. ;) The thing is that in configurations for hdfs you can provide multiple hosts. MongoDB also can have multiple hosts as replica. Should the storage plugin I write include that functionality? If so, how the URI passed a question. For a table space details will of the table is given as a URI. By default java URI don't allowed multiple hosts. Then how hdfs do that? It is something to be studied.


I setup the Travis for my GitHub account. It is cool. I mean great. It can be name as one of the coolest things provided in the internet. Traivis automatically build the project in my GirHub repositories. We can configure it with travis.yml. And the best thing is it is completely free for opensource projects. :D :D 

Sunday, May 8, 2016

The Simple Contribution - GSoC 2016

Got a reply from the mongo community. Seems like I have to learn a lot about mapping document based databases to column based databases.

Yesterday something marvelous happened. My mentor asked me to do a commit. Actually he told me how to do it. First I couldn't even understand the issue, but somehow he explained it really well. I made the changes in my repo yesterday, today I  make the pull request. Lol, I should have done it yesterday, but I had doubts. Anyway Travis the bot is doing tests automatically. I don't have to worry about that. 😂 I think what I edited do not effect unit tests or integration tests but Jaehwa said that after testing with MariaDB server, he'll prepare to commit my patch.

Further I got an email from a student(Subashini Hariharan) who is doing her Masters. She wants to add Cassandra plugin for Tajo as her Master's project. I don't know whether it is enough for that but I think it's a great idea. I think it is possible and will be easier to do compared to the MongoDB. So I introduced her to my mentors.

Still I need to understand the storage module architecture. It can't be much complicated. I want understand how to map Mongo Collections to Tajo tables. Thing is that we don't have much time. So many assignments and submissions. I am going to go through Storage Module again today. That's it for today. 

Wednesday, April 27, 2016

Moving On - GSoC 2016

So far, I filled Google Tax form. It get rejected once and submitted again. Hope this time it will not be rejected. In the development process still no progress. The thing is that I am busy with a lot of academic works these days. Poor me!

Also I have interviews for internships in the next semester. What a tragedy. Anyway, I have to get prepared for it too.

But I found some solutions for above scenarios. Part of my semester project is to develop a web service to collect data from a mobile application. Yeah I have to develop mobile app too ;)
I chose to develop the web service using java and MongoDB. ;-)
Now I have to use mongo driver for that. I am using it as a pre-project to get familiar with mongo java driver. Today I tried it with several applications, works fine. I am positive. :D :D

Now I am gonna connect a MySQL database in my local machine with the Tajo in the virtual machine. ;) Thanks to VMware, it has a virtual network card too. I can just ping from my physical machine to virtual machine. Funny. ;)  I am following Jaehwa's docs.

Finally, GSoC Payoneer account thing is really complicated. I always have doubts about that. What to do. This looks like a diary, isn't it? Never mind. It's public anyway.

Saturday, April 23, 2016

මීයා ගේ කතාව

මේකත් තව තැනක තිබිල ගෙනාපු එකක්. :D
මගේ ඔලුවට මීයෙක් ගැන අදහසක් අවේ කොතනින්ද කියල හරියටම කියන්න අමාරුයි. මතක විදිහට ඕක ආවේ අර “හොර පූසෙක්_ _ _ _, බටු මීයෙක් ටකස් ටකස්”. ඉතින් කොහෙන් හරි ඔය අදහස ඔලුවට ආව. ඊට පස්සේ ඉතින් ඔය සිතුවිලි දාමය විකාර පාරවල් වල ගිහින් නතර වුණේ හරිම අපූරු තැනක.
මං හිතන්නේ ඔය ගොඩක් අය අහල ඇති  ජන්මෙට වඩා පුරුද්ද ලොකුයි කියල කතාවක්. මේක ගැන ප්‍රශ්නයක් තිබිල තියෙනවා අපේ රටේ හිටිය ප්‍රසිද්ද හිමිවරුන් දෙනමක් වන වීදාගම මෛත්‍රී හිමියන් සහ තොටගමුවේ ශ්‍රී රාහුල හිමියන් අතර. රාහුල හාමුදුරුවෝ කියන්නේ වීදාගම හිමියන්ගේ ගෝල හිමි නමක්. කොහොම හරි කතාව මේකයි. වීදාගම හිමියෝ පූසෙකුට පුරුදු කරලා තියනවා පහන අල්ලන්න. ඒ කිව්වේ තමන් වැඩක් කරන කොට පූසා ඉතින් පහන අල්ලගෙන ඉන්නවා. ඉතින් වීදාගම හාමුදුරුවෝ කිව්වලු ජන්මෙට වඩා පුරුද්ද ලොකුයි කියල. මේක දැකපු ගෝල රාහුල හාමුදුරුවෝ දවසක් ගුරු හිමියන් ගේ කුටියට යනකොට මීයෙක් ව අරන් ගිහින් ඇත හැරියලු. පූසා මීයෙක් ව දැක්කම කරන දේ අමුතුවෙන් කියන්න ඕනේ නෑ නේ. ඌ පහන පැත්තක ට දාල මීය පස්සේ පන්නගෙන ගියාලු. කොහොම හරි ඉතින් රාහුල හිමියෝ ඔප්පු කළා ලු ජන්මේ ලොකුයි කියල.
ඕක ඉතින් පොත් පත් වලත් සඳහන් වෙනවා. දැන් මට ඇතිවිච්ච ප්‍රශ්නෙ තමයි ඔය කියන මීයට මොකද වුනේ කියන එක. ඌ පූසට අහුවුනාද? නැත්නම් පණ බේරාගෙන පැනගත්තද ? කියල කව්රුවත් දන්නෙ නෑ. ඇත්තටම කියනවා නම් ඒ මීය ගැන ඊට පස්සේ පොත්පත් වල සඳහනක් නෑ.ඔව් ඉතින් එකෙ පුදුමෙකුත් නෑ. කව්ද මීයෙක් ගැන හිතන්නේ. කොහොම හරි අර මීය නො සෑහෙන්න පින් කරලා තියෙන්න ඇති ඔය තරම් වත් කතාවට ඇතුලත් වෙන්න. “කොහොම හරි කමක් නෑ නියමයි මීයෝ උඹ ඉතිහාස ගත උනානේ, පූසට අහුවෙලා මැරුණත් මොකෝ”
(දැන් ඉතින් උඹල පූසා ගැන අහන්න එපා. උටත් ඔය ටිකම තමයි. හැබැයි ඉතින් ඌ මීයගෙන් මැරුම් කන්න නම් නැතුව ඇති)
ඉතින් අපි හැමෝම එක එක විදිහට ජීවත් වෙනවා. මැරිලා යනවා. සමහරු ඉතිහාස ගතවෙනවා. ගොඩක් අය අර මීය තරම් වත් ඉතිහාසෙට එක් වෙන්නේ නෑ. ඒක ඉතින් අපේ වැරද්දක් නෙවෙයි. හැමෝටම ඉතිහාස ගත වෙන්න බෑ නේ.
හරි ඉතින් එච්චරයි. ඔබ සැම මෙය කියවා තම තම නැණ පමණින් කැමති කැමති දෙයක් සිතාගන්න. කියවූ ඔබ සැමට පිං෴