Friday, June 24, 2016

Implemented Components at the Mid Term Evaluation - GSoC 2016

 It's 25th July and the mid term evaluation of GSoC is in this week. This post is about my progress in past four weeks. This post contains a list of implemented components and their purposes. Further my understanding, and plans on the implementation of other components.

List of implemented components

Maven Module: (https://github.com/janakact/tajo/tree/TAJO-2079-FileSpace/tajo-storage/tajo-storage-mongodb) contains following java classes.
  • ConnectionInfo.java
  • MongoDBFragment.java
  • MongoDBFragmentSerde.java
  • MongoDBMetadataProvider.java
  • MongoDBTableSpace.java
  • MongoDBScanner.java 
  • MongoCollectionReader
  • MongoDocumentDeserializer
Further, for testing purpose following classes were implemented.
  • MongoDBTestServer.java
  • TestConnectionInfo.java
  • TestMetadataProvider.java
  • TestMongoDBTableSpace.java
  • TestMongoDBQueryTest.java
Next section contains a brief description about each class mentioned above. 

ConnectionInfo
Connection info is the class I use to keep the mongodb connection info. It's main purpose is to convert the given URI into a MongoColientURI. An object can be connected using fromURI static method and it the object contains the relevant MongoClientURI. Conversion is done by parsing the provided URI and mapping relevant parameters into a new MongoClientURI object.

MongoDBFragment
MongoDBFragment is an extension of Fragment abstract class in the storage-common module. Fragment is like splits in map reduce. In other words table space divides the data set into splits, and send them to be processes separately. It has start_key and end_key which defines the range of the fragment. At the moment I didn't implemented that functionality because at the moment the table space takes the whole data-set as a single fragment. 

MongoDBFragmentSerde
MongoDBFragmentSerde is the java class which is used to serialize the MongoDBFragment class. Fragment has to send to remote clusters for processing purposes, there for it should be able to be serialized in to a string and send. It is done using Google's Protocol Buffer. For that protoc buff need Serd class which can be used to serialize and deserialize the fragment class.

MongoDBMetadataProvider
Metadata Provider provides meta data regarding the table space. It has following important methods getTables() and getTableDesc(). Since the mongo collection is mapped to a table in tajo, getTables() return the list of collections. getTableDesc() returns the table details, such as table stats and column descriptions. Here lies a problem. Since mongodb is schema less I can't provide column descriptions, but one of my mentor thought it's better if I can implement a meta data provider.

MongoDBTableSpace
MongoDBTableSpace extends the TableSpace class to support the mongo storage system. It is the base class of this module. MongoDB table space will be defined using this class and also it has create table, purge table methods.

MongoDBScanner
Since I implemented a meta data provider I was thinking to implement this as a normal storage system which provides schema such as jdbc. There was an error so I couldn't figure out. So I asked the for the help from one of my mentors Jihoon, (He is the one who implemented example-storage-module). At the moment we both are trying to figure out a solution. Meanwhile I wanted to implement Scanner anyway. Therefore I created a new branch TAJO-2079-FileSpace which support mongodb as a file storage system. Anyway today I was able to run it. Scanner works fine and it can read data from mongo collections. Select query works. Still do not support composite attributes.

MongoCollectionReader
This class is quite similar to the json line reader in example storage module. It provide iterating functionality through documents of a collection. It returns results as Tuples according to the target projection.

MongoDocumentDeserializer
This is used to parse and convert a given document into a Tuple. At the moment it is implemented similar to the json deserializer. Actually mongo documents can be converted to a json string easily. Therefore, this class currently do is convert the document into a json object and convert. Which should be improved for performance in the future.


Next section I am going to describe testing approach.

MongoDBTestServer.java
This class is to create and host a mongodb instance in the localhost for testing. It further loads data from json files in the data set and create Mongo Collections. Mongo Instance is created using flapdoodle.embed.mongoWhen It runs for the first time it downloads mongodb for relevent platform. Jihoon think it can be a problem for continues integration if it doesn't happen quickly. Finally it register mongodb table space in tajo for testing.

TestConnectionInfo.java 
This is a simple Unit Test Class for ConnectionInfo class. It does not use MongoDBTestServer instance. 

TestMetadataProvider.java
This is to check the functionality of MetadataProvider. It uses the server and get metadata provider for databases through the tablespace. In  TAJO-2079-FileSpace  branch this test is disabled. It is because file spaces do not provide meta data.

TestTableSpace.java
This class is to test the functionality of table space. It test the handler, create, delete functinality and methods such as getMetadataProvider method. 

TestMongoDBQueryTest
This class is to test the table space for queries. It's parent class is QueryTestCaseBase. It runs simple tests using query files and compares results with result files. At the moment it only runes select query.

That is the description of currently implemented modules. 

Regarding the current situation of the project.
Original branch for this issue is TAJO-2079 but as I mentions in the scanner section, there was an issue in the metadata provided table space. Therefore I decided to create it as file table space for now and improve it to support metadata in the future. So I created a new branch TAJO-2079-FIleSpace. It contains the mongo-table-space as a file storage system(Tajo it self keep all metadata). And I implemented scanner for that. To see the scanner you have to visit TAJO-2079-FileSpace branch. Since file space do not have a metadata provider, to see the functionality of MetadataProvider you have to visit TAJO-2079 branch. 

Next blog post will be about how the mapping is done from MongoDB to Tajo-Tables. :)