MapR Architecture - fcicq’s hatena diary

by EMC Labs China, Xiang Dong
Original Article in Chinese:
http://qing.weibo.com/2294942122/88ca09aa330003zv.html
Translated by fcicq ( @fcicq & id:fcicq)

(fcicq:
This is #20 entry for Hadoopアドベントカレンダー2011 http://www.zusaar.com/event/174001

Thanks to id:nagixx for http://d.hatena.ne.jp/nagixx/20111216/1324006829 , but I still wondering what will he write next :)

I'm working on restoring my own Chinese blog @ http://www.fcicq.net/wp/ , but since that is not finished...
so I have to post it on Hatena Diary, there is GFW you know :D

I found this article is a rewrite of the presentation mentioned below,
but since the author is working for EMC, and EMC Partners with MapR for Greenplum, knows more internal things...
So I decided to translate most parts & publish here.)

On Hadoop Summit 2011 on June, the founder of MapR, M.C. Srivas talked about "Design, Scale and Performance of MapR's Distribution for Hadoop".
http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop
Introduced some design methods used by MapR, some implementation details and performance data. ... (fcicq: some omitted...)

MapR thinks, to solve the all kinds of problem Hadoop current have, the following issues should be considered:

(fcicq: Page 10 of the slide, Page xx for short below)
1 the scalability of (centralized) meta data server is poor. Making it distributed is the solution, every node is a meta data server.
Problem: meta server should not eat too much memory, leaving space to run MapReduce applications.
2 the number of blocks every datanode support should be increased, the block-report size should be reduced at the same time.
3 As memory capacity is limited, the memory capacity cost for search service should be reduced.
4 Fast restart (for better Availability)

(fcicq: Page 11)
MapR expects scalability boost with these methods.
Number of Nodes: ~2k to 10k+
Capacity: 10-50PB to 1-10EB
Number of Files: 150 Million to 1 Trillion

At the same time, Full random R/W & some features for enterprise should be supported (Snapshot, mirror...).
MapR also expects some performance increase, and exploit the hardware, support new types of hardware (SSD, 10GE...).

(fcicq: Page 12)
The core of MapR is its distributed NameNode (NN for short).
In the MapR design, distributed NN (fcicq: mini NN) is also called Container.
Different from Hadoop's NN, Container not only maintain user file's meta data, but also data blocks.
The size of every container is 16-32G (there will be many containers on a single node). Different nodes have replicas of the same container.

(fcicq: Page 13)
For end user, the concept of Container is too deep/complex. So Volume is introduced to lower the fence & improve the flexibility.
The concept of MapR Volume (http://www.mapr.com/doc/display/MapR/Volumes ) is similar to Volume in traditional storage concept.

Container Management is achieved by managing Volume(no need to manage Container directly):
Now Volume storage quota, replication level, snapshot, mirror setting is available to end user.

The metadata of Container & Volume is maintained in CLDB.
(container location database, http://www.mapr.com/doc/display/MapR/Glossary#Glossary-containerlocationdatabase )
CLDB is a centralized service, so MapR designed a series of fault-tolerance methods, see http://www.mapr.com/doc/display/MapR/CLDB+Failover

Using distributed NN leads to a large number of distributed transactions.
User may operate on two containers at the same time.
(fcicq: Page 22-23)
For this circumstances, MapR thinks both traditional 2PC and Quorum(typo in original article) based protocol has its limitations.
MapR proposed a new method: MapR lockless transaction(TX for short).

There is few details in Srivas' talk,
From the limited slides, we found the following method:
(fcicq: Page 21,24)
1 every node writes its own WAL(Write Ahead Logging).
There are two kinds of WAL. OP log stores the modification & rollback info for metadata, Value log stores the same for data blocks.
2 Log has its Global ID (easy with ZooKeeper), which makes distributed transaction possible.
(fcicq: someone told me ZK's lock granularity prevent it from doing this type of work, it may be generated by another service.)
3 With WAL, TX can be rolled back quickly(less than 2 sec)
4 No explicit commit required for MapR lockless transaction (by default, TX will be successful committed)
5 Replicas will monitor conflicts, if any, rollback. otherwise confirm the TX.

MapR claims this way has high throughput, and no lock is required during the TX.
Thanks to WAL, program crash during TX is no longer a matter.

(fcicq: In fact I read the presentation so carelessly. Anyone have read that presentation should get these summary.
Nothing new :( But read the analysis next paragraph carefully, which is the point...)

In fact, The design of MapR lockless transaction is a trade-off, by carefully analysis the charactistics of MapR distributed TX:
As a big data analytics platform, the data-set to be processed on Hadoop is often readonly or involving more Reads and less Writes(the current HDFS actually Readonly),
The conflict chance is low for distributed TX. In other words, the probability of high cost operations (rollback) is very low, so it's reasonable to abandon (high cost) locks.

Other than distributed NN, some other advanced features are implemented:
1 Direct Access NFS.
Users can mount MapR HDFS via any NFS client remotely & do anything like local filesystem.
Some question on Hadoop Summit:
Q: Symbolic Link? A: Yes
Q: O_DIRECT? A: for NFS, not reasonable to support.

Random IO largely extended the usage for MapR Hadoop (Original HDFS ~= Readonly).

2 Snapshot & Mirror.
(fcicq: Advertising omitted...)