The algorithm used in an stock exchange system

Original Articles in Chinese: http://t.cn/R2DoIAe & http://t.cn/R2DXusg

Background:

Shareholders (>10^9) info should be managed in memory.
There are many Sockets of CPU in the server, one process pinned on each processor(or core?).
The expected QPS of queries (Query, Modify, Delete, Add) per process is in millions, operations must be done in concurrent.

Data Structure:

The shareholders info is identified by:
Shareholder Type (char(1), [A-F])
Shareholder Code (char(10), [A-Z][0-9]{9})
Security ID (short)

The shareholder data is an 64bit value.

Solution:
Compress the shareholders info to a single 64bit key, then the entire struct will be 16 bytes.
Manage the Key-Value pair by open addressing hash map (load factor should be under control), the concurrent modifications can be done by CAS operations.

Difference between Research and Innovation

OCR. from http://weibo.com/flyinglichen

What is the major difference between doing research and set up a startup?
Research: money to knowledge-> burn money on knowledge
Innovation: knowledge to money-> make benefit for company as soon as possible

The two Should be separated and distinguished!!!

2 category of innovations:
Disruptive innovation: usually done by startup, who has desire to survive.
Improve current products: big companies

2 kinds of products:
Antibody product-> you have to buy, e.g. something evolutional that dwarf all other peers. This is what a startup should focus on.
Vitamin product-> its helpful, but not necessary.

Startups could begin at a very low position in market. but should increase its share in a log rate.

Always need best people from all area, marketing, sales, etc. Engineering guys can not do everything. Actually they are not good at most of "things"...
Market driven instead of technology driven: Never start with certain tech and try to find a place to sell Should start from customer needs and look backward in technology to see how we can satisfy.

Big company lab always just compete with university and this make no sense. They are too slow.
Experienced VC company helps. Money are not all same. Money from good VC are not simply money; they can help a lot in the infancy state of the company.

What student should prepare?
CS guys are trained to be critical (the point they can do their own research). They tend to ignore bright side.
But to do a startup, you need to influence other positively. Keep open mind.

Thinkings of Architecture Design

Thinkings of Architecture Design
by Lin Shiding (An Top Architect @ Baidu.com)
Online View(in Chinese): http://wenku.baidu.com/view/8f7a79dcad51f01dc281f127.html

Translated & Transcript by fcicq

Page 2
Begin with Examples:

  • Storage
  • Distributed
  • Service Architecture
  • Computation Models

Page 3
Storage(1):

  • Structure: File, Object, Table
  • Characteristics of Data
    • mutable or not
    • size
    • data layout
  • Access Pattern
    • Realtime Read(Query)/Write
    • Batched Write, Realtime Query
    • Stream Read
    • Scan / Range Query
  • "Realtimeness"
    • Realtimeness
    • Freshness
    • Consistency

Page 4
Storage(2):

  • Conflicts (fcicq: or trade-offs)
    • Latency / Throughput
    • Random / Sequential
    • Scale / Freshness (Realtimeness / Latency)
  • Model
    • B+ Tree (Realtime, Random)
    • Log-based (Batched, Sequential)
  • Solve the conflicts:
    • Weaken the requirements
    • Exploit the locality
    • Combine models

Page 5
Storage Model: B+ tree
(pic)

Page 6
Storage Model: Log-based structure
(pic)

Page 7
Storage (Model): Combined(/Hybrid) Model
(S: Sequential Read performance High, High storage capacity)
(R: Random Read performace High, Low storage capacity)

Page 8
Distributed

  • Goal
    • Scaling (capacity): scalability
    • Fault Tolerance: availability
  • Methods
    • Partition
    • Replication

(P = p^k => P = 1-(1-p)^k)

  • Point
    • Protocol Design
    • Debugging

Page 9
Distributed: Partition

  • Static hashing
    • cant modify/tune (after build)
  • Consistent hashing
    • K/n (fractional of total data) affected
  • Mapping
    • Split and Combine/Merge

Page 10
Distributed: Replication

  • Granularity
    • Machine
    • Record
    • Group

(Pic Translation:
粒度: Granularity, 开销: Cost,
并行度: Degree of parallelism, 可靠性: Reliability)

Page 11
Replication is not omnipotent (fcicq: cant solve every problem)

(Pic Translation:
故障率: Failure Rate)

Page 12
(Pic Translation:
时间: Time, 吞吐: Throughput, 输入: Load, 极限: Designed maximum throughput,
文艺模型: Well-designed / Optimal Model, 普通模型: Typical / poor-designed Model)

Page 13
Service Architecture

  • Goal
    • High throughput
    • Stable throughput/serving under extreme load
  • Model
    • Basic: threadpool + queue
    • Complex/Advanced: event-driven
  • Ensure the stablity
    • Reduce the granularity for resource allocation, active scheduling
    • Flow control
      • Load (data/metrics) feedback, Throttling (fcicq: on high load)
      • (Response) Latency deadline, multi-leveled queue (fcicq: looks like QoS)

Page 14
Computation

  • Data Intensive
    • MapReduce
    • Scan-Filter
  • Compute Intensive (CPU-Bound)
    • seti@home
  • Communication Intensive (Traditional HPC)
    • Machine Learning
    • Matrix related calculation

Page 15
Scan-Filter

  • Example: Calculate the Intersection of the 2 sets.
  • Input: list1 + list2, |list1|>>|list2|
  • Output: {}
  • MapReduce
    • Sort + Partition + Reduce
  • Scan-Filter (Model)

Page 16
How to make a Storage System?
How to make a High Performance Services?
How to make a Data Warehouse?

What is Architechure?
What does An Architect need?

Page 17-19
Three methods for An Architect

Understand the requirements

  • Tradeoff
    • Cant satisfy all the requirements
    • Dont have to treat all the requirements the same. (fcicq: the important ones have high priority)

(Yellow word on the right side:
Say No to unreasonable requirements! but still give end-to-end solutions)

  • Find the root requirements
    • Divide(breaking down a problem), Abstract, Dimensional(or scale) Reduction

(fcicq: "Dimensional Reduction" here means solve a simplified and/or scale-reduced problem?)

    • Define primitives and combine rules.
  • Understand the requirement changes as time goes by

Choose the methods

  • Estimate, Stimulation, Implement

(Yellow word:
Back-of-the-Envelope Calculation
Monte-Carlo Simulation
Discrete Event Simulation
Emulation)

  • Divide vs Iteration
  • Design Patterns

Maintaining appropriate pace

  • Planning the reachable path
  • Produce/Deliver periodically / on a regular basis

(Pic Translation:
迭代: Iteration)

Give us some color to see see

"Give you some color to see see" is chinglish, which means "I will teach you a lesson".

Thanks to the firewall, the real competitors are not going to appear in short term in China.

Without real competition, many kinds of product copycats boomed.
The users' requirement is so easy to fulfill, product / interface design is much more important than the backend / infrastructure / algorithms.
(I have to admit there are many great product manager / designers appeared these years. but in reality ... No(?) new model is first invented in China?)

The developers catching the tails of technologies like mongodb, node.js & thinking that's the way to solve every scalability problem.
Thankfully some of them found some right things like nginx, hadoop, erlang ...
One thing worth mentioning: agentzh developed ngx_lua, which may become one of the next big things in my opinion.

A humor says one famous company in China spent more than 2yrs & 1000 man-years to develop a cloud platform which is now abandoned. The company is believed to be aliyun.
Hu Ning commented on the humor: Without rare seen system architecture masters like Jeff Dean and Sanjay Ghemawat, thinking the leaders Hooyou*, brainwashing culture and huge-crowd strategy will make a successful system? No Way!
(This sentence is hard to translate :) )

Hooyou:
1 http://en.wikipedia.org/wiki/Hooyou
2 Hooyou & brainwashing both means tricking you to think something daydreaming is possible.

I'm prepared to see more and more larger-than-before failures in the future, product division, technology division, or anything else.

But few other people admit this reality. see also https://twitter.com/#!/fcicq/status/179400077062840321

MapR Architecture

by EMC Labs China, Xiang Dong
Original Article in Chinese:
http://qing.weibo.com/2294942122/88ca09aa330003zv.html
Translated by fcicq ( @fcicq & id:fcicq)

(fcicq:
This is #20 entry for Hadoopアドベントカレンダー2011 http://www.zusaar.com/event/174001

Thanks to id:nagixx for http://d.hatena.ne.jp/nagixx/20111216/1324006829 , but I still wondering what will he write next :)

I'm working on restoring my own Chinese blog @ http://www.fcicq.net/wp/ , but since that is not finished...
so I have to post it on Hatena Diary, there is GFW you know :D

I found this article is a rewrite of the presentation mentioned below,
but since the author is working for EMC, and EMC Partners with MapR for Greenplum, knows more internal things...
So I decided to translate most parts & publish here.)

On Hadoop Summit 2011 on June, the founder of MapR, M.C. Srivas talked about "Design, Scale and Performance of MapR's Distribution for Hadoop".
http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop
Introduced some design methods used by MapR, some implementation details and performance data. ... (fcicq: some omitted...)

MapR thinks, to solve the all kinds of problem Hadoop current have, the following issues should be considered:

(fcicq: Page 10 of the slide, Page xx for short below)
1 the scalability of (centralized) meta data server is poor. Making it distributed is the solution, every node is a meta data server.
Problem: meta server should not eat too much memory, leaving space to run MapReduce applications.
2 the number of blocks every datanode support should be increased, the block-report size should be reduced at the same time.
3 As memory capacity is limited, the memory capacity cost for search service should be reduced.
4 Fast restart (for better Availability)

(fcicq: Page 11)
MapR expects scalability boost with these methods.
Number of Nodes: ~2k to 10k+
Capacity: 10-50PB to 1-10EB
Number of Files: 150 Million to 1 Trillion

At the same time, Full random R/W & some features for enterprise should be supported (Snapshot, mirror...).
MapR also expects some performance increase, and exploit the hardware, support new types of hardware (SSD, 10GE...).

(fcicq: Page 12)
The core of MapR is its distributed NameNode (NN for short).
In the MapR design, distributed NN (fcicq: mini NN) is also called Container.
Different from Hadoop's NN, Container not only maintain user file's meta data, but also data blocks.
The size of every container is 16-32G (there will be many containers on a single node). Different nodes have replicas of the same container.

(fcicq: Page 13)
For end user, the concept of Container is too deep/complex. So Volume is introduced to lower the fence & improve the flexibility.
The concept of MapR Volume (http://www.mapr.com/doc/display/MapR/Volumes ) is similar to Volume in traditional storage concept.

Container Management is achieved by managing Volume(no need to manage Container directly):
Now Volume storage quota, replication level, snapshot, mirror setting is available to end user.

The metadata of Container & Volume is maintained in CLDB.
(container location database​, http://www.mapr.com/doc/display/MapR/Glossary#Glossary-containerlocationdatabase )
CLDB is a centralized service, so MapR designed a series of fault-tolerance methods, see http://www.mapr.com/doc/display/MapR/CLDB+Failover

Using distributed NN leads to a large number of distributed transactions.
User may operate on two containers at the same time.
(fcicq: Page 22-23)
For this circumstances, MapR thinks both traditional 2PC and Quorum(typo in original article) based protocol has its limitations.
MapR proposed a new method: MapR lockless transaction(TX for short).

There is few details in Srivas' talk,
From the limited slides, we found the following method:
(fcicq: Page 21,24)
1 every node writes its own WAL(Write Ahead Logging).
There are two kinds of WAL. OP log stores the modification & rollback info for metadata, Value log stores the same for data blocks.
2 Log has its Global ID (easy with ZooKeeper), which makes distributed transaction possible.
(fcicq: someone told me ZK's lock granularity prevent it from doing this type of work, it may be generated by another service.)
3 With WAL, TX can be rolled back quickly(less than 2 sec)
4 No explicit commit required for MapR lockless transaction (by default, TX will be successful committed)
5 Replicas will monitor conflicts, if any, rollback. otherwise confirm the TX.

MapR claims this way has high throughput, and no lock is required during the TX.
Thanks to WAL, program crash during TX is no longer a matter.

(fcicq: In fact I read the presentation so carelessly. Anyone have read that presentation should get these summary.
Nothing new :( But read the analysis next paragraph carefully, which is the point...)

In fact, The design of MapR lockless transaction is a trade-off, by carefully analysis the charactistics of MapR distributed TX:
As a big data analytics platform, the data-set to be processed on Hadoop is often readonly or involving more Reads and less Writes(the current HDFS actually Readonly),
The conflict chance is low for distributed TX. In other words, the probability of high cost operations (rollback) is very low, so it's reasonable to abandon (high cost) locks.

Other than distributed NN, some other advanced features are implemented:
1 Direct Access NFS.
Users can mount MapR HDFS via any NFS client remotely & do anything like local filesystem.
Some question on Hadoop Summit:
Q: Symbolic Link? A: Yes
Q: O_DIRECT? A: for NFS, not reasonable to support.

Random IO largely extended the usage for MapR Hadoop (Original HDFS ~= Readonly).

2 Snapshot & Mirror.
(fcicq: Advertising omitted...)