Anything, I like to write.: January 2011

Monday, January 31, 2011

Some useful settings and plugins for VI Editor : Part 1 - General settings

Here are some useful settings, commands and plugins to make VI editor more developer friendy (some settings are there only for PHP users).

You need to add the following settings in .vimrc file in your home directory or alternatively you can add them to /etc/vim/vimrc file, that will enable these settings for all users on the system.

"For highlighting the code add
syntax on

"Show a ruler at the bottom of screen
set ruler
set laststatus=2

"Show matching brackets.
set showmatch

"To do a case insesnitive search.
set ignorecase

"To replace TAB with shift of 4 spaces.
set tabstop=4
set shiftwidth=4
set expandtab

"Show line numbers.
set number

"Jump 5 lines when running out of the screen
set scrolljump=5

"Indicate jump out of the screen when 3 lines before end of the screen
set scrolloff=3

"Set indentation rules
setlocal autoindent
setlocal smartindent

"Correct indentation after opening a docblock and automatic * on every line
setlocal formatoptions=qroct

"Append ending brackets whenever open a bracket
inoremap [ []
inoremap ( ( )

"Spell Checking
set spell spelllang=en_us
Now use can move to next or previous misspelled word using ]s and [s commands, also you can use following
zg Add word under cursor as good word
z= Suggest corrections for the word under cursor

For other posts related to VIM settings and plugins you can also visit

Some useful settings and plugins for VI Editor : Part 2 - Autocompletion
Some useful settings and plugins for VI Editor : Part 3 - PHP documentor
Some useful settings and plugins for VI Editor : Part 4 - CodeSniffer Integration
Some useful settings and plugins for VI Editor : Part 5 - Comment a code block

Wednesday, January 26, 2011

MongoDB vs MySQL: speed test part 2, Select queries

In the first part of this post I compared the performance of Insert operations for mongoDB and mysql. In this part I tried to compare the performance for different select operations. Test setup is same as used in the part 1.

1. Selects on an indexed column with different limit clauses

To check the performance of selects on an indexed column with limit clause, a number of queries with different limit clauses were executed on both the databases.

sample mysql query:
SELECT ID, NAME, BIRTH_DT, CONTACT_ADDRESS, CITY, TOTAL_EXP, ENTRY_DT, PROFILE, SUMMARY from db.resume where ID > 1000 limit 100000, 1000

sample MongoDB query:
$collection->find(array('ID' => array(':gt'=>1000)))->skip(100000)->limit(1000);

Start Limit	Total Records Fetched	MySQL	MongoDB
0	1000	0.846 ms	0.0710ms
100000	1000	0.903 ms	0.0391ms
200000	1000	0.969 ms	0.0209ms
300000	1000	1.029 ms	0.0889ms
400000	1000	1.058 ms	0.0488ms
500000	1000	1.149 ms	0.0482ms
600000	1000	1.214 ms	0.0460ms
700000	1000	1.170 ms	0.0469ms
800000	1000	1.196 ms	0.0450ms
900000	1000	1.216 ms	0.0460ms

2. Selects on a non indexed column with different limit clauses

sample mysql query:
SELECT ID, NAME, BIRTH_DT, CONTACT_ADDRESS, CITY, TOTAL_EXP, ENTRY_DT, PROFILE, SUMMARY from db.resume where TOTAL_EXP > 5 limit 100000, 1000

sample MongoDB query:
$collection->find(array('TOTAL_EXP' => array(':gt'=>5)))->skip(100000)->limit(1000);

Start Limit	Total Records Fetched	MySQL	MongoDB
0	1000	1.133 ms	0.0679 ms
100000	1000	1.166 ms	0.0469 ms
200000	1000	1.334 ms	0.0469 ms
300000	1000	1.293 ms	0.0438 ms
400000	1000	2.047 ms	0.0450 ms

3. Selects on an indexed column with sorting

sample mysql query:

SELECT ID, NAME, BIRTH_DT, CONTACT_ADDRESS, CITY, TOTAL_EXP, ENTRY_DT, PROFILE, SUMMARY from db.resume where ID > 1000 order by USERNAME asc

sample MongoDB query:

$collection->find(array('ID' => array(':gt'=>1000)))->sort(array("USERNAME"=>1));

Avg time in Mysql : 1.973 sec
Avg time in MongoDB : 0.138 ms

4. Selects with IN clause on an Indexed Column

To check the performance of select queries on an indexed key with IN clause, a number of queries were executed on both the databases and an avearge is taken. Each query had 100 random ID values in the IN clause.

Avg time in Mysql : 4.865 ms
Avg time in MongoDB : 1.570 ms

Its clear from the above results that MongoDB outperformed mysql in each case by a large margin.

MongoDB vs MySQL: speed test part 1, Insert queries

Recently I started exploring NoSQL databases as an alternative for some of our high traffic mysql tables. After going through a number of articles on net, I decided to explore MongoDB. I tried to compare the performance of different database operations (inserts/ different types of selects) in mongoDB and in MySQL. Performance comparison of insert operations are given here.

Test setup:
For testing I used a 3.16 GHz, Intel Xeon CPU with 2 GB of memory and 350 GB of disk.

MySQL:

key_buffer = 128M
sort_buffer_size = 512K
read_buffer_size = 256K
max_allowed_packet = 1M

Table schema:

ID int(11)
NAME varchar(35)
BIRTH_DT date
CONTACT_ADDRESS varchar(150)
CITY int(11)
TOTAL_EXP varchar(5)
ENTRY_DT date
PROFILE varchar(250)
SUMMARY varchar(250)

MongoDB:

For mongo two shard servers, one config server and one mongos were satarted on the same machine with chunk size set to 10.

Sample document:

{"_id" : ObjectId("4ca6cca6a87305c90b000000"),
"ID" : "5839427",
"NAME" : "Gaurav Asthana",
"BIRTH_DT" : "1981-06-29",
"CONTACT_ADDRESS" : "Noida, India",
"CITY" : "19",
"TOTAL_EXP" : "06.10",
"ENTRY_DT" : "2010-11-26",
"PROFILE" : "zxzzzzz zzzzzzzzzz zzzzzzzzzzzzzzzzzz zzzzzzzz",
"SUMMARY" : "abcfsf fsdfs gdgdfg gdfgdh dfghdh dfhdh" }

An index is also created on ID field.

I have created a simple php script to perform the benchmark. This script inserted a total of 15 Lac records both in the mysql and mongodb. I have recorded time for each batch of 100 records that were inserted. So, in total I recorded 15000 readings. The average time taken by both the databases is given below.

Average time per batch of 100 records :

Mysql : 18.77 ms
MongoDB : 5.53 ms

Size on disk:

Mysql : 390 MB
MongoDB : 1.6 GB

In my benchmark, MongoDB came out three times faster that mysql in case of insert queries. But it occupied four times more disk than mysql.

Tuesday, January 25, 2011

Classification of NoSQL Databases

NoSQL databases can be broadly classified as:

1. Distributed vs. Not-distributed databases

Distributed databases take the responsibility of data partitioning (for scalability) and replication (for availability) and do not leave that to the client. Non-distributed databases leaves the responsibility of data partitioning and replication on the clients.

Table 1: Distributed and Non-distributed databases

Distributed	Not Distributed
Amazon Dynamo Amazon S3 Scalaris Voldemort CouchDb (thru Lounge) Riak MongoDb BigTable Cassandra HyperTable HBase	Redis Tokyo Tyrant MemcacheDb Amazon SimpleDb

2. Disk vs. Memory databases

An useful dimension is whether the database is memory-driven or disk-driven. This is important since in the latter case an explicit cache would be required, while in the former case data is not durable.

Table 2: Memory driven and disk driven databases

Memory	Configurable	Disk
Scalaris Redis	BigTable Cassandra Hbase HyperTable	CouchDb MongoDb Riak Voldemort

On one end of the spectrum is Scalaris which is entirely memory-driven, and Redis which is primarily memory oriented. Cassandra, BigTable, Hypertable, Hbase allow configuring how large the Memtable can get, so that provides a lot of control. CouchDb, MongoDb and Riak all use on-disk B+ trees, and Voldemort uses BDB and MySQL.

3. Data Model richness

On the basis of data model the various NoSQL databases can be grouped in following three groups.

3.1 Key-value Stores

These systems store values and an index to find them, based on a programmer-defined key. These data stores use a data model similar to the popular memcached distributed in-memory cache, with a single key-value index for all the data. Like memcached, none of these systems offer secondary indices or keys.

3.2 Document Stores

These systems store documents. The documents are indexed and a simple query mechanism may be provided. Document stores support more complex data than the key-value stores. The term “document store” is not ideal, because these systems store objects (generally objects without pointers, described in JSON notation), not necessarily documents. Unlike the key-value stores, they generally support multiple indexes and multiple types of documents (objects) per database, and they support complex values.

3.3 Column Stores

These systems store extensible records that can be partitioned across nodes. They are also refered as “Extensible Record Stores”. Their basic data model is rows and columns, and their basic scalability model is splitting both rows and columns over multiple nodes. Rows are split across nodes through conventional sharding, on the primary key. They typically split by range rather than a hash function (this means that queries on ranges of values do not have to go to every node). Columns of a table are distributed over multiple nodes by using “column groups”.

These may seem like a new complexity, but column groups are simply a way for the customer to indicate which columns are best grouped together. These two partitionings (horizontal and vertical) can be used simultaneously on the same table. The column groups must be pre-defined with the extensible record stores. However, that is not a big constraint, as new attributes can be defined at any time. Rows are not that dissimilar from documents: they can have a variable number of attributes (fields), the attribute names must be unique, rows are grouped into collections (tables), and an individual row’s attributes can be of any type.

Table 3: Classification of NoSQL databases based on data model

Key-Value store	Document store	Column-Store
Amazon Dynamo Amazon S3 Redis Scalaris Voldemort	SimpleDb Couchdb MongoDb Riak	Cassandra Google BigTable HBase Hyperbase

Multi Version Concurrency Control

Multi Version Concurrency Control or MVCC is an efficient method to let multiple processes access the same data in parallel without corrupting the data and the possibility of deadlocks. It is an alternative to the Lock based approaches, where every process first has to request an exclusive lock on a data item, before it can be read or updated. MVCC is used in some relational databases as well as in most distributed databases.

Instead of letting each process access the data exclusively for a certain amount of time, MVCC allows processes to read the data in parallel, even if a process is updating the data. To maintain consistency, each data item has some kind of time stamp or revision. If a process reads a data item, it does not only get the value of it, the process also retrieves the revision of the data item. So if this process attempts to update this data item, then it writes the new value with the previously read revision number to the database. If the actual revision in the store is the same, then the new value is written and the revision of the data item is incremented.

But if the revision in the store is not the same as the revision read by the writing process, then there must have been another process which has updated the data item in the meantime. In a relational database, the writing process would be a transaction, which would in this case be aborted or restarted.

In distributed databases, there are at least two cases for such a conflict: The first one is that two processes are attempting to write the same data item on the same node. In this case, the database could detect the conflict during the write operation and abort it, so the client would need to re-read the data item and retry its desired update, like the behavior of a RDBMS in this situation.

Another case is that multiple clients update the same data item on different nodes. If the distributed database uses asynchronous replication, then this conflict can not be detected and handled during the write operations. The nodes first have to be synchronized before they can handle the conflict. The conflict resolution can happen during the replication or during the first read operation on the conflicting data item.

Some databases, which implement this, store all conflicting revisions of the data items and let the client decide how the conflict should be handled. In such systems a read request returns all conflicting versions of the value and the client has to choose one or has to merge the versions and to write the corrected revision back to the database.