Tuesday, December 21, 2010

Using Thrift with Java and PHP

Thrift is a software framework for scalable cross-language services development. Thrift allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages.

This post provides a step by step guide to install thrift and write a server (in java) and a client (in php) using it.

1. Download Thrift

Basic requirements
Please go through this link for a list of prerequisite or basic requirements for thrift compiler.

Download the latest stable release of from here and extract it. OR do a svn checkout

$ svn co http://svn.apache.org/repos/asf/thrift/trunk thrift

2. Build and Install

Now go to the thrift directory and run

$ ./bootstrap.sh
$ ./configure
$ make
$ make install

this will install thrift on your system.

3. Writing a Thrift file

Next step is to write a thrift definition or .thrift file. This file describes the data structures, and functions available to your remote service. For this post I am going to write a simple service for getting a user profile.

profileservice.thrift

namespace php ProfileService #client
namespace java test.services.profile.thrift #server

enum JobType {
P, //Permanent
T //Temporary
}

enum EmploymentStatus {
F, //Full Time
P, //Part Time
}

exception ProfileServiceException {
1: i32 code,
2: string message
}

struct Profile {
1: i32 profileId,
2: string name,
3: string birthDate,
4: string contactAddress,
5: i32 cityId,
6: double totalExperience,
7: JobType jobType,
8: EmploymentStatus employmentStatus,
9: string summary,
}

service ProfileService {
Profile getProfileById(1:i32 profileId) throws (1: ProfileServiceException e),
Profile getProfileByName(1:string name) throws (1: ProfileServiceException e),
}

4. Using the Thrift Compiler

Now its time to generate the thrift code for server and client. For java server run the command

thrift --gen java profileservice.thrift

After you run the thrift generation for java, it’ll make a directory called gen-java/. Under this, you can find relevant files and classes to do work based on your Thrift definition. For my thrift its generated the following files under the directory gen-java/test/services/profile/thrift/ (its based on package name or namespace provided in the .thrift file)

$ ls gen-java/test/services/profile/thrift/
EmploymentStatus.java
JobType.java
Profile.java
ProfileServiceException.java
ProfileService.java

For php client run

thrift --gen php profileservice.thrift

for php, it’ll make a directory called gen-php/. For my thrift its generated the following files under the directory gen-php/profileservice/ (its based on package name or namespace provided in the .thrift file)

$ ls gen-php/profileservice/
ProfileService.php
profileservice_types.php

5. Creating a Thrift Server using Java

The next step is to create a java source file for implementing the interface (functions that we had defined in the profileservice.thrift file). The name of the interface is our case is ProfileService.Iface. We named the java class that implemented this interface in our case "ProfileServiceImpl". You will also need thrift java library for this. You can get lib/java/libthrift.jar file from your thrift source directory.

ProfileServiceImpl.java

package server;

import java.util.*;
import org.apache.thrift.*;
import test.services.profile.thrift.*;

class ProfileServiceImpl implements ProfileService.Iface
{
public Profile getProfileById(int profileId) throws ProfileServiceException, TException {
// your code goes here
return profile;
}

public Profile getProfileByName(String name) throws ProfileServiceException, TException {
// your code goes here
return profile;
}
}

Now write a java server for this service.

Server.java

package server;

import java.io.*;
import org.apache.thrift.protocol.*;
import org.apache.thrift.protocol.TBinaryProtocol.*;
import org.apache.thrift.server.*;
import org.apache.thrift.transport.*;
import test.services.profile.thrift.*;

public class Server
{
private void start()
{
try
{
TServerSocket serverTransport = new TServerSocket(7911);
ProfileService.Processor processor = new ProfileService.Processor(new ProfileServiceImpl());
Factory protFactory = new TBinaryProtocol.Factory(true, true);
TServer server = new TThreadPoolServer(processor, serverTransport, protFactory);
System.out.println("Starting server on port 7911 ...");
server.serve();
}catch(TTransportException e)
{
e.printStackTrace();
}
}

public static void main(String[] args)
{
Server srv = new Server();
srv.start();
}
}

This program simply has a main function which binds the service to a particular port and makes the server ready to accept connections and provide response. This code will generally remain constant unless you want to provide additional functionality at server level.

Compile all the files and start the server.

6. Creating a Thrift Client using PHP

Now its time to write a thrift client in php to use this service. You'll need to include the language specific libraries to facilitate access to thrift. Look for the folder ./lib/php/src/ in your thrift source directory which contains the library files you will need.

For this tutorial I have created a folder testclient in my home directory. Now create a subfoder named src-php, and copy all the library files in this folder. You will also need to mv or cp the autogenerated thrift files (from gen-php folder) for this project into the packages folder of these library files. Here’s a screenshot of my directorys structure for this project.

testclient
..src-php
....autoload.php
....ext
....packages
......profileservice
........ProfileService.php
........profileservice_types.php
....protocol
....server
....Thrift.php
....transport

Write a php client script to connect to the thrfit ProfileService server

ProfileServiceClient.php

// Setup the path to the thrift library folder
$GLOBALS['THRIFT_ROOT'] = 'thrift';
// Load up all the thrift stuff
require_once $GLOBALS['THRIFT_ROOT'].'/Thrift.php';
require_once $GLOBALS['THRIFT_ROOT'].'/protocol/TBinaryProtocol.php';
require_once $GLOBALS['THRIFT_ROOT'].'/transport/TSocket.php';
require_once $GLOBALS['THRIFT_ROOT'].'/transport/TBufferedTransport.php';

// Load the package that we autogenerated for this tutorial
require_once $GLOBALS['THRIFT_ROOT'].'/packages/profileservice/ProfileService.php';

try {
// Create a thrift connection
$socket = new TSocket('localhost', '9090');
$transport = new TBufferedTransport($socket);
$protocol = new TBinaryProtocol($transport);

// Create a profile service client
$client = new ProfileServiceClient($protocol);

// Open up the connection
$transport->open();
$data = $this->client->getProfileById(123);
$this->transport->close();
$this->socket->close();
print_r($data);
}
catch (TException $tx) {
// a general thrift exception
echo "ThriftException: ".$tx->getMessage()."\r\n";
}
?>

to run the client execute
php ProfileServiceClient.php

Friday, December 10, 2010

An Intro to NoSQL

What is NoSQL


For a quarter of a century, the relational database (RDBMS) has been the dominant model for database management. In the past, relation databases were used for nearly everything. Because of their rich set of features, query capabilities and transaction management they seemed to be fit for almost every possible task one could imagine to do with a database. But their feature richness is also their flaw, because it makes building distributed RDBMSs very complex. In particular it is difficult and not very efficient to make transactions and join operations in a distributed system.

This is why, there are now some non relational databases with limited feature sets and no full ACID support, which are more suitable for the usage in a distributed environment. These databases are currently called NoSQL databases. The need to look at Non SQL systems arises out of scalability issues with relational databases, which are a function of the fact that relational databases were not designed to be distributed (which is key to write scalability), and could thus afford to provide abstractions like ACID transactions and a rich high-level query model. All NoSQL databases try and address the scalability issue in many ways – by being distributed, by providing a simpler data / query model, by relaxing consistency requirements, etc.

The name first suggests that these databases do not support the SQL query language and are not relational. But it also means "Not Only SQL", which is not so aggressive against relational databases. This stands for a new paradigm: One database technology alone is not fit for everything. Instead it is necessary to have different kinds of databases for different demands. Most NoSQL databases are developed to run on clusters consisting of commodity computers and therefore have to be distributed and failure tolerant. To achieve this, they have to make different trade-offs regarding the ACID properties, transaction management, query capabilities and performance. They are usually designed to fit the requirements of most web services and most of them are schema free and bring their own query languages.

Why NoSQL

Even though RDBMS have provided database users with the best mix of simplicity, robustness, flexibility, performance, scalability, and compatibility, their performance in each of these areas is not necessarily better than that of an alternate solution pursuing one of these benefits in isolation. Today, the situation is slightly different. For an increasing number of applications, one of these benefits is becoming more and more critical; and while still considered a niche, it is rapidly becoming mainstream, so much so that for an increasing number of database users this requirement is beginning to eclipse others in importance. That benefit is scalability.

Relational databases scale well, but usually only when that scaling happens on a single server node. When the capacity of that single node is reached, you need to scale out and distribute that load across multiple server nodes. This is when the complexity of relational databases starts to rub against their potential to scale. Try scaling to hundreds or thousands of nodes, rather than a few, and the complexities become overwhelming, and the characteristics that make RDBMS so appealing drastically reduce their viability as platforms for large distributed systems.

Cloud computing also has placed new challenges on the database. The economic vision for cloud computing is to provide computing resources on demand with a "pay-as-you-go" model. A pool of computing resources can exploit economies of scale and a levelling of variable demand by adding or subtracting computing resources as workload demand changes. The traditional RDBMS has been unable to provide these types of elastic services. For cloud services to be viable, vendors have had to address this limitation, because a cloud platform without a scalable data store is not much of a platform at all. So, to provide customers with a scalable place to store application data, vendors had only one real option. They had to implement a new type of database system that focuses on scalability, at the expense of the other benefits that
come with relational databases.

Wednesday, November 24, 2010

Consistent Hashing

Hashing is a common method of mapping a key to a location. This is useful for many things, but in more relevant terms, this can be used to map keys to a server with great effect. Simple hashing use a Key mod N algorithm, where K is the number of keys and N is the number of slots or servers. This ensures that keys are mapped evenly across N slots. The problem with this algorithm is that adding or removing a slot or server would require a complete rehash of all the keys. And in case of huge data set, it is ideally not feasable to rehash and re-distribute the keys.

Consistent Hashing is a specific implementation of hashing that is well suited for many of today’s web-scale load balancing problems. Consistent Hashing is used particularly because it provides a solution for the typical “K mod N” method of distributing keys across a series of servers. It does this by allowing servers to be added or removed without significantly upsetting the distribution of keys, nor does it require that all keys be rehashed to accommodate the change in the number of servers. When using consistent hashing, only K/N keys need to be remapped on average.

Implementing Consistent Hashing is done by mapping keys and servers onto edge of a circle. All servers are mapped on to a series of angles around a circle. Each key is also hashed onto the circle. Each hashed server contains all keys between itself and the next clock-wise server hashed onto the circle. The bucket where each item should be stored is chosen by selecting the next highest angle which an available bucket maps to. So, each bucket contains resources mapping to an angle between it and the next smallest angle. If a bucket becomes unavailable, the keys being mapped to that bucket get mapped to the next highest bucket (or the next bucket in the circle). So, only keys which were in the bucket which became unavailable is lost. Similarly when a bucket is added, the keys between the new bucket and the next smallest bucket is mapped to the new bucket. Keys which should be associated with the new bucket and were stored previously will become unavailable.

As shown in Figure 1, Keys 1, 2, 3 and 4 map to slots A, B and C. To find which slot a key goes in, we move around the circle until we find a slot. So here key 1 goes into slot A, 2 goes into slot B and 3 goes into slot C, key 4 goes into slot A again. If C is removed, key 3 would belong to slot A.

If another slot D is added as shown in Figure 2, it will take keys 3 and 4 and only leave key 1 belonging to A.

Fig 1: Keys distribution with Consistent hashing
Fig 2: Keys re-distribution with Consistent hashing

Tuesday, November 23, 2010

Database Consistency

A database system is said to be in a consistent state if it satisfies all known integrity constraints. Integrity is defined as `the accuracy or correctness of data in the database`. There are two famous types of integrity constraints:

1. Entity integrity constraints (for example, no primary key value can be null)
2. Referential integrity constraints (a field in one table which refers to another table must refer to a field that exists in that table).

A database is in a correct state if it is both consistent and if it accurately reflects the true state of affairs in the real world. A database that is in a correct state will always be consistent. But consistent does not necessarily mean correct. A consistent database can be incorrect.

Different types of consistency exists:

Strong consistency means, that all processes connected to the database will always see the same version of a value and a committed value is instantly reflected by any read operation on the database until it is changed by another write operation.

Eventual Consistency is weaker and does not guarantee that each process sees the same version of the data item. Even the process which writes the value could get an old version during the inconsistency window. This behavior is usually caused by the replication of the data over different nodes.

Read-your-own-writes consistency, some distributed databases can ensure that a process can always read its own writes. For this, the database has to connect the same process always to nodes that already store the data written by this process.

A subtype of read-your-writes consistency is session consistency. Thereby it is only guaranteed that a process can read its own written data during a session. If the process starts a new session, it might see an older value during the inconsistency window.

Another variant of eventual consistency is monotonic read consistency, which assures that when a newly written value is read the first time, all subsequent reads on this data item will not return any older values. This type of consistency allows the database to replicate newly written data, before it allows the clients to see the new version.

Most of the NoSQL databases can only provide eventual consistency.

Wednesday, September 8, 2010

Smarty with symfony 1.4

For last more than two years we have been using symfony 1.0. We used symfony 1.2 also for some of the projects. Since it is official now that symfony developers are not going to support 1.0 and 1.2 version, I thought to give symfony 1.4 a try for one of my new project.

After installing symfony 1.4.6 and setting some config parameters. I was able to run my first page successfully. Now it was turn to enable Smarty plugin. I tried to install sfSmartyViewPlugin that I used with version 1.2. But I came to know that symfony 1.4 does not support sfSmartyViewPlugin. I googled and found that there is a sfSmarty3Plugin that should be used instead.

It took me a long time to make this plugin work since I was not able to find enough information on how to configure this plugin on net.

Here I am giving the steps that I followed in case someone needs it:

Prerequisite:

- sfSmarty3Plugin
- Smarty3

1. extract sfSmarty3Plugin in plugin/ folder

2. vi config/ProjectConfiguration.class.php and add
$this->enablePlugins('sfSmarty3Plugin');
in the setup function

3. vi plugins/sfSmarty3Plugin/config/settings.yml and set

allow_php_tag: true #TO ALLOW THE OLD STYLE
lib_dir: lib/vendor/smarty/libs # path to the Smarty.class.php
left_delimiter:'{' # change it if you do not use default one
right_delimiter: '}' # change it if you do not use default one

4. vi config/module.yml

  default: # For all environments
    enabled: true
    is_internal: false
    view_class: sfSmarty
    partial_view_class: sfSmarty

5. In module.yml of each app

  default: # For all environments
    enabled: true
    is_internal: false
    view_class: sfSmarty
    partial_view_class: sfSmarty

6. In app.yml in each of app

  # default values
  all:
    sfSmarty:
      class_path: lib/vendor/smarty/libs # path to the Smarty.class.php
      template_extension: .tpl
      template_security: false

7. rename layout.php to layout.tpl

8. Make changes in settings.yml related to smarty configs

Tuesday, July 27, 2010

Memcache

Caching helps, Everybody knows it. The performance of a web site can be improved by a great deal with proper caching. Caching can be done at various levels - browser level, proxy level, server level, database level etc. Here, I would be talking about a system known as memcache that helps cache at both server & database level. Memcache is designed for fast storing and retreiving of information. Its a system which is distributed and which stores information in memory only.

Memcache was developed by Danga interactive, according to them memcache is 'A high performance, distributed memory object caching system'. Memcache has a daemon and client apis for various languages. So you have to run the memcache daemons on servers where free memory is available and use client apis to set and get data from memcache daemon. Data sent by the client apis are serialized and stored corresponding to a key provided by the clients in memcache.

But there are some shortcomings with its distributed architecture. If you have 3 memcache daemons running and you are distributing data over those daemons and if one of those daemons goes down, you simply loose data set on those servers. What also happens is that since the number of servers have changed, the distribution logic of data divides data among these 2 daemons. And later when you bring up the third daemon, the distribution logic now does not know where the data is which were set while one of the daemons was down.

You can download it from http://www.danga.com/memcached/. Download the tar.gz source file. Untar it, go to memcached directory and compile it using ./configure , make and make install.

Now run the memcached binary in daemon mode and assign it some amount of memory. To see a list of all available options run memcached -h, and it will list you all the available options. For example to start memcached in daemon mode with 64 MB RAM, which listens on localhost port 11211, run the following command

./memcached -d -m 64 -l 127.0.0.1 -p 11211

This will start the server. Now clients can connect to this server and store data over there. There are apis available for different languages (perl, python, ruby, java, C# and C) that allow you to connect to memcached daemon and store/retrieve variables, arrays and objects from it. One point to note here: Objects stored in memcache are language dependent. If you extract an object using java api which was stored by a php api. You would not be able to parse it.