Digging to WSO2 BAM

From this blog-post,I'm going to explain how to setup WSO2 BAM 2.3.0 clustered setup with getting data from mediation agents[WSO2 ESB] and service agents[WSO2 AS].

Components of the setup

1. ESB Cluster

2. AS Cluster

3. BAM Cluster

--- Two DR nodes

--- Four cassandra nodes

--- Two DA nodes

--- Two zoo-keeper nodes

--- One dashboard node

WSO2 BAM is used to aggregate ,analyze and visualize the data events coming from different agents.

By default WSO2 BAM contains data agents for

--Collecting mediation statistics

--Collecting service statistics

and more.To get more information on it,please refer [1].

Once different data agents send different stat events to BAM side,first those row data will be stored to BAM integrated No-SQL cassandra data store.

Note -In WSO2 BAM,primary default data-store would be No-SQL cassandra and secondary data store is H2 based RDBMS database.Secondary database can be changed to any other RDBMS database type or Cassandra database.The reason to keep Cassandra as the primary data-store is because,there will be a very large volume of row data statistics come from different data agents to BAM in a real-world use-case.Since cassandra is having the capability of horizantal scalability and distributed storage;in other words,since we can have a large number of cluster nodes and able to write them to parellel cassandra has ben chosen.

Then through hive-scripts, map-reduce jobs will be scheduled [this can be one-time or periodical] to underlying hadoop file-system and data will be analyzed and then move the analyzed data to a relational database.

Then by querying this relational database,the analyzed data will be visualize as gadgets/rendered html pages with using wso2 inbuilt dashboard capability.So in a company,the managerial level can be used these visualize data to analyze and make decisions on their business related data.

Due to flexible and componentize architecture of WSO2 BAM, same BAM node can be scale to act as different components.

For example,a single BAM node consists the components of

Data-Receiver
Data-Analyzer
Internal Cassandra
Internal Hadoop
Internal zoo-keeper

If an organization wants to keep a BAM node as only to function as Data-receiver,that can be achieved through configurations easily.This is same for other BAM inbuilt componentize features as well.

Additionally,BAM can setup with external cassandra store or external hadoop cluster or external zookeeper cluster without using internal embedded ones.

In our setup,we have used BAM to collect mediation statistics and service statistics.

First we'll look into the data flow of the setup. Below diagram is showing the high-level architecture of the setup.

Setting ESB & AS Data Agents

First we need to enable BAM statistics in ESB and AS.Then we have to enable mediation agent and service agent from each.For those,please refer;

http://docs.wso2.org/display/BAM230/Setting+up+Mediation+Data+Agentt

http://docs.wso2.org/display/BAM230/Setting+up+Service+Statistics+Data+Agent

In our-case,since we have setup two DR nodes to receive BAM stat events in load-balancing manner.We need to set ESB and AS data agents as load balancing data agents and we need to add two BAM receiver nodes urls in a load balancing manner in ESB and AS data agents side.To refer on how to do that,please refer
http://docs.wso2.org/display/BAM230/Setting+up+Multi+Receiver+and+Load+Balancing++Data+Agent

Setting BAM Data Receiver Nodes

Then it need to configure BAM Data Receiver [DR]nodes. From setting multiple BAM receiver nodes,what will happen is,the data events from ESB/AS will recieve to those BAM nodes in a high available manner.If one DR node is down,the other node will be act as the data receiver.Once data events recieved to these BAM nodes,those need to send to primary cassandra storage to store the row data.

As the configuration changes,we need to point these nodes to cassandra cluster and we need to define read/write consistency levels of data receivers which write data to cassandra.

Pointing to the cassandra cluster from DR nodes can be done by modifying cassandra-component.xml which can be found from {BAM_Home}/repository/conf as below.

<Cassandra>

<Cluster>

    <Name>Test Cluster</Name>

    <Nodes>cass_node1_ip:9160,cass_node2_ip:9160,cass_node3_ip:9160</Nodes>

    <DefaultPort>9160</DefaultPort>

    <AutoDiscovery disable="false" delay="1000" /&gt;

</Cluster>

</Cassandra>

Defining the read/write consistency levels of data receivers on writing data to cassandra cluster can be changed from streamdefn.xml which can be found from {BAM_Home}/repository/conf/advanced.

For example;

<StreamDefinition>

    <ReplicationFactor>3</ReplicationFactor>

    <ReadConsistencyLevel>QUORUM</ReadConsistencyLevel>

    <WriteConsistencyLevel>QUORUM</WriteConsistencyLevel>

    <StrategyClass>org.apache.cassandra.locator.SimpleStrategy</StrategyClass>

</StreamDefinition>

In above configuration,the WriteConsistency and ReadConsistency has set as QUORUM = '(replication_factor / 2) +1'  = '3/2 +1' = 2. Such that, you should have atleast 2 cassandra nodes up and running to the write to be succeed. 

Hence we need to plan what is the tolerance level of the system, and we have to plan the WriteConsistency and ReadConsistency depending on that. To keep the tolerance of 1 node to be down, then it can be specified as 'ONE' or 'ANY'. Please refer http://www.datastax.com/docs/1.1/dml/data_consistency

Additionally we need to change default users-store of each nodes to common users-store by configuring user-mgt.xml in {BAM_Home}/repository/conf location.

Since these BAM DR nodes will not use BAM data analyzing feature and inbuilt cassandra support;

--  Can remove the BAM Tool Box Deployer feature using feature manager

-- Start BAM nodes with stop starting cassnadra bundled with server by giving below property with the server startup command.

sh wso2server.sh -Ddisable.cassandra.server.startup=true

Setting BAM Cassandra Cluster
Next,it need to configure BAM cassandra cluster.As the cassandra cluster,you can either setup an external cassandra cluster or you can use BAM nodes with their inbuilt cassandra feature support.In this setup,we have used four  BAM nodes with their inbuilt cassandra support as the cassandra cluster.In each of these BAM nodes,you have to change cassandra.yaml file which can be found from {BAM_Home}/repository/conf/etc location.

Basically,we need to change the following configurations in cassandra.yaml file.

--cluster_name- Change to a common name

--listen_address- Hostname of each BAM cassandra node

--seeds- Hostname of the seed nodes in cassandra cluster.Here we have set only one BAM node as the seed node.

--initial_token -Tokens can be generated from http://www.datastax.com/docs/0.8/install/cluster_init#calculating-tokens-for-a-single-data-center

--rpc_address- Hostname of each BAM cassandra node

--rpc_port -A common port value across cassandra BAM nodes.Default value is 9160

--storage_port- Default value is 7000.Shoud be common across cassandra BAM nodes.

NOTE : If you have setup the BAM cassandra nodes with port-offset values,then you have to add additional two system properties to server startup as below,to connect all the nodes to one cassandra cluster.

-Dcassandra.rpc.port= default_port[9160]+offset

-Dcassandra.storage.port=defaut_port[7000]+offset

From above system properties,cassandra rpc and storage ports have been set to a common value with adding the offset value.Please note,the above defined same nodes need to be define in cassandra.yaml file.

The next major fact that we need to aware is, whether the cassandra cluster successfully created and nodes are joined successfully or not.

For that,we have used nodetool which is shipped with apache-cassandra 1.1.3.First downloaded apache-cassandra 1.1.3 from here,unzipped it and executed the below nodetool command.

 ./nodetool -u admin -pw admin -host  -p  ring

From above command,it will list down all the nodes connected with that cassandra cluster.

Additionally we need to change default users-store of each nodes to common users-store by configuring user-mgt.xml in {BAM_Home}/repository/conf location.

Setting BAM Data Analyzer Nodes

Next step is to configure BAM data analyzer nodes,which function as analyzing the row data stored in cassandra primary storage and put into a different secondary storage.For that BAM provided hive scripts support,such that,hive scripts will handle scheduling tasks into local inbuilt hadoop system in BAM nodes and process tasks with analyzing row data.

To collect anayzed statistics of ESB mediation data and AS service data,BAM itself has predefined hive-scripts written to do analytics jobs from hadoop.

These hive-scripts has been included to BAM binary pack in a deployable artifact type called BAM toolbox .

In our deployment,we have kept two BAM nodes as DA nodes in which one will act as read-write mode with ability to deploy toolbox artifacts,while other node is in read-only mode with disabling BAM toolbox deployment feature.Addition to that,we have used external zookeepr cluster setup to use with scheduling hadoop jobs in a high availability manner.The steps on how we did that can be found from the section "Configuring data analyzer cluster" described in 

http://docs.wso2.org/display/BAM230/Partially-Distributed%2C+Highly-Available+BAM+Setup.

Once you configured two nodes and zookeeper cluster,to check whether the DA nodes provide high availability,try first deploy the relevant analytic scripts as toolboxes and enable executing those analytic scripts to be run as scheduled tasks.Then down one DA node and check whether the schedule task is properly trigger with the second DA node,when the first DA node is down.

In this setup,we have not configured external hadoop cluster and used BAM inbuilt hadoop support.The reason for it is,the resources allocation we were had was less and the analyzing rate of data in our setup is not very frequent.Thus if one DA server down and if the analyzing of a row data entry failed at first time,when second task execution time,still that row data entry elligable to be analyzed and since analyzing of data not needed to be done frequently,we used BAM inbuilt hadoop support as it is.

The advantage you get from having an external Hadoop cluster is the possible performance increase. That is, basically, this affects the execution of a single Hive analytics operation. So if the Hive operation is an expensive operation, it's execution can be made faster if we had split operation among multiple Hadoop nodes, but in above setup, it will always execute in the local node. So if it need to scale the execution of individual jobs, it can be added an external Hadoop cluster and add nodes to it to make the operations ultimately finish executing earlier. If each individual Hive operations are not that large, and does not execute for a long period,then going with the internal Hadoop of each BAM node is fine.

Since these nodes will not use inbuilt cassandra support;

-- Start BAM nodes with stop starting cassandra bundled with server by giving below property with the server startup command.

sh wso2server.sh -Ddisable.cassandra.server.startup=true

Setting BAM Dashboard Node

This node is only to show the presentation dashboard with collected analyzed statistics.Such that you can edit the toolboxes to only include the presentation part and deploy in this node.Then modify the master-datasources.xml in {BAM_Home}/repository/conf/data-sources location to add the secondary relational database storage,which contains the analyzed statistics and same time which will be used to query to visualize from dashboard.

Since this node will not use inbuilt cassandra support;

-- Start BAM nodes with stop starting cassandra bundled with server by giving below property with the server startup command.

sh wso2server.sh -Ddisable.cassandra.server.startup=true

And disable data analyzing of the BAM node as well. For that remove the analyzer based features from that server from feature manager.

Open Space

Search This Blog

Digging to WSO2 BAM

Comments

Post a Comment

Popular posts from this blog

Convert an InputStream to XML

Concat two xml values with XSLT

Passing end-user details from client to real backend endpoint via JWT token