Apache Cassandra

MySense · 发表于 2014-2-26 09:32:28

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless replication allowing low latency operations for all clients.
Cassandra also places a high value on performance. University of Toronto researchers studying NoSQL systems concluded that "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments."[2]
Cassandra's data model is a partitioned row store with tunable consistency.[3] Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key.[4] Other columns may be indexed separately from the primary key.[5]
Tables may be created, dropped, and altered at runtime without blocking updates and queries.[6]
Cassandra does not support joins or subqueries, except for batch analysis via Hadoop. Rather, Cassandra emphasizes denormalization through features like collections.[7]

The Apache Cassandra Project

MySense · 发表于 2014-2-26 09:53:39

Introducing DataStax Enterprise What is DataStax Enterprise?

DataStax Enterprise is a NoSQL database platform architected for today's line-of-business applications that is powered by Apache Cassandra and designed to securely manage real-time, analytic, and search data all in the same database cluster.

How Does DataStax Enterprise Work?

DataStax Enterprise contains a production-certified version of Cassandra for handling real-time, transactional workloads as well as advanced security for protecting sensitive data.

Analytics on Cassandra data may easily be performed by adding nodes dedicated to analytic operations (currently powered by Hadoop). Enterprise search operations on Cassandra data can be run by adding nodes devoted to search (currently handled by Solr) to a cluster.

Each workload (real-time, analytics, and search) are isolated to nodes devoted to their respective operations so that real-time transactional workloads do not negatively impact analytic operations, which in turn do not affect search tasks. Full workload management is built in to each cluster.

Adding additional capacity or different workloads is done simply by adding new nodes to a cluster and choosing how to replicate data between them:

Home Introducing DataStax Enterprise
Documentation
Documentation home

MySense · 发表于 2014-2-26 10:25:56

Architecture in brief

An overview of Cassandra's structure.

Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based in the understanding that system and hardware failure can and do occur. Cassandra addresses the problem of failures by employing a peer-to-peer distributed system where all nodes are the same and data is distributed among all nodes in the cluster. Each node exchanges information across the cluster every second. A commit log on each node captures write activity to ensure data durability. Data is also written to an in-memory structure, called a memtable, and then written to a data file called an SSTable on disk once the memory structure is full. All writes are automatically partitioned and replicated throughout the cluster.

Cassandra is a row-oriented database. Cassandra's architecture allows any authorized user to connect to any node in any data center and access data using the CQL language. For ease of use, CQL uses a similar syntax to SQL. From the CQL perspective the database consists of tables. Typically, a cluster has one keyspace per application. Developers can access CQL through cqlsh as well as via drivers for application languages.

Client read or write requests can go to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator for that particular client operation. The coordinator acts as a proxy between the client application and the nodes that own the data being requested. The coordinator determines which nodes in the ring should get the request based on how the cluster is configured. For more information, see Client requests.

Key components for configuring Cassandra

Gossip: A peer-to-peer communication protocol    to discover and share location and state information about the other nodes in a Cassandra    cluster.

Gossip information is also persisted locally by each node to use immediately when a node    restarts. You may want to purge gossip history    on node restart for various reasons, such as when the node's IP addresses has changed.
Partitioner: A partitioner determines    how to distribute the data across the nodes in the cluster. Choosing a partitioner determines    which node to place the first copy of data on.

You must set the partitioner type and    assign the node a num_tokens value for    each node. If not using virtual nodes (vnodes), use the initial_token setting instead.
Replica placement strategy:    Cassandra stores copies (replicas) of data on multiple nodes to ensure reliability and fault    tolerance. A replication strategy determines which nodes to place replicas on. The first    replica of data is simply the first copy; it is not unique in any sense.

When you create a keyspace, you must define the replica placement strategy and the    number of replicas you want.
Snitch: A snitch defines the topology    information that the replication strategy uses to place replicas and route requests    efficiently.

You need to configure a snitch when you    create a cluster. The snitch is responsible for knowing the location of nodes within your    network topology and distributing replicas by grouping machines into data centers and    racks.
The cassandra.yaml file is the main configuration file for Cassandra. In this file, you set the initialization properties for a cluster, caching parameters for tables, properties for tuning and resource utilization, timeout settings, client connections, backups, and security.
Cassandra stores table properties in the system keyspace. You set    storage configuration attributes on a per-keyspace or per-table basis programmatically or    using a client application, such as CQL.

By default, a node is configured to store the data it manages in the       /var/lib/cassandra directory. In a production cluster deployment, you    change the commitlog-directory    to a different disk drive from the data_file_directories.

Related topics

Parent topic: Understanding the architecture

MySense · 发表于 2014-2-26 10:46:21

Querying Cassandra

Quickly master inserting and retrieving data from Cassandra 2.0 using the cqlsh utility.
Attention: The information presented here applies only to Cassandra 2.x not to Cassandra 1.2.

You can run Cassandra Query Language (CQL) using the cqlsh utility to:

Create a keyspace, which is akin to the namespace of an SQL database.
Use the keyspace to create a table, which is similar to an SQL table.
Insert data into the table.
Use queries to sort, retrieve, alter, automatically expire, and drop the data.

Procedure

From a terminal:

Assuming Cassandra is running, start cqlsh on Windows or Linux from the installation directory. In a shell on Mac OS X, for example:

$ ./bin/cqlsh

At the cqlsh prompt, use the DESCRIBE cqlsh command to see the keyspaces that already exist in Cassandra:

DESCRIBE keyspaces;

The output is a list of system keyspaces containing tables of details about database objects and cluster configuration:

system system_auth system_traces

Create a keyspace.

CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };

Use the keyspace, just as you would use an SQL database.

USE mykeyspace;

Create a simple table with three columns for the ids, first names, and last names of users.

CREATE TABLE users (
   user_id int PRIMARY KEY,
   fname text,
   lname text
);

Check that your table and keyspace has been created.

DESCRIBE TABLES;

The output is the list of tables, in the case just one, in the keyspace you're using:

users

Insert the ids, first name, and last name of a few users into the table.

INSERT INTO users (user_id,  fname, lname)
   VALUES (1745, 'john', 'smith');
INSERT INTO users (user_id,  fname, lname)
   VALUES (1744, 'john', 'doe');
INSERT INTO users (user_id,  fname, lname)
   VALUES (1746, 'john', 'smith');

Retrieve all the data from the users table.

SELECT * FROM users;

The output lists the data in the order Cassandra stores it.

   user_id | fname | lname
---------+-------+-------
      1745 |  john | smith
      1744 |  john | doe
      1746 |  john | smith

Retrieve data about users whose last name is smith by first creating an index, and then querying the table.

CREATE INDEX ON users (lname);

Note: You need the index because your WHERE clause will use a column that isn't the primary key.

SELECT * FROM users WHERE lname = 'smith';

   user_id | fname | lname
---------+-------+-------
      1745 |  john | smith
      1746 |  john | smith

Drop the users table.

DROP TABLE users;

账号		自动登录	找回密码
密码			注册