Ceph Storage: A Distributed Object Storage System

CEPH Storage

Ceph is an open-source, software-defined and distributed storage system. A Software-defined Storage (SDS) system means a form of storage virtualization to separate the storage hardware from the software that manages the storage infrastructure. Ceph Storage is a true SDS solution and runs on any commodity hardware without any vendor lock-in. An SDS selection provides the flexibility in hardware selection. Customers can select any commodity hardware of any manufacturer. Ceph is massively scalable(up to exabytes and beyond) and there is no single point of failure. Today, private and public cloud models are used massively in providing IT infrastructure management to customers.

Ceph is very popular in cloud storage solutions such as OpenStack. cloud depends on commodity hardware and CEPH makes full use of this commodity hardware to provide a faultless, cost-effective storage system. Ceph is a unified storage solution that provides access to files, blocks as well as objects from a single platform along with their storage. RAID technology has been the fundamental building block of storage systems for the past few years. RAID uses a lot of disk spaces and takes an efficient amount of time to repair a failed disk which has storage size in the order of TBs.The integration of RAID technology also increases the cost required for the storage. A CEPH storage system addresses these problems and eliminates the need for RAID technology. Ceph Storage support has been added to Linux kernel from Version 2.6.32

WHY OBJECT STORAGE?

An object is a combination of data and metadata components. These are identified with a unique id and eliminates the possibility of another object with the same id. Traditional storage solutions are not capable of providing object storage. They provide only file and block-based storage. Object-Based storage has many advantages when compared with traditional file and block-based storage solutions. The selection of object storage provides platform and hardware independence and allows the freedom in selecting them. The basic building block or foundation of CEPH is an object. Any form of data whether it is a file, the block gets stored in the form of objects in a CEPH cluster and replicates these objects across the cluster and improves the reliability. In Ceph, objects are not tied to a physical path, making objects flexible and location-independent. This enables Ceph to scale linearly from the petabyte level to an exabyte level.

CEPH RELEASES

Hammer V0.94.3 is the latest release of CEPH. Before that Giant version was also released.

Ceph Storage comparison list

CEPH ARCHITECTURE

A CEPH storage cluster is made up of several different software daemons where each daemon takes care of unique CEPH functionalities. Each of these daemons is separated from each other and this feature makes CEPH cluster storage cost low as compared to other storage systems.In the below figure, RADOS is the lower part that is internal to the Ceph cluster with no direct client interface and the upper part that has all the client interfaces.

clients flow chart

Figure: Ceph Architecture

CEPH DEPLOYMENT

Suppose we have three nodes with hostnames as CEPH-node1, CEPH-node2, and CEPH-node3 respectively.

1.Installing Ceph-deploy on CEPH-node1 by executing

# yum install CEPH-deploy

2. Create a CEPH cluster by using CEPH-deploy tool,

# CEPH-deploy new CEPH-node1

The new subcommand of CEPH-deploy deploys a new cluster with CEPH as the cluster name, which is by default. It generates a cluster configuration and keying files as ceph.conf and ceph.mon.keyring files in the current working directory. When CEPH Storage runs with authentication and authorization enabled, it will ask for a username and a keyring containing the secret key of that user. By default, the client.admin is the default user name.

3. To install Ceph software binaries on all the nodes using CEPH-deploy, execute the following command from CEPH-node1

# CEPH-deploy install –release emperor CEPH-node1 CEPH-node2 CEPH-node3

the emperor is a version type of CEPH

4. Create the first monitor on CEPH-node1

# CEPH-deploy mon create-initial

5. Check the cluster status by

# CEPH status

Initially, the cluster won’t be healthy.

Creating Object Storage Device

Create an Object Storage Device(OSD) on CEPH-node1 and add it to the CEPH cluster by,

1. List the disks on nodes by,

# CEPH-deploy disk list CEPH-node1

From the output, identify the disks (other than OS-partition disks) on which we should create Ceph OSD.

2. The disk zap subcommand will destroy the existing partition table and content from the disk.

# CEPH-deploy disk zap CEPH-node1:sdb CEPH-node1:sdc CEPH-node1:sdd

3. The osd create subcommand will first prepare the disk, that is, erase the disk with a filesystem, which is xfs by default. Then, it will activate the disk’s first partition as data partition and second partition as a journal:

# ceph-deploy osd create ceph-node1:sdb ceph-node1:sdc ceph-node1:sdd

4. Check the cluster status for new OSD entries:

# ceph status

At this stage, the cluster will not be healthy. We need to add a few more nodes to the Ceph cluster so that it can set up a distributed, replicated object storage, and hence become healthy.

RADOS

Reliable Autonomic Distributed Object Store(RADOS) or storage cluster is the heart of CEPH storage system. RADOS provides features such as distributed object store, high availability, reliability, no single point of failure, self-healing,self-managing to CEPH storage system. The data access methods of Ceph, such as rados block device(RBD), CephFS, rados gateway,and rados library operate on top of the RADOS layer. RADOS stores data in the form of objects inside a pool. When there is a written request to a ceph cluster, the position to which the corresponding data write to be made is calculated based on the algorithm called CRUSH. Based on that, RADOS distributes data to all the cluster nodes in the form of objects.

RADOS also performs data replication. It takes a copy of objects and distributes these copies to different zones. No two copies will reside on the same zone and ensure that every object is replicated at least once. RADOS also checks for object states to ensure every object is keeping a stable state. In the case of inconsistency, recoveries are performed with the help of remaining object copies. These recovery operations are hidden from the end-user. RADOS consists of two major components, Object Storage Device(OSD) and Monitor.

1.RADOS Object Storage Device(OSD):

OSD stores data of clients in the form objects and on physical disk drives of each node in the cluster. A CEPH cluster consists of many OSDs. For any read and write operations, the client requests for cluster maps from monitors and after examining the maps client directly interacts with OSDs for I/O operations. Each object in OSD has one primary copy and several secondary copies that are scattered across other OSDs. Each OSD plays the role of primary OSD for some objects and at the same time acts as a secondary OSD for other objects. When there is a disk failure, all OSDs performs recovery options. At this time secondary OSD holding replicated copies of failed objects will be promoted as primary OSD along with the creation of new secondary object copies.

2. Ceph Monitors:

Ceph monitors do not store data of clients. It serves updated cluster maps to clients and other cluster nodes. Clients and other cluster nodes periodically check with monitors for the most recent copies of cluster nodes. Ceph Storage monitors are responsible for the health of Ceph clusters by storing cluster information, the states of nodes, and cluster configuration information. It also keeps a master copy of a cluster. A typical ceph cluster consists of more than one monitor. The monitor count in the cluster should be an odd number and a multi monitored ceph architecture develops a quorum. The decision making is distributed among all the monitors. The odd number of monitors are recommended to avoid split-brain scenarios. Out of all the ceph monitors, one operates as a leader. The other monitors will become a leader if the current leader monitor is down. At least three monitors should be there in a production cluster. The cluster map includes the monitor, OSD, PG and CRUSH maps.

3.Monitor map:

This holds end-to-end information about a monitor node, which includes the Ceph cluster-ID, monitor hostname, and IP address with the port number. It also stores the current data for map creation and last-changed information.

4.OSD map:

This stores fields such as the cluster-ID, information for OSD map creation,last-changed information and information related to pools such as pool names, pool ID, type, replication level, and placement groups. It also stores OSD information such as count, state, weight and OSD host information. We can check the cluster’s OSD maps by executing:

# ceph osd dump

•PG map: This holds the time stamp, last OSD map, full ratio, and near full ratio information. It also keeps track of each placement group ID, object count, state, state stamp, up and acting OSD sets. To check cluster PG map, execute:

# ceph pg dump

•CRUSH map: This holds information of cluster’s storage devices and the rules defined for the failure when storing data. To check cluster CRUSH map, execute the following command:

# ceph osd crush dump

librados

libraos is a C library that allows applications to work directly with RADOS, bypassing other interface layers to interact with ceph cluster. It offers API support so that applications can interact directly and parallelly with no HTTP overhead. Applications link with librados library and extend their protocol, thereby gaining access to RADOS. This direct interaction with RADOS using librados improves the performance of applications. librados library serves as the base for other service interfaces that are built on top of librados interface, which includes the Ceph File System, Ceph Rados gateway and Ceph Block Device.

RADOS GATEWAY

Ceph object gateway is known as the RADOS gateway. It provides API for different applications such as Amazon S3 API, Swift API(OpenStack Object Storage). It can be considered as a proxy that converts HTTP requests to RADOS requests and vice versa. Both S3 and swift API shares a common namespace inside a ceph cluster so that we can write data with one API and retrieve that data using another API. Apart from S3 and Swift API, an application can be made to bypass the RADOS gateway and get direct parallel access to librados, that is, to the ceph cluster. This method of removing additional layers will be an effective one for applications that require extreme performance from a storage point of view. Maintaining more than one gateway will result in reduced load on a storage cluster.

• S3 compatible: This provides an Amazon S3 RESTful API-compatible interface to Ceph storage clusters. RESTful(Representational State Transfer) API is a popular API building style for CLOUD COMPUTING SERVICES based APIs.

• Swift compatible API: It provides an OpenStack Swift API-compatible interface to Ceph storage clusters. Ceph Object Gateway can be used as a replacement for Swift in an OpenStack cluster.

• Admin API: This is helpful for the administration of our Ceph cluster over HTTP RESTful API.

ceph cluster flow chart

Figure: Different access methods using RADOS Gateway

RADOS BLOCK DEVICE(RBD)

In block storage, data is stored as volumes that are in the form of blocks and are attached to nodes. This provides large storage capacity required by applications. These blocks are mapped to the operating system and are controlled by its file system. Ceph introduced a new protocol called RBD. RBD provides a reliable, distributed and high-performance block storage disks to clients. RBD drivers have been integrated with Linux kernel. RBD supports images up to 16 exabytes. Ceph block device provides full support to cloud platforms such as OpenStack and cloud stack etc. In OpenStack,ceph block device is used with cinder and glance components.

1.Creating an RBD with the name ‘testrdb’ with 20480 MB or 20 GB size

# rbd create testrdb –size 20480

2. Listing RBDs by,

# rbd ls

3. Retrieve information about the block device by,

# rbd –image testrbd info

4. Map the remote rbd image to RBD device,

echo “{ceph-monitor ip} name=admin,secret=Qwer12%$&*wqMN ceph-pool ceph-image” > /sys/bus/rbd/add

‘ceph-image’ is the name for rbd image and ‘ceph-pool’ is the name of pool.

5. Format the device,

# mkfs.xfs -L rbddevice /dev/rbd0

rbddevice is the label used to identify the RBD device in a multiple RBD environment.

6. Remove the rbd device by executing,

# echo “0” > /sys/bus/rbd/remove

CEPH File System

Ceph provides a file system on top of RADOS. It uses a metadata daemon that manages metadata and keeps it separated from the data. This separation helps to reduce complexity and improves reliability. CephFs offers a POSIX, distributed file system of any size. Ceph file system uses the same ceph storage cluster system as ceph block devices and Ceph object storage. To use a ceph file system, We require at least one metadata server. Linux kernel version 2.6.34 and above supports CephFs. There are two approaches to use a CephFS, using a native kernel driver and others by using a Ceph FUSE.

Mounting CephFS with kernel driver

1. Check kernel version of the client by using command ‘uname -r’ and create a mount point directory,

# mkdir /mnt/cephkernel

2. Mount cephfs by,

# mount -t ceph <monitr ip>:<port no of monitor>:/ /mnt/cephkernel -o name=admin,secret=<key>

eg: mount -t ceph 192.168.1.65:6789:/ /mnt/cephkernel -o name=admin,secret=Mwkwwk&%$75757HJF

Here key is the admin secret key located in /etc/ceph/ceph.client.admin.keyring

Mounting CephFS as FUSE

FUSE stands for the file system in userspace. It is a mechanism used that allows non-privileged users to create their own file systems without editing kernel code.

1.Install CEPH-fuse module on the client machine by,

# yum install CEPH-fuse

2. Create a directory called ‘cephnew’ for mounting,

# mkdir /mnt/cephfs

3.Mount by,

# CEPH-fuse -m <monitor ip>:<port number of monitor> <mount point name>

eg: CEPH-fuse -m 192.168.1.34:6789 /mnt/cephfs

4. To mount permanently, open /etc/fstab and add,

<ceph-id> <mount point> <Type> <options>

id=admin /mnt/cephfs fuse.ceph defaults 0 0

PLACEMENT GROUP(PG)

A placement group is a logical collection of objects that are replicated on OSDs to provide reliability in storage system. We can consider PG as a logical container holding multiple objects and this container is mapped onto multiple OSDs.Placement Group is essential for the scalability and performance of a CEPH storage system. Without PGs,It will be difficult to track and manage multiple replicated copies of an object that is spread over many OSDs. Every placement group requires resources like CPU, a memory so that they can easily manage multiple objects. Increasing the number of PGs in a cluster reduces OSD load, but the count increment of PG should be done in a regulated way. 50 to 100 PGs per OSD is recommended.

CEPH POOLS

A CEPH pool is a logical partition to store objects. Ceph provides easy storage management using these pools. Each pool in CEPH holds several placement groups and this placement group holds an object that is mapped to OSDs. A CEPH pool ensures data availability by creating several object copies. At the time of pool creation, we can define the replica size. The default replica size is 2(object + additional copy). When we first deploy a CEPH cluster without creating a pool, CEPH uses default pools to store data.

A CEPH pool supports snapshot features. A CEPH pool allows setting ownerships and access to objects. In Ceph Storage Systems, Data management starts as soon as the client writes data to a CEPH pool. Once the client writes data to a CEPH pool, data is then written to a primary OSD based on the pool replication size. The primary OSD then replicates the same data to secondary and tertiary OSDs. After finishing data writes, the secondary and tertiary OSDs will give an acknowledgment of primary OSD. Then only primary OSD will give an acknowledgment to the client, confirming that the data write operation has been completed.

Creating a Pool

Creating a Ceph pool requires a pool name, PG and PGP and a pool-type which is replicated by default. PGP is the total number of Placement Groups for the Placement purpose of objects inside a pool.

1. Creating a pool named as ‘newpool’ with 128 PG and PGP numbers by,

# ceph osd pool create newpool 128 128

2. Listing of pools can be done in two ways,

# ceph osd lspools

# rados lspools

3. The default replication size for a Ceph pool created with CEPH emperor or earlier releases is two. We can set replication size by,

# ceph osd pool set newpool size 4

4.Taking snapshot of a pool

# rados mksnap snapshot01 -p newpool

CRUSH

Normally traditional storage systems store data and its metadata. The metadata, which is the data about data, stores information such as where the data is stored in memory. Each time new data is added to the storage system, its metadata is first updated with the physical location where the data will be stored, after which the actual data is stored. This is not usable when we need to deal with exabyte level data and it creates a single point of failure for the storage system. if we lose our storage metadata, we lose all our data. So it is important to keep central metadata safe from disasters, either by keeping multiple copies on a single node or replicating the entire data and metadata. Such complex management of metadata is a bottleneck in a storage system’s scalability, high availability, and performance.

How it works?

Using the CEPH Controlled Replication Under Scalable Hashing (CRUSH) algorithm. Unlike traditional systems that rely on storing and managing a central metadata/index table, Ceph uses the CRUSH algorithm to compute where the data should be written to or read from. Instead of storing metadata, CRUSH computes metadata on-demand, thus removing all the limitations encountered in traditionally storing metadata. The metadata computation process is known as CRUSH lookup and it is not system dependent. Ceph provides enough flexibility to clients to perform on-demand metadata computation and allows data to read or write. For a read-and-write operation to Ceph clusters, client-first contact a Ceph monitor and retrieve a copy of the cluster map. The cluster map helps clients to know the state and configuration of the Ceph cluster. The data is converted to objects with object id and pool names/IDs. The object is then hashed with the number of placement groups to generate a final placement group within the required Ceph pool.

The calculated placement group then goes through a CRUSH lookup(on-demand metadata computation) to determine the primary OSD location for the storage or retrieval of data. After computing the OSD ID, the client contacts this OSD directly and stores the data. All these computer operations are performed by the clients, hence it does not impact cluster performance. Once the data is written to the primary OSD, the same node performs a CRUSH lookup operation and computes the location for secondary placement groups and OSDs so that the data is replicated across clusters for high availability.

Recovery and Rebalancing

In the event of failure of any component, Ceph waits for 300 seconds(default), before it marks OSD down and initiates recovery operation. This recovery option is done through ‘mon osd down out interval’ parameter under the CEPH cluster configuration file. During this recovery operation, CEPH starts to regenerate the affected data which is placed on the node that failed. CRUSH replicates data to many nodes and these replicated copies of data are used for the recovery. When a new disk or host is added to a CEPH cluster, CRUSH starts a rebalancing operation during which it moves data from existing hosts or disks to the new host or disk. The Rebalancing operation is performed to keep all disks equally utilized. This will make cluster performance more efficient. All the existing OSDs will work in parallel to move the data and helps to complete the Rebalancing operation in a faster way.

CEPH and Openstack

Openstack is a set of software tools for building and managing cloud computing platforms for public and private clouds. Ceph provides robust reliable storage for OpenStack. Ceph can be integrated with OpenStack components such as Cinder, Glance, Nova, and Keystone. The main benefits of integrating Ceph with Openstack includes,

1. Ceph is a unified storage solution of block, file and mainly object storage for Openstack, allowing different applications to use storage as they need.

2. Ceph supports rich APIs for both Swift and S3 object storage interfaces.

3. It provides a snapshot feature to OpenStack volumes that can be used as a backup.

4. Ceph provides a feature-rich storage backend at a very low cost which in turn limits the OpenStack deployment cost.

5. It provides advanced block storage capabilities such as cloning of VM for OpenStack clouds

CEPH Best Practices :

1.The OSD journal

Ceph first writes the data from CEPH clients to a journal. After completing this writing to journal, then data is written to the storage. Journal is a small-sized partition on the same disk as OSD or in another SSD(Solid State Drive) disk or maybe as a file on a file system. 10 GB is the common size of the journal. Ceph uses journaling for speed and consistency. Ceph incorporates Btrfs and XFS as journaling file systems for OSD. A sync operation will run every five seconds and it determines the life of a particular journal.

Usage of SSD disk partitions for journaling purpose results in faster write of data to the journal. So it is recommended to use SSD disk partitions for journals. The back storage can be compromised of slower disks like SATA disks. In the case of a journal failure in a Btrfs based file system, there will be only minimal data loss or no data loss at all. The failure of journal disks that host OSDs running on XFS or ext4 file systems will result in data loss. So Btrfs is preferred. Btrfs is a copy of the write file system, which means if the content of a block is changed then the changed block is written separately. This method preserves the old block and old data will be available even after a journal failure. We should not exceed OSD to journal ratio of four to five OSDs per journal disk when external SSDs are used for the journal.

Ceph Storage task

1.Figure: Ceph OSD journaling

In the above figure, (1) indicates the first data writing from the client to the journal. (2) indicates the data writing from journal to back storage, which is physical disks like SATA disks.

2.Number of Placement Groups

Setting the correct number of placement groups is an essential step in building Ceph storage clusters. The formula to calculate the total number of placement groups for a Ceph cluster is:

Total PGs = (Total number of OSD * 100) / maximum replication count

The maximum replication count is the number of maximum replications set for an object. The result must be rounded up to the nearest power of 2. For example, a result value of 1888.82 will be round to 2048.

Total number of PGs per pool in the Ceph cluster is calculated by,

Total PGs = ((Total number of OSD * 100) / maximum replication count)/pool count

This value also needs to be rounded to the nearest power of two.

CONCLUSION

If we make a comparison between Ceph and other storage solutions available today, Ceph has more features. Ceph is an open-source, software-defined storage solution on top of any commodity hardware, which makes it an economical storage solution. Ceph provides a variety of interfaces for the clients to connect to a Ceph cluster, thus increasing flexibility for clients. For data protection, Ceph does not rely on RAID technology. Rather, it uses replication, which has been proved to be better solutions than RAID. Every component of Ceph is reliable and supports high availability. Ceph does not have any single point of failure, which is a major challenge for other storage solutions available today. One of the biggest advantages of Ceph is its unified nature, where it provides block, file, and object storage solutions, while other storage systems are still incapable of providing this.

Ceph is a distributed storage system and clients can perform quick transactions using Ceph. It does not follow the traditional method of storing data by maintaining metadata, rather it introduces a new mechanism, which allows clients to dynamically calculate data location required by them. This provides an increase in performance for the client, as they no longer need to wait to get data locations and contents from the metadata server. In the event of failure, when other storage systems cannot provide reliability against multiple failures. Ceph detects and corrects failure in the disk, node, network, data center, etc.

Other storage solutions can only provide reliability up to disk or node failure. It provides a unified, distributed, highly scalable, and reliable object storage solution, which is much needed for today’s and the future’s unstructured data needs. The world’s storage need is increasing, so we need a storage system that is scalable to the exabyte level without affecting data reliability and performance. Ceph provides a solution to all these problems. For more distillery details, you can contact us. You can also refer to our blog for more technical articles on different subjects