Lecture Notes Of Day 21:
MongoDB Sharding (Day 21)
Objective:
Understand
the concept of sharding in MongoDB for horizontal scaling.
Outcome:
By the
end of this lecture, students will be able to set up and manage a sharded
MongoDB cluster.
1.
Introduction to MongoDB Sharding
As data
grows in size, a single server may not be sufficient to handle the workload. MongoDB
provides sharding as a method for horizontal scaling
to distribute data across multiple machines efficiently.
1.1
What is Sharding?
Sharding
is the process of distributing data across multiple servers to:
- Improve
read and write performance
- Increase
storage capacity
- Ensure
high availability
- Handle
large-scale applications
1.2
Why Do We Need Sharding?
Without
sharding, MongoDB stores all data on a single server, which can lead to:
- Performance
bottlenecks due to high traffic.
- Increased
response time for queries.
- Storage
limitations of a single machine.
By
implementing sharding, MongoDB distributes the load across
multiple servers, allowing it to handle more queries efficiently.
2.
Components of a Sharded Cluster
A MongoDB
sharded cluster consists of three main components:
1.
Shards
o Store
actual data in a distributed manner.
o Each
shard is a replica set to ensure high availability.
2.
Config Servers
o Store
metadata and configuration settings of the cluster.
o Help
the cluster track which shard contains which data.
3.
Query Routers (mongos)
o Handle
client requests and direct queries to the appropriate shard.
o Act
as an interface between the application and the sharded cluster.
3.
How Sharding Works
MongoDB
divides data across shards using a shard key.
3.1
Shard Key
A shard
key is a field (or a combination of fields) used to determine the
distribution of documents in a collection. It should:
- Be
frequently queried for better efficiency.
- Have
high cardinality (many unique values) to distribute data
evenly.
- Avoid
hotspots (uneven distribution leading to performance
issues).
3.2
Sharding Methods
MongoDB
supports two types of sharding:
1.
Range-Based Sharding
o Data
is distributed based on ranges of values.
o Example:
If a shard key is age, documents are distributed as:
§ Shard
1: { age: 1 - 30 }
§ Shard
2: { age: 31 - 60 }
§ Shard
3: { age: 61 - 100 }
o Disadvantage:
If most queries target a specific range, some shards may be overloaded.
2.
Hash-Based Sharding
o Data
is distributed using a hash function on the shard key.
o Ensures
even distribution across all shards.
o Example:
If a shard key is _id, a hash function randomly assigns
documents to different shards.
4.
Setting Up a Sharded Cluster
4.1
Prerequisites
- Install
MongoDB on multiple servers or instances.
- Ensure
network connectivity between all nodes.
4.2
Steps to Configure a Sharded Cluster
1.
Start Config Servers
Config servers store metadata. Start them using:
bashmongod --configsvr --replSet configReplSet --port 27019 --dbpath /data/configdb
2.
Start Shards (Replica Sets)
bashmongod --shardsvr --replSet shard1ReplSet --port 27018 --dbpath /data/shard1
3.
Initiate the Replica Sets Connect to each shard and initialize
the replica set:
javascriptrs.initiate()
4.
Start Query Router (mongos)
bashmongos --configdb configReplSet/localhost:27019 --port 27017
5.
Add Shards to the Cluster
javascriptsh.addShard("shard1ReplSet/localhost:27018")sh.addShard("shard2ReplSet/localhost:27020")
6.
Enable Sharding for a Database
javascriptsh.enableSharding("mydatabase")
7.
Choose a Shard Key and Shard a Collection
javascriptsh.shardCollection("mydatabase.users", { "user_id": "hashed" })
5.
Advantages of Sharding
- Scalability:
Supports large datasets and high traffic.
- High Availability:
Data is replicated across shards.
- Improved Query Performance:
Distributes queries across multiple servers.
- Load Balancing:
Spreads workload across multiple nodes.
6.
Challenges of Sharding
- Complexity in Setup:
Requires multiple configurations.
- Shard Key Selection:
Choosing an inefficient key can lead to hotspots.
- Operational Overhead:
Monitoring and maintaining multiple nodes.
7.
Summary
- Sharding
is a horizontal scaling technique used in MongoDB.
- It
consists of Shards, Config Servers, and Query Routers.
- Uses
Range-Based or Hash-Based sharding.
- Setting
up a sharded cluster involves multiple configuration steps.
- Provides
scalability, high availability, and load balancing.
8.
Assignment
1.
What is sharding, and why is it important
in MongoDB?
2.
Explain the difference between Range-Based
and Hash-Based sharding.
3.
List the steps required to configure a MongoDB
sharded cluster.
4.
What are some challenges faced while
implementing sharding?
