i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
MongoDB is another kind of NoSQL database that boasts high performance, easy scalability, and high availability. It is based on the collections and documents theory. A MongoDB database is a physical and organized assembly of collections. A collection is a set of documents or a table, and a document is a group of key-value pairs.
Documents have a dynamic schema that means that there is no need for matching structures for each document in the same collection. Also, matching fields may contain different data types. That's why MongoDB very flexible. Sometimes this dynamic schema is known as a schema-less meaning that it can be almost anything.
MongoDB Document Structure:
MongoDB comes with a specific type of query language that is document-based and makes querying of the documents very easy. We may get some idea from a sample document that could be in MongoDB.
{ _id: ObjectId(23f2918g201), postTitle: "Dog Videos", facebookUser: 1037882, facebookURL: "facebook.com/dogvideos", likes: 4000000, comments: 9073, shares: 100000050, postMetadata: [ { timePosted: 10929682, clicksOnPost: 923081208, usersReached: 99302981069 } ] } |
In this JSON snippet, we have some information about a Facebook post with the Dog Videos from a specific Facebook user which got a ton of likes, comments, and shares. In the post metadata, we have more key-value pairs showing that there are nested metadata. MongoDB is a bit different as each document is stored as JSON objects. This is how we can have a schema-less architecture as no two JSON objects are same. However, the actual MongoDB document is structured, but the collection doesn't care what the JSON looks like, so our MongoDB is still schemaless. The document query language allows us to dive deep into the JSON to get the data that we need out of it. If we got fired up about no complex joins, this is why. All of the data is inside of the JSON object, so you don't have to go from table to table to get the data that you want.
MongoDB Features
Here, in this part, we discuss some key features of MongoDB:
Ad-hoc Queries: Ad-hoc queries are the queries that are unknown while structuring the database. In this case, MongoDB offers ad-hoc query support which makes it so unique. Ad-hoc queries are updated in real-time, leading to an improvement in performance.
Schema-Less Database: In MongoDB, one collection holds a different kind of documents. It is schema-less, so in the same collection, it can have many various fields, content, and size than another document. For this reason, MongoDB shows flexibility in dealing with the databases.
Document Oriented: MongoDB is a document-oriented database, which is a great feature. We use tables and rows for arrangements of the data in relational databases. Every row has a specific number of columns & those can store a particular type of data. Now comes NoSQL's flexibility where instead of tables and rows, there are fields. There are various documents that can store different types of data, and in MongoDB, we have collections of similar documents. Every document has a unique key Id or object Id that can be defined by either a user or a system.
Indexing: Indexing is very crucial for performance tuning of search queries. We should index those fields which match our search criteria in continuous document processing. We can index any field with primary and secondary indices in MongoDB.
Aggregation: MongoDB uses an aggregation framework for efficient usability. We can process data by batch and get a single result even after performing different operations on the group data. The aggregation pipeline, map-reduce function, and single-purpose aggregation methods are the three ways to provide an aggregation framework.
Replication: Replication is the method used by MongoDB when it comes to redundancy. This function distributes data to several machines. It may have primary nodes and their replica sets of one or more. If the primary node is down, the secondary node becomes primary, for instance. It saves our maintenance time and ensures smooth operations.
GridFS: This feature helps to store and retrieve files. This feature is very much useful for the files larger than 16 MB. GridFS divides a document into chunks and stores them in a separate document. These chunks have a default size of 255 kB except the last chunk. Once we query GridFS for a file, all the chunks are assembled as required.
Sharding: The sharding concept comes when it comes to dealing with massive databases. When a request comes for big data query, this will cause some problems. This functionality allows distributing these troublesome data to multiple instances of MongoDB. The MongoDB collections are distributed in several collections which have a larger size. These collections are called "shards". Shards are implemented by clusters.
High Performance: MongoDB is an open-source database with high performance. This shows high availability and scalability. Because of indexing and replication, it has a faster query response. This makes MongoDB a better solution for big data and real-time applications.
Don't miss out!