Structured, Semi-structured and Unstructured data
Big Data includes huge volume, high velocity, and extensible variety of data. These are 3 types: Structured data, Semi-structured data, and Unstructured data.
- Structured data is a data whose elements are addressable for effective analysis. It has been organised into a formatted repository that is typically a database. Example: Relational database.
- Semi-structured data is information that does not reside in a rational database but that have some organizational properties that make it easier to analyse. With some process, we can store them in the relation database (it could be very hard for some kind of semi-structured data), but Semi-structured exist to ease space. Example: XML data, JSON.
- Unstructured data is a data that is which is not organised in a pre-defined manner or does not have a pre-defined data model, thus it is not a good fit for a mainstream relational database. So for Unstructured data, there are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF, Text, Media logs.
NoSQL (Not Only SQL database)
NoSQL is an approach to database design that can accommodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, which stand for “not only SQL,” is an alternative to traditional relational databases in which data is placed in tables and data schema is carefully designed before the database is built. NoSQL databases are especially useful for working with large sets of distributed data.
Key-value stores, or key-value databases, implement a simple data model that pairs a unique key with an associated value.
Document databases, also called document stores, store semi-structured data and descriptions of that data in document format. They allow developers to create and update programs without needing to reference master schema. Use of document databases has increased along with use of JavaScript and the JavaScript Object Notation (JSON).
Wide-column stores organize data tables as columns instead of as rows.
Graph data stores organize data as nodes, which are like records in a relational database, and edges, which represent connections between nodes.
Couchbase
Couchbase Server, originally known as Membase, is an open-source, distributed (shared-nothing architecture) multi-model NoSQL document-oriented database software package that is optimized for interactive applications. Couchbase Server is designed to provide easy-to-scale key-value or JSON document access with low latency and high sustained throughput. It is designed to be clustered from a single machine to very large-scale deployments spanning many machines.
Coubase Inc. describes Couchbase as an Engagement Database, a new category of database that enables enterprises to continually create and reinvent the customer experience. Unlike traditional databases, the Engagement Database taps into dynamic data, at any scale and across any channel or device, to liberate data’s full potential at a time when the strategic use of data to create exceptional customer experiences has become a key competitive differentiator for businesses.
In Engagement Database architecture data is first cached in memory, replicated for availability and then finally written to disk.
Core features of Couchbase
Data: Couchbase Server stores data as items. Each item consists of a key, by which the item is referenced; and an associated value, which must be either binary or a JSON document.
Buckets, Memory, and Storage: Items are stored in named Buckets; being kept only in memory, or both in memory and on disk.
Services: Services can be deployed to support different forms of data-access. Details are given in next section.
Clusters and Availability: A single node running Couchbase Server is considered a cluster of one node. As successive nodes are initialized, each can be configured to join the existing cluster.
Across the nodes of each cluster, Couchbase data is evenly distributed and replicated: nodes can be removed, and node-failure handled, without data-loss. Data can be selected for replication across clusters residing in different data centres, to ensure high availability.
Services
Couchbase Server provides the following services:
- Data: Supports the storing, setting, and retrieving of data-items, specified by key.
- Query: Parses queries specified in the N1QL query-language, executes the queries, and returns results. The Query Service interacts with both the Data and Index services.
- Index: Creates indexes, for use by the Query and Analytics services.
- Search: Create indexes specially purposed for Full Text Search. This supports language-aware searching; allowing users to search for, say, the word beauties, and additionally obtain results for beauty and beautiful.
- Analytics: Supports join, set, aggregation, and grouping operations; which are expected to be large, long-running, and highly consumptive of memory and CPU resources.
- Eventing: Supports near real-time handling of changes to data: code can be executed both in response to document-mutations, and as scheduled by timers.
N1QL
N1QL (pronounced nickel), is used for manipulating the JSON data in Couchbase, just like SQL manipulates data in RDBMS. It has SELECT, INSERT, UPDATE, DELETE, MERGE statements to operate on JSON data.
The N1QL data model is non-first normal form (N1NF) with support for nested attributes and domain-oriented normalization. The N1QL data model is also a proper superset and generalization of the relational model.
Example
{
"email": "[email protected]",
"friends": [
{"name":"rick"},
{"name":"cate"}
]
}
Like Query
SELECT * FROM `bucket` WHERE email LIKE "%@example.org";
Array Query
SELECT * FROM `bucket` WHERE ANY x IN friends SATISFIES x.name = "cate" END;
Programming model
Couchbase provides client libraries for different programming languages such as Java / .NET / PHP / Ruby / C / Python / Node.js
Following is the core API that Couchbase offers. (in an abstract sense)
# Get a document by key
doc = get(key)
# Modify a document, notice the whole document
# need to be passed in
set(key, doc)
# Modify a document when no one has modified it
# since my last read
casVersion = doc.getCas()
cas(key, casVersion, changedDoc)
# Create a new document, with an expiration time
# after which the document will be deleted
addIfNotExist(key, doc, timeToLive)
# Delete a document
delete(key)
# When the value is an integer, increment the integer
increment(key)
# When the value is an integer, decrement the integer
decrement(key)
# When the value is an opaque byte array, append more
# data into existing value
append(key, newData)
# Query the data
results = query(viewName, queryParameters)
Couchbase Java SDK
The code snippet below shows how the Java SDK may be used for some common operations:
import com.couchbase.client.java.*;
import com.couchbase.client.java.document.*;
import com.couchbase.client.java.document.json.*;
import com.couchbase.client.java.query.*;
public class Example {
public static void main(String... args) throws Exception {
// Initialize the Connection
Cluster cluster = CouchbaseCluster.create("localhost");
cluster.authenticate("username", "password");
Bucket bucket = cluster.openBucket("bucketname");
// Create a JSON Document
JsonObject arthur = JsonObject.create()
.put("name", "Arthur")
.put("email", "[email protected]")
.put("interests", JsonArray.from("Holy Grail", "African Swallows"));
// Store the Document
bucket.upsert(JsonDocument.create("u:king_arthur", arthur));
// Load the Document and print it
// Prints Content and Metadata of the stored Document
System.out.println(bucket.get("u:king_arthur"));
// Create a N1QL Primary Index (but ignore if it exists)
bucket.bucketManager().createN1qlPrimaryIndex(true, false);
// Perform a N1QL Query
N1qlQueryResult result = bucket.query(
N1qlQuery.parameterized("SELECT name FROM `bucketname` WHERE $1 IN interests",
JsonArray.from("African Swallows"))
);
// Print each found Row
for (N1qlQueryRow row : result) {
// Prints {"name":"Arthur"}
System.out.println(row);
}
}
}
Spring Data Couchbase
The Spring Data Couchbase project provides integration with the Couchbase Server database. Key functional areas of Spring Data Couchbase are a POJO centric model for interacting with Couchbase Buckets and easily writing a Repository style data access layer.
1. Data Model
First create an entity class representing the JSON document to persist.
@Document
public class Person {
@Id
private String id;
@Field
@NotNull
private String firstName;
@Field
@NotNull
private String lastName;
@Field
@NotNull
private DateTime created;
@Field
private DateTime updated;
// standard getters and setters
}
2. Couchbase Repository
We declare a repository interface for the Person class by extending CrudRepository<String,Person> and adding a derivable query method:
public interface PersonRepository extends CrudRepository<Person, String> {
List findByFirstName(String firstName);
}
3. Service Layer
For our service layer, we define an interface and an implementation using the Spring Data repository abstraction. Here is our PersonService interface:
public interface PersonService {
Person findOne(String id);
List findAll();
List findByFirstName(String firstName);
void create(Person person);
void update(Person person);
void delete(Person person);
}
4. Service Implementation
@Service
@Qualifier("PersonRepositoryService")
public class PersonRepositoryService implements PersonService {
@Autowired
private PersonRepository repo;
public Person findOne(String id) {
return repo.findOne(id);
}
public List findAll() {
List people = new ArrayList();
Iterator it = repo.findAll().iterator();
while(it.hasNext()) {
people.add(it.next());
}
return people;
}
public List findByFirstName(String firstName) {
return repo.findByFirstName(firstName);
}
public void create(Person person) {
person.setCreated(DateTime.now());
repo.save(person);
}
public void update(Person person) {
person.setUpdated(DateTime.now());
repo.save(person);
}
public void delete(Person person) {
repo.delete(person);
}
}