SQL Interfaces with Big Data systems

Big Data SQL applications provide SQL query kind of interface to query the data underlying in HDFS and other storage systems in Big Data

As SQL is popular common language used across globally, big data systems can be easily accessed using SQL interface.

Few are the below SQL interfaces described to access big data systems

Hive: Popular SQL engine on Hadoop. Provides SQL interface to access data in HDFS. Hive queries internally trigger MapReduce jobs for processing.

Hive on Apache Tez: Hive with Tez combination gives HiveQL performing on Hive execution engine. SQL query based on analytics which is highly performance oriented with distinct MapReduce jobs executed as DAG workflow thereby increasing parallelism

Hive on Apache Spark: Hive utilizing Spark framework for execution and utilizes in memory clusters for fast performance

Hive with LLAP: Hive with low latency queries running on Hadoop and Yarn. In normal Hive, every time a SQL job is submitted to Hive Server, Yarn application will be started. With LLAP introduced from Hive 2 where LLAP stands for Long Lived Analytical Processing, there will be LLAP daemon, also called HiveServer2-interactive, launches a long running YARN application when it starts. This YARN application will execute all the SQL queries, independently of the user running the query.

Impala: Impala is MPP based SQL query engine that provides low latency and high performance SQL queries on data stored in HDFS. Impala does not use Hadoop, it has daemons running on all nodes which caches the data in HDFS. These daemons return data quickly without having go through the whole MapReduce job. Impala does not provide fault tolerance compared to Hive

Apache Drill: Low latency SQL query engine for interactive SQL analytics. Has unique ability to discover schemas on read with data discovery and exploration capabilities on data residing in multiple formats residing either in flat files, HDFS or any file systems, NoSQL databases. This is open source version for Google BigQuery which is a service in Google Cloud

Apache Presto: Open source distributed SQL query engine for interactive analytics against many datasources like Cassandra, Hive, RDBMS and few proprietary datastores. One query from Presto can aggregate data from multiple data sources. This is mainly used in adhoc SQL analytics for large data sets. Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases or proprietary file systems (such as Amazon Web Services’ S3.)

Apache Tajo: Apache Tajo is a relational and distributed data warehouse for Hadoop. It is designed for low latency and scalable adhoc query analysis. It has distributed SQL query processing engine with advanced query optimization. It is ANSI SQL compliant allow access to Hive Metastore and supports various file formats. Tajo is designed for low-latency and scalable adhoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources

Flink Table API and SQL: The Table API is a SQL-like expression language for relational stream and batch processing that can be easily embedded in Flink’s DataSet and DataStream APIs (Java and Scala)

Apache Phoenix: Relational interface over HBase, Phoenix takes SQL Query as input and compiles to set of HBase scans, coordinates running of scans and output JDBC results

BlinkDB: Large scale data warehouse system that adds the ability to create and use smaller samples of large datasets to make queries even faster. It runs queries on data samples and present results with valid threshold values

Hadapt: Cloud-Optimized system offering an analytical platform for performing complex analytics on structured and unstructured data. Like Apache Hive and other technologies, Hadapt provides a familiar JDBC/ODBC interface for submitting SQL or MapReduce jobs to the cluster. Hadapt provides a cost-based query optimizer, which can decide between a combination of MapReduce jobs and MPP database jobs to fulfill a query, or the job can be handled by the MPP database for fast interactive response

Spark SQL: Allows querying structured and unstructured data within Spark using SQL. Access variety of data sources and file formats such as Hive, HBase, Cassandra, Avro, Parquet, ORC, JSON and relational datasets. Spark SQL uses Hive Metastore with access to Hive data, queries and UDF’s

Splice Machine: General purpose RDBMS, a unique hybrid database that combines the advantage of SQL, the scale out of NoSQL, and the performance of in-memory technology. It provides ANSI SQL and ACID transactions of an RDBMS on the Hadoop eco system

Apache Calcite: Open source framework for building databases. It includes SQL parser, validator and JDBC driver, Query optimization tools, including relational algebra API, rule based planner and cost based query optimizer. Apache Hive uses Calcite for cost-based query optimization while Apache Drill and Apache Kylin uses SQL parser

Apache Kylin: Apache Kylin is an open-source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

Google BigQuery: Google BigQuery is a cloud-based big data analytics warehouse for processing very large read-only data sets. BigQuery was designed for analyzing data on the order of billions of rows, using a SQL-like syntax. It runs on the Google Cloud Storage infrastructure and can be accessed with a REST-oriented application program interface (API)

Apache Trafodion: Trafodion is another SQL on HBase similar to Apache Phoenix and also integrated with Hive with supporting transactions

Lingual: Provides full ANSI SQL interface for Hadoop that can be easily integrated with existing BI tools and also supports JDBC

Apache Druid: Druid is an open-source data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence (OLAP) queries. Druid SQL is a built-in SQL layer and an alternative to Druid’s native JSON-based query language

Actian Vector: High performance analytic database that makes use of vector processing, multiple data model to perform the same operation on multiple data simultaneously

AtScale: High performance OLAP server platform on Hadoop. This acts as a semantic layer between Big Data systems like Cloudera, Hortonworks or MapR and data visualization tools like Tableau, Qlik, MicroStrategy. Data access over Hadoop is leveraged through SQL or MDX

Citus: A horizontally scalable database built on PostgreSQL to solve real time big data challenges with a horizontally scalable architecture that combines with massive parallel query processing across highly available clusters.

Greenplum: High analytical query performance engine for petabytes of data built on PostgreSQL. It leverages standards-compliant SQL to support BI and reporting tools

HAWQ: Parallel SQL processing engine built on top of the HDFS optimized for analytics with full ACID transaction support. HAWQ breaks complex queries into smaller tasks and distributes them to query processing units for execution

JethroData: Index based SQL engine that enables interactive BI on big data. It indexes every column in HDFS and the queries uses indexes to access the data only they need instead of full scan, resulting in faster response time and low resource utilization

SQLstream: Provides interactive real time processing of data in motion to build new real time processing applications. It has optimized SQL query engine for unstructured machine data streams.

VoltDB: In memory, massively parallel relational database. This adds horizontal partitioning, active-active redundant clustering.

Vertica: Product of HP. Is designed for column oriented storage optimization, MPP, SQL query interface, In-database Machine Learning for large scale data analytics. Vertica is infastructure independent supporting deployments on multiple cloud platforms (AWS, Azure, Google), on-premises and natively on Hadoop nodes.

Netezza: Product of IBM. Designs and markets high-performance data warehouse appliances and advanced analytics applications for uses including enterprise data warehousing, business intelligence, predictive analytics and business continuity planning.

Aster: Analytics platform acquired by Teradata. Provides integration with SQL and MapReduce. The mapper and reducer functions are implemented in SQL

Oracle Big Data SQL: Oracle Big Data SQL supports queries against non-relational data stored in multiple big data sources, including Apache Hive, HDFS, Oracle NoSQL Database, and Apache HBase

Spark SQL with Tachyon: Spark SQL integrated with Tachyon, an in-memory storage system integrated to store intermediate results which makes faster processing. Tachyon caching is powerful than Spark in memory caching. Spark cache is volatile across different jobs. Tachyon will be useful when the data or intermediate results has to be shared between different applications, caching the RDD for interactive exploratory analysis. Now Tachyon is called as Alluxio.

KSQL: KSQL provides a simple and completely interactive SQL interface for processing streaming data in Kafka provided from Confluent

Azure Stream Analytics: Streaming analytics platform supporting SQL like query language for performing transformations over stream of events provided by Azure

PolyBase: Provided by Microsoft where T-SQL statements access data stored in Azure Blob storage or HDFS. PolyBase enables Azure SQL Data Warehouse to import and export data from Azure Data Lake Store, and from Azure Blob Storage.

Big SQL: IBM Big SQL is a massive parallel processing engine that access various data sources like HDFS, RDBMS, NoSQL databases, Object stores using single database connection or single query.

Azure Synapse Analytics(formerly Azure SQL Data Warehouse): Azure Synapse Analytics , a fully managed cloud data warehouse and big data analytics platform for enterprises of any size that combines lightning-fast query performance

Amazon Redshift: Data warehousing service from Amazon Web Services which is fast, simple, cost-effective

Cassandra CQL: SQL kind of query language to access Cassandra database

Apache Kudu:Columnar storage developed for Hadoop system, has advantages like HBase and Parquet. It is as fast as HBase at ingesting data and random access and almost as quick as Parquet when it comes to analytics queries. Kudu has SQL interface from Spark and Impala

Dremio: Data lake engine that provides a self-service semantic layer directly on data lake storage without moving data to proprietary data warehouses, no cubes, no aggregation tables or extracts. Dremio reads data from any source (RDBMS, HDFS, S3, NoSQL) into Arrow(Apache Arrow-in memory storage) buffers, and provides fast SQL access via ODBC, JDBC, and REST for BI, Python, R

OmniSciDB: OmniSciDB is an open source SQL-based, relational, columnar database engine that leverages the full performance and parallelism of modern hardware (both CPUs and GPUs) to enable querying of multi-billion row datasets in milliseconds, without the need for indexing, pre-aggregation, or downsampling

Amazon Athena: Serverless interactive query service for analyzing big data in Amazon S3 using standard SQL. This is built on Presto

Share this:

Related

Leave a comment Cancel reply