Archive
Azure Databricks (a fully managed Apache Spark offering)
Databricks Introduction:
Azure Databricks = Best of Databricks + Best of Azure
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform (PaaS).
It is a fast, easy-to-use, and collaborative Apache Spark–based analytics platform. Designed in collaboration with the creators of Apache Spark, it combines the best of Databricks and Azure to help you accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration among data scientists, data engineers, and business analysts. Because it’s an Azure service, you benefit from native integrations with other Azure services such as Power BI, SQL Data Warehouse, and Cosmos DB. You also get enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.
–> With Databricks you can:
– Launch your new Spark environment with a single click.
– Integrate effortlessly with a wide variety of data stores.
– Use Databricks Notebooks to unify your processes and instantly deploy to production.
– Improve and scale your analytics with a high-performance processing engine optimized for the comprehensive, trusted Azure platform.
Learning Resources:
– Webinar recording on Azure Databricks
Azure SQL Data Sync – keep your data in sync between Azure Hybrid environment SQL Server and SQL DB
SQL Data Sync is a service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple SQL databases and SQL Server instances.
Internal Mechanism & Performance impact:
Data Sync uses insert, update, and delete triggers to track changes. It creates side tables in the user database for change tracking. These change tracking activities have an impact on your database workload. Assess your service tier and upgrade if needed.
Since Data Sync is trigger-based, transactional consistency is not guaranteed. Microsoft guarantees that all changes are made eventually, and that Data Sync does not cause data loss.
Limitations & Requirements:
1. Each table must have a primary key
2. Snapshot isolation must be enabled
3. A table cannot have an identity column that is not the primary key
4. The names of objects (databases, tables, and columns) cannot contain the printable characters period (.), left square bracket ([), or right square bracket (]).
5. Unsupported data types
a. FileStream
b. SQL/CLR UDT
c. XMLSchemaCollection (XML supported)
d. Cursor, Timestamp, Hierarchyid
6. Azure Active Directory authentication is not supported.
Maximum number of sync groups any database can belong to | 5 |
Maximum number of endpoints in a single sync group | 30 |
Maximum number of on-premises endpoints in a single sync group | 5 |
Database, table, schema, and column names | 50 characters per name |
Tables in a sync group | 500 |
Columns in a table in a sync group | 1000 |
Data row size on a table | 24 Mb |
Minimum sync interval | 5 Minutes |
Configurations:
1. Frequency can be set in seconds, minutes, hours and days (min 5 minutes)
2. You can choose and select desired tables and columns to sync
3. For on premises database you must configure a local DMG agent
Setup Azure SQL Data Sync:
Check here for Step-by-Step tutorial.
1. Create a Hub database on SQL DB (Hub & Spoke topology)
a. Hub database: must be Azure SQL DB
b. Spoke/Member database: rest of databases are either Azure SQL DB, on SQL Server instance
2. Create Sync group (On the Hub Database create a “New Sync Group”)
a. Sync Schema: which data is being synchronized
a. Sync metadata database: must be an Azure SQL DB
b. Sync Interval: frequency
c. Conflict Resolution Policy: (Hub wins or Member wins)
3. Add Sync Members (Spokes, can be either SQL DB or SQL instance)
a. Sync Agent Gateway (for on-prem): download and install on on-premise server
b. Sync Direction: bi-directional, or one direction
4. Configure Sync group
a. Select the Tables/Columns which you want to sync.
Powershell script to:
– Sync SQL DB & on-prem SQL Server instance
– Sync between multiple SQL Databases
Best Practices: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-best-practices-data-sync
Monitor: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-sync-monitor-oms
Troubleshoot: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-troubleshoot-data-sync
Recommendations:
Data Sync is not appropriate for the following scenarios:
1. Disaster Recovery
2. Read Scale
3. ETL (OLTP to OLAP)
4. Migration from on-premises SQL Server to Azure SQL Database
Further reading: https://blogs.msdn.microsoft.com/igorpag/2017/07/06/azure-sql-data-sync-test-drive-and-first-impressions/