Skip to content

Microsoft Azure Data Engineering Associate (DP-203) Study Guide

Menu
  • Contact Us
Menu

Manage – Data Sources and Ingestion

Posted on 2022-03-232024-08-05 by Benjamin Goodwin

The Manage hub includes the enablers to provision, configure, and connect many Azure Synapse Analytics features to other products. From the creation of SQL pools to the configuration of GitHub, read on to learn about each possibility.

Analytics Pools

When you run a SQL script in Azure Synapse Analytics Studio, some form of compute machine is required to perform its execution. Those machines are called pools and come in two forms: SQL and Spark. You should provision a SQL pool if you are working primarily with relational databases, and a Spark pool if you are working primarily with files. It is perfectly reasonable that your data analytics solution span across both of those scenarios, i.e., relational databases and files. If this is the case, you can provision both types of pools and use them as required to transform, analyze, and store data. You will need to find the optimal configuration for your given scenario.

SQL POOLS

By default, when you provision Azure Synapse Analytics, a serverless SQL pool is provided. Creating a dedicated SQL pool is rather simple: Click the + New link (refer to Figure 3.27) and provide the necessary information, as shown in Figure 3.28. Remember that dedicated SQL pools were previously referred to as Azure SQL data warehouses.

FIGUER 3.28 Creating an Azure Synapse Analytics dedicated SQL pool

Table 3.7 summarizes the numerous differences between dedicated and serverless SQL pools.

TABLE 3.7 Dedicated vs. serverless SQL pools

CapabilityDedicated SQL poolServerless SQL pool
Database tablesYesNo
External tables (CETAS)YesYes
TriggersYesNo
Materialized viewsYesNo
DDL statementsYesViews and security only
DML statementsYesNo
Azure AD integrationYesYes
ViewsYesYes
Stored proceduresYesYes
Select performance levelYesNo
Max query durationNoYes, 30 minutes

The most obvious constraint when running on a serverless SQL pool is that you cannot create tables to store data on. When using a serverless pool, you should instead consider using external tables. In Chapter 2 you learned how to create an external table. That process requires the creation of a data source, using CREATE EXTERNAL DATA SOURCE, that identifies the file(s) containing the data you want to query. Then you identify the file format using CREATE EXTERNAL FILE FORMAT. Finally, you use the data source and the file format as parameters in the WITH clause of the CREATE EXTERNAL TABLE statement. So, in reality you are creating a pointer to the file that is mapped to a table structure that is queryable. It acts like a table, but it is not the same as a table created on a dedicated SQL pool. The other capability to call out is the Max query duration. Unlike dedicated SQL pools, which allow you to choose the performance level of the node, as shown in Table 3.8, serverless SQL pools manage the scaling for you. Cost management for serverless pools is managed by limiting the amount of data queried daily, weekly, and/or monthly. This is in contrast to dedicated pools, where you pay for the selected performance level.

TABLE 3.8 Dedicated SQL pool performance level

Performance LevelMaximum nodesDistributionsMemory
DW100c16060 GB
DW300c160180 GB
DW500c160300 GB
DW1000c230600 GB
DW1500c320900 GB
DW2500c5121,500 GB
DW5000c1063,000 GB
DW7500c1544,500 GB
DW10000c2036,000 GB
DW15000c3029,000 GB
DW30000c60118,000 GB

Notice in Table 3.8 that the ratios between maximum nodes and distributions conform to the law of 60. Using this table together with the knowledge of number of data rows can help you determine the optimal size. Remember that you need 1,000,000 rows of data per node to meet the requirement for performance boosting caching to kick in. If your distribution were 10,000,000 rows of data, which performance level would you choose? The maximum would be DW5000c, because you want to have 1,000,000 distributed across the nodes. Perhaps DW1500c or DW2500c are better choices to be certain you breach the threshold. But, like with all decisions, the choice is based on the scenario and how compute‐intense your queries are.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • July 2022
  • May 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • May 2021

Categories

  • ARM TEMPLATE
  • Create an Azure Data Factory
  • DATA EXPLORER POOLS
  • Design Analytical Stores
  • MANAGED PRIVATE ENDPOINTS
  • Microsoft DP-203
© 2025 Microsoft Azure Data Engineering Associate (DP-203) Study Guide All Rights Reserved