• 0
mmeshref

Inside SQL Server Engine

You want this to be translated to Arabic?   66 اصوات

  1. 1. You want this to be translated to Arabic?


من فضلك سجل دخول او سجل حساب جديد قبل التمكن من اضافه صوتك .

سؤال


As I said earlier, I'd be posting some articles on how SQL Server Engine works.

I'll do it in English, brother Tareq could translate it to Arabic if you want.

My first article would be about SQL Server Arcitecture

SQL Server Architecture

Below is a diagram showing all the components of SQL Server, we’ll dig into each of these components and have some idea about what does it do, after that we are going to dig more into how the query is getting executed.

sqla.png

SQL Server Protocols:

Below are the communication protocols that SQL Server supports:

  1. Shared Memory: Good for connecting to instance on the same machine.
  2. Named Pipes: Good for connection to an instance which exists on the LAN
  3. TCP/IP: The most widely used protocol over the internet. This protocol can be used to communicate across interconnected networks of different hardware and OSs
  4. Virtual Interface Adaptor (VIA): a protocol to be used with VIA hardware. It’s a specialized protocol for VIA hardware only

Query Processor:

Below are the build components of the Query Processor which is responsible of taking the user’s query in its textual format, parsing it and executing it, and then producing the results back

  1. Command Parser: generates a query tree out of the supplied T-SQL code to be consumed by the Query Optimizer
  2. Query Optimizer: transform the query tree into what’s known by a Query Plan which is an optimized plan on how the execution will be happening taking into consideration many variables like system IO, CPU usage, etc
  3. SQL Manager: responsible for anything related to stored procedures and managing their query plans, in addition to the new SQL Server 2005 feature Autoparamererization in which some ad hoc queries as if they are parameterized stored procs and query plans are generated and saved for them
  4. Database Manager: is access and retrieve all the metadata needed to query compilation and optimization (i.e. datatypes, indexes, etc)
  5. Query Executor: the final phase of the relational engine in which the query executor takes the query plan (Aka execution plan) and communicate with the storage engine (if needed) to execute the query plan’s iterators

Storage Engine:

The storage engine is the part of SQL Server that is responsible for anything related to storage, it maintains the indexes, the data and all operations on the data. It works with the Query Processor to locate and generate data, so it’s the next step after the query is ready to be executed.

It mainly does these operations:

  1. Access Methods
  2. Row & Index Operations
  3. Page Allocation Operations
  4. Versioning Operations
  5. Transaction Services
  6. Locking Operations

In the next article (which could be today or tomorrow), we'll be talking about the Storage Engine's Operation which are listed above, so stay tuned

تم تعديل بواسطه Mohamed Moshrif
4

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه

17 إجابة على هذا السؤال .

  • 0

Storage Engine Operations

In continue for where we stopped in the last lesson, we'll be talking about Storage Engine's operations in this one.

Storage Engine: Access Methods

When SQL Server needs to locate data it uses access methods. Access methods will either receive an OLE DB row sets to be inserted (i.e. in case of INSERT call) or OLE DB row sets to be retrieved (i.e. in case of SELECT call).

The access methods code contains methods to open a table, retrieve qualified data, and update data, but it doesn’t actually retrieve the pages, it just makes the request to the buffer manager which does this work for it.

The access methods code is not only for queries, but also for qualified updates and deletes (i.e. an UPDATE with WHERE clause and for any data modifications that will affect index entries

Storage Engine: Row and Index Operations:

Rows & Index operations are components of access methods code which carry out the actual method of access to the rows and indexes. Each component is responsible for manipulating and maintaining its respective on-disk data structures, rows of data or B-Tree indexes.

They understand and manipulate information on data and index pages, they contain two types of operations, Row operations and Index operations:

Row Operations:

The Row operations code as you can think from its name, retrieves, modifies and performs operations on individual rows. It performs an operation within a single row, such as “retrieve column 2” or “write this value to column 3”.

Row operations will work hand in hand with the lock and transaction management components, and as a result the row is found and appropriately locked as a part of a transaction.

After the formatting or modification of the row is completed, the row operation code inserts or deletes the row.

Index Operations:

The index operations code maintains and supports searching on B-trees, which are used for SQL Server indexes. B-Trees groups records that have similar index keys, and hence allowing fast access to data by searching on a key value.

Branches on the B-Tree are joined together or split apart as necessary so the search for a given record always traverse the same number of levels and therefore requires the same number of page accesses

Storage Engine: Page Allocation Operations:

The allocation operations code manages a collection of pages and keeps track of which pages in a database have already been used, for what purpose they have been used, and how much space is available on each page.

In SQL Server, each database is a collection of 8 KB disk pages that are spread across one physical files.

SQL Server uses 8 types of disk pages:

  1. Data Pages: to store user data
  2. LOB Pages: to store user data for large object
  3. Index Pages: to store index rows
  4. Page Free Space (PFS) Pages: to keep track of which pages in the database are available to hold new data
  5. Global Allocation Map (GAM) & Shared Global Allocation Map (SGAM) Pages: Keep track of which extents* are allocated
  6. Index Allocation Map (IAM) Pages: to store information about extents used by a table or index
  7. Bulk Changed Map (BCM) Pages: Information about extents modified by bulk operations since the last BACKUP LOG statement
  8. Differential Changed Map (DCM) Pages: information about extents that have changed since the last BACKUP DATABASE statement

*Extents: are the basic unit in which space is managed. An extent is 8 physically contiguous pages = 64K

Storage Engine: Versioning Operations:

A new type of access methods is defined in SQL Server 2005 which is the access through the version store.

Row versioning allows SQL Server to maintain older version of changed rows, and the versioning technology supports Snapshot Isolation* as well as other features of SQL Server 2005, including online index builds and triggers. Versioning Operations code is the code to maintain row versions for whatever purpose they are needed

*Snapshot Isolation: prior to SQL Server 2005, concurrency was based solely on locking, while in 2005 snapshot isolation using versioning was introduced to enhance OLTP performance by decreasing locking and deadlocking problems in applications.

Storage Engine: Transaction Services.

A core feature of SQL Server is its ability to ensure that transactions are atomic (that is, all or nothing). In addition, transactions must be durable, which means that if a transaction has been committed, it must be recoverable by SQL Server no matter what, even if a total system failure occurs 1 millisecond after the commit was acknowledged.

There are 4 properties that transactions must adhere to, called the ACID properties (Atomicity, Consistency, Isolation, and Durability).

If a system failure happened while transaction is in progress, SQL Server will rollback the system into the state that existed before the transaction. This is done by what’s known as Write-Ahead Logging in which SQL Server synchronously writes to the log before writing any data into the data pages (which could be asynchronous as everything can be reconstructed from the log files)

The transaction management component defines the boundaries of statements that must be grouped together to form an operation, it also handles transactions that cross databases within the same instance.

For distributed transaction to another SQL instance, the transaction management component coordinates with Microsoft Distributed Transaction Coordinator (MS DTC) service using operating system remote procedure calls (RPC). It then marks save points that you designate within a transaction at which work can be partially rolled back or un-done

SQL Server 2005 supports two concurrency models to guarantee the ACID properties:

  1. Pessimistic Concurrency: this is the model that existed in every earlier SQL Server. It guarantees the correctness and consistency by locking data so that it cannot be changed
  2. Optimistic Concurrency: this is a new model that was introduced in SQL Server 2005; it provides consistent data by keeping older versions of rows with committed values in an area in tempdb called the version store. Which optimistic concurrency, readers don’t block writers and writers don’t block readers, but writers will still block writers. The cost of these non-blocking reads and writes must be considered as SQL Server needs to spend more time managing the version store to support this model

Storage Engine: Locking Operations.

Locking is a very important part of a multi-user database system as SQL Server. And even if you’re completely working on snapshot isolation level (in which writes don’t block readers and vice-versa), writers will acquire locks and can still block other writers, and if two writers try to change the same data concurrency, a conflict will occur and must be resolved.

The locking code manages compatibility between the lock types, resolves deadlocks, and escalates locks if needed. The locking code controls table, page and row locks as well as system data locks

SQL Server has the following locking types:

  1. Shared Locks: for multiple readers
  2. Update Locks: used when SQL Server intends to modify a page, and later promotes the update page lock to an exclusive lock before actually making the changes
  3. Exclusive Locks: for writers
  4. Intent Lock: it’s lock “by intention” in which SQL Server places exclusive locks in a queue to ensure that those locks will be placed on the data elements in the order the transactions were initiated. There are 6 different types of intent locks
  5. Extent Lock: used for space allocation (extents allocation)
  6. Schema Lock: used for schema modifications, there are 2 different types of schema locks.
  7. Bulk Update Lock: used when performing bulk-copy into a table with TABLOCK* hint
  8. Key-Range Lock: used to place a lock over a range of rows

*TABLOCK Hint: specifies that a lock will be placed exclusively on the entire table, this is needed when inserting a large amount of data to get rid of row-level which will be slow in case of large amount of data.

In the next lesson, we'll be covering a little bit of SQL Server Operating System

5

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

SQL Server Operating System:

SQL Server has an internal operating system which is responsible for doing all the tasks, like managing threads, fibers, processes and memory inside SQL Server.

Two Major pieces that belong to the SQL Server OS, is its ability to support NUMA architecture even if no NUMA hardware exists, and the scheduler.

NUMA (Non-Uniform Memory Access) Architecture

NUMA is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors, you can find more information about it here, below is a diagram for a typical NUMA nodes:

numa.gif

SQL Server 2005 is Non-Uniform Memory Access (NUMA) aware, and both scheduling and memory access can take advantage of NUMA hardware by default.

SQL Server 2005 allows you to subdivide one or more physical NUMA nodes into smaller NUMA nodes, referred to as software NUMA or soft-NUMA. You typically use soft-NUMA when you have many CPUs and do not have hardware NUMA because soft-NUMA allows only for the subdividing of CPUs but not memory. You can also use soft-NUMA to subdivide hardware NUMA nodes into groups of fewer CPUs than is provided by the hardware NUMA.

SQL Server OS Scheduler:

Before SQL Server 7.0, SQL Server was depending entirely on the OS for scheduling, which was not good because the OS treated SQL Server worker threads the same way it treated any other threads, which was not good in a high performance system like SQL Server.

Starting from SQL Server 7.0, SQL Server had its own scheduling mechanism, enhancing the total performance of the system

You can think of the SQL Server schedule as a logical CPU used by SQL Server Workers. A worker can be either a thread of a fiber that is bound to a logical schedule

Workers are created when the schedule receives a request (a task to execute) and there are no idle workers.

SQL Server handles the worker pool very efficiently, freeing and destroying workers when they are idle or more resources are needed.

In SQL Server 2005, each actual CPU (whether hyperthreaded or physical) has a scheduler created for it when SQL Server starts up.

Each schedule is set to either ONLINE or OFFLINE based on the affinity mask settings. Changing the affinity mask value can change the schedule from ONLINE to OFFLINE and vice versa without the need for restarting SQL Server, but any workers will finish first before turning the scheduler to OFFLINE mode.

.

Tasks:

The unit of work for a SQL Server worker is a request or a task, which you can think of as being equivalent to a single batch sent from the client to the server. Once the request is received by SQL Server, it is bound to a worker, and that worker processes the entire request before handling any other requests (note worker != scheduler), this is true even if the request is blocked for any reason (IO request, lock, etc). So each worker can handle only one task at a time, and then it goes to handle another task. So in short: SQL Server instance, has Schedulers which are equal to the number of active CPUs, each scheduler internally is responsible for multiple workers (i.e. consider it as an internal process), each worker would handle one task at a time.

Threads vs. Fibers:

A fiber is a unit of execution that can run under a thread, and hence a thread can contain multiple fibers. Fibers have less overhead than threads but they don’t provide many advantages than simply running completely on threads.

A worker can either run in a thread or a fiber, you can configure SQL Server to run in fiber mode by setting the Lightweight Pooling option to 1.

Some components of SQL Server don’t work very well when SQL Server is running in the fiber mode, like CLR queries, and SQLMail because they need certain thread-specific facilities provided by the OS. Although SQL Server can switch to thread mode “on the need basis”, the overhead of such switches might be greater than the overhead of using SQL Server in fibers mode.

Relation between NUMA & Schedulers:

With a NUMA configuration, every node has some subset of machine’s processors and the same number of schedulers (which is equal to the number of processors). When a machine is configured for hardware NUMA, each node has the same number of processors, but for soft-NUMA that you configure yourself, you can assign different numbers of processors to each node. There will still be the same number of schedulers as processors.

Next article will cover SQL Server Memory Management.

1

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

SQL Server Memory Management

In this section, we'll take a look on how SQL Server manages memory, it's done mainly through these parts:

  1. Buffer Pool & Data Cache
  2. Access to In-Memory Data Pages
  3. Managing Pages in Data Cache
  4. Checkpoints

SQL Server Memory: Buffer Pool & Data Cache

The main memory component in SQL Server is the buffer pool. All memory not used by another memory component remains in the buffer pool to be used as a data cache for pages read in from the database files on disk

The buffer manager manages disk I/O functions for bringing data and index pages into the data cache so data can be shared among users.

When other components require memory, they can request a buffer from the buffer pool.

A buffer is a page in memory that’s the same size as the data or index page. Most of the buffers taken from the buffer pool for other memory components go to other kinds of memory caches, the largest of which is typically the cache for procedure and query plans, which is usually called the procedure cache.

Occasionally SQL Server must request contiguous memory in blocks larger than 8-KB, but this is kept to a minimum so direct calls to the OS is kept to minimum.

Access to In-Memory Data Pages

Access to pages in the data cache must be fast, because even with real memory, it would be amazingly slow to scan the whole data cache for a page when you have gigabytes of data. Pages in the data cache are therefore hashed for fast access.

Given a dbid-fileno-pageno identified (a combination of database ID, file number, and page number), the hash function converts that key to the hash buckets that should be checked, and so easily indentifying the index to the specific page needed.

Managing Pages in the Data Cache

You can use a data page or an index page only if it exists in memory. Therefore, a buffer in the data cache must be available for the page to be read into. Keeping a supply of buffers available for immediate use is an important performance optimization. If a buffer isn’t readily available, many memory pages might have to be searched simply to locate a buffer to free up for use as a workspace.

SQL Server uses LRU-K algorithm to maintain information useful for writing changed pages to disk and for marking as free those pages that have not referenced for some time.

LRU-K algorithm is a great improvement over a strict LRU (Least Recently Used) algorithm. It’s also an improvement over LFU (Least Frequently Used) algorithm involving reference counters, because it requires far fewer adjustments by the engine, and much less bookkeeping overhead. An LRU-K algorithm keeps track of the last K times a page was referenced and can differentiate between types of pages, such as index and data pages, with different levels of frequency.

SQL Server uses K value of 2, so it keeps track of the two most recent accesses of each buffer page.

The data cache is periodically scanned from the start to the end. Because the buffer cache is all in memory, these scans are quick. During the scan, a value is associated with each buffer based on its usage history. When the value gets low enough, the dirty page indicator is checked. If the page is dirty, a write is scheduled to write the modifications to disk. After the modified page is written to disk, or if the page was not dirty to start with, the page is freed, and the buffer associated with this page is placed back into the free list of pages.

Each instance of SQL Server has a lazywriter thread for each NUMA node that scans through the buffer cache associated with that node. The lazywriter thread sleeps for a specific interval of time, and when it wakes up, it examines the size of the free buffer list. If the list is below a certain threshold, which depends on the total size of the buffer pool, the lazywriter thread scans the buffer pool to repopulate the free list.

SQL Server must be constantly aware of the amount of free memory, so it queries the system periodically to determine the amount of free physical memory available.

If it detects that too much paging is happening on the OS, it releases memory to the OS.

SQL Server Memory Checkpoints

The checkpoint process also scans the buffer cache periodically and writes any dirty data pages for a particular database to disk. The difference between the checkpoint process and the lazywriter (or the worker threads’ management of pages) is that the checkpoint process never puts buffers on the free list. The purpose of the checkpoint process is only to ensure that pages written before a certain time are written to disk, so that the number of dirty pages in memory is always kept to minimum, which in turn ensures that the length of time SQL Server requires for recovery of a database after a failure is kept to minimum

When a checkpoint occurs, SQL Server writes a checkpoint record to the transaction log, which lists all the transactions that are active. This allows the recovery process to build a table containing a list of all the potentially dirty pages. Checkpoints occur automatically at regular intervals, but can also be requested manually.

Checkpoints are triggered when:

  1. A database owner issue a checkpoint command.
  2. The log is getting full (more than 70%) and the database is in SIMPLE recovery mode.
  3. A long recovery time is estimated. When recovery time is predicted to be longer than the recovery interval configuration option.
  4. An orderly shutdown of SQL Server is requested.

Next time we'll start to talk about the most important part of SQL Server, which is "Query Processing" and how the query get executed inside SQL Server, which covers topics like Indexing, Iterators, Physical Joins and how they are done internally, aggregation on the physical level, etc

0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

Query Processing

This lesson will be the first lesson in the query processing and how the query get processed, to understand the query processing components, we first need to understand Iterators

Iterators

What are iterators?

SQL Server breaks queries down into a set of fundamental building blocks that we call operators or iterators. Each iterator implements a single basic operation such as scanning data from a table, updating data in a table, filtering or aggregating data, or joining two data sets. In all, there are a few dozen of such primitive iterators. Iterators may have one, two, or more children and can be combined into trees which we call query plans.

By building appropriate query plans, we can execute any SQL statement. In practice, there are frequently many valid query plans for a given statement. The query optimizer’s job is to find the best (e.g., the cheapest) query plan for a given statement.

How do iterators work?

An iterator reads input rows either from a data source such as a table or from its children (if it has any) and produces output rows which it returns to its parent. The output rows that an iterator produces depend on the operation that the iterator performs.

All iterators implement the same set of core methods. For example, the Open method tells an iterator to prepare to produce output rows while the GetRow method requests that an iterator produce a new output row. Because all iterators implement the same methods, iterators are independent of one another. That is, an iterator does not need specialized knowledge of its children (if any) or parent. Consequently, iterators can be easily combined in many different ways and into many different query plans.

When SQL Server executes a query plan, control flows down the tree. That is, we call the methods Open and GetRow on the iterator at the root of the query plan and these methods propagate down through the tree to the leaf iterators. Data flows or more accurately is pulled up the tree when one iterator calls another iterator’s GetRow method.

Iterator Simple Example:

Consider the following query:

select count(*) from t

The simplest way to execute this query is to scan each row in table [t] and count the rows. SQL Server uses two iterators to achieve this result: one to scan the rows in [t] and another to count them:

<-- count(*) <-- Scan [t]

To execute this query plan, we call Open on the count(*) iterator. The count(*) iterator performs the following tasks in Open:

  1. it calls Open on the scan iterator which readies the scan to produce rows;
  2. it calls GetRow repeatedly on the scan, counting the rows returned, and stopping only when GetRow returns that it has reached the end of the scan; and
  3. it calls Close on the scan iterator to indicate that we are done.

Thus, by the time the count(*) iterator returns from Open, it has already calculated the number of rows in [t]. To complete execution we call GetRow on the count(*) and it returns this result. (Technically, we call GetRow on the count(*) one more time since we do not know that count(*) produces only a single row until we try. The second GetRow call returns that we have reached the end of the result set.)

Note that the count(*) iterator does not care or need to know that it is counting rows from a scan; it will count rows from any sub-tree we put below it regardless of how simple or complex the sub-tree may be.

iterators1.png

Iterators more Complex Example:

Consider the following query:

SELECT a FROM t1 UNION ALL SELECT a FROM t2

This query combines the output from two tables. To achieve this result, we need two table scan operators (one for t1 and another for t2). (Remember that all query plans are built out of the same set of basic operators.) To combine the results from two different sub-trees (in this case two different table scans), we use the “concatenation” operator. The concatenation operator can have more than one child. It reads and returns all rows from its first child and then proceeds to do the same with its next child.

iterators2.png

Note that when an operator has more than one child, the order of the children matters. The topmost child is the first child while the bottommost child is the second. The concatenation operator processes the children in this order.

Iterators' Memory:

All iterators require some small fixed amount of memory to store state, perform calculations, and so forth. We do not track this fixed memory or try to reserve this memory before executing a query. When we cache an executable plan, we cache this fixed memory so that we do not need to allocate it again and to speed up subsequent executions of the cached plan.

However, some iterators which we refer to as memory consuming iterators require additional memory to execute. This additional memory is used to store row data. The amount of memory required by a memory consuming operator is generally proportional to the number of rows processed. To ensure that the server does not run out of memory and that queries containing memory consuming iterators do not fail, we estimate how much memory we need and we reserve a memory grant before we execute such a query.

Memory consuming iterators can affect performance in a few ways:

  1. Queries with memory consuming iterators may have to wait to acquire the necessary memory grant and begin execution if the server is executing other such queries and does not have enough available memory. This can directly affect performance by delaying execution.
  2. If there are too many queries competing for limited memory resources, the server many suffer from reduced concurrency and/or throughput. This is generally not a major issue for data warehouses but is undesirable in OLTP systems.
  3. If a memory consuming iterator requests too little memory, it may need to spill data to disk during execution. Spilling can have a significant adverse impact on the query and system performance due to the extra I/O overhead. Moreover, if an iterator spills too much data, it can run out of tempdb and fail.

The main memory consuming iterators are sort, hash aggregate, and hash join which we'll be covering in later lessons.

Blocking vs. Non-blocking Iterators

Most iterators can be classified into two categories:

  1. Iterators that consume input rows and produce output rows at the same time (in the GetRow method). We often refer to these iterators as “non-blocking.”
  2. Iterators that consume all input rows (generally in the Open method) before producing any output rows. We refer to these iterators as “blocking” or “stop-and-go.”

The compute scalar iterator is a simple example of a non-blocking iterator. It read an input row, computes a new output value using the input values from the current row, immediately outputs the new value, and continues to the next input row.

The sort iterator is a good example of a blocking iterator. The sort cannot determine the first output row until it has read and sorted all input rows. (The last input row could be the first output row; there is no way to know until we have seen it.)

Blocking iterators often but not always consume memory. For example, sort consumes memory. The count(*) example (and other scalar aggregates such as sum(*), min(*), max(*), etc.) do not consume memory yet are blocking. It is not possible to know the number of rows until we’ve read and counted them all.

If an iterator has two children, the iterator may be blocking with respect to one and non-blocking with respect to the other. Hash join is a good example of such an iterator.

Non-blocking iterators are generally optimal for OLTP queries where response time is important. They are especially desirable for TOP N queries or queries with a FAST N hint. Since the goal is to return rows as quickly as possible, we’d like to avoid blocking operators which might process more data than necessary before returning the first rows. Non-blocking iterators can also be useful when evaluating an EXISTS subquery where we again would like to avoid processing more data than necessary to conclude that at least one output row exists.

Hint: If a query requires ordered output, creating the right index can enable the optimizer to find a query plan that uses the index to produce the desired order instead of introducing a blocking sort. This may dramatically improve response time. Of course, there are other reasons to use indexes too.

Iterators' Dynamic Cursor Support

The iterators used in a dynamic cursor query plan have special properties. Among other things, a dynamic cursor must be able to return a subset of the result set at a time, must be able to scan forward or backward, and must be able to acquire scroll locks (make changes). To support this functionality, an iterator must be able to save and restore its state, scan forward or backward, must process one input row for each output row it produces, and must be non-blocking. Not all iterators have all of these properties.

For a query to be executed using a dynamic cursor, the optimizer must be able to find a query plan that uses only iterators that support dynamic cursors. This is not always possible and this is why some queries cannot be executed using a dynamic cursor.

Hint: Just a creating the right index can enable the optimizer to eliminate a sort, for the same reason, it can sometimes (though not always) enable the optimizer also to find a dynamic cursor query plan. Unfortunately, this is not always possible. For example, there is no way to perform aggregation without violating the one input row for each output row property (though it is sometimes possible to work around this problem using an indexed view).

You can read more about SQL Server Iterators in either Craig Freedman's technical blog or msdn page on iterators

Next we'll be talking about SQL Server Indexing, we'll be mainly talking about these topics:

  1. Table Scan
  2. Table Seek
  3. Bookmark Lookup
  4. Seek Predicates
  5. Indexing Examples

تم تعديل بواسطه Mohamed Moshrif
1

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

SQL Server Indexing Part 1

In this lesson, we'll be talking about how Indexing work.

Table Scan vs. Table Seek

Scans vs. Seeks

Scans and seeks are the iterators that SQL Server uses to read data from tables and indexes.  These iterators are among the most fundamental ones that we support.  They appear in nearly every query plan.

What is the difference between a scan and a seek?

A scan returns the entire table or index.  A seek efficiently returns rows from one or more ranges of an index based on a predicate.  For example, consider the following query:

select OrderDate from Orders where OrderKey = 2

Table Scan

With a scan, we read each row in the orders table, evaluate the predicate “where OrderKey = 2” and, if the predicate is true (i.e., if the row qualifies), return the row.  In this case, we refer to the predicate as a “residual” predicate.  To maximize performance, whenever possible we evaluate the residual predicate in the scan.  However, if the predicate is too expensive, we may evaluate it in a separate filter iterator. 

Since a scan touches every row in the table whether or not it qualifies, the cost is proportional to the total number of rows in the table.  Thus, a scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.  However, if the table is large and if most of the rows do not qualify, we touch many more pages and rows and perform many more I/Os than is necessary.

The following figure illustrates the scan:

scanb.png

Table Seek

Going back to the example, if we have an index on OrderKey, a seek may be a better plan.  With a seek, we use the index to navigate directly to those rows that satisfy the predicate.  In this case, we refer to the predicate as a “seek” predicate.  In most cases, we do not need to re-evaluate the seek predicate as a residual predicate; the index ensures that the seek only returns rows that qualify.

Since a seek only touches rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.  Thus, a seek is generally a more efficient strategy if we have a highly selective seek predicate; that is, if we have a seek predicate that eliminates a large fraction of the table.

The following figure illustrates the seek:

seekf.png

Bookmark Lookup

SQL Server can use an index to efficiently locate rows that qualify for a predicate.  When deciding whether to use an index, SQL Server considers several factors.  These factors include checking whether the index covers all of the columns that the query references (for the table in question).

What does it mean for an index to cover a column?

The heap or clustered index for a table (often called the “base table”) contains (or covers) all columns in the table.  Non-clustered indexes, on the other hand, contain (or cover) only a subset of the columns in the table.  By limiting the set of columns stored in a non-clustered index, we can store more rows on each page.  Thus, we save disk space and improve the efficiency of seeks and scans by reducing the number of I/Os and the number of pages that we touch.

Each non-clustered index covers the key columns that were specified when it was created.  Also, if the base table is a clustered index, each non-clustered index on this table covers the clustered index keys (often called the “clustering keys”) regardless of whether they are part of the non-clustered index’s key columns.  In SQL Server 2005, we can also add additional non-key columns to a non-clustered index using the “include” clause of the create index statement.

For example, given this schema:

create table T_heap (a int, b int, c int, d int, e int)

create index T_heap_a on T_heap (a)

create index T_heap_bc on T_heap (b, c)

create index T_heap_d on T_heap (d) include (e)

create table T_clu (a int, b int, c int, d int, e int)

create unique clustered index T_clu_a on T_clu (a)

create index T_clu_b on T_clu (b)

create index T_clu_ac on T_clu (a, c)

create index T_clu_d on T_clu (d) include (e)

The covered columns for each index are:

coveredcolumn.png

How does this relate to index seeks and bookmark lookups?

Consider this query (using the above schema):

select e from T_clu where b = 2

At first glance, this query looks like a perfect candidate for an index seek using the non-clustered index on column b (T_clu_b) with the seek predicate “b = 2”.  However, this index does not cover column e so a seek or scan of this index cannot return the value of column e.  The solution is simple.  For each row that we fetch from the non-clustered index, we can lookup the value of column e in the clustered index.  We call this operation a “bookmark lookup.”  A “bookmark” is a pointer to the row in the heap or clustered index.  We store the bookmark for each row in the non-clustered index precisely so that we can always navigate from the non-clustered index to the corresponding row in the base table.

The following figure illustrates a bookmark lookup:

bokmark.png

Tradeoffs:

Bookmark lookup is not a cheap operation.  Each bookmark lookup performs a random I/O into the clustered index.  Random I/Os are very expensive.  When comparing various plan alternatives including scans, seeks, and seeks with bookmark lookups, the optimizer must decide whether it is cheaper to perform more sequential I/Os and touch more rows using an index scan or a seek with a less selective predicate that covers all required columns or to perform fewer random I/Os and touch fewer rows using a seek with a more selective predicate and a bookmark lookup.

In some cases, you can introduce a better plan option by creating a new index or by adding one or more columns to an existing index so as to eliminate a bookmark lookup or change a scan into a seek.  In SQL Server 2000, the only way to add columns to an index is to add additional key columns.  As I mentioned above, in SQL Server 2005, you can add also columns using the include clause of the create index statement.  Included columns are more efficient than key columns; they save disk space and make searching and updating the index more efficient.

Of course, whenever you create new indexes or add new key or included columns to an existing index, you do consume additional disk space and you make it more expensive to search and update the index.  Thus, you must balance the frequency and importance of the queries that benefit from the new index against the queries or updates that are slower.

Next time we will talk about the Seek Predicates which is the operation SQL Server performs to determine if the index can cover your query predicate and so it can perform an index seek or not and so it will perform a table scan, also some examples on Indexing

تم تعديل بواسطه Mohamed Moshrif
0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

SQL Server Indexing Part 2

In this lesson we will finalize the indexing by talking about seek predicates and give some examples to show how indexing work

Seek Predicates

Before SQL Server can perform an index seek, it must determine whether the keys of the index are suitable for evaluating a predicate in the query.

Seek Predicates : Single-column indexes

Single column indexes are fairly straightforward. SQL Server can use single column indexes to answer most simple comparisons including equality and inequality (greater than, less than, etc.) comparisons. More complex expressions such as functions over a column and “like” predicates with a leading wildcard character will generally prevent SQL Server from using an index seek.

For example, suppose we have a single column index on a column “a.” We can use this index to seek on these predicates:

  1. a = 3.14
  2. a > 100
  3. a between 0 and 99
  4. a like ‘abc%’
  5. a in (2, 3, 5, 7)

However, we cannot use it to seek on these predicates:

  1. ABS(a) = 1
  2. a + 1 = 9
  3. a like ‘%abc’

Seek Predicates : Multi-column indexes

Multi-column indexes are slightly more complex. With a multi-column index, the order of the keys matters. It determines the sort order of the index and it affects the set of seek predicates that SQL Server can evaluate using the index.

For an easy way to visualize why order matters, think about the phone book. The phone book is like an index with the keys (last name, first name). The contents of the phone book are sorted by last name and it is easy to look someone up if we know their last name. However, if we have only a first name, it is very difficult to get a list of people with that name. We would need another phone book sorted on first name.

In the same way, if we have an index on two columns, we can only use the index to satisfy a predicate on the second column if we have an equality predicate on the first column. Even if we cannot use the index to satisfy the predicate on the second column, we may be able to use it on the first column. In this case, we introduce a residual predicate for the predicate on the second column. This is the same residual predicate that we use for scans.

For example, suppose we have a two column index on column “a” and “b.” We can use this index to seek on any of the predicates that worked on the single column index. We can also use it to seek on these additional predicates:

  1. a = 3.14 and b = ‘pi’
  2. a = ‘xyzzy’ and b <= 0

For the next set of examples, we can use the index to satisfy the predicate on column a, but not on column b. In these cases, we need a residual predicate.

  1. a > 100 and b > 100
  2. a like ‘abc%’ and b = 2

Finally, we cannot use the index to seek on the next set of predicates as we cannot seek even on column a. In these cases, we must use a different index (e.g., one where column b is the leading column) or we must use a scan with a residual predicate.

  1. b = 0
  2. a + 1 = 9 and b between 1 and 9
  3. a like ‘%abc’ and b in (1, 3, 5)

Indexing at a Glance

The optimizer must choose an appropriate “access path” to read data from each table referenced in a query. The optimizer considers many factors when deciding which index to use, whether to do a scan or a seek, and whether to do a bookmark lookup. These factors include:

  1. How many I/Os will a seek or scan of the index perform?
  2. Are the keys of the index suitable for evaluating a predicate in the query?
  3. How selective is the predicate? (That is, what percentage of the total rows in the table qualifies for this predicate? The lower this number the better.)
  4. Does the index cover all of the necessary columns?

Indexing Examples.

Schema

This is the schema that is going to be used in the examples:

create table T (a int, b int, c int, d int, x char(200))

create unique clustered index Ta on T(a)

create index Tb on T(b)

create index Tcd on T(c, d)

create index Tdc on T(d, c)

If you want to try the examples, the table was populated using the following script:

set nocount on

declare @i int

set @i = 0

while @i < 100000

begin

insert T values (@i, @i, @i, @i, @i)

set @i = @i + 1

end

Covered Columns for the Example:

indexingcoveredcolumn.png

Indexing Examples – I/O

Consider this query:

select a, b from T

This query has no WHERE clause so we must use a scan. However, there are two indexes we can scan. There is the clustered index Ta and there is the non-clustered index Tb. Both of these indexes cover columns a and b. However, the clustered index also covers columns c and x. Because column x is a char(200), the total width of each row in the clustered index is over 200 bytes, fewer than 40 rows fit on each 8K page, and the index requires more than 2,500 pages to store all 100,000 rows. In contrast, the total width of each row in the non-clustered index, is only 8 bytes plus some overhead, hundreds of rows fit on each page, and the index requires fewer than 250 pages to store all 100,000 rows. By scanning the non-clustered index, we can execute the query while performing many fewer I/Os.

So as a result, query optimizer will scan using the non-clustered index

Indexing Examples – Selective Example

Consider this query:

select a from T

where c > 150 and c < 160 and d > 100 and d < 200

This query has two different predicates that we might use for an index seek. We can use the predicate on column c with the non-clustered index Tcd or we can use the predicate on column d with the non-clustered index Tdc.

The optimizer looks at the selectivity of the two predicates to determine which index to use. The predicate on column c selects only 9 rows while the predicate on column d selects 99 rows. Thus, it is cheaper to seek using the index Tcd and evaluate a residual predicate on column d for 9 rows than it is to seek using the index Tdc and evaluate a residual predicate on column c for 99 rows.

Indexing Examples - Seek vs. Scan

Consider these two queries:

select a from T where a between 1001 and 9000

select a from T where a between 101 and 90000

As you might expect, for the first query, the optimizer chooses a clustered index seek to satisfy the predicate on column a.

For the second query, instead of the clustered index seek, the optimizer chooses an index scan of the non-clustered index Tb and uses a residual predicate for the WHERE clause.

What happened? The predicate on the first query selects 8,000 out of 100,000 rows; this is about 8% of the table or about 230 pages of the clustered index. The predicate on the second query selects 89,000 rows; this is nearly 90% of the table and if we were to use the clustered index it would mean touching over 2,500 pages. By comparison, we can scan the entire non-clustered index Tb and touch only 174 pages. Thus, the optimizer chooses the plan that requires significantly fewer I/Os.

Indexing Examples - Seek with bookmark lookup vs. scan example

Consider these two queries:

select x from T where b between 101 and 200

select x from T where b between 1001 and 2000

We again have two plans from which to choose. We can scan the clustered index directly and apply the predicate on column b as a residual. Or, we can use the non-clustered index Tb and perform a seek using the predicate on column b then do a bookmark lookup on the clustered index to get the value of column x for each qualifying row. As we explained before, bookmark lookups perform random I/Os which are very expensive. Thus, the plan with the bookmark lookup is only a good plan when the seek predicate is very selective.

The first query touches only 100 rows and the optimizer concludes that the bookmark lookup is worthwhile.

The second query touches 1,000 rows. Although this is still only 1% of the table, the optimizer concludes that 1,000 random I/Os are more expensive than 2,800 sequential I/Os and opts for the clustered index scan.

Next time we'll be talking about SQL Server Joins

This will be both about Logical Joins (joins that we use inside out T-SQL) and Physical Joins (which is the actual physical operations done inside SQL Server to perform these joins).

The Logical Joins we'll talk about are:

  1. Inner Join
  2. Outer Join
  3. Cross Join
  4. Cross Apply
  5. Semi-Join
  6. Anti-Semi-Join

As for the Physical Joins, SQL Server supports 3 Physical Joins operators:

  1. Nested Loops Join
  2. Merge Join
  3. Hash Join

تم تعديل بواسطه Mohamed Moshrif
0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

SQL Server Joins

Joins are one of the most important operations performed by a relational database system.  An RDBMS uses joins to match rows from one table with rows from another table.  For example, we can use joins to match sales with customers or books with authors.  Without joins, we might have a list of sales and customers or books and authors, but we would have no way to determine which customers bought which items or which authors wrote which books.

We can join two tables explicitly by writing a query that lists both tables in the FROM clause.  We can also join two tables by using a variety of different sub-queries.  Finally, SQL Server may introduce joins for a variety of purposes into a query plan during optimization.

We’ll start by introducing the logical join operators that SQL Server supports.  These are:

  1. Inner Join
  2. Outer Join
  3. Cross Join
  4. Cross Apply
  5. Semi-Join
  6. Anti-semi-join

Here is a simple schema and data set that we will be using to illustrate each join type:

create table Customers (Cust_Id int, Cust_Name varchar(10))

insert Customers values (1, 'Craig')

insert Customers values (2, 'John Doe')

insert Customers values (3, 'Jane Doe')

create table Sales (Cust_Id int, Item varchar(10))

insert Sales values (2, 'Camera')

insert Sales values (3, 'Computer')

insert Sales values (3, 'Monitor')

insert Sales values (4, 'Printer')

Inner Join:

Inner joins are the most common join type.  An inner join simply looks for two rows that put together satisfy a join predicate.  For example, this query uses the join predicate “S.Cust_Id = C.Cust_Id” to find all Sales and Customer rows with the same Cust_Id:

select * from Sales S inner join Customers C

on S.Cust_Id = C.Cust_Id

Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ----------

2 Camera 2 John Doe

3 Computer 3 Jane Doe

3 Monitor 3 Jane Doe

Inner joins are fully commutative.  “A inner join B” and “B inner join A” are equivalent.

Left/Right Outer Joins

Suppose that we would like to see a list of all sales; even those that do not have a matching customer.  We can write this query using an outer join.  An outer join preserves all rows in one or both of the input tables even if we cannot find a matching row per the join predicate.  For example:

select * from Sales S left outer join Customers C on S.Cust_Id = C.Cust_Id

Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ----------

2 Camera 2 John Doe

3 Computer 3 Jane Doe

3 Monitor 3 Jane Doe

4 Printer NULL NULL

Left/Right outer joins are not commutative, for example left outer join will return all the rows in the left table, etc.

Full Outer Joins

Using a full outer join, we can find all customers regardless of whether they purchased anything and all sales regardless of whether they have a valid customer:

select * from Sales S full outer join Customers C on S.Cust_Id = C.Cust_Id

Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ----------

2 Camera 2 John Doe

3 Computer 3 Jane Doe

3 Monitor 3 Jane Doe

4 Printer NULL NULL

NULL NULL 1 Craig

Outer Joins Summary

The following table shows which rows will be preserved or NULL extended for each outer join variation:

outerjoin.png

Cross Join

A cross join performs a full Cartesian product of two tables.  That is, it matches every row of one table with every row of another table.  You cannot specify a join predicate for a cross join using the ON clause though you can use a WHERE clause to achieve essentially the same result as an inner join.

Cross joins are fairly uncommon.  Two large tables should never be cross joined as this will result in a very expensive operation and a very large result set.

select * from Sales S cross join Customers C

Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ----------

2 Camera 1 Craig

3 Computer 1 Craig

3 Monitor 1 Craig

4 Printer 1 Craig

2 Camera 2 John Doe

3 Computer 2 John Doe

3 Monitor 2 John Doe

4 Printer 2 John Doe

2 Camera 3 Jane Doe

3 Computer 3 Jane Doe

3 Monitor 3 Jane Doe

4 Printer 3 Jane Doe

Cross Apply Joins

We introduced cross apply in SQL Server 2005 to enable joins with a table valued function (TVF) where the TVF has a parameter that changes for each execution.  For example, the following query returns the same result as the above inner join using a TVF and cross apply:

create function dbo.fn_Sales(@Cust_Id int)

returns @Sales table (Item varchar(10))

as

begin

insert @Sales select Item from Sales where Cust_Id = @Cust_Id

return

end

select * from Customers cross apply dbo.fn_Sales(Cust_Id)

Cust_Id Cust_Name Item

----------- ---------- ----------

2 John Doe Camera

3 Jane Doe Computer

3 Jane Doe Monitor

Outer Apply Joins

We can also use outer apply to find all Customers regardless of whether they purchased anything.  This is similar to an outer join.

select *from Customers outer apply dbo.fn_Sales(Cust_Id)

Cust_Id Cust_Name Item

----------- ---------- ----------

1 Craig NULL

2 John Doe Camera

3 Jane Doe Computer

3 Jane Doe Monitor

Semi Join and Anti-Semi Join

A semi-join returns rows from one table that would join with another table without performing a complete join.

An anti-semi-join returns rows from one table that would not join with another table; these are the rows that would be NULL extended if we performed an outer join.

Unlike the other join operators, there is no explicit syntax to write “semi-join,” but SQL Server uses semi-joins in a variety of circumstances.  For example, we may use a semi-join to evaluate an EXISTS sub-query:

select * from Customers C where exists ( select * from Sales S where S.Cust_Id = C.Cust_Id )

Cust_Id Cust_Name

----------- ----------

2 John Doe

3 Jane Doe

Physical Joins Operations

Below we will talk about how SQL Server performs the join operations internally, SQL Server supports three physical join operators:

  1. Nested Loop Join
  2. Merge Join
  3. Hash Join

Nested Loop Join

In its simplest form, a nested loops join compares each row from one table (known as the outer table) to each row from the other table (known as the inner table) looking for rows that satisfy the join predicate.  (Note that the terms “inner” and “outer” are overloaded; their meaning must be inferred from context.  “Inner table” and “outer table” refer to the inputs to the join.  “Inner join” and “outer join” refer to the logical operations.)

for each row R1 in the outer table

for each row R2 in the inner table

if R1 joins with R2

return (R1, R2)

It’s the nesting of the for loops in this algorithm that gives nested loops join its name.

The total number of rows compared and, thus, the cost of this algorithm is proportional to the size of the outer table multiplied by the size of the inner table.  Since this cost grows quickly as the size of the input tables grow, in practice we try to minimize the cost by reducing the number of inner rows that we must consider for each outer row.

For example, using the same schema from before:

create table Customers (Cust_Id int, Cust_Name varchar(10))

insert Customers values (1, 'Craig')

insert Customers values (2, 'John Doe')

insert Customers values (3, 'Jane Doe')

create table Sales (Cust_Id int, Item varchar(10))

insert Sales values (2, 'Camera')

insert Sales values (3, 'Computer')

insert Sales values (3, 'Monitor')

insert Sales values (4, 'Printer')

Consider this query:

select * from Sales S inner join Customers C on S.Cust_Id = C.Cust_Id

option(loop join)

The outer table in this plan is Customers while the inner table is Sales.  Thus, we begin by scanning the Customers table.  We take one customer at a time and, for each customer, we scan the Sales table.  Since there are 3 customers, we execute the scan of the Sales table 3 times.  Each scan of the Sales table returns 4 rows.  We compare each sale to the current customer and evaluate whether the two rows have the same Cust_Id.  If they do, we return the pair of rows.  We have 3 customers and 4 sales so we perform this comparison a total of 3*4 or 12 times.  Only 3 of these comparisons result in a match.

Now consider what happens if we create an index on Sales:

create clustered index CI on Sales(Cust_Id)

This time, instead of doing a table scan of Sales, we do an index seek.  We still do the index seek 3 times – once for each customer, but each index seek returns only those rows that match the current Cust_Id and qualify for the join predicate.  Thus, the seek returns a total of only 3 rows as compared to the 12 rows returned by the table scan.

Notice that the index seek depends on C.CustId which comes from the outer table of the join – the table scan of Customers.  Each time we execute the index seek (recall that we execute it 3 times – once for each customer), C.CustId has a different value.  We refer to C.CustId as a “correlated parameter.”  If a nested loops join has correlated parameters, we output them in the showplan as “OUTER REFERENCES.”  We often refer to this type of nested loops join where we have an index seek that depends on a correlated parameter as an “index join.”  This is a common scenario.

The nested loops join supports the following logical join operators:

  1. Inner Join
  2. Left Outer Join
  3. Cross Join
  4. Cross Apply & Outer Apply
  5. Left Semi-Join & Left Anti-Semi-Join

The nested loops join does not support the following logical join operators:

  1. Right & Full Outer Join
  2. Right Semi-Join & Right Anti-Semi-Join

Why does the nested loops join only support left joins?

We can easily extend the nested loops join algorithm to support left outer and semi-joins.  For instance, here is pseudo-code for left outer join.  We can write similar pseudo-code for left semi-join and left anti-semi-join.

for each row R1 in the outer table

begin

for each row R2 in the inner table

if R1 joins with R2

return (R1, R2)

if R1 did not join

return (R1, NULL)

end

This algorithm keeps track of whether we joined a particular outer row.  If after exhausting all inner rows, we find that a particular inner row did not join, we output it as a NULL extended row.

Now consider how we might support right outer join.  In this case, we want to return pairs (R1, R2) for rows that join and pairs (NULL, R2) for rows of the inner table that do not join.  The problem is that we scan the inner table multiple times – once for each row of the outer join.  We may encounter the same inner rows multiple times during these multiple scans.  At what point can we conclude that a particular inner row has not or will not join?  Moreover, if we are using an index join, we might not encounter some inner rows at all.  Yet these rows should also be returned for an outer join.

Fortunately, since right outer join commutes into left outer join and right semi-join commutes into left semi-join, we can use the nested loops join for right outer and semi-joins.  However, while these transformations are valid, they may affect performance.   For example, the join “Customer left outer join Sales” using the above schema with the clustered index on Sales, could use an index seek on Sales just as in the inner join example.  On the other hand, the join “Customer right outer join Sales” can be transformed into “Sales left outer join Customer,” but now we need an index on Customer.

Same Concept applies to full outer join.

Is NL join good or bad?

Neither actually.  There is no “best” join algorithm and no join algorithm is inherently good or bad.  Each join algorithm performs well in the right circumstances and poorly in the wrong circumstances.  Because the complexity of a nested loops join is proportional to the size of the outer input multiplied by the size of the inner input, a nested loops join generally performs best for relatively small input sets.  The inner input need not be small, but, if it is large, it helps to include an index on a highly selective join key.

In some cases, a nested loops join is the only join algorithm that SQL Server can use.  SQL Server must use a nested loops join for cross join as well as some complex cross applies and outer applies.  Moreover, with one exception (for full outer join), a nested loops join is the only join algorithm that SQL Server can use without at least one equijoin predicate.

Next we'll be talking about Hash Join and Merge Join.

0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

شكرا" لك يا محمد .... تابع مقالاتك

0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

شكرا" لك يا محمد .... تابع مقالاتك

I am glad you liked it, if you know of someone who can translate it to Arabic so everyone here could benefit, this would be great :)

0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

SQL Server Joins Part 2

This lesson will be about Hash Join and Merge Join.

Merge Join

Unlike the nested loops join which supports any join predicate, the merge join requires at least one equijoin predicate. Moreover, the inputs to the merge join must be sorted on the join keys. For example, if we have a join predicate “T1.a = T2.b,” table T1 must be sorted on T1.a and table T2 must be sorted on T2.b.

The merge join works by simultaneously reading and comparing the two sorted inputs one row at a time. At each step, we compare the next row from each input. If the rows are equal, we output a joined row and continue. If the rows are not equal, we discard the lesser of the two inputs and continue. Since the inputs are sorted, we know that we are discarding a row that is less than any of the remaining rows in either input and, thus, can never join.

We can express the algorithm in pseudo-code as:

get first row R1 from input 1

get first row R2 from input 2

while not at the end of either input

begin

if R1 joins with R2

begin

return (R1, R2)

get next row R2 from input 2

end

else if R1 < R2

get next row R1 from input 1

else

get next row R2 from input 2

end

Unlike the nested loops join where the total cost may be proportional to the product of the number of rows in the input tables, with a merge join each table is read at most once and the total cost is proportional to the sum of the number of rows in the inputs. Thus, merge join is often a better choice for larger inputs.

The above pseudo-code implements a one-to-many merge join. After we join two rows, we discard R2 and move to the next row of input 2. This presumes that we will never find another row from input 1 that will ever join with the discarded row. In other words, there may not be duplicates in input 1. On the other hand, it is acceptable that we might have duplicates in input 2 since we did not discard the current row from input 1.

Merge join can also support many-to-many merge joins. In this case, we must keep a copy of each row from input 2 whenever we join two rows. This way, if we later find that we have a duplicate row from input 1, we can play back the saved rows. On the other hand, if we find that the next row from input 1 is not a duplicate, we can discard the saved rows. We save these rows in a worktable in tempdb. The amount of disk space we need depends on the number of duplicates in input 2.

A one-to-many merge join is always more efficient than a many-to-many merge join since it does not need a worktable.

To use a one-to-many merge join, the optimizer must be able to determine that one of the inputs consists strictly of unique rows. Typically, this means that there is either a unique index on the input or there is an explicit operator in the plan (perhaps a sort distinct or a group by) to ensure that the input rows are unique.

Sort merge join vs. index merge join

There are two ways that we can get sorted inputs for a merge join: we may explicitly sort the inputs using a sort operator or we may read the rows from an index. In general, a plan using an index to achieve sort order is cheaper than a plan using an explicit sort.

Join predicates

Merge join supports multiple equijoin predicates so long as the inputs are sorted on all of the join keys. The specific sort order does not matter so long as both inputs are sorted in the same order. For example, if we have a join predicate “T1.a = T2.a and T1.b = T2.b,” we can use a merge join so long as T1 and T2 are both sorted either on “(a, b)” or on “(b, a).”

Merge join also supports residual predicates. For example, consider the join predicate “T1.a = T2.a and T1.b > T2.b.” Although we cannot use the inequality predicate as part of a merge join, we can still use the equijoin portion of this predicate to perform a merge join (presuming the inputs are sorted on column “a”). For each pair of rows that join on column “a,” we can then apply the inequality predicate. If the inequality evaluates to true, we return the joined row; if not, we discard it.

Outer and semi-joins

Merge join supports all outer and semi-joins variations. To implement an outer join, we simply need to track whether each row has joined. Instead of discarding a row that has not joined, we can NULL extend it and output it as appropriate. We can implement a semi-join or an anti-semi-join in a similar way.

Merge join supports a special case for full outer join. In some cases, we generate a merge join for a full outer join even if we have no equijoin predicate.

The following examples use this simple schema:

create table T1 (a int, b int, x char(200))

create table T2 (a int, b int, x char(200))

set nocount on

declare @i int

set @i = 0

while @i < 1000

begin

insert T1 values (@i * 2, @i * 5, @i)

insert T2 values (@i * 3, @i * 7, @i)

set @i = @i + 1

end

Let’s start with a simple example:

select * from T1 join T2 on T1.a = T2.a option (merge join)

Since we’ve asked for a merge join with the option hint and since there are no indexes on these two tables, the optimizer must add explicit sorts to the plan:

Rows Executes

334 1 |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([T1].[a])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[a]))

1000 1 |--Sort(ORDER BY:([T1].[a] ASC))

1000 1 | |--Table Scan(OBJECT:([T1]))

668 1 |--Sort(ORDER BY:([T2].[a] ASC))

1000 1 | |--Table Scan(OBJECT:([T2]))

Although the rows in both input tables are indeed unique, the optimizer does not know that and cannot enforce it so we generate a many-to-many join.

Note that each table scan executed only once regardless of the number of rows processed. Also, note that the sort on T2 only returned 668 rows out of 1000 rows. What happened? After processing 668 rows from T2, the merge join encountered a row from T2 that is greater than any row in T1. At this point, the merge join read all the remaining rows from T1. Once it reached the end of T1, the merge join exited even though it did not read all of the rows in T2.

Now observe what happens if we create an index on T1:

create unique clustered index T1ab on T1(a, b)

And run the same query again:

Rows Executes

334 1 |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([T1].[a])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[a]))

1000 1 |--Clustered Index Scan(OBJECT:([T1].[T1ab]), ORDERED FORWARD)

668 1 |--Sort(ORDER BY:([T2].[a] ASC))

1000 1 |--Table Scan(OBJECT:([T2]))

This is the same plan except that we now need a sort only on T2 since we can use the index on T1 to achieve the sort order. Note that even though we now have a unique index on T1, we still have a many-to-many join. Why? The index guarantees uniqueness on “(a, b)” not on column “a” alone. We could have duplicate values of column “a” and, thus, so long as we join only on a, we need the many-to-many join.

Now let’s try adding an index on T2:

create unique clustered index T2a on T2(a)

With both indexes, we no longer need the hint to get a merge join:

select * from T1 join T2 on T1.a = T2.a

Rows Executes

334 1 |--Merge Join(Inner Join, MERGE:([T2].[a])=([T1].[a]), RESIDUAL:([T2].[a]=[T1].[a]))

668 1 |--Clustered Index Scan(OBJECT:([T2].[T2a]), ORDERED FORWARD)

1000 1 |--Clustered Index Scan(OBJECT:([T1].[T1ab]), ORDERED FORWARD)

Observe that, with both indexes, we no longer need any sorts. Moreover, the unique index on T2 does guarantee uniqueness on column “a” so we can now do a one-to-many join. Notice that the MANY-TO-MANY keyword is gone in this example. (There is no ONE-TO-MANY keyword in text showplan.) Notice also that the optimizer switched the order of the inputs to put the unique input T2 first so we can use a one-to-many join.

Hash Join

it comes to physical join operators, hash join does the heavy lifting. While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

Hash join shares many characteristics with merge join. Like merge join, it requires at least one equijoin predicate, supports residual predicates, and supports all outer and semi-joins. Unlike merge join, it does not require ordered input sets and, while it does support full outer join, it does require an equijoin predicate.

The hash join executes in two phases: build and probe. During the build phase, it reads all rows from the first input (often called the left or build input), hashes the rows on the equijoin keys, and creates an in-memory hash table. During the probe phase, it reads all rows from the second input (often called the right or probe input), hashes these rows on the same equijoin keys, and looks or probes for matching rows in the hash table. Since hash functions can lead to collisions (two different key values that hash to the same value), we typically must check each potential match to ensure that it really joins.

for each row R1 in the build table

begin

calculate hash value on R1 join key(s)

insert R1 into the appropriate hash bucket

end

for each row R2 in the probe table

begin

calculate hash value on R2 join key(s)

for each row R1 in the corresponding hash bucket

if R1 joins with R2

return (R1, R2)

end

Note that unlike the nested loops and merge joins which immediately begin flowing output rows, the hash join is blocking on its build input. That is, it must completely read and process its entire build input before it can return any rows. Moreover, unlike the other join methods, the hash join requires a memory grant to store the hash table. Thus, there is a limit to the number of concurrent hash joins that SQL Server can run at any given time. While these characteristics are generally not a problem for data warehouses, they are undesirable for most OLTP applications.

Hash Join – Memory & Spilling

Before a hash join begins execution, SQL Server tries to estimate how much memory it will need to build its hash table. We use the cardinality estimate for the size of the build input along with the expected average row size to estimate the memory requirement. To minimize the memory required by the hash join, we try to choose the smaller of the two tables as the build table. We then try to reserve this much memory to ensure that the hash join can successfully execute.

What happens if we grant the hash join less memory than it requests or if the estimate is too low? In these cases, the hash join may run out of memory during the build phase. If the hash join runs out of memory, it begins spilling a small percentage of the total hash table to disk (to a workfile in tempdb). The hash join keeps track of which “partitions” of the hash table are still in memory and which ones have been spilled to disk. As we read each new row from the build table, we check to see whether it hashes to an in-memory or an on-disk partition. If it hashes to an in-memory partition, we proceed normally. If it hashes to an on-disk partition, we write the row to disk. This process of running out of memory and spilling partitions to disk may repeat multiple times until the build phase is complete.

Hash Join - Left deep vs. right deep vs. bushy hash join trees

These terms refer to the shape of the query plan as illustrated by this figure:

hashjointree.png

Hash Join - Left deep vs. right deep vs. bushy hash join trees

The shape of the join tree is particularly interesting for hash joins as it affects the memory consumption.

In a left deep tree, the output of one hash join is the build input to the next hash join. Because hash joins consume their entire build input before moving to the probe phase, in a left deep tree only adjacent pairs of hash joins are active at the same time. For example, in the above picture, we begin by building the hash table for HJ1. When HJ1 begins probing, we use the output of HJ1 to build the hash table for HJ2. When HJ1 is done probing, we can release the memory used by its hash table. Only then do we begin probing HJ2 and building the hash table for HJ3. Thus, HJ1 and HJ3 are never active at the same time and can share the same memory grant. The total amount of memory we need is max(HJ1 + HJ2, HJ2 + HJ3).

In a right deep tree, the output of one hash join is the probe input to the next hash join. All of the hash joins must build their complete hash tables before we can begin probing. All of the hash joins are active at once and cannot share memory. When we do begin probing, rows flow up the entire tree of hash joins without blocking. Thus, the total amount of memory we need is HJ1 + HJ2 + HJ3.

Hash Join - Examples

create table T1 (a int, b int, x char(200))

create table T2 (a int, b int, x char(200))

create table T3 (a int, b int, x char(200))

set nocount on

declare @i int

set @i = 0

while @i < 1000

begin

insert T1 values (@i * 2, @i * 5, @i)

set @i = @i + 1

end

set @i = 0

while @i < 10000

begin

insert T2 values (@i * 3, @i * 7, @i)

set @i = @i + 1

end

set @i = 0

while @i < 100000

begin

insert T3 values (@i * 5, @i * 11, @i)

set @i = @i + 1

end

select *from T1 join T2 on T1.a = T2.a

Rows Executes

334 1 |--Hash Match(Inner Join, HASH:([T1].[a])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[a]))

1000 1 |--Table Scan(OBJECT:([T1]))

10000 1 |--Table Scan(OBJECT:([T2]))

Notice that the T2 has ten times as many rows as T1 and indeed the optimizer chooses to use T1 as the build table and T2 as the probe table.

select *from (T1 join T2 on T1.a = T2.a) join T3 on T1.b = T3.a

Rows Executes

334 1 |--Hash Match(Inner Join, HASH:([T1].[ b])=([T3].[a]), RESIDUAL:([T1].[ b]=[T3].[a]))

334 1 |--Hash Match(Inner Join, HASH:([T1].[a])=([T2].[a]), RESIDUAL:([T1].[a]=[T2].[a]))

1000 1 | |--Table Scan(OBJECT:([T1]))

10000 1 | |--Table Scan(OBJECT:([T2]))

100000 1 |--Table Scan(OBJECT:([T3]))

Note that the optimizer has selected a left deep plan. First, we join the two small tables, T1 and T2. The results of this join yield only 334 rows which we use to build a hash table before joining with the large table T3.

Now observe what happens if we add a predicate to restrict the size of the smaller two tables. (A single where clause suffices; the optimizer can derive “T2.a < 100” from “T1.a < 100” and “T1.a = T2.a”.)

select *from (T1 join T2 on T1.a = T2.a) join T3 on T1.b = T3.awhere T1.a < 100

Rows Executes

17 1 |--Hash Match(Inner Join, HASH:([T2].[a])=([T1].[a]), RESIDUAL:([T1].[a]=[T2].[a]))

34 1 |--Table Scan(OBJECT:([T2]), WHERE:([T2].[a]<(100)))

50 1 |--Hash Match(Inner Join, HASH:([T1].[ b])=([T3].[a]), RESIDUAL:([T1].[ b]=[T3].[a]))

50 1 |--Table Scan(OBJECT:([T1]), WHERE:([T1].[a]<(100)))

100000 1 |--Table Scan(OBJECT:([T3]))

This time the optimizer selected a right deep plan. T1 and T2 are now so small (34 and 50) rows that it is better to build a hash table on these two tables and probe using the large table T3 than it is to build a hash table on an intermediate hash join result.

Joins Summary

The following table summarizes the characteristics of the three physical join operators:

joins.png

Next time we'll be talking about how SQL Server does the aggregation internally on the physical level.

تم تعديل بواسطه Mohamed Moshrif
0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

SQL Server Aggregation Part 1

Aggregation refers to the collapse of a larger set of rows into a smaller set of rows. Typical aggregate functions are COUNT, MIN, MAX, SUM, and AVG.

Scalar Aggregation

Scalar aggregates are queries with aggregate functions in the select list and no GROUP BY clause. Scalar aggregates always return a single row.

There is one operator for scalar aggregation: stream aggregate. For example:

create table t (a int, b int, c int)

select count(*) from t

|--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1005],0)))

|--Stream Aggregate(DEFINE:([Expr1005]=Count(*)))

|--Table Scan(OBJECT:([t]))

This is the aggregation equivalent of “Hello World!” The stream aggregate just counts the number of input rows and returns this result. The stream aggregate actually computes the count ([Expr1005]) as a bigint. The compute scalar is needed to convert this result to the expected output type of int. Note that a scalar stream aggregate is one of the very few operator that can produce an output row even with an empty input set.

Examples:

It is easy to see how to implement other simple scalar aggregate functions such as MIN, MAX, and SUM. We can also calculate multiple scalar aggregates at the same time using a single stream aggregate operator:

select min(a), max(b) from t

|--Stream Aggregate(DEFINE:([Expr1004]=MIN([t].[a]), [Expr1005]=MAX([t].[ b])))

|--Table Scan(OBJECT:([t]))

This plan just reads each row of table t and keeps track of the minimum value of column a and the maximum value of column b. Note that we do not need to convert the result for the MIN and MAX aggregates since the types of these aggregates are computed based on the types of columns a and b.

Some aggregates such as AVG are actually calculated from two other aggregates such as SUM and COUNT:

select avg(a) from t

|--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [Expr1005]=(0) THEN NULL ELSE [Expr1006]/CONVERT_IMPLICIT(int,[Expr1005],0) END))

|--Stream Aggregate(DEFINE:([Expr1005]=COUNT_BIG([t].[a]), [Expr1006]=SUM([t].[a])))

|--Table Scan(OBJECT:([t]))

This time the compute scalar also calculates the average from the sum and count. The CASE expression is needed to make sure that we do not divide by zero.

Now let’s take a look at what happens if we add a DISTINCT keyword to the aggregate:

select count(distinct a) from t

|--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1007],0)))

|--Stream Aggregate(DEFINE:([Expr1007]=COUNT([t].[a])))

|--Sort(DISTINCT ORDER BY:([t].[a] ASC))

|--Table Scan(OBJECT:([t]))

This query must only count rows that have a unique value for column a. We use the sort operator to eliminate rows with duplicate values of column a. It is easy to remove duplicate rows once we sort the input set since the duplicates will be adjacent to one another.

Not all distinct aggregates require duplicate elimination. For example, MIN and MAX behave identically with and without the distinct keyword. The minimum and maximum values of a set remain the same whether or not the set includes duplicate values.

select min(distinct a), max(distinct b) from t

|--Stream Aggregate(DEFINE:([Expr1004]=MIN([t].[a]), [Expr1005]=MAX([t].[ b])))

|--Table Scan(OBJECT:([t]))

If we have a unique index, we also can skip the duplicate elimination since the index guarantees that there are no duplicates:

create unique index ta on t(a)

select count(distinct a) from t

|--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1007],0)))

|--Stream Aggregate(DEFINE:([Expr1007]=COUNT([t].[a])))

|--Index Scan(OBJECT:([t].[ta]))

Consider this query:

select count(distinct a), count(distinct b) from t

As we saw before, we can compute “count(distinct a)” by eliminating rows that have duplicate values for column a. Similarly, we can compute “count(distinct b)” by eliminating rows that have duplicate values for column b. But, given that these two sets of rows are different, how can we compute both at the same time? The answer is we cannot. We must first compute one aggregate result, then the other, and then we must combine the two results into a single output row:

|--Nested Loops(Inner Join)

|--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1010],0)))

| |--Stream Aggregate(DEFINE:([Expr1010]=COUNT([t].[a])))

| |--Sort(DISTINCT ORDER BY:([t].[a] ASC))

| |--Table Scan(OBJECT:([t]))

|--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1011],0)))

|--Stream Aggregate(DEFINE:([Expr1011]=COUNT([t].[ b])))

|--Sort(DISTINCT ORDER BY:([t].[ b] ASC))

|--Table Scan(OBJECT:([t]))

The two inputs to the nested loops join compute the two counts from the original query. One of the inputs removes duplicates for column a while the other input removes duplicates for column b. The nested loops join has no join predicate; it is a cross join. Since both inputs to the nested loops join each produce a single row – they are both scalar aggregates – the result of the cross join is also a single row. The cross join just serves to “glue” the two columns of the result into a single row.

Next Lesson we'll talk about Stream aggregation, which is mainly what happen when we specify "GROUP BY" clause, and there are two physical operators used:

  1. Stream Aggregate
  2. Hash Aggregate

تم تعديل بواسطه Mohamed Moshrif
-1

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

موضوع قيم شكرا لك

0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

بارك الله فيك ..

بجد موضوع تحفة ..

0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

يعطيك العافية والله

ولو كان في مجال لاضافة بعض الترجمات والتلميحات ممتاز والله يبارك بجهودك

0

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه
  • 0

please share all your book and stop make Copy / Past :) Thanks

3

شارك هذا الرد


رابط المشاركة
شارك الرد من خلال المواقع ادناه

من فضلك سجل دخول لتتمكن من التعليق

ستتمكن من اضافه تعليقات بعد التسجيل



سجل دخولك الان

  • يستعرض القسم حالياً   0 members

    لا يوجد أعضاء مسجلين يشاهدون هذه الصفحة .