SQL Server – TABLESAMPLE, retrieving random data from SQL Server

SQL Server Concepts

TABLESAMPLE allows you to limit the number of rows from a table in the FROM clause. The rows retrieved are random and they are are not in any order. With the TAMPLESAMPLE option, you can get a sample set of data from your table without having to read through the entire table or having to assign temporary random values to each row of data.

The syntax for the TABLESPACE clause is as follows:

TABLESAMPLE [SYSTEM] (sample_number [ PERCENT | ROWS ] )
[ REPEATABLE (repeat_seed) ]

 

SYSTEM specifies an ANSI SQL implementation-dependent sampling method. Specifying SYSTEM is optional, but this option is the only sampling method available in SQL Server and is applied by default.

Data is read at a page level.  Each 8K page for the table is given a random value and based on this random value and the value you specify either number of rows or percentage will determine how many rows are returned.  So based on this, each run can retrieve a totally different number of rows of data.

When the number of rows is specified, that number is converted into a percentage of the rows and, therefore, pages that should be returned. The TABLESAMPLE operation is then performed with that computed percentage.

If the table is made up of a single page, either all rows on the page are returned or none of the rows are returned. In this case, TABLESAMPLE SYSTEM can only return 100 percent or 0 percent of the rows on a page, regardless of the number of rows on the page.

Using TABLESAMPLE SYSTEM for a specific table limits the execution plan to use a table scan (a scan of the heap, or of the clustered index if one exists) on that table. Although the plan shows a table scan is performed, only those pages that are included in the result set are actually required to be read from the data file.

The REPEATABLE option causes a selected sample to be returned again. When REPEATABLE is specified with the same repeat_seed value, SQL Server returns the same subset of rows, as long as no changes have been made to the table. When REPEATABLE is specified with a different repeat_seed value, SQL Server will typically return a different sample of the rows in the table. The following actions to the table are considered changes: inserting, updating, deleting, index rebuilding, index defragmenting, restoring a database, and attaching a database.

Usage

/*Return a sample 50 percent of the rows of the result set. */ 
TABLESAMPLE 50 PERCENT) 

/* Return a sample of 25 rows from the result set. */
TABLESAMPLE (25 ROWS) 

Examples

SELECT Name, Color
FROM [Production].[Product] 
TABLESAMPLE (10 PERCENT) ;
SELECT Name, Color
FROM [Production].[Product] 
TABLESAMPLE (100 ROWS) ;

There are some restrictions in using TABLESAMPLE

  • TABLESAMPLE cannot be applied to derived tables, tables from linked servers, and tables derived from table-valued functions, rowset functions, or OPENXML.
  • TABLESAMPLE cannot be specified in the definition of a view or an inline table-valued function.
  • The number of rows returned is approximate. 100 ROWS doesn’t necessarily return 100 records.

Consider the scenario when we are using TABLESAMPLE in a Join. If TABLESAMPLE is specified for either of the two tables, some rows returned from the unsampled table are unlikely to have a matching row in the sampled table.

 

Author: simpleBIinsights

Leave a Reply