SQL is a language for accessing and manipulating database standardized by ANSI. To be successful with database-centric applications (which includes most of the applications Data Warehousing domain), one must be strong enough in SQL. In this article, we will learn more about SQL by breaking the subject in the form of several question-answer sessions commonly asked in Interviewes.

 

Difference b/w Data Warehousing and Business Intelligence?

Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart, including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management and backup/recovery planning. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions. Typically, the term ’business intelligence’ is used to encompass OLAP, data visualization, data mining and query/reporting tools. (Reference: Les Barbusinski)

 

Why is Data Modeling Important?

Data modeling is probably the most labor intensive and time consuming part of the development process. The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language, it can be reviewed and verified as correct by the end users.

In computer science, data modeling is the process of creating a data model by applying a data model theory to create a data model instance. A data model theory is a formal data model description. In data modeling, we are structuring and organizing data. These data structures are then typically implemented in a database management system. In addition to defining and organizing the data, data modeling will impose (implicitly or explicitly) constraints or limitations on the data placed within the structure.

Managing large quantities of structured and unstructured data is a primary function of information systems. Data models describe structured data for storage in data management systems such as relational databases. They typically do not describe unstructured data, such as word processing documents, email messages, pictures, digital audio, and video.

 

Question 1: SQL Query to find second highest salary of Employee

Answer : There are many ways to find second highest salary of Employee in SQL, you can either use SQL Join or Subquery to solve this problem. Here is SQL query using Subquery :
selectMAX(Salary) from Employee WHERE Salary NOT IN (selectMAX(Salary) from Employee );
Question 2: SQL Query to find Max Salary from each department.
Answer :
SELECT DeptID, MAX(Salary) FROM Employee GROUP BY DeptID. 
Question 3:Write SQL Query to display current date.
Ans:SQL has built in function called GetDate() which returns current timestamp.
SELECT GetDate(); 
Question 4:Write an SQL Query to check whether date passed to Query is date of given format or not.
Ans: SQL has IsDate() function which is used to check passed value is date or not of specified format ,it returns 1(true) or 0(false) accordingly.
SELECT ISDATE('1/08/13') AS"MM/DD/YY";
It will return 0 because passed date is not in correct format.

Question 5: Write a SQL Query to print the name of distinct employee whose DOB is between 01/01/1960 to 31/12/1975.
Ans:
SELECT DISTINCT EmpName FROM Employees WHERE DOB BETWEEN ‘01/01/1960’AND31/12/1975’;
Question 6:Write an SQL Query find number of employees according to gender whose DOB is between 01/01/1960 to 31/12/1975.
Answer : SELECT COUNT(*), sex from Employees WHERE DOB BETWEEN ‘01/01/1960 ' AND ‘31/12/1975’ GROUP BY sex;

Question 7:Write an SQL Query to find employee whose Salary is equal or greater than 10000.
Answer : SELECT EmpName FROM Employees WHERE Salary>=10000;

Question 8:Write an SQL Query to find name of employee whose name Start with ‘M’
Ans: SELECT * FROM Employees WHERE EmpName like 'M%';

Question 9: find all Employee records containing the word "Joe", regardless of whether it was stored as JOE, Joe, or joe.
Answer : SELECT * from Employees  WHERE upper(EmpName) like upper('joe%');

Question 10: Write a SQL Query to find year from date.
Answer :  SELECT YEAR(GETDATE()) as "Year";

 

What is a Surrogate Key?

A surrogate key is a substitution for the natural primary key. It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it should be unique for each row in the table. It is useful because the natural primary key can change and this makes updates more difficult. Surrogated keys are always integer or numeric.

 

How to Optimize Stored Procedure Optimization?

There are many tips and tricks for the same. Here are few:

  • Include SET NOCOUNT ON statement.
  • Use schema name with object name.
  • Do not use the prefix “sp_” in the stored procedure name.
  • Use IF EXISTS (SELECT 1) instead of (SELECT *).
  • Use the sp_executesql stored procedure instead of the EXECUTE statement.
  • Try to avoid using SQL Server cursors whenever possible.
  • Keep the Transaction as short as possible.
  • Use TRY-Catch for error handling.

(Read more here)

What is SQL Injection? How to Protect Against SQL Injection Attack?

SQL injection is an attack in which malicious code is inserted into strings that are later passed to an instance of SQL Server for parsing and execution. Any procedure that constructs SQL statements should be reviewed for injection vulnerabilities because SQL Server will execute all syntactically valid queries that it receives. Even parameterized data can be manipulated by a skilled and determined attacker.

Here are few methods which can be used to protect again SQL Injection attack:

  • Use Type-Safe SQL Parameters
  • Use Parameterized Input with Stored Procedures
  • Use the Parameters Collection with Dynamic SQL
  • Filtering Input parameters
  • Use the escape character in LIKE clause
  • Wrapping Parameters with QUOTENAME() and REPLACE()

How to Find Out the List Schema Name and Table Name for the Database?

We can use following script:

SELECT '['+SCHEMA_NAME(schema_id)+'].['+name+']' AS SchemaTable
FROM sys.tables

 

SET UP OF SAMPLE DATA FOR PRACTICING SQL

For the purpose of our demonstration, we will primarily use two database tables with just a few records - EMPLOYEE table and DEPT table. EMPLOYEE table will contain 10 records pertaining to 10 employees with funny sounding names of an imaginary organization and DEPT or Department table will contain 5 departments of that organization. Click here to download the DDL/INSERT statements for this data if you want to practice the below SQLs in your personal computer

Contents of these tables are not same with Oracle emp and dept tables!!

What is PRIMARY KEY?

A PRIMARY KEY constraint is a unique identifier for a row within a database table. Every table should have a primary key constraint to uniquely identify each row, and only one primary key constraint can be created for each table. The primary key constraints are used to enforce entity integrity.

 

What is UNIQUE KEY Constraint?

A UNIQUE constraint enforces the uniqueness of the values in a set of columns; so no duplicate values are entered. The unique key constraints are used to enforce entity integrity as the primary key constraints.

 

What is FOREIGN KEY?

A FOREIGN KEY constraint prevents any actions that would destroy links between tables with the corresponding data values. A foreign key in one table points to a primary key in another table. Foreign keys prevent actions that would leave rows with foreign key values when there are no primary keys with that value. The foreign key constraints are used to enforce referential integrity.

 

What are Different Types of Locks?

  • Shared Locks: Used for operations that do not change or update data (read-only operations), such as a SELECT statement.
  • Update Locks: Used on resources that can be updated. It prevents a common form of deadlock that occurs when multiple sessions are reading, locking, and potentially updating resources later.
  • Exclusive Locks: Used for data-modification operations, such as INSERT, UPDATE, or DELETE. It ensures that multiple updates cannot be made to the same resource at the same time.
  • Intent Locks: Used to establish a lock hierarchy. The types of intent locks are as follows: intent shared (IS), intent exclusive (IX), and shared with intent exclusive (SIX).
  • Schema Locks: Used when an operation dependent on the schema of a table is executing. The types of schema locks are schema modification (Sch-M) and schema stability (Sch-S).
  • Bulk Update Locks: Used when bulk-copying data into a table and the TABLOCK hint is specified.

What are Pessimistic Lock and Optimistic Lock?

Optimistic Locking is a strategy where you read a record, take note of a version number and check that the version hasn’t changed before you write the record back. If the record is dirty (i.e. different version to yours), then you abort the transaction and the user can re-start it.

Pessimistic Locking is when you lock the record for your exclusive use until you have finished with it. It has much better integrity than optimistic locking but requires you to be careful with your application design to avoid Deadlocks.

 

When is the use of UPDATE_STATISTICS command?

This command is basically used when a large amount of data is processed. If a large amount of deletions, modifications or Bulk Copy into the tables has occurred, it has to update the indexes to take these changes into account. UPDATE_STATISTICS updates the indexes on these tables accordingly.

 

What is Dirty Read?

A dirty read occurs when two operations, say, read and write occur together giving the incorrect or unedited data. Suppose, A changed a row but did not committed the changes. B reads the uncommitted data but his view of the data may be wrong so that is Dirty Read.

 

Why can’t I use Outer Join in an Indexed View?

Rows can logically disappear from an indexed view based on OUTER JOIN when you insert data into a base table. This makes incrementally updating OUTER JOIN views relatively complex to implement, and the performance of the implementation would be slower than for views based on standard (INNER) JOIN.(Read More Here)

 

What is DataWarehousing?

  • Subject-oriented, which means that the data in the database is organized so that all the data elements relating to the same real-world event or object are linked together;
  • Time-variant, which means that the changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;
  • Non-volatile, which means that data in the database is never over-written or deleted, once committed, the data is static, read-only, but retained for future reporting.
  • Integrated, which means that the database contains data from most or all of an organization’s operational applications, and that this data is made consistent.

Difference between a Function and a Stored Procedure?

UDF can be used in the SQL statements anywhere in the WHERE/HAVING/SELECT section, whereas Stored procedures cannot be. UDFs that return tables can be treated as another rowset. This can be used in JOINs with other tables. Inline UDF’s can be thought of as views that take parameters and can be used in JOINs and other Rowset operations.

What is subquery? Explain the Properties of a Subquery?

Subqueries are often referred to as sub-selects as they allow a SELECT statement to be executed arbitrarily within the body of another SQL statement. A subquery is executed by enclosing it in a set of parentheses. Subqueries are generally used to return a single row as an atomic value although they may be used to compare values against multiple rows with the IN keyword.

A subquery is a SELECT statement that is nested within another T-SQL statement. A subquery SELECT statement if executed independently of the T-SQL statement, in which it is nested, will return a resultset. This implies that a subquery SELECT statement can stand alone, and it does not depend on the statement in which it is nested. A subquery SELECT statement can return any number of values and can be found in the column list of a SELECT statement, and FROM, GROUP BY, HAVING, and/or ORDER BY clauses of a T-SQL statement. A subquery can also be used as a parameter to a function call. Basically, a subquery can be used anywhere an expression can be used. (Read more here)

What are Different Types of Join?

Cross Join

A cross join that does not have a WHERE clause produces the Cartesian product of the tables involved in the join. The size of a Cartesian product result set is the number of rows in the first table multiplied by the number of rows in the second table. The common example is when company wants to combine each product with a pricing table to analyze each product at each price.

Inner Join

A join that displays only the rows that have a match in both joined tables is known as inner Join. This is the default type of join in the Query and View Designer.

Outer Join

A join that includes rows even if they do not have related rows in the joined table is an Outer Join. You can create three different outer join to specify the unmatched rows to be included:

  • Left Outer Join: In Left Outer Join, all the rows in the first-named table, i.e. “left” table, which appears leftmost in the JOIN clause, are included. Unmatched rows in the right table do not appear.  
  • Right Outer Join: In Right Outer Join, all the rows in the second-named table, i.e. “right” table, which appears rightmost in the JOIN clause are included. Unmatched rows in the left table are not included.
  • Full Outer Join: In Full Outer Join, all the rows in all joined tables are included, whether they are matched or not.

Self Join

This is a particular case when one table joins to itself with one or two aliases to avoid confusion. A self join can be of any type, as long as the joined tables are the same. A self join is rather unique in that it involves a relationship with only one table. The common example is when company has a hierarchal reporting structure whereby one member of staff reports to another. Self Join can be Outer Join or Inner Join. (Read more here)

 

What is RDBMS?

Relational Data Base Management Systems (RDBMS) are database management systems that maintain data records and indices in tables. Relationships may be created and maintained across and among the data and tables. In a relational database, relationships between data items are expressed by means of tables. Interdependencies among these tables are expressed by data values rather than by pointers. This allows a high degree of data independence. An RDBMS has the capability to recombine the data items from different files, providing powerful tools for data usage. (Read more here)

 

What are the Properties of the Relational Tables?

Relational tables have the following six properties:

  • Values are atomic.
  • Column values are of the same kind.
  • Each row is unique.
  • The sequence of columns is insignificant.
  • The sequence of rows is insignificant.
  • Each column must have a unique name.

What is Normalization?

Database normalization is a data design and organization process applied to data structures based on rules that help building relational databases. In relational database design, the process of organizing data to minimize redundancy is called normalization. Normalization usually involves dividing a database into two or more tables and defining relationships between the tables. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.

 

What is De-normalization?

De-normalization is the process of attempting to optimize the performance of a database by adding redundant data. It is sometimes necessary because current DBMSs implement the relational model poorly. A true relational DBMS would allow for a fully normalized database at the logical level, while providing physical storage of data that is tuned for high performance. De-normalization is a technique to move from higher to lower normal forms of database modeling in order to speed up database access.

 

How is ACID property related to Database?

ACID (an acronym for Atomicity Consistency Isolation Durability) is a concept that Database Professionals generally look for while evaluating databases and application architectures. For a reliable database, all this four attributes should be achieved.

Atomicity is an all-or-none proposition.

Consistency guarantees that a transaction never leaves your database in a half-finished state.

Isolation keeps transactions separated from each other until they are finished.

Durability guarantees that the database will keep track of pending changes in such a way that the server can recover from an abnormal termination. (Read more here)

 

What are the Different Normalization Forms?

1NF: Eliminate Repeating Groups

Make a separate table for each set of related attributes, and give each table a primary key. Each field contains at most one value from its attribute domain.

2NF: Eliminate Redundant Data

If an attribute depends on only part of a multi-valued key, then remove it to a separate table.

3NF: Eliminate Columns Not Dependent On Key

If attributes do not contribute to a description of the key, then remove them to a separate table. All attributes must be directly dependent on the primary key. (Read more here)

BCNF: Boyce-Codd Normal Form

If there are non-trivial dependencies between candidate key attributes, then separate them out into distinct tables.

4NF: Isolate Independent Multiple Relationships

No table may contain two or more 1:n or n:m relationships that are not directly related.

5NF: Isolate Semantically Related Multiple Relationships

There may be practical constrains on information that justify separating logically related many-to-many relationships.

ONF: Optimal Normal Form

A model limited to only simple (elemental) facts, as expressed in Object Role Model notation.

DKNF: Domain-Key Normal Form

A model free from all modification anomalies is said to be in DKNF.

Remember, these normalization guidelines are cumulative. For a database to be in 3NF, it must first fulfill all the criteria of a 2NF and 1NF database.

 

What are Primary Keys and Foreign Keys?

Primary keys are the unique identifiers for each row. They must contain unique values and cannot be null. Due to their importance in relational databases, Primary keys are the most fundamental aspect of all keys and constraints. A table can have only one primary key.

Foreign keys are a method of ensuring data integrity and manifestation of the relationship between tables.

 

What is a Stored Procedure?

A stored procedure is a named group of SQL statements that have been previously created and stored in the server database. Stored procedures accept input parameters so that a single procedure can be used over the network by several clients using different input data. And when the procedure is modified, all clients automatically get the new version. Stored procedures reduce network traffic and improve performance. Stored procedures can be used to help ensure the integrity of the database.

e.g. sp_helpdb, sp_renamedb, sp_depends etc.

What is a Trigger?

A trigger is a SQL procedure that initiates an action when an event (INSERT, DELETE or UPDATE) occurs. Triggers are stored in and managed by the DBMS. Triggers are used to maintain the referential integrity of data by changing the data in a systematic fashion. A trigger cannot be called or executed; DBMS automatically fires the trigger as a result of a data modification to the associated table. Triggers can be considered to be similar to stored procedures in that both consist of procedural logic that is stored at the database level. Stored procedures, however, are not event-drive and are not attached to a specific table as triggers are. Stored procedures are explicitly executed by invoking a CALL to the procedure while triggers are implicitly executed. In addition, triggers can also execute stored procedures.

Nested Trigger: A trigger can also contain INSERT, UPDATE and DELETE logic within itself; so when the trigger is fired because of data modification, it can also cause another data modification, thereby firing another trigger. A trigger that contains data modification logic within itself is called a nested trigger. (Read more here)

What are the Different Types of Triggers?

There are two types of Triggers.

1)      DML Trigger

There are two types of DML Triggers

1.Instead of Trigger
Instead of Triggers are fired in place of the triggering action such as an insert, update, or delete.

2. After Trigger
After triggers execute following the triggering action, such as an insert, update, or delete.

2)      DDL Trigger

This type of trigger is fired against Drop Table, Create Table, Alter Table or Login events. DDL Triggers are always After Triggers.

 

What is a View?

A simple view can be thought of as a subset of a table. It can be used for retrieving data as well as updating or deleting rows. Rows updated or deleted in the view are updated or deleted in the table the view was created with. It should also be noted that as data in the original table changes, so does the data in the view as views are the way to look at parts of the original table. The results of using a view are not permanently stored in the database. The data accessed through a view is actually constructed using standard T-SQL select command and can come from one to many different base tables or even other views.

 

What is an Index?

An index is a physical structure containing pointers to the data. Indices are created in an existing table to locate rows more quickly and efficiently. It is possible to create an index on one or more columns of a table, and each index is given a name. The users cannot see the indexes; they are just used to speed up queries. Effective indexes are one of the best ways to improve performance in a database application. A table scan happens when there is no index available to help a query. In a table scan, the SQL Server examines every row in the table to satisfy the query results. Table scans are sometimes unavoidable, but on large tables, scans have a terrific impact on performance.

 

What is a Linked Server?

Linked Servers is a concept in SQL Server by which we can add other SQL Server to a Group and query both the SQL Server databases using T-SQL Statements. With a linked server, you can create very clean, easy–to-follow SQL statements that allow remote data to be retrieved, joined and combined with local data. Stored Procedures sp_addlinkedserver, sp_addlinkedsrvlogin will be used to add new Linked Server. (Read more here)

 

What is a Cursor?

A cursor is a database object used by applications to manipulate data in a set on a row-by-row basis, instead of the typical SQL commands that operate on all the rows in the set at one time.

In order to work with a cursor, we need to perform some steps in the following order:

  • Declare cursor
  • Open cursor
  • Fetch row from the cursor
  • Process fetched row
  • Close cursor
  • Deallocate cursor (Read more here)

What is Collation?

Collation refers to a set of rules that determine how data is sorted and compared. Character data is sorted using rules that define the correct character sequence with options for specifying case sensitivity, accent marks, Kana character types, and character width. (Read more here)

 

What is User-defined Functions?

Types of User-defined Functions that can be created?

User-defined Functions allow defining its own T-SQL functions that can accept zero or more parameters and return a single scalar data value or a table data type.

Different Types of User-Defined Functions created are as follows:

Scalar User-defined Function

A scalar user-defined function returns one of the scalar data types. Text, ntext, image and timestamp data types are not supported. These are the type of user-defined functions that most developers are used to in other programming languages.

Inline Table-Value User-defined Function

An Inline table-value user-defined function returns a table data type and is an exceptional alternative to a view as the user-defined function can pass parameters into a T-SQL select command and in essence provide us with a parameterized, non-updateable view of the underlying tables.

Multi-Statement Table-Value User-defined Function

A multi-statement table-value user-defined function returns a table, and it is also an exceptional alternative to a view as the function can support multiple T-SQL statements to build the final result where the view is limited to a single SELECT statement. Also, the ability to pass parameters into a T-SQL select command or a group of them gives us the capability to in essence create a parameterized, non-updateable view of the data in the underlying tables. Within the create function command, you must define the table structure that is being returned. After creating this type of user-defined function, It can be used in the FROM clause of a T-SQL command unlike the behavior encountered while using a stored procedure which can also return record sets. (Read here for example)

 

Difference between a HAVING clause and a WHERE clause?

They specify a search condition for a group or an aggregate. But the difference is that HAVING can be used only with the SELECT statement. HAVING is typically used in a GROUP BY clause. When GROUP BY is not used, HAVING behaves like a WHERE clause. Having Clause is basically used only with the GROUP BY function in a query, whereas WHERE Clause is applied to each row before they are part of the GROUP BY function in a query. (Read more here)

 

What are the Difference between Clustered and a Non-clustered Index?

A clustered index is a special type of index that reorders the way records in the table are physically stored. Therefore, the table can have only one clustered index. The leaf nodes of a clustered index contain the data pages.

A non-clustered index is a special type of index in which the logical order of the index does not match the physical stored order of the rows on disk. The leaf node of a non-clustered index does not consist of the data pages. Instead, the leaf nodes contain index rows. (Read more here)

 

What are the Different Index Configurations a Table can have?

A table can have one of the following index configurations:

  • No indexes
  • A clustered index
  • A clustered index and many non-clustered indexes
  • A non-clustered index
  • Many non-clustered indexes

 

What are Different Types of Collation Sensitivity?

Case sensitivity – A and a, B and b, etc.

Accent sensitivity – a and á, o and ó, etc.

Kana Sensitivity – When Japanese Kana characters Hiragana and Katakana are treated differently, it is called Kana sensitive.

Width sensitivity – When a single-byte character (half-width) and the same character represented as a double-byte character (full-width) are treated differently, it is width sensitive. (Read more here)

 

What is OLTP (Online Transaction Processing)?

In OLTP –(online transaction processing) systems, relational database design uses the discipline of data modeling and generally follows the Codd rules of data normalization in order to ensure absolute data integrity. Using these rules, complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules.

 

Difference between a Primary Key and a Unique Key?

Both primary key and unique key enforce uniqueness of the column on which they are defined. But by default, the primary key creates a clustered index on the column, whereas unique key creates a non-clustered index by default. Another major difference is that primary key doesn’t allow NULLs, but unique key allows one NULL only. (Read more here)

 

Difference between DELETE  and TRUNCATE Commands?

Delete command removes the rows from a table on the basis of the condition that we provide with a WHERE clause. Truncate will actually remove all the rows from a table, and there will be no data in the table after we run the truncate command.

TRUNCATE

  • TRUNCATE is faster and uses fewer system and transaction log resources than DELETE. (Read all the points below)
  • TRUNCATE removes the data by deallocating the data pages used to store the table’s data, and only the page deallocations are recorded in the transaction log.
  • TRUNCATE removes all the rows from a table, but the table structure, its columns, constraints, indexes and so on remains. The counter used by an identity for new rows is reset to the seed for the column.
  • You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint.
  • Using T-SQL – TRUNCATE cannot be rolled back unless it is used in TRANSACTION. OR TRUNCATE can be rolled back when used with BEGIN … END TRANSACTION using T-SQL.
  • TRUNCATE is a DDL Command.
  • TRUNCATE resets the identity of the table.

DELETE

  • DELETE removes rows one at a time and records an entry in the transaction log for each deleted row.
  • DELETE does not reset Identity property of the table.
  • DELETE can be used with or without a WHERE clause
  • DELETE activates Triggers if defined on table.
  • DELETE can be rolled back.
  • DELETE is DML Command.
  • DELETE does not reset the identity of the table.

 

What is Connection Pooling and why it is Used?

To minimize the cost of opening and closing connections, ADO.NET uses an optimization technique called connection pooling.

The pooler maintains ownership of the physical connection. It manages connections by keeping alive a set of active connections for each given connection configuration. Whenever a user calls Open on a connection, the pooler looks for an available connection in the pool. If a pooled connection is available, it returns it to the caller instead of opening a new connection. When the application calls Close on the connection, the pooler returns it to the pooled set of active connections instead of closing it. Once the connection is returned to the pool, it is ready to be reused on the next Open call.

 

What are the Properties and Different Types of Sub-Queries?

Properties of a Sub-Query

  • A sub-query must be enclosed in the parenthesis.
  • A sub-query must be put on the right hand of the comparison operator, and
  • A sub-query cannot contain an ORDER BY clause, however sub-query can use ORDER BY when used with TOP clause. Read Comment by David Bridge
  • A query can contain more than one sub-query.

Types of Sub-query

  • Single-row sub-query, where the sub-query returns only one row.
  • Multiple-row sub-query, where the sub-query returns multiple rows, and
  • Multiple column sub-query, where the sub-query returns multiple columns

 

What is the difference between inner and outer join? Explain with example.

Inner Join

Inner join is the most common type of Join which is used to combine the rows from two tables and create a result set containing only such records that are present in both the tables based on the joining condition (predicate).

Inner join returns rows when there is at least one match in both tables

If none of the record matches between two tables, then INNER JOIN will return a NULL set. Below is an example of INNER JOIN and the resulting set.

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE 
FROM DEPT dept, EMPLOYEE emp
WHERE emp.dept_id = dept.id
Department Employee
HR Inno
HR Privy
Engineering Robo
Engineering Hash
Engineering Anno
Engineering Darl
Marketing Pete
Marketing Meme
Sales Tomiti
Sales Bhuti
Outer Join

Outer Join can be full outer or single outer

Outer Join, on the other hand, will return matching rows from both tables as well as any unmatched rows from one or both the tables (based on whether it is single outer or full outer join respectively).

Notice in our record set that there is no employee in the department 5 (Logistics). Because of this if we perform inner join, then Department 5 does not appear in the above result. However in the below query we perform an outer join (dept left outer join emp), and we can see this department.

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE 
FROM DEPT dept, EMPLOYEE emp
WHERE dept.id = emp.dept_id (+)
Department Employee
HR Inno
HR Privy
Engineering Robo
Engineering Hash
Engineering Anno
Engineering Darl
Marketing Pete
Marketing Meme
Sales Tomiti
Sales Bhuti
Logistics  

The (+) sign on the emp side of the predicate indicates that emp is the outer table here. The above SQL can be alternatively written as below (will yield the same result as above):

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE 
FROM DEPT dept LEFT OUTER JOIN EMPLOYEE emp
ON dept.id = emp.dept_id  

What is the difference between JOIN and UNION?

SQL JOIN allows us to “lookup” records on other table based on the given conditions between two tables. For example, if we have the department ID of each employee, then we can use this department ID of the employee table to join with the department ID of department table to lookup department names.

UNION operation allows us to add 2 similar data sets to create resulting data set that contains all the data from the source data sets. Union does not require any condition for joining. For example, if you have 2 employee tables with same structure, you can UNION them to create one result set that will contain all the employees from both of the tables.

SELECT * FROM EMP1
UNION
SELECT * FROM EMP2;

What is the difference between UNION and UNION ALL?

UNION and UNION ALL both unify for add two structurally similar data sets, but UNION operation returns only the unique records from the resulting data set whereas UNION ALL will return all the rows, even if one or more rows are duplicated to each other.

In the following example, I am choosing exactly the same employee from the emp table and performing UNION and UNION ALL. Check the difference in the result.

SELECT * FROM EMPLOYEE WHERE ID = 5
UNION ALL
SELECT * FROM EMPLOYEE WHERE ID = 5
ID MGR_ID DEPT_ID NAME SAL DOJ
5.0 2.0 2.0 Anno 80.0 01-Feb-2012
5.0 2.0 2.0 Anno 80.0 01-Feb-2012
SELECT * FROM EMPLOYEE WHERE ID = 5
UNION 
SELECT * FROM EMPLOYEE WHERE ID = 5
ID MGR_ID DEPT_ID NAME SAL DOJ
5.0 2.0 2.0 Anno 80.0 01-Feb-2012

What is the difference between WHERE clause and HAVING clause?

WHERE and HAVING both filters out records based on one or more conditions. The difference is, WHERE clause can only be applied on a static non-aggregated column whereas we will need to use HAVING for aggregated columns.

To understand this, consider this example.
Suppose we want to see only those departments where department ID is greater than 3. There is no aggregation operation and the condition needs to be applied on a static field. We will use WHERE clause here:

SELECT * FROM DEPT WHERE ID > 3
ID NAME
4 Sales
5 Logistics

Next, suppose we want to see only those Departments where Average salary is greater than 80. Here the condition is associated with a non-static aggregated information which is “average of salary”. We will need to use HAVING clause here:

SELECT dept.name DEPARTMENT, avg(emp.sal) AVG_SAL
FROM DEPT dept, EMPLOYEE emp
WHERE dept.id = emp.dept_id (+)
GROUP BY dept.name
HAVING AVG(emp.sal) > 80
DEPARTMENT AVG_SAL
Engineering 90

As you see above, there is only one department (Engineering) where average salary of employees is greater than 80.

What is the difference among UNION, MINUS and INTERSECT?

UNION combines the results from 2 tables and eliminates duplicate records from the result set.

MINUS operator when used between 2 tables, gives us all the rows from the first table except the rows which are present in the second table.

INTERSECT operator returns us only the matching or common rows between 2 result sets.

To understand these operators, let’s see some examples. We will use two different queries to extract data from our emp table and then we will perform UNION, MINUS and INTERSECT operations on these two sets of data.

UNION

SELECT * FROM EMPLOYEE WHERE ID = 5
UNION 
SELECT * FROM EMPLOYEE WHERE ID = 6
ID MGR_ID DEPT_ID NAME SAL DOJ
5 2 2.0 Anno 80.0 01-Feb-2012
6 2 2.0 Darl 80.0 11-Feb-2012

MINUS

SELECT * FROM EMPLOYEE
MINUS
SELECT * FROM EMPLOYEE WHERE ID > 2
ID MGR_ID DEPT_ID NAME SAL DOJ
1   2 Hash 100.0 01-Jan-2012
2 1 2 Robo 100.0 01-Jan-2012

INTERSECT

SELECT * FROM EMPLOYEE WHERE ID IN (2, 3, 5)
INTERSECT
SELECT * FROM EMPLOYEE WHERE ID IN (1, 2, 4, 5)
ID MGR_ID DEPT_ID NAME SAL DOJ
5 2 2 Anno 80.0 01-Feb-2012
2 1 2 Robo 100.0 01-Jan-2012

What is Self Join and why is it required?

Self Join is the act of joining one table with itself.

Self Join is often very useful to convert a hierarchical structure into a flat structure

In our employee table example above, we have kept the manager ID of each employee in the same row as that of the employee. This is an example of how a hierarchy (in this case employee-manager hierarchy) is stored in the RDBMS table. Now, suppose if we need to print out the names of the manager of each employee right beside the employee, we can use self join. See the example below:

SELECT e.name EMPLOYEE, m.name MANAGER
FROM EMPLOYEE e, EMPLOYEE m
WHERE e.mgr_id = m.id (+)
EMPLOYEE MANAGER
Pete Hash
Darl Hash
Inno Hash
Robo Hash
Tomiti Robo
Anno Robo
Privy Robo
Meme Pete
Bhuti Tomiti
Hash  

The only reason we have performed a left outer join here (instead of INNER JOIN) is we have one employee in this table without a manager (employee ID = 1). If we perform inner join, this employee will not show-up.

How can we transpose a table using SQL (changing rows to column or vice-versa) ?

The usual way to do it in SQL is to use CASE statement or DECODE statement.

How to generate row number in SQL Without ROWNUM

Generating a row number – that is a running sequence of numbers for each row is not easy using plain SQL. In fact, the method I am going to show below is not very generic either. This method only works if there is at least one unique column in the table. This method will also work if there is no single unique column, but collection of columns that is unique. Anyway, here is the query:

SELECT name, sal, (SELECT COUNT(*)  FROM EMPLOYEE i WHERE o.name >= i.name) row_num
FROM EMPLOYEE o
order by row_num
NAME SAL ROW_NUM
Anno 80 1
Bhuti 60 2
Darl 80 3
Hash 100 4
Inno 50 5
Meme 60 6
Pete 70 7
Privy 50 8
Robo 100 9
Tomiti 70 10

The column that is used in the row number generation logic is called “sort key”. Here sort key is “name” column. For this technique to work, the sort key needs to be unique. We have chosen the column “name” because this column happened to be unique in our Employee table. If it was not unique but some other collection of columns was, then we could have used those columns as our sort key (by concatenating those columns to form a single sort key).

Also notice how the rows are sorted in the result set. We have done an explicit sorting on the row_num column, which gives us all the row numbers in the sorted order. But notice that name column is also sorted (which is probably the reason why this column is referred as sort-key). If you want to change the order of the sorting from ascending to descending, you will need to change “>=” sign to “<=” in the query.

As I said before, this method is not very generic. This is why many databases already implement other methods to achieve this. For example, in Oracle database, every SQL result set contains a hidden column called ROWNUM. We can just explicitly select ROWNUM to get sequence numbers.

How to select first 5 records from a table?

This question, often asked in many interviews, does not make any sense to me. The problem here is how do you define which record is first and which is second. Which record is retrieved first from the database is not deterministic. It depends on many uncontrollable factors such as how database works at that moment of execution etc. So the question should really be – “how to select any 5 records from the table?” But whatever it is, here is the solution:

In Oracle,

SELECT * 
FROM EMP
WHERE ROWNUM <= 5;

In SQL Server,

SELECT TOP 5 * FROM EMP;

Generic solution,

I believe a generic solution can be devised for this problem if and only if there exists at least one distinct column in the table. For example, in our EMP table ID is distinct. We can use that distinct column in the below way to come up with a generic solution of this question that does not require database specific functions such as ROWNUM, TOP etc.

SELECT  name 
FROM EMPLOYEE o
WHERE (SELECT count(*) FROM EMPLOYEE i WHERE i.name < o.name) < 5
name
Inno
Anno
Darl
Meme
Bhuti

I have taken “name” column in the above example since “name” is happened to be unique in this table. I could very well take ID column as well.

In this example, if the chosen column was not distinct, we would have got more than 5 records returned in our output.

Do you have a better solution to this problem? If yes, post your solution in the comment.

What is the difference between ROWNUM pseudo column and ROW_NUMBER() function?

ROWNUM is a pseudo column present in Oracle database returned result set prior to ORDER BY being evaluated. So ORDER BY ROWNUM does not work.

ROW_NUMBER() is an analytical function which is used in conjunction to OVER() clause wherein we can specify ORDER BY and also PARTITION BY columns.

Suppose if you want to generate the row numbers in the order of ascending employee salaries for example, ROWNUM will not work. But you may use ROW_NUMBER() OVER() like shown below:

SELECT name, sal, row_number() over(order by sal desc) rownum_by_sal
FROM EMPLOYEE o
name Sal ROWNUM_BY_SAL
Hash 100 1
Robo 100 2
Anno 80 3
Darl 80 4
Tomiti 70 5
Pete 70 6
Bhuti 60 7
Meme 60 8
Inno 50 9
Privy 50 10

What are the differences among ROWNUM, RANK and DENSE_RANK?

ROW_NUMBER assigns contiguous, unique numbers from 1.. N to a result set.

RANK does not assign unique numbers—nor does it assign contiguous numbers. If two records tie for second place, no record will be assigned the 3rd rank as no one came in third, according to RANK. See below:

SELECT name, sal, rank() over(order by sal desc) rank_by_sal
FROM EMPLOYEE o
name Sal RANK_BY_SAL
Hash 100 1
Robo 100 1
Anno 80 3
Darl 80 3
Tomiti 70 5
Pete 70 5
Bhuti 60 7
Meme 60 7
Inno 50 9
Privy 50 9

DENSE_RANK, like RANK, does not assign unique numbers, but it does assign contiguous numbers. Even though two records tied for second place, there is a third-place record. See below:

SELECT name, sal, dense_rank() over(order by sal desc) dense_rank_by_sal
FROM EMPLOYEE o
name Sal DENSE_RANK_BY_SAL
Hash 100 1
Robo 100 1
Anno 80 2
Darl 80 2
Tomiti 70 3
Pete 70 3
Bhuti 60 4
Meme 60 4
Inno 50 5
Privy 50 5