Building Interactive Text-to-SQL Systems

Koops, Reinier

Building Interactive Text-to-SQL Systems

Title

Building Interactive Text-to-SQL Systems

Author

Koops, Reinier (TU Delft Electrical Engineering, Mathematics and Computer Science; ING AI for FinTech Research; ING)

Contributor

Houben, G.J.P.M. (mentor)
Gadiraju, Ujwal (mentor)
Brons, J. (mentor)
Lan, G. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science

Date

2022-05-24

Abstract

Natural Language Interfaces for Databases (NLIDBs) offer a way for users to reason about data. It does not require the user to know the data structure, its relations, or familiarity with a query language like SQL. It only requires the use of Natural Language. This thesis focuses on a subset of NLIDBs, namely those with 'plain English' sentences as input and SQL queries as output.

Study 1 recruits participants from multiple origins (i.e. academia, a crowdsourcing platform, banking industry) without selection based on their query language capabilities. Next, participants are segmented based on query language capabilities to distinguish between non-experts and experts. A common way to retrieve information from databases is by using SQL. Thus knowledge of SQL is assumed to be a proxy for participants' skill level (i.e. SQL proficient, non-SQL proficient). We create an approach that uses an automated near semantic equivalence evaluation for user-generated queries against a predefined gold-standard SQL query and thus segment participants. We find that 70 out of 242 participants are identified as SQL proficient. To differentiate between the segmentations, we define 42 requirements often implemented for NLIDB systems, from which both segmentations pick a selection as their preferred requirements. We are unable to find statistically significant differences between the segmentations' preferences. However, exploratory findings reveal the importance of origin, namely the banking industry, which prefers explanation over answer accuracy, different from other segmentations.

Study 2 is inspired by the exploratory findings of Study 1 and uses requirements from Study 1 to create an application that tests two conditions, one with an explanation by using color-coding (i.e. to show the relations between the natural language question asked and the models' output columns) and another without. NLIDBs make it hard for users to verify if the answer provided by its model is correct. Therefore, Study 2 uses these two conditions above to test if color-coding improves performance for the participants. Our findings suggest that color-coding only improves performance for non-aggregate selection queries with multiple columns.

Subject

NLIDB
NLP
SQL
Text-to-sql
Deep Learning

To reference this document use:

http://resolver.tudelft.nl/uuid:8ccc193c-35db-472f-a013-fe9aa87b44e7

Related dataset 4TU.ResearchData

https://doi.org/10.4121/19733029

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Reinier_Koops_MSc_Thesis.pdf

4.18 MB

Close viewer