Querying a Database

Suppose we have a program that lets us type in a natural language question and gives us back the right answer:

How hard is it to write such a program? And can we just use the same techniques that we've encountered so far in this book, or does it involve something new? In this section, we will show that solving the task in a restricted domain is pretty straightforward. But we will also see that to address the problem in a more general way, we have to open up a whole new box of ideas and techniques, involving the representation of meaning.

So let's start off by assuming that we have data about cities and countries in a structured form. To be concrete, we will use a database table whose first few rows are shown in Table 10-1.

The data illustrated in Table 10-1 is drawn from the Chat-80 system (Warren & Pereira, 1982). Population figures are given in thousands, but note that the data used in these examples dates back at least to the 1980s, and was already somewhat out of date at the point when (Warren & Pereira, 1982) was published.

Table 10-1. city_table: A table of cities, countries, and populations



















The obvious way to retrieve answers from this tabular data involves writing queries in a database query language such as SQL.

SQL (Structured Query Language) is a language designed for retrieving and managing data in relational databases. If you want to find out more about SQL, http://www.w3schools.com/sql/ is a convenient online reference.

For example, executing the query (2) will pull out the value 'greece':

(2) SELECT Country FROM city_table WHERE City = 'athens'

This specifies a result set consisting of all values for the column Country in data rows where the value of the City column is 'athens' .

How can we get the same effect using English as our input to the query system? The feature-based grammar formalism described in Chapter 9 makes it easy to translate from English to SQL. The grammar sql0.fcfg illustrates how to assemble a meaning representation for a sentence in tandem with parsing the sentence. Each phrase structure rule is supplemented with a recipe for constructing a value for the feature SEM. You can see that these recipes are extremely simple; in each case, we use the string concatenation operation + to splice the values for the child constituents to make a value for the parent constituent.

>>> nltk.data.show_cfg('grammars/book_grammars/sql0.fcfg') % start S

S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp] VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp] VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap] NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n] PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np] AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp] NP[SEM='Country="greece"'] -> 'Greece' NP[SEM='Country="china"'] -> 'China' Det[SEM='SELECT'] -> 'Which' | 'What' N[SEM='City FROM city_table'] -> 'cities' IV[SEM=''] -> 'are' A[SEM=''] -> 'located' p[sEM=''] -> 'in'

This allows us to parse a query into SQL: >>> from nltk import load_parser

>>> cp = load_parser('grammars/book_grammars/sql0.fcfg')

>>> query = 'What cities are located in China'

SELECT City FROM city_table WHERE Country="china"

® - Your Turn: Run the parser with maximum tracing on, i.e., cp =

load_parser('grammars/book_grammars/sql0.fcfg', trace=3), and ex" v i ^ amine how the values of SEM are built up as complete edges are added to the chart.

Finally, we execute the query over the database city.db and retrieve some results: >>> from nltk.sem import chat80

>>> rows = chat80.sql_query('corpora/city_database/city.db', q) >>> for r in rows: print r[0], O

canton chungking dairen harbin kowloon mukden peking shanghai sian tientsin

Since each row r is a one-element tuple, we print out the member of the tuple rather than the tuple itself O.

To summarize, we have defined a task where the computer returns useful data in response to a natural language query, and we implemented this by translating a small subset of English into SQL. We can say that our NLTK code already "understands" SQL, given that Python is able to execute SQL queries against a database, and by extension it also "understands" queries such as What cities are located in China. This parallels being able to translate from Dutch into English as an example of natural language understanding. Suppose that you are a native speaker of English, and have started to learn Dutch. Your teacher asks if you understand what (3) means:

(3) Margrietje houdt van Brunoke.

If you know the meanings of the individual words in (3), and know how these meanings are combined to make up the meaning of the whole sentence, you might say that (3) means the same as Margrietje loves Brunoke.

An observer—let's call her Olga—might well take this as evidence that you do grasp the meaning of (3). But this would depend on Olga herself understanding English. If she doesn't, then your translation from Dutch to English is not going to convince her of your ability to understand Dutch. We will return to this issue shortly.

The grammar sql0.fcfg, together with the NLTK Earley parser, is instrumental in carrying out the translation from English to SQL. How adequate is this grammar? You saw that the SQL translation for the whole sentence was built up from the translations of the components. However, there does not seem to be a lot of justification for these component meaning representations. For example, if we look at the analysis of the noun phrase Which cities, the determiner and noun correspond respectively to the SQL fragments SELECT and City FROM city_table. But neither of these has a well-defined meaning in isolation from the other.

There is another criticism we can level at the grammar: we have "hard-wired" an embarrassing amount of detail about the database into it. We need to know the name of the relevant table (e.g., city_table) and the names of the fields. But our database could have contained exactly the same rows of data yet used a different table name and different field names, in which case the SQL queries would not be executable. Equally, we could have stored our data in a different format, such as XML, in which case retrieving the same results would require us to translate our English queries into an XML query language rather than SQL. These considerations suggest that we should be translating English into something that is more abstract and generic than SQL.

In order to sharpen the point, let's consider another English query and its translation:

(4) a. What cities are in China and have populations above 1,000,000?

b. SELECT City FROM city_table WHERE Country = 'china' AND Population > 1000

Your Turn: Extend the grammar sql0.fcfg so that it will translate (4a) into (4b), and check the values returned by the query. Remember that figures in the Chat-80 database are given in thousands, hence 1000 in (4b) represents one million inhabitants.

You will probably find it easiest to first extend the grammar to handle queries like What cities have populations above 1,000,000 before tackling conjunction. After you have had a go at this task, you can compare your solution to grammars/book_grammars/sql1.fcfg in the NLTK data distribution.

Observe that the and conjunction in (4a) is translated into an AND in the SQL counterpart, (4b). The latter tells us to select results from rows where two conditions are true together: the value of the Country column is 'china' and the value of the Population column is greater than 1000. This interpretation for and involves a new idea: it talks about what is true in some particular situation, and tells us that Condi AND Cond2 is true in situation s if and only if condition Condi is true in s and condition Cond2 is true in s. Although this doesn't account for the full range of meanings of and in English, it has the nice property that it is independent of any query language. In fact, we have given it the standard interpretation from classical logic. In the following sections, we will explore an approach in which sentences of natural language are translated into logic instead of an executable query language such as SQL. One advantage of logical formalisms is that they are more abstract and therefore more generic. If we wanted to, once we had our translation into logic, we could then translate it into various other special-purpose languages. In fact, most serious attempts to query databases via natural language have used this methodology.

Was this article helpful?

0 0

Post a comment