Friday, April 22, 2016

Results

Though preliminary, my experiments show promising results. I ran the training program on the training split of Free917 to get the model trained then tested it against the test split:

Total:                276
Total not broken:     206
Total simple:         158
Total found:          97 (61% simple)
Total not found:      61

I only count simple questions (easy object -> property questions) because it's too tedious to generate training data for non simple sentences. It's important to note, however, that my method, however, isn't bound to only simple questions. The real takeaway is that this method for extracting the main entity from the sentence works 61% of the time, which is higher than I was expecting.

Looking at errors:
  • Question pattern lookup fail: The question pattern simply wasn't recorded in the training set. This is solved by adding a couple more questions to the training set to cover the missed cases
  • Freebase search fail: The words extracted from the KDG are insufficient to find the correct entity on Freebase. This problem is much more difficult to solve.
  • Further K-parser errors: This problem is out of my control.
I have two options now: find out how to improve on the 61% as much as possible or start working on the other half of the problem, getting the "property" out of the KDG. The latter is more interesting but I'm not sure if it's even possible.

Thursday, April 14, 2016

A more in-depth look at training

In the previous post, I explained the system I'm using to train a main-entity extraction model, but in very general terms. Here is a more detailed explanation.

The notion of "main entity" of a question is vague. Consider the following question: "How many benches are in Central Park?" Is the question asking about benches or Central Park? The answer comes from how information is structured in Freebase. If there was a topic in Freebase for benches and it had a property that represented the number of benches in Central Park, then it would be logical to call "benches" the main entity of the question. However, it is more likely that there is a topic in Freebase for Central Park and it has a property that represents the number of benches. In this (more plausible) case, the main entity is Central Park.

If you don't leverage some kind of training system and just read questions without any prior knowledge, you will run into this problem. I know that because I did during the earlier stages of this project; I tried to just look at the KDG of the sentence and guess where the main entity was. This turned out to be a terribly inaccurate process. With training, however, my system gets an idea of where the main entity is located based on question structure, so it reduces the amount of "guessing" you have to do.

"Training" is a vague term too, when it comes to computers. When I refer to training, I mean the following: feeding a system a number of (input, desired_output) pairs and getting back a model that can guess desired_output based on input that it hasn't already seen before. Here is a simple example. John is training Bob. Bob knows that when John says a number, he should say a number back; input is a number and desired_output is a number. John tells him that when he hears 2, he should say 4; when he hears 3, he should say 6; when he hears 4, he should hear 8; and continues this for a hundred more numbers or so. By now, Bob has a pretty good idea of what to do even if he doesn't know the exact desired output for the input he is given; just double the input. To test him, John tells him "1" and he correctly responds "2" even though he had never seen that example before. Based on training data supplied by John, Bob has created a model of the system John has designed.

In the scope of this project, input is the KDG of the question being asked and the desired output is the PATH to the main entity in the graph. Remember that since a KDG is a directed tree (acyclic connected graph), there's exactly one path from the root node to every other node. I'll illustrate this with the example from my last post. The sentence is "How many schools are in the school district of Philadelphia" and the KDG looks like this:


The correct main entity of this question is the school district of Philadelphia because there happens to be a topic on Freebase for it with a property that lists the schools in it. The "district-8" node is the node which represents this entity, so the correct path to the node from the root node ("are-4") is just "is_inside_location". If the main entity was, for example, Philadelphia, then the path would be "is_inside_location -> is_part_of".

The pair for this example would look like this: (KDG of "how many schools are in the school istrict of Philadelphia?", is_inside_location). Now imagine there are hundreds of these. How do you use all this information to predict the correct path given a NEW KDG? You have to compare it to all the existing ones and see if the STRUCTURE matches. If the structure matches the KDG in the pair, then there's a really good chance then the path listed in the pair is correct for the new KDG too. Structure in this sense means the vague structure of the graph: "does it have an agent edge?" or "does it have a is_inside_location edge?" and so on. You can't just check if the KDG itself matches any known KDG because you won't have seen it before. Back to our John and Bob example, if Bob hears 1, he can't try to remember what John said the correct answer for 1 because he just doesn't know. Instead, he generalizes based on the patterns he saw.

That's the whole idea. I'd like to know if I explained this well, so leave a comment if I lost you somewhere and you still want to understand.

I've already implemented this system, and I'm preparing to run it on the test cases that Free917 provides to see how accurate it is in predicting the main entity of a question. Resuls from that should be one or two posts from now.

Sunday, April 10, 2016

Main object extraction

As I wrote earlier, the first step of answering a question is finding out what it's asking about. The question "How old do you have to be to play Monopoly?" is asking about the board game Monopoly. I've proposed a method to train a model to be able to do this using K-parser.

I'll illustrate my process with a more complex example: "how many schools are in the school district of philadelphia?" The KDG looks like this (for brevity, I replaced every element with its class and stripped unimportant nodes):

root: be
  agent: schools

    trait: ?
      subclass_of: quantity
  is_inside_location: district
     complement_word: school
     is_part_of: philadelphia
(Take note that the ? with the subclass "quantity" represents the words "How many...?")

I have an operation that takes a node of the graph and returns a list of strings. It builds a list of strings out of the node's name, and the names of any "complement_word", "trait", "is_part_of", etc. nodes that come out of it. For example, the strings for "district" would be {"district", "school", "philadelphia"}. I then can concatenate them ("district school philadelphia") and search for that using Google's Freebase API.

The trick here is finding the correct node to do this on. One node in the graph (and the subgraph that stems from it) usually represents the main object of the question. I think a model can be trained with (graph structure, path to correct node) pairs. These pairs can be extracted from the training set by taking each KDG and using brute force to find the correct path. It works on most of the sentences that K-parser correctly parses; my experiments right now give me 75% but I'm sure it can be improved to around 90%.

The crux of the idea is that the actual entities do not matter in finding the main entity, only the words between them. With a single training example: "how many schools are in the school district of philadelphia," the model can now extract the main object of any question of the form "how many X are in Y." With over 600 examples, this system should be able to generalize to all forms of questions. However, whether that is truly the case remains to be seen, because I haven't implemented this system yet.

Saturday, April 2, 2016

Breaking the problem down further

This past week or two I've been classifying each training example from Free917 based on how "easy" it is to answer. The only criteria is whether outside information is needed to answer the question, and an example where this is evident is the old Monopoly question: How old do you have to play Monopoly? To answer this question, not only does the computer need to know the meanings of all the words in this sentence, but also that "how old do you have to be" denotes that we are looking for some kind of "minimum age" of the Monopoly board game (information that does exist in Freebase). An example of an easy question would be something like "Who is the pitcher for the Chicago Cubs." The only information needed is Chicago Cubs and pitcher, which are both readily available; no extra intelligence is needed.
The label 1 means the question is easy, and 0 means it is not.

My other task is determining how to answer the "easy" questions, and at this point it seems the simplistic system I described a couple of posts ago can be adequate, with some tuning.