Friday, March 4, 2016

The challenge of named-entity recognition

One of the biggest challenges that I face is named-entity recognition. I'll illustrate the meaning of the term with an example.

Imagine you have a list of directions written in a language you do not know. One of them reads:
Gok bok nok
You haven't the slightest clue as to what this means, until someone tells you "gok bok" roughly translates to "get." Now you just have to figure out what "nok" means and retrieve it, so you go to the library and ask the person working there for "nok." They reply with the question "Brok nok ork grok nok?" Your limited capabilities with this language tell you that the clerk is asking you to specify which kind of nok you want, as there are two different meanings of the word, one relating to "brok" and one to "grok."

You have no idea which one you need so you go back to the person who gave you the instructions to get some context. However, he just gives you some more gibberish that you have to translate and somehow relate to the words "grok" and "brok," the two different kind of noks to find out which nok you need.

It should be clear where this is going; you are the computer trying to figure out what an English sentence means and you've been thrown an ambiguous word. In the question "Who is the pitcher for the Chicago Cubs?" the meaning of the word "pitcher" might seem obvious to someone who knows what the Chicago Cubs are, and it's possible for computers to deal with this kind of ambiguity (is it a tool for holding liquids or a baseball player?) but not without significant work.

However, in my case I face another challenge on top of this. in the above example I assumed that the worker had access to a library that knew the meanings of the word "nok." I've been looking for one, and I've come across two serious candidates. My needs include complete search results (I should find what I'm looking for) and each result must be linked to its counterpart in Freebase.

The first is Google's Knowledge Graph (wiki) which Google claims is the successor of Freebase (which I find dubious, for reasons outside the scope of this post). Put mildly, its searching capability is miserable. Here are the results obtained from searching for "First Amendment" (an entity I know for a fact exists in Freebase):
First amendment, Book by Ashley McConnell (score 142.049927)
First Amendment, Song by Silent Civilian (score 141.980042)
First Amendment, Musical Group (score 141.068527)
First Amendment, TV Episode (score 137.776077)
First Amendment, Song by Silent Civilian (score 126.532188)
Where's the actual amendent?! Unless there's something I'm completely missing, Google's Knowledge Graph does not contain the information I need. This is saddening because this is the closest knowledge base to Freebase that is easy to use- each result that I get is directly linked to the corresponding entity in Freebase. The other option I tried has a much better search yet lacks this.

It's called "WordNet" and I am content with the amount of information it has. The search for "First Amendment" returns the result I'm looking for:
  • S: (n) First Amendment (an amendment to the Constitution of the United States guaranteeing the right of free expression; includes freedom of assembly and freedom of the press and freedom of religion and freedom of speech)
However, since I need to go from "First Amendment" to the correct entity in Freebase (something Google's product allows but this one does not), it will be difficult to use this service. Simply knowing the definition will not help.

There is a third option- perform a direct search on Freebase itself, but getting a copy of Freebase up and running is not something I can do with this laptop. Here is an excerpt from the instructions:
make sure you have at least 60GB of memory
Doesn't look like the 8GB I have here is going to do. I'm currently working to get access to a computing cluster at ASU that will be able to put up a copy of Freebase.

No comments:

Post a Comment