The importance of context in text disambiguation

Share
[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]

Some time ago, we explained to you how novoseek interprets a query and is able to return relevant publications, no matter the synonym used in the article and in the query. Indeed, the use of synonyms to extend a search makes one of the user’s main goals-and matter-of-factly ours- possible: find the best and most comprehensive information regarding a research area. This appeared all the more important as Techcrunch was pointing out recently that Netbase was giving not relevant – when not really inconvenient – results due to severe problems in their text-mining techniques and semantic knowledge.

However, the path to returning accurate and comprehensive information to the final user is a tricky one. Once the synonyms to a query word have been analyzed, it comes a second challenging  problem: disambiguate homonyms.

Homonyms are terms with the same spelling but with different meanings. When a search is performed, many of the potential results can deal with a totally different area of interest. This forces the user to try with new queries and to make sure that the system is understanding the query correctly; which will avoid further searches.

Obviously, this takes a long time to achieve and it could be summed up in a sentence: “If the search engine would only know the meaning of the search term this process could be reduced to minutes“.

How is the homonyms disambiguation process performed?
Novoseek looks for the word in the literature and based on the semantic role of the word in the sentence and the analysis of the context is able to assign it to an entry in our build-in biomedical dictionary. Below is a sample image of what the context of the spot is with an extract of an article found for BRCA1.

spot_context

As a result of the analysis, we are able to determine if a document is on-topic or off-topic. For example, CAT is a gene symbol of the human gene catalase, but it is also an homonym for cat the animal or for Carnitine acetyltransferase. This means that if “CAT” appears in a document, a text mining-based system will have to decide to which concept it actually refers and disambiguate the symbol before proceeding to any higher level analysis steps.

CAT

Furthermore, there can be an ambiguity as the same gene entity can have the same name in different organisms. As a result the analysis of context information must be able to tell to which organism it is referenced. At this level, it is crucial for a text mining system to get the analyses correct and only associate those documents to a certain biological entity that actually mentions that entity. Errors at this level would populate throughout the system and the end result presented to the user would be wrong.

novoseek_process_homonyms

In regular search engines you will get all documents for a query term no matter its meaning. With novoseek you can focus on the meaning you want for your term to retrieve just the documents you are looking for.

The text analysis is just one of the first steps in nooseek’s text mining technology. The results of these analyses has to be structured and delivered to the user in a fast and easy way.  But we’ll talk about this in another post.

We recommend you to read:

stay connected with novoseek

Subscribe to the RSS feed

  

2 comments ↓

#1    Twitter Trackbacks for The importance of context in text disambiguation | Knowledge beyond words [novoseek.com] on Topsy.com on 10.30.09 at 5:36 pm

[...] The importance of context in text disambiguation | Knowledge beyond words blog.novoseek.com/index.php/user-experience/the-importance-of-context-in-text-disambiguation.html – view page – cached The importance of context in text disambiguation, Some time ago, we explained to you how novoseek interprets a query and is able to return relevant publications, no matter the synonym used in the… (Read more)The importance of context in text disambiguation, Some time ago, we explained to you how novoseek interprets a query and is able to return relevant publications, no matter the synonym used in the article and in (Read less) — From the page [...]

#2    Medlib’s Blog Carnival Round 2.1: Free Speech in Health Information, and More « Emerging Technologies Librarian on 02.10.10 at 4:06 am

[...] a look at that blogpost, you might want to also take a look at an earlier post from Novoseek called The importance of context in text disambiguation. It is a kind of geeky, technical post, but the fundamental concept is central to how humans (as [...]

Leave a Comment