Interview with Christian Blaschke, scientific director of novoseek

[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]

The interview of Christian Blaschke PhD, scientific director at novoseek was originaly published in Spanish and titled Bioinformatics in the business world in Jose María Fernández González’s blog. José María Fernández González is a bioinformatician at CNIO (Madrid – Spain). He has developed web services for iHOP. With a view to sharing it with the rest of the scientific community and english speaking people, we have translated it and published it here.



I’ve always wondered how to make a bioinformatics-related development in the business, because the objectives are different. In the scientific world almost always you have to publish prior to your competitors, whereas in the business world, the objectives are more related to the versatility and robustness of the tools and systems developed.

Therefore, when I got the opportunity to conduct a series of questions about someone who is “on the other side”, I grabbed the chance. Christian Blaschke, working at Bioalma answered my questions about the development of novoseek, a text-mining product.

Christian Blaschke is a graduate in Plant Physiology from the University of Salzburg and has a Ph.D. in Molecular Biology from the Autnoma University in Madrid. He began his career developing data-mining systems and information extraction in the Protein Design Group. Today he is the Research & Development Director and Principal Investigator in several European projects in which Bioalma takes part. He was also the coordinator for the first edition of the BioCreAtivE, an assessment for text-mining algorithms. He has been conducting research in text mining for more than 10 years.

  1. In general, for ordinary people, what is novoseek?
  2. It is a Web 2.0 search engine for scientific literature and also an alternative to Pubmed to search in Medline, in full-text articles from PubMedCentral and in U.S.
    Grants. It is based on a unique text-mining technology that analyzes and processes the nearly 20 millions publications available in PubMed and the 3 million existing concepts in the literature. Our technology analyzes and takes into account the synonyms and homonyms to the search term, which allows to return relevant and complete results in the very first search. In addition, a profile (which appears in the left bar of the browser) is created for each search. This profile displays important concepts related to the search with a view to using them as filters and make the search more specific. Thus, the user finds the publications he needs to read in a more simple, fast and reliable way.

  3. What was originally the idea to create this tool?
  4. In the late 90s I was fortunate to work with Alfonso Valencia (then working at the National Center for Biotechnology in Madrid) in subjects dealing with word processing and information extraction. He was among the first to work on these subjects in the field of molecular biology and bioinformatics and I was able to explore many ideas. At that time we were interested in extracting proteins interactions and the analysis of the results on DNA microarrays based on the knowledge published in the scientific literature. Later we realized we could offer the benefits of the technologies we had developed to a wider audience and find a way in which biomedical researchers could benefit from it. So in Bioalma we started working on products that would be based on text analysis in the biomedical field. One could say that novoseek is the third generation of this products development and that we have now brought it online.

  5. How many people were necessary for the development of novoseek? Did they / Do they have highly specialized profiles (text mining, databases, etc …)?
  6. We started with a few people and we are currently a dozen that participate actively in the development of novoseek. We are a multidisciplinary team which includes people trained in many areas. From software engineers, experts in the development of databases, bioinformatitians, biochemists, pharmacists to experts in artificial intelligence. In addition we have long been dealing with texts and analyzing natural language. This is an area in which most of our team has experience.

  7. Are there critical points with the current tools and web systems such as keeping the information updated and consistent. Did you have / Do you have many issues?
  8. At first it wasn’t easy because the set of documents included in PubMed were much larger than anything we had processed before in our work experience. But I have to say that we have a great team and today we integrate documents published in PubMed (abstracts of publications) and PubMedCentral (full text) every day.

  9. How do you get feedback from regular users? I mean, do they propose interesting features, or do they help you detect problems or system failures.
  10. Novoseek is a service based on state-of-the-art technology, people working in the company are quite young, they know the internet well and are concerned with constantly improving the user experience. Therefore, their feedback is very important to us. We have opened discussion platforms that have a particular role. In uservoice, users tend to make us suggestions as to new developments and usability. We study them and we include them into our development “road map”. There are things that are easy to implement and take little time (like export to CiteULike) and others that we need to assess and may take longer (such as search in figures and images). Twitter (@novoseek) is a tool we use for real time communication with our users and to share information such as: interesting publications, news and interesting links for our community, surveys or a more direct feedback. For example, I remember the time someone asked us if novoseek was down and in 5 minutes, 5 people (including us) told her that it was not.

    I admit that there is a subtle balance between what people want in the web-based service and what we think is good for efficient searches and a nice user experience. In general, user feedback helps us a lot.

  11. If today you had to start from scratch the design of a tool with the same target as novoseek, having the background that you now have, what would you not do?
  12. Our professional education is very technical and this was reflected in our previous products. They were very powerful but sometimes too complex for our target audience. We thought that more (functionality) was better than less and we did not consider enough the point of view of our users. For us this has been quite a journey in which we learned a lot. In the last months we have conducted many usability tests, and we realized that there are elements that are not clear enough. So we are currently working on a redesign of novoseek. This should help understand better how it differs from PubMed and what it actually bring to users.

  13. In the current scientific landscape of web 2.0, web services, bibliographic social networks (such as CiteULike, Zotero 2.0, …), etc … that is beginning to be beyond PubMed or Google Scholar “Are you facing many challenges to link (or provide links) to these resources?
  14. Given our work and activity online, we know well the other web 2.0 tools that today are part of the life of a novoseek user. They are tools we are also using ourselves and that we consider important because they are completing the service offered by novoseek. It is a requirement that we must meet so that people keep using novoseek. So far, we have done it for CiteULike and it is pending for Zotero 2.0 and Mendeley. As these web 2.0 services grow in number and their use is increasing among scientists, novoseek has to be more compatible with them.

  15. Nearly all bioinformatics services today (either academic or commercial) offer programmatic APIs. What can you tell about yours?
  16. For novoseek’s API we have used REST based on the XML standard because it is relatively simple to use and there are libraries for most programming languages available today.
    As for the functionality it offers, we tried to bring most things that can be done in novoseek to the API. One can do searches based on words and biological concepts (like e.g. genes, diseases, drugs or chemicals) to retrieve documents. The documents offer all the entries included in novoseek and these can be used as a basis for new text mining services. It also offers the key concepts that are calculated for a search related to the documents returned and that characterize this set of documents.
    Our main goal is to offer the possibility to integrate the functionality of novoseek on other platforms. For example to enrich the content of web pages or blogs. Furthermore, it is now very common to do ” mash-ups” between different systems to create something totally new. We wanted people to be able to use novoseek in new ways beyond what might occur to us. People interested can request an API Key in http://api.novoseek.com

  17. What are the future plans for a tool like novoseek?
  18. In the future we want to extract more and more information from the documents which are indexed in novoseek to allow ever more powerful searches. One problem is that e.g. in PubMed you can not search for a person. If you search for “John Smith” the system will return documents where the name refers to different people. Or in documents where “J Smith” appears as an author, you do not know if it belongs to “John Smith” or “Jeff Smith”. Another problem that requires a lot of work is to find specific information such as, e.g., what drugs treat a disease or what are the genetic causes of a disease. We want to solve these problems for our users to save them time spent searching and so that they could devote to actually reading the documents that are relevant to them.

  19. Can you tell more about the infrastructure needed to provide this service?
  20. At first we set up novoseek on a small cluster of Linux machines installed in our offices in Madrid. But we realized that keeping a 24 hour service with minimum disruption was not easy. We were depending on a single Internet line that failed several times in the first months. The air conditioning system was not secure enough and we could not withstand power outages of over 15 minutes. After evaluating many options such as hosting of machines in a data center or collocation of our own hardware in one of them, we chose the web services offered by Amazon (which is known as AWS – Amazon Web Services consisting of EC2 and S3). Amazon offers what is known today as “the cloud”, a system of virtual machines that are configured in a flexible way. It is easy to create more nodes to meet our growing needs and also pay only what is actually used. The decision to migrate novoseek to the Amazon platform solved the problems I mentioned before because it is a very stable environment that has not failed us so far.

Thank you to José María Fernández González and Christian Blaschke for their time and dedication for this interview.

You can get an API key here


Improvements in novoseek – March 2010

[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]

There have been several major improvements this month in novoseek:

  • Select the Publication Type from the Advanced Search panel
  • Users have been asking for it and it is now available when you are on the Advanced Search panel.

    TIP 1: Hold Ctrl (control) to select several Publication Types

    TIP 2: Learn more about the different Publication Types and their use when looking for scientific publications.
  • Complete authors list for each article
  • With a view to providing you with more information about the authors of an article, we have updated the meta data of every publication with the complete list of authors.

    - In the search results page, you will see the two first authors and the last author of the publication. Check with a search example

    TIP: when you are looking for a specific author, this author will appear highlighted within the results and you will see 4 authors in total for every publication (the 3 mentioned previously + the author you are looking for and highlighted within the results) Check with this example for Eley Robert

    - In the detail page of an author, all of the authors are now listed. Check authors in a publication detail page

  • Disambiguation of authors
  • A common problem within the scientific literature is the broad range of text formating that has an influence on authors name too. Sometimes an author name is written with Last name, First name or Last name, initial First name, etc. We now index all the known aliases of an author to make searches for an author publication more comprehensive. Check direct example of author disambiguation

  • Better navigation from one search results page to another
  • Users suggested to give a more intuitive navigation menu at the bottom of search results pages to switch from one page to another. This is done!

We didn’t do it

[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]

There has been quite a surprise yesterday on the world wide web as the redesigned version of Pubmed was released once and for all all of a sudden, like said Stephanie Fulton on twitter. However this was almost a non-surprise as it was taken off almost right away and made Librarian EagleDawg write about it. In fact, it looks like Pubmed expected technical difficulties releasing the redesigned version of its search engine.

Guys,  we would like all of the Pubmed users to know that we -novoseek- are not responsible at all for this and that we did not touch or unplug Pubmed at any moment ;-) .


pubmedVSnovoseek2

You can click the image to view it in 1280 x 800 pixels and save it to your computer.

A few days left: Win Amazon gift cards. Take the novoseek survey

[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]

Take the novoseek survey and winAmazon gift cards!

We would like to remind you that the novoseek survey will close in a few days so hurry up to take it and enter the drawING to win one of the 10 Amazon gift cards worth $25 each.

We guarantee you that it takes less than 10 minutes ;)

Thanks in advance to you all for your help… Good luck!

From The Cloud

[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]

Some days ago, we finished the migration of our production site to the Cloud, more precisely to Amazon EC2. I do not know if the “Cloud Computing” needs defining, but in any case I invite you to watch this wonderful video made by the people of Salesforce.com, in which you can find an easy and intuitive definition of Cloud Computing, and a list of its benefits.

What benefits does provide the Cloud to a search engine like novo|seek offer? I am probably going to repeat most of the same arguments listed in the video, anyhow:

  • Cost reduction: in the Cloud we pay for what we use: CPU time, storage, bandwidth…
  • Easy scaling: for a growing search engine like novo|seek, scalability is critical. For us, the user experience is very important and thus the QoS (Quality of Service).  Dimensioning the servers of an emerging web site is a hard task. If you get short, any marketing or PR action that drives a lot of traffic to the site can get the servers down on their knees.  On the contrary, if you over-dimension your infrastructures, you will have your servers getting old inside your data center. Amazon EC2 let us re-dimension our production infrastructure at the same time our traffic grows.
  • Reduction of the Time-to-market and the entry barriers to innovation: The EC2 infrastructure lets us create new server instances fast and easily  with different sizes and performance. If we need to try new text mining algorithms or expand our technology to new data sources, the Cloud will allow us to instantiate all the required servers to meet our extra computing power and we can forget about finding new room in our crowded data center. Cloud computing lowers the innovation entry barriers to small and medium size companies like us.

We know that we are not the only Company in the sector to take advantage of Cloud Computing, BioTeam, for example, is adapting bioinformatics solutions so it can be run in Amazon EC2.

Not only small and medium companies are in The Cloud, big pharmas like Jonhson&Johnson or Lilly, are developing their first projects on EC2 although a recent report from McKinsey stated that Cloud Computing will not reduce costs to large corporations.

The novo|seek team is sure that moving to The Cloud will improve the quality of the service that we are currently providing, and will let us bring to all of you the innovative features cooking right know at our R&D department.

Greetings from the cloud!