Interview with Christian Blaschke, scientific director of novoseek

[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]

The interview of Christian Blaschke PhD, scientific director at novoseek was originaly published in Spanish and titled Bioinformatics in the business world in Jose María Fernández González’s blog. José María Fernández González is a bioinformatician at CNIO (Madrid – Spain). He has developed web services for iHOP. With a view to sharing it with the rest of the scientific community and english speaking people, we have translated it and published it here.



I’ve always wondered how to make a bioinformatics-related development in the business, because the objectives are different. In the scientific world almost always you have to publish prior to your competitors, whereas in the business world, the objectives are more related to the versatility and robustness of the tools and systems developed.

Therefore, when I got the opportunity to conduct a series of questions about someone who is “on the other side”, I grabbed the chance. Christian Blaschke, working at Bioalma answered my questions about the development of novoseek, a text-mining product.

Christian Blaschke is a graduate in Plant Physiology from the University of Salzburg and has a Ph.D. in Molecular Biology from the Autnoma University in Madrid. He began his career developing data-mining systems and information extraction in the Protein Design Group. Today he is the Research & Development Director and Principal Investigator in several European projects in which Bioalma takes part. He was also the coordinator for the first edition of the BioCreAtivE, an assessment for text-mining algorithms. He has been conducting research in text mining for more than 10 years.

  1. In general, for ordinary people, what is novoseek?
  2. It is a Web 2.0 search engine for scientific literature and also an alternative to Pubmed to search in Medline, in full-text articles from PubMedCentral and in U.S.
    Grants. It is based on a unique text-mining technology that analyzes and processes the nearly 20 millions publications available in PubMed and the 3 million existing concepts in the literature. Our technology analyzes and takes into account the synonyms and homonyms to the search term, which allows to return relevant and complete results in the very first search. In addition, a profile (which appears in the left bar of the browser) is created for each search. This profile displays important concepts related to the search with a view to using them as filters and make the search more specific. Thus, the user finds the publications he needs to read in a more simple, fast and reliable way.

  3. What was originally the idea to create this tool?
  4. In the late 90s I was fortunate to work with Alfonso Valencia (then working at the National Center for Biotechnology in Madrid) in subjects dealing with word processing and information extraction. He was among the first to work on these subjects in the field of molecular biology and bioinformatics and I was able to explore many ideas. At that time we were interested in extracting proteins interactions and the analysis of the results on DNA microarrays based on the knowledge published in the scientific literature. Later we realized we could offer the benefits of the technologies we had developed to a wider audience and find a way in which biomedical researchers could benefit from it. So in Bioalma we started working on products that would be based on text analysis in the biomedical field. One could say that novoseek is the third generation of this products development and that we have now brought it online.

  5. How many people were necessary for the development of novoseek? Did they / Do they have highly specialized profiles (text mining, databases, etc …)?
  6. We started with a few people and we are currently a dozen that participate actively in the development of novoseek. We are a multidisciplinary team which includes people trained in many areas. From software engineers, experts in the development of databases, bioinformatitians, biochemists, pharmacists to experts in artificial intelligence. In addition we have long been dealing with texts and analyzing natural language. This is an area in which most of our team has experience.

  7. Are there critical points with the current tools and web systems such as keeping the information updated and consistent. Did you have / Do you have many issues?
  8. At first it wasn’t easy because the set of documents included in PubMed were much larger than anything we had processed before in our work experience. But I have to say that we have a great team and today we integrate documents published in PubMed (abstracts of publications) and PubMedCentral (full text) every day.

  9. How do you get feedback from regular users? I mean, do they propose interesting features, or do they help you detect problems or system failures.
  10. Novoseek is a service based on state-of-the-art technology, people working in the company are quite young, they know the internet well and are concerned with constantly improving the user experience. Therefore, their feedback is very important to us. We have opened discussion platforms that have a particular role. In uservoice, users tend to make us suggestions as to new developments and usability. We study them and we include them into our development “road map”. There are things that are easy to implement and take little time (like export to CiteULike) and others that we need to assess and may take longer (such as search in figures and images). Twitter (@novoseek) is a tool we use for real time communication with our users and to share information such as: interesting publications, news and interesting links for our community, surveys or a more direct feedback. For example, I remember the time someone asked us if novoseek was down and in 5 minutes, 5 people (including us) told her that it was not.

    I admit that there is a subtle balance between what people want in the web-based service and what we think is good for efficient searches and a nice user experience. In general, user feedback helps us a lot.

  11. If today you had to start from scratch the design of a tool with the same target as novoseek, having the background that you now have, what would you not do?
  12. Our professional education is very technical and this was reflected in our previous products. They were very powerful but sometimes too complex for our target audience. We thought that more (functionality) was better than less and we did not consider enough the point of view of our users. For us this has been quite a journey in which we learned a lot. In the last months we have conducted many usability tests, and we realized that there are elements that are not clear enough. So we are currently working on a redesign of novoseek. This should help understand better how it differs from PubMed and what it actually bring to users.

  13. In the current scientific landscape of web 2.0, web services, bibliographic social networks (such as CiteULike, Zotero 2.0, …), etc … that is beginning to be beyond PubMed or Google Scholar “Are you facing many challenges to link (or provide links) to these resources?
  14. Given our work and activity online, we know well the other web 2.0 tools that today are part of the life of a novoseek user. They are tools we are also using ourselves and that we consider important because they are completing the service offered by novoseek. It is a requirement that we must meet so that people keep using novoseek. So far, we have done it for CiteULike and it is pending for Zotero 2.0 and Mendeley. As these web 2.0 services grow in number and their use is increasing among scientists, novoseek has to be more compatible with them.

  15. Nearly all bioinformatics services today (either academic or commercial) offer programmatic APIs. What can you tell about yours?
  16. For novoseek’s API we have used REST based on the XML standard because it is relatively simple to use and there are libraries for most programming languages available today.
    As for the functionality it offers, we tried to bring most things that can be done in novoseek to the API. One can do searches based on words and biological concepts (like e.g. genes, diseases, drugs or chemicals) to retrieve documents. The documents offer all the entries included in novoseek and these can be used as a basis for new text mining services. It also offers the key concepts that are calculated for a search related to the documents returned and that characterize this set of documents.
    Our main goal is to offer the possibility to integrate the functionality of novoseek on other platforms. For example to enrich the content of web pages or blogs. Furthermore, it is now very common to do ” mash-ups” between different systems to create something totally new. We wanted people to be able to use novoseek in new ways beyond what might occur to us. People interested can request an API Key in http://api.novoseek.com

  17. What are the future plans for a tool like novoseek?
  18. In the future we want to extract more and more information from the documents which are indexed in novoseek to allow ever more powerful searches. One problem is that e.g. in PubMed you can not search for a person. If you search for “John Smith” the system will return documents where the name refers to different people. Or in documents where “J Smith” appears as an author, you do not know if it belongs to “John Smith” or “Jeff Smith”. Another problem that requires a lot of work is to find specific information such as, e.g., what drugs treat a disease or what are the genetic causes of a disease. We want to solve these problems for our users to save them time spent searching and so that they could devote to actually reading the documents that are relevant to them.

  19. Can you tell more about the infrastructure needed to provide this service?
  20. At first we set up novoseek on a small cluster of Linux machines installed in our offices in Madrid. But we realized that keeping a 24 hour service with minimum disruption was not easy. We were depending on a single Internet line that failed several times in the first months. The air conditioning system was not secure enough and we could not withstand power outages of over 15 minutes. After evaluating many options such as hosting of machines in a data center or collocation of our own hardware in one of them, we chose the web services offered by Amazon (which is known as AWS – Amazon Web Services consisting of EC2 and S3). Amazon offers what is known today as “the cloud”, a system of virtual machines that are configured in a flexible way. It is easy to create more nodes to meet our growing needs and also pay only what is actually used. The decision to migrate novoseek to the Amazon platform solved the problems I mentioned before because it is a very stable environment that has not failed us so far.

Thank you to José María Fernández González and Christian Blaschke for their time and dedication for this interview.

You can get an API key here


Open access vs Free access

[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]

Plos open access logoWe have recently added to novoseek new articles from PubMed Central. This new feature provides the ability to access “full text publications” and we have noticed that there is quite some misunderstanding regarding what has actually been indexed. So let us explain it in detail.

Indeed, we have included the Open Access subset of PubMed Central. What is that? Well, Open Access is the free online access to research papers. Obviously, this definition has driven some confusion and misuse of the term “open” access as it is often considered a synonym to “free” access.

The first definition for open access came up at the Budapest Open Access Initiative which was later revised in Bethesda and Berlin. This led to what Peter Suber calls the BBB open access definition for which most of the Open Access Movement agreed on.
The Open Access definition stands around two ideas:

  • Free of charge accessibility
  • Tears down permission barriers

Consequently, these ideas make distribution, copying and derivative work production possible to anyone.

Interestingly, we’ve observed that most of the time, open access is used as a synonym to free access. This is not quite correct since open access goes beyond just free access to content. For a better understanding of the differences between them, have a look at the graphic below.

open-access

PubMed Central is a free peer reviewed digital archive of biomedical and life sciences literature developed and managed by the NIH. It gives free access to articles among which some are open access. As we have discussed in previous posts, the NIH public access policy has ensured the access to published results of NIH funded research. However it does not say whether it has to be through a free access or an open access policy.

In novoseek, we have analyzed with our text mining algorithms the full text of the open access subset and we have made it public. So now you will find full text articles in which you will be able to highlight all the relevant keywords, and enjoy the great features of our technology.

We hope you like this new data set and we will more than welcome your comments and suggestions.

Free access to biomedical knowledge

[Connotea] [del.icio.us] [Digg] [diigo] [Google] [LinkedIn] [Reddit] [StumbleUpon] [Email]
Dodo bird

Dodo bird

The NIH’s recent measure responds to a paradox in the field of scientific publications.  It is the classic “chicken or egg” scenario.

First, much of the content generation is done so at no cost to the “publishing house”.  Researchers, sponsored mainly by public institutions – and driven by their own interests and search for knowledge – create new research.  The new content creation is usually based on “free” generation, collaboration and assessment of content.

Second, the publishing companies then add value by managing, editing and distributing that content. In this process, “end-user” scientists have willingly paid publishers for the ability to access this “approved” content.

So the paradox before us is…if the free content is based upon paid content (made available through publishing houses), should the new content be made available for free or follow the publishing payment model?

In fact, there is no black and white answer – but new technologies are creating an even more heated debate.

New technologies and the Internet have simplified the editing  and distribution processes, opening new possibilities for additional  formats and business models, such as open access publications, in  which access to contents is free and redistribution is at your fingertips.  Will the new technologies be able to handle the role that traditional publishers have successfully handled for nearly 100 years in the areas of managing, editing and distributing that content?

In the new model – what becomes of peer review?  In  the traditional method, public institutions fund the management and the peer-review  editing process, since these publications charge researchers for the  distribution and circulation of their work once it is considered  scientifically relevant.

Through PubMed Central, the NIH has generated a centralized system aimed to distribute scientific works which have already undergone a peer-review process, and have made scientists sponsored by the NIH add a clause regarding the copyright’s release to publishing companies, before the last version of their work can be  placed in this repository 12 months before the publication. Congress’s proposal wishes to avoid this kind of measure.

We must take into account that the research process is strongly supported in the maintained publication of new results, which daily becomes the base for additional discoveries by scientists. Better access to information and the implementation of easier conversations, such as access offered by the Internet would accelerate research, as mentioned  by Mr. Akst in The WSJ: “knowledge dissemination is crucial for the creation of wealth and can’t reproduce in isolation.”

It is true that the management of any peer-review process is necessary if willing to maintain the quality and excellence of scientific works,  but it is in any case a process in which scientists collaborate  voluntarily, a process in which new technologies have enabled a new,  easier distribution. And if there is an entity willing to lead the centralization of contents, is it sensible to approve measures not favoring this access to content?

Unfortunately I don’t have an answer.  To deny access to valuable medical information, just doesn’t seem right.  Neither does denying a commercial business the right to operate when it provides a valuable service.

So – chicken or egg?  All I know is this debate is not likely to go the way of the dodo bird anytime soon.