The interview of Christian Blaschke PhD, scientific director at novoseek was originaly published in Spanish and titled Bioinformatics in the business world in Jose María Fernández González’s blog. José María Fernández González is a bioinformatician at CNIO (Madrid – Spain). He has developed web services for iHOP. With a view to sharing it with the rest of the scientific community and english speaking people, we have translated it and published it here.
I’ve always wondered how to make a bioinformatics-related development in the business, because the objectives are different. In the scientific world almost always you have to publish prior to your competitors, whereas in the business world, the objectives are more related to the versatility and robustness of the tools and systems developed.
Therefore, when I got the opportunity to conduct a series of questions about someone who is “on the other side”, I grabbed the chance. Christian Blaschke, working at Bioalma answered my questions about the development of novoseek, a text-mining product.
Christian Blaschke is a graduate in Plant Physiology from the University of Salzburg and has a Ph.D. in Molecular Biology from the Autnoma University in Madrid. He began his career developing data-mining systems and information extraction in the Protein Design Group. Today he is the Research & Development Director and Principal Investigator in several European projects in which Bioalma takes part. He was also the coordinator for the first edition of the BioCreAtivE, an assessment for text-mining algorithms. He has been conducting research in text mining for more than 10 years.
- In general, for ordinary people, what is novoseek?
- What was originally the idea to create this tool?
- How many people were necessary for the development of novoseek? Did they / Do they have highly specialized profiles (text mining, databases, etc …)?
- Are there critical points with the current tools and web systems such as keeping the information updated and consistent. Did you have / Do you have many issues?
- How do you get feedback from regular users? I mean, do they propose interesting features, or do they help you detect problems or system failures.
- If today you had to start from scratch the design of a tool with the same target as novoseek, having the background that you now have, what would you not do?
- In the current scientific landscape of web 2.0, web services, bibliographic social networks (such as CiteULike, Zotero 2.0, …), etc … that is beginning to be beyond PubMed or Google Scholar “Are you facing many challenges to link (or provide links) to these resources?
- Nearly all bioinformatics services today (either academic or commercial) offer programmatic APIs. What can you tell about yours?
- What are the future plans for a tool like novoseek?
- Can you tell more about the infrastructure needed to provide this service?
It is a Web 2.0 search engine for scientific literature and also an alternative to Pubmed to search in Medline, in full-text articles from PubMedCentral and in U.S.
In the late 90s I was fortunate to work with Alfonso Valencia (then working at the National Center for Biotechnology in Madrid) in subjects dealing with word processing and information extraction. He was among the first to work on these subjects in the field of molecular biology and bioinformatics and I was able to explore many ideas. At that time we were interested in extracting proteins interactions and the analysis of the results on DNA microarrays based on the knowledge published in the scientific literature. Later we realized we could offer the benefits of the technologies we had developed to a wider audience and find a way in which biomedical researchers could benefit from it. So in Bioalma we started working on products that would be based on text analysis in the biomedical field. One could say that novoseek is the third generation of this products development and that we have now brought it online.
We started with a few people and we are currently a dozen that participate actively in the development of novoseek. We are a multidisciplinary team which includes people trained in many areas. From software engineers, experts in the development of databases, bioinformatitians, biochemists, pharmacists to experts in artificial intelligence. In addition we have long been dealing with texts and analyzing natural language. This is an area in which most of our team has experience.
At first it wasn’t easy because the set of documents included in PubMed were much larger than anything we had processed before in our work experience. But I have to say that we have a great team and today we integrate documents published in PubMed (abstracts of publications) and PubMedCentral (full text) every day.
Novoseek is a service based on state-of-the-art technology, people working in the company are quite young, they know the internet well and are concerned with constantly improving the user experience. Therefore, their feedback is very important to us. We have opened discussion platforms that have a particular role. In uservoice, users tend to make us suggestions as to new developments and usability. We study them and we include them into our development “road map”. There are things that are easy to implement and take little time (like export to CiteULike) and others that we need to assess and may take longer (such as search in figures and images). Twitter (@novoseek) is a tool we use for real time communication with our users and to share information such as: interesting publications, news and interesting links for our community, surveys or a more direct feedback. For example, I remember the time someone asked us if novoseek was down and in 5 minutes, 5 people (including us) told her that it was not.
I admit that there is a subtle balance between what people want in the web-based service and what we think is good for efficient searches and a nice user experience. In general, user feedback helps us a lot.
Our professional education is very technical and this was reflected in our previous products. They were very powerful but sometimes too complex for our target audience. We thought that more (functionality) was better than less and we did not consider enough the point of view of our users. For us this has been quite a journey in which we learned a lot. In the last months we have conducted many usability tests, and we realized that there are elements that are not clear enough. So we are currently working on a redesign of novoseek. This should help understand better how it differs from PubMed and what it actually bring to users.
Given our work and activity online, we know well the other web 2.0 tools that today are part of the life of a novoseek user. They are tools we are also using ourselves and that we consider important because they are completing the service offered by novoseek. It is a requirement that we must meet so that people keep using novoseek. So far, we have done it for CiteULike and it is pending for Zotero 2.0 and Mendeley. As these web 2.0 services grow in number and their use is increasing among scientists, novoseek has to be more compatible with them.
For novoseek’s API we have used REST based on the XML standard because it is relatively simple to use and there are libraries for most programming languages available today.
As for the functionality it offers, we tried to bring most things that can be done in novoseek to the API. One can do searches based on words and biological concepts (like e.g. genes, diseases, drugs or chemicals) to retrieve documents. The documents offer all the entries included in novoseek and these can be used as a basis for new text mining services. It also offers the key concepts that are calculated for a search related to the documents returned and that characterize this set of documents.
Our main goal is to offer the possibility to integrate the functionality of novoseek on other platforms. For example to enrich the content of web pages or blogs. Furthermore, it is now very common to do ” mash-ups” between different systems to create something totally new. We wanted people to be able to use novoseek in new ways beyond what might occur to us. People interested can request an API Key in http://api.novoseek.com
In the future we want to extract more and more information from the documents which are indexed in novoseek to allow ever more powerful searches. One problem is that e.g. in PubMed you can not search for a person. If you search for “John Smith” the system will return documents where the name refers to different people. Or in documents where “J Smith” appears as an author, you do not know if it belongs to “John Smith” or “Jeff Smith”. Another problem that requires a lot of work is to find specific information such as, e.g., what drugs treat a disease or what are the genetic causes of a disease. We want to solve these problems for our users to save them time spent searching and so that they could devote to actually reading the documents that are relevant to them.
At first we set up novoseek on a small cluster of Linux machines installed in our offices in Madrid. But we realized that keeping a 24 hour service with minimum disruption was not easy. We were depending on a single Internet line that failed several times in the first months. The air conditioning system was not secure enough and we could not withstand power outages of over 15 minutes. After evaluating many options such as hosting of machines in a data center or collocation of our own hardware in one of them, we chose the web services offered by Amazon (which is known as AWS – Amazon Web Services consisting of EC2 and S3). Amazon offers what is known today as “the cloud”, a system of virtual machines that are configured in a flexible way. It is easy to create more nodes to meet our growing needs and also pay only what is actually used. The decision to migrate novoseek to the Amazon platform solved the problems I mentioned before because it is a very stable environment that has not failed us so far.
Thank you to José María Fernández González and Christian Blaschke for their time and dedication for this interview.
![[Connotea]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/connotea.png)
![[del.icio.us]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/delicious.png)
![[Digg]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/digg.png)
![[diigo]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/diigo.png)
![[Google]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/google.png)
![[LinkedIn]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/linkedin.png)
![[Reddit]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/reddit.png)
![[StumbleUpon]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/stumbleupon.png)
![[Email]](http://blog.novoseek.com/wp-content/plugins/bookmarkify/email.png)
We have recently added to 

