- I have downloaded the workflow file, what next?
A workflow file captures the design of a workflow in xml format. You need Taverna to (re-)design and run workflows (http://taverna.sourceforge.net). Btw, it is planned to enable you to run workflows directly from myExperiment.org.
- Why am I getting no or very few results?
You can try to increase ‘maxHits’ (maximum number of abstracts returned), but in our experience this is often due to a search result (the list of relevant abstracts) that contains few proteins. Sometimes the best hits contain no or a very short abstract, hence few proteins. Often the abstracts are of the wrong ‘flavour’ (e.g. clinical instead of molecular). You can try to select against these flavours (e.g. add ‘-clinical’ to your query). You can also try a new document index ‘MedLine_new’ with mesh terms (e.g. add ‘+mesh:[some mesh term]‘)
- I increased maxHits and now the workflow fails; why?
The most likely cause is memory limitations. You can try increasing java heapsize (see java or Taverna help), but you probably will have to decrease maxHits. The issue is addressed in version 2 of Taverna (not out yet) and we are looking at grid-based solutions. Currently, how Taverna 1 and SOAP web services handle data transfer combined with memory limitations of java on your computer is a bottleneck.
- Do I need to include synonyms in my search query?
Character case variations are catered for. For protein synonyms you can add our protein synonym workflow to a workflow (service and data courtiously provided by Martijn Schuemie). Because these workflows do not provide synonyms for every type of entity we generally do not include them in our more general workflows.
- Why not use NCBI’s PubMed services(eUtils such as eFetch) for retrieving relevant documents? Would that not be better than a service based on Lucene?
Our main objective is to allow customization of the text mining process. The AIDA service has ‘document_index’ as an input which allows us to choose indexes other than for Medline. We are working on a workflow + Taverna extension that allows you to create your own index. NCBI’s PubMed services may be more optimized for Medline. Note that we found some eUtils hard to get started with in Taverna; for working examples of text mining look out for workflows by Paul Fisher on myExperiment.org.
- Are documents retrieved from the latest version of Medline?
We cannot gaurantee that at this moment. Unlike PubMed we do notautomatically update our index. We try to do it often and can do it on demand. Also see the PubMed vs Lucene question.
- Why don’t you exctract diseases directly, why via proteins and OMIM?
At the moment we have discovery models for ‘named entity recognition’ (NER) trained on news and genomics entities. The latter include protein names which suits many of our biological needs and are discovered best. We are working on workflows to train NER, for instance on disease names (see future developments).
- Can I search for anything?
In principle you can. You have to use the Lucene syntax for more complicated queries. Note that not every query will result in abstracts that contain proteins for protein discovery
- Is the Ontology of the RDF workflow completely automatically extracted?
No, we have created a ‘template ontology’ with basic concepts such as protein and disease. We want to use BioAID workflows to do the repetetive work such as filling in the instances.
- Can only proteins be discovered?
No, we are filtering protein tagged entities from a larger set of genomic tags. However, proteins do seem to be discovered best. Next to genomic terms our current named-entity recognizer can also discover news entities. We are working on workflows that allow you to train a named entity recognizer to discover entities of your choosing (see future developments).
- Am I free to use and change your workflows any way I like?
You are, but we ask you to comply with the Creative CommonsAttribution-Share Alike 3.0 License. In short, that means you can useand adapt the workflows as long as you credit us and also share your work.
- What future developments can we expect?
Our short to mid-term goal is to provide ‘BioAID’ workflows for a text mining ‘pipeline’, from creating indexes for searching documents to training entity and relationship discovery, and storage of results as semantic models (RDF/OWL). We will share them on myExperiment.org where we also have set up a text mining network for users and developers.