I tried applyCRF, experiences:
- The list output combines the ‘B-Protein’ and ‘I-Protein’ of the other output formats
- The resulting names are often more than a gene symbol (e.g. ‘NRG1 gene’)
- Sometimes a B-Protein is truncated and a gene symbol extension is the first I-Protein (e.g. ‘DNAse’, ‘I’)
- B-Proteins can have little mistakes (e.g. ‘STA1,’), with regexp I clear comma’s and points at the end of B-Proteins
- Conclusion: I do everything: list output (B-Protein+I-Protein), B-Protein only (with and without [,.]) before validation
I tested applyCRF with Matijn Schuemie’s UniProt service. The purpose is to filter our false positives. I found that
- Many proteins names give no result; these are generally not human (Scheumie’s service is human only)
- The DDBJ service, GetGene_DDBJentry or GetProd_DDBJentry from Japan can also be used
- works on many organisms
- multiple results: from which organism is the extracted protein name?
- more unique names could be retrieved if the organism is know
- Conclusion: organism is essential to identify a protein name: a gene symbol is (more or less) unique within a species. Add organism name to document search? What to do when id is not unique?
ToDo
- Compare NER and applyCRF in one workflow
- Consider taking up organism name in document search query.
0 responses so far ↓
You must log in to post a comment.