Adaptive Information Disclosure (AID)

Partcipating in the VL-e project

Adaptive Information Disclosure (AID) header image 1

Complications of extracting protein names

March 30th, 2008 · No Comments

I tried applyCRF, experiences:

  • The list output combines the ‘B-Protein’ and ‘I-Protein’ of the other output formats
  • The resulting names are often more than a gene symbol (e.g. ‘NRG1 gene’)
  • Sometimes a B-Protein is truncated and a gene symbol extension is the first I-Protein (e.g. ‘DNAse’, ‘I’)
  • B-Proteins can have little mistakes (e.g. ‘STA1,’), with regexp I clear comma’s and points at the end of B-Proteins
  • Conclusion: I do everything: list output (B-Protein+I-Protein), B-Protein only (with and without [,.]) before validation

I tested applyCRF with Matijn Schuemie’s UniProt service. The purpose is to filter our false positives. I found that

  • Many proteins names give no result; these are generally not human (Scheumie’s service is human only)
  • The DDBJ service, GetGene_DDBJentry or GetProd_DDBJentry from Japan can also be used
    • works on many organisms
    • multiple results: from which organism is the extracted protein name?
    • more unique names could be retrieved if the organism is know
  • Conclusion: organism is essential to identify a protein name: a gene symbol is (more or less) unique within a species. Add organism name to document search? What to do when id is not unique?

ToDo

  • Compare NER and applyCRF in one workflow
  • Consider taking up organism name in document search query.

Tags: BioAID_dev · NER

0 responses so far ↓

You must log in to post a comment.