Our PubMed entity relation extraction and search problem is related to two fields of study: Medical entity relation mining and Entity-related search system. In the medical text-mining domain, there exists some prior works about the relationship among medical entities shown in the knowledge databases[1,2]. The most popular one is Comparative Toxicogenomics Database (CTD) whose data, including relations between Chemical-Gene, Chemical-Disease and Gene-Disease. Unlike the specific and well-predefined relations between Chemical and Gene by the medical professionals, the relations between Disease and Chemical is very general and only have few relation types such as “therapeutic” and “mechanism” instead of phrases. Another similar …show more content…
However, consider our system should support online search query, the time costly NLP techniques such as parse tree is not applicable in our case. Thus, we use POS tagging sequence to classify our patterns and there are three types of pattern components--Entity, Verb Phrase and Entity Modifier. The new idea of entity modifier is introduced to specify a sub-level of the entity or describe a relation under a certain condition. Note that through adding entity modifiers, our system is able to not only further distinguish entity relations from general types (eg. the only pattern in the OpenIE:E VP E) to more specific ones but help users to better understand the relation phrases. In the figure XX, entity modifier is used to explain the occurrence of opposite relation phrases for same …show more content…
Most entities co-occur only a few times in the PubMed corpus and there are often diverse ways to describe the same meaning relation between an entity pair. To conquer this issue and take the leverage of the redundancy of the corpus, we decide to cluster synonymous relation phrases.
Relation vector based clustering:
Follow the traditional principle for such task, we run k-means clustering method on relation phrases represented by relation vectors. Relation vector is basically a bag-of-words model, which contains TF-IDF values multiply occurrence frequency for each term. In addition, we observe that the meaning of relation phrase becomes ambiguous without considering entity type information. For example, “prevent” and “treat” are similar relation for chemical and disease, but should view different for gene and chemical. Thus, we added query entity pair’s type information into the relation vector. The last but least, we take account of relation phrase polarity information to cluster similar semantic sense of relation phrases. To leverage the result, we choose the mass center vector of each cluster to represent such