Project description project "Multirelational data mining for bioinformatics", Jan Ramon, postdoctoral research FWO 2003-2006
AIM
During the last decade, Inductive Logic Programming (ILP) has become a well-founded paradigm for relational data mining. During the first years, mainly general theory and general-purpose systems were developed, However, certain application domains have their own features and difficulties. Domain specific methods that make use of the domain specific features can greatly improve the performance. In this project we want to extend Inductive Logic Programming to be more usable for some specific domains. We will pay special attention to the domain of bioinformatics.
OBJECTIVES
Inductive Logic Programming (ILP) is a data mining paradigm that uses first order logic to represent examples and hypotheses. An important advantage is that richer concepts can be expressed and hence more complex patterns over several relations can be discovered in the data. Another advantage is the possibility to add background knowledge easily. A possible weakness is the larger computational complexity caused by searching a larger hypotheses space. However, recently developed optimization techniques and the possibility to delimit the search space according to the knowledge of the user solve this problem for a large part.
Bioinformatics is the research area where computational approaches are used for problems in biology and medicine. Such methods can for example help in getting more insight during the search for new drugs. Important factors making data mining and in particular inductive logic programming a valuable tool in this domain are: (1) the large volume of available data and (2) the large number of different relations interacting in a complex way. Examples are structure activity relationship (SAR) analysis, the search for patterns in genomes, but also on a higher lever the analysis of the response of patients to different drugs.
Both the need for relational data mining methods and the value of computational methods in biology are widely accepted. In this project we want to use our expertise in relational data mining to contribute to the field of biology through collaboration with research groups having biological expertise. The objectives are therefore twofold. First, we want to extend Inductive Logic Programming with methods that are important for our application areas. Second, we want to contribute to the application domains by using these new techniques.
DESIGN AND METHODOLOGY
The proposed project wants to improve ILP techniques in domains with large complexity. While we will focus on bioinformatics in particular, these methods will be sufficient general to be applicable in other domains too. The project consists of two major components: a data mining part where we will extend ILP techniques and a second bioinformatics part that aims at contributing in application domains.
I) Data mining
In the data mining part of the project, an ILP system will be developed that is optimized for data mining in bio-medical domains. We will pay special attention to the following topics:
- Distance-based methods
- Interaction between agent/experiment and data mining
- Efficiency issues.
I.1) Distance based techniques
Several data mining methods use distance functions to measure the similarity between objects. One then assumes that similar objects will have similar features. E.g. if one knows that some substance is active against some disease, then one will try also substances with a similar structure as one can expect them to have a high probability of being active too. In previous research, we have developed distance functions and clustering and instance based learning algorithms in first order logic. In the first topic of the data mining part we will further refine these methods. Important aspects are the integration with kernel based methods and the use of background knowledge in the distance functions. For this part of the projects, collaborations with the university of Freiburg (S. Kramer) are planned.
I,2) Interaction between experiments and data mining
In some applications (a.o. biology), experiments and therefore also data are expensive. Hence one should not only try to learn good theories from the available data, but also to choose the experiments that generate the data in such a way that they provide maximal information to the learning system. In the active learning setting, experiments and learning phases are interleaved such that the learning system can choose the next experiments with knowledge from the previous ones. Some more theoretical results on active learning have been published (computational learning theory) and it has been applied in some domains such as natural language processing. Recently, there is also interest in active learning in the biological domain. This part of the project aims at contributing at active learning on both a theoretical level (from a computational learning theory viewpoint and from results in related domains such as planning and game theory) and a practical level (implementation in an ILP system). For the applications in the biological domain, we will collaborate with the university of Aberystwyth (Ross King).
I.3) Improving efficiency.
The third topic of the data mining component concerns efficiency issues. Data mining algorithms often have a high computational cost. This holds in particular for relational algorithms as they have a larger search space. In recent research, optimizations were developed for existing techniques. In this project we will further refine these optimizations and apply them in the newly developed methods.
II) Bioinformatics
In the bioinformatics component of this project we will investigate problems in the biological domain. Therefore, we will collaborate with other research groups where expertise in this domain is present. E.g. a visit is planned to the university of Natal in 2003 for a collaboration on determining the therapy response of patients on HIV drugs. There will also be collaboration with the Rega Institute of the K.U.Leuven for other projects.
By making the developed techniques available in the inductive knowledge base system of the DTAI research group, they will also be usable in other application areas. Moreover, this will allow us to use the results of the DTAI group in this project. In particular for bioinformatics we think e.g. of the work on probabilistic knowledge representation which allows one to describe biological knowledge more accurately.