ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this...

53
Plate-forme AFIA / Nice, du 30 mai au 3 juin 2005 ATELIER : Apprentissage automatique et bioinformatique Florence D'Alche-Buc Laurent Bréhélin

Transcript of ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this...

Page 1: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Plate-forme AFIA / Nice, du 30 mai au 3 juin 2005

ATELIER :

Apprentissage automatique et bioinformatique

Florence D'Alche-Buc Laurent Bréhélin

Page 2: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,
Page 3: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

����������� ��������������������! "�#�$��%'&(���)�*��+��,�! "�-�.�� /�1032,����%4%4 5+6 "�87( 5+.�$�)��&9�� 5:1��+�7(��+������;%4���(���-�.+&<�,:9�>=���$=;��+�+.��?(�@�A�.�;�,�B7( C&9���,�.D5���( 5+EGF�H,I�J"K�LNMOF,PRQ,KTSULVH,WXQ,MOY@Z[LVF-Q�Q-H,I1Z@KY@\L]H,I9Z�F-Q4^6_`_aH,I1PbK cd %4 "�)2� 5�9�@ feg�,�(���������9�( 5=�=; f7<hi:9�j%4���$�)�*03�,��2,�*�9�;+.%4 #& �A�>��:1����0k�*��2����(��+��A�.�;�,�47( /+. 5+l2�m"�( 5+n F,Z�H�_aYpo�Y@Z$WGH,W SrqXY)F,WXsutKvY$Z@Z�YwLVF*WGHBY"\6tKvY@Z$Z)Y!x#_yK;JAY z�{| :9�.��%4�A�����/ "�����������������}�,er�) 5=; @~��*�-�>�(�O7( 5+#�;���(�;�O�)�9 5%4�;�$��=��9 "�k���,���O+�X�A� F,��\vKvY$W��XF,��\vS�tKvY@Z$Z)Y[�/I*�OH,WX\�Y"\[�r��Y@�[�pY$�5KT_`_aY �Oz� �,�(eg�*��%4���.�����j7( /�(�;�,%j�,=;D"�@:9=; 5+U "�>�A&(&9�� 5�-�.�;+.+���2� /7( 5+8�;�-�. "�������������9+#7( �~��*��7( "��������=;+& �A��:(�}+��O+.�.m5%4 C%b:9=��.��0k��2� "�-�#��7(�A&9���A�.�;en F,P�KT_`_aY[o>Y@���$YRY"\ n F,Z�H�_aYpo�Y@Z$WGH,W c�c| +���:979�!��e | %4���(� | �@��7(+#�#���(�A�)�B�$�O7( 5+�pI(F,K;�*I(H'E(I�F,WGQ!^�W1��Y"_ � Y$Z"\lLVY.�(M-I}���*I1K��"H c-{| &(&(�) 5�-����+�+.��2� w7�h��*:1���,%4���. 5+#&(�A�peg:(+.�;�,�}7( b 1¡£¢¥¤T+���%4�;=��A�fe¦����2�%4 5�-��&(������+)§# "�f "��&<D"����0%4 5�-�.���.�����(+8+�:1�f=; 5+8&(���*��D5���( "+>¨V©�¢E Z�F,WGª"H,K`� n H,��\uYbY@\/x>H*I(_y��Y$W�«wY@Z � Y"_`_aY�¬ ­�®¯� 5�@�,�(+��.��:9�$�.�����B+.:9&< "�)~O��+�D5 p79 p2,���A&(�( [2�D5�9D"�.�;°�:9 qXY)F,WG±�tM-KT_yK��-�OY²�XY@Z@\ ­-³d =��A�<�,���A�.�;�,�B7( [�(�5�,��:9�!&��,:9��=Thi 5+�����%4�A�������47( C&9���,&9����D"��D"+>7( p&< "�.���. 5+#%4�,=;D5�$:(=� 5+´µK`�AF�¶8F�_aF,K`��H�_aF,S1qGH*WGF�\TMOF,W n MOY$WXS1qXH5¬)Y"_y·*WGY8o8Z@I(F,WGQ�S(t8Y@\3Y@Z�tM-I1W1�,S �G¸ qXH,��M-I(F �(¹ F,sPbKvQ�F,���wY"\6tKvY@Z$Z)Y[o�F�_aQ,K {Oz

º

Page 4: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

»

Page 5: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

¼

Page 6: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

½

Page 7: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

¾

Page 8: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

¿

Page 9: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

À

Page 10: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Á

Page 11: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Â

Page 12: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

º�Ã

Page 13: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

º@º

Page 14: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

º)»

Page 15: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

º�¼

Page 16: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

º ½

Page 17: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

����������������� �����������������������������������������������������

���������������� ���������� ������������

������������������������������������������������������� � ����!�"������

##$����������%��"���&�'#()*��������������+�(,�{bernon, mano, glize}@irit.fr

�������-� ���� �������� .� ����� �� .������� .��� ����� �/�..�������0�� ��� ���.�������� �/����������� ��������� ��� �� ����� .��������1��� ��� �/ ���0���� 2������ ���� �3��1���� �������0�������.������4� �� ������ .��� ��� ������� ���� .��.�� � �� ������������ �/��� �������� �� .������ ���� ������ ���� .��.�� � �� ��� ���� ������ ���.�����5� 6/��0������ 7���������8� ��"���� .����..�������0�������1����/��������������������.��������� ����� ������ ���.�������01������������ 9���� ��� ���� ����� ���� �� ��� "����0������ �+. ���������� ������������ ��� �������������.���������.� ��3.���������� ��"�������5�

�� �����������

: ������ ���� ���.������� ��� ���.�������� ���� �3��1���� ������� ���� ����� 0��"���� � �9���������/����/�0����/��9������������������������";������������"����0�������/����.���������������5�6����� ��������� ������������.��������������������3.���1��������������������� ���������+������� �"���� �� /�� .��� �"����5� ������ � ������� � �������� ������ .����.������� ���� ��� ��������/�� � ���� ����������� ��� �3��1��� ������ � .��� ��� ����������� ������� � ���� ������� � ����5� ��� ��� ������� ���� .��.�� � �� ������������ ��� ��� �������� �� .������ 9���� � ������� ����.��.�� � �� ��� ���� ������ ���.�����5� 6�� ���.��+�� � ��� ��� � .���� "����0����� ���� �� �� <� ������������������/����������.��� ��.��������������.�����;����/�������.��������.�3�����0����5�=���� ���������� ��� � ��"������ ��� ���+���� ����� 0������� ���� ���� ����� � �������� ���� 0�������-�����"��>�� 0�������0 �������� 0������� ��0 ��������� ������������������?�

������� ��� 0 ������� ����������� ������ ��� �/���3��� ���� .������� �/�+.������� 0 ����� ���.��� ����� ���� �� ������ .������� ���� �/���3��� ���� �3��1���� ������������ .��� ��� ����������������������������/�+.�����������01���� ���������������� ������"����0����5�6���������������.�����<�@:%�.�������/ �����������������.������/����������������������������������0 ����������������������� ��"����/��������������������/�+.�����������01�������������������������������9���� ���3� ��� ��� ����.� � ��5� :/������ .���� ��� ��"������� �� ��� ��������.������ 2��� �3.��"����0��������.�3�����4���������/�+. ������������������.��"�������� �����"���5�

:����������.�������"����0����3�� �������/�";���������.� ����.��;����������.����������/ ��"������ ���1��� ������������ �������� ��� ���.�������� ��������� �/�� ��������0������ ��� ����������������.����"���� ���/���3������� � ������������������������������5�A�� �������� �� ��������� �3��1���� ���.��+��� �/7���0������ ��������8� ��"���� .��� �..�������0�� �� ���1����/����������� ����� ���� .��������� �� ��� ������ ��� .��� ���� 01��� ��� ���� ����� ���� � .������+. ����������������������2�������.���������.� ��3.���������� ��"�������45�

6��������������������������������.� ����� ������/�"���� ��� ����������/�..�������0�����.� ������ ��.���� ���� ��� �� ��������� �3��1�����������0���� ���.������5�6��������.�������!�@����.�������������� ��� .��"�1��� �� � ���� ������� � ������ �� ������� '� �����.�0 �� ��� � ��������.� ������������������������.� � ���������������5�

º)¾

Page 18: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

�� ���������������������������������������������������

6/�..�������0�� ��� ���.�������� �/��� �������� <� �������� ���� �� ��� "����0�������+. ����������������.��"�1������.��+��2�����"������01�����������������0���������� ��� ������� ����+. ����������<�.�����������.���������"����������� � ��01��4����� �/��.���������������������0�0�������������� 0�������������/��.������������������/��0������������ ������������5� �/���� .�������� ���� ����� � ��� � ��� �� .��� ��.��3��� ���� ���������� �/�..�������0��������������������� ����������������B:� �('C��������/�"����������.��3�������� �����.��������1�������/ ���0����-������������@�@!�2@��.�����������@0���!3�����45�

���� ���������������� �

����3���/�..������.�������3��1�����������.��0 ��������������������0������������/3����.���5�6/���.������.�������� ����������..����������.����.���/�������0��������.����.��������� � ���� �������� .� ��1��� �������� 2�������� ��� �������� ��������?� B���"���DEC45�F�3��0���� ������/�������0��������������7��/ ���0�����.��� ���/������ �����0��"����<�.������ �/����������� �������� ����� ���� ���.������ ����������� �� .������8� BF�3��0�(#C5�:�.�������������� ������� ��������/�������0�������������������3������/�������������������.��+�� � ��� ��� �/���������� ���� �..��������� ������������� <� ������� �� G����� B���.�D$C5� 6��� ������� ��� ������ ����� ���� �� .���� � <� .��.����� ��� �� ����� �..�� �� @�@!� B�������DDC� ������������� ��� ���. ������ ���� ���������� 0�H��� ������� ��� �3��1��� �/�������0����� .���� �..������ <��/�;������ ��+� ���0������ ��� ��� ���������5� 6��� ����������� ����� ���� ���.������ ����3��1���� .���������������������������������/����.���1�����������������.���� �<����. ���������������� ����������5����0��� ������ ����������� �������<���������� �/��0������������3��1������<����0��� ����� ��� ���.�������� 0��"��� ��� ��� �3��1��5� ��� ���.�������� 0��"��� ���0�� ��������������� ����� ���.������ ��� ������� ������� ��� ��� ���I���� ��� ��� ��� � ������ ����� ������.���������������.�����������9������ 5�

:����������..����������@�@!�����3��1����������0������.�����������������.�� ��/�0���������7����. �������8���������3�������������������<�.���������������������������������������<�� ������ ��� ��� ���1��� ���� ��� ���� �0��� �� ������� ��� .���� ���. ������ ������ ���� ������� .���� <��..������� �/������ �������5� @�� ������ �/�� �0��� ��� ���. ������ ���� � ������ ��� ���1���7�.������.�����8��5�5���/����������������I������������������������������������0����������. ��������������� ���������������������������������������/����.�������. �����������������..�� ���!���������%�����. ��������2!%�45�:1����/���0���� ��������/���������������������������������������0���.���� ������� <� ��� ��������� ��/��� ;�0�� ��� ��� .���� ��� ���� 9���� ���. ������� ������ ���� �����������������������������9���J����0������. ������/����.������������5�

���� !���"����"��������� �

6�� ��������0������ ��"��� ��� �/ ����� ���� ��� ������� 7�!��������3���� �����������8� ���� ���0 ���� .����������� ���� ���.����� )*((� 01��� B@������(,� F�����(#C5� ���� ��0������ ���/�����0���/9�����������3�����������������.��� �������������� ���+. ����������2��.����01��� ��"��?4���������������� �9�� ��������5�

6�� �3��1��� ������������ ������ ���� �..������ <� ��������� ������ ��� .�.������� ����������� �� ������� ��� ��� ���������� �� �/�..�3��� ���� ���+� �3.��� ��� �� ����+. ����������-�

K� :��� �� ��� �������.��������� �"������ 0�H��� ��� .��������� ���� .����� <� @:%� .��������� ��� ��������������� .��� ���� ����������� �� "����0��#5� ���� �� ��� ������������������ ��� ��0�� ��� ������.���� ��� ������ ���� 01��� �+.��� �� �������/�+. ����������������� �������������)(((� ��0�������� �5� ����� ���� ���������������������01������/����������������������+�.�������� ������������.�����3.��� ����.. �-����������01�����������+������������������+.������5�

�����������������������������������������������������������#� ��� �������� ���� ������� � �� �����"������� ����� ��� 6�"��������� ����������0�������.��� � �� 2���� LL(,4� ��� �/�%!@� �����������5� º�¿

Page 19: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

K� :��� �� ��� .������� ��� ��� ������ �������� ��������.����� ��� ���.�������� ��� ���������� .���� ���� ����������� �. ��������� ��� �������5� ���� �� ��� �+.������ ��������� � ���� ���.������ ��������.������ 2����� ���� ��� .F� ��� ��� �M*4� .� ����� ���� ���"��� ����������������������.��������������/����������������������������. ������5�

�������<���������� �������+�����"��������� ���������.����3������ �� ����.� �1����������� ���+� �3.��� ��� �� ��� ���� ������� �� ����� ���� ������� ���.�������� ��1�� ���� ������ 2�/���������������<������4����������.�������9������������0������5���������� � � ����������/����������������������/����.����������.������������� �������������������.�������"�����5��

#� $��� �����������������

6/�";�������������� �������������������������3��1��������������.������������..������<������0����������/����/�0��������/���3��1��������5�N�.������������ ����"�����"��������3��1�������������� ����� .������� <� .� ����� �/�+.������� 0 ������� ��� ��� ������� ��"��� �� � ���������/��0�������� ������������ ��� ����������� ��� ���� ����������� ���� ��� ������ ����� ���� �� ����"�����"���5� ��� ����� �/��0������ �������� ��� ��� � ��� .������ .� ����� ���� �� ����������.��������� ����� ��� <�.������ �������������� �����������.������ �������.��������������� ������� ��� ��������� �+. ������� "����0������ ����� ��� ���� ������ �������� <� ��� � ������ .����� <�@:%����������� �����5�

6�� .����1��� ��.�� ���� ��� ������� ��� �����"���� � �/�� ���� ����������� "H��� ���� ��� �+.����������.�������������������01����������������������������0��������/���������������������������� ���0���5�

#����� $���%�������� ���������

6����� ����������.��������������������.��������������3��1��������.�����������.������"����� �/�� ��.� �������� ���� ��� �3��1��� ���� ������ ���� 3� �0��� .���� �;������ ����� ������ � ��������� ���� ��������� �+. ���������5�@��� ������ ������ ��� .���� ;����� ��� ���.�������� ��� �����������/���������������� ����� �������� ������������� ��� � 0��������������������9����.���������.����������������.��� ��������/���������"������5�

������ ����������.�����������������.����������������������� �������������������������� ���.������� ��� �3��1��� �/�..�������0�5� =����� ���� ����� ������ ���� ���� �� ���������/�3�����/���������������������������������3������������"���.��.����������������������������� ���������� ��� ��� �/�;������ .���� �+.������ ��� "��� ������ � ���� ��� �3��1��5� ��� ���� ������ ���������� ���������������������9���� ��.� ��� ���.��������0�������. ������� 2����� ��0����#4�-��

K� ���� �0���� �������.��������� ����0 �� ��� ��.� ������ ���� )*((� �������� ��� 01��� ��� ���������5�����.�������������������������������������������������3.����/�0�����J�

K� ���� �0���� ��������.������ ��.� ������ ���� '(� �������� ��� ���.������ ����� �����/�+301�� �/ �����?� ���� �����0������ ����� ���� �0���� �������.��������� �������� ������������� /�����0������ .��� ����� ��+� ���� ���� �� ��� ������.������� ������..�� �����I���� ��5�

K� ��������0���� ����� ������������ ��.� �������� ������������������������������.��� �������������0����������������������������3.����/�0���5�

6��!�@����������������������� ������3��������/�0��������������5�6��.�����.� ������������������������������ �������������<��� 0���������0������������.����������������.������������<��������;����������������� �������5�6����"���������������� ������������������.����.��� ������ �0��� ���� ��+ � ��"����������5� ���� ����� ������.����� <� �/�������� �/���������O���"�������/���0���.���������������������5�@��������������������+��0������������� �����.���-�

K� ��.�����������.�����<����������� ����.���������.�����/�0���.����������������������������J�

K� ��.�����������.�����<����������� ���������������.�����/�0���.�����������������������������J� º À

Page 20: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

K� ��������������������������������<�� ���������������������/������"��<�(5�

�/�����/������������������/���0��������������.���������������0������������������������������������ ��� ������ ���0��� ��� �������� ��0�������� ��� � ����5� 6�� �3��1��� ���� ����� ��.�"��� ����/���.��������/�..��������������������������������������������5�

#����� !��������������������������������������

:��� ���� @�@!� ��� ���. ������ ��� � ����� ������ �����-� 2#4� ����� ��0��� .��P�� .���� 9��������.� � �����2*4�����������������.�������/�"������<������������������2'4�������������������������� <� �/���������5� :1�� ���� �/��� ��� ���� ��������� /���� .���� � ���� �� �� �0��� .���� ���;�0��� �� ��������� �� ���. ������5� ��� �+��.��� ��� ��� �������� 2#4� /���� .���� � ���� �� ��� 3� ������.� ������ ��"�0�Q� � ��� ����. �����J� �� .���� � ������� ��� �������� 2*4� ����� ��������/��.���������� � ��� ���� ���������� �/������� � ��� ������� ��� ��� ���������� .������ ��� .����������������2'4�/����.����� ���� �5�6/�0�������������������/�0���������1�����������.���������������������������5��

����!%��.������9������ �������3���������������/�0���<�����.����.���������� ��������������������5�:������� ������/�0�������������� �������/�����������������0������������/��������������������������� ��2���.������������������4�.������������5� ���������������0�����9������������������������������������� ���2���.�����������������4�����������������"����5����������������������� � ��� ��� 2���.���������� �������4� ���� .���� � ������� � ����� ��5� !�..����� ���� ���+�01�����������������"���������� �������������� ���� �"����0����5�@���������/������������������ ����� ���+� �0���� �������.��������� ���� � ������� ��� �9��� ������� ���� �� ������1��� ������������������������������������. ���������������������/����������.�������������5��������������������.�����������.������������� ����0�������/������������/�0���������.�����2���������4���� ������� ��� .����� ��� ��� .���� ���"��� ��� �� ���� ��� �������� �/�������� ��� �/������ �0���2���"����45�

����� ���� !%�� .������ ������ 9���� �� ��� <� �� 7������"��>�85� :��� ������ ����� ��� �����"��>�������.��� .���� �� �0��� <� �/ ����� ����� ��� ������� ��� � � ����� �+. ��������� ��� ��� �������� ���� ��.����/�0��5�!���� �������0����������+������������������������ ���.����������������������1������������������������ ���������. ���������� �/�0�������� �����������.����� �;���������������� ���� ��� ������ ����� ��� ������� �+. ��������5�@���� ���� !%�� �� ��� <� ���� ����� ������ �������������-�

K� 6�� ������ � ������ �� .��� �� �0��� <� �� ������ �� � ���� �� ������� <� ��� ������ �� � ��� ��.�����������"��>5� �������� �����������������0��������������� ������� ������0���������.��������������� ����������������.�����������������������.������������ ������������+�.����������������0��������������9��������������5��

K� 6�� ������ � ������ �� .��� �� �0��� <� �� ������ �� � ���� ��. ������� <� ��� ������ �� � ��� ��.��� ��� �����"��>5� ���� ��� � ������� ���������������� ��������� � ������ ���������������.��������������� ��������0���������.�����������������������.������������ ������������+�.����������������0��������������9��������������5��

����������O.���������

������������ ����������������� �����

@0�������� ��������R����������2.��� ���4�

@0�����������.�������'(����������2.F�

�M*?4�

@0����������.���������)*((����������201��4�

A�0����#5������������������0�����������.����������������� 0�5�

º�Á

Page 21: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

���������� ���������� �;�������������.����.������0��������/ ������������������������. ������� ��� ��������� !%��.������ �..���I���� .���� ���� ����/������� �0���� ���� ������ ������������� 9���� � ���" ��5� :�� .������ �� .������ ��� � ����� �..������ <� ��� � ��0������ 0�H��� <� ����� �;�������������������5�

&� ������������� ����������� %���

6��� .����1���� �+. ������� ���� ��� � � �� ��� �/�� ������� ��������� <� ��� � 0������� ���� ����� �3������� �/�0���5� =����� ������� .��������� ��� �� ��� ����������� �� � ����� �� ��������/�0�������������������������� .������������01���<�����������������������/ ��"�����������������0���� ���� ��.� �����5� ������ .����1��� �..������ �� .���� �";������ ��� � ������� ����������/�..�������0�� ��� �/�������� ��� ���� � .������� ���� ��� ������ ��� ����+� ����� ���� �� ����+. ���������5��

��H���<��/�����������.� ��� ������0����*��������.����"���������I��������01�����.���� ���������� ������/�0����0�H���<��������������������������������01����� �����������9�������0������2��#45�M�.�������������� ������������������� ������� ��� �����������0�������� ���� ���� ���� ������������������� �5� ������������.����"����.�1�� ���� ������������ ��� � ��0������������ ������������������� �/������� � ���.������� ��� �������� 01��� 2����"��� ���������� ��+� �0���� S 6*ED�� ���S6�',(T��������� �����2*44��������������������� .�������0�H�����+������������������������������� ��� �3�"������� �/�������� ��/�� 01�� .���1��� ���� �� ������ 2���� ����"��� �� 2'4� ��������������<���������������01�����������S6�',(T�������01��S 6*ED��45�

�������������"�������������<���01���������� ��������������2*4�������������������������"���.����������.�������+����������� ���������+. ������������������.�������"����0������2"������0�4�����������"���.���������� ��.����/�..��������2��������45��������+�����"���.����������/�������� .���� �� �0��� �/�� 0����� ��� �/ ����� ����� ���� �������� �+. ���������� ��� ���� �������������� ��� .��� �/�..�������� 2����� ��0���45� M� �������� ��/��� ���� ��� <� ������� ���� �3����� ��������������/�� 0������������ �����������5�% ���������������������.��������������5� �����/������ ��� ���� ������ ���.� �U�� .���� ������� ��� ������ ��������� ����� ���� ���� <� ��� ��������� ��� ����������������� ������ "����������������������������������������/����� ��������������������������� ���������.���������� ���������������0�������� ������5�

A�0����*5�@������ ����.�����������01������������5�

��

��

��

��

��

º�Â

Page 22: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

'� !�� ���������������������

:��� ��� ���� ���� ���� �� ������ ���� ��� ��������� � ��� �/�..�������0�� � ����� ���� �/��.���� ������������� 0�0�������� ��� ������ � ��� ���� ������ ���� ���������� ������������� �������������. �����5� :/������ .���� ��� /���� .��� .����"��� ��� � ����� ��� ������� ��� 7��������8� .��������.�������.��"�1���.����������������������.������� �������<��"���������������������..������������������������0���������0 ��������������� ����+�������������������������������"���5��

6�� � �������� .��� �������0�������� ���. ������� ���� ���� ����� .� ��� �� ���� .������ ����������������.���� ����������� ���������<����� ��������������.��"�1��������������������. �������� ������/������������� ������� �/��.�����������������������������5��������..�����������/������.������ �����������������10�����/�;�����������������������������1��������� ����O��..����������������������� ����O��..��������/�0���� ����� �������� ����������������� ���.�����������1���������+���� ����������������������������. �����5�=���9������"������.������������������<���������������������������������������.���������/��������������.����������������� 0�������������/�..�������0��������.������������01����/�����������0�H���<��������..����������.��������������"��5� 6��� ��.��� ��� �������� ��� � ����..����� ������� ��.������ ���� �10���� ��� � �;���������������� ���� ����� ����� �0���� 2�� ����O��..������4� ���� ��� ������ ������� �3���������� ���� ���������������.������������������������� ����7��� ���85�:��.�����������������������.��������;�����������0��������� �����������;�����������1���������������.����+��.����/ ��������/����������/����.�����������������������������0���5��

���� ��������� ��������� �..������� �+�������� ������ <� � ������ ��� ���.�������� �/����������5�7�6/=������8�.����+��.�������������/������� ������������<�.����������� �������.����������������� ��� ����+. �������������������.��������3��������������������B��>������(,C�������7�������������������8� ��0���.�� �� ����"��� ��� ��0������� .��������� ��� � ������ �/��0�������� � ��"��������/����0�������/�������� ���������"�������� �������1������� ��� ����������� ����/�+���.���������.��������� �����.���������B6��V(#C5�����������..���������.����������������1����.�������5��� � ��� ������� ��� ����������� ��� ������.���� .� ����� �/��� �������� ������� "�� �� ���� ����� �������� ��� ���� �������� .���� �������� �+. ������������ � ����� ��5�6/�..�������������������.� ��� ������������.�������.�����������������5�

�������������� &�%���� ����� <� ���������� ����6��������"�������� .���������� ��� 6��� ��� �/�%!@� ��� =������������.������� ��������������� �����������.����1����������������.����������������� 0��������������0���/�0 �������%@�5�

(�) �����*���

B@������(,C�@�������!5�������3���R5���"�����65�������+��5����"���������5�65�������5��������������5� ��� ���������� !5=5� @������� !�����03-� �� %���� ���� W��3� F�0�� =������ ���������� �� !��������3���������������A���"����� �������@..����������"����03��������������03�5)L-�L'E�L,*�*((,5�

B���"���DEC����"����=5������������5�@������0�������� ��� ���.��������� �����������-� ������ ��������������������������+�@������0��������������.��������=�5�F���1��#DDE5�

B���.�D$C� ���.�� W5� �������� �X 5� ��� ������ 5� ��� �� ����� ���� .� ��1��� 0��"��+� ��� �� ���� �������������������������!�+�1����;��� ��������.������@:Y!�@� ���<��������=�5�F���1��%��5�#DD$5�

B:� �('C��:� ���5� ���V�>��@5�!����3� 5�������������=5�� ����������������.�����/�.��������������������&�=�5�=3�������*(('5�

B�������DDC����������� 5����.��W5���������� 5�@������3����=���0������.������������������.��������!����M0������������@��.�����@����������!3������A������=���.�����0��������!3������!������W�������#DDD5�

BF�����(#C�F�������5@5��������������65�!����������5���������:5���������3��5�5�!��������3����������:���"���� �����������������!����3�����=+.����������A��������@��3����:����%�������@���������*D�$(�$#�*((#5�

BF�3��0��(#C�F�3��0���A5�����!���������!������0�����������@��.�����3�������=�3���.��������6����!�..����!3������2=M6!!� �"����������5�6��4�*((#5��

B6��V(#C� 6��V� 65�5� !������ �5�5� ���� W������� ����-� �� !���V���� =�������� ���� ���.��������� ����������03���������"���������03���0�������5�#D2#(4-,(#�,()�*((#5�

B��>������(,C���>�������Z5�Z�����Z5�F���5������������5�@���������0���������������������������������������!�����������������������*(2,4�L'$�L,)�*((,5�»$Ã

Page 23: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Automatic extraction of relevant nodes inbiochemical networks

Sebastien Vast, Pierre Dupont, Yves Deville

Department of Computing Science and EngineeringUniversite catholique de Louvain

Place Sainte Barbe, 2B-1348 Louvain-la-Neuve - Belgium

{svast, pdupont, deville}@info.ucl.ac.behttp://www.info.ucl.ac.be/∼ {svast,pdupont,yde}

Abstract : In this paper we describe a novel method for extracting a set of nodesthat best capture the connections between k given nodes of interest in a biochem-ical network. This method relies on the projection of the nodes of the network,seen as an undirected graph, into an euclidean space. Euclidean distances be-tween nodes in the projected space correspond to their commute time distances inthe original graph, a measure based on a random walk model on the graph. Com-mute time reflects the distance between two nodes while considering all pathsconnecting them. Results on artificial data illustrate the interest of this approach.

Keywords: biochemical network analysis, subgraph mining, commute time dis-tance, spectral graph analysis

1 IntroductionBiochemical networks model interactions between biochemical entities within cells.Metabolism can be viewed as a network of chemical reactions catalyzed by enzymes,and connected via their substrates and products; a metabolic pathway is then a coor-dinated series of reactions. Other types of biochemical networks include regulatoryor signal transduction networks. Several models exist to represent biochemical net-works (Deville et al., 2003). In most cases, these networks can be viewed as directedor undirected graphs. The present work is part of the BioMaze project which aimsto produce computer tools for analyzing biochemical networks. BioMaze extends theAmaze project which aims to build a biochemical database integrating the three typesof networks mentioned above and to provide dedicated query tools (van Helden et al.,2000).

»�º

Page 24: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

CAp 2005 : Apprentissage et Bioinformatique

The specific problem we address here is the extraction of a relevant subgraph of anundirected1 graph, which best explains the relations between k given nodes of interestin this graph. Assume, for instance, we are analyzing the synthesis of pyruvate fromglucose and would like to study the possible influence of the expression of a givengene on a protein, say phospho-fructokinase-2, in the context of the regulation of thismetabolic pathway. In this case, we have 4 nodes of interest in a possibly very largegraph of interactions and we would like to extract a relevant subgraph explaining therelations between these 4 nodes. The methods described in this paper are also applicableto other practical domains.

This paper presents a novel approach to this problem. It relies on the projection ofthe nodes of the graph into an euclidean space. Euclidean distances between nodes inthe projected space correspond to their commute time distances in the original graph, ameasure based on a random walk model on the graph (Saerens et al., 2004). Commutetime reflects the distance between two nodes while considering all paths connectingthem. This contrasts with simpler approaches which would extract only specific pathsbetween each pair of nodes of interest, such as shortest distance or maximal flow paths.Here the goal is the extraction of a relevant subgraph as this is considered to be more in-formative. An inspiring approach to this problem was presented recently in (C. Falout-sos & Tomkins, 2004) but the problem was restricted to 2 nodes of interest. We adopthere a different point of view allowing for a direct solution to the general problem withany number of nodes of interest. We propose to solve the problem in two steps: theextraction of a subset of relevant nodes in the graph followed by the construction of asubgraph connecting them. The present contribution focuses on the first step.

Section 2 proposes a formal statement of the problem we address. Some possiblemethods to solve it are discussed and contrasted with our approach. The theory behindthe notion of commute time distance is summarized in section 3. Section 4 details howto use commute time distances in order to extract a subset of relevant nodes in a graph.Practical experiments are presented in section 5.

2 The problem of extracting a subset of relevant nodesProblem statement: Given a connected undirected graph G = (V, E), where V denotesa set of nodes (or vertices) and E denotes a set of weighted edges, a non-empty setK ⊆ V of nodes of interest and s a strictly positive integer, find a set S ⊆ V \ K ofnodes, with |S| = s, optimizing a goodness function g(S, K). The goodness functiong(S, K) measures how well the s extracted nodes explain the relations between thek = |K| ≥ 2 nodes of interest in the graph.

The goodness function should measure how well the nodes of interest are connectedthrough paths to which the extracted nodes belong. A naive approach to this problemconsists in extracting nodes belonging to shortest paths between pairs of nodes of in-terest. Consider, for instance, the graph depicted in Figure 1 and assume this graphrepresents a road map between cities A and B (i.e. k = 2, in the present case). The

1Even though there is a direction of flow in a metabolic pathway, the type of graph analysis consideredhere does not require directed edges.

»@»

Page 25: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

shortest distance2 approach would typically select nodes C and D belonging to the high-way connecting A and B. However, as soon as one edge is removed along this path (e.g.in case of a traffic jam) no alternative route from A to B goes through C or D. Nodesincluded in the dashed circle are more relevant here as they belong to many alternativeroutes connecting A and B, even though none of these routes might be shorter than thehighway. Thus the goodness function should take into account many alternatives routes,possibly all of them.

C D

A B

Figure 1: Nodes that best capture the connections between A and B are in the dashedcircle as they belong to many alternative routes from A to B, or conversely.

Faloutsos et al. proposed an interesting approach to the more general problem ofextracting a relevant subgraph (C. Faloutsos & Tomkins, 2004). This approach can di-rectly be applied to our problem of extracting a subset of the graph nodes. They restricttheir attention to the case for which k = 2. The goodness function g(S, K) is basedon an electrical analogy. The 2 nodes of interest are respectively considered to be thesource and the sink of an electrical current. The algorithm searches the paths followedby the current fl ow and maximizes the sum of current fl ow in the extracted subgraph.In addition, each node includes some current loss in order to penalize long paths andvery highly connected nodes (hubs)3. This approach takes into account several pathsbetween the nodes of interest. The fraction of current captured by the subgraph dependson the number and weight of such paths. One drawback of this method is that, due tothe current loss, the solution depends on which of the 2 nodes of interest is chosen to bethe current source. We propose in the present work an alternative method which dealswith any number of nodes of interest with no preference a priori defined between them.

2This distance may correspond to the travel time in this particular case.3While it is interesting to extract nodes offering alternative routes to the nodes of interest, hubs do not

explain well the specific relations between the nodes of interest as they are well connected to most nodes.

»$¼

Page 26: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

CAp 2005 : Apprentissage et Bioinformatique

3 Euclidean commute time distanceAs motivated by the discussion in section 2, we are looking for a measure describinghow well several nodes are connected in a graph by considering all possible paths con-necting them. This measure will then be applied to the extraction of a subset of relevantnodes in a graph as detailed in section 4.

The proposed measure relies on a random walk model on the graph. This modelassigns transition probabilities to the edges, so that a random walker will jump fromone node to another with a probability proportional to the weight of the edge connect-ing them. The average commute time4 between nodes i and j computes the averagetime taken by a random walker for reaching node j from node i, and coming back toi. The square root of this quantity is a distance measure between any two nodes calledthe euclidean commute time distance (ECTD). Most of the theory, summarized in thepresent section, was introduced in (Saerens et al., 2004). The application of this dis-tance measure to the extraction of a subset of relevant nodes in a graph is detailed insection 4.

Section 3.1 introduces some notations and, in particular, the Laplacian matrix L of agraph. Section 3.2 details how to compute the ECTD from L.

3.1 The Laplacian matrix of a weighted graphWe consider a weighted undirected graph G = (V, E) with strictly positive weightsbetween each pair of connected nodes. The graph order |V | is also denoted n in thesequel. The larger the weight wij of the edge connecting node i to node j, the easier thecommunication between i and j is assumed to be. Moreover, the weights are requiredto be symmetric (wij = wji). The adjacency matrix A is defined in the usual way:

aij =

{

wij , if node i is connected to node j0 , otherwise.

The diagonal degree matrix D is defined as follows. dii =∑n

l=1 ail and dij = 0,if i 6= j. A related quantity is the graph volume, that is the sum of node degrees:DG =

∑ni=1 dii.

The Laplacian matrix L of the graph is defined as L = A − D. When G has a singleconnected component, the rank of L is n − 1. Moreover, one can easily show that L issymmetric and positive semidefinite (Chung, 1997).

3.2 Computation of the commute time distancesKlein and Randic proposed in (Klein & Randic, 1981) a distance measure betweengraph nodes, called resistance distance which has the property of decreasing when thenumber of paths between two nodes increase. As shown by Chandra (Chandra et al.,

4This notion of commute time is equivalent to the average number of steps a random walker would makeon average to commute between both nodes, since the random walker is assumed to make one step at eachtime clock.

» ½

Page 27: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

1989), this measure can be expressed in terms of the random walk model describedbelow.

A random walk on a graph is a Markov chain describing the sequence of nodes visitedby a random walker. A state of the Markov chain is associated with every node of thegraph. A random variable X(t) represents the current state of the Markov chain at timet. The probability of transiting to state j at time t+1, given the current state is i at timet, is given by:

P (X(t + 1) = j|X(t) = i) = pij = aij/dii.

Thus, from any state i, the probability to jump to a state j is proportional to the weightaij of the edge between i and j. The transition matrix P = [pij ] of the Markov chain isrelated to the degree and adjacency matrices as P = D

−1A.

The average first-passage time m(j|i) is defined as the average number of steps arandom walker, starting in state i, will take to reach state j for the first time. Thesemeasure can be computed by the following recurrence (Norris, 1997):

{

m(j|i) = 1 +∑n

l=1,l 6=j pi,l m(j|l) for i 6= j

m(j|j) = 0(1)

A closely related measure is the average commute time, q(i, j), defined as the averagenumber of steps a random walker, starting in state i, will take to enter state j for the firsttime, and go back to state i for the first time: q(i, j) = m(j|i) + m(i|j). Note that, ingeneral, m(i|j) 6= m(j|i), while the average commute time is symmetric by definition.As shown by several authors, the average commute time is a distance (Klein & Randic,1981; Gobel & Jagers, 1974). Moreover the square root of the average commute timedefines an euclidean distance (Saerens et al., 2004).

A first method for computing euclidean commute time distances is based on the iter-ative solving of the recurrences (1). An alternative approach derives from the Moore-Penrose pseudoinverse of the Laplacian L, denoted by L

+, as proposed in (Saerenset al., 2004):

q(i, j) = DG(l+ii + l+jj − 2l+ij) (2)

If we further define ei as the ith column of the n×n identity matrix, equation (2) canbe rewritten as

q(i, j) = DG(ei − ej)TL

+(ei − ej), (3)

where each node i is represented by a unit base vector ei. These nodes can be mappedinto an euclidean space that preserves the commute time distances as L

+ is positivesemidefinite. Indeed, every positive semidefinite matrix can be transformed to a di-agonal matrix (see, e.g., (Meyer, 2000)), Λ = U

TL

+U, where U is an orthonormal

matrix made of the eigenvectors of L+, U = [u1,u2, . . . ,un−1,un = 0]. Hence, the

commute time distances can be rewritten as:

q(i, j) = DG (x′i − x

′j)

T(x′i − x

′j) (4)

where the following transformations have been applied: xi = UTei, and x

′i = Λ

1/2xi.

So, in this n-dimensional Euclidean space, the transformed node vectors, x′i, are

exactly separated by euclidean commute time distances (up to the scaling factor DG).

»@¾

Page 28: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

CAp 2005 : Apprentissage et Bioinformatique

Close points in this ECTD space represent nodes well connected in the original graphG according to any possible paths between them, and the euclidean distance betweenthem measures this connectivity in the original graph.

The ECTD space has dimensionality n, the graph order, but projection to a subspacepreserving as much information as possible can reduce computation time. The so-calledspectral (or eigenvector) decomposition of L

+ is given by:

L+ = UΛU

T =n−1∑

l=1

λl uluTl (5)

where λ1 > λ2 > . . . > λn−1 > λn = 0 are the eigenvalues of L+, and ul the

associated eigenvectors. The eigenvector expansion of L+ can be computed up to m <

n − 1, by considering only the m largest eigenvalues of L+. This gives rise to an m-

dimensional subspace where the commute time distances are approximately preserved.Finally, since L and L

+ have the same set of eigenvectors but inverse (non zero)eigenvalues, we do not need to explicitly compute the pseudoinverse of L. It is onlynecessary to compute the smallest non zero eigenvalues of L, which correspond tothe largest eigenvalues of L

+, and their associated eigenvectors. Fast iterative meth-ods exists for this purpose (Golub & Loan, 1996; Sorensen, 1996). The complexityfor computing one eigenvalue/eigenvector is O(n2) and the overall complexity for thismethod is thus O(mn2).

4 Node subset minimizing euclidean commute timeThe relevant node subset problem can be easily formulated and solved using the eu-clidean commute time distances between any graph nodes and the nodes of interest.More specifically, we consider the following goodness function :

g(S, K) =∑

i∈S

dr(i, K) (6)

withdr(i, K) = min

W⊆K,|W |=r

j∈W

q(i, j)

Thus, the contribution of each extracted node to the goodness of the subset S is the sumof the commute time distances to its r (1 ≤ r ≤ k) closest nodes of interest in theECTD (sub-)space. In the experiments reported below, we considered the distances tothe two closest nodes of interest, for each extracted node (r = 2). The choice r = kwould correspond to considering the distances to all nodes of interest. On one hand,this would allow to take into account the connectivity to all nodes of interest. On theother hand, as this measure would be more global, the extracted nodes might not beparticularly well connected to any specific node of interest. We will further study thistrade-off in our future work.

Computing an optimal S, which minimizes g for a given number s of nodes to beextracted, is straightforward once the commute time distances between any node of

»$¿

Page 29: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

interest and the other nodes of the graph have been computed. It simply amounts tocompute dr(i, K) for each possible node i of the graph (except the k nodes of interestthemselves) and to return the nodes with the s smallest values.

Figure 2 presents the graph of Figure 1 with nodes indexed in increasing order ac-cording to d2(i, K) (here, K = {A, B}). Unit weight edges were considered in thisexample.

A B

4 1

66

4 2

3

5

Figure 2: The nodes of this graph are labeled in increasing order (ties are assigned thesame rank) according to the sum of their commute time distances to A and B respec-tively.

5 ExperimentsThe ultimate objective of this work is to provide a method for a biologist to automat-ically extract a subset of relevant nodes related to given nodes of interest in a largebiochemical network. In order to assess the performance of the proposed method, pre-liminary experiments with artificial graphs are reported here.

1

10

100

1000

1 10

Num

ber

of n

odes

Degree

Figure 3: Average degree distribution of graphs used for testing.

A set of 10 graphs of 100 nodes were randomly generated using a power-law graphgenerator (Barabasi et al., 2000). For each graph, five nodes were used as initial seeds.Next, 95 nodes were iteratively added and randomly connected to 3 nodes of the current

» À

Page 30: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

CAp 2005 : Apprentissage et Bioinformatique

graph. At each step, the probability of an existing node to be connected to the new nodeis proportional to its current degree. Each generated graph contains a single connectedcomponent and all edges have a unit weight. The degree distribution (averaged over allgenerated graphs) is depicted (in log scales) in Figure 3.

For each graph tested, 10 sets of k nodes of interest were randomly selected. Resultsare reported for k = 2, 4 and 8. In each case, an increasing number of s nodes were

extracted. The distance measure D =

i∈Sd2(i,K)

i∈V \Kd2(i,K)

is the cumulated distance of the

subset S of extracted nodes relative to the distance of the total set V \K of nodes whichcan possibly be extracted. As we aim at minimizing a distance in this case, the smallerD the better.

Comparative results with the method proposed by Faloutsos et al. (C. Faloutsos &Tomkins, 2004) are possible when k = 2. These results are presented in Figure 4. Bothapproaches perform very similarly in this setting, showing that they capture essentiallythe same information (at least for the tested graphs). However, Faloutsos method can-not extract more than 29 % of nodes in this case (28 out of the 98 nodes which canpotentially be extracted with our approach). This comes from the fact that this methodonly extracts nodes on loopless paths between the 2 nodes of interest. Hence a sig-nificant fraction of the graph nodes (here 71 %) may not respect this constraint. Oneone hand, this illustrates an advantage of our approach. On the other hand, Faloutsosmethod is more general as it does not only extract a node subset but a connected sub-graph. Extension of our method to deal with this more general problem is part of ourfuture work.

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

% extracted Nodes

D

Figure 4: Distance of the extracted subsets for increasing number of extracted nodes.The x axis gives the value of s

n−k , that is the percentage of extracted nodes. The greencurve (circles) corresponds to our approach minimizing commute times, while the bluecurve corresponds to the method of Faloutsos. Results are obtained for k = 2 andaveraged over 100 tests.

»$Á

Page 31: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Figure 5 illustrates the results of our approach for k = 4 and 8 on the same graphs.Both curves behave similarly and illustrate the generalization of our approach to largersets of nodes of interest. Note that the computational complexity remains essentiallythe same as it is dominated by the computation of the same commute time distances inall cases.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

% extracted Nodes

D

(a) k = 4

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

% extracted Nodes

D

(b) k = 8

Figure 5: Distance of the extracted subsets for increasing number of extracted nodes fork = 4 or k = 8.In all results presented so far, all edge weights were assumed to be equal (standard

deviation σ = 0). Figure 6 presents the extraction results for another set of 10 graphs of200 nodes for which a normal distribution on edge weight with a much larger standarddeviation (σ > 200) was defined. As slightly better performance is obtained for thesegraphs as the distribution of commute time distances is sharper in this case.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

D

% extracted nodes

std = 0

std=239

Figure 6: Extraction results for equal weight for all edges (σ = 0) or large standarddeviation (σ = 239) for a normal weight distribution. Average results for 100 tests withk = 2 (10 randomly selected pairs of nodes of interest for each graph).

»$Â

Page 32: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

CAp 2005 : Apprentissage et Bioinformatique

6 Conclusion and future workWe propose in this paper a novel approach to the extraction of nodes in a biochemicalnetwork which best explain the connections between k given nodes of interest in thisnetwork. This approach uses commute time distances between nodes, a measure ofhow well two nodes are connected in a graph by considering all possible paths betweenthem. It is based on the projection of the nodes of the network, seen as an undirectedgraph, into an euclidean space. Euclidean distances between nodes in the projectedECTD space correspond to their commute time distances in the original graph.

Several questions need to be addressed in the future.

1. The commute time distances can be approximated if the nodes of the originalgraph are projected into a subspace of the full ECTD space. A lower dimensionsubspace corresponds to a coarser approximation to the actual commute timeswhile reducing the computational complexity. We will study the trade-off be-tween this complexity and the quality of the set of extracted nodes.

2. Our goodness measure for the extracted node subset is based on the commutetime distance from each extracted node to its two closest nodes of interest. Asdiscussed in section 4, alternative goodness measures will be investigated.

3. The more general problem of a relevant subgraph extraction will be considered.Starting from the set of extracted nodes, some edge selection in the original graphhas to be designed. This should be derived from the fraction of edges responsiblefor the largest part of the commute time distances between nodes.

4. Actual experiments on real biochemical networks and result interpretations bybiologists are also part of the current project. Comparisons between extractedsubgraphs and known pathways could be performed in this regard.

ReferencesBARABASI A., ALBERT R., JEONG H. & BIANCONI G. (2000). Power-law distribution of theworld wide web. Science, 287(12).

C. FALOUTSOS K. M. & TOMKINS A. (2004). Fast discovery of connection subgraphs. In 10thACM Conference on Knowledge Discovery and Data Mining (KDD), volume 2, p. 118–127.

CHANDRA A. K., RAGHAVAN P., RUZZO W. L. & SMOLENSKY R. (1989). The electricalresistance of a graph captures its commute and cover times. In STOC ’89: Proceedings of thetwenty-first annual ACM symposium on Theory of computing, p. 574–586: ACM Press.

CHUNG F. (1997). Spectral graph theory. American Mathematical Society.

DEVILLE Y., GILBERT D., VAN HELDEN J. & WODAK S. J. (2003). An overview of datamodels for the analysis of biochemical pathways. Briefings in Bioinformatics, 4(3), 246–259.

GOBEL F. & JAGERS A. (1974). Random walks on graphs. Stochastic Processes and theirApplications, 2, 311–336.

¼@Ã

Page 33: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

GOLUB G. H. & LOAN C. F. V. (1996). Matrix Computations, 3rd edition. Johns HopkinsUniversity Press.

KLEIN D. & RANDIC M. (1981). Resistance distance. Journal of Mathematical Chemistry, 12,81–95.

MEYER C. D. (2000). Matrix Analysis and Applied Linear Algebra. Society for Industrial andApplied Mathematics.

NORRIS J. (1997). Markov Chains. Cambridge University Press.

SAERENS M., FOUSS F., YEN L. & DUPONT P. (2004). The princial components analysisof a graph, and its relationships to spectral clustering. In Proceedings of the 15th EuropeanConference on Machine Learning (ECML 2004), volume 3201 of Lecture Notes in ArtificialIntelligence, p. 371–383: Springer-Verlag.

SORENSEN D. C. (1996). Implicitly restarted Arnoldi/Lanczos methods for large scale eigen-value calculations. Rapport interne TR-96-40, Rice University.

VAN HELDEN J., NAIM A., MANCUSO R., ELDRIDGE M., WERNISCH L., GILBERT D.& WODAK S. (2000). Representing and analyzing molecular and cellular function using thecomputer. Biol. Chem., 381(9-10), 921–935.

¼*º

Page 34: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

¼"»

Page 35: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Conformation de biomolecules et apprentissage desinteractions de van der Waals par un systeme multi-agent

adaptatifCamille Besse, Carole Bernon

Institut de Recherche en Informatique de Toulouse,Universite Paul Sabatier,

118, route de Narbonne, 31062 Toulouse Cedex 4Äbesse,bernon Å @irit.fr

Problematique

Les conformations de biomolecules expliquent leurs roles structuraux et/ou fonctionnels dans la nature. Seule-ment, la complexite et la meconnaissance des infl uences inter-atomiques ne permettent pas de calculer ces confor-mations de maniere efficace. Des approches basees sur des methodes d’optimisation globale (Fan, 2002) existentmais se trouvent limitees par leurs hypotheses sur l’evaluation de la fonction energetique a minimiser. L’approcheproposee ici, a comme objectif d’apprendre ces interactions atomiques puis de les composer, de maniere simple,locale et efficace, en utilisant une technologie basee sur les systemes multi-agents adaptatifs. Dans un premiertemps, elle a pour but de resoudre des problemes de conformation par minimisation de l’energie de Lennard-Jonesinduite des interactions de van der Waals, puis, dans un deuxieme temps, elle sera utilisee pour apprendre cesinteractions.

La resolution emergente par auto-organisation cooperative

Un moyen d’apprendre pour un systeme S consiste a transformer sa fonction actuelle f Æ de maniere autonomeafin de s’adapter a l’environnement, considere comme une contrainte qui lui est donnee. Chaque partie P Ç d’unsysteme S realise une fonction partielle f È@É de la fonction globale f Æ . Elle est le resultat de la combinaison desfonctions partielles f È$É . La combinaison etant determinee par l’organisation courante des parties, il s’ensuit quetransformer l’organisation conduit a changer la combinaison des fonctions partielles f È$É et donc a modifier la fonc-tion globale f Æ , devenant par la meme un moyen d’adapter le systeme a l’environnement. La justification theoriquequi guide le processus d’auto-organisation dans les AMAS est le fait qu’un systeme qui possede ses parties eninteractions cooperatives permanentes, realise la fonction souhaitee dans son environnement (George et al., 2003).En elle-meme, l’organisation qui emerge est une organisation observable non predonnee par le concepteur dusysteme. Mais le plus interessant, c’est l’emergence de la fonction du systeme qui est produite par l’organisationentre les agents a un instant donne. Pour realiser cela, tout agent est programme pour chercher a etre en situationcooperative avec les autres agents du systeme : idealement, il devrait recevoir des informations pertinentes pourrealiser sa fonction et transmettre ses resultats vers ceux qui devraient en tirer benefice. L’agent realise en perma-nence sa fonction partielle, mais il agit simultanement sur l’organisation interne du systeme s’il detecte des situa-tions non cooperatives (SNC). Ainsi, la recomposition des fonctions partielles realisees par chaque agent ameneune transformation de la fonction globale du systeme et les etats non cooperatifs dus aux situations imprevues sontprogressivement supprimes. La resolution de problemes avec les AMAS s’articule donc autour de trois notionsessentielles : apprentissage, auto-organisation et emergence. Il y a apprentissage car le processus de resolution estincompletement specifie : au cours de son activite le systeme doit progresser a partir d’observations sur l’environ-nement. Cet apprentissage se realise par un processus d’auto-organisation (cooperative pour les AMAS) entre sesparties constituantes (que nous nommons agents). La solution obtenue est emergente car, ni la specification initialedu systeme ni son algorithme d’apprentissage n’ont a connaıtre ce qui devra etre appris.

¼@¼

Page 36: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Atelier Bioinformatique & Apprentissage

La resolution emergente de la conformation de molecules

Cette approche ayant fourni des resultats satisfaisants dans des domaines aussi varies que complexes commela prevision de crues ou encore la resolution d’emploi du temps, nous l’avons applique a la recherche de confor-mations de proteines avec pour objectif a terme de simuler du ”docking” de molecules, domaine important enbioinformatique.Une molecule est un AMAS constitue d’agents cooperatifs (les atomes), sa description spatiale est fournie enentree a l’application suivant un fichier au format de la Protein Data Bank (PDB, 2003). Chaque atome de cettemolecule est constamment en interaction avec ses voisins. Nous avons choisi de modeliser ces interactions vial’energie potentielle de Lennard-Jones exprimee comme suit :ÊrËÍÌÍÎbÏ;Ð)Ñ�Ò<ÓBÔ/Õ5ÖG×ÙØÚ4Û6ÜTÝ£ÞRß Ö<×ÙØÚ4ÛÙà�áfâVã�äÚ ä â�ã ÜÚ (1)

ou ×ÙØ represente la distance a laquelle l’energie potentielle est minimale,Ô5å ã�ä Ð.æ ã Ü permettent de regler la valeur

de ce minimum et les coefficients de la pente de la courbe etÚ

la distance entre deux atomes. ×ÙØ åTÔ5å ã�ä å ã Ü sontles parametres que nous nous proposons d’apprendre.Cette modelisation nous a permis d’utiliser les interactions atomiques comme moteur du repliement vers la confor-mation de minimum energetique (Besse, 2003). Nous avons donc cherche a minimiser cette energie sans connaitreles parametres de l’energie de Lennard-Jones. Cette minimisation se fait de maniere cooperative : chaque atometend a diminuer l’energie potentiellle resultant des interactions de Van der Waals, sans jamais accroitre la situationenergetique de son voisin de plus grande energie. De cette maniere, l’energie converge par etapes locales vers unminimum global.L’apprentissage des parametres de la loi de Lennard-Jones peut donc etre mis en place en mesurant l’ecart entrela conformation souhaitee (donnee par une molecule connue) et la conformation obtenue apres utilisation de notrealgorithme sur cette meme molecule deformee par un placement aleatoire de chacun de ses atomes. Les parametressont alors ajustes cooperativement de maniere a ce que la molecule deformee tende un peu plus vers la conforma-tion voulue a chaque cycle d’apprentissage.

Conclusion et perspectives

Nous avons actuellement mis en place la minimisation de l’energie de Lennard-Jones a propos de laquelle nousobtenons des resultats plutot encourageants en termes de temps de resolution. Les conformations finales trouveesne permettent pas de conclure quant a la validite de l’algorithme etant donne que les fonctions ne sont pas connuesa ce jour. Elle a en outre l’avantage par rapport a des methode plus classiques telles que le recuit simule ou les al-gorithmes genetiques, de n’emettre aucune hypothese sur la fonction economique a minimiser. La prochaine etapeconsiste donc a mettre en place cet apprentissage et a le valider a l’aide de molecules connues. Elle sera realisee etexperimentee dans les prochains mois. Toutefois, notre modelisation ne prend en compte ni le milieu (a savoir l’eauentourant necessairement les molecules) ni les interactions electrostatiques dues a la polarite de certains atomes.Cette approche est novatrice dans sa conception du repliement moleculaire et pourrait par la suite faciliter desdecouvertes sur l’implication de tel ou tel acide amine dans le processus de repliement.

ReferencesBESSE C. (2003). Conformation de molecules par un systeme multi-agent adaptatif. In Rapport de Stage Maıtrise, InstitutUniversitaire Professionnalise Systemes Intelligents, Universite Paul Sabatier, Toulouse III.

FAN E. (2002). Global optimization of lennard-jones atomic clusters. In Thesis for the Degree Master of Science of McMasterUniversity, Hamilton, Ontario.

GEORGE J. P., GLEIZES M.-P. & GLIZE P. (2003). Conception de systemes adaptatifs a fonctionnalite emergente : la theorieamas. In Revue d’Intelligence Artificielle, RSTI serie RIA, volume 17, n ç 4, p. 591–626.

PDB T. (2003). The protein data bank. In P. BOURNE & H. WEISSIG, Eds., Structural Bioinformatics, p. 161–179.

¼ ½

Page 37: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

A study of Amino Acids Binary CodesHuaiguo Fu, Engelbert Mephu Nguifo

CRIL-CNRS FRE2499, Universite d’Artois - IUT de LensRue de l’universite SP 16, 62307 Lens cedex. France.

E-mail: è fu,mephu é @cril.univ-artois.fr

Abstract :

If the biochemical properties of Amino acids can be perfectly described with cer-tain binary codes, it can increase the potential of symbolic artificial intelligencemethods to deal with protein folding problem. Thus we study four kinds of bi-nary codes of amino acids (AA). Two codes of them are based respectively onbiochemical properties, and the two others are generated with artificial intelli-gence (AI) methods, and are based on protein structures and alignment, and onDayhoff matrix. In order to give a global significance of each binary code, weuse a hierarchical clustering method to generate different clusters of each binarycodes of amino acids. Each cluster is examined with biochemical properties togive an explanation on the similarity between amino acids that it contains. Tovalidate our examination, a decision tree based machine learning system is usedto characterize the AA clusters obtained with each binary codes. From this exper-imentation, it comes out that one of the AI based codes allows to obtain clustersthat have significant biochemical properties. As a consequence, it appears thateven if attributes of binary codes generated with AI methods, do not separatelycorrespond to a biochemical property, they can be significant in the whole. Con-versely binary codes based on biochemical properties can be insignificant whenforming a whole.

This work allows to take into account biochemical properties of AA when binarycodes are used to redescribe protein primary sequences.

Keywords: AI in Bioinformatics, Amino acids, Classification, Clustering

1 IntroductionA protein is typically built of a series of amino acids. Amino acids may come in a vari-ety of shapes and properties: they may be small or bulky, hydrophobic or hydrophyllic,electrically charged or neutral, etc... hence allowing for very complex shapes and inter-actions to be produced. So the biochemical properties of amino acids are very importantto analyse problems such as protein sequence alignment, protein secondary or tertiarystructures prediction (Kawashima & Kanehisa, 2000; De la Maza, 1994).

¼"¾

Page 38: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Along with the development of bioinformatics, more and more methods and tech-niques of Artificial Intelligence (AI) are applied to solve problems of molecular biol-ogy such as protein secondary or tertiary structures prediction (Muggleton et al., 1992).Such techniques are generally based on AA mutation matrices (Kawashima & Kane-hisa, 2000). However there are a lot of symbolic AI techniques based on binary repre-sentations that could be applied in this domain. For example, genetic algorithms wereused to predict protein structure (Unger & Moult, 1993; Pedersen & Moult, 1997).These works used single representation which does not catch many biochemical prop-erties of AA. If the biochemical properties of AA can be perfectly described with certainbinary codes, it will improve performance accuracy on the prediction of protein struc-ture and function from its AA sequences (De la Maza, 1994). Hence in this paper wewill focus on binary representation for AA. It is extended work of (Fu & Mephu Nguifo,2004).

Expressing binary rules is more understandable for human-expert and could be help-ful for providing efficient results explanation to the expert. Many symbolic AI systemsdeal with binary representations. They are unable to treat numerical values as that en-codes in AA mutation matrices.

If several research works are being devoted to AA indices or mutation matrices(Kawashima & Kanehisa, 2000), our investigation of the literature gives rise only tofour works on binary representation of AA : Dickerson & Geis (Dickerson & Geis,1969), Marliere & Saurin (Sallantin et al., 1984), De la Maza (De la Maza, 1994), andfinally Gracy & Mephu (Mephu Nguifo, 1993).

De la Maza used primary and secondary structure of proteins as input, and combinedgenetic algorithm and neural networks algorithm, to generate the binary strings to repre-sent AA. Gracy & Mephu applied a simulated annealing algorithm to Dayhoff’s matrixto generate a binary representation of AA.

Dickerson & Geis and Marliere & Saurin proposed binary codes that are based onbiochemical properties of AA. They consider different classification of biochemicalproperties of AA.

1 0 0 . . . 0 1 C0 1 0 . . . 0 1 A0 1 0 . . . 1 0 N0 1 0 . . . 0 0 E

. . . . . .0 1 0 . . . 1 1 S0 1 0 . . . 1 1 T0 1 0 . . . 0 1 I

Binary code

Clustering Decision tree

....

....******+++++++++

??????????

Clusters

Validation clusters Interpretation

Classification

Figure 1: A view of the process of amino acids code comparison: clustering and deci-sion tree.

In this study, we compare and analyse these 4 methods of AA binary representation(see a view of the process of amino acids code comparison in figure 1). In order tosearch for a global significance of each binary code, firstly, we use a clustering algo-

¼@¿

Page 39: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

A study of Amino Acids Binary Codes

rithm to group AA. Then a machine learning system is used to explain the significancesof the clusters using biochemical data. If the clusters perfectly correspond to certainbiochemical properties of Amino acids, we consider this binary code as a good binaryrepresentation.

For our experimentation, we use an hierarchical clustering method (Ward’s method(Lebart et al., 1984)) to generate different clusters of AA. Each cluster is then examinedby hand to give an explanation on the similarity between AA inside the cluster. To dothis, we extend the representation of biochemical properties (hydrophobicity, charged,bulky, ...) of AA proposed by De la Maza (De la Maza, 1994). In order to validate andexplain the clusters of AA, a public domain version of the decision tree based machinelearning system C4.5 (Witten & Frank, 1999) is used to characterize the AA clustersobtained with the binary code.

The paper is organized as follows. The four AA binary representations are presentedin next section. In the third section, we present hierarchical clustering to generate clus-ters of binary representations, and discuss the results obtained. And then a decisiontree system C4.5 is used to validate the clusters of representations of AA in the fourthsection.

2 Binary representations of amino acids

In this section, we describe the four AA binary representations. Binary representationof AA is a table of twenty rows and different columns. Each column is a property whichcan correspond to a biochemical property. Each row corresponds to an AA, and is a bitstring where “0” means that this AA hasn’t the property, and “1” means that this AAhas the property.

2.1 Binary Codes based on biochemical properties

Two representations based on biochemical properties of AA are described by: Dicker-son & Geis (Dickerson & Geis, 1969), and Marliere & Saurin (Sallantin et al., 1984).

2.1.1 Dickerson & Geis’s code

Dikerson & Geis’s binary representation considers following properties of AA: aliphatic,aromatic, charged, polar, size of AA and hydrophobic (see figure 2). The table of theproperties of AA could be easily transform to a binary representation.

Dickerson & Geis make an analysis of some protein sequences of the heavy andlight chains of immunoglobulines to create this representation. A problem with thisrepresentation is that some AA have exactly the same physical and chemical propertiesin their classification, so the binary code of representation is the same, for example, Aand G, I and L, Y and W, can’t be distinguished. A way to solve this problem could beto add additional properties that allows to distinguish them.

¼ À

Page 40: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

Properties List of AA

Hydrophobic M I L V C A G T K H Y W F

Charged D E K R H

Polar Q N S C D E K R H Y W T

Positive K R H

Small P N D T V C S A G

Tiny A G C S

Aliphatic I L V

Aromatic F Y W H

Figure 2: Classification of biochemical properties of amino acids proposed by Dikerson& Geis.

2.1.2 Marliere & Saurin’s code

Marliere & Saurin propose to represent AA with 8 biochemical properties (Sallantinet al., 1984) (see figure 3). On the basis of these 8 biochemical properties, each AA canbe represented by a bit string like with the Dikerson & Geis’s code. For example, weuse 00010100 to represent the amino acid I (Isoleucine).

Properties List of AA

Side chain with less than 4

heavy atoms excluding H

A C G P S T V

Side chain with more than 4

heavy atoms excluding H

E F H K Q R W Y

Linear side chain A C G K M S

Bulky side chain F H I P T W Y

Side chain with an oxygen D E N Q S T Y

Side chain without Z-H(Z=N,

O, S) bond

A F G I L P V

Charged side chain D E H K R

Side chain with less than 3

Carbon-Hydrogen bonds

C D G N S

Figure 3: Biochemical properties of amino acids proposed by Marliere & Saurin.

This coding is a topologic description of AA. Pingand (Pingand, 1990) show thatin such a coding appear sharply particular choices to certain types of studies, becausecertain criteria such as hydrophobicity in particular are debatable. This coding can bespread with the addition of other properties such as hydrophobicity, hydrophilic, etc.. . .

With such coding, it is also necessary to explicitly or implicitly add negation of prop-erties in order to avoid inclusion between two AA codes. For example, the code of L(respectively R) is included into the code of I (respectively K).

¼@Á

Page 41: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

A study of Amino Acids Binary Codes

2.2 Binary codes based on AI methodsDe la Maza’s (De la Maza, 1994) and Gracy & Mephu’s (Mephu Nguifo, 1993) codesapply AI algorithms to generate binary representations. They propose a complete sys-tem to generate and test the binary representations of AA.

They use different techniques, but the whole structure is the same (see figure 4). Inorder to generate best binary representations, they use searching algorithms to find thebest solution for representation of AA, from some data of AA or protein (e.g. Dayhoff’smatrix, primary and secondary structure of protein).

Coding methods :

1. GA & NN based code

2. SA based code

3. Biochemical propertiesbased code

4. Or other methods

Group representation withClustering algorithmsêDescribe and verify clus-ters with decision tree sys-tem

Generate representations Explain representations

Figure 4: The system structure of the methods based on AI methods.

2.2.1 De la Maza’s binary code of amino acids

De la Maza used the primary and secondary structure of proteins to create AA represen-tations that facilitate secondary structure prediction (see figure 5). A genetic algorithmsearches the space of AA representations. The quality of each representation is quan-tified by training a neural network to predict secondary structure using that representa-tion. The genetic algorithm then uses the performance accuracy of the representation toguide its search and to create AA representations (see an example in figure 2.2.1) thatimprove the performance accuracy.

. . . MDLM . . . A

. . . PHW . . . B

. . . CRAY . . . B

. . . ELVIS . . . A

. . . PITT . . . A

. . . . . . . . . . . .primary structure and

secondary structure data

Genetic Algorithm

Neural Network

1 0 0 1 . . . 0 0 1 G

0 1 0 1 . . . 0 1 0 P

0 1 0 0 . . . 1 0 1 C

0 1 0 1 . . . 0 1 0 Q

. . . . . .

0 1 0 1 . . . 0 1 0 D

1 0 0 1 . . . 0 0 1 SBinary code representation

Figure 5: Generating the representations of amino acids with De la Maza’s system.

¼@Â

Page 42: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

A B C D E F G H I J K L M N O P AA1 0 0 1 0 1 1 1 1 0 1 1 1 0 0 0 Alanine1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 Cysteine1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 0 Aspartic0 0 1 1 1 1 1 0 1 1 0 1 1 1 1 0 Glutamic0 0 1 0 1 0 0 0 0 0 0 1 1 0 1 1 Phenylalanine1 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 Glycine1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 1 Histidine0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 Isoleucine1 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 Lysine1 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 Leucine1 1 1 0 1 1 0 1 0 0 1 0 0 0 1 0 Methionine1 0 0 1 1 1 1 0 1 1 0 0 1 0 0 1 Asparagine0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 0 Proline0 0 1 0 0 1 1 0 0 1 0 0 1 1 1 0 Glutamine0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 Arginine0 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 Serine1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 Threonine1 1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 Valine0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 0 Tryptophan1 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 Tyrosine

Figure 6: 16 bits representations of de la Maza’s code.

De la Maza (De la Maza, 1994) describes a system that synthesizes regularity ex-posing attributes from large protein databases. After processing primary and secondarystructure data, this system discovers an AA representation that captures what are thoughtto be the three most important AA characteristics (size, charge, and hydrophobicity) fortertiary structure prediction.

2.2.2 Gracy & Mephu’s binary code of amino acids

Gracy & Mephu’s method uses the Dayhoff matrix and simulated annealing algorithmsto generate the binary code of representation of 20 AA (see figure 2.2.2).

Dayhoff Matrixsimilarity matrix be-tween amino acids

Simulated Annealing

1 0 1 0 . . . 1 0 1 I0 1 0 0 . . . 1 0 1 V1 0 1 0 . . . 1 0 1 L0 1 1 0 . . . 1 0 1 M

. . . . . .1 0 0 0 . . . 1 1 0 Y1 0 0 1 . . . 0 1 0 H

Binary code representation

Figure 7: Generating the representations of amino acids in method of Gracy & Mephu.

½ Ã

Page 43: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

A study of Amino Acids Binary Codes

Dayhoff (Dayhoff, 1972) proposed the matrix of similarity (probability of mutation)for AA. This matrix is a reference of major work of molecular biology, especially withthe protein alignment problem or the protein structure prediction. Gracy & Mephutranslate this similarity matrix into a distance matrix between AA, then use a simulatedannealing algorithm to generate an AA binary representation that allows to approximatesuch distance matrix with the Euclidian distance measure.

Simulated annealing (Kirpatrick et al., 1983) is a stochastic computational techniquefor finding near globally-minimum-cost solutions to large optimization problems. Someresearches and applications have shown that simulated annealing is a technique whichhas a high probability of finding the optimal or a near-optimal solution in a reasonabletime.

Using this method, we can get different representations of AA by changing the pa-rameters of algorithms of this method. For example (see figure 8), we report the best24-bits representation for 20 AA (see (Mephu Nguifo, 1993)).

G 1 0 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 0 1P 1 0 0 1 0 1 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0C 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1A 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 0 1N 1 0 1 0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0Q 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0E 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0D 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0S 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 1T 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1I 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1V 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1L 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1M 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1F 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1Y 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0W 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0H 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0K 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0R 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0

Figure 8: 24 bits amino acids representation generated by Gracy & Mephu’s method.

3 Clustering analysis of binary representations of aminoacids

In the previous section, four methods to represent AA with binary codes are presented.However, we face some questions: What’s the significance of each binary representa-

½ º

Page 44: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

tion? Which is a good AA representation?To answer these questions, we use a clustering algorithm and decision tree system to

validate the AA binary representations.We use a clustering method to capture similarities among AA. This is similar to work

done by de la Maza (De la Maza, 1994) to validate its binary code. This work is anextension to other binary representations.

3.1 Hierarchical ClusteringClustering is often used for discovering structure in data. Cluster analysis (Dervin,1996; K.Chan, 1991) provides a description or a reduction in the dimension of the data.In our work, clustering analysis is used for interpretation of binary representation ofAA.

We use different hierarchical clustering methods available inside the SAS dataminingpackage (SAS, 2001): Ward’s method, Average Linkage method, and Centroid method.We focus our interest on the Ward’s method (Lebart et al., 1984) since different trieswith these clustering methods generates the same clusters on each of the binary codes.

The Ward’s method is an agglomerative hierarchical method which starts with eachobject describing a cluster, and then combines them into more inclusive clusters untilonly one cluster remains. The aim of the Ward’s method is to unify groups such thatthe variation inside these groups is not increased too drastically.

For each binary code, we use the clustering algorithm to generate various number ofclusters from 4 to 9. For example, with the 24 bits representation of Gracy & Mephu’smethod, when the number of clusters is 5, the result obtained is:

Cluster 1 : Asp, Glu, His, Lys, Asn, Pro, Gln, Arg.Cluster 2 : Gly, Met, Val.Cluster 3 : Cys, Ser, Thr.Cluster 4 : Ala, Ile, Leu.Cluster 5 : Phe, Trp, Tyr.For our experimentation, we use the 8 bits code of Dikerson and Geis, the 16-bits rep-

resentation of Marliere and Saurin (adding negation of each property), the best 16-bitrepresentation of de la Maza, and the best 24-bit code reported by Gracy and Mephu.We obtain 24 sets of clusters, each set corresponding to a binary code and a fixed num-ber of clusters.

Each cluster is then examined by hand to give an explanation on the similarity be-tween AA inside the cluster.

3.2 Examination of clustersIn order to search for some significance of the binary codes of AA representations, wefirst examine by hand and analyze each cluster that we have obtained, with biochemicalproperties of AA.

The AA biochemical properties are at the basis of the interpretation of clusters ofbinary representations. We modify and extend the representation of chemical propertiesof AA proposed by De la Maza (De la Maza, 1994) (see figure 3.2). Modifications and

½ »

Page 45: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

A study of Amino Acids Binary Codes

extensions come from discussion reported in recent publications on biochemistry (Kruh,1995; Delaunay, 1997). We add some properties such as the mass, the number of atoms,and the hydrophobicity scale.

As an example, the result of 5 clusters with the 24 bits representation of Gracy &Mephu, is well-adapted to biochemical conditions and these clusters of AA have acertain logic of biochemical affinity:

Phenylalanine, Tryptophane and Tyrosine : aromatic AA with cyclic side chain andno charged.

Alanine, Isoleucine and Leucine : AA with side chains aliphatic.Cysteine, Serine and Threonine : the Threonine differs of Serine by a grouping methyl

and the Cysteine differs of serine by the presence of one atom of sulfur in the place ofthe atom of oxygen.

Asp, Glu, His, Lys, Asn, Pro, Gln, Arg: are more hydrophilic.Gly, Met, Val are hydrophobic.

HYB ARO SOU CHA NEU BAS ACI HYL PI -coo ë�ìîí ï ë�ð MM N.A E.H AAy n n n y n n n 6.02 2.34 9.69 X 89 13 +1.8 Ala(A)n n n y n y n y 10.76 2.17 9.04 12.48 174 26 -4.5 Arg(R)n n n n y n n y 5.40 2.00 8.80 X 132 17 -3.5 Asn(N)n n n y n n y y 2.98 2.09 9.82 3.86 133 16 -3.5 Asp(D)n n y n y n n y 5.02 1.74 10.78 8.33 121 14 +2.5 Cys(C)n n n n y n n y 5.65 2.17 9.13 X 147 20 -3.5 Gln(Q)n n n y n n y y 3.22 2.19 9.67 4.25 146 19 -3.5 Glu(E)y n n n y n n n 5.95 2.34 9.60 X 75 10 -0.4 Gly(G)n n n y n y n y 7.59 1.82 9.17 6.00 155 20 -3.2 His(h)y n n n y n n n 6.05 2.40 9.70 X 131 22 +4.5 Ile(I)y n n n y n n n 5.98 2.36 9.60 X 131 22 +3.8 Leu(L)n n n y n y n y 9.74 2.18 8.95 10.53 146 24 -3.9 Lys(K)y n y n y n n n 5.75 2.30 9.20 X 149 20 +1.9 Met(M)y y n n y n n n 5.45 1.80 9.10 X 165 23 +2.8 Phe(F)y n n n y n n n 6.30 2.00 10.00 X 115 17 -1.6 Pro(P)n n n n y n n y 5.68 2.21 9.15 X 105 14 -0.8 Ser(S)n n n n y n n y 6.53 2.63 10.43 X 119 17 -0.7 Thr(T)y y n n y n n n 5.90 2.40 9.40 X 204 27 -0.9 Trp(W)n y n n y n n y 5.65 2.20 9.11 10.07 181 24 -1.3 Tyr(Y)y n n n y n n n 5.95 2.30 9.60 X 117 19 +4.2 Val(V)

Figure 9: The biochemical properties that are used for representation of AA. (y=yes,n=no, HYB= hydrophobic, ARO=aromatic, SOU=sulfur, CHA=charged, NEU=neutral,BAS= basic, ACI=acidic, HYL=hydrophlic, PI= pI value, molecular MM=mass,N.A=number of atom, E.H= hydrophobicity scale ).

¿From the examination of all the sets of clusters obtained, it appears very often thatthere were always two or three clusters with debatable similarity, except for the previousone : set of 5 clusters of 24-bits representation of Gracy & Mephu.

With the coding of de la Maza, we didn’t obtain the same clusters as that reportedin (De la Maza, 1994). This may be due to the fact that de la Maza uses the Cobwebclustering algorithm which is different from the Ward’s method.

With the coding based on biochemical properties, we were unable to find a set withgood clusters.

From this first observation, it appears that coding based on AI method can have a

½ ¼

Page 46: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

good global significance, whenever coding based on biochemical properties could notbe significant in a whole. This global significance arises from the fact that similaritybetween AA is expressed inside the Dayhoff matrix in the case of Gracy and Mephu, orinside protein sequence alignment and protein structure in the case of de la Maza.

A second observation is made on properties of AI based coding. For the binary repre-sentation based on biochemical properties, each column corresponds to one biochemicalproperty. For representations based on AI methods, we find that properties of binarycodes do not separately correspond to biochemical properties.

Through clusters analysis and examination by hand, we obtain a global analysis ofeach binary representation. In the next section, we use decision tree system to validateour analysis.

4 Interpretation of clusters

In order to verify the clusters of binary representations of AA using biochemical data,we use a public domain of decision tree system C4.5 available in the WEKA package(Witten & Frank, 1999), to predict the cluster of an AA given its biochemical properties.

Using the clusters of representations and the database of biochemical properties shownin figure 3.2, we generate a decision tree to explain the clustering in terms of the bio-chemical properties. This decision tree is used to classify instances. Each node of thetree contains a test on an attribute. The branches that exit from a node correspond to theoutcomes of the test. The leaves of the tree contain clusters.

¿From the results of decision tree, the representation based on AI methods can beshown that it corresponds to some biochemical properties of AA in varying degrees. Itvalidates the accuracy of the clusters of AA representations.

However, even with the clusters and biochemical properties reported in de la Mazapaper (De la Maza, 1994), we were unable to find the explanation tree obtained withthe decision tree system as mentioned in his paper.

The results show that the clustering results of the 24-bits representation producedby Gracy & Mephu’s method are correct and can be characterize with biochemicalproperties. This is in concordance with the results of our examinations by hand. Thisshows that the 24-bits representation produced by Gracy & Mephu’s method is one ofthe best binary representations of AA. This corroborates previous results reported in(Mephu Nguifo, 1993; Landes et al., 1996), where this representation allows to findgood alignment when dealing with weakly homologous protein sequences.

For biochemical based representations, the decision tree can’t give an understandableexplanation of the clusters. This is in concordance with our analysis by hand. Thus bio-chemical properties based representations need to be preprocessed before being used,in order to express a whole significance.

¿From the results of clustering and decision tree process, we know that the methodsof De la Maza, and of Gracy & Mephu based on AI methods can generate good binaryrepresentation of AA. These representations are significant in a whole, and allow to takeinto account biochemical properties when dealing with protein primary structure.

½@½

Page 47: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

A study of Amino Acids Binary Codes

Aromatic = yes: 5 (3.0) Aromatic = no Hydrophobicity <= -1.3: 1 (8.0) Hydrophobicity > -1.3: Hydrophobic = yes

PI <= 5.95: 2 (3.0) PI > 5.95: 4 (3.0)

Hydrophobic = no: 3 (3.0)

Figure 10: An example of decision tree( for 5 clusters of 24 bits representation of Gracy& Mephu).

5 ConclusionThis paper reviews two kinds of AA binary representations respectively based on bio-chemical properties, and on searching methods. A comparative study of these codes isdescribed, and provides some significant results.

Properties of binary codes generated with AI methods, do not separately correspondto biochemical properties, but they are significant in a whole, conversely binary codesbased on biochemical properties can be insignificant when forming a whole. And oneof the codes proposed by Gracy and Mephu generates clusters that have significant bio-chemical properties. This fact seems to confirm a previous experimental result obtainedwith the alignment program, N+ONE (Landes et al., 1996).

Good AA representations can facilitate the prediction of proteins secondary or ter-tiary structure, can allow to find good alignment of proteins primary sequences. Thiswork could allow to improve results of AI methods when dealing with protein foldingproblem as it is well-established that data representation is one of the keys of successof AI methods.

ReferencesDAYHOFF M. (1972). Atlas of protein sequence and structure, nat. Biomedical Research Foun-dation,Washington DC.

DE LA MAZA M. (1994). Generate, Test and Explain: Synthesizing Regularity Exposing At-tributes in Large Protein DataBases. In Proc. of the 27th Hawain Intl. Conf. on System Science(HICSS), p. 123–129, Hawa, USA.

DELAUNAY J. (1997). Biochimie generale.

DERVIN C. (1996). Comment interpreter les resultats d’une classification automatique.

DICKERSON R. & GEIS I. (1969). The structure and actions of proteins. Harper ñ RowPublishers,New York, NY, p. 16 – 17.

FU H. & MEPHU NGUIFO E. (2004). Clustering binary codes to express the biochemical prop-erties of amino acids. In The International Conference on Intelligent Information Processing(ICIIP).

KAWASHIMA S. & KANEHISA M. (2000). AAindex: Amino Acid index database. Nucl. Acids.Res., 28(1), 374–.

½ ¾

Page 48: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

K.CHAN P. (1991). Machine learning in molecular biology sequence analysis.

KIRPATRICK S., GELATT C. & VECCHI M. (1983). Optimization by simulated annealing.Science, 220(4598), 671–680.

KRUH J. (1995). Biologie cellulaire et moleculaire.

LANDES C., BRAS F. & DEZELEE, S.& TENINGES D. (1996). The gene 2 of the sigmarhadovirus genome encodes the p protein, the gene 3 encodes a protein related to the reversetranscriptases of retroelements. Virology, 215, 123–142.

LEBART L., MORINEAU A. & WARWICK K. M. (1984). Multivariate descriptive statisticalanalysis. New York.

MEPHU NGUIFO E. (1993). Concevoir une abstraction a partir de ressemblances. These dedoctorat d’universite, Universite de Montpellier II. 276 pages.

MUGGLETON S., KING R. & STERNBERG M. (1992). Protein secondary structure using logic-based machine learning. Protein Engineering, 5((7)), 647–657.

PEDERSEN J. & MOULT J. (1997). Ab initio protein folding simulations with genetic algo-rithms: simulations on the complete sequence of small proteins. Proteins, S1, 179–184.

PINGAND P. (1990). Thesis of PhD. PhD thesis, University of Montpellier.

SALLANTIN J., MARLIERE P. & SAURIN W. (1984). Description logique des contextes spa-tiaux dans les proteines: application a la conception de polypeptides artificiels. In Actes desjournees Point Curie sur l’intelligence artificielle et l’analyse des sequences biologiques, p.141–153, Paris, France: Institut Curie.

SAS (2001). SAS Publishing, Getting started with enterprise miner.

UNGER R. & MOULT J. (1993). Genetic algorithms for protein folding simulations. MolecularBiology, 231 (1), 75–81.

WITTEN I. H. & FRANK E. (1999). Data Mining: Practical Machine Learning Tools andTechniques with Java Implementations.

½ ¿

Page 49: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

òôó�ó �£õ�ö(�Í÷1÷9�ø�úùRûT�ü�ö9�����lö9�£÷ ó �l�ý�ü�÷9�Í�õ ù��þ�>ÿ�� �@÷1�.�����Í�l ý) �ø�����£õ8ö ó ����(÷����µö � ó�� �����£õ8ö9�6ö9�Í��õ�÷ ÷9ü� �Í�£÷ ó (�lö � �Íõ/�£÷��������

����������� �"!$#&%('*),+ -/.102� 354"6�-��879+;:=<>+;?@?@+A#B�:,BC'EDGFH'*I*JLK��M� !N-�FH#>��JPO535!,Q5-(<R-���354M�"-�3SF�TVU�W�XVYZ:,-����5-�!=#R-�Q�-\[]F���^���5_�-

`=��35!aO5�N� O]��!^� �5!135�5-L�5��3�6 -/4M4M-G��O5O5�N�E_cb�-db5-�35�N�M!C.��Mef35-GO]-/�^JP-/.^.����V.aQSgh��O5O��^-���Q5�^-dQ5-/!��3�.^� Ji�j.�-/!R�5� ��Q5k/.^-��^JP�M���M!N.^-�!lO]� 3��R4m�(_n���^��_/.^k��N�M!��o.��M����Q5-1pq�jJi�"4M4M-/!rQ5-aO5�N��.�k/�M�5-/!�s5#R-/.^.^-��O5O��^�*_cb5-�-/!N.$K���!Nk�-(!^35�$4m�Gpt3�!^�M���uQ5-9O��j�M�^-/!,Q5-2pt����v JP-��V.^!$!N�Mv ���"w�_��j.��x6 -�JP-��V.&!N�MJP�M4M���M�N-�!y 'E�{zA|\sV+r4M4"-R�5-&��k�_�-/!^!N�".�-$��3�_�35�i��4M�"v �5-/Ji-/�f.}JL354x.��"O54M-AO5�^k���4m�jK54M-js `=�j.��^-Rb5-�35�N�M!C.��Mef35-&3E.��x~4M�"!^-L3���� �NQ5� �5�5���5_�-/Ji-/�V.�Qgh35��-P4"�M!C.�-LQ5-i'���z�Q5-L.c�j�M4M4"-�!ab�k/.�k/�^� v����5-/!(-�[*.����j�".�-/!9Q53��N-/3QSgh��O5O5�N-��V.��"!^!^��v -LQSg�35�5-(pq��JP�M4M4"-2Q5-G!Nk�ef35-/�5_�-/!(O��^��.^k��Mef35-/!�s]?@�i!^��35O54M-/!^!N-LQ5-L4M�PJPk/.^b5�EQ�--�!C.(��O5OH� �N.^k�-dO��j�2Q5�"�]k��N-��V.�!a.�I*OH-�!9QSg�� �^Q5���5�����5_/-�JP-��V.nF�����.c��JPJP-��V.2-���pt� ��_/.��"� ��Q5-d4M�Q5�"!^OH� �5�MK��M4M�x.�kjFc� 39�5� �SFnQgh-�[�-/JiO�4M-�!@Q5-;!Nk�ef35-���_�-�!}�Sgh��O5O����C.�-/�����V.@O���!��>4m�Rpq�jJi�"4M4M-}�>_n�j����_�~.�k/�^�"!^-�� y _�� �V.��N-\~�-\[E-�JPO54M-/!c|�s���O5���N.^�M�>QSg�35�5-(�^-/O5�^k/!^-/�f.��j.��"� �8Q53d�N-/3uQg���O�O5�^-/�f.^�M!N!���v -9O��j�35���j3�.�� Ji�j.^-��5����Q5k\.�-��NJi�"�5�M!C.�-jFHO�4M35!N�M-�3��^!1k\.c��OH-�!=Q5-Gv�k��5k/����4"�M!��o.��M���8��� 35!aOH-��NJi-\.^.�-/�V.QSg�� K�.^-��5�"��Q5-/!rJi��.^�"pt!�!^OHk�_/�"w�ef35-/!l�=4m�=pq��JP�M4"4M-js�?@�1v k���k����j4M�M!^�j.��"� �G!/gh-\�-/_/.^3����V.;O�����pt3�!^�M���Q5-/!1'E�{z}FEO535�M!&O5���&35�5�xw�_n�j.^�M� ��Q5-�!$��_/�MQ5-/!,��Ji�"�5k�!l�5� ��_n�j����_\.�k��N�M!C.��Mef35-/!�F54M�L�^-/O5�^k/!^-��V.��o~.��"� ��O5���2��3�.^� Ji�j.�-(O]-/�^JP-/.1QSgh��K�.�-/�5�M�1��4m�Ppt���M!$35�5-d4M�*_n�j4M�M!^�j.��"� ��Q5-/!2��� �5-/!a_����5!^-/�N6 k/-�!-/.l35�5-=Ji�*Q5k/4M�M!^�j.��"� �LQ5-=4�g�-��5_cb5�j�m�5-/Ji-/�V.RQ5-$_�-/!R�/� �5-�!/s��=-1O54"35!�Ff4m�9O5�N�M!^-$-��i_�� JPO�.^-=Q5-/!O5�N� O5�^�"k/.^k�!LO5bVI*!^�"_��j~�_cb5�MJP�Mef35-/!LQ5-/!Z��_��"Q5-�!d�jJi�"�5k�!�O]-/�^JP-/.d35��-�k/.���O]-�!^35O5O�4Mk�JP-��V.����M�N-Q5-,v k/�5k��^��4M�"!��j.^�M� �dQ5-,_/-��N.����M��-�!;O]��!^�".^�M� ��!r�(Q�-�!;v �^��35O]-/!;Q5-=)}�nI*4M� �/sV?@�2pq�jJi�"4M4M-AQ5-/!>�uB�zy �u���N� �>B��V.��N�M�5!N�M_2zr�N��.�-/�M�5!�|l�^-/v �^� 3�O]-aQ�-�!,!^k/e*3�-��5_/-�!=��I ���V.,4m��pt���5_/.^�M� ��Q�-2_n�����j4].��^���5!�~JP-�JLK5�^�������"�^-jsf`=��.^�^-1k\.�35Q�-a�GOH� 35�A� K*�N-/_/.��xp�Q5-=_n���^��_/.^k��^�"!^-/�R4m�9pt� �5_\.��M���iQ5-/!,�uB�z�FEJi���M!��35!N!^�{_/-�4"4M-LO54"35!aO5�Nk�_/�M!^-jFSQSg�35�5-d!N� 35!C~�pq��JP�M4M4"-9Q5-�!(�uB�z�ef35-Z4�g�� ���5� JPJP-��R�j.�-/�C~�!NO]-/_��xw�_y -�[���D$�2z&� F��je*35��O]���^�M��-�|2O����G��O5O]��!^�".^�M� ���u4M�u!N� 35!C~�pq��JP�M4M4"-d�5� �E~��>�o.�-���~�!NO]-/_��"w5_ y -�[��09?�z;�RFv 4"I*_�-/�^� 4}pq��_��"4M�".��j.����c|\sHzr�^�j.�� Ji�j.c��~�?rF]3��5-i�"JiO�4Mk�JP-��V.��j.��"� ��Q5-Z�5�j.��^-Z�jO5O5�^�*_cb5-jFv k/�5���N-2Q5-�!,JP��.��xpt!Ref35�SJP� �V.��N-��V.,Q5-2KH� �5!>�Nk�!^3�4".c�o.�!&-����^��O5OH-�4S-\.&O5�^k/_��"!^�M���SFE.����V.,!^35�&4M�_n�j����_\.�k��N�M!^�j.��"� � ��3¡�5�"6�-n��3 Q5-Z4m��pq��JP�M4"4M-G�uB�z�ef35-ZQ5-Z4m��!N� 35!C~�pq��JP�M4M4"-(¢£�j.^-��C~�'EOH-�_��xw�_�s?rghk\.�35Q�-GQ5-(4m�Lpq��JP�M4"4M-1Q�-�!1�uB�z¤O����,�5��.^�^-���O5O5�N�E_cb5-9pq���x.$��O5O����^�j�M.^�^-94m�dOH-��C.��M��-��5_/-�Q5-/!��3�.^� Ji�j.�-/!=�jO5O5�^�"!�s�¥$�5-�OH-��C.��"�5-��5_/-�ef35�@-/!N.=K5�M-/�uk\.c��K54"�M-2OH� 35�,4M-9O5�N-�JP�M-/�&Ji��.^�"p�_n�j����_�~.�k/�^�"!N.��"ef35-�Q�-�!9�uB�z¦-���_�� JPO����^���M!N� ���n6�-�_d_�-/3E[�Q5-Lzr�^��!^�".^-�-\.aQ5-Lz;�^�j.^.�sH?��i�5��.��"� �8Q5-_���§�.nFE-���.�-/�^JP-1QSg�-��N�^-/35�^!&�5k/_�-/!^!��j�M�^-/!&O]��35�&��_/_�-�OE.�-��R35�5-2!^k/ef35-��5_/-9Q����5!>3�����3�.���J��j.^-�F-�!C.a�"�f.^�^�*Q535�x.�-9O����$4�g��M�5Q��M_�-1¨�-��N�^� �$_�� �N�^-/_/.��"�5v�_/� !N.C¨cs�B�4@O]-/�^JP-/.=Q5-�pq�j�M�^-9�N-�!^!N� �N.^�M�$k�vV�j4M-\~JP-��V.&ef35-24m�(v k���k����o.��M����Q5-2JP��.��xpt!lO54"35!>4"� �5v ! y Q5-aX WL�LO54M3�!>Q5-L��W�WGO]��!^�".^�M� ��!c|r-/.,Q5� ��_O5�N�E_cb5-/!=Q�-�4M�d.^� OH� 4M� v��M-1Q�-�!2�8B�z�O]-/�^JP-/.,35�5-9.��N��!$K]���5�5-9Q5�M!N_��N�MJP�M���j.^�M� ��-��V.��N-�4"->�N-/3�R�j.�-/�C~�!NO]-/_��xw�_9-/.$4M-l�N-/3u�5���E~��R�j.�-/�C~�!^OH-�_��xw�_�s

½AÀ

Page 50: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

½ Á

Page 51: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

© ��ª��õ/÷�ö9(ü1ªGö(�Í�õ ÷1ü ó �î�«>��÷ � � ù�� øl � óa¬ � ø � õ � ö9��­>ü/�®�-n�j�*¯Ezrb5�M4"�MO5OH-1°;-/�N.+r_�� 4"-2Q5-�!$JP�M�5-/!&Q5-(z����N�M!�F5�����f.����M��-�K54"-n��3SF5�{:$D1`a#>+

D�4±ghb5-/35�^-R� ²d4M-/!�.�-/_cb5�5� 4"� v �M-/!{Q5�x.�-�!&³��ab5-�3E.;Q5k�K5�x.�³lpt� 3��^�5�"!^!^-/�V.ref3����V.��x.�k&Q5-&Q5���5�5k�-/!!^3��=4"-�!$v ���5-/!1-\.14"-�35�N!=O5�N�EQ�35�".^!�F�4m�d�^-/_�� ��!N.��N35_/.^�M� �uQ5-�!=�M�V.�-/����_\.��"� �5!$Q5�"6�-��^!N-�!=-��V.��N-G_�-/!_���JiOH� !��j�f.^!&!^-9OH� !^�x.��M���5�5-a_/� JiJP-13��uO5�N� K54"��JP-1pt� �5Q��jJi-/�f.���4OH� 35�&4M�L_�� JPO5�Nk�b5-/�5!^�"� �Q53�pt� ��_/.��"� �5�5-/Ji-/�V.�_/-�4M4"354m�j�M�^-8��3 �5�"6�-n��3�!CI*!N.�k/Ji�"ef35-�s,®�-uO5�^k/!^-/�f.^-��^���a35�5-�Jik\.�b5�*Q5-�F!�g��M��!^O5�"�����V.dQ5-/!Z�^k�_/-��V.�-/!��jO5O5�^�*_cb5-/!PQg���O�O5�^-/�f.^�M!N!���v -�!C.c�j.^�M!N.^�Mef35-�Q5���5!d4M-/!d-�!NO���_�-/!ZQ5-´=�"4MKH-��C.1�i�5��IV��3��N-�O5�N�EQ53��M!��j�f.�FH6*�"!����V.2�P�^-/_�� �5!C.��N35�M�N-L35��v �^��O5b5-�Q5-Gv����5-/!a� 3�Q5-GO5�^�j~.�k/�M�5-/!(��O5���N.^�M�aQ5-dQ�� �5�5k/-�!(!^35�94M-/!2v ����-�!�F@�j�M�5!N�{ef35-ZQSgh3��5-i_/� �5���j�M!^!^���5_/-PO5���N.^�M-�4"4M-�Q53v �^��O5b5-¡�µ�^-/_�� �5!C.��N35�M�N-�s=®�gh�"4M4"35!N.^�^-��^���$_�-/.N.�-���O5O5�N�E_cb�-��n6�-�_£3��5-�.�-��V.��j.��x6 - Q5- �^-/_�� �5!�~.��N35_/.^�M� �uQ5-�!=6 � �"-�!1JPk/.���KH� 4M�"e*3�-�!,-/.1Q53��^k/!^-���3�QSg��M�V.�-/����_\.��M���5!=O��^��.^k��Mef35-/!aQ5-�4m�i4"-/6*35�N-'���_/_cb����N� JGI*_�-/!=_/-��^-\6*�M!^�M��-�s

½ Â

Page 52: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

¾$Ã

Page 53: ATELIER - Inria · connecting them. Results on artificial data illustrate the interest of this approach. Keywords: biochemical network analysis, subgraph mining, commute time dis-tance,

¶ �Í�;·��l �6ö9�Í��õúù�� õ���¸�ü$ ó �ü���@ûT�£÷�ö(�����6ö9�Í��õúù�� ó � ó (� � ö � ÷ ù�� ó �]�ö9�kö(�£÷ ���;� � ªµü=���£÷?��"6j��:$��4M���"6�� 4m��F�®�� ���j.^b���� #Rb5-/�SF�®��*_�-/4"I*�5-G<R�^3��j�5QSF]z}-/.^-��9zrbf35�5v5FS']s®�� !^bf3���'f�>��JP�x~Q���!N!,-/.=zr�M-��N�^-(<>��4"Q5�B��5!C.��".^3�.�-2pt� �$02-���� Ji�"_�!&�j�5Q�<l�M� �"��pt� �^Ji�j.^�M_�!/F*¥=�5�x6 -��N!^�x.�k9Q5-�#>��4M�xpt� �^���M-a�dB��C6E�"�5-

?�-�!�Ji��4Mk�_/354M-/!@_�� JPO]���N.c�j�f.�JP� �M��!�Qgh35��-R_�-/�V.c���"�5-lQg��j.^� JP-�!�-/.}Q5-l4"�m���"!^� ��!�_cb5�MJP�Mef35-/!�N� 35-/�V.235���^¹�4M-9pt� �5Q���JP-��V.���4}-/��_cb5�MJP�M-(� �NvV���5�"e*3�-G-/.1K5�M��4M� v �"-�s�+r4M4"-�!$�M�V.�-/�N6*�M-/�5�5-/�f.a-��-/�]-/.>Q����5!&O54"35!^�"-�35�N!>O5�N� K54Mk/J��o.��Mef35-/!R�Nk�-�4"4M-/!R_���JiJP-1_/-�4M4"-�!/F�O����l-\[E-�JPO54"-�F�Q�-24�g�k�4m�jK]� �^�o~.��"� ��Q�-GO5�N�M�5_/�MOH-�!=��_/.^�"pt!1OH� 35�$4M-/!1JPk�Q5�"_n��JP-��V.^!�sH?�-GQ5k\6 -�4"� O5OH-�JP-��V.2Q5-�v ���j�5Q5-�!1K���!^-/!Q5-&Q5���5�5k�-/!rQ5->.�-/4M!{_/� JPO]� !^���V.�!{_cb��MJP�Mef35-�!�!^� 354"�/6�-��V.;4m�1ef35-/!N.^�M� �ZQ5->4M�1_����5!N.^�^35_\.��M���dQ5-JPk/.�b��EQ5-/!L��3E.�� Ji�j.��"ef35-�!9_n�jO���K54"-�!GQgh-/�º-�!C.��MJP-��(4M-/!�O��^� O5�N�Mk\.�k�!�O5bVI*!^�Mef35-/!�F}_cb5�MJP�Mef35-/!-/.$K5�"� 4M��v �Mef35-/!�s

`=��35!rO5�^��O]� !N� �5!;O54"35!^�"-�35�N!r_�4m�j!^!^-/!lQ�-,pt���5_/.^�M� �5!��5��IV��3iO]� 3��r_�-�!;JP� 4Mk/_�354"-�!rQ5-&OH-/.^�".^-.c�j�M4M4"-;-��d-\[EO54"� �".����V.�4"-�35�N!{�^-/O5�^k/!^-/�f.��j.��"� �5!>�n�GFVY���F�-/.rT �Gsj?��=�^-�O��^k�!N-��V.c�o.��M���8�n� �^-/O]� !N-!^3��L35�5-i_��*Q���v -�'5�uB�?@+�Q�-�!LJP� 4Mk/_�354"-�!/F�4M�u�N-�O5�Nk�!^-/�V.c�j.^�M� �»Y��¼!^35��4M-�3��^!Gv�����O5b�-�!LQ5-/!�j.^� Ji-/!1-\.24M�M���M!N� �5!,-/.14m���N-�O5�Nk�!^-/�V.c�j.^�M� � T �½!^35�a4"-�!aQ��M!N.����5_/-�!a-/�f.^�^-LQ�-�!aO��j�M�^-/!aO5�Nk�Q5k�~w��5�"-�!�QSg��o.�� JP-�!/s�#R-/!L�5��IV�j3E[º!N� �V.L3�.^�M4"�M!^k/!�OH� 35�(4�g�-�!C.��"J��j.^�M� � Q5-i4m��JL3�.���v k����M_��x.�kjF�Q5-4m�9.��n[E�M_/�".�k=-/.&Q�-a4�gh��_/.^�"6*�".^k1�j�f.^�x~�_n���5_/k��N-�35!N-2Q5-aJP� 4Mk/_�354"-�!A�M!^!N35-�!&Q�-aK���!N-�!&Q5-aJP� 4"k�_�3�4M-�!O535K�4M�Mef35-/!,-/.,OH-��^JP-/.N.�-/�f.$QSg�� K�.^-��5�"�,Q5-�!,O5�Nk�Q5�"_/.��"� �5!>.^�^��!&w��jK54M-/!�s

¾�º