Presentation Transcript
Exploitation of Structural Similarity in Semi-Structured Bioinformatics Data for Efficient Storage Construction : Exploitation of Structural Similarity in Semi-Structured Bioinformatics Data for Efficient Storage Construction Dongkyoo Shin (shindk@sejong.ac.kr)
Sejong University, InCob2007
Table of contents : Table of contents Abstract
Background
Methods
Results
Conclusions
Abstract (1) : Abstract (1) Background
Many researches related to storing XML data
Reduce the number of joins between tables
Not proper to microarray data with distinctive hierarchy
Hierarchical feature of microarray data model
a few core values occurs iteratively
New approach for capturing the feature
Class elements with similar structure into a group
Design common database table for the group
Abstract (2) : Abstract (2) Results
Database schema created by our approach
Reduce the number of table joins remarkably
Improve performance of storing and loading XML-based microarray data
Conclusions
Efficient way to improve performance of microarray data is mining structural similarity of elements
Background (1) : Background (1) DTD (Data Type Definition)-dependent base
Map one element into one table
For each e E, #(S) ≥1 OR #(A) ≥1 -> define_Class(e)
For each Se S -> Add_attributes_of_Class(e)
Se SequenceType -> Define_multivalued_att(Se, e)
Background (2) : Background (2) Inline technique base
Reduce the complexity of DTD (Data Type Definition)
For each e, #(S) == 1 AND Se SequenceType
-> Add_Multi-valued_attribute_of_Paren-tClass(e)
Background (3) : Background (3) Drawback of previous approaches
DTD-dependent
Database schema has the same complexity with DTD
Inline technique
Strongly depend on the number of omissible elements
New design approach for microarray database
Capture similar structural features of microarray data
Need fast and simple way to mine the structural features
Background (5) : Background (5) Microarray data and MAGE (Microarray Gene Expression) standards
Research groups share microarray data with others, and use it to solve their biological questions
MGED society’s standard definitions
MIAME (Minimum Information for the Annotation of a Microarray Experiment)
MAGE-OM and MAGE-ML
Exchange object model and format for MIAME
Structural feature of MAGE-OM
a variety set of objects defining the same data types including complex types.
Background (6) : Background (6) Decision Tree
a simple model for easy understanding classification rules correlations, and effects between variables
Proper for mining structural features of MAGE-ML DTD itself (Not MAGE-ML instances !!!)
Possible to classify all elements three levels:
A root, mediators group, and bottoms group
Methods (1) : Methods (1) Classification of core features using decision tree
Terminologies for expression of a complexType
e: an element defined in XML schema
E: an elements set of e
SE: a sub-elements set of e
a: an attribute of e
A: an attributes set of e
SA: an attributes set for all sub-elements of e
complexType: Structural information that consists of SE and (or) A of e.
Lowest child: an element without a sub-element
Lowest parent: an element with a sub-element that is one of the lowest child elements
PG (Parent Group): a set of candidate elements to be parents of a Lowest Child
LPCG (The Lowest Parent Candidate Group): a set of candidates to be Lowest Parent
LCG (The Lowest Child Group): a set of Lowest child elements
LPG (The Lowest Parent Group): a set of Lowest Parent elements
ULPG (Upper Level Parent Group): a set of upper level parents, including elements that are neither Lowest Child nor Lowest Parent
Methods (2) : Methods (2) Expression of a complexType
A complexType defines structural information of elements
A set of arrays including data type
Definition of structural similarity
SEelex = {e1, e2, … , en}, SAelex = {Ae1, Ae2, … , Aen}
complexType(elex) = {SEelex, SAelex} complexType(elex) == complexType(eley)
Methods (3) : Methods (3) Decision Tree for recognizing the core features
Condition 1: If rule 1 is satisfied, then e arrives at LCG. Otherwise, it arrives at PG.
Condition 2: If rule 2 is satisfied, then e and its similar element e arrive at a new LCG.
Condition 3: If rule 3 is satisfied, then e arrives at LPG. Otherwise, it arrives at ULPG.
Condition 4: If rule 4 is satisfied, then e and elements similar to e arrive at a new LPG.
Methods (4) : Methods (4) Classification rules
Rule 1
Decide that an element should belong to group LCG or PG For each ei E {
if(number of elements in SEei == 0){
ei is classified into LCG;
}else{
ei is classified into PG;
}
}
Methods (5) : Methods (5) Classification rules
Rule 2
Classify multiple sets of LCG
p = 0;
For each ei LCG0 {
Flag=0;
If (p>0) {
For q=1 to p
If (complexType(ei) = complexType(element in LCGq) {
ei is classified into LCGq;
Flag=1;
}
}
If (Flag==0) {
For each ej LCG0
if(complexType(ei) = complexType(ej) {
p=p+1;
ei and ej are classified into a new group of LCGp;
}
}
}
Methods (6) : Methods (6) Classification rules
Rule 3
Separate elements in PG into two groups: LPG and ULPG For each ei PG {
if(SEei LCG) {
ei is classified into LPG;
}else{
ei is classified into ULPG;
}
}
Methods : Methods Classification rules
Rule 4
Classify multiple sets of LPG
p = 0;
For each ei LPG0 {
Flag=0;
If (p>0) {
For q=1 to p
If (complexType(ei) = complexType(element in LPGq) {
ei is classified into LPGq;
Flag=1;
}
}
If (Flag==0) {
For each ej LPG0
if(complexType(ei) = complexType(ej) {
p=p+1;
ei and ej are classified into a new group of LPGp;
}
}
}
Result (1) : Result (1) Database design by the proposed decision tree
Result (2) : Result (2) Database space complexity
Time complexity
Result (3) : Result (3) Reconstructing the XML Document
Conclusions : Conclusions Proposed approach
Mine elements with structural similarity from XML Schema for biological information
Experimental result
Mining structural similarity of object model is proper to microarray data and more efficient than previous approaches
Future work
Plan to extend current classification rules to root, LCG, LPG, ULPG respectively
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.