Annotation and integrated retrieval of microbial genome

Annota&on'and'integrated'retrieval'
of'microbial'genome'sequences
Hideaki'Sugawara'
Na&onal'Ins&tute'of'Gene&cs,'Japan'
[email protected]
September 5th, 2014
WDCM training course, Beijing
Part'I
INTRODUCTION
September 5th, 2014
WDCM training course, Beijing
Cost'per'HumanFsized'Genome
Data+from+the+NHGRI+Genome+Sequencing+Program+(GSP)+
hQp://www.genome.gov/sequencingcosts/
September'5th,'2014
WDCM'training'course,'Beijing
Cost'per'Raw'Megabase'of'DNA'Sequences
Data+from+the+NHGRI+Genome+Sequencing+Program+(GSP)+
hQp://www.genome.gov/sequencingcosts/
September'5th,'2014
WDCM'training'course,'Beijing
• 
The+following+'sequence+coverage'+values+were+used+in+calculaDng+the+cost+per+
genome:+
•  SangerFbased'sequencing'(average'read'length=500F600'bases):'6Ffold'coverage'
•  454'sequencing'(average'read'length=300F400'bases):'10Ffold'coverage'
•  Illumina'and'SOLiD'sequencing'(average'read'length=50F100'bases):'30Ffold'coverage'
• 
ProducDon+cost+
! 
! 
! 
! 
! 
! 
• 
Labor,'administra&on,'management,'u&li&es,'reagents,'and'consumables'
Sequencing'instruments'and'other'large'equipment'(amor&zed'over'three'years)'
Informa&cs'ac&vi&es'directly'related'to'sequence'produc&on'(e.g.,'laboratory'informa&on'
management'systems'and'ini&al'data'processing)'
Shotgun'library'construc&on'(required'for'preparing'DNA'to'be'sequenced)'
Submission'of'data'to'a'public'database'
Indirect''
Costs+NonGproducDon+cost+
•  Quality'assessment/control'for'sequencing'projects'
•  Technology'development'to'improve'sequencing'pipelines'
•  Development'of'bioinforma&cs/computa&onal'tools'to'improve'sequencing'pipelines'or'to'
improve'downstream'sequence'analysis'
•  Management'of'individual'sequencing'projects'
•  Informa&cs'equipment'
•  Data'analysis'downstream'of'ini&al'data'processing'(e.g.,'sequence'assembly,'sequence'
alignments,'iden&fying'variants,'and'interpreta&on'of'results)'
September 5th, 2014
WDCM training course, Beijing
Sta&s&cs'of'Microbial'Genome'Sequencing
•  Studies'
•  Metagenomic '
'
'''472'
•  NonFMetagenomic''''''''18,766'
•  Biosamples'
•  HostFassociated' ''''''''1,704''
•  Engineered'
'
'''''''''''293'
•  Environmental' ''''''''''''''''3,140'
'
'
•  Organisms'
•  Archaea''
•  Bacteria''
•  Eukarya' '
•  Projects'
'
'
'
'
'
'926''
''''''''''''36,962''
''''''''''''''8,624'
•  Complete'Projects
'
''''''
6,555'
•  Permanent'Dracs'''''''''''''22,551'
•  Incomplete'Projects'''''''''20,794'
•  Targeted'(not'yet'started)''
'''''Projects''
'
'
'
'925'
hQp://www.genomesonline.org/
The'work'conducted'by'the'U.S.'Department'of'Energy'Joint'Genome'Ins&tute'is'supported'by'the'Office'of'Science'of'
the'U.S.'Department'of'Energy'under'Contract'No.'DEFAC02F05CH11231.'Accordingly,'the'U.S.'Government'retains'a'
nonexclusive,'royaltyFfree'license'to'publish'or'reproduce'these'documents,'or'allow'others'to'do'so,'for'U.S.'
Government'purposes.'All'documents'available'from'this'server'may'be'protected'under'the'U.S.'and'Foreign'Copyright'
Laws'and'permission'to'reproduce'them'may'be'required.'The'public'may'copy'and'use'this'informa&on'without'
charge,'provided'that'this'No&ce'and'any'statement'of'authorship'are'reproduced'on'all'copies.'JGI'is'not'responsible'
for'the'contents'of'any'offFsite'pages'referenced.
September 5th, 2014
WDCM training course, Beijing
Biosample'Distribu&on'Map
hQp://www.genomesonline.org/genomemap
September'5th,'2014
WDCM'training'course,'Beijing
Genomic(Encyclopedia(of(Bacteria(and(Archaea:(
Sequencing(a(Myriad(of(Type(Strains('
Kyrpides'NC,'Hugenholtz'P,'Eisen'JA,'Woyke'T,'Göker'M,'et'al.'(2014).''
PLoS(Biol'12(8):'e1001920.''
doi:10.1371/journal.pbio.1001920'
Box+1.+The+Value+of+Type+and+
64'authors/57'ins&tu&ons'
Reference+Strains+
Figure(1.(((((From(a(total(of(approximately(11,000(bacterial(
and(archaeal(type(strains,(3,285((30%)(have(a(publicly(
known(genome(project.
September'5th,'2014
Box+2.+Global+Data+Standards+
“——'Accurate'es&mates'of'diversity'
will'require'not'only'standards'for'data'
but'also'standard'opera&ng'procedures'
for'all'phases'of'data'genera&on'and'
collec&on—“'
Box+3.+CreaDng+a+Comprehensive+
Microbial+Genomic+Framework+
“—FCurrently'recognized'genome'
projects'have'mapped'~2.8%'of'that'
known'microbial'diversity'[13].'
Sequencing'all'of'the'remaining'type'
strains'will'increase'the'phylogene&c'
coverage'encompassed'and'will'then'
approach'15%'of'the'known'bacterial'
and'archaeal'diversity,'—F'“'
WDCM'training'course,'Beijing
An(integrated(catalog(of(reference(genes(in(the(
human(gut(microbiome.
Nat(Biotechnol.'2014'Aug;32(8):834F41.'doi:'10.1038/nbt.2942.'Epub'2014'Jul'6.'
Li'J.'et(al.(/MetaHIT'Consor&um'(hQp://www.metahit.eu/)'
Sources:''
249'newly'sequenced'samples'+'1,018'previously'sequenced'samples'(Human'
Intes&nal'Tract:'MetaHIT)';'a'cohort'from'three'con&nents'(Europe,'China'and'USA)'
Results:''
•  The'integrated'gene'catalog'(IGC)'comprising'9,879,896'genes.''
•  The'catalog'includes'closeFtoFcomplete'sets'of'higher'quality''genes'for'most'gut'
microbes'
•  Analyses'of'a'group'of'samples'from'Chinese'and'Danish'individuals'using'the'catalog'
revealed'countryFspecific'gut'microbial'signatures.''
•  This'expanded'catalog'should'facilitate'quan&ta&ve'characteriza&on'of'metagenomic,'
metatranscriptomic'and'metaproteomic'data'from'the'gut'microbiome'to'understand'
its'varia&on'across'popula&ons'in'human'health'and'disease.'
•  Web'site''hQp://meta.genomics.cn''
Marine'metagenomics'projects' '435'projects'
Ref:'NCBI'BioProject'DB'(hQp://www.ncbi.nlm.nih.gov/bioproject/)'
''''''''((metagenom*)'AND'marine)'NOT'marine[SubmiQer'Organiza&on]
September 5th, 2014
WDCM training course, Beijing
Marine(Metagenomics:(New(Tools(for(the(Study(and(
ExploitaRon(of(Marine(Microbial(Metabolism
Kennedy,'J.(et(al.(Marine(Drugs.(2010;'8(3):'608–628.''Published'online'Mar'15,'2010.'doi:''10.3390/md8030608'
'
'''Figure'1.'Enzyme'discovery'from'metagenomes:'func&onal'and''
Table'1'''Marine'enzymes'discovered'from'
''''''''''''''''''''SequenceFbased'approaches.'
''''''''''''''''Microbial'and'Metagenomic'sources.
AcDvity:+
Esterase;'Lipase;'Cellulase;'
Chi&nase;'Amidase;'Amylase;'
Phytase;'Protease;'Alkane'
hydroxylase;'Xylanase'
'
Habitat:+
Antarc&c'ice;'Antarc&c'Seawater'
Arc&c'sediment;'Bal&c'Sea'
sediment;'Coastal'solfataric'vent;'
DeepFsea'basin;'DeepFsea'
hydrothermal'vent;'DeepFsea'
sediment;'Estuary;'Fish'gut;'
Hydrocarbon'seep;'Marine'hot'
spring;'Marine'sediment/sludges'
Marine'Sponge;'Sea'Hare'Eggs;'
Sea'saltern;'Shipworm;'Surface'
seawater;'Tidal'Flat''
September 5th, 2014
Scien&fic'Data'
'
Data'maQers'!!
Credit''''''''''''''Reuse'''''''''''''Quality'''''''''''Discovery'''''''''''''''Open''''''''''''''''Service
Scien&fic'Data'is'an'openFaccess,'peerFreviewed'publica&on'for'
descrip&ons'of'scien&fically'valuable'datasets.'Our'primary'
ar&cleFtype,'the'Data+Descriptor,'is'designed'to'make'your'data'
more'discoverable,'interpretable'and'reusable.'
hQp://www.nature.com/sdata/'
hQp://www.nature.com/sdata/about/principles'
'
'
September 5th, 2014
WDCM training course, Beijing
The(rise(of(the(dataTcentric(research(and(
publicaRon(enterprises
SusannaFAssunta'Sansone,'PhD'(Data'Consultant,'Honorary'Academic'Editor)'
hQp://www.slideshare.net/SusannaSansone
September 5th, 2014
WDCM training course, Beijing
Part'2'Annota&on
MIGAP:+MICROBIAL+GENOME+
ANNOTATION+PIPELINE
September'5th,'2014
WDCM'training'course,'Beijing
Contents
• 
• 
• 
• 
• 
Concept'and'usage'of'MiGAP'
Start'MiGAP'
Basic'opera&ons'of'MiGAP'
User'levels'(bF,'sF'and'gF)'
Pipeline'(workflow)'editor'of'gFMiGAP
Concept'and'usage'of'MiGAP
MiGAP(Microbial'Genome'Annota&on'Pipeline)'
De'novo'annota&on'of'nucleo&de'sequences''of''prokaryo&c'and''eukaryo&c''microbes'
Sugawara+H,+Ohyama+A,+Mori+H+and+Kurokawaw+K.+
Microbial'Genome'Annota&on'Pipeline'(MiGAP)'for'diverse'users.''
20th(Int.(Conf.(Genome(InformaRcs((Kanagawa,'Japan)'''
2009:'SF001,'p'1F2.''
''
Sta&s&cs'of'MiGAP'usages
44'publica&ons'have'used'
MiGAP''by'2013.
Sequences'are'annotate'by'predic&on'programs'and'
blastFing'ORFs'against'reference'databases
Results'are'stored'by'
“Features'and'qualifiers”'
Predic&on'
programs
'
• 
ORFs
• 
• 
Reference'
Reference'
databases
Reference'
databases
Reference'
databases
databases
• 
CDS'
–  /ECFNumber'
–  /func&on'
–  /gene'
–  /product'
RBS'
–  /note'
rRNA'
–  /note'
tRNA'
–  /note'
MiGAP'
Microbial'Genome'Annota&on'Pipeline
•  De'novo'annota&on'of'nucleo&de'sequences'of'prokaryo&c'
and'eukaryo&c'microbes'
•  Data'items'in'the'annota&on
"   ORF(CDS)'and'RBS
"   de'novo'predic&on'of'ORC'(CDS)'by'MetaGeneAnnotator'(MGA),'Glimmer+or'Augustus+
"   de'novo'predic&on'of'Ribosome'Binding'Site'' in'the'case'of'MGA "'start'codon'
"   rRNA'
"   de'novo'predic&on'by'RNAmmer+
"   homology+search'for'16S'rRNA''
"   tRNA'
"   de'novo'predic&on'by'tRNAScanGSE
"   Transla&on'of'nucleo&de'sequences'to'amino'acid'sequences
"   Inherit'annota&on'of'known'amino'acids'sequences'by'NCBI+blast+
"   refer'to'top'hit
•  Pipeline'
• 
• 
a'chain'of'dataFprocessing'processes'or'other'socware'en&&es'
'''(ref'hQp://en.wikipedia.org/wiki/Pipeline)'
MiGAP'is'a'branched'parallel'pipeline.'
input
output
Reference'databases'in'MiGAP
•  RNA'
–  prokaryote'
•  5S'
•  16S'
•  23S'
–  eukaryote'
•  5.8S'
•  18S'
•  28S
•  CDS amino'acid'sequence'DB)'
•  Ortholog'DB '
–  COG'
–  KOG'
–  EGGNOG'
•  Non'redundant'DB'
–  NRAA daily'update '
–  TrEMBL monthly'update '
–  RefSeq bimonthly'update '
Input
'''MiGAP ''''''output
•  Input'
•  Genomic'nucleo&de'sequences'
•  Single'and'mul&ple'FastA'or'a'simple'text'
•  mul&ple'con&gs'are'to'be'10,000'or'less'
•  Please'refrain'from'singletons'in'short'reads'of'NGS'(Next'Genera&on'
Sequencers)'
•  Output'
•  Links'to'the'result'file'
• 
• 
• 
• 
• 
Log'file:'pipeline.log'
Nucleo&de'and'amino'acid'sequences:''Fna,'Faa'
Sequence'input:'*.fasta'
Feature'defini&on'*.csv,'*.annt,'
Annota&on:'''*.ddbj *.embl,'*.gbk'
•  Compressed'file'of'mul&ple'con&gs:'.tar.gz'
•  Result'acer'ORF'predic&on
result.*'
•  Result'acer'annota&on' resultFa.*'
Start'MiGAP
Login
User'registra&on'at'DDBJ'site'
(hQp://www.ddbj.nig.ac.jp/)
Startup'screen'of'MiGAP
Horizontal'menu '
'''Logout,'Help,'Contact'US'(the'MiGAP'admin)'
On/off'ver&cal'menu
Ver&cal'menu '
Pipeline:'input'
Pipeline'history:'retrieve'results'
Change'User'Level:'upgrade'from'bFMiGAP'to'sFMiGAP'and'gFMiGAP'
Current'Process:'check'processes'running'
Basic'opera&on'of'MiGAP
Input
Current'process
Cancel'jobs'in'“Pipeline'history”
Pipeline'history'
list'of'results'of'your'jobs
Click'one'of'your'jobs
Click'a'con&g
Click'a'feature'and'browse'the'details
Display'alignment
Downloadable'output'files'for'postFprocessing
• 
• 
• 
• 
• 
• 
• 
Log+File+
•  pipeline.log'
N.A.+
•  resultFna.fasta:'nucleo&de'sequences'of'ORFs'
A.A.+
•  resultFaa.fasta:'amino'acid'sequences'of'ORFs'
CSV+
•  result.csv:'features'in'CSV'file'by'con&gs'(before'the'annota&on)'
•  resultFa.csv:'features'in'CSV'files'by'con&gs'(acer'the'annota&on)'
GenbBank+
•  result.gbk:'nucleo&de'sequence'for'the'submission'to'GenBank'
•  result.Fa.gbk:'annota&on'data'for'the'submission'to'GenBank'
EMBL+
•  result.embl:'nucleo&de'sequence'for'the'submission'to'EMBL'
•  resultFa.embl:'annota&on'data'for'the'submission'to'EMBL'
DDBJ+
•  result.fasta:'nucleo&de'sequences'by'con&gs' ORF '
•  result.annt:'Features'table'for'the'submission'to'DDBJ ORF '
•  result.ddbj:'DDBJ'format' ORF '
•  result.ddbj:'Mul&ple'DDBJ'files' ORF '
•  resultFa.fasta:'nucleo&de'sequences'by'con&gs'(Annota&on '
•  resultFa.annt:'Features'table'for'the'submission'to'DDBJ' ORF '
•  resultFa.ddbj:'nucleo&de'sequences'by'con&gs' Annota&on '
•  resultFa.ddbj:'Features'table'for'the'submission'to'DDBJ' Annota&on
Pipeline.log
Parameters'of'tools'and'version'of'databases'are'recorded
Read'Parameter'File=Done[2012/05/14'09:51:13]'
Pipe'Line'Name=Sample'data'
Sequence'Filename=direct'
Read'Sequence'File=Done[2012/05/14'09:51:14]'
Number'of'Con&g=1'
Total'Length'of'Sequence=10530'
Write'Genbank'File=Done[2012/05/14'09:58:12]'
Write'EMBL'File=Done[2012/05/14'09:58:12]'
Write'DDBJ'File=Done[2012/05/14'09:58:12]'
Write'Informa&on'File=Done[2012/05/14'09:58:12]'
Write'Feature'File=Done[2012/05/14'09:58:12]'
Create'Genome'Map=Done[2012/05/14'09:51:25]'
Create'Feature'Map=Done[2012/05/14'09:51:25]'
Start'Time=1336956674148'
End'Time=1336956685833'
Process'ID=53067'
Wai&ng'List=0'
Unexpected'Error='
Memory'Status=37MB'/'15271MB'
Detail=End'Annota&on'
A'Start'Time=1336956685834'
A'End'Time=1336957092535'
Metagene=Done[2012/05/14'09:51:20]'
Metagene'Version=MetaGeneAnnotator'1.0'
Metagene'Parameter=Fm'
tRNAscan=Done[2012/05/14'09:51:20]'
tRNAscan'Version=tRNAscanFSE'1.23'
tRNAscan'Parameter='
RNAmmer=Done[2012/05/14'09:51:20]'
RNAmmer'Version=RNAmmer'1.2'
RNAmmer'Parameter=FS'bac'Fm'tsu,lsu'
Blast'Version=NCBI'BLAST'2.2.18'
1st'DB'Name=RefSeq'
1st'DB'Version=20120308'
1st'DB'Count=3'
1st'DB'Revision=release52'
PhaseF1=Done[3/3]'
2nd'DB'Name=TrEMBL'
2nd'DB'Version=20120222'
2nd'DB'Count=3'
2nd'DB'Revision=release2012_02'
PhaseF2=Done[3/3]'
3rd'DB'Name=COG'
3rd'DB'Version=20030417'
3rd'DB'Count=3'
3rd'DB'Revision='
PhaseF3=Done[3/3]'
16S'rRNA=Done[2012/05/14'09:51:22]'
16S'rRNA'Parameter=FF'F'Fa'4'
16S'rRNA'Name=16S'rRNA'
16S'rRNA'Version=20090220'
16S'rRNA'Count=1'
16S'rRNA'Revision='
A.A.'Mapping'Name=bbgbk'
Annota&on=PhaseF3'Done[3/3][2012/05/14'09:58:12]'
Annota&on'Parameter=FF'F'
• 
User'levels'(bF,'sF,'gF)
''
bFMiGAP,'sFMiGAP,'gFMiGAP
•  bFMiGAP '
''''''bronze'level'for'novices up'to'10'jobs '
– 'default'
•  sFMiGAP '
'''''''silver'level'for'experienced'users' HTML '
–  parameter'se|ng'
–  databases'selec&on'
–  workflow'branch'
•  gFMiGAP:''
'''''gold'level'for'advanced'users
Applet '
–  parameter'se|ng'
–  databases'selec&on'
–  workflow'branch'
bFMiGAP,'sFMiGAP,'gFMiGAP
bFMiGAP,'sFMiGAP,'gFMIGAP
bFMiGAP,'sFMiGAP,'gFMiGAP
Pipeline'(workflow)'editor'of'gFMiGAP
Start
Start
Drag'and'drop'AUGUSTUS
Drag'and'drop'AUGUSTUS
Set'5.8SrRNA
Set'18SrRNA
Set'28SrRNA
Include'RNAmmer'into'the'pipeline
Set'the'1st'blast
Set'databases'that'the'3rd'blast'will'use
Insert'branch'acer'the'1st'blast
Insert'branch'acer'2nd'blast
bFMiGAP'
'
'
'
>Seq1
ATCTTTTTCGGCTTTT
TTTAGTATCCACAGA
GGTTATCGACA'
>seq2'
CATTTTCACATTACCA
ACCCCTGTGGACAAG
GTTTTT
Qualifier
ORF'
tRNA'
MGA
'
Archaea Bacteria '
'
KK16
SDB
16SrRNA'
tRNAScanFSE'
'
'
BLAST'
rRNA'
'
Map'
Map'
ORF
'
'
DDBJ'File'
'
RNAmmer'
KK16
SDB
1
'
'
'
2
'
HM:60%'
OV:60%
HM:60%'
OV:60%
Draw'
BLAST'
BLAST'
HTML
3
'
Hit?
No
Annota&on
'
'
Alignment'
'
DDBJ'
File'
/Downlod'
'
'
RefSeq'
microbial'DB
COG
'
'
'
DDBJ
'
HM:30%'
OV:30%
DDBJ
BLAST'
Draw'
HTML
TrEMBL'DB
DB
HTML
HM='
OV='
Percent'Inden&ty'
'
2012/06/05
'
Microbial'Genome'Annota&on'Pipeline
Qualifier
'
'
sFMiGAP'
>Seq
ATCTTTTTCGGCTTTT
TTTAGTATCCACAGA
GGTTATCGACAACAT
TTTCACATTACCAAC
CCCTGTGGACAAGGT
TTTT
ORF'
tRNA'
16SrRNA'
rRNA'
'
Map'
Map'
ORF
'
'
DDBJ
'
DDBJ
MGA
tRNAScanFSE'
RNAmmer'
K16S
DB
'
K16S
DB
BLAST'
Draw'
1
'
'
HM:
OV:
2
'
'
'
BLAST'
3
'
HM:
OV:
Hit?
No
'
'
'
'
BLAST'
'
DDBJ
BLAST'
Draw'
HM:
OV:
Hit?
No
'
Annota&on
'
'
Alignment'
'
DDBJ
File'
Download'
Glimmer
'
Archaea
Bacteria '
Eukaryote'
'
'
Start'Codon Stop'Codon'
2
ORF
'
ORF'
Applet
RefSeq'
microbial'DB
TrEMBL'DB
COG'DB
COG'DB
COG'DB
KOG'DB
KOG'DB
KOG'DB
eggNOG'DB
eggNOG'DB
RefSeq'
microbial'DB
TrEMBL'DB
RefSeq'
microbial'DB
TrEMBL'DB
NR'DB
NR'DB
NR'DB
'
DB
COG'DB
DB
eggNOG'DB
Applet
Part'3'Integrated'Retrieval'of'Microbial'Genome'Sequences
MIROBEDB.JP:+AGGREGATION+OF+ENVIRONMENTAL,+
PHENOTYPIC+AND+GENOMIC+DATA+FOR+THE+STUDY+AND+
UTILIZATION+OF+MICROBES
September'5th,'2014
WDCM'training'course,'Beijing
Contents
•  Background'and'concept'
•  Informa&on'technologies'behind'the'
scenes'
•  Please'try'it'!'
Background'and'concept'
Many'microbial'databases'(DBs)'exist'…'
Ortholog
Taxonomy
Culture'
Collec&on
Genome
Pathogen
Gene'
Func&on
Metagenome
Which+DBs+should+we+use?
From'Na&onal'Research'Council'(USA)
Microbes'inhabit'almost'everywhere'on'Earth'and'interact'with'their'environments.
Knowledge'of'microbes'will'have''
high'poten&al'scien&fic'and'commercial'applica&ons.
Promo&ng'the'Integrated'Use'of''
Life'Science'Databases'in'Japan'
・ FY'2007F2010'“Integrated'Database'Project”'
→ Database'Center'for'Life'Science'(DBCLS)'
'
・ FY'2011F''
→ Na&onal'Bioscience'Database'Center'(NBDC)'
'
About'NBDC'
・ Established'in'April'2011'
・ As'part'of'the'Japan'Science'and'Technology'Agency'(JST),'a'
funding'agency'supported'by'MEXT'
'
URL:''hQp://biosciencedbc.jp/?lng=en'
Ac&vi&es'by'NBDC
1.'Formula&on'of'strategies'related'to'coordina&on'and'integra&on'
of'DBs,'and'interna&onal'coopera&on'
'
2.'Crea&on'and'management'of'a'portal'website'from'exis&ng'life'
science'DBs'hQp://biosciencedbc.jp/?lng=en'
'
3.'Funding'of'R&D'of'new'technology'necessary'for'organizing'and'
linking'life'science'DBs'
'
4.'Funding'of'R&D'that'coordinate'exis&ng'and'emerging'DBs'in'
specific'research'fields'
Includes++microbes++(PI:+Ken+KUROKAWA)
Aim+of+
'to'integrate'several'microbial'data'(include'omics,'taxonomy/cultures,'habitats)''
'using'seman&c'web'technology
We integrate the microbial data that can be linked to genomes.
http://microbedb.jp/
How'to'aggregate'diverse'data'sources'to'find'hidden'
rela&onships'among'them?
Gene
Ortholog: MBGD
Taxon
Taxonomy:
NCBI Taxonomy
Genome: GTPS/RefSeq
Annotation:
TogoAnnotation
Culture Collection:
NBRC/JCM
Environment
Metadata:
INSDC SRA
Metagenome:
INSDC SRA
We integrate the microbial data that can be linked to genomes.
http://microbedb.jp/
Gene
Ortholog: MBGD
Taxon
Environment
Taxonomy:
NCBI Taxonomy
Genome: GTPS/RefSeq
Annotation:
TogoAnnotation
Culture Collection:
NBRC/JCM
Informa&on'technologies'behind'the'
scenes'
Metadata:
INSDC SRA
Metagenome:
INSDC SRA
RDF'is'a'standard'data'model'of'Seman&c''
Web'technology
RDF
RDF'(Resource'Descrip&on'Framework)'
Data'model'which'uses'Triples''
(Subject'–'Predicate'–'Object)
S
<URI>
'<URI>
<URI>/Literal
gtps:Gene1'' rdfs:label “16S'rRNA'gene”
URI'node'can'be'linked'to'other'nodes
S
P
O/S
S
P
O
Gene1
Gene1
Gene1
has'
has'
has'
Func&on
Func&on
Func&on
GO:
GO:
GO:
0003700
0003700
0003700
Genome1
Genome1
Genome1
organism
organism
organism
Escherichia'
Escherichia'
Escherichia(
coli
coli
coli
O
P
P
KO:03043
Organism1
Organism1
Organism1
has'
has'
has'
Genome
Genome
Genome
Genome1
Genome1
Genome1
Organism1
Organism1
Organism1
inhabit
inhabit
inhabit
Lake
Lake
Lake
O
Ontology
×
Triple+store
SPARQL
Search
To'prepare'data'in'RDF,''
the'database'management'system'automa&cally'recognize'same'resources.
How'to'integrate'the'data'from'two'different'DBs?
DB+1
Gene1
Gene1
Gene1
has'
has'
has'
Func&on
Func&on
Func&on
DB+2
GO:
GO:
GO:
0003700
0003700
0003700
Gene1
Organism'1
Genome1
Genome1
Genome1
organism
organism
organism
Escherichia'
Escherichia'
coli
Organism'1
coli
Genome1
Enzyme'1
Organism1
Organism1
Organism1
has'
has'
has'
Genome
Genome
Genome
inhabit
inhabit
inhabit
GO:
Enzyme'1
0003700
can'
organism
Use
Escherichia'
Compound'
coli1
Genome1
Genome1
Genome1
Organism1
Organism'1
Organism1
Organism1
Organism1
has'
can'
Func&on
Produce'
Lake
Lake
Lake
has'
can''
Genome
Grow
Genome1
Medium'1
owl:+
sameAs
1.  When'two'DBs'use'same'URI,'already'two'DB’s'data'are'integrated.'
2.  If'not,'you'can'integrate'two'DB’s'data'by'adding'one'Triple'(db1:A'owl:sameAs'db2:B).'
You'don’t'need'to'place'all'of'these'data'in'one'DB'managenement'system.
How'can'we'discriminate'whether'two'DB’s'resources'are'same'or'not?
You'should'describe'your'resource'by''
using'some'Ontologies
Ontology'is'a'structured'controlled'vocabulary''to'describe'proper&es'and'types'of'resources.'
For'example,'to'answer:'What'is'soil?''What'is'a'rela&onship'between'soil'and'sand?''
MEO+(Microbes+Environmental+Ontology)
PDO+(Pathogenic+Disease+Ontology)+
MCCV+(Microbial+Culture+CollecDon+Vocabulary)+
+
MSV+(Metagenome+Sample+Vocabulary)+
+
MPO+(Microbial+Phenotype+Ontology)+
+
MBGD+Ortholog+Ontology
Most'of'them'can'be'obtained'from'
Sea'water
Metagenome'
(Environment)
Sequence'similarity''
search
Genome'
(Taxon)
Gene'clustering'using'
Sequence'similarity'
Ortholog'
(Gene)
Soil
Human'gut
We'have'converted'most'of'our'data'to'RDF,''
developed'many'ontologies,'and'developed'a'RDFized'microbial'DB.
hQp://microbedb.jp/
More'than'1'billion'Triples!
'''''''''Gene
'''''''''Taxon'''''
Ortholog:'MBGD
'''Environment
Taxonomy:''
NCBI'Taxonomy'
Genome:'GTPS/RefSeq
Annota&on:''
TogoAnnota&on
Culture'Collec&on:'
NBRC/JCM
Metadata:''
INSDC'SRA'
Metagenome:''
INSDC'SRA'
Red'color'indicates'our'collaborators.
RDF'conversion'example
JCM/NBRC'Culture'Collec&on'data
1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
11. 
12. 
13. 
14. 
15. 
Strain_Number'
Other_Collec&on_Numbers'
Name'
Organism_Type'
History_of_Deposit'
Date_of_Isola&on'
Isolated_from'
Geographic_Origin'
Status'
Optimum_Temperature_for_Growth'
Maximum_Temperature_for_Growth'
Minimum_Temperature_for_Growth'
Medium'
Application'
Literature
Example'of'NBRC'Culture'Collec&on'RDF'data
:MCCV_000026'
nbrcmedium:NBRC_227
rdf:type'
:MCCV_00018'
“Strain'Number”'
<hQp://www.dsmz.de/
catalogues/details/
culture/
DSMF40226.html>'
”'DSM'40226”
:MCCV_000001'
(Culture)'
#
:MCCV_000025'
:MCCV_000033'
”'Applica&on"
"Thienamycins'produc&on';'Vitamin'B12'(Cyanocobalamine)'
produc&on';'Steroid'conversion"
nbrc:NBRC_12841
<hQp://iden&fiers.org/taxonomy/67274>'
:MCCV_00022''
:MCCV_000014'
'“Op&mal'growth'
temperature”'' :MCCV_000017'
”Type'Strain'"
<hQp://www.ncbi.nlm.nih.gov/taxonomy/
67274>
:MCCV_000027'
”History'of'deposit”''
<hQp://purl.uniprot.org/taxonomy/
67274>
"false"^^xsd:boolean
“28"^^<http://www.w3.org/2001/XMLSchema#integer>
:MCCV_00023''
:MCCV_000028'
dc:iden&fier
'
'“Isolated'from”''
“IFO 12841 <-- SAJ <-- OWU (ISP 5226) <-- Squibb &
Sons (F. Arnow, MD 2428, ETH 24234, NIHJ 501)”
meo:MEO_0000007'
#
rdfs:label
<hQp://iden&fiers.org/taxonomy/67263>'
“Soil”
:MCCV_000012''
<hQp://www.ncbi.nlm.nih.gov/taxonomy/
67263>
<hQp://purl.uniprot.org/taxonomy/
67263>
“Streptomyces griseus subsp. griseus (Krainsky 1914)
Waksman and Henrici 1948”
Overall'data'structure'of'MicrobeDB.jp
Stanza'Development
To' obtain' biological' knowledge' from' low' data' (sequence' and' metadata),'
we' developed' a' variety' of' “Stanza”,' which' is' a' compact,' modular,' and'
reusable'applica&on'for'data'analysis.
Correla&on'analysis''
between'gene'abundance'
and'metadata
fastq'
UCLUST'
Iden&ty'>'97%,'cov'>'90%'
Analyze'data'by'using''
'the'Stanza'
OTUs
UCHIME'
Reference'mode'
UCHIME'De'
novo'mode'
Remove'chimeras
Comparison'of'taxonomic'composi&on
Clean'OTUs'
Taxonomic'assignment'by'using'RDP'
Classifier
Stanza'categories'in'MicrobeDB.jp
Gene'Defini&on'
Gene'Publica&on'
Ortholog'Defini&on'
Gene'Annota&on'
Ortholog'Group'Members'
Ortholog'Cluster'
Genome'Informa&on'
GTPS'Gene/Genome'Feature'
RefSeq'Gene/Genome'Feature'
GTPS'Genome'
GTPS'Genome'Defini&on'
Other'Collec&on'Numbers'
Pathogen'Informa&on'
Phenotype'Informa&on'
RefSeq'Genome'
RefSeq'Genome'Defini&on'
Strain'Defini&on'
Strain'Genome'
Strain'Reference'
Taxon'Defini&on'
Taxon'Hierarchy'
Genes
Taxon
Sample'Func&on'
Mapping'to'Environment'(Chromosome)'
Mapping'to'Environment'(Plasmid)'
Ortholog'Abundance'among'Environments'
Ortholog'Abundance'in'Environment'
Disease'Defini&on'
Environment'Defini&on'
MEO'Hierarchy'
Environment MEO'Ontology'View'
Meta16S'Sample'List'
Metagenome'Sample'List'
Numeric'Metadata'Histogram'
Sample'Defini&on'
Sample'Metadata'
SRS'Cross'Reference'
GenomeFSequenced'Strains'
Symptom'List'
Sequenced'Genome'List'
Strain'List'
Taxonomic'Composi&on'of'Genomes'
Taxonomic'Composi&on'of'Meta'16S'
Human'Meta'Body'Mapping'
Strain'Metadata'
Stanza'Example
・'Gene'Annota&on
・ Ortholog'list
・'Genome'Informa&on
Stanza'Example
・Taxonomic'composi&on'
'''of'16S'rRNA'gene'amplicon''
'''sequencing'analysis''
・Func&onal'and'taxonomic'
'''composi&on'of'a''
'''metagenome'sample'
Stanza'Example
You'can'understand'the'distribu&on'paQern'of'a'taxa'in'human'body.
hQp://microbedb.jp/
Keyword'example:'lake
lake
meo:pond'is_a'meo:lake
Genome+
sequenced+
strains+
isolated'
from'lake
Strain_A'mccv:isola&on_source'meo:pond''''''''''''Strain_A'
Abundant+Orthologs+in+
metagenome+samples+
obtained'from'lake
JCM/NBRC+Strains+
isolated'from'lake
Metageno
me+
samples+
obtained'
from'lake
Taxonomic+
composiDon+
of+16S+
amplicon+
sequencing+
which'
sampled'
from'lake
MEO'
hierarchica
l'structure
MicrobeDB.jp'will'facilitate'the'explora&on'of'the'exis&ng'scaQered'informa&on'of'microbes.
Plan'of'integra&on'between'public''
genome'data'and'user'genome'data'in'MicrobeDB.jp
Automa&c'microbial'
'genome'annota&on
Microbial'genome''
sequence'data''
producer
Input'Metadata''
related'to'the'genome.'
Convert'genome'data''
to'the'RDF'format
Integrate''
public'genome'data''
and'user'genome'data'
RefSeq
GTPS
Public'microbial'
'genome'sequence'data
Please'try'it'!'
Acknowledgements
MiGAP+
Dr.'Akira'OHYAMA'
In'silico'biology''
hQp://www.insilicobiology.jp/'
'
MirobeDB.jp+
Assistant'Professor'Hiroshi'MORI'and'Professor'Ken'KUROKAWA'
Tokyo'Ins&tute'of'Technology,'Graduate'School'of'Bioscience'and'
Biotechnology'Department'of'Biological'Informa&on'
hQp://microbedb.jp/'
'
Supported+by:+
WFCCFMIRCEN'World'Data'Center'for'Microorganisms'
hQp://www.wdcm.org/''
Bureau'of'Interna&onal'Coopera&on,'Chinese'Academy'of'Sciences''
China'Na&onal'CommiQee,'the'CommiQee'on'Data'for'Science'and'
Technology'(CODATA)
September 5th, 2014
WDCM training course, Beijing