by Pjotr Prins, summer 2010
This is the journal of adding BAM/SAM support to BioLib. Here I show what it takes to do a full SWIG mapping of an important C library.
There are three interesting implementations of BAM/SAM support. The Picard library (Java), Samtools API (C) and EMBOSS (C). A description of the Sequence Alignment/Map format (SAM) can be found here. SAM is a textual format, and BAM is the matching binary format. From the specification it is clear that BAM/SAM is a rather extensive format, for large files, and benefits from fast C parsing.
As a matter of fact, the Picard library and the EMBOSS version are derived from SAMtools. For BioLib we decided to map native SAMtools.
For Perl, Python and Ruby, people have also written SAMtools wrappers. The Ruby version is incomplete and the others differ in implementation phylosophy. All these projects need to be tested and maintained. Here we try to consolidate the *low* level API for all three langauges in one SWIG binding.
Some use cases for a BAM/SAM library (by Jan Aerts):
Notice that some of these tasks are read-centric, and can probably be done by iterating over the reads one by one (no memory worries). Others are contig-centric, and this will require much more care.
On the EMBOSS mailing list Peter Cock wrote:
Further down the line it'd be nice to include functionality of GATK as well:
BioLib stores existing project repositories in git - so they can be pulled in as git 'submodules'. This allows keeping track of versions of SAMtools bindings.
Note: don't do this, this is just to show what is needed to add a project to BioLib.
Pull the SAMTools SVN into git:
cd ~/izip/git/opensource/contrib/ git svn clone https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/samtools cd samtools git remote add origin git@github.com:pjotrp/samtools.git git push origin master
and update with
cd samtools git svn rebase git push origin master
Note: don't do this, this is just to show what is needed to add a project to BioLib.
Do not use the same git dir, as above, but add the submodule using the *public* link from github.
cd biolib git checkout -b samtools # create new git branch git submodule add http://github.com/pjotrp/samtools.git ./contrib/samtools
Now we have the source code for SAMtools in place.
To compile the C library in biolib a CMakeLists.txt file needs to be written in ./src/clibs/samtools/src/CMakeLists.txt and the required C files need to be in the accompanying files.txt.
The first compile led to a bug. If _USE_KNETFILE is not defined bam_readers.c breaks on line 28 (x does not exist). So we define _USE_KNETFILE with:
ADD_DEFINITIONS(-D_USE_KNETFILE)
Running
cmake . make
creates a shared library ../build/libbiolib_samtools-0.0.6.so.
(until this point was one hour of work)
First we start with a simple binding. I use Ruby first, as I am happiest in that language.
The first step is top open a file, as defined in the API
The function 'samopen' is defined in sam.h, so we are going to tell SWIG to parse it.
First create the CMakeFile.txt for Ruby in ./src/mappings/swig/ruby/samtools. Next tell SWIG to include sam.h in samtools.i.
Once it compiles, I merely need to add an integration test in ./src/mappings/swig/ruby/test/test_samtools.rb
Which is automatically called with
make test
effectively calling (see Testing/Temporary/LastTest.log):
/usr/bin/ruby -I../samtools ./../test/test_samtools.rb
The test simply invokes:
result = Biolib::Samtools.samopen(samgz,"r",nil)
and SAMtools says samopen no @SQ lines in the header - which is a result.
Now it should be really easy to to the Perl binding. Again, create a CMakeLists.txt file and make sure to read samtools.i. And
$result = samtools::samopen($samgz,"r",undef);
passes.
And similarly for Python:
result = samtools.samopen(samgz,'r',None)
The mapping code for Ruby, Perl and Python can be found here
(until this point was another hour of work)
This patch allows building SAMtools with one of:
./configure --with-samtools --with-perl make make test make install (with priviliges)
./configure --with-samtools --with-python make make test make install (with priviliges)
./configure --with-samtools --with-ruby make make test make install (with priviliges)
Here you can see how much plumbing BioLib does. BioLib locates the Perl/Python/Ruby interpreter, queries build settings, builds the modules, and installs them into the right place. This on Mac OSX, Linux and Windows (cygwin, for now)
The next step is to do something with the open SAM file. We ought to close the file with samclose. samclose is also defined in sam.h. which means the function is alread bound by SWIG - yes, we don't have to do anything!
Next, time to read an alignment from a SAM file. samread is also defined in sam.h. And an example of use is given here. There are two hurdles now. First, the data structure bam1_t in
int samread(samfile_t *fp, bam1_t *b)
is defined in bam.h. So we add that to the SWIG samtools.i file, which now reads
\%include bam.h \%include sam.h
That wasn't too hard, huh? After running SWIG again with
cd src/mappings/swig/(ruby|perl|python)/samtools rm CMakeCache.txt ; cmake . ; make make test
the test still passes. It is educational to read the SWIG generated file ruby_samtoolsRUBY_wrap.c (for Ruby), to see how much of the hard work is being done by SWIG. You'll find getters and setters for the bam1_t data structure, for example _wrap_bam1_t_m_data_get is a getter for the bam.m_data element. This means Ruby/Python/Perl code gets mapped to this getter and will be available as bam.m_data
The next problem with the samread function is that it gets two results! One is the number of bytes read - as returned by the function - but also the 'b' parameter is actually a buffered return value. It is passed in and out as a bam1_t pointer.
This is a strategy often seen in C code to return multiple parameters (btw the SAMtools authors are advised to have clearer naming for these parameters). We need to tell SWIG it is a return value, not an input value. Described here
Meanwhile in the C code the buffer is allocated with
#define bam_init1() ((bam1_t*)calloc(1, sizeof(bam1_t)))
SWIG bindings will actually create a buffer internally - because it copies data into a structure your favourite language understands. So, here, I thought we could do away with buffer allocation. This was an ill-advised side road.
To make this work we need to define a typemap, as bam1_t is a 'complex' structure. To define the typemap we trick SWIG in generating an example by giving it a fake definition:
bam1_t *samread2(samfile_t *fp, bam1_t *b);
for Ruby it generates:
result = (bam1_t *)samread2(arg1,arg2); vresult = SWIG_NewPointerObj(SWIG_as_voidptr(result), SWIGTYPE_p_bam1_t, 0 | 0 ); return vresult;
treating b as an 'opaque' pointer. When we >use
bam1_t samread2(samfile_t *fp, bam1_t *b);
you can see it copies the memory block:
vresult = SWIG_NewPointerObj((bam1_t *)memcpy((bam1_t *)malloc(sizeof(bam1_t)) ,&result,sizeof(bam1_t)), SWIGTYPE_p_bam1_t, SWIG_POINTER_OWN | 0 ); return vresult;
which is passed around, as is. The getters and setters of bam1_t access the contained values in the following way:
res1 = SWIG_ConvertPtr(self, &argp1,SWIGTYPE_p_bam1_t, 0 | 0 );
if (!SWIG_IsOK(res1)) {
SWIG_exception_fail(SWIG_ArgError(res1), Ruby_Format_TypeError( "", "bam1_t
*","data_len", 1, self ));
}
arg1 = (bam1_t *)(argp1);
result = (int) ((arg1)->data_len);
vresult = SWIG_From_int((int)(result));
So, it fetches the memory block, picks out data_len (as a C integer) and converts it to a Ruby type (vresult).
For us, to adapt samread for our purposes we do not need to pass in a buffer, as we can create it just before calling C samread. The data gets copied into a memory block - which is wrapped for Ruby into a Ruby type using SWITYPE_p_bam_t. We free the buffer and leave it to Ruby to handle the memory management of the Ruby type.
For the other languages the procedure will be equal, we just need to provide the stack handling.
This all may seem overly complex. The point is that a powerful toolbox often comes at a price. What SWIG really does is parse C/C++ function definitions and provide automatic bindings for the simple cases. With more complex data types you get all the facilities to modify the way the bindings work. As SWIG contains a macro language it is pretty straightforward to turn a specific case into a generic case. To view how the macro's work read the typemaps in /usr/share/swig1.3. You can see that the generic macros are about 10K lines of macro code, and the specific additions for Ruby are 7K lines. That is a lot of work that has been done for you.
Much of this code is supporting the rather complex C++ template libraries. In bioinformatics we can mostly focus on plain C wrapping, and the example I provided here is as complex as it gets. The only other case I have had is with multidimensional arrays (matrices) of values, or pointers.
So, what I tried first is
\%typemap(argout, fragment="output_helper") bam1_t *b {
VALUE bam1 = SWIG_NewPointerObj((bam1_t *)memcpy((bam1_t *)malloc(sizeof(bam1_t)),arg2,sizeof(bam1_t)), SWIGTYPE_p_bam1_t, SWIG_POINTER_OWN | 0 );
$result = bam1;
}
\%typemap(in,numinputs=0) bam1_t *b(bam1_t bam1temp) {
$1 = &bam1temp;
}
The first call to samread worked! Unfortunately the second bombed out, and when checking the stack trace I found the approach was a little naive:
(gdb) run Starting program: /usr/bin/ruby "-I../samtools" "./../test/test_samtools.rb" Program received signal SIGSEGV, Segmentation fault. Switching to Thread -1211017536 (LWP 21332) 0xb7d9111c in realloc () from /lib/i686/cmov/libc.so.6 (gdb) where #0 0xb7d9111c in realloc () from /lib/i686/cmov/libc.so.6 #1 0xb7c8c58d in sam_read1 () from /usr/local/lib/libbiolib_samtools-0.0.6.so #2 0xb7cbe43b in samread () from /usr/local/lib/libbiolib_samtools-0.0.6.so #3 0xb7cce02f in _wrap_samread () from ../samtools/biolib/samtools.so #4 0xb7f09a90 in rb_rescue () from /usr/lib/libruby1.8.so.1.8
the data array inside bam1_t is being reallocated. Looks like it is reusing the original buffer for all reads. Hmmm. It coulde be there is state being carried around, not only in the fh, but also in bam1_t. OK, this means another approach, we need to create a buffer and pass it in every time. We'll probably also want to use the contents of the data buffer (it is the whole point of this excercise), which is of type uint8_t. A bit strange, presumably it contains characters.
On the one hand we have a continuous buffer, on the other we want to query the data in that buffer. One option is to create a special C query, which would work like:
bambuf = new_bambuffer() numread = samread(fh,bambuf) data = bamdata_get(bambuf)
it would be nicer to query bambuf.data directly. Later we can try to achieve that. The third option is to return multiple values from samread:
bambuf = new_bambuffer() numread,data = samread(fh,bambuf)
Finally, we'll have to find a way to write that same data using samwrite.
Anyway, let's try option one. A persistent buffer returns:
samopen SAM header is present: 2 sequences. >>> num=135 >>> bam.data_len=107 >>> bam.m_data=128 >>>#<SWIG::TYPE_p_uint8_t:0xb7cdd43c> >>> num=154 >>> bam.data_len=99 >>> bam.m_data=128 >>>#<SWIG::TYPE_p_uint8_t:0xb7cdd360> >>> num=123 >>> bam.data_len=79 >>> bam.m_data=128 >>>#<SWIG::TYPE_p_uint8_t:0xb7cdd284> >>> num=-1 >>> bam.data_len=79 >>> bam.m_data=128 >>>#<SWIG::TYPE_p_uint8_t:0xb7cdd1a8>
That works! Now >we add a way to fetch the data defining a function bam1_t_datalist_get. According to the samtools docs data_len is the length of the current data block inside bam1_t. We would like to have them in a native Array, to be returned by
data = bam1_t_datalist_get(bambuf)
By treating data as a Ruby String (these are 8-bit values) we get a result like
> rec,bytesread=3, 123 > bam.core.tid=1 (chromosome nr) > bam.core.pos=87 (chromosome pos) > bam.data_len=79 > bam.m_data=128 >79 "read_28701_28881_323c\0000\002\000\000\022(\030\030(\204B(\204B$\030BD\"\210B\020\e\e\e\e\e\032\e\e\e\e\026\032\031\e\e\e\025\032\e\e\e\e\e\e\e\e\e\e\e\e\026\e\e\e\e"
from
fh = Biolib::Samtools.samopen(fn,"r",nil) count = 0 begin count += 1 bam = Biolib::Samtools.new_bam() bytesread = Biolib::Samtools.samread(fh,bam) break if bytesread < 0 data = Biolib::Samtools.bam1_t_datalist_get(bam) print "\n> rec,bytesread=",count,', ',bytesread print "\n> bam.core.tid=",bam.core.tid," (chromosome nr)" print "\n> bam.core.pos=",bam.core.pos," (chromosome pos)" # print "\n> bam1_qname(bam)=",data print "\n> bam.data_len=",bam.data_len print "\n> bam.m_data=",bam.m_data print "\n>",bam.data_len p data end while true Biolib::Samtools.samclose(fh)
for this I merely had to add the following to samtools.i
void bam1_t_datalist_get(char *datalist, bam1_t *b) {
datalist = (char *)b->data;
}
\%apply char *OUTPUT { char *datalist };
\%typemap(argout) (char *datalist, bam1_t *b) {
int asize = $2->data_len;
$result = rb_str_new($2->data,asize);
}
The reason we have to specify the 'argout' in a typemap is that SWIG can not know the size of the data record - the size is fetched from the data_len value. If we want the result as a list of integers the following works:
\%typemap(argout) (char *datalist, bam1_t *b) {
int asize = $2->data_len;
$result = rb_ary_new();
for (i=0; i<asize; i++)
rb_ary_push($result,INT2NUM($2->data[i]));
}
Now this is taking all a bit long. And I should have looked more closely at the following example, where it is clear a new buffer is used every time!
int pos=0;
int lastpos=0;
b = bam_init1();
while(samread(fp_in, b) > 0) {
lastpos = pos;
pos = b->core.pos;
if(pos != lastpos) {
printf("tid : %d\n",b->core.tid);
printf("pos : %d\n",b->core.pos);
char *name = bam1_qname(b);
char *qual = bam1_qual(b);
int n=0;
char *qseq = (char *) malloc(b->core.l_qseq+1);
char *s = bam1_seq(b);
for(n=0;n<(b->core.l_qseq);n++) {
char v = bam1_seqi(s,n);
qseq[n] = bam_nt16_rev_table[v];
}
qseq[n] = 0;
printf("name : %s\n",name);
printf("qseq : %s\n",qseq);
//printf("s cigar: %s\n",cigar);
printf("qual :");
for(n=0;n<(b->core.l_qseq);n++) {
printf(" %d",qual[n]);
}
printf("\n");
}
bam_destroy1(b);
b = bam_init1();
}
bam_destroy1(b);
Perl and Python implementations are similar, as there is nothing language specific in samtools.i, so far.
See SWIG mapping samtools.i, Ruby test_samtools.rb, Perl test_samtools.pl, and Python test_samtools.py
To test, fetch the github samtools branch of biolib and execute one of:
./configure --with-samtools --(with-perl|with-ruby|with-python) make make test cat ./Testing/Temporary/LastTest.log
BTW, SWIG introduces type safety. If I write:
fh = Biolib::Samtools.samopen(fn,"r",nil) bam = 2 bytesread = Biolib::Samtools.samread(fh,bam)
Ruby responds with:
./../test/test_samtools.rb:13:in `samread': Expected argument 1 of type bam1_t *, but got Fixnum 2 (TypeError)
in SWIG method 'samread'
from ./../test/test_samtools.rb:13
In above examples you may note that SWIG allows for automatic binding to deep data structures. For example we accessed:
print "\n> bam.core.tid=",bam.core.tid," (chromosome nr)"
where bam is a C struct, and core is a contained C struct with tid as a field. The mapping is transparent in the calling language, as both getters and setters are provided.
At this stage I decided to align the naming with the API, so new_bam becomes bam_init1, which requires renaming the method to bl_bam_init1. The work around is needed, because in SAMtools bam_init1 is actually a macro, not a real function. So in samtools.i:
%{
bam1_t *bl_bam_init1() {
return bam_init1();
}
%}
#undef bam_init1
%rename(bam_init1) bl_bam_init1;
Adding samwrite after samread is straightforward, as we are dealing with the same data buffer. To read a SAM file and write it as a BAM file we can do
fh = Biolib::Samtools.samopen(fn,"r",nil) fh2 = Biolib::Samtools.samopen(fn2,"bw",nil) bytesread=0 while bytesread>=0 bam = Biolib::Samtools.bam_init1() bytesread = Biolib::Samtools.samread(fh,bam) break if bytesread < 0 bytesread = Biolib::Samtools.samwrite(fh2,bam) Biolib::Samtools.bam_destroy1(bam) end Biolib::Samtools.samclose(fh) Biolib::Samtools.samclose(fh2)
and we should have covered the first use case mentioned in the introduction. However, it is never that easy.
samopen(fn2,"bw",nil) fails, as the third parameter needs to be a BAM header. Apparently you have to create a valid bam_header_t before writing to file. Functions that return bam_header_t are:
/*! @abstract Read header information from a TAB-delimited list file @param fn_list file name for the list @return a pointer to the header structure @discussion Each line in this file consists of chromosome name and the length of chromosome. */ bam_header_t *sam_header_read2(const char *fn_list); /*! @abstract Read header from a SAM file (if present) @param fp SAM file handler @return pointer to header struct; 0 if no @SQ lines available */ bam_header_t *sam_header_read(tamFile fp);
In the examples of SAMtools the README says:
which implies that the header information is available in ex1.fa.fai, which contains:
seq1 1575 6 60 61 seq2 1584 1614 60 61
faidx is implemented in faidx.c. fai_save writes the index using
fprintf(fp, "%s\t%d\t%lld\t%d\t%d\n", fai->name[i], (int)x.len, (long long)x.offset, (int)x.line_blen, (int)x.line_len);
Meanwhile the import is handled bamtk.c's main and delegated to main_import in sam_view.c. This rewrites the command line (?!) by setting
argc2 = 6; argv2 = calloc(6, sizeof(char*)); argv2[0] = "import", argv2[1] = "-o", argv2[2] = argv[3], argv2[3] = "-bt", argv2[4] = argv[1], argv2[5] = argv[2]; ret = main_samview(argc2, argv2);
I am so tempted to comment on this programming 'style'. Anyway, I digress. The command line initialises fn_out and out_mode to 'b'. '-bt' switch refers to the index, setting fn_list, presumably the index (naming?!). First the fn_list is opened, indeed, by calling bam_header_t *sam_header_read(tamFile fp). So tamFile is simply an open index file. Still, reading the code, I can't figure out how it works, as all parsing depends on SAM headers. Ah, wait, if aux is the same as fn_list, and that is parsed in sam_open itself, using bam_header_t *sam_header_read2(const char *fn). Like an afterthought. Strangely it is an index faid file, and it is not read in with the fai_read function. fai_read is only used by fai_load to create the index.
Good, so we can use sam_header_read2 with an index file and get bam_header_t to pass into sam_open. Only thing is we need to generate the index file first - which is a bit lame. Maybe it is possible to create the index first in memory. As it stands I am just going to use the same steps:
#--- convert SAM+FASTA to BAM file
fafn = 'ex1.fa'
faifn = 'ex1.fa.fai'
samfn = 'ex1.sam'
bamfn = 'ex1.bam.gz'
#--- first create the faifn from the FASTA file
if fai_build(fafn) >= 0
fh1 = Biolib::Samtools.samopen(samfn,"r",nil)
bamheader = sam_header_read2(faifn)
fh2 = Biolib::Samtools.samopen(bamfn,"bw",bamheader)
bytesread=0
while bytesread>=0
bam = Biolib::Samtools.bam_init1()
bytesread = Biolib::Samtools.samread(fh,bam)
break if bytesread < 0
bytesread = Biolib::Samtools.samwrite(fh2,bam)
Biolib::Samtools.bam_destroy1(bam)
end
Biolib::Samtools.samclose(fh)
Biolib::Samtools.samclose(fh2)
A final hiccup is that the third parameter of samopen is used in two ways. The description in sam.h says:
@param aux auxiliary data; if mode[0]=='w', aux points to bam_header_t; if strcmp(mode, "rb")!=0 and @SQ header lines in SAM are absent, aux points the file name of the list of the reference; aux is not used otherwise. If @SQ header lines are present in SAM, aux is not used, either.
A description like this ought to raise red flags with anyone doing a code review. For us, in practice 'aux' is a void * and can point to either a structure or a string. Difficult for SWIG to handle, as it needs to know about one or the other. One way is to write a second wrapper function to handle the String option. The other way is to create a special SWIG typemap to handle two cases. The SWIG manual says somewhere 'If you need to cast a pointer or change its value, consider writing some helper functions instead.'
It always pays to read the SWIG generated code:
res3 = SWIG_ConvertPtr(argv[2],SWIG_as_voidptrptr(&arg3), 0, 0);
if (!SWIG_IsOK(res3)) {
SWIG_exception_fail(SWIG_ArgError(res3), Ruby_Format_TypeError( "", "void const *","samopen", 3, argv[2] ));
}
result = (samfile_t *)samopen((char const *)arg1,(char const *)arg2, (void const *)arg3);
So, it expects a type void *. We can write a function String2voidptr, which is a generic converter. Or we write a special samopen_listfn, I am inclined to write the latter. Now >we have
fh2 = Biolib::Samtools.samopen_listfn(bamfn,"bw",faifn)
Unfortunately the implementation still bombs out for ex1.sam. It says that it creates a header struct with two targets from ex1.fa.fai. However, samread fails with a -1, where ex3.sam reads a few records without a faifn index. This is annoying, as the SAMTools code is pretty hard to read/understand. And this means diving in again.
At this stage it may pay off to use a debugger, as my previous code walks simply have not pointed where this goes wrong. We have used the debugger earlier to do a stack trace, but that won't be enough now. cmake has default support for debugging when you set SET(CMAKE_BUILD_TYPE Debug). This needs to be done for both the linked library and the bindings. Easiest to do from the root with:
cd biolib make clean rm CMakeCache.txt cmake -DSAMTOOLS_LIB:BOOLEAN=TRUE -DBUILD_RUBY:BOOLEAN=TRUE -DDEBUG:BOOLEAN=TRUE . ; make
In the cmake output you'll see 'Build type: Debug for i686'. Start gdb and it faults at bam.c
gdb
file /usr/bin/ruby
run -I. test/test_sam2bam.rb
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1211705664 (LWP 21888)]
0xb7be9021 in bam_format1_core (header=0x0, b=0x8517d80, of=0)
at /home/wrk/izip/git/opensource/biolib/contrib/samtools/bam.c:248
248 else { kputs(header->target_name[c->tid], &str); kputc('\t', &str); }
#--- do a stack dump
(gdb) backtrace
#0 0xb7be9021 in bam_format1_core (header=0x0, b=0x8517d80, of=0)
at /home/wrk/izip/git/opensource/biolib/contrib/samtools/bam.c:248
#1 0xb7c16558 in samwrite (fp=0x8517ce0, b=0x8517d80)
at /home/wrk/izip/git/opensource/biolib/contrib/samtools/sam.c:133
#2 0xb7c356c4 in _wrap_samwrite (argc=2, argv=0xbfe495d0, self=3083180300)
at /home/wrk/izip/git/opensource/biolib/src/mappings/swig/ruby/samtools/ruby_samtoolsRUBY_wrap.c:6645
The last line tells us the fault is caused by a call to samwrite. We can set a breakpoint. For example:
(gdb) break samwrite Breakpoint 1 at 0xb7c164ee: file /home/wrk/izip/git/opensource/biolib/contrib/samtools/sam.c, line 130. run
and, yes, we are walking in the sam.c code. fh->header points to NULL. Hmmm. Anyway, I am not going to bore you with the drill-down. The main thing is that you can use a debugger to inspect code and variables.
IMPORTANT: when you need to recompile the C code, including the SWIG generated code, make sure to build from the root of the biolib source tree, to generate debugging information. Just invoking make should be enough.
One thing I found is the order of samopen mode matters. To open a BAM for writing I used 'bw', but it should be 'wb'! You know what, it worked for ex3.sam, which includes a header:
#--- Convert ex3.sam to ex3.bam.gz - note ex3.sam includes a header
#--- which is used to write the new file
require 'biolib/samtools'
datadir = '../../../../test/data/samtools'
samfn = datadir+'/ex3.sam'
bamfn = '/tmp/ex3.bam.gz'
fh1 = Biolib::Samtools.samopen(samfn,"r",nil)
bamheader = fh1.header # borrow the header from fh1
fh2 = Biolib::Samtools.samopen(bamfn,"wb",bamheader)
bytesread=0
while bytesread >= 0
bambuf = Biolib::Samtools.bam_init1()
bytesread = Biolib::Samtools.samread(fh1,bambuf)
break if bytesread < 0
bytesread = Biolib::Samtools.samwrite(fh2,bambuf)
Biolib::Samtools.bam_destroy1(bambuf)
end
Biolib::Samtools.samclose(fh1)
Biolib::Samtools.samclose(fh2)
Another thing is that samopen does not really give information when a file does not exist.
So, next, we adapt the script to read ex1.sam, which does not come with a header, so we need to create one using a fai index. The following script converts SAM to BAM whether the SAM contains a header or not. This works:
require 'biolib/samtools'
#--- convert SAM+FASTA to BAM file
fafn = datadir+'/ex1.fa'
samfn = datadir+'/ex1.sam.gz'
faifn = datadir+'/ex1.fa.fai'
bamfn = '/tmp/ex1.bam.gz'
raise "#{samfn} does not exist" if !File.exist?(samfn)
fh1 = Biolib::Samtools.samopen(samfn,"r",nil) # use header when no SQ lines
if fh1 == nil or fh1.header.n_targets == 0
print "No header information in #{samfn}, creating fai!\n"
raise "#{fafn} does not exist" if !File.exist?(fafn)
raise "Can not create faifn from #{fafn}" if Biolib::Samtools.fai_build(fafn) < 0
fh1 = Biolib::Samtools.samopen_listfn(samfn,"r",faifn)
end
bamheader = fh1.header
print "targets=",bamheader.n_targets,"\n"
fh2 = Biolib::Samtools.samopen(bamfn,"wb",bamheader)
bytesread=0
while bytesread >= 0
bambuf = Biolib::Samtools.bam_init1()
bytesread = Biolib::Samtools.samread(fh1,bambuf)
print bytesread,"\n"
break if bytesread < 0
bytesread = Biolib::Samtools.samwrite(fh2,bambuf)
Biolib::Samtools.bam_destroy1(bambuf)
end
Biolib::Samtools.samclose(fh1)
Biolib::Samtools.samclose(fh2)
The latest version of the Ruby script is here
The Perl and Python versions will be exactly the same, as we are using SWIG, and there is nothing specific to Ruby. After my testing and modifications, simply compile the new bindings for any language. And we are done. Here the Perl version:
use biolib::samtools;
$datadir = '../../../../test/data/samtools';
$fafn = "$datadir/ex1.fa";
$samfn = "$datadir/ex1.sam.gz";
$faifn = "$datadir/ex1.fa.fai";
$bamfn = "/tmp/ex1.bam.gz";
$fh1 = samtools::samopen($samfn,"r",undef);
if ($fh1==undef or $fh1->{header}->{n_targets} == 0) {
print "Using fai\n";
samtools::fai_build($fafn);
$fh1 = samtools::samopen_listfn($samfn,"r",$faifn);
}
$bamheader = $fh1->{header};
print("targets=",$bamheader->{n_targets},"\n");
$fh2 = samtools::samopen($bamfn,"wb",$bamheader);
$bytesread = 0;
while ($bytesread >= 0) {
$bambuf = samtools::bam_init1();
$bytesread = samtools::samread($fh1,$bambuf);
last if $bytesread < 0;
print ".";
$byteswritten = samtools::samwrite($fh2,$bambuf);
samtools::bam_destroy1($bambuf);
}
samtools::samclose($fh1);
samtools::samclose($fh2);
Note I only had to use the debugger to figure out how SAMtools works! I did not have to modify the C-code, or SWIG, to get the script to work.
In the next phase I'll add more functions, though I would like someone else to actually test and use the code for real.
The mapping of SAMtools proves a little harder than I thought. This has little to do with SWIG bindings, which can be hard too, but more with the way the SAMtools code is organised and documented. The way SAMtools does file handling and parameter passing between functions like samopen and samread is complicated, under-documented and rather brittle.
First, just to be clear: I am grateful to the authors for doing all this work and creating the functionality and wish to thank them for that. They could justly claim that if I have a problem with their code, I am free to fix it. However, I wish that we would get rid of this great tradition in bioinformatics of writing brittle interfaces. Brittle means in the end that it is hard to maintain, also for the original authors. Sometimes it is justified, in the sense that many bioinformatics exercises are one-offs. However, projects like SAMtools are bound to be around for some time, and deserve a little more consideration.
The Bio* project authors can probably do a lot in wrapping above code, so helpful error messages get displayed, parameters get more sane (like the 'wb' vs 'bw', and the use of aux examples above). And buffers (like 'header' and 'bam1') get handled more elegantly. One thing that tripped me up is that if I passed in a non-existing file samopen would just return NULL. Likewise, passing in NULL to samwrite would fail silently.
Above SWIG bindings are a good start. They are efficient, and easy to test and maintain.
What does SWIG offer:
What does BioLib offer in addition:
And something else. EMBOSS comes with SAMtools, now. And BioLib contains EMBOSS. As long as the EMBOSS SAM API matches the SAMtools API we can use exactly the same SWIG bindings! And use either library using the same Perl, Python or Ruby code.
Problems when mapping C libraries:
Speed of mapping depends on the API
If the C API is non-inituitive the mapping is harder. A sane API makes easy binding with SWIG, or any other binding method.
SWIG is not well documented when it comes to hand-written mappings
SWIG has a lot of documentation, which mostly suffices. Still, when special bindings need to be written, in many cases the required macros are not well documented on the web. It would be useful to have a number of examples and use cases of 'specialised' bindings online. As it stands I write a Ruby version first, and next look for the more generic macros and see what they generate for all languages. The SWIG source distribution includes a lot of useful information on low-level access, which should be checked when you get stuck.