2011-03-10 12:54 by pjotrp

1 Adding BAM/SAM support to BioLib

by Pjotr Prins, summer 2010

This is the journal of adding BAM/SAM support to BioLib. Here I show what it takes to do a full SWIG mapping of an important C library.

There are three interesting implementations of BAM/SAM support. The Picard library (Java), Samtools API (C) and EMBOSS (C). A description of the Sequence Alignment/Map format (SAM) can be found here. SAM is a textual format, and BAM is the matching binary format. From the specification it is clear that BAM/SAM is a rather extensive format, for large files, and benefits from fast C parsing.

As a matter of fact, the Picard library and the EMBOSS version are derived from SAMtools. For BioLib we decided to map native SAMtools.

For Perl, Python and Ruby, people have also written SAMtools wrappers. The Ruby version is incomplete and the others differ in implementation phylosophy. All these projects need to be tested and maintained. Here we try to consolidate the *low* level API for all three langauges in one SWIG binding.

1.1 SAM/BAM use cases

Some use cases for a BAM/SAM library (by Jan Aerts):

Notice that some of these tasks are read-centric, and can probably be done by iterating over the reads one by one (no memory worries). Others are contig-centric, and this will require much more care.

On the EMBOSS mailing list Peter Cock wrote:

With regards to converting (aligned) reads in SAM/BAM to FASTQ, there are issues to discuss. Firstly the original distinct read pair names are not (currently) held, although there was a suggestion on the samtools-devel list to hold this in the tags. All you get is the template name, and in my code I was appending /1 or /2 in the Illumina style for the forward and reverse reads. Secondly, to recover the original read orientation, any read mapped to the reverse strand in SAM/BAM must be reverse complemented (and its quality reversed). Later Peter Rice added that it was better to use _f and _r, instead of /1 /2, as the latter will break file names.

Further down the line it'd be nice to include functionality of GATK as well:

1.2 Compiling SAMtools in BioLib

BioLib stores existing project repositories in git - so they can be pulled in as git 'submodules'. This allows keeping track of versions of SAMtools bindings.

1.2.1 Create a git repository of the SAMtools project on github

Note: don't do this, this is just to show what is needed to add a project to BioLib.

Pull the SAMTools SVN into git:

Shell
  cd ~/izip/git/opensource/contrib/
  git svn clone https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/samtools
  cd samtools
  git remote add origin git@github.com:pjotrp/samtools.git
  git push origin master

and update with

Shell
  cd samtools
  git svn rebase
  git push origin master

1.2.2 Add the SAMtools git as a BioLib submodule

Note: don't do this, this is just to show what is needed to add a project to BioLib.

Do not use the same git dir, as above, but add the submodule using the *public* link from github.

Shell
  cd biolib
  git checkout -b samtools   # create new git branch
  git submodule add http://github.com/pjotrp/samtools.git ./contrib/samtools

Now we have the source code for SAMtools in place.

1.2.3 Add SWIG C bindings using CMake

To compile the C library in biolib a CMakeLists.txt file needs to be written in ./src/clibs/samtools/src/CMakeLists.txt and the required C files need to be in the accompanying files.txt.

The first compile led to a bug. If _USE_KNETFILE is not defined bam_readers.c breaks on line 28 (x does not exist). So we define _USE_KNETFILE with:

Cmake
  ADD_DEFINITIONS(-D_USE_KNETFILE)

Running

Shell
  cmake .
  make

creates a shared library ../build/libbiolib_samtools-0.0.6.so.

(until this point was one hour of work)

1.3 Binding SAMTools file samopen, against Ruby

First we start with a simple binding. I use Ruby first, as I am happiest in that language.

The first step is top open a file, as defined in the API

The function 'samopen' is defined in sam.h, so we are going to tell SWIG to parse it.

First create the CMakeFile.txt for Ruby in ./src/mappings/swig/ruby/samtools. Next tell SWIG to include sam.h in samtools.i.

Once it compiles, I merely need to add an integration test in ./src/mappings/swig/ruby/test/test_samtools.rb

Which is automatically called with

Shell
  make test

effectively calling (see Testing/Temporary/LastTest.log):

Shell
  /usr/bin/ruby -I../samtools ./../test/test_samtools.rb

The test simply invokes:

Ruby
  result = Biolib::Samtools.samopen(samgz,"r",nil)

and SAMtools says samopen no @SQ lines in the header - which is a result.

1.4 Binding SAMTools file samopen, against Perl and Python

Now it should be really easy to to the Perl binding. Again, create a CMakeLists.txt file and make sure to read samtools.i. And

Perl
  $result = samtools::samopen($samgz,"r",undef);

passes.

And similarly for Python:

Python
  result = samtools.samopen(samgz,'r',None)

The mapping code for Ruby, Perl and Python can be found here

(until this point was another hour of work)

1.5 Adding SAMtools to the base build

This patch allows building SAMtools with one of:

Shell
  ./configure --with-samtools --with-perl
  make
  make test
  make install (with priviliges)

Shell
  ./configure --with-samtools --with-python
  make
  make test
  make install (with priviliges)

Shell
  ./configure --with-samtools --with-ruby
  make
  make test
  make install (with priviliges)

Here you can see how much plumbing BioLib does. BioLib locates the Perl/Python/Ruby interpreter, queries build settings, builds the modules, and installs them into the right place. This on Mac OSX, Linux and Windows (cygwin, for now)

1.6 Reading a SAM file (samread)

The next step is to do something with the open SAM file. We ought to close the file with samclose. samclose is also defined in sam.h. which means the function is alread bound by SWIG - yes, we don't have to do anything!

Next, time to read an alignment from a SAM file. samread is also defined in sam.h. And an example of use is given here. There are two hurdles now. First, the data structure bam1_t in

C
  int samread(samfile_t *fp, bam1_t *b)

is defined in bam.h. So we add that to the SWIG samtools.i file, which now reads

Swig
  \%include bam.h
  \%include sam.h

That wasn't too hard, huh? After running SWIG again with

Shell
  cd src/mappings/swig/(ruby|perl|python)/samtools
  rm CMakeCache.txt ; cmake . ; make
  make test

the test still passes. It is educational to read the SWIG generated file ruby_samtoolsRUBY_wrap.c (for Ruby), to see how much of the hard work is being done by SWIG. You'll find getters and setters for the bam1_t data structure, for example _wrap_bam1_t_m_data_get is a getter for the bam.m_data element. This means Ruby/Python/Perl code gets mapped to this getter and will be available as bam.m_data

The next problem with the samread function is that it gets two results! One is the number of bytes read - as returned by the function - but also the 'b' parameter is actually a buffered return value. It is passed in and out as a bam1_t pointer.

This is a strategy often seen in C code to return multiple parameters (btw the SAMtools authors are advised to have clearer naming for these parameters). We need to tell SWIG it is a return value, not an input value. Described here

Meanwhile in the C code the buffer is allocated with

Shell
  #define bam_init1() ((bam1_t*)calloc(1, sizeof(bam1_t)))

SWIG bindings will actually create a buffer internally - because it copies data into a structure your favourite language understands. So, here, I thought we could do away with buffer allocation. This was an ill-advised side road.

To make this work we need to define a typemap, as bam1_t is a 'complex' structure. To define the typemap we trick SWIG in generating an example by giving it a fake definition:

C
  bam1_t *samread2(samfile_t *fp, bam1_t *b);

for Ruby it generates:

C
  result = (bam1_t *)samread2(arg1,arg2);
  vresult = SWIG_NewPointerObj(SWIG_as_voidptr(result), SWIGTYPE_p_bam1_t, 0 |
0 );
  return vresult;

treating b as an 'opaque' pointer. When we >use

C
  bam1_t samread2(samfile_t *fp, bam1_t *b);

you can see it copies the memory block:

C
  vresult = SWIG_NewPointerObj((bam1_t *)memcpy((bam1_t *)malloc(sizeof(bam1_t))
,&result,sizeof(bam1_t)), SWIGTYPE_p_bam1_t, SWIG_POINTER_OWN |  0 );
  return vresult;

which is passed around, as is. The getters and setters of bam1_t access the contained values in the following way:

C
  res1 = SWIG_ConvertPtr(self, &argp1,SWIGTYPE_p_bam1_t, 0 |  0 );
  if (!SWIG_IsOK(res1)) {
    SWIG_exception_fail(SWIG_ArgError(res1), Ruby_Format_TypeError( "", "bam1_t
*","data_len", 1, self ));
  }
  arg1 = (bam1_t *)(argp1);
  result = (int) ((arg1)->data_len);
  vresult = SWIG_From_int((int)(result));

So, it fetches the memory block, picks out data_len (as a C integer) and converts it to a Ruby type (vresult).

For us, to adapt samread for our purposes we do not need to pass in a buffer, as we can create it just before calling C samread. The data gets copied into a memory block - which is wrapped for Ruby into a Ruby type using SWITYPE_p_bam_t. We free the buffer and leave it to Ruby to handle the memory management of the Ruby type.

For the other languages the procedure will be equal, we just need to provide the stack handling.

This all may seem overly complex. The point is that a powerful toolbox often comes at a price. What SWIG really does is parse C/C++ function definitions and provide automatic bindings for the simple cases. With more complex data types you get all the facilities to modify the way the bindings work. As SWIG contains a macro language it is pretty straightforward to turn a specific case into a generic case. To view how the macro's work read the typemaps in /usr/share/swig1.3. You can see that the generic macros are about 10K lines of macro code, and the specific additions for Ruby are 7K lines. That is a lot of work that has been done for you.

Much of this code is supporting the rather complex C++ template libraries. In bioinformatics we can mostly focus on plain C wrapping, and the example I provided here is as complex as it gets. The only other case I have had is with multidimensional arrays (matrices) of values, or pointers.

So, what I tried first is

Swig
\%typemap(argout, fragment="output_helper") bam1_t *b {
  VALUE bam1 = SWIG_NewPointerObj((bam1_t *)memcpy((bam1_t *)malloc(sizeof(bam1_t)),arg2,sizeof(bam1_t)), SWIGTYPE_p_bam1_t, SWIG_POINTER_OWN |  0 );
  $result = bam1;
}
\%typemap(in,numinputs=0) bam1_t *b(bam1_t bam1temp) {
  $1 = &bam1temp;
}

The first call to samread worked! Unfortunately the second bombed out, and when checking the stack trace I found the approach was a little naive:

Shell
(gdb) run
  Starting program: /usr/bin/ruby "-I../samtools" "./../test/test_samtools.rb"
  Program received signal SIGSEGV, Segmentation fault.
  Switching to Thread -1211017536 (LWP 21332)
  0xb7d9111c in realloc () from /lib/i686/cmov/libc.so.6
(gdb) where
  #0  0xb7d9111c in realloc () from /lib/i686/cmov/libc.so.6
  #1  0xb7c8c58d in sam_read1 () from /usr/local/lib/libbiolib_samtools-0.0.6.so
  #2  0xb7cbe43b in samread () from /usr/local/lib/libbiolib_samtools-0.0.6.so
  #3  0xb7cce02f in _wrap_samread () from ../samtools/biolib/samtools.so
  #4  0xb7f09a90 in rb_rescue () from /usr/lib/libruby1.8.so.1.8

the data array inside bam1_t is being reallocated. Looks like it is reusing the original buffer for all reads. Hmmm. It coulde be there is state being carried around, not only in the fh, but also in bam1_t. OK, this means another approach, we need to create a buffer and pass it in every time. We'll probably also want to use the contents of the data buffer (it is the whole point of this excercise), which is of type uint8_t. A bit strange, presumably it contains characters.

On the one hand we have a continuous buffer, on the other we want to query the data in that buffer. One option is to create a special C query, which would work like:

Ruby
  bambuf = new_bambuffer()
  numread = samread(fh,bambuf)
  data = bamdata_get(bambuf)

it would be nicer to query bambuf.data directly. Later we can try to achieve that. The third option is to return multiple values from samread:

Ruby
  bambuf = new_bambuffer()
  numread,data = samread(fh,bambuf)

Finally, we'll have to find a way to write that same data using samwrite.

Anyway, let's try option one. A persistent buffer returns:

Shell
samopen SAM header is present: 2 sequences.
>>> num=135
>>> bam.data_len=107
>>> bam.m_data=128
>>>#<SWIG::TYPE_p_uint8_t:0xb7cdd43c>
>>> num=154
>>> bam.data_len=99
>>> bam.m_data=128
>>>#<SWIG::TYPE_p_uint8_t:0xb7cdd360>
>>> num=123
>>> bam.data_len=79
>>> bam.m_data=128
>>>#<SWIG::TYPE_p_uint8_t:0xb7cdd284>
>>> num=-1
>>> bam.data_len=79
>>> bam.m_data=128
>>>#<SWIG::TYPE_p_uint8_t:0xb7cdd1a8>

That works! Now >we add a way to fetch the data defining a function bam1_t_datalist_get. According to the samtools docs data_len is the length of the current data block inside bam1_t. We would like to have them in a native Array, to be returned by

Ruby
  data = bam1_t_datalist_get(bambuf)

By treating data as a Ruby String (these are 8-bit values) we get a result like

Shell
> rec,bytesread=3, 123
> bam.core.tid=1 (chromosome nr)
> bam.core.pos=87 (chromosome pos)
> bam.data_len=79
> bam.m_data=128
>79 "read_28701_28881_323c\0000\002\000\000\022(\030\030(\204B(\204B$\030BD\"\210B\020\e\e\e\e\e\032\e\e\e\e\026\032\031\e\e\e\025\032\e\e\e\e\e\e\e\e\e\e\e\e\026\e\e\e\e"

from

Ruby
fh = Biolib::Samtools.samopen(fn,"r",nil)
count = 0
begin
  count += 1
  bam = Biolib::Samtools.new_bam()
  bytesread = Biolib::Samtools.samread(fh,bam)
  break if bytesread < 0
  data = Biolib::Samtools.bam1_t_datalist_get(bam)
  print "\n> rec,bytesread=",count,', ',bytesread
  print "\n> bam.core.tid=",bam.core.tid," (chromosome nr)"
  print "\n> bam.core.pos=",bam.core.pos," (chromosome pos)"
  # print "\n> bam1_qname(bam)=",data
  print "\n> bam.data_len=",bam.data_len
  print "\n> bam.m_data=",bam.m_data
  print "\n>",bam.data_len
  p data
end while true
Biolib::Samtools.samclose(fh)

for this I merely had to add the following to samtools.i

Shell
  void bam1_t_datalist_get(char *datalist, bam1_t *b) {
    datalist = (char *)b->data;
  }
  \%apply char *OUTPUT { char *datalist };
  \%typemap(argout) (char *datalist, bam1_t *b) {
    int asize = $2->data_len;
    $result = rb_str_new($2->data,asize);
}

The reason we have to specify the 'argout' in a typemap is that SWIG can not know the size of the data record - the size is fetched from the data_len value. If we want the result as a list of integers the following works:

Shell
  \%typemap(argout) (char *datalist, bam1_t *b) {
    int asize = $2->data_len;
    $result = rb_ary_new();
    for (i=0; i<asize; i++)
      rb_ary_push($result,INT2NUM($2->data[i]));
  }

Now this is taking all a bit long. And I should have looked more closely at the following example, where it is clear a new buffer is used every time!

C
  int pos=0;
  int lastpos=0;
  b = bam_init1();
  while(samread(fp_in, b) > 0) {
    lastpos = pos;
    pos = b->core.pos;
    if(pos != lastpos) {
      printf("tid  : %d\n",b->core.tid);
      printf("pos  : %d\n",b->core.pos);
      char *name  = bam1_qname(b);
      char *qual  = bam1_qual(b);
      int n=0;
      char *qseq = (char *) malloc(b->core.l_qseq+1);
      char *s   = bam1_seq(b);
      for(n=0;n<(b->core.l_qseq);n++) {
        char v = bam1_seqi(s,n);
        qseq[n] = bam_nt16_rev_table[v];
      }
      qseq[n] = 0;
      printf("name : %s\n",name);
      printf("qseq : %s\n",qseq);
      //printf("s cigar: %s\n",cigar);
      printf("qual :");
      for(n=0;n<(b->core.l_qseq);n++) {
        printf(" %d",qual[n]);
      }
      printf("\n");
    }
    bam_destroy1(b);
    b = bam_init1();
  }
  bam_destroy1(b);

Perl and Python implementations are similar, as there is nothing language specific in samtools.i, so far.

See SWIG mapping samtools.i, Ruby test_samtools.rb, Perl test_samtools.pl, and Python test_samtools.py

To test, fetch the github samtools branch of biolib and execute one of:

Shell
./configure --with-samtools --(with-perl|with-ruby|with-python)
make
make test
cat ./Testing/Temporary/LastTest.log

1.7 SWIG type safety

BTW, SWIG introduces type safety. If I write:

Shell
  fh = Biolib::Samtools.samopen(fn,"r",nil)
  bam = 2
  bytesread = Biolib::Samtools.samread(fh,bam)

Ruby responds with:

Shell
./../test/test_samtools.rb:13:in `samread': Expected argument 1 of type bam1_t *, but got Fixnum 2 (TypeError)
        in SWIG method 'samread'
        from ./../test/test_samtools.rb:13

1.8 SWIG access to data structures

In above examples you may note that SWIG allows for automatic binding to deep data structures. For example we accessed:

Ruby
  print "\n> bam.core.tid=",bam.core.tid," (chromosome nr)"

where bam is a C struct, and core is a contained C struct with tid as a field. The mapping is transparent in the calling language, as both getters and setters are provided.

1.9 Adding samwrite support

At this stage I decided to align the naming with the API, so new_bam becomes bam_init1, which requires renaming the method to bl_bam_init1. The work around is needed, because in SAMtools bam_init1 is actually a macro, not a real function. So in samtools.i:

Swig
%{
  bam1_t *bl_bam_init1() {
    return bam_init1();
  }
%}
#undef bam_init1
%rename(bam_init1) bl_bam_init1;

Adding samwrite after samread is straightforward, as we are dealing with the same data buffer. To read a SAM file and write it as a BAM file we can do

Ruby
fh = Biolib::Samtools.samopen(fn,"r",nil)
fh2 = Biolib::Samtools.samopen(fn2,"bw",nil)
bytesread=0
while bytesread>=0
  bam = Biolib::Samtools.bam_init1()
  bytesread = Biolib::Samtools.samread(fh,bam)
  break if bytesread < 0
  bytesread = Biolib::Samtools.samwrite(fh2,bam)
  Biolib::Samtools.bam_destroy1(bam)
end
Biolib::Samtools.samclose(fh)
Biolib::Samtools.samclose(fh2)

and we should have covered the first use case mentioned in the introduction. However, it is never that easy.

samopen(fn2,"bw",nil) fails, as the third parameter needs to be a BAM header. Apparently you have to create a valid bam_header_t before writing to file. Functions that return bam_header_t are:

C
/*!
  @abstract       Read header information from a TAB-delimited list file
  @param  fn_list file name for the list
  @return         a pointer to the header structure
  @discussion Each line in this file consists of chromosome name and
  the length of chromosome.
 */
bam_header_t *sam_header_read2(const char *fn_list);
/*!
  @abstract       Read header from a SAM file (if present)
  @param  fp      SAM file handler
  @return         pointer to header struct; 0 if no @SQ lines available
 */
bam_header_t *sam_header_read(tamFile fp);

In the examples of SAMtools the README says:

samtools faidx ex1.fa # index the reference FASTA samtools import ex1.fa.fai ex1.sam.gz ex1.bam # SAM->BAM

which implies that the header information is available in ex1.fa.fai, which contains:

seq1  1575  6 60  61
seq2  1584  1614  60  61

faidx is implemented in faidx.c. fai_save writes the index using

C
fprintf(fp, "%s\t%d\t%lld\t%d\t%d\n", fai->name[i], (int)x.len, (long long)x.offset, (int)x.line_blen, (int)x.line_len);

Meanwhile the import is handled bamtk.c's main and delegated to main_import in sam_view.c. This rewrites the command line (?!) by setting

C
argc2 = 6;
argv2 = calloc(6, sizeof(char*));
argv2[0] = "import", argv2[1] = "-o", argv2[2] = argv[3], argv2[3] = "-bt", argv2[4] = argv[1], argv2[5] = argv[2];
ret = main_samview(argc2, argv2);

I am so tempted to comment on this programming 'style'. Anyway, I digress. The command line initialises fn_out and out_mode to 'b'. '-bt' switch refers to the index, setting fn_list, presumably the index (naming?!). First the fn_list is opened, indeed, by calling bam_header_t *sam_header_read(tamFile fp). So tamFile is simply an open index file. Still, reading the code, I can't figure out how it works, as all parsing depends on SAM headers. Ah, wait, if aux is the same as fn_list, and that is parsed in sam_open itself, using bam_header_t *sam_header_read2(const char *fn). Like an afterthought. Strangely it is an index faid file, and it is not read in with the fai_read function. fai_read is only used by fai_load to create the index.

Good, so we can use sam_header_read2 with an index file and get bam_header_t to pass into sam_open. Only thing is we need to generate the index file first - which is a bit lame. Maybe it is possible to create the index first in memory. As it stands I am just going to use the same steps:

  1. Take a SAM file with FASTA
  2. Generate the index
  3. Load the index as a header
  4. Write BAM file

Ruby
#--- convert SAM+FASTA to BAM file
fafn  = 'ex1.fa'
faifn = 'ex1.fa.fai'
samfn = 'ex1.sam'
bamfn = 'ex1.bam.gz'
#--- first create the faifn from the FASTA file
if fai_build(fafn) >= 0
  fh1 = Biolib::Samtools.samopen(samfn,"r",nil)
  bamheader = sam_header_read2(faifn)
  fh2 = Biolib::Samtools.samopen(bamfn,"bw",bamheader)
  bytesread=0
  while bytesread>=0
    bam = Biolib::Samtools.bam_init1()
    bytesread = Biolib::Samtools.samread(fh,bam)
    break if bytesread < 0
    bytesread = Biolib::Samtools.samwrite(fh2,bam)
    Biolib::Samtools.bam_destroy1(bam)
  end
Biolib::Samtools.samclose(fh)
Biolib::Samtools.samclose(fh2)

A final hiccup is that the third parameter of samopen is used in two ways. The description in sam.h says:

@param aux auxiliary data; if mode[0]=='w', aux points to
bam_header_t; if strcmp(mode, "rb")!=0 and @SQ header lines in SAM
are absent, aux points the file name of the list of the reference;
aux is not used otherwise. If @SQ header lines are present in SAM,

aux is not used, either.

A description like this ought to raise red flags with anyone doing a code review. For us, in practice 'aux' is a void * and can point to either a structure or a string. Difficult for SWIG to handle, as it needs to know about one or the other. One way is to write a second wrapper function to handle the String option. The other way is to create a special SWIG typemap to handle two cases. The SWIG manual says somewhere 'If you need to cast a pointer or change its value, consider writing some helper functions instead.'

It always pays to read the SWIG generated code:

C
  res3 = SWIG_ConvertPtr(argv[2],SWIG_as_voidptrptr(&arg3), 0, 0);
  if (!SWIG_IsOK(res3)) {
    SWIG_exception_fail(SWIG_ArgError(res3), Ruby_Format_TypeError( "", "void const *","samopen", 3, argv[2] ));
  }
  result = (samfile_t *)samopen((char const *)arg1,(char const *)arg2, (void const *)arg3);

So, it expects a type void *. We can write a function String2voidptr, which is a generic converter. Or we write a special samopen_listfn, I am inclined to write the latter. Now >we have

Ruby
fh2 = Biolib::Samtools.samopen_listfn(bamfn,"bw",faifn)

Unfortunately the implementation still bombs out for ex1.sam. It says that it creates a header struct with two targets from ex1.fa.fai. However, samread fails with a -1, where ex3.sam reads a few records without a faifn index. This is annoying, as the SAMTools code is pretty hard to read/understand. And this means diving in again.

At this stage it may pay off to use a debugger, as my previous code walks simply have not pointed where this goes wrong. We have used the debugger earlier to do a stack trace, but that won't be enough now. cmake has default support for debugging when you set SET(CMAKE_BUILD_TYPE Debug). This needs to be done for both the linked library and the bindings. Easiest to do from the root with:

Shell
cd biolib
make clean
rm CMakeCache.txt
cmake -DSAMTOOLS_LIB:BOOLEAN=TRUE -DBUILD_RUBY:BOOLEAN=TRUE -DDEBUG:BOOLEAN=TRUE . ; make

In the cmake output you'll see 'Build type: Debug for i686'. Start gdb and it faults at bam.c

Shell
gdb
file /usr/bin/ruby
run -I. test/test_sam2bam.rb
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1211705664 (LWP 21888)]
0xb7be9021 in bam_format1_core (header=0x0, b=0x8517d80, of=0)
  at /home/wrk/izip/git/opensource/biolib/contrib/samtools/bam.c:248
248             else { kputs(header->target_name[c->tid], &str); kputc('\t', &str); }
#--- do a stack dump
(gdb) backtrace
#0  0xb7be9021 in bam_format1_core (header=0x0, b=0x8517d80, of=0)
    at /home/wrk/izip/git/opensource/biolib/contrib/samtools/bam.c:248
#1  0xb7c16558 in samwrite (fp=0x8517ce0, b=0x8517d80)
    at /home/wrk/izip/git/opensource/biolib/contrib/samtools/sam.c:133
#2  0xb7c356c4 in _wrap_samwrite (argc=2, argv=0xbfe495d0, self=3083180300)
    at /home/wrk/izip/git/opensource/biolib/src/mappings/swig/ruby/samtools/ruby_samtoolsRUBY_wrap.c:6645

The last line tells us the fault is caused by a call to samwrite. We can set a breakpoint. For example:

Shell
(gdb) break samwrite
Breakpoint 1 at 0xb7c164ee: file /home/wrk/izip/git/opensource/biolib/contrib/samtools/sam.c, line 130.
run

and, yes, we are walking in the sam.c code. fh->header points to NULL. Hmmm. Anyway, I am not going to bore you with the drill-down. The main thing is that you can use a debugger to inspect code and variables.

IMPORTANT: when you need to recompile the C code, including the SWIG generated code, make sure to build from the root of the biolib source tree, to generate debugging information. Just invoking make should be enough.

One thing I found is the order of samopen mode matters. To open a BAM for writing I used 'bw', but it should be 'wb'! You know what, it worked for ex3.sam, which includes a header:

Ruby
#--- Convert ex3.sam to ex3.bam.gz - note ex3.sam includes a header
#--- which is used to write the new file
require 'biolib/samtools'
datadir = '../../../../test/data/samtools'
samfn = datadir+'/ex3.sam'
bamfn = '/tmp/ex3.bam.gz'
  fh1 = Biolib::Samtools.samopen(samfn,"r",nil) 
  bamheader = fh1.header # borrow the header from fh1
  fh2 = Biolib::Samtools.samopen(bamfn,"wb",bamheader)  
  bytesread=0
  while bytesread >= 0
    bambuf = Biolib::Samtools.bam_init1()
    bytesread = Biolib::Samtools.samread(fh1,bambuf)
    break if bytesread < 0
    bytesread = Biolib::Samtools.samwrite(fh2,bambuf)
    Biolib::Samtools.bam_destroy1(bambuf)
  end
  Biolib::Samtools.samclose(fh1)
  Biolib::Samtools.samclose(fh2)

Another thing is that samopen does not really give information when a file does not exist.

So, next, we adapt the script to read ex1.sam, which does not come with a header, so we need to create one using a fai index. The following script converts SAM to BAM whether the SAM contains a header or not. This works:

Ruby
require 'biolib/samtools'
#--- convert SAM+FASTA to BAM file
fafn  = datadir+'/ex1.fa'
samfn = datadir+'/ex1.sam.gz'
faifn = datadir+'/ex1.fa.fai'
bamfn = '/tmp/ex1.bam.gz'
raise "#{samfn} does not exist" if !File.exist?(samfn)
fh1 = Biolib::Samtools.samopen(samfn,"r",nil) # use header when no SQ lines
if fh1 == nil or fh1.header.n_targets == 0
  print "No header information in #{samfn}, creating fai!\n"
  raise "#{fafn} does not exist" if !File.exist?(fafn)
  raise "Can not create faifn from #{fafn}" if Biolib::Samtools.fai_build(fafn) < 0 
  fh1 = Biolib::Samtools.samopen_listfn(samfn,"r",faifn)
end
bamheader = fh1.header
print "targets=",bamheader.n_targets,"\n"
fh2 = Biolib::Samtools.samopen(bamfn,"wb",bamheader)  
bytesread=0
while bytesread >= 0
  bambuf = Biolib::Samtools.bam_init1()
  bytesread = Biolib::Samtools.samread(fh1,bambuf)
  print bytesread,"\n"
  break if bytesread < 0
  bytesread = Biolib::Samtools.samwrite(fh2,bambuf)
  Biolib::Samtools.bam_destroy1(bambuf)
end
Biolib::Samtools.samclose(fh1)
Biolib::Samtools.samclose(fh2)

The latest version of the Ruby script is here

The Perl and Python versions will be exactly the same, as we are using SWIG, and there is nothing specific to Ruby. After my testing and modifications, simply compile the new bindings for any language. And we are done. Here the Perl version:

Perl
use biolib::samtools;
$datadir = '../../../../test/data/samtools';
$fafn  = "$datadir/ex1.fa";
$samfn = "$datadir/ex1.sam.gz";
$faifn = "$datadir/ex1.fa.fai";
$bamfn = "/tmp/ex1.bam.gz";
$fh1 = samtools::samopen($samfn,"r",undef);
if ($fh1==undef or $fh1->{header}->{n_targets} == 0) {
  print "Using fai\n";
	samtools::fai_build($fafn);
  $fh1 = samtools::samopen_listfn($samfn,"r",$faifn);
}
$bamheader = $fh1->{header};
print("targets=",$bamheader->{n_targets},"\n");
$fh2 = samtools::samopen($bamfn,"wb",$bamheader);
$bytesread = 0;
while ($bytesread >= 0) {
  $bambuf = samtools::bam_init1();
  $bytesread = samtools::samread($fh1,$bambuf);
	last if $bytesread < 0;
	print ".";
	$byteswritten = samtools::samwrite($fh2,$bambuf);
  samtools::bam_destroy1($bambuf);
}
samtools::samclose($fh1);
samtools::samclose($fh2);

Note I only had to use the debugger to figure out how SAMtools works! I did not have to modify the C-code, or SWIG, to get the script to work.

In the next phase I'll add more functions, though I would like someone else to actually test and use the code for real.

2 Discussion

The mapping of SAMtools proves a little harder than I thought. This has little to do with SWIG bindings, which can be hard too, but more with the way the SAMtools code is organised and documented. The way SAMtools does file handling and parameter passing between functions like samopen and samread is complicated, under-documented and rather brittle.

'''brittle''' (jargon): Said of software that is functional but easily broken by changes in operating environment or configuration, or by any minor tweak to the software itself (from dictionary.com).

First, just to be clear: I am grateful to the authors for doing all this work and creating the functionality and wish to thank them for that. They could justly claim that if I have a problem with their code, I am free to fix it. However, I wish that we would get rid of this great tradition in bioinformatics of writing brittle interfaces. Brittle means in the end that it is hard to maintain, also for the original authors. Sometimes it is justified, in the sense that many bioinformatics exercises are one-offs. However, projects like SAMtools are bound to be around for some time, and deserve a little more consideration.

The Bio* project authors can probably do a lot in wrapping above code, so helpful error messages get displayed, parameters get more sane (like the 'wb' vs 'bw', and the use of aux examples above). And buffers (like 'header' and 'bam1') get handled more elegantly. One thing that tripped me up is that if I passed in a non-existing file samopen would just return NULL. Likewise, passing in NULL to samwrite would fail silently.

Above SWIG bindings are a good start. They are efficient, and easy to test and maintain.

3 Conclusion

What does SWIG offer:

    >Easy binding using a map for every function
  1. Automatic binding of data structures
  2. Type safety
  3. One map binds many languages
  4. One test file works for many languages >Easy updating of bindings - when upgrading C/C++ sources

What does BioLib offer in addition:

  1. Container of many C/C++ Bio libraries and functions
  2. Recent and controlled versions of contained libraries
  3. Support for Perl, Python, Ruby (Java and JVM languages is coming)
  4. One system for deployment on Linux, OSX, Windows
  5. Generated API documentation from SWIG bindings
  6. Generated test files (planned)

And something else. EMBOSS comes with SAMtools, now. And BioLib contains EMBOSS. As long as the EMBOSS SAM API matches the SAMtools API we can use exactly the same SWIG bindings! And use either library using the same Perl, Python or Ruby code.

Problems when mapping C libraries:

Speed of mapping depends on the API

If the C API is non-inituitive the mapping is harder. A sane API makes easy binding with SWIG, or any other binding method.

SWIG is not well documented when it comes to hand-written mappings

SWIG has a lot of documentation, which mostly suffices. Still, when special bindings need to be written, in many cases the required macros are not well documented on the web. It would be useful to have a number of examples and use cases of 'specialised' bindings online. As it stands I write a Ruby version first, and next look for the more generic macros and see what they generate for all languages. The SWIG source distribution includes a lot of useful information on low-level access, which should be checked when you get stuck.


Bibliography


wikiTEXer 0.55 by Pjotr Prins - generated 2011-03-10 12:54 by pjotrp