NCBIx::BigFetch

Robustly retrieve very large NCBI sequence result sets based on keyword searches using NCBI eUtils
Download

NCBIx::BigFetch Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Perl Artistic License
  • Price:
  • FREE
  • Publisher Name:
  • Roger A Hall
  • Publisher web site:
  • http://search.cpan.org/~rogerhall/

NCBIx::BigFetch Tags


NCBIx::BigFetch Description

Robustly retrieve very large NCBI sequence result sets based on keyword searches using NCBI eUtils NCBIx::BigFetch is a Perl module useful for downloading very large result sets of sequences from NCBI given a text query. Its first use had over 11,000,000 sequences as the result of a single keyword search. It uses YAML to create a configuration file to maintain project state in case network or server issues interrupts execution, in which case it may be easily restarted after the last batch.Downloaded data is organized by "project id" and "base directory" and saved in text files. Each file includes the project id in its name. The project_id and base_dir keys are the only required keys, although you will get the same search for "apoptosis" everytime unless you also set the "query" key. In any case, once a project is started, it only needs the two parameters to be reloaded.Besides the data files, two other files are saved: 1) the initial search result, which includes the WebEnv key, and 2) a configuration file, which saves the parsed data and is used to pick-up the download and recover missing batches or sequences.Results are retrived in batches depending on the "return_max" key. By default, the "index" starts at 1 and downloads continue until the index exceedes "count".Occasionally errors happen and entire batches are not downloaded. In this case, the "index" is added to the "missing" list. This list is saved in the configuration file. The missing batches should be downloaded every day, and not saved until the end of the complete run.Working scripts are included in the script directory: fetch-all.pp fetch-missing.pp fetch-unavailable.ppThe recommended workflow is: 1. Copy the scripts and edit them for a specific project. Use a new number as the project ID. 2. Begin downloading by running fetch-all.pp, which will first submit a query and save the resulting WebEnv key in a project specific configuration file (using YAML). 3. The next morning, kill the fetch-all.pp process and run fetch-missing.pp until it completes. 4. Restart fetch-all.pp. If you wish to re-download "not available" sequences, you may run fetch-unavailable.pp. However, they will be downloaded at the end of fetch-all.pp if it completes normally.If your query result set is so large that your WebEnv times out, simply start a new project with that last index of the previous project, and it will pick up the result set from there (with a new WebEnv). (Planned upgrade will automagically start another search.)Warning: You may lose a (very) few sequences if your download extends across multiple projects. However, our testing shows that the batches generated with the same query within a few days of each other are largely identical.SYNOPSIS use NCBIx::BigFetch; # Parameters my $params = { project_id => "1", base_dir => "/home/user/data", db => "protein", query => "apoptosis", return_max => "500" }; # Start project my $project = NCBIx::BigFetch->new( $params ); # Love the one you're with print " AUTHORS: " . $project->authors() . "\n"; # Attempt all batches of sequences while ( $project->results_waiting() ) { $project->get_next_batch(); } # Get missing batches while ( $project->missing_batches() ) { $project->get_missing_batch(); } # Find unavailable ids my $ids = $project->unavailable_ids(); # Retrieve unavailable ids foreach my $id ( @$ids ) { $project->get_sequence( $id ); } Requirements: · Perl


NCBIx::BigFetch Related Software