Class ExternalSort

java.lang.Object
org.apache.jackrabbit.oak.commons.sort.ExternalSort

public class ExternalSort extends Object
Source copied from a publicly available library.
See Also:
  • https://code.google.com/p/externalsortinginjava
     Goal: offer a generic external-memory sorting program in Java.
    
     It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient.
    
     This software is in the public domain.
    
     Usage: java org/apache/oak/commons/sort//ExternalSort somefile.txt out.txt
    
     You can change the default maximal number of temporary files with the -t flag: java
     org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -t 3
    
     You can change the default maximum memory available with the -m flag: java
     org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -m 8192
    
     For very large files, you might want to use an appropriate flag to allocate more memory to
     the Java VM: java -Xms2G org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt
    
     By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan
     Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Jerry Yang, First published:
     April 2010 originally posted at
     http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
     
  • Field Details

  • Constructor Details

    • ExternalSort

      public ExternalSort()
  • Method Details

    • sort

      public static void sort(File input, File output) throws IOException
      Throws:
      IOException
    • estimateBestSizeOfBlocks

      public static long estimateBestSizeOfBlocks(File filetobesorted, int maxtmpfiles, long maxMemory)
    • sortInBatch

      public static List<File> sortInBatch(File file, Predicate<String> filterPredicate) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      file - some flat file
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      a list of temporary flat files
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, Predicate<String> filterPredicate) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      file - some flat file
      cmp - string comparator
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      a list of temporary flat files
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, boolean distinct, Predicate<String> filterPredicate) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      file - some flat file
      cmp - string comparator
      distinct - Pass true if duplicate lines should be discarded.
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      a list of temporary flat files
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, boolean distinct) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, Predicate<String> filterPredicate) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      maxtmpfiles - maximal number of temporary files
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      numHeader - number of lines to preclude before sorting starts
      usegzip - use gzip compression for the temporary files
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      a list of temporary flat files
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, Compression algorithm) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, Predicate<String> filterPredicate) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static <T> List<File> sortInBatch(File file, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType, Predicate<T> filterPredicate) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      maxtmpfiles - maximal number of temporary files
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      numHeader - number of lines to preclude before sorting starts
      usegzip - use gzip compression for the temporary files
      typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
      stringToType - function to map custom type to string. Used for storing sorted content back to file
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      a list of temporary flat files
      Throws:
      IOException
    • sortInBatch

      public static <T> List<File> sortInBatch(File file, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static <T> List<File> sortInBatch(File file, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static <T> List<File> sortInBatch(File file, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, Function<T,String> typeToString, Function<String,T> stringToType, Predicate<T> filterPredicate) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static <T> List<File> sortInBatch(BufferedReader fbr, long actualFileSize, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static <T> List<File> sortInBatch(BufferedReader fbr, long actualFileSize, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType, Predicate<T> filterPredicate) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static <T> List<File> sortInBatch(BufferedReader fbr, long actualFileSize, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static <T> List<File> sortInBatch(BufferedReader fbr, long actualFileSize, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, Function<T,String> typeToString, Function<String,T> stringToType, Predicate<T> filterPredicate) throws IOException
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, Predicate<String> filterPredicate) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      maxtmpfiles - maximal number of temporary files
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      a list of temporary flat files
      Throws:
      IOException
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct) throws IOException
      Throws:
      IOException
    • sortAndSave

      public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, Predicate<String> filterPredicate) throws IOException
      Sort a list and save it to a temporary file
      Parameters:
      tmplist - data to be sorted
      cmp - string comparator
      cs - charset to use for output (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      the file containing the sorted data
      Throws:
      IOException
    • sortAndSave

      public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory) throws IOException
      Throws:
      IOException
    • sortAndSave

      public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip, Predicate<String> filterPredicate) throws IOException
      Sort a list and save it to a temporary file
      Parameters:
      tmplist - data to be sorted
      cmp - string comparator
      cs - charset to use for output (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      the file containing the sorted data
      Throws:
      IOException
    • sortAndSave

      public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip) throws IOException
      Throws:
      IOException
    • sortAndSave

      public static <T> File sortAndSave(List<T> tmplist, Comparator<T> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip, Function<T,String> typeToString, Predicate<T> filterPredicate) throws IOException
      Sort a list and save it to a temporary file
      Parameters:
      tmplist - data to be sorted
      cmp - string comparator
      cs - charset to use for output (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct -
      usegzip - assumes we used gzip compression for temporary files
      typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      the file containing the sorted data
      Throws:
      IOException
    • sortAndSave

      public static <T> File sortAndSave(List<T> tmplist, Comparator<T> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip, Function<T,String> typeToString) throws IOException
      Throws:
      IOException
    • sortAndSave

      public static <T> File sortAndSave(List<T> tmplist, Comparator<T> cmp, Charset cs, File tmpdirectory, boolean distinct, Compression algorithm, Function<T,String> typeToString, @Nullable @Nullable Predicate<T> filterPredicate) throws IOException
      Sort a list and save it to a temporary file. In case this method is directly used and path filters predicates are provided, try to avoid usage of ArrayList (tmplist) as removal from ArrayList is O(n) operation.
      Parameters:
      tmplist - data to be sorted
      cmp - string comparator
      cs - charset to use for output (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct -
      algorithm - compression algorithm to use for the temporary files
      typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
      filterPredicate - predicate to keep data which need to be sorted
      Returns:
      the file containing the sorted data
      Throws:
      IOException
    • sortAndSave

      public static <T> File sortAndSave(List<T> tmplist, Comparator<T> cmp, Charset cs, File tmpdirectory, boolean distinct, Compression algorithm, Function<T,String> typeToString) throws IOException
      Throws:
      IOException
    • mergeSortedFiles

      public static int mergeSortedFiles(List<File> files, File outputfile) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files -
      outputfile - file
      Returns:
      The number of lines sorted. (P. Beaudoin)
      Throws:
      IOException
    • mergeSortedFiles

      public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files -
      outputfile - file
      Returns:
      The number of lines sorted. (P. Beaudoin)
      Throws:
      IOException
    • mergeSortedFiles

      public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, boolean distinct) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files -
      outputfile - file
      Returns:
      The number of lines sorted. (P. Beaudoin)
      Throws:
      IOException
    • mergeSortedFiles

      public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
      outputfile - The output File to merge the results to.
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
      usegzip - assumes we used gzip compression for temporary files
      Returns:
      The number of lines sorted. (P. Beaudoin)
      Throws:
      IOException
      Since:
      v0.1.4
    • mergeSortedFiles

      public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct, boolean append, Compression algorithm) throws IOException
      Throws:
      IOException
    • mergeSortedFiles

      public static <T> int mergeSortedFiles(List<File> files, File outputfile, Comparator<T> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
      This merges a bunch of temporary flat files and deletes them on success or error.
      Parameters:
      files - The List of sorted Files to be merged.
      outputfile - The output File to merge the results to.
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
      append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
      usegzip - assumes we used gzip compression for temporary files
      typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
      stringToType - function to map custom type to string. Used for storing sorted content back to file
      Throws:
      IOException
      Since:
      v0.1.4
    • mergeSortedFiles

      public static <T> int mergeSortedFiles(List<File> files, File outputfile, Comparator<T> cmp, Charset cs, boolean distinct, boolean append, Compression algorithm, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
      Throws:
      IOException
    • mergeSortedFiles

      public static <T> int mergeSortedFiles(List<File> files, BufferedWriter fbw, Comparator<T> cmp, Charset cs, boolean distinct, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
      This merges a bunch of temporary flat files and deletes them on success or error.
      Parameters:
      files - The List of sorted Files to be merged.
      fbw - Buffered writer used to store the sorted content
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
      usegzip - assumes we used gzip compression for temporary files
      typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
      stringToType - function to map custom type to string. Used for storing sorted content back to file
      Throws:
      IOException
      Since:
      v0.1.4
    • mergeSortedFiles

      public static <T> int mergeSortedFiles(List<File> files, BufferedWriter fbw, Comparator<T> cmp, Charset cs, boolean distinct, Compression algorithm, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
      This merges a bunch of temporary flat files and deletes them on success or error.
      Parameters:
      files - The List of sorted Files to be merged.
      fbw - Buffered writer used to store the sorted content
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
      algorithm - algorithm for compression by default assumes we used gzip compression for temporary files
      typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
      stringToType - function to map custom type to string. Used for storing sorted content back to file
      Throws:
      IOException
      Since:
      v0.1.4
    • merge

      public static <T> int merge(BufferedWriter fbw, Comparator<T> cmp, boolean distinct, List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer<T>> buffers, Function<T,String> typeToString) throws IOException
      This merges several BinaryFileBuffer to an output writer.
      Parameters:
      fbw - A buffer where we write the data.
      cmp - A comparator object that tells us how to sort the lines.
      distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
      buffers - Where the data should be read.
      typeToString - function to map string to custom type. User for converting line to custom type for the purpose of sorting
      Throws:
      IOException
    • mergeSortedFiles

      public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
      outputfile - The output File to merge the results to.
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      Returns:
      The number of lines sorted. (P. Beaudoin)
      Throws:
      IOException
      Since:
      v0.1.2
    • mergeSortedFiles

      public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files -
      outputfile - file
      cs - character set to use to load the strings
      Returns:
      The number of lines sorted. (P. Beaudoin)
      Throws:
      IOException
    • displayUsage

      public static void displayUsage()
    • main

      public static void main(String[] args) throws IOException
      Throws:
      IOException