Class ExternalSort


  • public class ExternalSort
    extends java.lang.Object
    Source copied from a publicly available library.
    See Also:
    https://code.google.com/p/externalsortinginjava
     Goal: offer a generic external-memory sorting program in Java.
    
     It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient.
    
     This software is in the public domain.
    
     Usage: java org/apache/oak/commons/sort//ExternalSort somefile.txt out.txt
    
     You can change the default maximal number of temporary files with the -t flag: java
     org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -t 3
    
     You can change the default maximum memory available with the -m flag: java
     org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -m 8192
    
     For very large files, you might want to use an appropriate flag to allocate more memory to
     the Java VM: java -Xms2G org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt
    
     By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan
     Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Jerry Yang, First published:
     April 2010 originally posted at
     http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
     
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.util.Comparator<java.lang.String> defaultcomparator  
    • Constructor Summary

      Constructors 
      Constructor Description
      ExternalSort()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static void displayUsage()  
      static long estimateBestSizeOfBlocks​(java.io.File filetobesorted, int maxtmpfiles, long maxMemory)  
      static void main​(java.lang.String[] args)  
      static <T> int merge​(java.io.BufferedWriter fbw, java.util.Comparator<T> cmp, boolean distinct, java.util.List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer<T>> buffers, java.util.function.Function<T,​java.lang.String> typeToString)
      This merges several BinaryFileBuffer to an output writer.
      static <T> int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.BufferedWriter fbw, java.util.Comparator<T> cmp, java.nio.charset.Charset cs, boolean distinct, boolean usegzip, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType)
      This merges a bunch of temporary flat files and deletes them on success or error.
      static <T> int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.BufferedWriter fbw, java.util.Comparator<T> cmp, java.nio.charset.Charset cs, boolean distinct, Compression algorithm, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType)
      This merges a bunch of temporary flat files and deletes them on success or error.
      static int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile)
      This merges a bunch of temporary flat files
      static int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp)
      This merges a bunch of temporary flat files
      static int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, boolean distinct)
      This merges a bunch of temporary flat files
      static int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs)
      This merges a bunch of temporary flat files
      static int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct)
      This merges a bunch of temporary flat files
      static int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct, boolean append, boolean usegzip)
      This merges a bunch of temporary flat files
      static int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct, boolean append, Compression algorithm)  
      static <T> int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<T> cmp, java.nio.charset.Charset cs, boolean distinct, boolean append, boolean usegzip, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType)
      This merges a bunch of temporary flat files and deletes them on success or error.
      static <T> int mergeSortedFiles​(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<T> cmp, java.nio.charset.Charset cs, boolean distinct, boolean append, Compression algorithm, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType)  
      static void sort​(java.io.File input, java.io.File output)  
      static java.io.File sortAndSave​(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory)  
      static java.io.File sortAndSave​(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, boolean usegzip)  
      static java.io.File sortAndSave​(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, boolean usegzip, java.util.function.Predicate<java.lang.String> filterPredicate)
      Sort a list and save it to a temporary file
      static java.io.File sortAndSave​(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, java.util.function.Predicate<java.lang.String> filterPredicate)
      Sort a list and save it to a temporary file
      static <T> java.io.File sortAndSave​(java.util.List<T> tmplist, java.util.Comparator<T> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, boolean usegzip, java.util.function.Function<T,​java.lang.String> typeToString)  
      static <T> java.io.File sortAndSave​(java.util.List<T> tmplist, java.util.Comparator<T> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, boolean usegzip, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Predicate<T> filterPredicate)
      Sort a list and save it to a temporary file
      static <T> java.io.File sortAndSave​(java.util.List<T> tmplist, java.util.Comparator<T> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, Compression algorithm, java.util.function.Function<T,​java.lang.String> typeToString)  
      static <T> java.io.File sortAndSave​(java.util.List<T> tmplist, java.util.Comparator<T> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, Compression algorithm, java.util.function.Function<T,​java.lang.String> typeToString, @Nullable java.util.function.Predicate<T> filterPredicate)
      Sort a list and save it to a temporary file.
      static <T> java.util.List<java.io.File> sortInBatch​(java.io.BufferedReader fbr, long actualFileSize, java.util.Comparator<T> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType)  
      static <T> java.util.List<java.io.File> sortInBatch​(java.io.BufferedReader fbr, long actualFileSize, java.util.Comparator<T> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType, java.util.function.Predicate<T> filterPredicate)  
      static <T> java.util.List<java.io.File> sortInBatch​(java.io.BufferedReader fbr, long actualFileSize, java.util.Comparator<T> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType)  
      static <T> java.util.List<java.io.File> sortInBatch​(java.io.BufferedReader fbr, long actualFileSize, java.util.Comparator<T> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType, java.util.function.Predicate<T> filterPredicate)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, boolean distinct)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, boolean distinct, java.util.function.Predicate<java.lang.String> filterPredicate)
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, java.util.function.Predicate<java.lang.String> filterPredicate)
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, Compression algorithm)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, java.util.function.Predicate<java.lang.String> filterPredicate)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, java.util.function.Predicate<java.lang.String> filterPredicate)
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.util.function.Predicate<java.lang.String> filterPredicate)
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      static <T> java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<T> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType)  
      static <T> java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<T> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType, java.util.function.Predicate<T> filterPredicate)
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      static <T> java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<T> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType)  
      static <T> java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.Comparator<T> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, Compression algorithm, java.util.function.Function<T,​java.lang.String> typeToString, java.util.function.Function<java.lang.String,​T> stringToType, java.util.function.Predicate<T> filterPredicate)  
      static java.util.List<java.io.File> sortInBatch​(java.io.File file, java.util.function.Predicate<java.lang.String> filterPredicate)
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • defaultcomparator

        public static java.util.Comparator<java.lang.String> defaultcomparator
    • Constructor Detail

      • ExternalSort

        public ExternalSort()
    • Method Detail

      • sort

        public static void sort​(java.io.File input,
                                java.io.File output)
                         throws java.io.IOException
        Throws:
        java.io.IOException
      • estimateBestSizeOfBlocks

        public static long estimateBestSizeOfBlocks​(java.io.File filetobesorted,
                                                    int maxtmpfiles,
                                                    long maxMemory)
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.function.Predicate<java.lang.String> filterPredicate)
                                                        throws java.io.IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        a list of temporary flat files
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file)
                                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               java.util.function.Predicate<java.lang.String> filterPredicate)
                                                        throws java.io.IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        cmp - string comparator
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        a list of temporary flat files
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp)
                                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               boolean distinct,
                                                               java.util.function.Predicate<java.lang.String> filterPredicate)
                                                        throws java.io.IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        cmp - string comparator
        distinct - Pass true if duplicate lines should be discarded.
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        a list of temporary flat files
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               boolean distinct)
                                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               int maxtmpfiles,
                                                               long maxMemory,
                                                               java.nio.charset.Charset cs,
                                                               java.io.File tmpdirectory,
                                                               boolean distinct,
                                                               int numHeader,
                                                               boolean usegzip,
                                                               java.util.function.Predicate<java.lang.String> filterPredicate)
                                                        throws java.io.IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        usegzip - use gzip compression for the temporary files
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        a list of temporary flat files
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               int maxtmpfiles,
                                                               long maxMemory,
                                                               java.nio.charset.Charset cs,
                                                               java.io.File tmpdirectory,
                                                               boolean distinct,
                                                               int numHeader,
                                                               boolean usegzip)
                                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               int maxtmpfiles,
                                                               long maxMemory,
                                                               java.nio.charset.Charset cs,
                                                               java.io.File tmpdirectory,
                                                               boolean distinct,
                                                               int numHeader,
                                                               Compression algorithm)
                                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               int maxtmpfiles,
                                                               long maxMemory,
                                                               java.nio.charset.Charset cs,
                                                               java.io.File tmpdirectory,
                                                               boolean distinct,
                                                               int numHeader,
                                                               Compression algorithm,
                                                               java.util.function.Predicate<java.lang.String> filterPredicate)
                                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static <T> java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                                   java.util.Comparator<T> cmp,
                                                                   int maxtmpfiles,
                                                                   long maxMemory,
                                                                   java.nio.charset.Charset cs,
                                                                   java.io.File tmpdirectory,
                                                                   boolean distinct,
                                                                   int numHeader,
                                                                   boolean usegzip,
                                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                                   java.util.function.Function<java.lang.String,​T> stringToType,
                                                                   java.util.function.Predicate<T> filterPredicate)
                                                            throws java.io.IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        usegzip - use gzip compression for the temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        stringToType - function to map custom type to string. Used for storing sorted content back to file
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        a list of temporary flat files
        Throws:
        java.io.IOException
      • sortInBatch

        public static <T> java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                                   java.util.Comparator<T> cmp,
                                                                   int maxtmpfiles,
                                                                   long maxMemory,
                                                                   java.nio.charset.Charset cs,
                                                                   java.io.File tmpdirectory,
                                                                   boolean distinct,
                                                                   int numHeader,
                                                                   boolean usegzip,
                                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                                   java.util.function.Function<java.lang.String,​T> stringToType)
                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static <T> java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                                   java.util.Comparator<T> cmp,
                                                                   int maxtmpfiles,
                                                                   long maxMemory,
                                                                   java.nio.charset.Charset cs,
                                                                   java.io.File tmpdirectory,
                                                                   boolean distinct,
                                                                   int numHeader,
                                                                   Compression algorithm,
                                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                                   java.util.function.Function<java.lang.String,​T> stringToType)
                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static <T> java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                                   java.util.Comparator<T> cmp,
                                                                   int maxtmpfiles,
                                                                   long maxMemory,
                                                                   java.nio.charset.Charset cs,
                                                                   java.io.File tmpdirectory,
                                                                   boolean distinct,
                                                                   int numHeader,
                                                                   Compression algorithm,
                                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                                   java.util.function.Function<java.lang.String,​T> stringToType,
                                                                   java.util.function.Predicate<T> filterPredicate)
                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static <T> java.util.List<java.io.File> sortInBatch​(java.io.BufferedReader fbr,
                                                                   long actualFileSize,
                                                                   java.util.Comparator<T> cmp,
                                                                   int maxtmpfiles,
                                                                   long maxMemory,
                                                                   java.nio.charset.Charset cs,
                                                                   java.io.File tmpdirectory,
                                                                   boolean distinct,
                                                                   int numHeader,
                                                                   boolean usegzip,
                                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                                   java.util.function.Function<java.lang.String,​T> stringToType)
                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static <T> java.util.List<java.io.File> sortInBatch​(java.io.BufferedReader fbr,
                                                                   long actualFileSize,
                                                                   java.util.Comparator<T> cmp,
                                                                   int maxtmpfiles,
                                                                   long maxMemory,
                                                                   java.nio.charset.Charset cs,
                                                                   java.io.File tmpdirectory,
                                                                   boolean distinct,
                                                                   int numHeader,
                                                                   boolean usegzip,
                                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                                   java.util.function.Function<java.lang.String,​T> stringToType,
                                                                   java.util.function.Predicate<T> filterPredicate)
                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static <T> java.util.List<java.io.File> sortInBatch​(java.io.BufferedReader fbr,
                                                                   long actualFileSize,
                                                                   java.util.Comparator<T> cmp,
                                                                   int maxtmpfiles,
                                                                   long maxMemory,
                                                                   java.nio.charset.Charset cs,
                                                                   java.io.File tmpdirectory,
                                                                   boolean distinct,
                                                                   int numHeader,
                                                                   Compression algorithm,
                                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                                   java.util.function.Function<java.lang.String,​T> stringToType)
                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static <T> java.util.List<java.io.File> sortInBatch​(java.io.BufferedReader fbr,
                                                                   long actualFileSize,
                                                                   java.util.Comparator<T> cmp,
                                                                   int maxtmpfiles,
                                                                   long maxMemory,
                                                                   java.nio.charset.Charset cs,
                                                                   java.io.File tmpdirectory,
                                                                   boolean distinct,
                                                                   int numHeader,
                                                                   Compression algorithm,
                                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                                   java.util.function.Function<java.lang.String,​T> stringToType,
                                                                   java.util.function.Predicate<T> filterPredicate)
                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               int maxtmpfiles,
                                                               long maxMemory,
                                                               java.nio.charset.Charset cs,
                                                               java.io.File tmpdirectory,
                                                               boolean distinct,
                                                               java.util.function.Predicate<java.lang.String> filterPredicate)
                                                        throws java.io.IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        a list of temporary flat files
        Throws:
        java.io.IOException
      • sortInBatch

        public static java.util.List<java.io.File> sortInBatch​(java.io.File file,
                                                               java.util.Comparator<java.lang.String> cmp,
                                                               int maxtmpfiles,
                                                               long maxMemory,
                                                               java.nio.charset.Charset cs,
                                                               java.io.File tmpdirectory,
                                                               boolean distinct)
                                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortAndSave

        public static java.io.File sortAndSave​(java.util.List<java.lang.String> tmplist,
                                               java.util.Comparator<java.lang.String> cmp,
                                               java.nio.charset.Charset cs,
                                               java.io.File tmpdirectory,
                                               java.util.function.Predicate<java.lang.String> filterPredicate)
                                        throws java.io.IOException
        Sort a list and save it to a temporary file
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        the file containing the sorted data
        Throws:
        java.io.IOException
      • sortAndSave

        public static java.io.File sortAndSave​(java.util.List<java.lang.String> tmplist,
                                               java.util.Comparator<java.lang.String> cmp,
                                               java.nio.charset.Charset cs,
                                               java.io.File tmpdirectory)
                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortAndSave

        public static java.io.File sortAndSave​(java.util.List<java.lang.String> tmplist,
                                               java.util.Comparator<java.lang.String> cmp,
                                               java.nio.charset.Charset cs,
                                               java.io.File tmpdirectory,
                                               boolean distinct,
                                               boolean usegzip,
                                               java.util.function.Predicate<java.lang.String> filterPredicate)
                                        throws java.io.IOException
        Sort a list and save it to a temporary file
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        the file containing the sorted data
        Throws:
        java.io.IOException
      • sortAndSave

        public static java.io.File sortAndSave​(java.util.List<java.lang.String> tmplist,
                                               java.util.Comparator<java.lang.String> cmp,
                                               java.nio.charset.Charset cs,
                                               java.io.File tmpdirectory,
                                               boolean distinct,
                                               boolean usegzip)
                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • sortAndSave

        public static <T> java.io.File sortAndSave​(java.util.List<T> tmplist,
                                                   java.util.Comparator<T> cmp,
                                                   java.nio.charset.Charset cs,
                                                   java.io.File tmpdirectory,
                                                   boolean distinct,
                                                   boolean usegzip,
                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                   java.util.function.Predicate<T> filterPredicate)
                                            throws java.io.IOException
        Sort a list and save it to a temporary file
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct -
        usegzip - assumes we used gzip compression for temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        the file containing the sorted data
        Throws:
        java.io.IOException
      • sortAndSave

        public static <T> java.io.File sortAndSave​(java.util.List<T> tmplist,
                                                   java.util.Comparator<T> cmp,
                                                   java.nio.charset.Charset cs,
                                                   java.io.File tmpdirectory,
                                                   boolean distinct,
                                                   boolean usegzip,
                                                   java.util.function.Function<T,​java.lang.String> typeToString)
                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • sortAndSave

        public static <T> java.io.File sortAndSave​(java.util.List<T> tmplist,
                                                   java.util.Comparator<T> cmp,
                                                   java.nio.charset.Charset cs,
                                                   java.io.File tmpdirectory,
                                                   boolean distinct,
                                                   Compression algorithm,
                                                   java.util.function.Function<T,​java.lang.String> typeToString,
                                                   @Nullable
                                                   @Nullable java.util.function.Predicate<T> filterPredicate)
                                            throws java.io.IOException
        Sort a list and save it to a temporary file. In case this method is directly used and path filters predicates are provided, try to avoid usage of ArrayList (tmplist) as removal from ArrayList is O(n) operation.
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct -
        algorithm - compression algorithm to use for the temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        filterPredicate - predicate to keep data which need to be sorted
        Returns:
        the file containing the sorted data
        Throws:
        java.io.IOException
      • sortAndSave

        public static <T> java.io.File sortAndSave​(java.util.List<T> tmplist,
                                                   java.util.Comparator<T> cmp,
                                                   java.nio.charset.Charset cs,
                                                   java.io.File tmpdirectory,
                                                   boolean distinct,
                                                   Compression algorithm,
                                                   java.util.function.Function<T,​java.lang.String> typeToString)
                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(java.util.List<java.io.File> files,
                                           java.io.File outputfile)
                                    throws java.io.IOException
        This merges a bunch of temporary flat files
        Parameters:
        files -
        outputfile - file
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        java.io.IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(java.util.List<java.io.File> files,
                                           java.io.File outputfile,
                                           java.util.Comparator<java.lang.String> cmp)
                                    throws java.io.IOException
        This merges a bunch of temporary flat files
        Parameters:
        files -
        outputfile - file
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        java.io.IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(java.util.List<java.io.File> files,
                                           java.io.File outputfile,
                                           java.util.Comparator<java.lang.String> cmp,
                                           boolean distinct)
                                    throws java.io.IOException
        This merges a bunch of temporary flat files
        Parameters:
        files -
        outputfile - file
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        java.io.IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(java.util.List<java.io.File> files,
                                           java.io.File outputfile,
                                           java.util.Comparator<java.lang.String> cmp,
                                           java.nio.charset.Charset cs,
                                           boolean distinct,
                                           boolean append,
                                           boolean usegzip)
                                    throws java.io.IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
        usegzip - assumes we used gzip compression for temporary files
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        java.io.IOException
        Since:
        v0.1.4
      • mergeSortedFiles

        public static int mergeSortedFiles​(java.util.List<java.io.File> files,
                                           java.io.File outputfile,
                                           java.util.Comparator<java.lang.String> cmp,
                                           java.nio.charset.Charset cs,
                                           boolean distinct,
                                           boolean append,
                                           Compression algorithm)
                                    throws java.io.IOException
        Throws:
        java.io.IOException
      • mergeSortedFiles

        public static <T> int mergeSortedFiles​(java.util.List<java.io.File> files,
                                               java.io.File outputfile,
                                               java.util.Comparator<T> cmp,
                                               java.nio.charset.Charset cs,
                                               boolean distinct,
                                               boolean append,
                                               boolean usegzip,
                                               java.util.function.Function<T,​java.lang.String> typeToString,
                                               java.util.function.Function<java.lang.String,​T> stringToType)
                                        throws java.io.IOException
        This merges a bunch of temporary flat files and deletes them on success or error.
        Parameters:
        files - The List of sorted Files to be merged.
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
        append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
        usegzip - assumes we used gzip compression for temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        stringToType - function to map custom type to string. Used for storing sorted content back to file
        Throws:
        java.io.IOException
        Since:
        v0.1.4
      • mergeSortedFiles

        public static <T> int mergeSortedFiles​(java.util.List<java.io.File> files,
                                               java.io.File outputfile,
                                               java.util.Comparator<T> cmp,
                                               java.nio.charset.Charset cs,
                                               boolean distinct,
                                               boolean append,
                                               Compression algorithm,
                                               java.util.function.Function<T,​java.lang.String> typeToString,
                                               java.util.function.Function<java.lang.String,​T> stringToType)
                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • mergeSortedFiles

        public static <T> int mergeSortedFiles​(java.util.List<java.io.File> files,
                                               java.io.BufferedWriter fbw,
                                               java.util.Comparator<T> cmp,
                                               java.nio.charset.Charset cs,
                                               boolean distinct,
                                               boolean usegzip,
                                               java.util.function.Function<T,​java.lang.String> typeToString,
                                               java.util.function.Function<java.lang.String,​T> stringToType)
                                        throws java.io.IOException
        This merges a bunch of temporary flat files and deletes them on success or error.
        Parameters:
        files - The List of sorted Files to be merged.
        fbw - Buffered writer used to store the sorted content
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
        usegzip - assumes we used gzip compression for temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        stringToType - function to map custom type to string. Used for storing sorted content back to file
        Throws:
        java.io.IOException
        Since:
        v0.1.4
      • mergeSortedFiles

        public static <T> int mergeSortedFiles​(java.util.List<java.io.File> files,
                                               java.io.BufferedWriter fbw,
                                               java.util.Comparator<T> cmp,
                                               java.nio.charset.Charset cs,
                                               boolean distinct,
                                               Compression algorithm,
                                               java.util.function.Function<T,​java.lang.String> typeToString,
                                               java.util.function.Function<java.lang.String,​T> stringToType)
                                        throws java.io.IOException
        This merges a bunch of temporary flat files and deletes them on success or error.
        Parameters:
        files - The List of sorted Files to be merged.
        fbw - Buffered writer used to store the sorted content
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
        algorithm - algorithm for compression by default assumes we used gzip compression for temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        stringToType - function to map custom type to string. Used for storing sorted content back to file
        Throws:
        java.io.IOException
        Since:
        v0.1.4
      • merge

        public static <T> int merge​(java.io.BufferedWriter fbw,
                                    java.util.Comparator<T> cmp,
                                    boolean distinct,
                                    java.util.List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer<T>> buffers,
                                    java.util.function.Function<T,​java.lang.String> typeToString)
                             throws java.io.IOException
        This merges several BinaryFileBuffer to an output writer.
        Parameters:
        fbw - A buffer where we write the data.
        cmp - A comparator object that tells us how to sort the lines.
        distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
        buffers - Where the data should be read.
        typeToString - function to map string to custom type. User for converting line to custom type for the purpose of sorting
        Throws:
        java.io.IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(java.util.List<java.io.File> files,
                                           java.io.File outputfile,
                                           java.util.Comparator<java.lang.String> cmp,
                                           java.nio.charset.Charset cs,
                                           boolean distinct)
                                    throws java.io.IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        java.io.IOException
        Since:
        v0.1.2
      • mergeSortedFiles

        public static int mergeSortedFiles​(java.util.List<java.io.File> files,
                                           java.io.File outputfile,
                                           java.util.Comparator<java.lang.String> cmp,
                                           java.nio.charset.Charset cs)
                                    throws java.io.IOException
        This merges a bunch of temporary flat files
        Parameters:
        files -
        outputfile - file
        cs - character set to use to load the strings
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        java.io.IOException
      • displayUsage

        public static void displayUsage()
      • main

        public static void main​(java.lang.String[] args)
                         throws java.io.IOException
        Throws:
        java.io.IOException