Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Tim sort. #239

Closed
wants to merge 8 commits into from
Closed

Conversation

sozelfist
Copy link
Contributor

Tim sort is a sorting algorithm that combines the techniques of insertion sort and merge sort. It was designed to perform well on many kinds of real-world data. The algorithm works by dividing the input into small pieces, sorting them using insertion sort, and then merging them using merge sort. Tim sort has a time complexity of O(nlogn) in the worst-case scenario and is widely used in programming languages such as Python, Java, and C++.

raklaptudirm
raklaptudirm previously approved these changes Apr 16, 2024
@sozelfist
Copy link
Contributor Author

sozelfist commented Apr 23, 2024

Can you review this PR? It's been sitting here for 3 weeks, @appgurueu?

@sozelfist sozelfist requested a review from raklaptudirm April 28, 2024 03:02
@sozelfist
Copy link
Contributor Author

sozelfist commented Jun 9, 2024

Hey, @appgurueu. Could you please have a look at this PR? It's been here for a month. I think this PR is good enough to be accepted and merged into the master branch.

Copy link
Contributor

@appgurueu appgurueu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, I needed to take the time to review this properly. I'm afraid there are some issues with this.

To summarize: The tests are currently insufficient and not deduplicated. The implementation is overly simple and loses defining characteristics of Timsort. (In other words, this is not quite a Timsort yet, although it is inspired by the same core idea.)

- Refactor merge function to handle merge space over head and add galloping mode
- Make the comparator optional, default to an ascending comparator
- Managing a stack of sorted runs with special size invariants in order to do balanced merges
- Add tests (include edge cases as proposed)
@sozelfist
Copy link
Contributor Author

Can you review the updates @appgurueu, @raklaptudirm?

- calculates the minimum run length for Tim sort based on the length of the array
- add some edge tests
@sozelfist sozelfist requested a review from appgurueu June 17, 2024 13:19
@sozelfist
Copy link
Contributor Author

I will break down the implementation in detail as follows:

  1. Constants:

    • MIN_MERGE: Minimum size of a run to be merged, typically 32.
    • MIN_GALLOP: A threshold for switching to galloping mode during merge, typically 7.
  2. Comparator Type:

    • Comparator<T>: Defines a comparator function type to compare elements of type T.
  3. Merge Function:

    • The merge function merges two sorted subarrays using optimized galloping mode.
    • Galloping mode quickly moves through elements when many consecutive elements from one subarray are smaller than those from the other subarray.
      const merge = <T>(
        arr: T[],
        leftIndex: number,
        middleIndex: number,
        rightIndex: number,
        compare: Comparator<T>
      ): void => {
        // ...
        // Implement the main merging logic with galloping mode
      }
  4. Main timSort Function:

    • Step 1: Calculate the minimum run length.

    • Step 2: Identify runs (ascending or descending) and sort them using insertion sort.

    • Step 3: Push identified runs onto a stack and ensure they maintain the size invariant by merging appropriately.

    • Step 4: Merge all remaining runs to produce the final sorted array.

      export const timSort = <T>(arr: T[], compare: Comparator<T>): void => {
        const length = arr.length
      
        // Function to identify runs and sort them
        const findRunsAndSort = (start: number, end: number): void => {
          // ...
        }
      
        // Function to calculate minimum run length
        const minRunLength = (n: number): number => {
          // ...
        }
      
        // Function to push a new run onto the stack
        const pushRun = (
          runs: { start: number; length: number }[],
          start: number,
          length: number
        ) => {
          // ...
        }
      
        // Function to merge two adjacent runs
        const mergeAt = (runs: { start: number; length: number }[], i: number) => {
          // ...
        }
      
        // Function to force collapse all remaining runs
        const mergeForceCollapse = (runs: { start: number; length: number }[]) => {
          // ...
        }
      
        // Function to ensure runs maintain the size invariant
        const mergeCollapse = (runs: { start: number; length: number }[]) => {
          // ...
        }
      
        // Determine the minimum run length
        const minRun = minRunLength(length)
        let runs: { start: number; length: number }[] = []
      
        // Find runs and sort them
        let start = 0
        while (start < length) {
          // Identify runs (ascending or descending) and sort them
          let end = start + 1
          if (end < length && compare(arr[end - 1], arr[end]) <= 0) {
            // Ascending run
            while (end < length && compare(arr[end - 1], arr[end]) <= 0) {
              end++
            }
          } else {
            // Descending run
            while (end < length && compare(arr[end - 1], arr[end]) > 0) {
              end++
            }
            // Reverse descending run to make it ascending
            arr.slice(start, end).reverse()
          }
      
          findRunsAndSort(start, end - 1)
          pushRun(runs, start, end - start)
      
          mergeCollapse(runs)
      
          start = end
        }
      
        // Merge all remaining runs
        mergeForceCollapse(runs)
      }

Key Techniques

  1. Hybrid Sorting:

    • Combines insertion sort for small subarrays (runs) and merge sort for larger arrays.
  2. Galloping Mode:

    • Optimizes the merge process by quickly traversing elements when many consecutive elements from one subarray are smaller than those from the other.
  3. Run Identification:

    • Identifies naturally occurring runs (ascending or descending sequences) in the data, which are then sorted and merged.
  4. Stack of Runs:

    • Maintains a stack of runs to be merged, ensuring that the size invariant is maintained for efficient merging.
  5. Insertion Sort:

    • Efficiently sorts small runs within the array.
  6. Adaptive:

    • Adapts to the existing order in the data, making it particularly effective for real-world data which often has partially ordered sequences.

The optimized TimSort provides a highly efficient and robust sorting algorithm for various applications.

@sozelfist
Copy link
Contributor Author

Can you have a look @appgurueu?

Copy link
Contributor

@appgurueu appgurueu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests look fine, though please also add a test for sorting stability. (Note that this is not the same as testing duplicates, since you can't detect it if indistinguishable duplicates were swapped. What you need to do is to for example sort objects consisting of a "key" and a "value" by the "key" using a custom comparator, such that you can then check whether the order of "values" was preserved among objects of equal keys. For example if you had [{key: 42, value: 1}, {key: 42, value: 2}], the sorting algorithm should leave that untouched.)

It would also be good to do a couple iterations of the randomized test.


As an aside, please don't dump what looks like spammy LLM output on me.

@sozelfist
Copy link
Contributor Author

Thanks for your helpful reviews and suggestions. I will close the PR because I can't go forward on this PR anymore. The algorithm is very complicated in detail and the way you need it is so technically restricted to an open-source project for education.

@sozelfist sozelfist closed this Jun 18, 2024
@sozelfist sozelfist deleted the sort/tim-sort branch June 18, 2024 15:13
@appgurueu
Copy link
Contributor

Thanks for your helpful reviews and suggestions.

You're welcome. Thank you for your PR.

The algorithm is very complicated in detail

Indeed it is. It is probably a bit of an unfortunate choice if you're looking for a particularly "elegant" algorithm to implement.

the way you need it is so technically restricted to an open-source project for education

It is crucial for education that Timsort (as well as other algorithms) are represented properly; Timsort specifically lives from all these little observations enabling it to save some real-world time on real-world data.

These aren't arbitrary technical restrictions, these are defining characteristics of Timsort. If you call something Timsort, it should be Timsort.

If we were to misrepresent an oversimplified version of Timsort as Timsort, the issue would only get worse down the line: Each time someone makes a port of this implementation and decides to cut some corners themselves, the algorithm would stray more and more from the original.

@sozelfist
Copy link
Contributor Author

From a reviewer's perspective, I appreciate the steps you are taking to enhance the quality of the PRs. It helps the learner or anyone using these resources for their learning (whether studying or working) right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants