Find Median From Data Stream

Problem Statement

The median is the middle value in a sorted list of integers. For lists of even length, there is no middle value, so the median is the mean of the two middle values.

For example:

  • For arr = [1,2,3], the median is 2.
  • For arr = [1,2], the median is (1 + 2) / 2 = 1.5

Implement the MedianFinder class:

  • MedianFinder() initializes the MedianFinder object.
  • void addNum(int num) adds the integer num from the data stream to the data structure.
  • double findMedian() returns the median of all elements so far.

Example 1:

# Input:
median_finder = MedianFinder()
median_finder.add_num(1) # arr = [1]
median_finder.find_median() # return 1.0
median_finder.add_num(2) # arr = [1, 2]
median_finder.find_median() # return 1.5
median_finder.add_num(3) # arr[1, 2, 3]
median_finder.find_median() # return 2.0

Constraints:

  • -100,000 <= num <= 100,000
  • findMedian will only be called after adding at least one integer to the data structure.

You should aim for a solution with O(log n) time for add_num(), O(1) time for find_median(), and O(n) space, where n is the current number of elements.

You should aim for a solution with O(log n) time for add_num(), O(1) time for find_median(), and O(n) space, where n is the current number of elements.

Hint 1

A naive solution would be to store the data stream in an array and sort it each time to find the median, resulting in O(n log n) time for each find_median() call. Can you think of a better way? Perhaps using a data structure that allows efficient insertion and retrieval of the median can make the solution more efficient.

Hint 2

If we divide the array into two parts, we can find the median in O(1) if the left half can efficiently return the maximum and the right half can efficiently return the minimum. These values determine the median. However, the process changes slightly if the total number of elements is odd — in that case, the median is the element from the half with the larger size. Can you think of a data structure which is suitable to implement this?

Hint 3

We can use a Heap (Max-Heap for the left half and Min-Heap for the right half). Instead of dividing the array, we store the elements in these heaps as they arrive in the data stream. But how can you maintain equal halves of elements in these two heaps? How do you implement this?

Hint 4

We initialize a Max-Heap and a Min-Heap. When adding an element, if the element is greater than the minimum element of the Min-Heap, we push it into the Min-Heap; otherwise, we push it into the Max-Heap. If the size difference between the two heaps becomes greater than one, we rebalance them by popping an element from the larger heap and pushing it into the smaller heap. This process ensures that the elements are evenly distributed between the two heaps, allowing us to retrieve the middle element or elements in O(1) time.

Solution

Sorting

class MedianFinder:
def __init__(self):
self.data = []
def add_num(self, num: int) -> None:
self.data.append(num)
def find_median(self) -> float:
self.data.sort()
n = len(self.data)
return (self.data[n // 2] if (n % 2) else
(self.data[n // 2] + self.data[n // 2 - 1]) / 2)

Time complexity: O(1) for add_num(), O(n log n) for find_median(). Space complexity: O(n)

Heap

import heapq
class MedianFinder:
def __init__(self):
# two heaps, large, small, minheap, maxheap
# heaps should be equal size
self.small = [] # max heap
self.large = [] # min heap
def add_num(self, num: int) -> None:
if self.large and num > self.large[0]:
heapq.heappush(self.large, num)
else:
heapq.heappush(self.small, -num)
if len(self.small) > len(self.large) + 1:
val = -heapq.heappop(self.small)
heapq.heappush(self.large, val)
if len(self.large) > len(self.small) + 1:
val = heapq.heappop(self.large)
heapq.heappush(self.small, -val)
def find_median(self) -> float:
if len(self.small) > len(self.large):
return -self.small[0]
elif len(self.large) > len(self.small):
return self.large[0]
return (-self.small[0] + self.large[0]) / 2.0

Time complexity: O(log n) for add_num(), O(1) for find_median(). Space complexity: O(n)