Learn data science and machine learning by building real-world projects on Jovian

## Median of Two Sorted Arrays

In :
``project_name = "median_sorted_arrays" # give it an appropriate name``
In :
``!pip install jovian --upgrade --quiet``
```WARNING: You are using pip version 20.3.3; however, version 21.0.1 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command. ```
In :
``import jovian``

### Problem Statement

The basic idea of this problem to find a median of two arrays. A median is an element that happens to lay in the middle of a sorted list of elements.

It's easy to find a median when the number of elements in said list is odd, since there is only one element that lays in the middle (at position n/2). If the number of elements is even, then there's more than one middle element. In such case we consider median to be an average of elements at positions n/2-1 and n/2.

### Solution

#### 1. State the problem clearly. Identify the input & output formats.

Problem

The problem is to find a median of two lists (sorted). There are 2 inputs: two lists of sizes `m` and `n`. The output is a single value, representing the median value of the merged lists.

Input

1. `nums1` - a sorted list of numbers
2. `nums2` - a sorted list of numbers

Output

1. median of list being a result of merging `nums1` and `nums2`.

#### 2. Come up with some example inputs & outputs. Try to cover all edge cases.

Our function should be able to handle any set of valid inputs we pass into it. Here's a list of some possible variations we might encounter:

1. the lists contain the same number of elements,
2. the lists contain different number of elements,
3. one of the lists is empty,
4. both lists are empty (should return `None` since there's no middle element).

We'll express our test cases as dictionaries, to test them easily. Each dictionary will contain 2 keys: `input` (a dictionary itself containing one key for each argument to the function and `output` (the expected result from the function).

Create one test case for each of the scenarios listed above. We'll store our test cases in an array called `tests`.

In :
``tests = []``
In :
``````tests.append({
'input': {
"nums1": [1, 2, 3],
"nums2": [4, 5, 6]
},
'output': 3.5
})
tests.append({
'input': {
"nums1": [3, 6, 8, 9, 10, 11],
"nums2": [3, 3, 5]
},
'output': 6
})
tests.append({
'input': {
"nums1": [5, 9, 14, 16],
"nums2": []
},
'output': 11.5
})
tests.append({
'input': {
"nums1": [],
"nums2": [3, 6, 7, 11, 12]
},
'output': 7
})
tests.append({
'input': {
"nums1": [],
"nums2": []
},
'output': None
})
``````

#### 3. Come up with a correct solution for the problem. State it in plain English.

Merging the two lists is not necessary, as we can actually use the method used in merge sort, to merge the list until we approach a middle element(s).

1. Initialize two variables: `i` and `j`, setting their values to 0.
2. Create a new empty list called `merged`.
3. Calculate the lengths of `nums1` and `nums2` lists. Store the lenghts as `m` and `n` respectively.
4. If `m` and `n` is equal to 0, then return `None`.
5. If `m` or `n` is equal to 0, then the median lies only in one of the lists.
1. Return the median of non-empty list (using calculated position).
6. Calculate the `stop_position` as (`m`+`n`) / 2 + 1
7. Calculate the `is_odd` flag as the result of (`m` + `n`) % 2.
8. Compare elements at `nums1[i]` and `nums2[j]`.
1. If `i` is greater or equal to `m` then add `nums2[j]` and increment `j`.
2. Else if `j` is greater or equal to `n` then add `nums1[i]` to `merged` and increment `i`,
3. Else if `nums1[i]` is smaller than `nums2[j]` then add `nums1[i]` to `merged` and increment `i`.
4. Otherwise add `nums2[j]` to `merged` and increment `j`.
9. Repeat step 7 until the `i`+`j` is equal to (`m`+`n`)/2.
10. The median is the last element (or average of two last elements if `m` + `n` is even) of the `merged` list.

I'm also defining a helper function `list_median` that computes a median of a single list. This function is used mainly if one of the lists is empty.
It's pretty straightforward to compute middle element (or elements) given a single list. If we know the length, then we can just calculate the position(s) of middle element(s).
To consider the case when both lists are empty, I added an additional condition inside this function to take care of that.

#### 4. Implement the solution and test it using example inputs. Fix bugs, if any.

In :
``````def list_median(nums):
n = len(nums)
if n == 0:
return None
is_odd = n % 2
if is_odd:
return nums[n//2]
else:
return (nums[n//2] + nums[n//2 - 1])/2
def median_basic(nums1: list, nums2: list) -> float:
i, j = 0, 0
merged = []
m, n = len(nums1), len(nums2)
if m == 0: # the first list is empty, so the median is in the second one
return list_median(nums2)
if n == 0:
return list_median(nums1)

is_odd = (m + n) % 2
stop_position = (m + n)//2 + 1

while (i+j) != stop_position:
if i >= m:
merged.append(nums2[j])
j += 1
elif j >= n:
merged.append(nums1[i])
i += 1
elif nums1[i] < nums2[j]:
merged.append(nums1[i])
i += 1
else:
merged.append(nums2[j])
j += 1
if is_odd:
return merged[-1]
else:
return (merged[-1] + merged[-2])/2``````

Evaluate your function against all the test cases together using the `evaluate_test_cases` (plural) function from `jovian`.

In :
``from jovian.pythondsa import evaluate_test_cases``
In :
``evaluate_test_cases(median_basic, tests, error_only=True);``
``` SUMMARY TOTAL: 5, PASSED: 5, FAILED: 0 ```

#### 5. Analyze the algorithm's complexity and identify inefficiencies, if any.

The main problem with the above approach is that we need to merge the lists up to the moment when we encounter the middle element(s). The complexity of such approach is O((`m`+`n`)/2), so, for large enough `m` and `n` it becomes just O(n).

We can use a binary search approach here to calculate the median without merging the actual lists.

#### 6. Apply the right technique to overcome the inefficiency. Repeat steps 3 to 6.

We don't have to do the actual merging to find the median.

We start with two lists: `nums1` and `nums2`, and calculate their lengths: `m` and `n` respectively.

We know that the merged list would have `m` + `n` elements in total.

The order of elements in merged list is unknown at first. But we know that the median divides such list in half: to the left, all elements are smaller or equal to the median, and to the right all elements are bigger or equal to the median.

If we would merge the lists, some elements from `nums1` and `nums2` would go first because they're smaller or equal to median, followed by elements that are larger or equal to median. So we can divide the lists into sublists that represent the elements that are either to the left of the median or to the right. Let's call those sublists `nums1_left`, `nums2_left`, `nums1_right` and `nums2_right`.
Since the median splits the merged list in half, we know that the summed up lengths of left parts differ by 0 or 1 (depending on the fact that `m` + `n` can be odd or even) from the summed up lengths of right parts.

We'll use binary search to look for the positions that split the `nums1` and `nums2` lists into their left and right parts, so that every element in the left parts is smaller or equal to every element in right parts. We look mainly for two indices: `pos_nums1` and `pos_nums2` (position where we split the list). Summing them up gives us a position that would be in the middle of a merged list `(m + n) // 2`.

Because we want to keep the lengths of both parts so that they resemble a division made by a median, we look only for `pos_nums1` position, while calculating the `pos_nums2` from the knowledge about the length of resulting array. So `pos_nums2 = (m + n + 1) // 2 - pos_nums1`. We add 1 because we want to make sure that the left side is always the one that has an extra element.

Knowing `pos_nums1` and `pos_nums2` we know elements that would be to the left and right of the median. We define them like this:

• `eL1 = nums1[pos_nums1 - 1]` if `pos_nums1` > 0 else we just use a global minimum (because there's no element to the left of 0)
• `eR1 = nums1[pos_nums1]` if `pos_nums1` < `m` else we just use a global maximum (because there's no element to the right of `m`
• `eL2 = nums2[pos_nums2 - 1]` if `pos_nums2` > 0 else we just use a global minimum (because there's no element to the left of 0)
• `eR2 = nums2[pos_nums2]`if `pos_nums2` < `n` else we just use a global maximum (because there's no element to the right of `n`

To keep the "left" parts smaller or equal to median, and "right" parts bigger or equal to median, we have to make sure that:

• `eL1 <= eR1` - this condition is always fulfilled as we split the initially sorted list
• `eL2 <= eR2` - same as above
• `eL1 <= eR2` - this condition makes sures that both left parts (resulting from splitting `nums1` and `nums2`) are to the left of the median
• `eL2 <= eR1` - same as above

First two conditions are always true because the lists are sorted. The last two define moment where we stop binary searching the positions, because we've already know partitioning of both list, such that all elements on the left are smaller or equal to all elements on the right.

To find the median after finding the positions, we use defined earlier elements `eL1`, `eL2`, `eR1` and `eR2`. If `m` + `n` is odd, then the left part is one element longer. We take median as max of `eL1` and `eL2`.

The case differs if it's even, because there are two middle elements. But it's easy to find them. We again look for maximum value of `eL1` and `eL2` - this is because the larger of those elements would be closer to the median after actual merging. We also calculate minimum value of `eR1` and `eR2` - again, minimum of those values would be closer to the median. After finding them, we just calculate their average: `(max(eL1, eL2) + min(eR1, eR2))/2`.

This approach has `O(log(min(m, n))` complexity, because we binary search for the position of the shorter of lists only.

#### 7. Come up with a correct solution for the problem. State it in plain English.

Come with the optimized correct solution and explain it in simple words below:

1. Calculate `m` and `n` as lengths of `nums1` and `nums2`.
2. If `m` > `n` then the first list is longer than the second one. We want to avoid that, so we run the whole function again, swapping the lists.
3. If `m` is less or equal to 2, then use basic version.
4. Calculate `is_odd` as `(m + n) % 2`. This will be used later to decide how to calculate the median.
5. Find `global_min` and `global_max` of both lists.
6. Set `start` and `end` variables to 0 and `m` respectively.
7. While `start` is less or equal to `end` then:
1. Calculate `pos_nums1` (as `start + (end - start) // 2)`) and `pos_nums2` (as `(m + n + 1)//2 - pos_nums1`).
2. Calculate `eL1` and `eL2` (either global minimum or `nums1[pos_nums1 - 1]` and `nums2[pos_nums2 - 1]` respectively).
3. alculate `eR1` and `eR2` (either global maximum or `nums1[pos_nums1]` and `nums2[pos_nums2]` respectively).
4. If `eL1` is smaller or equal to `eR1` and `eL2` is smaller or equal to `eR2`, then we found the partition and we can calculate the median.
1. If `is_odd` then calculate the median as `max(eL1, eL2)`.
2. Otherwise calculate the median as `(max(eL1, eL2) + min(eR1, eR2)) / 2`.
5. Else if `eL1` is bigger than `eR2` then set `end` to `pos_nums1 - 1`.
6. Else set `start` to `pos_nums1` + 1.

Since the basic version already works, let's create the extended test set.

In :
``````extended_tests = list(tests)
import random

for _ in range(int(1e5)):
m, n = random.randint(0, 1000), random.randint(0, 1000)
nums1 = sorted([random.randint(-10000, 10000) for _ in range(m)])
nums2 = sorted([random.randint(-10000, 10000) for _ in range(n)])

extended_tests.append({
'input': {
"nums1": nums1,
"nums2": nums2
},
'output': median_basic(nums1, nums2)
})
``````

#### 8. Implement the solution and test it using example inputs. Fix bugs, if any.

In :
``````def median_improved(nums1: list, nums2: list) -> float:
m, n = len(nums1), len(nums2)
if m > n:
return median_improved(nums2, nums1)
if m <= 2:
return median_basic(nums1, nums2)

is_odd = (m + n) % 2

global_min = min(nums1 + nums2)
global_max = max(nums1 + nums2)

start = 0
end = m
while start <= end:
pos_nums1 = start + (end - start) // 2
pos_nums2 = (m + n + 1)//2 - pos_nums1

eL1 = global_min if pos_nums1 == 0 else nums1[pos_nums1 - 1]
eR1 = global_max if pos_nums1 == m else nums1[pos_nums1]

eL2 = global_min if pos_nums2 == 0 else nums2[pos_nums2 - 1]
eR2 = global_max if pos_nums2 == n else nums2[pos_nums2]

if (eL1 <= eR2) and (eL2 <= eR1):
if is_odd:
return max(eL1, eL2)
else:
return (max(eL1, eL2) + min(eR1, eR2)) / 2
elif eL1 > eR2:
end = pos_nums1 - 1
else:
start = pos_nums1 + 1``````
In :
``evaluate_test_cases(median_improved, extended_tests, error_only=True);``
``` SUMMARY TOTAL: 100005, PASSED: 100005, FAILED: 0 ```
In [ ]:
``jovian.commit(project=project_name, privacy="secret", filename="python-problem-solving-template")``
```[jovian] Attempting to save notebook.. ```