algorithm MIT 03~
python code - binary search
- insertion sort
- binary insertion sort
- vanillar insertion sort
1. binary search
scanning array : linear time to find specific k
A[0:n] -) B[0:n] : find this in logarithmic time using binary search, looking for specific item k
compare k to B[n/2], decide , divide and conquer
doing this in logarithm time
linear search turns into logarithmic search
not so obvious apps of sorting : data compression
people use sorting as a subroutine in data compression . - ex) document distance
if you compress a file, one of the things that you could do is , and it's a set of items you could sort the items, and automatically finds duplicates, and you could say if i have 100 items that are all identical.
I'm going to compress the file by representing the item once and then having a number associated with the frequency of that item - similar to what the document distance does.
1. 연습문제: 이거 해본다. 스트링을 검색해서 여러번 나온것들의 프리퀀시를 정의하고 그 프리퀀시를 1로 만들수 있는 (이 부분을 코드짜본다.)
어떤 문학글이 it becomes a bunch of words and frequencies. --) it is something that compresses the input and gives you a different representation
여러가지 소팅알고리즘이 나오는 이유는 according to the type of input, problems..
(insert sort)
for i = 1,2,...,n
insert A[i] into sorted array A[0:i-1] ( the right position ) : we're gonna expand this to the array to have i +1 elements.
: by pairwise swaps down to the correct position for the number that is initially in A of i.
5 2 4 6 1 3
key - start with the index 1, or the second element, because the very first element - it's a single element and it's already sorted by definition
1. key [1]=5와 2를 비교해서 2가 작으면 스왑 : 1step
2 5 4 6 1 3
2. then, key[2]=4 와 앞 숫자를 비교해서 스왑 : 2step
2 4 5 6 1 3
3. key[3] =6과 앞숫자를 비교해서 스왑 아니면 패싱 :0 step
2 4 5 6 1 3
4. key[4] = 1 이면 스왑해야하는데 일이 많아진다. :4 step
4번 스왑해야함 - shifting of these four elements is going to be computed in our models as four operations or four steps. 이구조에서는 pairwise swaps 이므로 해야한다.
1 2 4 5 6 3
5. key[5]= 3 이면 스왑 3번해야한다. 3 step
1 2 3 4 5 6
if i think of a step as being movement of the key
theta(n) steps in terms of key position , but if you had a list that was reverse sorted, you would essentially have to do on an average n by 2 swaps
each step (= compare & swap) is theta(n) swaps.
- 각 단계가 theta(n)인데 (?) 이거 이해안감 - 최악의 경우 맨끝에 있으면 스왑을 n-1번해야하므로
단계의 총합이 n이므로 이 알고리즘은 n*2 algorithm이다.
- theta(n*2) compares or theta(n*2) swap = theta(n*2) algorithm
i have theta(n) steps and each step is theta(n)
unspoken assumption has been that an operation is for a compare and a swap for numbers and they're essentially equal in cost. 심지어 compare가 더 expensive cost
But if you had a record and you were comparing records and the compariso function that you used for the records was in itself a method call or a subroutine, it's quite possible that all you're doing is swapping pointers or references to do the swap but the comparison could be substantially more expensive.
스왑의 실행이 단지 인덱스 스왑이나 포인터만 스왑하는 거라면 컴페어 단계가 더욱더 비용이 드는 오퍼레이션이 된다. in terms of counting operations
Compares >> Swaps theta(n*2) compare cost, the constant factors involved with the theta(n*2) swap cost.
:pairwise swap ---) binary search
Do a binary search on A[0:i-1] which is already sorted. --) in theta(log i ) time complexity
for each of those n steps, so then you get your theta(nlogn) in terms of compares
so you can do binary search on that part of the array.
Does this help the swaps for an array data structure?
- no, binary search will require insertion into A[0:i-1] (--- here is a problem
why don't we have a full-fledged theta(nlogn) algorithm regardless of the cost of compares or swaps?
because we need insert our A[i] into the right position into A[0:i-1]
When you do that, if you have an array structure, it might get into the middle. and you have to shift things over to the right. in the worst case, you have to shift a lot of things to the right. and that gets back to worts case complexity of theta(n). (n-1) 이 아닌가? 하지만 theta (n) 이다.
so a binary search in insertion sort gives you theta(nlogn) for compares.
but it's still theta(n*2) for swaps.
what's a simple fix-change to this algorithm that would give me a better complexity in the case where compares are more expensive, or I'm only looking at the complexity of compares. the theta (whatever) of compares.
you can take insertion sort and you can sort of trivially turn it into a theta (nlogn) algorithm if we are talking about n being the number of compares.
(Merge sort) about constant improvement, but we'll stick to improving asymptotic complexity.
time complexity theta(n) complexity
Complexity T(n) = C1(dividing the array) + 2T(2/n) (recursive part) + C*n (merge part: dominate part)
you can have something else out of there. like f(n) different linear function
cn
cn/2 cn/2
cn/4 cn/4 cn/4 cn/4
....
c c c c c c c c c c ......... c c c c c c c c c (n leaves)
add up the work in each of the levels of this tree
cn(lv1)+cn(lv2)+cn(lv3)+cn+...+cn(lv 1+logn)
각 단계마다 the same amount of work modulo
T(n) = (1+logN ) * cn = theta(nlogn)
= 2T(n/2) + theta(n) + theta(1)
insertion sort = if you talk about numbers , theta(n*2) for swap
mergesort is theta(nlogn)
What is one advantage of insertion sort over merge sort?
in merge sort, you need theta(n) (extra) auxiliary space
in place sorting, you need theta(1) (extra) auxiliary space
auxiliary space = temporary vaiable that you need when you swap 2 elements
when you swap a couple of registers you gotta store one of the values in a temporary location. override the other..
mergesort in python = 2.2 nlogn microseconds, insertion sort = 0.2 n*2 microseconds
Q : what does this recurrence of tree look like?
T(n) = 2T(n/2) + C n*2
cn*2
cn*2/4 cn*2/4
cn*2/16 cn*2/16 cn*2/16 cn*2/16
....
c c c c c c ....c c c c c c n leaves. treeheight = 1+nlogn
adding all the work each level
cn*2(dominates)+cn*2/2+cn*2/4+cn*2/8+cn*2/16+....+ cn (leaves)
theta(n*2) algorithm regardless of where the algorithm came
all of the other things are actually less than cn*2(dominates:all over work is done)
T(n) = 2T(n/2) + theta(1)/c
c
c c
c c c c
....
c c c c c c c n leaves
c+2c+ 4c+....+cn (dominant: all of the work done here)
theta(n)
the recursive tree is better way of looking at
depending on what the function of merge part is, in term of the work being done in the merge routine, you'd have different version of recurrences.
04 heap
priority Queue
implements a set S of elements, each of elements associated with a key
a specification of data structure in terms of the operation that data structure should perform
insert ( S,x) : insert element x into set of S
max(S) : return the element in S with the largest key
extract(S) : return the element in S with the largest key and remove it from S
increase (S, x, k) : increase the value of x's key to new value to k -) update, increment...
how you maintain a rep invariant of this data called heap, allow you to do this operation in efficient way.
The
whileloop executes as long as the condition (here:a < 10) remains true. In Python, like in C, any non-zero integer value is true; zero is false. The condition may also be a string or list value, in fact any sequence; anything with a non-zero length is true, empty sequences are false. The test used in the example is a simple comparison. The standard comparison operators are written the same as in C:<(less than),>(greater than),==(equal to),<=(less than or equal to),>=(greater than or equal to) and!=(not equal to).
HEAP
- an array structure visualized as an nearly complete binary tree.
what's nice about this Heap structure is that you'll have tree representation of an array, which lets you do a bunch of interesting things.
Heap as a tree
what you get out of this visualization
heap def:
root of tree = first element (i =1)
parent (i) = i/2
the left child (i) = 2i , the right child (i) = 2i+1
Max heaps and Min heaps
- has additional properties on the top of basic heap structure
Max heap properties :
the key of a node >= the keys of its children
Min heap properties :
the key of a node <= the keys of its children
to maintain the rep invariant of this data structure , which is max heap property, as we modify the heap,
if you go one heap to another, you start with the max heap and wanna end with the max heap
- extract max element over and over.. you get a sorted list of element in decreasing order
once we have this heap structure, and can maintain the max heap properties, and continually run extract max on it
... if you could build extract max in an efficient way
... it's a great sorting algorithm.
how to build up max heap from the initially unsorted array which may or may not turn in to a max heap:
build_max_heap : produce a max heap from arbitrary unsorted array
max_heapify : correct a single violation of the heap property in a subtree, a subtree's root, (and fix it)
do this recursively at different levels to go build max heap from an unordered array
max_heapify (arr,i) : Assume that the tree rooted at left(i) and right(i) are max.heaps (precondition) max_heapify, you are allowed to crash and not do anything useful, if the precondition is violated in but if the precondition is true, then, what you have to do is you have to return a max heap correcting this violation.
max_heapify (A,2)
all you have to do is to look at this subtree.
to take unordered array, and turn it into a max heap and once we do that, we extract max deal to sort the array.
- covert A[1,2,3,....n] into a max-heap
- build_max_heap (A):
for i = n/2 down to 1 # you are working your way up.
do max_max_heapify (A,i)
- elements A [ n/2+1,....,n] are all leaves. (this is true for any array, whatever n is)
leaves are good, because they automatically satisfy the backseat property.
because everytime you call max_heapify, you satisfy the precondition, the leaves are automatically max-heap.- 이거 약간 설명이 부족, 하지만 자식을 가지고 있기도 한다. 그러나 n/2-1부터는 무조건 자식을 갖는다. has leaves as its children
the complexity of (build_max_heap()) : O (nlogn)
not change the algorithm, but change your analysis,,
- binary recursive 위에서 내려올때는 constant time
the ones down on the bottom have a constant number of operations
- 바이너리 리커시브가 아래에서 위로 올라갈때는
올라갈때마다 1-2개의 operation이 증가하지만 해당 노드들이 줄어든다.
be careful to count
![]() |
예를 들어, n/4 node with level 1, n/8 node with level 2,...1node with level logn level (which is root node)
this is decrease in terms of nodes, as the work you're doing is increasing
*total amount of the work in the 4 loop
= n/4(1 C) + n/8(2 C) + n/16(3C)+...+1(lognC)
어찌되었든 이 식의 극한값은 수렴한다.
the key observation is this takes constant time.
this is bounded by a constant
n/4 = 2**k이고 결국 저 식에서 남는건 n 이라 할수 있으므로
time complexity = theta (n)
( 점점 익숙해진다.- 나도 이렇게 증명할 수 있었으면 좋겠다)
extract 하는 과정은
트리구조를 줄여가면서 리스트도 같이 오더 한다. .. - 알고리즘 짠다.
05. BST sort (which is a perfect data structure for some particular problems)
there are constraints associated with the system, that have to be obeyed, and you have to build these constraints in, and the checks for these constraints, into your data structure.
reservation for future landing
흔히들 보이는 예약시스템..야놀자,,등등 당일예약시스템..open Api..등등..
this is about adding to the data structure, insert operation that has a constraint associated with it that you need to check.
you wouldn't insert if the constraint was violated. you would if the constraint satisfied.
k, time variables, is part of the system and it needs to be moduled. and you have the current notion of time.
타임 리스트에서 이미 도착한 비행기의 시간은 제거해간다.
essentially take this particular landing time away from the set R
so every once in a while, as time increments., you're gonna check the data structure.
you can do this every minutes, ...
|R| = n 셋의 크기, O(logn) :이렇게 운영하는 것이 목표. efficiency requirement
data structure setting, representing 하는 작업.연습하기.
pretty much everything you want to do to it is linear (--- 이게 효율상 문제가 된다.
insert and add --) constant time : O(1)
go check against other elements of the array , it's unsorted, have no idea of where to find the element you want. have to scan through the entire array to check to see whether there's landing time that's within k of the current time t you are asked for. ---) O(n) linear time.
Unsorted array/list:
Insert in O(1) w/o check - 리스트를 체크하지않고 걍 집어넣는다 - constant time
(check takes order n time)
Sorted array/list:
through binary search, within logarithmic time, you'll find what we call the insertion point in the sorted array where this 34 is supposed to sit.
how long to take the steps below:
1. once you've found the insertion point to your left or to your right, (theta(logN))
:Find smallest i , such that R[i] >= t in O(logn) time.
2. do the K minute check.( comparison O(1) )
:Compare R[i] and R[i-1] against t in O(1) time.
3. do the insertion (when inserting in the array, the other elements are supposed to be shifted)
: Sadly, the actual insertion is going to require shifting. --O(n) time. because it's an array.
heap 을 사용해서 insert하면 한쪽 줄기는 min heap으로 정렬이 가능하지만 다른 라인을 체크해야하기 때문에 그것 또한 O(n) 걸린다.
dictionary도 마찬가지 문제로 귀결된다. 모두 O(n) 이 걸린다.
...
If you find fast insertion into sorted array, in log N times,
------) BST 가 가능하게 해준다.
the invariant that makes them different data structure is more strong than heap invariant.
BST의 이러한 invariant 는 children과의 관계만 언급하는 heap 보다 더 막강하다.
children의 형제들 끼리의 order가 unsorted되었을때는 O(n)time 이 걸려서 insertion 해야하지만,
이렇게 순서가 특징있게 조정이 되면, 왼쪽은 무조건 작고 오른쪽은 무조건 크다.
compare를 해본후에 예약을 거절한다.
in basic insertion method into BST, there's no constraint .
without changing the efficiency of this method, you can augment the insertion ..
the beauty of this BTS is that while you're finding the place to insert, you can do this check , like k=3.
이 트리의 높이에 따라서 시간이 얼마나 걸릴지 결정이 된다..
doing either find or insert - 전형적인 BST, Vanillar Search Tree.
이트리를 복사해서 여러개 만들어서 에약을 받거나 모디파이 할 수 있다.
this is accomplished in O(h) time.
find_min() : keep going to left in BST. (max : keep going to right in BST) in O(h).
can be done in min heap, if you wanna find the minimum value, it's constant time. just return the root.
next_large() : in O(h) time.
this notion of augmentation
added new requirement to BST, don't change logarithmic to linear..
rank(t) : how many planes scheduled to land at times <=t?
how to compute rank(t) from subtree size?
Augment BST - insert and delete numbers (modify subtree size- the number of the nodes that are beneath you)
O(h) --) h가 log n이 되기 위해서는 좀더 필요하다.
최악의 경우 h가 리니어가 되면 tree가 아니라 리스트가 되기때문이다. 그래서
balanced BST.
leaf = 0 height
how do you compute the height of a node?
- take the max of the height of the children +1
= max{height(left child),height(right child)}+1 (do constant amount work here)
data structure augmentation
our goal is to keep the heights small.
우리는 logN 구조를 원하기 때문에 a natural thing to do is store the heights.
when they get too big, fix it.
AVL trees
max{height(left child),height(right child)}+1 이 포뮬라를 사용할 수 있도록 무자식에게는 -1 값을 준다. 애초에 생각을 할때 유용한 포뮬라 형태로 정리해서 생각한다.
AVL trees
require heights of left and right children of every node to differ by at most +- 1
log N tree 구조를 유지하기 위한 전략.왼, 오 height 차이 플마 1 안쪽으로 되도록 유지한다.
to maintain left side and right side more less equal in height / subtree size
and it turns out to be easy to maintain in logN time
AVL trees are balanced(the height is always order logN) :
최악의 상황을 상상 디자인해본다. --) for every node. let's make the right side have a height of 1 larger than the left side.
worst case is when right subtree has height 1 more than left, for every node.
N sub h= min,.# nodes in an AVL tree of height h.
n nodes --) h nodes time 으로 바꾸는 것.
recursive formula
Nh(1) = O(1)
Nh = 1+ Nh-1+Nh-2 (exponential part)
h 는 항상 1.44 *logN 이다. (피보나치와 연관해서 알아내는 방식), 1은 아니지만 그래도 꽤 괜찮은 상수이다.
work flow : 최악의 모델을 일반화해서 그 포뮬라를 극한으로 푼다. 그래서 시간을 예측한다.
알고리즘을 고쳐쓰는 것이..키포인트.
Insertion 이 문제
1. AVL insertion
- simple BST insertion
- fix AVL property
even 은 한쪽으로만 패스가 길게 형성되지 않은 상태 : leaf and node that has two children
* 한쪽으로 길어지면 리니어타임이 된다.
이런 트리구조는 insertion 이 문제가 된다..
고쳐도 지그재그로 나와서 height가 줄지않는 경우
여러번의 로테이션으로 고친다.
but in general, there might be several violation up the tree.
we update the height based on the height of their children.
AVL tree is not about the level, it's about the right subtrees and the left subtrees.
한쪽으로 치우치는 것을 막는다. 계속 균형잡힌 트리를 유지한다.
07
comparison model:
the idea is to restrict what kind of operations we can do to be comparison - 비교를 하기 위해 우리는 어떤 연산을 제한해야하는가?
- all input items are black boxes (abstract data types) : 이것도 하나의 데이터 스트럭쳐
merge sort
binary search tree
heap
heapsort
these are all about comparison.
* time cost from lower bound perspective:가장 빨라봤자.. searching lower bound..
무엇을 하던간에 comparison operation 의 목표는
search 는 logN time
sort 는 nlogN time
the idea is the following :
if we know that our algorithms are only comparing comparing items, we can actually draw all the possible things that an algorithm could do,
Decision tree :
any comparison algorithm can be viewed as a tree of all possible comparisons and their outcomes and resulting answers. (based on order of logn times..,,..)
for any particular n,
tree will be exponential size.
아무리 미리 처리한 데이터라 하더라도 comparison을 해야한다면 다시 log N 이 걸린다.
데이터 구조상 바이너리 트리구조이고, 각 대답에 대해 하나의 리브가 할당된다. 그리고 많은 리브들은 같은 값을 가지고 있을 수 있다.
There are multiple paths to get the same answer.. in typical algorithm.
leaves가 최소한 n개 이상이 된다.
1) binary tree 2) at least n leaves ----) 3) at least logN height. (height is worst case running time)
2 fact
1. binary search is optimal in a comparison model
2. binary search tree is actually a good way to solve a problem
comparison model 에서는 log n이 최선의 비용이자 comparison모델의 한계이다.
sorting 은 preprocessed되지않지만,
comparison 은 앞선 과정이 무엇이 일어났든 상관없이 독립적으로 logN 시간비용이 든다. - constant time.
[linear time sorting - integer sorting]
that's a big assumption . if you're not sorting integer, you can map whatever the heck you're sorting into integers. you've represented your thing is an integer of sorts. ---) index로 바인딩해서 알고리즘 짜는것.
assume n keys sorting are integers {0,1,2,...,k-1}, just for completeness, each fits in a word.
the words are things you can manipulate in constant time ---) you can compare two keys to items in constant time. to get that for integers, you need to assume that your integers are fitting in words.
words를 숫자와 연결해서 사용하다보니, 더하기, 빼기 나누기등등..comparison 보다 많은 것을 할 수 있어서 우리에게 많은 도움이 된다.
- for k, ...., can sort in O(n) time.
- 1. counting sort : all it does is to sort integers. (알고리즘 별로 기능을 정확히 파악하기)
the array is already written by key,
메모리에 있는 어레이에 counting 한 것들을 저장해서 출력하는 방법.
파이썬의 key function 에 대해서 정리.
첫번째는 A의 entry들을 L에 키를 지정하는 방법?
두번째는 key를 엮어서 output에 나열하는 방법?
-2 radix sort
: using counting sort as subroutine
python 에서 key function 은 constant time.
뒷부분은 radix algorithm을 본뒤 다시 보고 정리하기
8. hashing
binary search에서는 key를 찾는 것이 아니라 the next larger, the next smaller ,succesor and predecessor를 찾는다.
딕셔너리에서는
" Does this key exist? if so, give me the item with the key" 오로지 이것에만 관심을 갖는다.
이것에 깔린 전제가 각자 아이템들은 고유의 키를 가지고 있다.
one way to enforce this, when you insert an item with an existing key, it overrides whatever key was there. : 원래 이런방식은 흔하지 않은가..봄. it's well defined what search does.
either there's one item with that key or there's no item with that key and it tells you what situation is.
motivation
- doctdist : 두 문서간의 같은 단어의 개수는 몇개? ..등등에 아주 빠르고 효율적으로 해결
-database
there are essentially 2 kinds of databases in the world, those using hash and those using searching tree.
sometime you need one, sometime the other..
hash type of data base
- compiler and interpreters
- network router / server
- substring search, google....
- string commonalities
- file/directory syncronization
python's notation to access dictionary is exactly identical to the notation for accessing array.
badness of dictionary
- 1. Keys may not be nonneg. integers
: solution to 1 : prehash function
- maps keys to nonnegative integers
- in theory, keys are finite and discrete
we know that anything on the computer could be ultimately be written down as a string of bits.
a string of bits represent an integer. ex) 알파벳이 어떤 특정 숫자를 가지고 있는..
- in Pythonm hash(x) is the prehash of x : it maps every object to an integer, or every hashable object, technically.
- hash('\0B')=hash('\0\0C')
ideally: hash(x) = hash(y) <=> x=y
but in python mostly true.
__hash__ method you can implement , but if you ever implement this function don't mess with it. prehash really shouldn't change. make sure that it's defined in such a way that it doesn't change over time. otherwise, you won't be able to find your items in the table. default of id , which is physical location of your object in memory. as long as your object isn't moving around in memory this is a pretty good hash function because no 2 items occupy the same space in memory.
그래서 리스트를 해쉬 펑션 쓰면 안된다. 리스트는 막 바뀌므로 , potentially you'd append to the list or whatever.
-2. gigantic memory hog
solution to 2, hashing (it means to cut into pieces and mix aroung)
the big idea is we take all possible keys and we want to reduce them down to some small, small set of integers.
reduce universe U of all keys (integers) down to reasonable size m for table.
A technique for dealing with collisions
1. chaining ; if you hash multiple items to the same position, just store them as a list, linked list of colliding elements in each slot of hash table
hash function 을 잘못 설정하면, 모든 데이타가 인서트 될때 하나의 슬롯에 리스트 형태로 들어가 버린다. 그러면 time complexity = theta(n)
hashing function def : h -) U -) {0,1,2,...,m-1} 유니버스공간을 저 공간에 대응시키는 함수.
hashing function 은 m이 무엇인지 알아야한다.
in reality there's not going to be one hashing function, there's going to be 1 for each m, or at least one for each m. depending on how big your table is, you use the corresponding hash function.
simple uniform hashing:
each key is equally likely to be hashed to any slot of the table, independent of where other keys hashing.
모든 데이터가 같은 확률(uniform)로 슬롯을 초이스해서 앞뒤 순서와 상관없이 ( independent)
해싱 되면 매우 좋은 상황이지만 이것은 일어날 수 없다.
analysis : expected length of chain for n keys, m slots = n/m
1/m+ 1/m+...+1/m = n/m = alpha = load factor of the table
if alph is 1, n = m , 만약에 알파가 10이면 n은 m의 10배이지만 여전히 constant time 이다.
theta(1) , if m = theta(n)
we need to maintain this property, as long as you set your table size to the right value, to be roughly n,m this will be constant.
난 이게 왜 constant time인지 이해안감.
running time = O(1+ |chain|)
= O (1+alpha) : 1 에 해당되는 operation : hashing function + search slot
만약 alpha 가 constant time 이면 모든 오퍼레이션이 constant time.
how to construct h(ash function)
HASH FUNCTIONS:
1. division method:
h(k) = k mod(m) : it's essentially bad if m has some common factors with k
2. multiplication method:
h(k) = [(a*k) mod 2**w] >> (w-r) , k = w bit
key value 를 찾을 때 좋지 않을 수 있다.
THE COOL METHOD called Universal hashing
대충 이해하자면 랜더마이즈도 달성했고(uniform) 최악의 상황이 그렇게 나쁘지않은 확률이므로 좋다는 의미같다.
09 Table Doubling Karp-Rabin
what m should be?
we want m to be big enough that our structure is fast, but small enough that it's not wasteful in space.
=)) 항상 이런 문제에 봉착.
How to choose m?
2. open addressing























































댓글
댓글 쓰기