11.1 If you were integrating a feed of end of day stock price information (open, high, low, and closing price) for 5,000 companies, how would you do it? You are responsible for the development, rollout and ongoing monitoring and maintenance of the feed. Describe the different methods you considered and why you would recommend your approach. The feed is delivered once per trading day in a comma-separated format via an FTP site. The feed will be used by 1000 daily users in a web application.
SOLUTION
Let’s assume we have some scripts which are scheduled to get the data via FTP at the end of the day. Where do we store the data? How do we store the data in such a way that we can do various analyses of it?
Proposal #1
Keep the data in text files. This would be very difficult to manage and update, as well as very hard to query. Keeping unorganized text files would lead to a very inefficient data model.
Proposal #2
We could use a database. This provides the following benefits:
»» Logical storage of data.
»» Facilitates an easy way of doing query processing over the data.
Example: return all stocks having open > N AND closing price < M
Advantages:
»» Makes the maintenance easy once installed properly.
»» Roll back, backing up data, and security could be provided using standard database
features. We don’t have to “reinvent the wheel.”
Proposal #3
If requirements are not that broad and we just want to do a simple analysis and distribute the data, then XML could be another good option.
Our data has fixed format and fixed size: company_name, open, high, low, closing price. The XML could look like this:
<root>
<date value=“2008-10-12”>
<company name=“foo”>
<open>126.23</open>
<high>130.27</high>
<low>122.83</low>
<closingPrice>127.30</closingPrice>
</company>
<company name=“bar”>
<open>52.73</open>
<high>60.27</high>
<low>50.29</low>
<closingPrice>54.91</closingPrice>
</company>
</date>
<date value=“2008-10-11”> . . . </date>
</root>
Benefits:
»» Very easy to distribute. This is one reason that XML is a standard data model to share / distribute data.
»» Efficient parsers are available to parse the data and extract out only desired data.
»» We can add new data to the XML file by carefully appending data. We would not have to re-query the database. However, querying the data could be difficult.
11.2 How would you design the data structures for a very large social network (Facebook, LinkedIn, etc)? Describe how you would design an algorithm to show the connection, or path, between two people (e.g., Me -> Bob -> Susan -> Jason -> You).
SOLUTION
Approach:
Forget that we’re dealing with millions of users at first. Design this for the simple case. We can construct a graph by assuming every person is a node and if there is an edge between two nodes, then the two people are friends with each other.
class Person {
Person[] friends;
// Other info
}
If I want to find the connection between two people, I would start with one person and do a simple breadth first search.
But... oh no! Millions of users!
When we deal with a service the size of Orkut or Facebook, we cannot possibly keep all of our data on one machine. That means that our simple Person data structure from above doesn’t quite work—our friends may not live on the same machine as us. Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:
1. For each friend ID: int machine_index = lookupMachineForUserID(id);
2. Go to machine machine_index
3. Person friend = lookupFriend(machine_index);
There are more optimizations and follow up questions here than we could possibly discuss, but here are just a few thoughts.
Optimization: Reduce Machine Jumps
Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch these jumps—e.g., if 5 of my friends live on one machine, I should look them up all at once.
Optimization: Smart Division of People and Machines
People are much more likely to be friends with people who live in the same country as them. Rather than randomly dividing people up across machines, try to divvy them up by country, city, state, etc. This will reduce the number of jumps.
Question: Breadth First Search usually requires “marking” a node as visited. How do you do that in this case?
Usually, in BFS, we mark a node as visited by setting a flag visited in its node class. Here, we don’t want to do that (there could be multiple searches going on at the same time, so it’s bad to just edit our data). In this case, we could mimic the marking of nodes with a hash table to lookup a node id and whether or not it’s been visited.
Other Follow-Up Questions:
»» In the real world, servers fail. How does this affect you?
»» How could you take advantage of caching?
»» Do you search until the end of the graph (infinite)? How do you decide when to give up?
»» In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use this data to pick where you start traversing?
The following code demonstrates our algorithm:
11.3 Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB of memory.
FOLLOW UP
What if you have only 10 MB of memory?
pg 72
SOLUTION
There are a total of 2^32, or 4 billion, distinct integers possible. We have 1 GB of memory, or 8 billion bits.
Thus, with 8 billion bits, we can map all possible integers to a distinct bit with the available memory. The logic is as follows:
1. Create a bit vector (BV) of size 4 billion.
2. Initialize BV with all 0’s
3. Scan all numbers (num) from the file and write BV[num] = 1;
4. Now scan again BV from 0th index
5. Return the first index which has 0 value.
Follow Up: What if we have only 10 MB memory?
It’s possible to find a missing integer with just two passes of the data set. We can divide up the integers into blocks of some size (we’ll discuss how to decide on a size later). Let’s just assume that we divide up the integers into blocks of 1000. So, block 0 represents the numbers 0 through 999, block 1 represents blocks 1000 - 1999, etc. Since the range of ints is finite, we know that the number of blocks needed is finite.
In the first pass, we count how many ints are in each block. That is, if we see 552, we know that that is in block 0, we increment counter[0]. If we see 1425, we know that that is in block 1, so we increment counter[1].
At the end of the first pass, we’ll be able to quickly spot a block that is missing a number. If our block size is 1000, then any block which has fewer than 1000 numbers must be missing a number. Pick any one of those blocks.
In the second pass, we’ll actually look for which number is missing. We can do this by creating a simple bit vector of size 1000. We iterate through the file, and for each number that should be in our block, we set the appropriate bit in the bit vector. By the end, we’ll know which number (or numbers) is missing.
Now we just have to decide what the block size is.
A quick answer is 2^20 values per block. We will need an array with 2^12 block counters and a bit vector in 2^17 bytes. Both of these can comfortably fit in 10*2^20 bytes.
What’s the smallest footprint? When the array of block counters occupies the same memory as the bit vector. Let N = 2^32.
counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4
It’s possible to find a missing integer with just under 65KB (or, more exactly, sqrt(2)*2^15 bytes).
11.4 You have an array with all the numbers from 1 to N, where N is at most 32,000. The array may have duplicate entries and you do not know what N is. With only 4KB of memory available, how would you print all duplicate elements in the array?
SOLUTION
We have 4KB of memory which means we can address up to 8 * 4 * (2^10) bits. Note that 32* (2^10) bits is greater than 32000. We can create a bit vector with 32000 bits, where each bit represents one integer.
NOTE: While this isn’t an especially difficult problem, it’s important to implement this cleanly. We will define our own bit vector class to hold a large bit vector.
11.5 If you were designing a web crawler, how would you avoid getting into infinite loops?
SOLUTION
First, how does the crawler get into a loop? The answer is very simple: when we re-parse an already parsed page. This would mean that we revisit all the links found in that page, and this would continue in a circular fashion.
Be careful about what the interviewer considers the “same” page. Is it URL or content? One could easily get redirected to a previously crawled page.
So how do we stop visiting an already visited page? The web is a graph-based structure, and we commonly use DFS (depth first search) and BFS (breadth first search) for traversing graphs. We can mark already visited pages the same way that we would in a BFS/DFS.
We can easily prove that this algorithm will terminate in any case. We know that each step of the algorithm will parse only new pages, not already visited pages. So, if we assume that we have N number of unvisited pages, then at every step we are reducing N (N-1) by 1. That proves that our algorithm will continue until they are only N steps.
SUGGESTIONS AND OBSERVATIONS
»» This question has a lot of ambiguity. Ask clarifying questions!
»» Be prepared to answer questions about coverage.
»» What kind of pages will you hit with a DFS versus a BFS?
»» What will you do when your crawler runs into a honey pot that generates an infinite subgraph for you to wander about?
11.6 You have a billion urls, where each is a huge page. How do you detect the duplicate documents?
SOLUTION
Observations:
1. Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter representation of pages in memory. A hash is an obvious choice for this.
2. Billions of urls exist so we don’t want to compare every page with every other page (that would be O(n^2)).
Based on the above two observations we can derive an algorithm which is as follows:
1. Iterate through the pages and compute the hash table of each one.
2. Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.
This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?
»»
How much space does each page take up in the hash table?
»» »» Each url is an average of 30 characters, so that’s another 30 bytes at least.
»»
»»
Each page hashes to a four byte value.
Each url takes up roughly 34 bytes.
34 bytes * 1 billion = 31.6 gigabytes. We’re going to have trouble holding that all in memory!
What do we do?
»» We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.
»» We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
»» Or, we could split this up across machines, and deal with network latency. Let’s go with this solution, and assume we have n machines.
»» First, we hash the document to get a hash value v
»» v%n tells us which machine this document’s hash table can be found on.
»» v / n is the value in the hash table that is located on its machine.
11.7 You have to design a database that can store terabytes of data. It should support ef- ficient range queries. How would you do it?
SOLUTION
Construct an index for each field that requires range queries. Use a B+ tree to implement the index. A B+ tree organizes sorted data for efficient insertion, retrieval and removal of records. Each record is identified by a key (for this problem, it is the field value). Since it is a dynamic, multilevel index, finding the beginning of the range depends only on the height of the tree, which is usually quite small. Record references are stored in the leaves, sorted by the key. Additional records can be found by following a next block reference. Records will be sequentially available until the key value reaches the maximum value specified in the query.
Thus, runtimes will be dominated by the number of elements in a range. Avoid using trees that store data at interior nodes, as traversing the tree will be expensive since it won’t be resident in memory.