Coding interview: Find median of 2 sorted arrays of int part 1

You are given 2 array of integers as input. They are sorted. Write a function can find the median of all the numbers.

First need to clarify what is median. When given a bunch of numbers, if you sort them, the number right in the middle is called the median. For example if we have [1, 2, 3, 4, 8, 9], the median is 4. When we have even number of numbers, it is the mean of those two in the middle. For example if we have [1, 2, 3, 4, 8, 9, 10], then the median is (4+8)/2=6

Next let’s solve the problem with a simple direct and easy to understand solution:

Really simple, right? And we are very sure it is correct as long as the merge function and median function are correct. Well you probably will be asked to implement them during an interview so here is one version:

Now, what is the problem with this approach? If you say it is too slow then you probably know the speed of the solution is O(m+n) which is linear. And you are right that we can do better than that but there are problems other than time. The action of merge array uses O(m+n) space which we can also avoid. But the bigger problem is that when we try to allocated the space for array c, we may not able to because m+n may overflow, same reason that I had to do the trick when I try to find the median. So let’s avoid it and avoid allocate a new array all together in part 2

 

 

Static over Dynamic

One reason I believe made Unix and Linux are very successful is their philosophy that “Everything is a file”. But the reason that this philosophy made them successful is files are relatively static than other things (Windows registry anyone?), in my opinion.

Another example would be for certain data that changes infrequently, simply put them in a static file and do a code change is much better than save in a database. Even though a code change and re-deploy is needed when change happens, but the performance and reliability is much better. A lot of time, I’ve seen people try to boost the performance by introduce caching at all levels thinking they get the good part from both: flexibility and the performance. Well what they got is complexity since cache invalidation is the second most difficult thing in computer engineering (right after naming things) according to Phil Karlton. One interesting story that I experienced first handed: One day, we accidentally stopped a service that has been running without any issue for 6 month. We thought, OK, let’s just re-start it. But it won’t, giving errors about cannot read a url. Turns out it is trying to do a one time load of some static data from an external team. The developer did not want to write a static file so he gave the justification that this is dynamic if the other team changed the data, we just need to restart the service. The other team had no idea this service depends on this url to start. There is completely no traffic on that and they removed it together with bunch of things 5 months ago.

Lastly, I have been trying to do a dev op work whole day today. I need to start a MariaDB Galera cluster. Our company’s setup script has evolved over the years from Chef to Puppet to finally Salt. I copied a set of working salt script to setup MariaDB Galera cluster from another team, but ended up reading everything in it try to understand what are the dynamic things it is doing based on host name, configurations, etc. I really wish that we are using Docker right now, because it is a file, it is static. Pull a image is basically copy paste a file. It is much much faster and guaranteed works.

Things I value in software/application design

Below are my personal opinions. I will explain some on them in details in future posts.

  • Proper design over hacking.
    • Not the security type of hacking, for the differences, read my post here. There is a Chinese old saying: “Think three times before you act.” My experience currently is that if the developer think half way through before he starts, it will be a relative successful project. And we give this all sorts of fancy terms such as “Agile”, “Bias for action”, “RAD”. I call it stupid.
  • Static over dynamic.
    • I mentioned this preference in my previous post here. Another example I want to give is when I debug some “fancy” code, it gives me a lot of headache when everything is dynamically picked. Reading the code will give you no clue what actually happens and you have to set break points and get dirty.
  • Separation over combination (which can be easier to use short term)
    • There are a lot of “powerful” library and frameworks out there. Their feature list is long and getting longer. But in my opinion, they should “do one thing and do one thing well” (Unix philosophy). And you should also keep this in mind when you write your code.
  • Simplicity over optimization.
    • “Premature optimization is the root of all evil.” —Donald Knuth. Need I say more? Actually one thing I would like to say is that quite often, you don’t even realize you are doing premature optimization. You are simply doing nature things as a well trained engineer, cache this, minify that, etc. I say you should question everything you do all the time if it is absolutely necessary.
  • Encapsulation over extensibility
    • Don’t get me wrong, inheritance and polymorphism are powerful programming concepts. Some libraries and frameworks leverage this and you can simply implement an interface or two to use them. However, I rarely see they get used properly when build internal systems. I see all the time an interface or parent class got one and only one implementer, which only makes the debugging experience horrible because you cannot get to the real code directly from the caller.
  • Configuration over convention (No I did not get the order wrong. I mean it.)
    • Convention over configuration is a software design paradigm advocated and embraced by a lot of people. It even has its own wikipedia page. I hate it, especially those “RAD” frameworks using this as an excuse to create tons of “dark magic”. The result is poor discoverability, hard to maintain code, buggy and hard to debug. An example I have to mention is a PHP framework called Lithium. Just don’t use it.

Can we have fixed versions for all the things?

Recently, we encountered some bower packages version conflicts issues. I believe the story is that we installed a new package called angular-touch. We installed it locally on our dev environment using the following command:

bower install angular-touch –save

The –save will add the package to the bower.json file. However, for some reason, we did not get a warning about that it needs a newer version of angular (1.4.3) than what we had (1.3.6). It may just silently updated it. Only when we committed the change and the build machine start to build from scratch, the issue showed up and we got his error:

Unable to find a suitable version for angular, please choose one:
1) angular#>=1 <1.3.0 which resolved to 1.2.28 and is required by angular-bootstrap#0.12.0
2) angular#1.2.28 which resolved to 1.2.28 and is required by angular-loader#1.2.28
3) angular#1.3.12 which resolved to 1.3.12 and is required by angular-resource#1.3.12
4) angular#1.3.16 which resolved to 1.3.16 and is required by angular-mocks#1.3.16, eMenu-web
5) angular#>= 1.0.8 which resolved to 1.3.16 and is required by angular-ui-router#0.2.10
6) angular#>=1.2.10 which resolved to 1.3.16 and is required by angular-carousel#0.3.12
7) angular#>=1.0.8 which resolved to 1.3.16 and is required by ngGeolocation#0.0.7
8) angular#~1.x which resolved to 1.3.16 and is required by angular-spinner#0.6.2
9) angular#>= 1.2.23 which resolved to 1.3.16 and is required by ngCordova#0.1.15-alpha
10) angular#1.4.3 which resolved to 1.4.3 and is required by angular-touch#1.4.3Prefix the choice with ! to persist it to bower.json

Here is something I really don’t like (that’s probably also why I’m writing a blog about it): Bower, like a lot other package management systems (npm, pip, maven, etc) allows you to specify version ranges. In this example, you can see notations “>=”, “<“, “~1.x” etc. Personally, I like all the things to be fixed. My thought is that deterministic is way better than random. That is probably why I like docker a lot.

I understand why they did this though. Because when your project depends on multiple libraries (let’s say A and B), they may in turn all depends on another library (let’s say C). If A want v1.1 of C and B want v1.2 of C, you got a conflict when A may work with anything greater than v1.1. So, to make it easy, they allow library writers to specify version range and must have some smart logic to pick one of the many versions that satisfies all the requirements (or just pick the latest that satisfies). It is only when they really cannot find anything to satisfies it, they ask a human like above. However in our case, it is too late because it is the build machine installing the packages.

Now, maybe it will be very annoying if we do not have version range and developers have to manually resolve conflicts a lot. Personally I feel much better knowing what exactly is my program using but others may not care and just want to use some library easily. Then I would ask this question: why don’t we have a platform can hold multiple versions of same package and let others use whatever version they need? Maybe it is very hard and need support from the programming language level. But in my opinion, it will be awesome. We will be able to encapsulate a lot better. Yes, the final size of your program may be several times depends on average how many versions of packages you included, but disk space and memory are getting much cheaper these days.

Update:

Dependency Hell

Explain WordPress as a house to a beginner

Today I met with a non-technical person who just started to use WordPress. I used a metaphor to explain to her what WordPress is and I felt that it is pretty good and I’m going to share here.

Imagine all the domain names out there are real estate land. The website we are building for the domain name is going to be the house on the land. You can build it from scratch by write HTML manually but just like if you build a house starting with cutting down some trees for wood, it will be very slow and ugly. WordPress is like a mobile house that is already built, you just need to tow it onto the land.

After the house is in place, you need to pick an existing overall look and feel of the house. This is picking the themes of the site. Unlike the style of the house, the theme can be changed pretty easily from one to another. Just like you can paint one side of the house in a different color, you can tweak specific things by customize the theme, but requires some coding skill.

The house comes with a kitchen. By kitchen I mean the blog functionality. Using that you can create dishes (blogs) and share with the world. However, if you want more rooms (static pages), you can create them. Again, you can create manually by write html of the page but a better way to do it is use some visual editor, like just order the carpet, curtain, furnitures and put them in the room.

Tools like visual editor or other ones that helps you create blogs, or help you SEO (make your house easy to find in a crowed city) are called plugins.

Got a better analogy? Leave a comment below.

Backup strategy of this blog when it is running on docker.

So you can see that I had some backup issue recently and lost couple posts. Honestly, I still have not figured out what was the problem so I’m going to lay out my backup stack and if you can spot anything wrong, leave a comment below.

First, this blog is running on 2 dockers. One for WordPress. It just have the PHP code running, with some credentials as environment variables in the memory, so nothing need to be back up. The other docker runs the MariaDB (basically MySQL), which contains all the the data for the post, comments, users, etc. That is what I want to backup.

backup1

If you read my previous posts about docker, you know that it is mostly read only. So if you start a docker running a database on it, it will lose the data whenever you restart it. Unless you use data volume. So my MariaDB docker has a volume for the folders: /etc/mysql and /var/lib/mysql , where MariaDB will save the data. And they will stay the same when I stop/start the container.

backup2

However, I did not map the data volume to a physical folder on the host. I’m not 100% where the data volume actually is but I think if I destroy the container, the data volume will be gone. I did not map it to a physical folder because the machine is just a VM on Microsoft Azure. I don’t think there is much value mapping it and save it there since the machine can be gone easily.

What I did was to backup the data volume to Amazon S3. I originally used this dockup project and its docker image. But it has some problems with restore. So I used a fork of it here. The fork did not have a docker image, so I built it here. I created 2 services on tutum.co and tell them to mount the data volume from my MariaDB. With one click of a button, I can easily backup/restore my MariaDB data volume as a zip file onto/from Amazon S3.

backup3 backup4

But I did not want to do this manually every time I post something. (Or maybe I should have) So I set up a cron job like thing to back it up every day. I used https://github.com/sunshineo/tutum-schedule which is my fork from https://github.com/alexdebrie/tutum-schedule . What I changed was make the project run a non stop python process directly instead of supervisord. I discussed this with alexdebrie in the tutum slack channel and we all agreed that this is a good idea.

So there you are, something worked pretty well when I tested end of May. I did not post must the month of June but the backup did happen everyday. However, the backed up file never changed even after I made some post. This is a mystery to me.

Docker: compare to virtual machine part 1

They are very different of course. Docker is a type of Linux container (operating-system level software container specific to Linux). Pictures on this page is very helpful on understanding it. Now I’m going to try to explain it to you in my own words and examples hopefully you can relate to.

docker-containers-vs-vms

docker-containers-vs-vms

There are many computers operating systems. In personal computer the top 3 are: Windows, OSX and Linux. Each with many different version and flavors. Linux as a personal computer OS is mostly among developers and after Ubuntu provided a half decent GUI. Most people, when buy a personal computer, got either a PC or a Mac and it comes with Windows and OSX correspondingly. BTW, these 2 camps don’t like each other. Here is the proof.

However, occasionally because of necessity (say you need to use a software only available on the other OS) or boredom, you may need/want to be able to use both Windows and OSX. But you only have one machine, either a PC or a Mac. One way to solve this is to multi-boot. Both PC and Mac support this and you can install both Windows and OSX on a PC or Mac. It is just very hard to do. And you cannot use both operating systems at the same time.

A much easier solution would be use a virtual machine. You need to use a virtual machine software such as Virtual Box, VMWare, etc. There is a long list of such software. Using them, you can create multiple virtual machines easily and relatively faster compare to multi-boot. And you can (actually have to) use them together with the original OS on your Mac or PC. You will notice that each virtual machine you create will be saved by the virtual machine software as a big file. You can easily backup the file, move the file, even copy paste the file to create more identical virtual machines. You will also notice that the virtual machines are slow. Sometimes unbearable depends on how good is your computer. Well have a look at the picture about virtual machines on this page again. Of course it is slow. An operation need to go into the software, then the guest OS, then the virtual machine software (hypervisor in the picture, I’ll explain later), then the host OS, and then the hardware. Especially the guest OS, who has no idea that it is actually running in a virtual machine could really slow things down and it is also why the virtual machine file is very big.

Now in the server world, Linux is dominant, Windows has a meaningful percentage and I’ve never heard of Mac being used as a server though I’m sure it can. And the data center machines are huge monsters compare to personal computers. Take a look at this PowerEdge R920 from Dell. All the cloud service providers such as Amazon EC2, Microsoft Azure, etc all have quite a few data centers with a lot of powerful machines in them. And they rent them out to a lot of other companies to run their services or applications. A service may only need a tiny fraction of one server’s power, and it may suddenly need a lot more (say a e-commerce website during Christmas) and then go back to need very little. The solution is that these huge servers will start and stop virtual machines on demand. They use SSD to do disk I/O faster to help start the virtual machine faster and give it better performance. And also a lot of effort are put into make good hypervisors so the performance is good. Hypervisor is a broader concept including virtual machine software you use on your personal computer but also include specialized software and sometimes are firmware or even hardware that are used in data center with high performance.

(to be continued)

Docker: How did I find it.

I don’t really remember the exact moment I saw Docker for the first time. It’s roughly September 2014 and I definitely did not dig into it for another several months, not love at first sight. Later, my team moved production code from one data center to another in November 2014, I talked about it in one of my previous post. The process was quite chaotic and tedious. Also, we moved from CentOS to Ubuntu at the same time. During that process, I learnt a lot IT stuff that I did not have much previous exposure to as a developer: chef, puppet, salt, etc.
After took a vacation in the holiday season, early 2015 (also around when I dropped the ball on this blog) I started researching (i.e Google) for better technology/system for configure production environment and easy deployment. That’s when docker came back to me and I took a deeper look. The first time I saw the blue whale icon on www.docker.com I though of the famous fail whale from twitter. Their interactive tutorial is pretty cool and fun. I went through it so fast the first time that I missed things and had to do it twice. But I’m totally OK with it, that’s how fun it is. I highly suggest you go do it now if you are a developer or comfortable using a command line window.

docker whale

docker

Twitter fail whale

Twitter fail whale

Docker: I’m pretty sure it’s the future.

Docker is hot and getting hotter. Here is the Google Trend screen shot as of today, May 16th, 2015.

Docker Google Trends result May 16th, 2015 Screen shot

Docker Google Trends result May 16th, 2015


I don’t know when are you reading this post, so here is the real time Google Trend embedded. I bet it is going to be even better than the screen shot.

There are a few good reasons why it is so popular. There are tons of articles on the internet about benefit of docker. I’m not an expert on docker yet, just a normal user so I will only talk about my experience using it and the benefits from my perspective as a developer in my next couple blog posts.

Coding interview: Detect if a linked list has loop

To interviewers: Please do not use this question to interview other people. Because it is basically useless. You probably won’t get much information out of it. If the candidate answered it properly, it is most likely they have seen this question before. And if you have not seen this before, read on.

First, let’s define the linked list node:

You do not need to do the getter and setter but it just shows that you have good coding behavior.

Wrong approaches to the question include but not limited to:
1. Have one pointer keep moving to the next node and see if comes back to the header. The loop does not have to make a complete circle. For example: 1->2->3->4->5->2
2. Go through the list and save node in a hash map (or other type of storage) to see if we get a node that we have already seen before. This mostly works but won’t on huge linked list. You are guaranteed to ran out of memory.

Correct solution: Have 2 pointers, one go through the list at speed 1 node per loop and one go through the list at speed 2 nodes per loop.

Now, have while (true) in production code is not a good idea, but OK here for it simplifies the logic and make code easier to understand.

One small twist you can add to the question is: Fix the loop in the linked list. This boils down to a simple math problem. The solution is after the 2 pointers meet, move one of them back to the head, start to move the again but this time 1 node per loop for both of them. They will meet again and where they meet is the node that the linked list is looped back to so you can break there.

Again, do NOT use this question to interview other people. It really does not tell you much about the candidate.