This page has been archived and commenting is disabled.
How To Lose $172,222 Per Second For 45 Minutes
Originally posted at Python Sweetness blog,
This is probably the most painful bug report I’ve ever read, describing in glorious technicolor the steps leading to Knight Capital’s $460m trading loss due to a software bug that struck late last year, effectively bankrupting the company.
The tale has all the hallmarks of technical debt in a huge, unmaintained, bitrotten codebase (the bug itself due to code that hadn’t been used for almost 9 years), and a really poor, undisciplined dev-ops story.
Highlights:
To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange,5 which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.
13. Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.
14. When Knight used the Power Peg code previously, as child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed. This feature instructed the code to stop routing child orders after the parent order had been filled completely. In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.
15. Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.
16. On August 1, Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server. As a result, this server began sending child orders to certain trading centers for execution.
19. On August 1, Knight also received orders eligible for the RLP but that were designated for pre-market trading.6 SMARS processed these orders and, beginning at approximately 8:01 a.m. ET, an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open. Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were received
It gets better:
27. On August 1, Knight did not have supervisory procedures concerning incident response. More specifically, Knight did not have supervisory procedures to guide its relevant personnel when significant issues developed. On August 1, Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment. Knight’s system continued to send millions of child orders while its personnel attempted to identify the source of the problem. In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.
The remainder of the document is definitely worth a read, but importantly recommends new human processes to avoid a similar tragedy. None of the ops failures leading to the bug were related to humans, but rather, due to most likely horrible deployment scripts and woeful production monitoring. What kind of cowboy shop doesn’t even have monitoring to ensure a cluster is running a consistent software release!? Not to mention deployment scripts that check return codes..
We can also only hope that references to "written test procedures" for the unused code refer to systematic tests, as opposed to a 10 year old wiki page.
The best part is the fine: $12m, despite the resulting audit also revealing that the system was systematically sending naked shorts.
- 26232 reads
- Printer-friendly version
- Send to friend
- advertisements -


I feel sorry for these guys. If they were big enough they could have gotten all the trades reversed, like Goldman does.
Sucks being them, I guess.
Moral of the story: be systemically imortant before you screw up.
The NY Fed is Goldman Sachs
"Sucks being them, I guess"
Good thing it probably wasn't their own money they were losing... unless you count the lost bonus checks that is...
Doubt it. The only trade that are reversed are those clearly erroneous. These trades were all kosher and at the market.
Kosher, as long as Rabbi Blankfein blesses them.
wasnt algoagogo an elvis flick?
I wouldn't even give those yo-yo's a second chance.
The US business community has an incredible inability to recognize the value of programming talent, so it ends up with idiots who run bankrupt-the-company type risks without anyone else knowing about it.
If your average businessperson knew how much power IT people in their individual discretion really have over their businesses, they'd shit their pants.
I regularly have their business on the pointy end of my keyboard. The business people think IT is a commodity, until me, or someone sitting in my seat, shows them who really is in charge. Outsource all your IT to India, or China, or Backmanistan, or La La Land, I'm all for it, cause you gonna pay up when you come back begging, if you make it back. It has always been labor vs. capital, but now my labor is capital, hmmmm, the knowledge worker rises, as the parasites look for new ways to deceive, and enslave.
Awesome article
This is nothing. I once misplaced my keys and searched for them for almost an hour.
Jaime Dimon printed this one out, taped it to his chest and had an Asian hooker shit on him/it for pleasure.
I don't care who you are, that's funny right there.
HAHAHAHA
I bet there was a guy who got a bonus that cut out all the human shit.
Probably works for HHS.
Hello! I am Samuel welcome to tex support! How am I about to be helping you?
.
This sounds a little like "the dog ate my homework".
More interesting would be a report on who MADE the $460 million that they "lost". Its a zero sum game after all in the short term. Was some one aware of the bug, or even perhaps planted it, and decided to get rich rather than get a good employee evaluation?
Unless you're a banker it's far easier to crash and burn something than it is to make it a success. Either outcome can make you money if you know what you are doing.
Enormous fortunes no doubt will be made when our sociopathic CEO's figure that little shortcut out...
"Unless you're a banker it's far easier to crash and burn something than it is to make it a success."
It's always easier to crash and burn something than it is to build something of value, that's the long history of failed collectivism. But what are you sayin? Everyone except Bankers wants to crash and burn something?
He's saying this could have always been a planned destruction, with its principal movers having large positions on the other side of the trade, and the "rogue computer code" being nothing more than a convenient excuse for the destruction. Someone DID get the money that Knight lost - who got it and how much? $460m is more than enough to be worth pulling off a major scam for.
Techies rule the modern world! There's rogue computer code everywhere. Thanks for clarifying what Zerohead is sayin. What do you think of Alan Gaysons recent attack on the Tea Party?
Hate to ruin a good conspiracy, but read article III (1.) of the administrative proceeding. "While processing 212 small retail orders...". Their retail clients were were your average piss-poor stock pickers. Besides, any insider who wanted to take the other side of the trade would have to know where it was routed. Since the purpose of an order routing system is to find the best bid/offer, the duplicates would have been routed all over the street.
Yes. Your particular answer is in the details. Keep searchin.......
What do you think of Alan Gaysons recent attack on the Tea Party?
Grayson's a harvard lawyer. I wouldn't trust him to take my garbage out.
I doubt it. I work with the same kind of people. Too much complacence and not enough people who act like real engineers. Just a lot of buffoons who don't know how to do their jobs or what the consequences for failure are. Here's a hint: You can get away with being a fuckup for as long as it doesn't put your company out of business. Then you stop getting a paycheck all together.
You are so right, sir. It is the pink elephant in the room. Who was on the other side of that trade, baby? How to win by losing. Shucks, there goes 460 million down the durn drain. I'm sorry I programmed it wrong.
It's a nano-traded bot world now, the other side of the trade was all the algos whose programmers and ops people had their shit wired tight.
Who was on the other side? Everyone and anyone who saw and hit their dodgy bids/offers. Sheesh...
def: "systemically important": dangerous, too risky to be allowed to continue, capable of anihilation, also...well connected, major contributor, and also so large as to be able to change the very definition of itself at will. New Mahem Dictionary
How does one obtain a copy of that tainted "Power Peg" code? I can think of a few .gov assholes that I would like to anonymously share it with. ;-)
Yen... That code is not available at this time. It is being use for the
Obamacare website.
lol good one. :-)
Were they both written by the same bunch of penis puffers?
Did they get the proper licensing or just steal it like the rest of their software?
If you download the new Apple Maverick software, your purchasing buying habits will assist us in developing new semi-periphery manufacturing zones. We appreciate the text/email questionnaire feedback sent to your smartphone.
Knight! Where the deal gets done.
That they didn't automate the deploy, ie, someone has to manually copy files is a red flag. Automate that shit and automate the rollback. Test the deployment scripts. Even written procedures for deployment are a sign of failure, documents are seldom read and followed, especially if the process is repeated often. No, this has to be automatic, fully vetted and documented in executable and well tested code. The testing of the deployment code should be automated. Changes in source control trigger automated tests to run. Failing tests or code without tests stop the release cold until it is addressed.
The problem with many companies is that executive technology decision makers are really not at all qualified or competent enough to make the decisions they make. Why is this? The usual nepotism, fraternal and rotten from the top down answers apply.
I've been trying to make the people in my company do this for two years now. They still don't get it.
I can't get eight servers setup exactly the same way no matter what I do, they always fuck up something! It isn't that fucking hard to automate, even.
The problem is that most companies are HR organized, in other words, not organized to be successful and efficient.
If you can't change your company, change your company.
vagrant up!
Absolutely, "freedogger". They set up seven machines properly, but screwed up the eighth?! This sort of thing absolutely needs to be deployed and verified automatically.
Oh, and the bit about "repurposing" parts of the configuration? That's just sloppiness, and laziness, and is begging for Murphy to come and pay you a visit.
no worries mate.....
compared to oblamascare........
just a flesh wound..... :)
More like a paper cut really.
I bet this place used domain specific tech outfits, that were familiar with NYC WS. This reduces the amount of firms that work there and makes the ones that do complacent. I don't think financial outfits have the intelligence to hire the best, reduces their bonuses....
The problem was traced down to their lack of skilled Cobol programmers.
The SEC complaint was due to their lack of paid congressdouches.
It's all mediocracy, all the time, everywhere now.
werd
Here's a bit of code that is would have saved $460 million dollars:
/* */
Problem solved.
LOL agreed!
Whoever does build and deploy on that code base needs to be hung. I am surprised they try to debug the problem "real-time" while the problem was occuring. Complete operations issue since they didn't even have a roll-back plan involved seems like.
I wouldn't even blame it on the software or the programmer completely!
I'm surprised they didn't just pull out the ethernet cables as soon as they realized something had gone horribly wrong. Were they afraid of losing money during their 45 minutes of running around like headless chickens? Bwahahahah
I see titties.
Hey I want to learn to write code and make a shitload of money. What's the fastest way to get from the above looking like tities, and making a shitload of money writing code?
Thanks for your guidance!
Start programming something. Then keep learning. Looking at titties is a good sign, you likely have an aptitude for writing code. Its all related, just keep looking at titties and build on that. Tell yourself you will get the titties someday if you only can code just a little better - this has worked for tens of thousands of us....
Annnnd, "repurposed a flag"??!!! What the fuck? Are you that tight on space in this million dollar transaction data that you can't have say, an 8-bit field with a magic number in there?
OP=5, Power Peg
OP=6, that stupid new thing we're tryin' out
Format C:\
After you write the code, make sure you backup up work with above function. It will help you protect yourself against the NSA bitches.
Powder Keg would have been a more apropos name.
Someday this will happen to everyone. Like a black hole opening up in cyberspace and vacuuming every last cent into Ed Snowden's swiis bank account. Then he will buy Mark Rich's place, donate all his proceeds to Amnesty International, and we will all live in a blissful anarchist utopia.
I'm stoned, never mind I just didn't understand very much of this post. They maybe should hire SOME FUCKING HUMANS and stop the madness.
Revenge of the Nerds, Bitchez.
One positive coming out of this is whoever on the receiving end of this sytem did get some liquidity out of it. As its platform is called "Retail Liquidity Program".
Does this tell us where the talent for the Obamacare website came from?
"Nailing Jelly to a Tree",
"The Mythical Man-Month"
"Software Design Patterns"
"Cargo-Cult Programming" (Not actually a book)
Idjuts in "management" treat software the same as manufacturing widgets in a factory -- tain't so!
What amazed me was that this buying high and selling low code was running for 45 minutes. Bad enough that old code (which just have been screwed up anyway to be buying high and selling low) was put into Production withoput rigorous testing, but the fact that traders/ programmers or Mid office risk management could not see in real time what the losses were after the first 2 to 5 minutes is a major fuck up.
They shoudl have just pulled the power cords from the sockets in the data center. Save your firm and be able to apologize and live to execute another day.
Classic "oups" moment ;)
And I thought I suck hard at programming....
I've yet to meet a high-level IT manager who knew anything at all about code.
I'll tell you something about code and coders. My last gig was COO of a small company ($15 mill) doing software development. Fairly frequently, in my daily stand-up meetings I would find out about a coder who was stuck on some problem or other. I would drop by and ask him how it was going and ask to look at the code he was having a problem with. I always got the rolling eyes, a shrug of the shoulders (had to let me know I was a clueless manager) Very often, after about a minute or two, I would spot the problem and tell him the solution. I'm not. Usually it was java code which I have never developed in but it is similar enough to C. I would repeat my advice to have someone else look at the code when you are stuck. Coding problems like that are often simple logic errors, faulty assumptions (that variable will never have that value), and, most frequently, simple typos. Incidences like that gave me the reputation of being a coding genius. I'm not but it is a common delusion of programmers when there is a problem to disbelieve that the problem is something trivial and obvious because coders tend to think they are tech geniuses. No matter how many times I would tell folks to get another pair of eyes when they are stuck they would just keep trying, staring at the screen and believing that the problem was more complicated than it was.
Once we had a problem in a system with 1 million users. Out of the 1 million there were exactly six who consistently had the system fail to update their records. The team had basically given up on solving the problem and would tell the help desk that they were working on it but never did until I became aware of it. In five minutes I discovered that all of users had a particular data field of 1 meg in size. No one else did. Then I found out that there was a hard limit (configuration) of 1 meg on the message size. I told the developer to up the limit to 2 megs and the problem disappeared.
As clueless as many high-level IT managers are almost every coder is afflicted with the attitude problem that the failure has nothing to do with them and everything to do with some mysterious other factor.
That is really good advice to talk about blocks or being stuck when programming. Often just discussing it before the other party chimes in is enough to solve it.
I would only disagree with your thoughts that it is an attitude problem or the "I'm a tech genius and I can solve it". More often, its insecurity and fear of being discovered as a technical fraud in a certain area. Or it is just a case of tunnel vision - getting stuck inside the problem and not being able to see it from different angles.
The rolling of eyes at a manager when asked to really explain an issue is typically the frustration that "it will take more minutes than your ADHD mind can handle so why should I even try" (I think you are a rare exception to most managers, hopefully the team became more confident in vetting the complex problems with you as time went on)
Stuborness or persistence is a good trait for programing, the opposite, a person that is always pestering others to do the mental lifting for them is pretty bad for a team. It takes a balance and years to know when you need to step back and ask for help.
Often, you become so focused on solving the problem that it becomes very hard to see outside of this. As you get more frustrated, you tend to add more complexity to try and solve it. Sometimes getting up and walking away from the computer to do something else really helps. A fresh set of eyes (or ears) is often the best. We have a rule to pair up if you spend more than 20 minutes on a technical problem that you can't solve. It is a sign of maturity when programmers are willing to talk about their code with other programmers and even business analysts, end users and managers.
On a complex software project, the whole team has to own the code and the features. If someone is stuck for a long time, the rest of the team needs to swarm and own it. We openly discuss the features and issues, business logic and design. For every feature I write, it feels like ideas from two or three others on my team make their way into the finished result. Same can be said for everyone on the team. They spend about an hour each for every three days of work I do on a feature.
We don't measure performance at an individual's level, the whole team is evaluated on the sum of finished features every three weeks. We try to figure out ways to be better and faster at regular intervals.
It is scary that there are too many coders and other IT types out there that think that crap just randomly happens. Follow the code.
In this article, the company intended to reuse an existing flag for something. They were intentionally using the flag but somehow didn't test out the use of the flag.
Abaco, I spent my working life in exactly the same kind of world you describe. You provided a very good description of the problem.
I recently wrote an email to a friend explaining why I say fuck the critics who blame Sebelius for the failure of the IT project that was supposed to launch Obamacare.
The degradation that has occurred over the years in the IT industry is beyond appalling. Once upon a time 90% of IT people were roughly all using the same toolset. Cobol - CICS - I came from a BAL shop - what I called God's Language then (kddingly of course).
The unholy matrix of technologies today together with the almost complete evaporation of standards and procedures and the devolution of things to the point where for all practical purposes unit testing occurs when systems are implemented into production makes me wonder how much longer things will go onl before something so catastrophic occurs civilization will grind to a halt.
People simply have no idea of the schlock that has been programmed and installed to handle critical functions society depends upon. Where I worked we had a multi-million dollar project finally go belly up when I began flooding the Indians who had been contracted with a daily torrent of problems I discovered doing testing. Despair began to set in with them when things got to a point where their hardcoded definitions of storage work areas that should have been coded to be dynamic temp storage requests finally drove them over the cliffs of insanity.
If ever there was an enterprise where the weakest links could hopelessly damage a project, computer systems have to be near the top of the list. When I used to have to provide the work estimates for various pieces of a project I remember one time I assigned the most difficult parts to who I knew was the best programmer with a 500 hour estimate and the easiest part to an inferior programmer with a 200 hour estimate. The hard part was finished in less than 200 and the easier work wasn't complete after 500.
I can't imagine the gawdawful pressure that must have existed where ever the Obamacare deadline was trying to be met. When deadlines are set by people who have no conception of the technical problems that must be solved to achieve a bug free complex system, you have nothing but a recipe for disaster on your hands.
The fun part of this is that there is nobody to blame. Ahahahaha. This shit is so complex that NO ONE COULD SEE IT COMING. Maybe some future chicken shit deployment will trigger the WW3!
Maybe? - I think its a certainty. Its mind bottling how many shitty systems are out there.
In the book "Command and Control" on the failures and near WWIII accidents of SAC, the guy taking over our nuclear deterrent back in the 90's gave credit to god for us not all being dead. He couldn't otherwise explain all the bullets we dodged.
RECORD BONUSES FOR FAILURE!