How To Lose $172,222 Per Second For 45 Minutes

Tyler Durden's picture

Originally posted at Python Sweetness blog,

This is probably the most painful bug report I’ve ever read, describing in glorious technicolor the steps leading to Knight Capital’s $460m trading loss due to a software bug that struck late last year, effectively bankrupting the company.

The tale has all the hallmarks of technical debt in a huge, unmaintained, bitrotten codebase (the bug itself due to code that hadn’t been used for almost 9 years), and a really poor, undisciplined dev-ops story.

Highlights:

To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange,5 which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.

 

13. Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.

 

14. When Knight used the Power Peg code previously, as child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed. This feature instructed the code to stop routing child orders after the parent order had been filled completely. In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.

 

15. Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.

 

16. On August 1, Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server. As a result, this server began sending child orders to certain trading centers for execution.

 

19. On August 1, Knight also received orders eligible for the RLP but that were designated for pre-market trading.6 SMARS processed these orders and, beginning at approximately 8:01 a.m. ET, an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open. Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were received

It gets better:

27. On August 1, Knight did not have supervisory procedures concerning incident response. More specifically, Knight did not have supervisory procedures to guide its relevant personnel when significant issues developed. On August 1, Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment. Knight’s system continued to send millions of child orders while its personnel attempted to identify the source of the problem. In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.

The remainder of the document is definitely worth a read, but importantly recommends new human processes to avoid a similar tragedy. None of the ops failures leading to the bug were related to humans, but rather, due to most likely horrible deployment scripts and woeful production monitoring. What kind of cowboy shop doesn’t even have monitoring to ensure a cluster is running a consistent software release!? Not to mention deployment scripts that check return codes..

We can also only hope that references to "written test procedures" for the unused code refer to systematic tests, as opposed to a 10 year old wiki page.

The best part is the fine: $12m, despite the resulting audit also revealing that the system was systematically sending naked shorts.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
NoDebt's picture

I feel sorry for these guys.  If they were big enough they could have gotten all the trades reversed, like Goldman does.

Sucks being them, I guess.

Moral of the story:  be systemically imortant before you screw up.

chump666's picture

The NY Fed is Goldman Sachs

ZerOhead's picture

"Sucks being them, I guess"


Good thing it probably wasn't their own money they were losing... unless you count the lost bonus checks that is...

Godisanhftbot's picture

 Doubt it. The only trade that are reversed are those clearly erroneous. These trades were all kosher and at the market.

ebworthen's picture

Kosher, as long as Rabbi Blankfein blesses them.

fourchan's picture

wasnt algoagogo an elvis flick?

icanhasbailout's picture

The US business community has an incredible inability to recognize the value of programming talent, so it ends up with idiots who run bankrupt-the-company type risks without anyone else knowing about it.

If your average businessperson knew how much power IT people in their individual discretion really have over their businesses, they'd shit their pants.

NickVegas's picture

I regularly have their business on the pointy end of my keyboard. The business people think IT is a commodity, until me, or someone sitting in my seat, shows them who really is in charge. Outsource all your IT to India, or China, or Backmanistan, or La La Land, I'm all for it, cause you gonna pay up when you come back begging, if you make it back. It has always been labor vs. capital, but now my labor is capital, hmmmm, the knowledge worker rises, as the parasites look for new ways to deceive, and enslave.

ninja247's picture

Awesome article

Ignatius's picture

This is nothing.  I once misplaced my keys and searched for them for almost an hour.

Charles Nelson Reilly's picture

Jaime Dimon printed this one out, taped it to his chest and had an Asian hooker shit on him/it for pleasure.

NoDebt's picture

I don't care who you are, that's funny right there.

Joebloinvestor's picture

HAHAHAHA

I bet there was a guy who got a bonus that cut out all the human shit.

 

Probably works for HHS.

RafterManFMJ's picture

Hello! I am Samuel welcome to tex support! How am I about to be helping you?

0b1knob's picture

This sounds a little like "the dog ate my homework".

More interesting would be a report on who MADE the $460 million that they "lost".    Its a zero sum game after all in the short term.  Was some one aware of the bug, or even perhaps planted it, and decided to get rich rather than get a good employee evaluation?

ZerOhead's picture

Unless you're a banker it's far easier to crash and burn something than it is to make it a success. Either outcome can make you money if you know what you are doing.

Enormous fortunes no doubt will be made when our sociopathic CEO's figure that little shortcut out...

Harbanger's picture

"Unless you're a banker it's far easier to crash and burn something than it is to make it a success."

 

It's always easier to crash and burn something than it is to build something of value, that's the long history of failed collectivism.  But what are you sayin?  Everyone except Bankers wants to crash and burn something? 

icanhasbailout's picture

He's saying this could have always been a planned destruction, with its principal movers having large positions on the other side of the trade, and the "rogue computer code" being nothing more than a convenient excuse for the destruction. Someone DID get the money that Knight lost - who got it and how much? $460m is more than enough to be worth pulling off a major scam for.

Harbanger's picture

Techies rule the modern world!  There's rogue computer code everywhere.  Thanks for clarifying what Zerohead is sayin.  What do you think of Alan Gaysons recent attack on the Tea Party?

zhandax's picture

Hate to ruin a good conspiracy, but read article III (1.) of the administrative proceeding.  "While processing 212 small retail orders...".  Their retail clients were were your average piss-poor stock pickers.  Besides, any insider who wanted to take the other side of the trade would have to know where it was routed.  Since the purpose of an order routing system is to find the best bid/offer, the duplicates would have been routed all over the street.

Harbanger's picture

Yes.  Your particular answer is in the details.  Keep searchin.......

What do you think of Alan Gaysons recent attack on the Tea Party?

zhandax's picture

Grayson's a harvard lawyer.  I wouldn't trust him to take my garbage out.

aerojet's picture

I doubt it.  I work with the same kind of people.  Too much complacence and not enough people who act like real engineers.  Just a lot of buffoons who don't know how to do their jobs or what the consequences for failure are.  Here's a hint:  You can get away with being a fuckup for as long as it doesn't put your company out of business.  Then you stop getting a paycheck all together.

NickVegas's picture

You are so right, sir. It is the pink elephant in the room. Who was on the other side of that trade, baby? How to win by losing. Shucks, there goes 460 million down the durn drain. I'm sorry I programmed it wrong. 

aerojet's picture

It's a nano-traded bot world now, the other side of the trade was all the algos whose programmers and ops people had their shit wired tight.

StandardDeviant's picture

Who was on the other side?  Everyone and anyone who saw and hit their dodgy bids/offers.  Sheesh...

lasvegaspersona's picture

def: "systemically important":  dangerous, too risky to be allowed to continue, capable of anihilation, also...well connected, major contributor, and also so large as to be able to change the very definition of itself at will. New Mahem Dictionary

Yen Cross's picture

  How does one obtain a copy of that tainted "Power Peg" code? I can think of a few .gov assholes that I would like to anonymously share it with. ;-)

 

zorba THE GREEK's picture

Yen... That code is not available at this time. It is being use for the

Obamacare website.

dunce's picture

Were they both written by the same bunch of penis puffers?

Grinder74's picture

Did they get the proper licensing or just steal it like the rest of their software?

Atomizer's picture

If you download the new Apple Maverick software, your purchasing buying habits will assist us in developing new semi-periphery manufacturing zones. We appreciate the text/email questionnaire feedback sent to your smartphone.

geotrader's picture

Knight!  Where the deal gets done.

freedogger's picture

That they didn't automate the deploy, ie, someone has to manually copy files is a red flag. Automate that shit and automate the rollback. Test the deployment scripts. Even written procedures for deployment are a sign of failure, documents are seldom read and followed, especially if the process is repeated often. No, this has to be automatic, fully vetted and documented in executable and well tested code. The testing of the deployment code should be automated. Changes in source control trigger automated tests to run. Failing tests or code without tests stop the release cold until it is addressed. 

The problem with many companies is that executive technology decision makers are really not at all qualified or competent enough to make the decisions they make. Why is this? The usual nepotism, fraternal and rotten from the top down answers apply. 

aerojet's picture

I've been trying to make the people in my company do this for two years now.  They still don't get it.  

I can't get eight servers setup exactly the same way no matter what I do, they always fuck up something!  It isn't that fucking hard to automate, even.

The problem is that most companies are HR organized, in other words, not organized to be successful and efficient.

freedogger's picture

If you can't change your company, change your company.

vagrant up!

StandardDeviant's picture

Absolutely, "freedogger".  They set up seven machines properly, but screwed up the eighth?!  This sort of thing absolutely needs to be deployed and verified automatically.

Oh, and the bit about "repurposing" parts of the configuration?  That's just sloppiness, and laziness, and is begging for Murphy to come and pay you a visit.

4 wheel drift's picture

no worries mate.....

 

compared to oblamascare........

 

just a flesh wound.....    :)

Element's picture

More like a paper cut really.

RaceToTheBottom's picture

I bet this place used domain specific tech outfits, that were familiar with NYC WS.  This reduces the amount of firms that work there and makes the ones that do complacent.  I don't think financial outfits have the intelligence to hire the best, reduces their bonuses....

Stuck on Zero's picture

The problem was traced down to their lack of skilled Cobol programmers.

 

zhandax's picture

The SEC complaint was due to their lack of paid congressdouches.

infinity8's picture

It's all mediocracy, all the time, everywhere now.

 

Bruce Flea's picture

Here's a bit of code that is would have saved $460 million dollars:

/* */

Problem solved.

cwwang's picture

LOL agreed!

Whoever does build and deploy on that code base needs to be hung.  I am surprised they try to debug the problem "real-time" while the problem was occuring. Complete operations issue since they didn't even have a roll-back plan involved seems like.

I wouldn't even blame it on the software or the programmer completely!

 

 

New World Chaos's picture

I'm surprised they didn't just pull out the ethernet cables as soon as they realized something had gone horribly wrong.  Were they afraid of losing money during their 45 minutes of running around like headless chickens?  Bwahahahah

jballz's picture

 

I see titties.

Hey I want to learn to write code and make a shitload of money. What's the fastest way to get from the above looking like tities, and making a shitload of money writing code?

Thanks for your guidance!