Monday 9 October 2017

Ms Sql Gleitender Durchschnitt


PostgreSQL vs MS SQL Server 0. Was ist das alles über Ich arbeite als Datenanalytiker in einem globalen professionellen Dienstleistungen Unternehmen (ein, das Sie sicherlich gehört haben). Ich mache das seit etwa einem Jahrzehnt. Ich habe dieses Jahrzehnt den Umgang mit Daten, Datenbanksoftware, Datenbankhardware, Datenbankbenutzern, Datenbankprogrammierern und Datenanalysemethoden verbracht, so dass ich einiges über diese Dinge weiß. Ich häufig in Kontakt mit Leuten kommen, die sehr wenig über diese Sachen ndash kennen, obwohl einige von ihnen es nicht verwirklichen. Im Laufe der Jahre habe ich die Frage der PostgreSQL vs MS SQL Server viele, viele Male diskutiert. Ein bekanntes Prinzip in der IT sagt: wenn youre es tun, mehr als einmal, automatisieren. Dieses Dokument ist meine Art der Automatisierung dieser Konversation. Wenn nicht anders angegeben, beziehe ich mich auf PostgreSQL 9.3 und MS SQL Server 2014, obwohl meine Erfahrung mit MS SQL Server mit Versionen 2008 R2 und 2012 ndash aus Gründen der Fairness und Relevanz Ich möchte die neueste Version von PostgreSQL mit dem neuesten vergleichen Version von MS SQL Server. Wo ich habe Ansprüche über MS SQL Server habe ich mein Bestes getan, um zu überprüfen, dass sie auf Version 2014 durch Beratung Microsofts eigene Dokumentation ndash gelten, obwohl aus Gründen, die ich bekommen werde. Ich habe auch weitgehend auf Google, Stack Overflow und die Nutzer des Internets verlassen. Ich weiß, seine nicht wissenschaftlich rigoros, um einen Vergleich wie dies zu tun, wenn ich nicht gleiche Erfahrung mit beiden Datenbanken haben, aber dies ist keine akademische Übung ndash seine ein real-world Vergleich. Ich habe mein ehrliches Bestes getan, um meine Tatsachen über MS SQL-Bedienerrecht ndash zu erhalten, das wir alle wissen, dass es unmöglich ist, den ganzen Internet bullshit zu erhalten. Wenn ich herausfinde, daß Ive etwas falsch erhielt, Kranke es regeln. Ich bin der Vergleich der beiden Datenbanken aus der Sicht eines Datenanalysten. Vielleicht MS SQL Server Kicks PostgreSQLs ass als OLTP-Backend (obwohl ich es bezweifeln), aber das ist nicht das, was Im hier zu schreiben, weil Im nicht ein OLTP developerDBAsysadmin. Schließlich gibt es eine E-Mail-Adresse oben rechts. Benutzen Sie es bitte, wenn Sie wünschen, dass ich mein Bestes tun werde, um zu reagieren. HAFTUNGSAUSSCHLUSS: Alle subjektiven Meinungen hier sind ausschließlich meine eigenen. 1. Warum PostgreSQL ist, viel besser als MS SQL Server Oops, Spoiler-Alarm. Dieser Abschnitt ist ein Vergleich der beiden Datenbanken hinsichtlich der für die Datenanalyse relevanten Merkmale. 1.1. CSV-Unterstützung CSV ist die De-facto-Standardmethode, um strukturierte (d. h. tabellarische) Daten zu verschieben. Alle RDBMSe können Daten in proprietäre Formate, die nichts anderes lesen können, was für Backups, Replikation und dergleichen gut ist, aber keine Verwendung für die Migration von Daten von System X zu System Y. Eine Datenanalyse-Plattform muss in der Lage sein, zu suchen Bei Daten aus einer Vielzahl von Systemen und produzieren Ergebnisse, die von einer Vielzahl von Systemen gelesen werden können. In der Praxis bedeutet dies, dass es in der Lage sein muss, CSV schnell, zuverlässig, wiederholbar und schmerzlos einnehmen und ausscheiden zu können. Lässt nicht unterschätzen: eine Datenanalyse-Plattform, die nicht CSV robust handhaben kann, ist eine gebrochene, nutzlose Haftung. PostgreSQLs CSV-Unterstützung ist erstklassig. Die COPY TO und COPY FROM Befehle unterstützen die in RFC4180 skizzierte Spezifikation (die dem offiziellen CSV-Standard am nächsten kommt) sowie eine Vielzahl von üblichen und nicht so häufigen Varianten und Dialekten. Diese Befehle sind schnell und robust. Wenn ein Fehler auftritt, geben sie hilfreiche Fehlermeldungen. Wichtig ist, dass sie nicht leise korrumpieren, missverstehen oder verändern. Wenn PostgreSQL sagt, dass Ihr Import funktioniert, dann funktionierte es richtig. Der geringste Hauch von einem Problem und es verlässt den Import und wirft eine hilfreiche Fehlermeldung. (Das klingt vielleicht wählerisch oder unpraktisch, aber es ist eigentlich ein Beispiel für ein etabliertes Designprinzip - es macht Sinn: Würden Sie lieber herausfinden, dass Ihr Import jetzt fehlgeschlagen ist, oder einen Monat ab jetzt, wenn Ihr Kunde sich darüber beschwert, dass Ihre Ergebnisse vorliegen Aus) MS SQL Server kann CSV weder importieren noch exportieren. Die meisten Leute glauben mir nicht, wenn ich ihnen das sage. Dann, an einem gewissen Punkt, sehen sie für sich. In der Regel beobachten sie etwas wie: MS SQL Server stumm schneiden ein Textfeld MS SQL-Server-Text-Encoding-Handling geht falsch MS SQL Server werfen eine Fehlermeldung, weil es nicht verstehen, Zitieren oder Escaping (im Gegensatz zu den gängigen Glauben, Zitieren und Escape sind keine exotischen Erweiterungen CSV. Sie sind grundlegende Konzepte in buchstäblich jede menschenlesbare Daten-Serialisierung Spezifikation. Dront jemand, der nicht weiß, was diese Dinge sind) MS SQL Server exportieren gebrochen, nutzlos CSV Microsofts horrendous Dokumentation. Wie haben sie es geschafft, etwas so einfach wie CSV zu komplizieren Dies ist besonders verwirrend, weil CSV-Parser sind trivial einfach zu schreiben (ich schrieb ein in C und plumpbed es in PHP ein Jahr oder zwei, weil ich nicht glücklich war mit seiner nativen CSV-Handling Das Ganze brachte vielleicht 100 Zeilen Code und drei Stunden ndash, von denen zwei mit SWIG begonnen wurden, was mir damals neu war). Wenn Sie mir nicht glauben, laden Sie diese korrekt formatierte, standardkonforme UTF-8 CSV-Datei herunter und verwenden Sie MS SQL Server, um die durchschnittliche Zeichenfolgenlänge (dh Anzahl der Zeichen) der letzten Spalte in dieser Datei zu berechnen (sie hat 50 Spalten) . Probieren Sie es aus. (Die Antwort, die Sie suchen, ist genau 183.895.) Natürlich ist die Bestimmung dieses ist trivial einfach in PostgreSQL ndash in der Tat, das zeitaufwändigste Bit ist die Schaffung einer Tabelle mit 50 Spalten, um die Daten zu halten. Schlechtes Verständnis von CSV scheint in Microsoft endemisch zu sein, dass Datei Access und Excel zu brechen. Traurig, aber wahr: einige Datenbank-Programmierer, die ich kenne kürzlich verbrachte viel Zeit und Mühe schriftlich Python-Code, der sanitizes CSV, damit MS SQL Server zu importieren. Sie werent in der Lage, eine Änderung der tatsächlichen Daten in diesem Prozess, though. Dies ist so verrückt wie ein Vermögen auf Photoshop und dann mit einigen benutzerdefinierten Code schreiben, um es zu öffnen, ein JPEG, nur um festzustellen, dass das Bild etwas geändert wurde. 1.2. Ergonomie Jede Datenanalyse-Plattform erwähnenswert ist Turing komplett, was bedeutet, geben oder nehmen, dass jeder von ihnen kann alles, was andere tun können. Es gibt keine solche Sache, wie Sie tun können, X in Software A, aber Sie können nicht tun, X in Software B. Sie können alles tun, in etwas ndash alles, was variiert ist, wie schwer es ist. Gute Werkzeuge machen die Dinge, die Sie brauchen, um leichte schlechte Werkzeuge machen sie schwer. Das ist, was es immer kocht auf. (Das ist alles konzeptionell wahr, wenn nicht buchstäblich wahr - zum Beispiel, keine RDBMS, die ich kenne kann 3D-Grafik zu machen. Aber jeder von ihnen kann jede Berechnungen emulieren, die eine GPU durchführen kann.) PostgreSQL ist eindeutig von Menschen, die tatsächlich kümmern geschrieben Dinge erledigt bekommen . MS SQL Server fühlt sich an, wie es von Leuten geschrieben wurde, die nie wirklich MS SQL Server verwenden müssen, um alles zu erreichen. Hier einige Beispiele: PostgreSQL unterstützt DROP TABLE IF EXISTS. Die die intelligente und offensichtliche Weise des Sagens ist, wenn diese Tabelle nicht existiert, nichts tun, aber wenn es tut, es loswerden. So etwas wie dieses: Heres, wie Sie es in MS SQL Server tun müssen: Ja, seine nur eine zusätzliche Zeile Code, aber beachten Sie den mysteriösen zweiten Parameter, um die OBJECTID-Funktion. Sie müssen das mit NV ersetzen, um eine Ansicht zu löschen. Sein NP für eine gespeicherte Prozedur. Ich habe nicht gelernt, die verschiedenen Buchstaben für alle verschiedenen Arten von Datenbank-Objekten (warum sollte ich) Beachten Sie auch, dass der Tabellenname unnötig wiederholt wird. Wenn Ihre Konzentration rutscht für einen Moment, seine tote leicht, dies zu tun: Siehe was ist dort passiert Dies ist eine zuverlässige Quelle für lästige, zeitraubende Fehler. PostgreSQL unterstützt DROP SCHEMA CASCADE. Die ein Schema und alle Datenbankobjekte in ihm sinkt. Dies ist sehr, sehr wichtig für eine robuste Analytics Delivery Methodologie, wo Tear-down-and-Rebuild ist das zugrunde liegende Prinzip der wiederholbaren, auditiven, kooperativen Analyse-Arbeit. Es gibt keine solche Einrichtung in MS SQL Server. Sie müssen alle Objekte im Schema manuell und in der richtigen Reihenfolge ablegen. Denn wenn Sie versuchen, ein Objekt zu löschen, auf dem ein anderes Objekt abhängt, löst MS SQL Server einfach einen Fehler aus. Dies gibt eine Vorstellung davon, wie mühsam dieser Prozess sein kann. PostgreSQL unterstützt CREATE TABLE AS. Ein kleines Beispiel: Dies bedeutet, dass Sie alles, aber die erste Zeile markieren und ausführen können, was eine nützliche und gemeinsame Aufgabe bei der Entwicklung von SQL-Code ist. In MS SQL Server geht die Tabellenerstellung wie folgt aus: Um die einfache SELECT-Anweisung auszuführen, müssen Sie das INTO-Bit auskommentieren oder entfernen. Ja, kommentieren zwei Zeilen ist einfach, das ist nicht der Punkt. Der Punkt ist, dass in PostgreSQL können Sie diese einfache Aufgabe ohne Änderung des Codes und in MS SQL Server Sie cant, und das führt zu einer weiteren potenziellen Quelle von Bugs und Ärgernisse. In PostgreSQL können Sie so viele SQL-Anweisungen ausführen, wie Sie in einem Batch mögen, solange Sie jede Anweisung mit einem Semikolon beendet haben, können Sie ausführen, was Kombination von Anweisungen, die Sie mögen. Für die Ausführung von automatisierten Batch-Prozessen oder wiederholbaren Datenaufbauten oder Ausgabeaufgaben ist dies eine wichtige Funktionalität. In MS SQL Server kann eine CREATE PROCEDURE-Anweisung nicht halb durch einen Batch von SQL-Anweisungen angezeigt werden. Theres kein guter Grund dafür, seine nur eine willkürliche Einschränkung. Es bedeutet, dass zusätzliche manuelle Schritte oft erforderlich sind, um eine große Batch von SQL auszuführen. Manuelle Schritte erhöhen das Risiko und senken die Effizienz. PostgreSQL unterstützt die RETURNING-Klausel, die UPDATE ermöglicht. INSERT - und DELETE-Anweisungen, um Werte aus den betroffenen Zeilen zurückzugeben. Dies ist elegant und nützlich. MS SQL Server hat die OUTPUT-Klausel, die eine separate Tabellenvariablendefinition benötigt, um zu funktionieren. Dies ist klobig und unpraktisch und zwingt einen Programmierer, unnötigen Boilerplate-Code zu erzeugen und zu warten. PostgreSQL unterstützt String-Zitate wie folgt: Dies ist äußerst nützlich für die Generierung von dynamischem SQL, weil (a) es dem Benutzer erlaubt, mühsames und unzuverlässiges manuelles Zitieren und Escaping zu vermeiden, wenn Literal-Strings verschachtelt sind, und (b) seit Texteditoren und IDEs dazu neigen Erkennung als String-Trennzeichen, Syntax-Hervorhebung bleibt auch in dynamischen SQL-Code funktional. PostgreSQL ermöglicht es Ihnen, Verfahrenssprachen einfach durch das Einreichen von Code in die Datenbank-Engine schreiben Sie Verfahrenscode in Python oder Perl oder R oder JavaScript oder eine der anderen unterstützten Sprachen (siehe unten) direkt neben Ihrem SQL, in demselben Skript. Dies ist bequem, schnell, wartbar, leicht zu überprüfen, einfach wiederzuverwenden und so weiter. In MS SQL Server können Sie entweder die klumpige, langsame, umständliche Prozedursprache T-SQL verwenden oder Sie können eine. NET-Sprache verwenden, um eine Assembly zu erstellen und in die Datenbank zu laden. Dies bedeutet, dass Ihr Code an zwei verschiedenen Orten ist und Sie müssen eine Reihe von GUI-basierten manuellen Schritten durchlaufen, um es zu ändern. Es macht die Verpackung bis alle Ihre Sachen an einem Ort schwerer und fehleranfälliger. Und es gibt viele weitere Beispiele gibt. Jedes dieser Dinge, isoliert, kann als ein relativ geringfügiges niggle erscheinen, aber der Gesamteffekt ist, dass immer echte Arbeit in MS SQL Server ausgeführt wird wesentlich schwerer und fehleranfälliger als in PostgreSQL, und Datenanalysten verbringen wertvolle Zeit und Energie Auf Workarounds und manuelle Prozesse, anstatt sich auf das eigentliche Problem zu konzentrieren. Update: es wurde mir gezeigt, dass eine wirklich nützliche Funktion MS SQL Server hat, die PostgreSQL fehlt ist die Fähigkeit, Variablen in SQL-Skripten deklarieren. So: PostgreSQL kann dies nicht tun. Ich wünschte, es könnte, denn es gibt eine Menge von Anwendungen für eine solche Funktion. 1.3. Sie können PostgreSQL in Linux, BSD etc. laufen lassen (und natürlich Windows) Wer die Entwicklungen in der IT verfolgt, weiß, dass Cross-Plattform eine Sache ist. Cross-Plattform-Unterstützung ist wohl die Killer-Funktion von Java, die eigentlich eine etwas holprige, hässliche Programmiersprache, aber dennoch enorm erfolgreich, einflussreich und weit verbreitet. Microsoft hat nicht mehr das Monopol es einmal genossen auf dem Desktop, dank der Aufstieg von Linux und Apple. IT-Infrastrukturen werden zunehmend heterogen durch die Flexibilität von Cloud-Services und einfachen Zugang zu leistungsstarker Virtualisierungstechnologie. Cross-Plattform-Software ist über die Benutzer die Kontrolle über ihre Infrastruktur. (Bei der Arbeit verwalte ich derzeit mehrere PostgreSQL-Datenbanken, einige in Windows und einige in Ubuntu Linux. I und meine Kollegen frei verschieben Code und Datenbank-Dumps zwischen ihnen. Wir verwenden Python und PHP, weil sie auch in beiden Betriebssystemen funktionieren. Es funktioniert alles nur .) Microsofts Politik ist und war immer Verkäufer-Lock-in. Sie dont open-Source ihren Code sie nicht bieten Cross-Plattform-Versionen ihrer Software, die sie sogar ein ganzes Ökosystem erfunden. NET, entworfen, um eine harte Linie zwischen Microsoft-Benutzern und Nicht-Microsoft-Benutzern zu zeichnen. Das ist gut für sie, weil sie ihre Einnahmen sichert. Es ist schlecht für Sie, der Benutzer, weil es Ihre Entscheidungen beschränkt und schafft unnötige Arbeit für Sie. (Update: ein paar Tage, nachdem ich dieses veröffentlichte, Microsoft machte mich aussehen wie ein Prat mit der Ankündigung, dass es Open-Sourcing. NET. Dies ist ein großer Schritt, aber nicht knacken öffnen Sie die Bollinger nur noch.) Nun, das ist Nicht ein Linux vs Windows-Dokument, obwohl Im sicher Ill am Ende schriftlich eines dieser an einem gewissen Punkt. Es genügt zu sagen, dass für die reale IT-Arbeit Linux (und die UNIX-ähnliche Familie: Solaris, BSD etc.) Windows im Staub verlässt. UNIX-ähnliche Betriebssysteme dominieren den Server-Markt, Cloud-Services, Supercomputing (in diesem Bereich ist es ein nahezu Monopol) und technisches Computing, und mit gutem Grund ndash diese Systeme werden von Techies für techies entworfen. Dadurch handeln sie für enorme Kraft und Flexibilität. Ein richtiges UNIX-ähnliches Betriebssystem ist nicht nur eine nette Befehlszeile ndash es ist ein Ökosystem von Programmen, Dienstprogrammen, Funktionalität und Unterstützung, die echte Arbeit macht effizient und angenehm macht. Ein kompetenter Linux-Hacker kann in einer einzigen Wegwerflinie des Bash-Skripts eine Aufgabe schaffen, die in Windows schwierig und zeitraubend wäre. (Beispiel: Neulich suchte ich eine Freundesfilmsammlung und er meinte, die Gesamtzahl der Dateien im Dateisystem sei hoch, wenn man bedenkt, wie viele Filme er hatte, und er fragte sich, ob er vielleicht versehentlich einen großen Ordner kopiert hatte Ich habe eine rekursive Anzahl von Dateien-pro-Ordner für ihn wie folgt: Das Ganze dauerte etwa eine Minute zu schreiben und eine Sekunde zu laufen. Es bestätigte, dass einige seiner Ordner hatte ein Problem und Sagte, was sie waren, wie würden Sie dies in Windows tun) Für die Datenanalyse, ein RDBMS existiert nicht in einem Vakuum ist es Teil eines Werkzeug-Stack. Deshalb ist die Umwelt wichtig. MS SQL Server ist auf Windows beschränkt, und Windows ist einfach eine schlechte Analyseumgebung. 1.4. Procedural Language Features Dies ist ein Biggie. Pure deklarative SQL ist gut, was es für ndash relationalen Daten Manipulation und Abfrage entworfen wurde. Sie erreichen schnell ihre Grenzen, wenn Sie versuchen, sie für komplexere analytische Prozesse wie komplexe Zinsrechnungen, Zeitreihenanalyse und allgemeines Algorithmusdesign zu nutzen. SQL-Datenbank-Anbieter wissen dies, so dass fast alle SQL-Datenbanken eine Art von Verfahrenssprache zu implementieren. Dies ermöglicht es einem Datenbankbenutzer, imperativ-artigen Code für komplexere oder fiddly Aufgaben zu schreiben. PostgreSQLs Verfahrenssprache Unterstützung ist außergewöhnlich. Sein unmöglich, ihm in einem kurzen Raum gerecht zu werden, aber heres eine Probe der Waren. Jede dieser Verfahrenssprachen kann zum Schreiben gespeicherter Prozeduren und Funktionen verwendet werden oder einfach in einen Codeblock, der inline ausgeführt werden soll, abgelegt werden. PLPGSQL: Dies ist PostgreSQLs native Verfahrenssprache. Es ist wie Oracles PLSQL, aber moderner und feature-complete. PLV8: Die V8-JavaScript-Engine von Google Chrome ist in PostgreSQL verfügbar. Diese Engine ist stabil, Feature-packed und absurd schnell ndash oft nähert sich der Ausführungsgeschwindigkeit der kompiliert, optimiert C. Kombinieren Sie das mit PostgreSQLs native Unterstützung für die JSON-Datentyp (siehe unten) und Sie haben ultimative Macht und Flexibilität in einem einzigen Paket. Noch besser ist es, dass PLV8 den globalen (d. H. Crossfunktionsaufruf) - Status unterstützt, so dass der Benutzer selektiv Daten im RAM für schnellen wahlfreien Zugriff zwischenspeichern kann. Angenommen, Sie müssen 100.000 Zeilen von Daten aus Tabelle A auf jeweils 1.000.000 Zeilen von Daten aus Tabelle B verwenden. In traditionellen SQL müssen Sie entweder diese Tabellen beitreten (was zu einer 100bn Zeile Zwischen-Tabelle, die alle aber die meisten töten wird Immense Server) oder etwas Ähnliches wie eine skalare Unterabfrage (oder schlimmer, Cursor-basierte verschachtelte Schleifen), was zu lähmenden IO-Last, wenn der Abfrageplaner nicht Ihre Intentionen richtig liest. In PLV8 legen Sie einfach die Tabelle A im Speicher ab und führen eine Funktion für jede Zeile der Tabelle B ndash aus, die Ihnen einen RAM-Qualitätszugriff (vernachlässigbare Latenzzeit und Direktzugriffsstrafe keine nichtflüchtige IO-Last) an die 100k-Zeilentabelle verleiht . Ich tat dies auf eine echte Arbeit vor kurzem ndash mein PostgreSQLPLV8-Code war etwa 80-mal schneller als die MS-T-SQL-Lösung und der Code war viel kleiner und wartbar. Da es etwa 23 Sekunden dauerte, anstatt eine halbe Stunde zu laufen, konnte ich 20 Run-Test-Modifikationszyklen in einer Stunde ausführen, was zu funktionsübergreifendem, richtig getestetem und fehlerfreiem Code führte. Schauen Sie hier für mehr Details auf diesem. (Alle diese Run-Test-Modifikationszyklen waren nur möglich, weil DROP SCHEMA CASCADE und die Freiheit, CREATE FUNCTION-Anweisungen in der Mitte einer Anweisung Batch auszuführen, wie oben erläutert.) PLPython: Sie können voll verwenden Python in PostgreSQL. Python2 oder Python 3, nehmen Sie Ihre Auswahl, und ja, erhalten Sie das enorme Ökosystem der Bibliotheken, für die Python ist rechtmäßig berühmt. Lust auf das Ausführen eines SVM von scikit-learn oder eine beliebige Genauigkeit Arithmetik von gmpy2 in der Mitte einer SQL-Abfrage Kein Problem PLPerl: Perl hat sich aus der Mode seit einiger Zeit, aber seine Vielseitigkeit verdient es einen Ruf als die Schweizer Armee Messer der Programmiersprachen. In PostgreSQL haben Sie voll Perl als Verfahrenssprache. PLR: R ist die de facto standardisierte statistische Programmierumgebung in Wissenschaft und Datenwissenschaft und mit gutem Grund - sie ist kostenlos, robust, voll funktionsfähig und unterstützt durch eine enorme Bibliothek von hochwertigen Plugins und Add-Ons. PostgreSQL ermöglicht es Ihnen, R als Verfahrenssprache zu verwenden. Java, Lua, sh, Tcl, Ruby und PHP werden auch als Verfahrenssprachen in PostgreSQL unterstützt. C: gehört nicht ganz in diese Liste, weil Sie es separat kompilieren müssen, aber es ist eine Erwähnung wert. In PostgreSQL ist es sehr einfach, Funktionen zu erstellen, die kompilierte, optimierte C (oder C oder Assembler) im Datenbank-Backend ausführen. Dies ist eine Power-User-Funktion, die konkurrenzlose Geschwindigkeit und feine Steuerung der Speicherverwaltung und Ressourcennutzung für Aufgaben bietet, bei denen die Leistung kritisch ist. Ich habe dies verwendet, um eine komplexe, stateful Zahlungsverarbeitung Algorithmus auf einer Million Zeilen von Daten pro Sekunde ndash und das war auf einem Desktop-PC zu implementieren. MS SQL Servers eingebaute Verfahrenssprache (Teil ihrer T-SQL-Erweiterung zu SQL) ist klobig, langsam und Feature-arm. Es ist auch anfällig für subtile Fehler und Bugs, wie Microsofts eigene Dokumentation manchmal anerkennt. Ich habe noch nie einen Datenbank-Benutzer, der die T-SQL-Verfahrenssprache mag getroffen. Was ist mit der Tatsache, dass Sie Assemblys machen können in. NET-Sprachen und verwenden Sie sie dann in MS SQL Server Dies zählt nicht als prozedurale Sprache zu unterstützen, weil Sie nicht senden können diesen Code an die Datenbank-Engine direkt. Manageability und Ergonomie sind von entscheidender Bedeutung. Das Einfügen von Python-Code inline in Ihre Datenbankabfrage ist einfach und bequem, Visual Studio zu starten, Projekte zu verwalten und DLL-Dateien herumzuwerfen (alle in GUI-basierten Prozessen, die nicht korrekt skriptgesteuert, versionskontrolliert, automatisiert oder überprüft werden können) ist umständlich, Fehler Und nicht skalierbar. In jedem Fall ist dieser Mechanismus auf. NET-Sprachen beschränkt. 1.5. Native reguläre Ausdrucksunterstützung Reguläre Ausdrücke (Regexen oder Regexe) sind für die analytische Arbeit als arithmetisches ndash grundlegend, sie sind die erste Wahl (und oft nur Auswahl) für eine Vielzahl von Textverarbeitungsaufgaben. Ein Datenanalyse-Tool ohne Regex-Unterstützung ist wie ein Fahrrad ohne Sattel ndash Sie noch verwenden können, aber es ist schmerzhaft. PostgreSQL hat zertrümmerte Out-of-the-Box-Unterstützung für regex. Einige Beispiele: Alle Zeilen beginnen mit einer wiederholten Zahl, gefolgt von einem Vokal: Holen Sie sich den ersten isolierten Hex-String, der in einem Feld vorkommt: Brechen Sie einen String auf Whitespace und geben Sie jedes Fragment in einer separaten Zeile zurück: Groß - / Kleinschreibung finden Sie alle Wörter in einem String Mit mindestens 10 Buchstaben: MS SQL Server hat LIKE. SUBSTRING. PATINDEX und so weiter, die nicht mit der richtigen Regex-Unterstützung vergleichbar sind (wenn Sie daran zweifeln, versuchen Sie es mit den oben genannten Beispielen). Es gibt Drittanbieter-Regex-Bibliotheken für MS SQL Server theyre nur nicht so gut wie PostgreSQLs unterstützen, und die Notwendigkeit, zu erhalten und installieren sie separat fügt Admin-Overhead. Beachten Sie auch, dass PostgreSQLs umfangreiche Verfahrenssprachenunterstützung Ihnen auch einige andere regex Maschinen und ihre verschiedenen Eigenschaften - z. B. Pythons Regex-Bibliothek bietet die zusätzliche Macht der positiven und negativen Lookbehind Behauptungen. Dies steht im Einklang mit dem allgemeinen Thema der PostgreSQL gibt Ihnen alle Werkzeuge, die Sie brauchen, um tatsächlich Dinge getan. 1.6. Benutzerdefinierte Aggregatfunktionen Dies ist ein Feature, das technisch sowohl von PostgreSQL als auch von MS SQL Server angeboten wird. Die Implementierungen unterscheiden sich jedoch sehr. In PostgreSQL, sind benutzerdefinierte Aggregate bequem und einfach zu bedienen, was in schnelle Problemlösung und wartbaren Code: Elegant, eh Ein benutzerdefiniertes Aggregat wird in Form eines internen Zustandes und eine Möglichkeit, diesen Zustand zu ändern, wenn wir neue Werte in die Aggregatfunktion. In diesem Fall starten wir jeden Kunden mit Null-Balance und keine Zinsen aufgelaufen, und an jedem Tag fangen wir Zinsen angemessen und Rechnung für Zahlungen und Abhebungen. Wir verbinden die Zinsen am 1. eines jeden Monats. Beachten Sie, dass das Aggregat eine ORDER BY-Klausel akzeptiert (da im Gegensatz zu SUM MAX und MIN dieses Aggregat auftragsabhängig ist) und PostgreSQL Operatoren zum Extrahieren von Werten aus JSON-Objekten bereitstellt. So, in 28 Zeilen von Code weve erstellt den Rahmen für monatliche Compoundierung Zinsen auf Bankkonten und verwendet es, um endgültige Salden zu berechnen. Wenn der Methodik Merkmale hinzugefügt werden sollen (z. B. Zinsänderungen in Abhängigkeit von debitcredit-Saldo, Erkennung von außergewöhnlichen Umständen), ist dies in der Übergangsfunktion genau richtig und wird in einer geeigneten Sprache für die Implementierung einer komplexen Logik geschrieben. (Tragische Seite Anmerkung: Ich habe gesehen, große Organisationen verbringen Zehntausende von Pfund über Wochen Arbeit versuchen, die gleiche Sache mit ärmeren Tools zu erreichen.) MS SQL Server, auf der anderen Seite, macht es absurd schwierig. Im Übrigen sind die Beispiele in der zweiten Verbindung für die Implementierung eines einfachen String-Verkettungsaggregats. Beachten Sie die riesige Menge an Code und Gymnastik erforderlich, um diese einfache Funktion implementieren (die PostgreSQL bietet out of the box, übrigens. Wahrscheinlich, weil seine nützlich). MS SQL Server erlaubt auch keine Reihenfolge in der Aggregat angegeben werden, die diese Funktion nutzlos für meine Art von Arbeit ndash mit MS SQL Server macht, ist die Reihenfolge der Zeichenfolge Verknüpfung zufällig, so dass die Ergebnisse einer Abfrage mit dieser Funktion sind Nicht deterministisch (sie können sich ändern von run to run) und der Code wird nicht eine Qualitätsprüfung bestehen. Der Mangel an Bestellung Unterstützung bricht auch Code wie das Zinsrechnungsbeispiel oben. Soweit ich sagen kann, können Sie dies nur mit einem benutzerdefinierten MS SQL Server-Aggregat tun. (Es ist tatsächlich möglich, dass MS SQL Server eine deterministische Zeichenfolgenverknüpfung Aggregation in reinen SQL, aber Sie müssen die RECURSIVE Abfrage-Funktionalität zu tun, es zu tun. Obwohl eine interessante akademische Übung, führt dies zu langsamen, unlesbaren, unmaintainable Code und ist nicht Eine realistische Lösung). 1.7. Unicode-Unterstützung Lange vorbei sind die Zeiten, in denen ASCII universell war, Charakter und Byte waren fungible Begriffe und ausländische (aus anglozentrischer Sicht) Text war eine exotische Ausnahme. Eine korrekte internationale Sprachunterstützung ist nicht mehr optional. Die Lösung für all dies ist Unicode. Es gibt viele Missverständnisse über Unicode da draußen. Sein nicht ein Zeichensatz, sein nicht eine Codepage, sein nicht ein Dateiformat und sein nichts, was auch immer mit Verschlüsselung zu tun hat. Eine Erforschung, wie Unicode arbeitet ist faszinierend, aber über den Rahmen dieses Dokuments ndash Ich empfehle herzlich Googeln es und arbeitet durch ein paar Beispiele. Die wichtigsten Punkte für Unicode, die für die Datenbankfunktionalität relevant sind, sind: Unicode-codierter Text (für unsere Zwecke bedeutet dies entweder UTF-8 oder UTF-16) ist eine Codierung mit variabler Breite. In UTF-8 kann ein Zeichen ein, zwei, drei oder vier Bytes darstellen. In UTF-16 seine entweder zwei oder vier. Dies bedeutet, dass Operationen wie das Aufnehmen von Teilzeichenfolgen und das Messen von Zeichenkettenlängen Unicode-bewusst sein müssen, um ordnungsgemäß zu funktionieren. Nicht alle Bytes sind gültig Unicode. Manipulieren gültigen Unicode ohne zu wissen, seine Unicode wird wahrscheinlich zu produzieren, was nicht gültig Unicode. UTF-8 und UTF-16 sind nicht kompatibel. Wenn Sie eine Datei von jedem Typ zu nehmen und verkettet, Sie (wahrscheinlich) am Ende mit einer Datei, die weder gültige UTF-8 noch gültige UTF-16. Für Text, der meist in ASCII passt, ist UTF-8 etwa doppelt so platzsparend wie UTF-16. PostgreSQL unterstützt UTF-8. Seine CHAR. VARCHAR - und TEXT-Typen sind standardmäßig UTF-8, was bedeutet, dass sie nur UTF-8-Daten akzeptieren und alle von ihnen angewandten Transformationen von Stringverkettung und Suche nach regulären Ausdrücken UTF-8-bewusst sind. Alles funktioniert. MS SQL Server 2008 unterstützt nicht UTF-16 es unterstützt UCS-2, eine veraltete Untermenge von UTF-16. Was dies bedeutet, ist, dass die meiste Zeit, wird es aussehen, wie seine Arbeit gut, und gelegentlich wird es schweigen Ihre Daten beschädigt. Da es Text als Zeichenfolge von breiten (d. h. 2 Byte) Zeichen interpretiert, schneidet es glücklicherweise ein 4-Byte-UTF-16-Zeichen in der Hälfte. Am besten, dies führt zu beschädigten Daten. Schlimmstenfalls wird etwas anderes in Ihrem toolchain brechen schlecht und youll haben eine Katastrophe auf Ihren Händen. Apologen für MS sind schnell darauf hinzuweisen, dass dies unwahrscheinlich ist, weil es die Daten benötigen, um etwas außerhalb Unicodes grundlegende mehrsprachige Flugzeug enthalten. Das ist völlig fehlt der Punkt. Eine einzige Datenbank dient zum Speichern, Zurückholen und Manipulieren von Daten. Eine Datenbank, die gebrochen werden kann, indem Sie die falschen Daten in sie ist so nutzlos wie ein Router, der bricht, wenn Sie die falsche Datei herunterladen. MS SQL Server-Versionen seit 2012 haben UTF-16 richtig unterstützt, wenn Sie sicherstellen, dass Sie eine UTF-16-kompatible Sortierung für Ihre Datenbank auswählen. Es ist verblüffend, dass dies (a) fakultativ ist und (b) bis 2012 implementiert. Besser spät als nie, nehme ich an. 1.8. Datentypen, die ordnungsgemäß funktionieren Ein allgemeines Missverständnis ist, dass alle Datenbanken die gleichen Typen ndash INT haben. VERKOHLEN. DATE und so weiter. Das ist nicht wahr. PostgreSQLs Typ-System ist wirklich nützlich und intuitiv, frei von Ärgernissen, die Bugs oder langsame Arbeit nach unten und, wie üblich, scheinbar mit Produktivität im Auge entworfen. MS SQL-Server-Typ-System, durch Vergleich, fühlt sich wie Beta-Software. Es cant Touch der Feature-Set von PostgreSQLs-Typ-System und es ist mit Fallen warten, um den unerwünschten Benutzer zu versorgen. Werfen wir einen Blick: CHAR, VARCHAR und Familie PostgreSQL: Die Docs ermutigen Sie aktiv, einfach den TEXT-Typ zu verwenden. Dies ist ein leistungsfähiger, UTF-8-validierte Textspeichertyp, der Strings bis zu 1 GB Größe speichert. Es unterstützt alle Text-Operationen PostgreSQL ist in der Lage: einfache Verkettung und Substringing Regex-Suche, Matching und Splitting Volltext-Suche Casting-Transformation und so weiter. Wenn Sie Textdaten haben, kleben Sie es in ein TEXT-Feld und weitermachen. Da alles in einem TEXT-Feld (oder in diesem Fall CHAR - oder VARCHAR-Felder) UTF-8 sein muss, gibt es kein Problem mit Kodierungs-Inkompatibilität. Da UTF-8 die de facto universelle Textcodierung ist, ist die Konvertierung von Text einfach und zuverlässig. Da UTF-8 eine Obermenge von ASCII ist, ist diese Umwandlung oft trivial einfach oder gänzlich unnötig. Alles funktioniert. MS SQL Server: es ist eine ziemlich traurige Geschichte. Die TEXT - und NTEXT-Typen bestehen und erstrecken sich auf 2GB. Bafflingly, aber, dont Unterstützung Casting. Auch, verwenden Sie sie nicht, sagt MS ndash werden sie in einer zukünftigen Version von MS SQL Server entfernt werden. Sie sollten CHAR verwenden. VARCHAR und ihre N - prefixed Versionen statt. Leider hat VARCHAR (MAX) schlechte Leistungsmerkmale und VARCHAR (8000) (die nächste größte Größe, aus irgendeinem Grund) Tops bei 8.000 Bytes. (Die 4.000 Zeichen für NVARCHAR.) Denken Sie daran, wie PostgreSQLs Beharren auf einer einzigen Text-Kodierung pro Datenbank macht alles reibungslos Nicht so in MS-Land: Wie bei früheren Versionen von SQL Server, wird Datenverlust während der Codepage-Übersetzungen nicht gemeldet. Link Mit anderen Worten, MS SQL Server könnte Ihre Daten beschädigen, und Sie wissen es nicht, bis etwas anderes schief geht. Das ist ganz einfach ein Dealbreaker. Eine Datenanalyseplattform, die Ihre Daten stumm ändern, verfälschen oder verlieren könnte, ist eine enorme Haftung. Betrachten Sie die Absurdität des Forking-out für einen Server mit teuren ECC RAM als Verteidigung gegen Daten Korruption durch kosmische Strahlen verursacht, und dann mit Software auf sie, die Ihre Daten sowieso korrumpieren könnte. Datums - und Uhrzeitarten PostgreSQL: Sie erhalten DATE. ZEIT. TIMESTAMP und TIMESTAMP MIT ZEITZONE. Die alle genau das tun, was man erwarten würde. Sie haben auch fantastische Reichweite und Präzision, unterstützt Mikrosekunde Auflösung von der 5. Jahrtausend v. Chr. Bis fast 300 Jahrtausende in der Zukunft. Sie nehmen Eingang in einer Vielzahl von Formaten und die letzte hat volle Unterstützung für Zeitzonen. Sie können zu und von Unix-Zeit konvertiert werden, was für die Interoperabilität mit anderen Systemen sehr wichtig ist. Sie können die speziellen Werte infinity und - infinity. Das ist keine metaphysisch-theologisch-philosophische Aussage, sondern eine sehr nützliche semantische Konstruktion. Stellen Sie zum Beispiel ein Benutzer-Kennwort-Ablaufdatum auf unendlich ein, um anzugeben, dass es sein Kennwort nicht ändern muss. Die normale Weise, dies zu tun ist, NULL oder irgendein Datum weit in der Zukunft zu verwenden, aber diese sind ungeschickt Hacks ndash sie beide beinhalten, setzen ungenaue Informationen in der Datenbank und Schreiben Anwendungslogik zu kompensieren. Was passiert, wenn ein Entwickler NULL oder 3499-12-31 sieht. Wenn Sie glücklich sind, kennt er die geheimen Händeschütteln und wird nicht durch es verwirrt. Wenn nicht, nimmt er an, dass das Datum unbekannt ist oder dass es sich wirklich auf das 4. Jahrtausend bezieht, und Sie haben ein Problem. Die kumulative Wirkung von Hacks, Workarounds und Kludien wie dieses ist unzuverlässige Systeme, unglückliche Programmierer und erhöhtes Geschäftsrisiko. Nützliche Semantik wie Unendlichkeit und - infinity können Sie sagen, was Sie meinen und schreiben konsistente, lesbare Anwendungslogik. Sie unterstützen auch die INTERVAL-Typ, die so nützlich ist, hat es seinen eigenen Abschnitt direkt nach diesem. Casting und Umwandlung von Datum und Uhrzeit-Typen ist einfach und intuitiv - Sie können jede Art zu TEXT. and the tochar and totimestamp functions give you ultimate flexibility, allowing conversion in both directions using format strings. For example: and, going in the other direction, As usual, it just works. As a data analyst, I care very much about a databases date-handling ability, because dates and times tend to occur in a multitude of different formats and they are usually critical to the analysis itself. MS SQL Server: dates can only have positive 4-digit years, so they are restricted to 0001 AD to 9999 AD. They do not support infinity and - infinity. They do not support interval types, so date arithmetic is tedious and clunky. You can convert them to and from UNIX time, but its a hack involving adding seconds to the UNIX epoch, 1970-01-01T00:00:00Z, which you therefore have to know and be willing to hardcode into your application. Date conversion deserves a special mention, because even by MS SQL Servers shoddy standards its bloody awful. The CONVERT function takes the place of PostgreSQLs tochar and totimestamp. but it works like this: Thats right ndash youre simply expected to know that 126 is the code for converting strings in that format to a datetime. MSDN provides a table of these magic numbers. I didnt give the same example as for PostgreSQL because I couldnt find a magic number corresponding to the right format for Saturday 03 Feb 2001. If someone gave you data with such dates in it, I guess youd have to do some string manipulation (pity the string manipulation facilities in MS SQL Server are almost non-existent. ) PostgreSQL: the INTERVAL type represents a period of time, such as 30 microseconds or 50 years. It can also be negative, which may seem counterintuitive until you remember that the word ago exists. PostgreSQL also knows about ago, in fact, and will accept strings like 1 day ago as interval values (this will be internally represented as an interval of -1 days). Interval values let you do intuitive date arithmetic and store time durations as first-class data values. They work exactly as you expect and can be freely casted and converted to and from anything which makes sense. MS SQL Server: no support for interval types. PostgreSQL: arrays are supported as a first-class data type, meaning fields in tables, variables in PLPGSQL, parameters to functions and so on can be arrays. Arrays can contain any data type you like, including other arrays. This is very, very useful . Here are some of the things you can do with arrays: Store the results of function calls with arbitrarily-many return values, such as regex matches Represent a string as integer word IDs, for use in fast text matching algorithms Aggregation of multiple data values across groups, for efficient cross-tabulation Perform row operations using multiple data values without the expense of a join Accurately and semantically represent array data from other applications in your tool stack Feed array data to other applications in your tool stack I cant think of any programming languages which dont support arrays, other than crazy ones like Brainfuck and Malbolge. Arrays are so useful that they are ubiquitous. Any system, especially a data analytics platform, which doesnt support them is crippled. MS SQL Server: no support for arrays. PostgreSQL: full support for JSON, including a large set of utility functions for transforming between JSON types and tables (in both directions), retreiving values from JSON data and constructing JSON data. Parsing and stringification are handled by simple casts, which as a rule in PostgreSQL are intelligent and robust. The PLV8 procedural language works as seamlessly as you would expect with JSON ndash in fact, a JSON-type internal state in a custom aggregate (see this example) whose transition function is written in PLV8 provides a declarativeimperative best-of-both-worlds so powerful and convenient it feels like cheating. JSON (and its variants, such as JSONB) is of course the de facto standard data transfer format on the web and in several other data platforms, such as MongoDB and ElasticSearch, and in fact any system with a RESTful interface. Aspiring Analytics-as-a-Service providers take note. MS SQL Server: no support for JSON. PostgreSQL: HSTORE is a PostgreSQL extension which implements a fast key-value store as a data type. Like arrays, this is very useful because virtually every high-level programming language has such a concept (and virtually every programming language has such a concept because it is very useful). JavaScript has objects, PHP has associative arrays, Python has dicts, C has std::map and std::unorderedmap. Go has maps. Und so weiter. In fact, the notion of a key-value store is so important and useful that there exists a whole class of NoSQL databases which use it as their main storage paradigm. Theyre called, uh, key-value stores . There are also some fun unexpected uses of such a data type. A colleague recently asked me if there was a good way to deduplicate a text array. Heres what I came up with: i. e. put the array into both the keys and values of an HSTORE, forcing a dedupe to take place (since key values are unique) then retrieve the keys from the HSTORE. Theres that PostgreSQL versatility again. MS SQL Server: No support for key-value storage. Range types PostgreSQL: range types represent, well, ranges. Every database programmer has seen fields called startdate and enddate. and most of them have had to implement logic to detect overlaps. Some have even found, the hard way, that joins to ranges using BETWEEN can go horribly wrong, for a number of reasons. PostgreSQLs approach is to treat time ranges as first-class data types. Not only can you put a range of time (or INT s or NUMERIC s or whatever) into a single data value, you can use a host of built-in operators to manipulate and query ranges safely and quickly. You can even apply specially-developed indices to them to massively accelerate queries that use these operators. In short, PostgreSQL treats ranges with the importance they deserve and gives you the tools to work with them effectively. Im trying not to make this document a mere list of links to the PostgreSQL docs, but just this once, I suggest you go and see for yourself . (Oh, and if the pre-defined types dont meet your needs, you can define your own ones. You dont have to touch the source code, the database exposes methods to allow you to do this.) MS SQL Server: no support for range types. NUMERIC and DECIMAL PostgreSQL: NUMERIC (and DECIMAL - theyre symonyms) is near-as-dammit arbitrary precision: it supports 131,072 digits before the decimal point and 16,383 digits after the decimal point. If youre running a bank, doing technical computation, landing spaceships on comets or simply doing something where you cannot tolerate rounding errors, youre covered. MS SQL Server: NUMERIC (and DECIMAL - theyre symonyms) supports a maximum of 38 decimal places of precision in total. PostgreSQL: XML is supported as a data type and the database offers a variety of functions for working with XML. Xpath querying is supported. MS SQL Server: finally, some good news MS SQL Server has an XML data type too, and offers plenty of support for working with it. (Shame XML is going out of style. ) 1.9. Scriptability PostgreSQL can be driven entirely from the command line, and since it works in operating systems with proper command lines (i. e. everything except Windows), this is highly effective and secure. You can SSH to a server and configure PostgreSQL from your mobile phone, if you have to (I have done so more than once). You can automate deployment, performance-tuning, security, admin and analytics tasks with scripts. Scripts are very important because unlike GUI processes, they can be copied, version-controlled, documented, automated, reviewed, batched and diffed. For serious work, text editors and command lines are king. MS SQL Server is driven through a GUI. I dont know to what extent it can be automated with Powershell I do know that if you Google for help and advice on getting things done in MS SQL Server, you get a lot of people saying right-click on your database, then click on Tasks. . GUIs do not work well across low-bandwidth or high-latency connections text-based shells do. As I write I am preparing to do some sysadmin on a server 3,500 miles away, on a VPN via a shaky WiFi hotspot, and thanking my lucky stars its an UbuntuPostgreSQL box. (Who on Earth wants a GUI on a server anyway) 1.10. Good external language bindings PostgreSQL is very, very easy to connect to and use from programming environments, because libpq, its external API, is very well-designed and very well-documented. This means that writing utilities which plug into PostgreSQL is very easy and convenient, which makes the database more versatile and a better fit in an analytics stack. On many occasions I have knocked up a quick program in C or C which connects to PostgreSQL, pulls some data out and does some heavy calculations on it, e. g. using multithreading or special CPU instructions - stuff the database itself is not suitable for. I have also written C programs which use setuid to allow normal users to perform certain administrative tasks in PostgreSQL. It is very handy to be able to do this quickly and neatly. MS SQL Servers external language bindings vary. Sometimes you have to install extra drivers. Sometimes you have to create classes to store the data you are querying, which means knowing at compile time what that data looks like. Most importantly, the documentation is a confusing, tangled mess, which makes getting this done unnecessarily time-consuming and painful. 1.11. Documentation Data analytics is all about being a jack of all trades. We use a very wide variety of programming languages and tools. (Off the top of my head, the programmingscripting languages I currently work with are PHP, JavaScript, Python, R, C, C, Go, three dialects of SQL, PLPGSQL and Bash.) It is hopelessly unrealistic to expect to learn everything you will need to know up front. Getting stuff done frequently depends on reading documentation. A well-documented tool is more useful and allows analysts to be more productive and produce higher-quality work. PostgreSQLs documentation is excellent. Everything is covered comprehensively but the documents are not merely reference manuals ndash they are full of examples, hints, useful advice and guidance. If you are an advanced programmer and really want to get stuck in, you can also simply read PostgreSQLs source code, all of which is openly and freely available. The docs also have a sense of humour: The first century starts at 0001-01-01 00:00:00 AD, although they did not know it at the time. This definition applies to all Gregorian calendar countries. There is no century number 0, you go from -1 century to 1 century. If you disagree with this, please write your complaint to: Pope, Cathedral Saint-Peter of Roma, Vatican. MS SQL Servers documentation is all on MSDN, which is an unfriendly, sprawling mess. Because Microsoft is a large corporation and its clients tend to be conservative and humourless, the documentation is business appropriate ndash i. e. officious, boring and dry. Not only does it lack amusing references to the historical role of Catholicism in the development of date arithmetic, it is impenetrably stuffy and hidden behind layers of unnecessary categorisation and ostentatiously capitalised official terms. Try this: go to the product documentation page for MS SQL Server 2012 and try to get from there to something useful. Or try reading this gem (not cherry-picked, I promise): A report part definition is an XML fragment of a report definition file. You create report parts by creating a report definition, and then selecting report items in the report to publish separately as report parts. Has the word report started to lose its meaning yet (And, of course, MS SQL Server is closed source, so you cant look at the source code. Yes, I know source code is not the same as documentation, but it is occasionally surprisingly useful to be able to simply grep the source for a relevant term and cast an eye over the code and the comments of the developers. Its easy to think of our tools as magical black boxes and to forget that even something as huge and complex as an RDBMS engine is, after all, just a list of instructions written by humans in a human-readable language.) 1.12. Logging thats actually useful MS SQL Servers logs are spread across several places - error logs, Windows event log, profiler logs, agent logs and setup log. To access these you need varying levels of permissions and you have to use various tools, some of which are GUI-only. Maybe things like Splunk can help to automate the gathering and parsing of these logs. I havent tried, nor do I know anyone else who has. Google searches on the topic produce surprisingly little information, surprisingly little of which is of any use. PostgreSQLs logs, by default, are all in one place. By changing a couple of settings in a text file, you can get it to log to CSV (and since were talking about PostgreSQL, its proper CSV, not broken CSV). You can easily set the logging level anywhere from dont bother logging anything to full profiling and debugging output. The documentation even contains DDL for a table into which the CSV-format logs can be conveniently imported. You can also log to stderr or the system log or to the Windows event log (provided youre running PostgreSQL in Windows, of course). The logs themselves are human-readable and machine-readable and contain data likely to be of great value to a sysadmin. Who logged in and out, at what times, and from where Which queries are being run and by whom How long are they taking How many queries are submitted in each batch Because the data is well-formatted CSV, it is trivially easy to visualise or analyse it in R or PostgreSQL itself or Pythons matplotlib or whatever you like. Overlay this with the wealth of information that Linux utilities like top, iotop and iostat provide and you have easy, reliable access to all the server telemetry you could possibly need. 1.13. Support How is PostgreSQL going to win this one Everyone knows that expensive flagship enterprise products by big commercial vendors have incredible support, whereas free software doesnt have any Of course, this is nonsense. Commercial products have support from people who support it because they are paid to. They do the minimum amount necessary to satisfy the terms of the SLA. As I type this, some IT professionals I know are waiting for a major hardware vendor to help them with a performance issue in a 40,000 server. Theyve been discussing it with the vendor for weeks theyve spent time and effort running extensive tests and benchmarks at the vendors request and so far the vendors reaction has been a mixture of incompetence, fecklessness and apathy. The 40,000 server is sitting there performing very, very slowly, and its users are working 70-hour weeks to try to stay on schedule. Over the years I have seen many, many problems with expensive commercial software ndash everything from bugs to performance issues to incompatibility to insufficient documentation. Sometimes these problems cause a late night or a lost weekend for the user sometimes they cause missed deadlines and angry clients sometimes it goes as far as legal and reputational risk. Every single time, the same thing happens: the problem is fixed by the end users, using a combination of blood, sweat, tears, Google and late nights. I have never seen the vendor swoop to the rescue and make everything OK. So what is the support for PostgreSQL like On the two occasions I have asked the PostgreSQL mailing list for help, I have received replies from Tom Lane within 24 hours. Take a moment to click on the link and read the wiki - the guy is not just a lead developer of PostgreSQL, hes a well-known computer programmer. Needless to say, his advice is as good as advice gets. On one of the occasions, where I asked a question about the best way to implement cross-function call persistent memory allocation, Lane replied with the features of PostgreSQL I should study and suggested solutions to my problem ndash and for good measure he threw in a list of very good reasons why my tentative solution (a C static variable) was rubbish. You cant buy that kind of support, but you can get it from a community of enthusiastic open source developers. Oh, did I mention that the total cost of the database software and the helpful advice and recommendations from the acclaimed programmer was 0.00 Note that by support I mean help getting it to work properly. Some people (usually people who dont actually use the product) think of support contracts more in terms of legal coverage ndash theyre not really interested in whether help is forthcoming or not, but they like that theres someone to shout at and, more importantly, blame. I discuss this too, here . (And if youre really determined to pay someone to help you out, you can of course go to any of the organisations which provide professional support for PostgreSQL. Unlike commercial software vendors, whose support functions are secondary to their main business of selling products, these organisations live or die by the quality of the support they provide, so it is very good.) 1.14. Flexible, scriptable database dumps Ive already talked about scriptability, but database dumps are very important, so they get their own bit here. PostgreSQLs dump utility is extremely flexible, command-line driven (making it easily automatable and scriptable) and well-documented (like the rest of PostgreSQL). This makes database migration, replication and backups ndash three important and scary tasks ndash controllable, reliable and configurable. Moreover, backups can be in a space-effecient compressed format or in plain SQL, complete with data, making them both human-readable and executable. A backup can be of a single table or of a whole database cluster. The user gets to do exactly as he pleases. With a little work and careful selection of options, it is even possible to make a DDL-only plain SQL PostgreSQL backup executable in a different RDBMS. MS SQL Servers backups are in a proprietary, undocumented, opaque binary format. 1.15. Reliability Neither PostgreSQL nor MS SQL Server are crash-happy, but MS SQL Server does have a bizarre failure mode which I have witnessed more than once: its transaction logs become enormous and prevent the database from working. In theory the logs can be truncated or deleted but the documentation is full of dire warnings against such action. PostgreSQL simply sits there working and getting things done. I have never seen a PostgreSQL database crash in normal use. PostgreSQL is relatively bug-free compared to MS SQL Server. I once found a bug in PostgreSQL 8.4 ndash it was performing a string distance calculation algorithm wrongly. This was a problem for me because I needed to use the algorithm in some fuzzy deduplication code I was writing for work. I looked up the algorithm on Wikipedia, gained a rough idea of how it works, found the implementation in the PostgreSQL source code, wrote a fix and emailed it to one of the PostgreSQL developers. In the next release of PostgreSQL, version 9.0, the bug was fixed. Meanwhile, I applied my fix to my own installation of PostgreSQL 8.4, re-compiled it and kept working. This will be a familiar story to many of the users of PostgreSQL, and indeed any large piece of open source software. The community benefits from high-quality free software, and individuals with the appropriate skills do what they can to contribute. Everyone wins. With a closed-source product, you cant fix it yourself ndash you just raise a bug report, cross your fingers and wait. If MS SQL Server were open source, section 1.1 above would not exist, because I (and probably thousands of other frustrated users) would have damn well written a proper CSV parser and plumbed it in years ago. 1.16. Ease of installing and updating Does this matter Well, yes. Infrastructure flexibility is more important than ever and that trend will only continue. Gone are the days of the big fat server install which sits untouched for years on end. These days its all about fast, reliable, flexible provisioning and keeping up with cutting-edge features. Also, as the saying goes, time is money. I have installed MS SQL Server several times. I have installed PostgreSQL more times than I can remember - probably at least 50 times. Installing MS SQL Server is very slow. It involves immense downloads (who still uses physical install media) and lengthy, important-sounding processes with stately progress bars. It might fail if you dont have the right version of. NET or the right Windows service pack installed. Its the kind of thing your sysadmin needs to find a solid block of time for. Installing PostgreSQL the canonical way ndash from a Linux repo ndash is as easy as typing a single command, like this: How long does it take I just tested this by spinning up a cheap VM in the cloud and installing PostgreSQL using the above command. It took 16 seconds . Thats the total time for the download and the install. As for updates, any software backed by a Linux repo is trivially easily patched and updated by pulling updates from the repo. Because repos are clever and PostgreSQL is not obscenely bloated, downloads are small and fast and application of updates is efficient. I dont know how easy MS SQL Server is to update. I do know that a lot of production MS SQL Server boxes in certain organisations are still on version 2008 R2 though. 1.17. The contrib modules As if the enormous feature set of PostgreSQL is not enough, it comes with a set of extensions called contrib modules. There are libraries of functions, types and utilities for doing certain useful things which dont quite fall into the core feature set of the server. There are libraries for fuzzy string matching, fast integer array handling, external database connectivity, cryptography, UUID generation, tree data types and loads, loads more. A few of the modules dont even do anything except provide templates to allow developers and advanced users to develop their own extensions and custom functionality. Of course, these extensions are trivially easy to install. For example, to install the fuzzystrmatch extension you do this: 1.18. Its free PostgreSQL is free as in freedom and free as in beer. Both types of free are extremely important. The first kind, free as in freedom, means PostgreSQL is open-source and very permissively licensed. In practical terms, this means that you can do whatever you want with it, including distributing software which includes it or is based on it. You can modify it in whatever way you see fit, and then you can distribute the modifications to whomever you like. You can install it as many times as you like, on whatever you like, and then use it for any purpose you like. The second kind, free as in beer, is important for two main reasons. The first is that if, like me, you work for a large organisation, spending that organisations money involves red tape. Red tape means delays and delays sap everyones energy and enthusiasm and suppress innovation. The second reason is that because PostgreSQL is free, many developers, experimenters, hackers, students, innovators, scientists and so on (the brainy-but-poor crowd, essentially) use it, and it develops a wonderful community. This results in great support (as I mentioned above ) and contributions from the intellectual elite. It results in a better product, more innovation, more solutions to problems and more time and energy spent on the things that really matter. 2. The counterarguments For reasons which have always eluded me, people often like to ignore all the arguments and evidence above and try to dismiss the case for PostgreSQL using misconceptions, myths, red herrings and outright nonsense. Stuff like this: 2.1. But a big-name vendor provides a safety net No it doesnt. This misconception is a variant of the old adage no-one ever got fired for buying IBM. Hilariously, if you type that into Google, the first hit is the Wikipedia article on fear, uncertainty and doubt - and even more hilariously, the first entry in the examples section is Microsoft. I promise I did not touch the Wikipedia article, I simply found it like that. In client-serving data analytics, you just have to get it right. If you destroy your reputation by buggering up an important job, your software vendor will not build you a new reputation. If you get sued, then maybe you can recover costs from your vendor - but only if they did something wrong. Microsoft isnt doing anything technically wrong with MS SQL Server, theyre simply releasing a terrible product and being up front about how terrible it is. The documentation admits its terrible. It works exactly as designed the problem is that the design is terrible. You cant sue Microsoft just because you didnt do your due diligence when you picked a database. Even if you somehow do successfully blame the vendor, you still have a messed up job and an angry client, who wont want to hear about MS SQL Servers unfortunate treatment of UTF-16 text as UCS-2, resulting in truncation of a surrogate pair during a substring operation and subsequent failure to identify an incriminating keyword. At best they will continue to demand results (and probably a discount) at worst, they will write you off as incompetent ndash and who could blame them, when you trusted their job to a RDBMS whose docs unapologetically acknowledge that it might silently corrupt your data Since the best way to minimise risk is to get the job done right, the best tool to use is the one which is most likely to let you accomplish that. In this case, thats PostgreSQL. 2.2. But what happens if the author of PostgreSQL dies Same thing that happens if the author of MS SQL Server dies ndash nothing. Also, needless to say, the author of PostgreSQL is as meaningless as the author of MS SQL Server. Theres no such thing. A senior individual with an IT infrastructure oversight role actually asked me this question once (about Hadoop, not PostgreSQL). There just seems to be a misconception that all open-source software is written by a loner who lives in his mums basement. This is obviously not true. Large open source projects like PostgreSQL and Hadoop are written by teams of highly skilled developers who are often commercially sponsored. At its heart, the development model of PostgreSQL is just like the development model of MS SQL Server: a large team of programmers is paid by an organisation to write code. There is no single point of failure. There is at least one key difference, though: PostgreSQLs source code is openly available and is therefore reviewed, tweaked, contributed to, improved and understood by a huge community of skilled programmers. Thats one of the reasons why its so much better. Crucially, because open-source software tends to be written by people who care deeply about its quality (often because they have a direct personal stake in ensuring that the software works as well as possible), it is often of the very highest standard (PostgreSQL, Linux, MySQL, XBMC, Hadoop, Android, VLC, Neo4JS, Redis, 7Zip, FreeBSD, golang, PHP, Python, R, Nginx, Apache, node. js, Chrome, Firefox. ). On the other hand, commercial software is often designed by committee, written in cube farms and developed without proper guidance or inspiration (Microsoft BOB, RealPlayer, Internet Explorer 6, iOS Maps, Lotus Notes, Windows ME, Windows Vista, QuickTime, SharePoint. ) 2.3. But open-source software isnt securereliabletrustworthyenterprise-readyetc Theres no kind way to say this: anyone who says such a thing is very ignorant, and you should ignore them ndash or, if youre feeling generous, educate them. Well, I guess Im feeling generous: Security: the idea that closed-source is more secure is an old misconception, for many good reasons which I will briefly summarise (but do read the links ndash theyre excellent): secrecy isnt the same as security an open review process is more likely to find weaknesses than a closed one properly reviewed open source software is difficult or impossible to build a back door into. If you prefer anecdotal evidence to logical arguments, consider that Microsoft Internet Explorer 6, once a flagship closed-source commercial product, is widely regarded as the least secure software ever produced, and that Rijndael, the algorithm behind AES, which governments the world over use to protect top secret information, is an open standard. In any case, relational databases are not security software. In the IT world, security is a bit like support our troops in the USA or think of the children in the UK ndash a trump card which overrules all other considerations, including common sense and evidence. Dont fall for it. Reliability: Windows was at one point renowned for its instability, although these days things are much better. (Supposedly, Windows 9x would spontaneously crash when its internal uptime counter, counting in milliseconds, exceeded the upper bound of an unsigned 32-bit integer, i. e. after 2 32 milliseconds or about 49.7 days. I have always wanted to try this.) Linux dominates the server space, where reliability is key, and Linux boxes routinely achieve uptimes measured in years. Internet Explorer has always (and still does) failed to comply with web standards, causing websites to break or function improperly the leaders in the field are the open-source browsers Chrome and Firefox. Lotus Notes is a flaky, crash-happy, evil mess Thunderbird just works. And I have more than once seen MS SQL Server paralyse itself by letting transaction log files blow up, something PostgreSQL does not do. Trustworthiness: unless youve been living under a rock for the past couple of years, you know who Edward Snowden is. Thanks to him, we know exactly what you cannot trust: governments and the large organisations they get their hooks into. Since Snowden went public, it is clear that NSA back doors exist in a vast array of products, both hardware and software, that individuals and organisations depend on to keep their data secure. The only defence against this is open code review. The only software that can be subjected to open code review is open source software. If you use proprietary closed-source software, you have no way of knowing what it is really doing under the hood. And thanks to Mr. Snowden, we now know that there is an excellent chance it is giving your secrets away. At the time of writing, 485 of the top 500 supercomputers in the world run on Linux. As of July 2014, Nginx and Apache, two open-source web servers, power over 70 of the million busiest sites on the net. The computers on the International Space Station (the most expensive single man-made object in existence) were moved from Windows to Linux in 2013 in an attempt to improve stability and reliability. The back-end database of Skype (ironically now owned by Microsoft) is PostgreSQL. GCHQ recently reported that Ubuntu Linux is the most secure commonly-available desktop operating system. The Large Hadron Collider is the worlds largest scientific experiment. Its supporting IT infrastructure, the Worldwide LHC Computing Grid, is the worlds largest computing grid. It handles 30 PB of data per year and spans 36 countries and over 170 computing centres. It runs primarily on Linux. Hadoop, the current darling of many large consultancies looking to earn Big Data credentials, is open-source. Red Hat Enterprise Linux CEntOS (Community Enterprise OS) SUSE Linux Enterprise Server Oracle Linux IBM Enterprise Linux Server etc. The idea that open-source software is not for the enterprise is pure bullshit. If you work in tech for an organisation which disregards open source, enjoy it while it lasts. They wont be around for long. 2.4. But MS SQL Server can use multiple CPU cores for a single query This is an advantage for MS SQL Server whenever youre running a query which is CPU-bound and not IO-bound. In real-life data analytics this happens approximately once every three blue moons. On those very rare, very specific occasions when CPU power is truly the bottleneck, you almost certainly should be using something other than an RDBMS. RDBMSes are not for number crunching. This advantage goes away when a server has to do many things at once (as is almost always the case). PostgreSQL uses multiprocessing ndash different connections run in different processes, and hence on different CPU cores. The scheduler of the OS takes care of this. Also, I suspect this query parallelism is what necessitates the merge method which MS SQL Server custom aggregate assemblies are required to implement bits of aggregation done in different threads have to be combined with each other, MapReduce-style. I further suspect that this mechanism is what prevents MS SQL Server aggregates from accepting ORDER BY clauses. So, congratulations ndash you can use more than one CPU core, but you cant do a basic string roll-up. 2.5. But I have MS SQL Server skills, not PostgreSQL skills Youd rather stick with a clumsy, awkward, unreliable system than spend the trivial amount of effort it takes to learn a slightly different dialect of a straightforward querying language Well, just hope you never end up in a job interview with me. 2.6. But a billion Microsoft users cant all be wrong This is a real-life quotation as well, from a senior data analyst I used to work with. I replied well there are 1.5 billion Muslims and 1.2 billion Catholics. They cant all be right. Ergo, a billion people most certainly can be wrong. (In this particular case, 2.7 billion people are wrong.) 2.7. But if it were really that good then it wouldnt be free People actually say this too. I feel sorry for these people, because they are unable to conceive of anyone doing anything for any reason other than monetary gain. Presumably they are also unaware of the existence of charities or volunteers or unpaid bloggers or any of the other things people do purely out of a desire to contribute or to create something or simply to take on a challenge. This argument also depends on an assumption that open source development has no benefit for the developer, which is nonsense. The reason large enterprises open-source their code and then pay their teams to continue working on it is because doing so benefits them. If you open up your code and others use it, then you have just gained a completely free source of bug fixes, feature contributions, code review, product testing and publicity. If your product is good enough, it is used by enough people that it starts having an influence on standards, which means broader industry acceptance. You then have a favoured position in the market as a provider of support and deployment services for the software. Open-sourcing your code is often the most sensible course of action even if you are completely self-interested. As a case in point, here I am spending my free time writing a web page about how fabulous PostgreSQL is and then paying my own money to host it. Perhaps Teradata or Oracle are just as amazing, but theyre not getting their own pages because I cant afford them, so I dont use them. 2.8. But youre biased No, I have a preference. The whole point of this document is to demonstrate, using evidence, that this preference is justified. If you read this and assume that just because I massively prefer PostgreSQL I must be biased, that means you are biased, because you have refused to seriously consider the possibility that it really is better. If you think theres actual evidence that I really am biased, let me know. 2.9. But PostgreSQL is a stupid name This one is arguably true its pretty awkward. It is commonly mispronounced, very commonly misspelt and almost always incorrectly capitalised. Its a good job that stupidness of name is not something serious human beings take into account when theyre choosing industrial software products. That being said, MS SQL Server is literally the most boring possible name for a SQL Server provided by MS. It has anywhere from six to eight syllables, depending on whether or not you abbreviate Microsoft and whether you say it sequel or ess queue ell, which is far too many syllables for a product name. Microsoft has a thing for very long names though ndash possibly its greatest achievement ever is Microsoft WinFX Software Development Kit for Microsoft Pre-Release Windows Operating System Code-Named Longhorn, Beta 1 Web Setup I count 38 syllables. Beeindruckend. 2.10. But SSMS is better than PGAdmin Its slicker, sure. Its prettier. It has code completion, although I always turn that off because it constantly screws things up, and for every time it helps me out with a field or table name, theres at least one occasion when it does something mental, like auto-correcting a common SQL keyword like table to a Microsoft monstrosity like TABULATIONNONTRIVIALDISCOMBOBULATEDMACHIAVELLIANGANGLYONID or something. For actually executing SQL and looking at the results in a GUI, PGAdmin is fine. Its just not spectacular. SSMS is obviously Windows-only. PGAdmin is cross-platform. This is actually quite convenient. You can run PGAdmin in Windows, where you have all your familiar stuff ndash Office, Outlook etc. ndash whilst keeping the back end RDBMS in Linux. This gets you the best of both worlds (even an open source advocate like me admits that if youre a heavy MS Office user, there is no serious alternative). Several guys I work with do this. One point in SSMSs favour is that if you run several row-returning statements in a batch, it will give you all the results. PGAdmin returns only the last result set. This can be a drag when doing data analytics, where you often want to simultaneously query several data sets and compare the results. Theres another thing though: psql. This is PostgreSQLs command-line SQL interface. Its really, really good. It has loads of useful catalog-querying features. It displays tabular data intelligently. It has tab completion which, unlike SSMSs code completion, is actually useful, because it is context sensitive. So, for example, if you type DROP SCHEMA t and hit tab, it will suggest schema names starting with t (or, if there is only one, auto-fill it for you). It lets you jump around in the file system and use ultra-powerful text editors like vim inline. It automatically keeps a list of executed commands. It provides convenient, useful data import and export functionality, including the COPY TO PROGRAM feature which makes smashing use of pipes and command-line utilities to provide another level of flexibility and control of data. It makes intelligent use of screen space. It is fast and convenient. You can use it over an SSH connection, even a slow one. Its only serious disadvantage is that it is unsuitable for people who want to be data analysts, but are scared of command lines and typing on a keyboard. 2.11. But MS SQL Server can import straight from Excel Yes. So what Excel can output to CSV (in a rare moment of sanity, Microsoft made Excels CSV export code work properly) and PostgreSQL can import CSV. Admittedly, its an extra step. Is the ability to import straight from Excel a particularly important feature in an analytics platform anyway 2.12. But PostgreSQL is slower than MS SQL Server A more accurate rephrasing would be MS SQL Server is slightly more forgiving if you dont know what youre doing. For certain operations, PostgreSQL is definitely slower than MS SQL Server ndash the easiest example is probably COUNT(). which is (I think) always instant in MS SQL Server and in PostgreSQL requires a full table scan (this is due to the different concurrency models they use). PostgreSQL is slow out-of-the box because its default configuration uses only a tiny amount of system resources ndash but any system being used for serious work has been tuned properly, so raw out-of-the-box performance is not a worthwhile thing to argue about. I once saw PostgreSQL criticised as slow because it was taking a long time to do some big, complex regex operations on a large table. But everyone knows that regex operations can be very computationally expensive, and in any case, what was PostgreSQL being compared to Certainly not the MS SQL Server boxes, which couldnt do regexes. PostgreSQLs extensive support for very clever indexes, such as range type indexes and trigram indexes, makes it orders of magnitude faster than MS SQL Server for a certain class of operations. But only if you know how to use those features properly. The immense flexibility you get from the great procedural language support and the clever data types allows PostgreSQL-based solutions to outperform MS SQL Server-based solutions by orders of magnitude. See my earlier example . In any case, the argument about speed is never only about computer time it is about developer time too. Thats why high-level languages like PHP and Python are very popular, despite the fact that C kicks the shit out of them when it comes to execution speed. They are slower to run but much faster to use for development. Would you prefer to spend an hour writing maintainable, elegant SQL followed by an hour of runtime, or spend three days writing buggy, desperate workarounds followed by 45 minutes of runtime 2.13. But you never mentioned such-and-such feature of MS SQL Server As I said in the banner and the intro. I am comparing these databases from the point of view of a data analyst, because Im a data analyst and I use them for data analysis. I know about SSRS, SSAS, in-memory column stores and so on, but I havent mentioned them because I dont use them (or equivalent features). Yes, this means this is not a comprehensive comparison of the two databases, and I never said it would be. It also means that if you care mostly about OLTP or data warehousing, you might not find this document very helpful. 2.14. But Microsoft has open-sourced. NET Yeah, mere hours after I wrote all about how theyre a vendor lock-in monster and are anti-open source. Doh. However, lets look at this in context. Remember the almighty ruckus when the Office Open XML standard was being created Microsoft played every dirty trick in the book to ensure that MS Office wouldnt lose its dominance. Successfully, too ndash the closest alternative, LibreOffice, is still not a viable option, largely because of incompatibility with document formats. The OOXML standard that was finally pushed through is immense, bloated, ambiguous, inconsistent and riddled with errors. That debacle also started with an apparent gesture toward open standards on Microsofts part. If that seems harsh or paranoid, lets remember that this is an organisation that has been in legal trouble with both the USA and the EU for monopolistic and anticompetitive behaviour and abuse of market power, in the latter case being fined almost half a billion Euros. Then theres the involvement in SCOs potentially Linux-killing lawsuit against IBM. When Steve Ballmer was CEO he described Linux as a cancer (although Ballmer also said Theres no chance that the iPhone is going to get any significant market share. No chance, so maybe he just likes to talk nonsense). Microsoft has a long-established policy of preferring conquest to cooperation. So, if they play nice for the next few years and their magnanimous gesture ushers in a new era of interoperability, productivity and harmony, I (and millions of developers who want to get on with creating great things instead of bickering over platforms and standards) will be over the moon. For now, thinking that MS has suddenly become all warm and fuzzy would just be naive. 2.15. But youre insultingI dont like your toneyou come across as angryyou sound like a fanboythis is unprofessionalthis is a rant This page is unprofessional by definition ndash Im not being paid to write it. That also means I get to use whatever tone I like, and I dont have to hide the way I feel about things. I hope you appreciate the technical content even if you dont like the way I write if my tone makes this document unreadable for you, then I guess Ive lost a reader and youve lost a web page. Cest la vie. Most people are familiar with the phrase, quotthis will kill two birds with one stonequot. If you39re not, the phase refers to an approach that addresses two objectives in one action. (Unfortunately, the expression itself is rather unpleasant, as most of us don39t want to throw stones at innocent animals) Today I39m going to cover some basics on two great features in SQL Server: the Columnstore index (available only in SQL Server Enterprise) and the SQL Query Store . Microsoft actually implemented the Columnstore index in SQL 2012 Enterprise, though they39ve enhanced it in the last two releases of SQL Server. Microsoft introduced the Query Store in SQL Server 2016. So, what are these features and why are they important Well, I have a demo that will introduce both features and show how they can help us. Before I go any further, I also cover this (and other SQL 2016 features) in my CODE Magazine article on new features SQL 2016. As a basic introduction, the Columnstore index can help speed up queries that scanaggregate over large amounts of data, and the Query Store tracks query executions, execution plans, and runtime statistics that you39d normally need to collect manually. Trust me when I say, these are great features. For this demo, I39ll be using the Microsoft Contoso Retail Data Warehouse demo database. Loosely speaking, Contoso DW is like quota really big AdventureWorksquot, with tables containing millions of rows. (The largest AdventureWorks table contains roughly 100,000 rows at most). You can download the Contoso DW database here: microsoften-usdownloaddetails. aspxid18279. Contoso DW works very well when you want to test performance on queries against larger tables. Contoso DW contains a standard data warehouse Fact table called FactOnLineSales, with 12.6 million rows. That39s certainly not the largest data warehouse table in the world, but it39s not child39s play either. Suppose I want to summarize product sales amount for 2009, and rank the products. I might query the fact table and join to the Product Dimension table and use a RANK function, like so: Here39s a partial result set of the top 10 rows, by Total Sales. On my laptop (i7, 16 GB of RAM), the query takes anywhere from 3-4 seconds to run. That might not seem like the end of the world, but some users might expect near-instant results (the way you might see near-instant results when using Excel against an OLAP cube). The only index I currently have on this table is a clustered index on a sales key. If I look at the execution plan, SQL Server makes a suggestion to add a covering index to the table: Now, just because SQL Server suggests an index, doesn39t mean you should blindly create indexes on every quotmissing indexquot message. However, in this instance, SQL Server detects that we are filtering based on year, and using the Product Key and Sales Amount. So, SQL Server suggests a covering index, with the DateKey as the index key field. The reason we call this a quotcoveringquot index is because SQL Server will quotbring along the non-key fieldsquot we used in the query, quotfor the ridequot. That way, SQL Server doesn39t need to use the table or the clustered index at all the database engine can simply use the covering index for the query. Covering indexes are popular in certain data warehousing and reporting database scenarios, though they do come at a cost of the database engine maintaining them. Note: Covering indexes have been around for a long time, so I haven39t yet covered the Columnstore index and the Query Store. So, I will add the covering index: If I re-execute the same query I ran a moment ago (the one that aggregated the sales amount for each product), the query sometimes seems to run about a second faster, and I get a different execution plan, one that uses an Index Seek instead of an Index Scan (using the date key on the covering index to retrieve sales for 2009). So, prior to the Columnstore Index, this could be one way to optimize this query in much older versions of SQL Server. It runs a little faster than the first one, and I get an execution plan with an Index Seek instead of an Index Scan. However, there are some issues: The two execution operators quotIndex Seekquot and quotHash Match (Aggregate)quot both essentially operate quotrow by rowquot. Imagine this in a table with hundreds of millions of rows. Related, think about the contents of a fact table: in this case, a single date key value andor a single product key value might be repeated across hundreds of thousands of rows (remember, the fact table also has keys for geography, promotion, salesman, etc.) So, when the quotIndex Seekquot and quotHash Matchquot work row by row, they are doing so over values that might be repeated across many other rows. This is normally where I39d segue to the SQL Server Columnstore index, which offers a scenario to improve the performance of this query in amazing ways. But before I do that, let39s go back in time. Let39s go back to the year 2010, when Microsoft introduced an add-in for Excel known as PowerPivot. Many people probably remember seeing demos of PowerPivot for Excel, where a user could read millions of rows from an outside data source into Excel. PowerPivot would compress the data, and provide an engine to create Pivot Tables and Pivot Charts that performed at amazing speeds against the compressed data. PowerPivot used an in-memory technology that Microsoft termed quotVertiPaqquot. This in-memory technology in PowerPivot would basically take duplicate business keyforeign key values and compress them down to a single vector. The in-memory technology would also scanaggregate these values in parallel, in blocks of several hundred at a time. The bottom line is that Microsoft baked a large amount of performance enhancements into the VertiPaq in-memory feature for us to use, right out of the proverbial box. Why am I taking this little stroll down memory lane Because in SQL Server 2012, Microsoft implemented one of the most important features in the history of their database engine: the Columnstore index. The index is really an index in name only: it is a way to take a SQL Server table and create a compressed, in-memory columnstore that compresses duplicate foreign key values down to single vector values. Microsoft also created a new buffer pool to read these compressed vector values in parallel, creating the potential for huge performance gains. So, I39m going to create a columnstore index on the table, and I39ll see how much better (and more efficiently) the query runs, versus the query that runs against the covering index. So, I39ll create a duplicate copy of FactOnlineSales (I39ll call it FactOnlineSalesDetailNCCS), and I39ll create a columnstore index on the duplicated table that way I won39t interfere with the original table and the covering index in any way. Next, I39ll create a columnstore index on the new table: Note several things: I39ve specified several foreign key columns, as well as the Sales Amount. Remember that a columnstore index is not like a traditional row-store index. There is no quotkeyquot. We are simply indicating which columns SQL Server should compress and place in an in-memory columnstore. To use the analogy of PowerPivot for Excel when we create a columnstore index, we39re telling SQL Server to essentially do the same thing that PowerPivot did when we imported 20 million rows into Excel using PowerPivot So, I39ll re-run the query, this time using the duplicated FactOnlineSalesDetailNCCS table that contains the columnstore index. This query runs instantly in less than a second. And I can also say that even if the table had hundreds of millions of rows, it would still run at the proverbial quotbat of an eyelashquot. We could look at the execution plan (and in a few moments, we will), but now it39s time to cover the Query Store feature. Imagine for a moment, that we ran both queries overnight: the query that used the regular FactOnlineSales table (with the covering index) and then the query that used the duplicated table with the Columnstore index. When we log in the following morning, we39d like to see the execution plan for both queries as they took place, as well as the execution statistics. In other words, we39d like to see the same statistics that we39d be able to see if we ran both queries interactively in SQL Management Studio, turned in TIME and IO Statistics, and viewed the execution plan right after executing the query. Well, that39s what the Query Store allows us to do we can turn on (enable) Query Store for a database, which will trigger SQL Server to store query execution and plan statistics so that we can view them later. So, I39m going to enable the Query Store on the Contoso database with the following command (and I39ll also clear out any caching): Then I39ll run the two queries (and quotpretendquot that I ran them hours ago): Now let39s pretend they ran hours ago. According to what I said, the Query Store will capture the execution statistics. So how do I view them Fortunately, that39s quite easy. If I expand the Contoso DW database, I39ll see a Query Store folder. The Query Store has tremendous functionality and I39ll try to cover much of it in subsequent blog posts. But for right now, I want to view execution statistics on the two queries, and specifically examine the execution operators for the columnstore index. So I39ll right-click on the Top Resource Consuming Queries and run that option. That gives me a chart like the one below, where I can see execution duration time (in milliseconds) for all queries that have been executed. In this instance, Query 1 was the query against the original table with the covering index, and Query 2 was against the table with the columnstore index. The numbers don39t lie the columnstore index outperformed the original tablecovering index by a factor of almost 7 to 1. I can change the metric to look at memory consumption instead. In this case, note that query 2 (the columnstore index query) used far more memory. This demonstrates clearly why the columnstore index represents quotin-memoryquot technology SQL Server loads the entire columnstore index in memory, and uses a completely different buffer pool with enhanced execution operators to process the index. OK, so we have some graphs to view execution statistics can we see the execution plan (and execution operators) associated with each execution Yes, we can If you click on the vertical bar for the query that used the columnstore index, you39ll see the execution plan below. The first thing we see is that SQL Server performed a columnstore index scan, and that represented nearly 100 of the cost of the query. You might be saying, quotWait a minute, the first query used a covering index and performed an index seek so how can a columnstore index scan be fasterquot That39s a legitimate question, and fortunately there39s an answer. Even when the first query performed an index seek, it still executed quotrow by rowquot. If I put the mouse over the columnstore index scan operator, I see a tooltip (like the one below), with one important setting: the Execution Mode is BATCH (as opposed to ROW . which is what we had with the first query using the covering index). That BATCH mode tells us that SQL Server is processing the compressed vectors (for any foreign key values that are duplicated, such as the product key and date key) in batches of almost 1,000, in parallel. So SQL Server is still able to process the columnstore index much more efficiently. Additionally, if I place the mouse over the Hash Match (Aggregate) task, I also see that SQL Server is aggregating the columnstore index using Batch mode (although the operator itself represents such a tiny percent of the cost of the query) Finally, you might be asking, quotOK, so SQL Server compresses the values in the data, treats the values as vectors, and read them in blocks of almost a thousand values in parallel but my query only wanted data for 2009. So is SQL Server scanning over the entire set of dataquot Again, a good question. The answer is, quotNot reallyquot. Fortunately for us, the new columnstore index buffer pool performs another function called quotsegment eliminationquot. Basically, SQL Server will examine the vector values for the date key column in the columnstore index, and eliminate segments that are outside the scope of the year 2009. I39ll stop here. In subsequent blog posts I39ll cover both the columnstore index and Query Store in more detail. Essentially, what we39ve seen here today is that the Columnstore index can significantly speed up queries that scanaggregate over large amounts of data, and the Query Store will capture query executions and allow us to examine execution and performance statistics later. In the end, we39d like to produce a result set that shows the following. Notice three things: The columns essentially pivot all of the possible Return Reasons, after showing the sales amount The result set contains subtotals by the week ending (Sunday) date across all clients (where the Client is NULL) The result set contains a grand total row (where the Client and Date are both NULL) First, before I get into the SQL end we could use the dynamic pivotmatrix capability in SSRS. We would simply need to combine the two result sets by one column and then we could feed the results to the SSRS matrix control, which will spread the return reasons across the columns axis of the report. However, not everyone uses SSRS (though most people should). But even then, sometimes developers need to consume result sets in something other than a reporting tool. So for this example, let39s assume we want to generate the result set for a web grid page and possibly the developer wants to quotstrip outquot the subtotal rows (where I have a ResultSetNum value of 2 and 3) and place them in a summary grid. So bottom line, we need to generate the output above directly from a stored procedure. And as an added twist next week there could be Return Reason X and Y and Z. So we don39t know how many return reasons there could be. We simple want the query to pivot on the possible distinct values for Return Reason. Here is where the T-SQL PIVOT has a restriction we need to provide it the possible values. Since we won39t know that until run-time, we need to generate the query string dynamically using the dynamic SQL pattern. The dynamic SQL pattern involves generating the syntax, piece by piece, storing it in a string, and then executing the string at the end. Dynamic SQL can be tricky, as we have to embed syntax inside a string. But in this case, it our only true option if we want to handle a variable number of return reasons. I39ve always found that the best way to create a dynamic SQL solution is by figuring out what the quotidealquot generated-query would be at the end (in this case, given the Return reasons we know about).and then reverse-engineering it by piecing it together one part at a time. And so, here is the SQL we need if we knew those Return Reasons (A through D) were static and would not change. The query does the following: Combines the data from SalesData with the data from ReturnData, where we quothard-wirequot the word Sales as an Action Type form the Sales Table, and then use the Return Reason from the Return Data into the same ActionType column. That will give us a clean ActionType column on which to pivot. We are combining the two SELECT statements into a common table expression (CTE), which is basically a derived table subquery that we subsequently use in the next statement (to PIVOT) A PIVOT statement against the CTE, that sums the dollars for the Action Type being in one of the possible Action Type values. Note that this isn39t the final result set. We are placing this into a CTE that reads from the first CTE. The reason for this is because we want to do multiple groupings at the end. The final SELECT statement, that reads from the PIVOTCTE, and combines it with a subsequent query against the same PIVOTCTE, but where we also implement two groupings in the GROUPING SETS feature in SQL 2008: GROUPING by the Week End Date (dbo. WeekEndingDate) GROUPING for all rows () So if we knew with certainty that we39d never have more return reason codes, then that would be the solution. However, we need to account for other reason codes. So we need to generate that entire query above as one big string where we construct the possible return reasons as one comma separated list. I39m going to show the entire T-SQL code to generate (and execute) the desired query. And then I39ll break it out into parts and explain each step. So first, here39s the entire code to dynamically generate what I39ve got above. There are basically five steps we need to cover. Schritt 1 . we know that somewhere in the mix, we need to generate a string for this in the query: SalesAmount, Reason A, Reason B, Reason C, Reason D0160016001600160 What we can do is built a temporary common table expression that combines the hard wired quotSales Amountquot column with the unique list of possible reason codes. Once we have that in a CTE, we can use the nice little trick of FOR XML PATH(3939) to collapse those rows into a single string, put a comma in front of each row that the query reads, and then use STUFF to replace the first instance of a comma with an empty space. This is a trick that you can find in hundreds of SQL blogs. So this first part builds a string called ActionString that we can use further down. Schritt 2 . we also know that we39ll want to SUM the generatedpivoted reason columns, along with the standard sales column. So we39ll need a separate string for that, which I39ll call SUMSTRING. I39ll simply use the original ActionString, and then REPLACE the outer brackets with SUM syntax, plus the original brackets. Step 3: Now the real work begins. Using that original query as a model, we want to generate the original query (starting with the UNION of the two tables), but replacing any references to pivoted columns with the strings we dynamically generated above. Also, while not absolutely required, I39ve also created a variable to simply any carriage returnline feed combinations that we want to embed into the generated query (for readability). So we39ll construct the entire query into a variable called SQLPivotQuery. Step 4 . We continue constructing the query again, concatenating the syntax we can quothard-wirequot with the ActionSelectString (that we generated dynamically to hold all the possible return reason values) Step 5 . Finally, we39ll generate the final part of the Pivot Query, that reads from the 2 nd common table expression (PIVOTCTE, from the model above) and generates the final SELECT to read from the PIVOTCTE and combine it with a 2 nd read against PIVOTCTE to implement the grouping sets. Finally, we can quotexecutequot the string using the SQL system stored proc spexecuteSQL So hopefully you can see that the process to following for this type of effort is Determine what the final query would be, based on your current set of data and values (i. e. built a query model) Write the necessary T-SQL code to generate that query model as a string. Arguably the most important part is determining the unique set of values on which you39ll PIVOT, and then collapsing them into one string using the STUFF function and the FOR XML PATH(3939) trick So whats on my mind today Well, at least 13 items Two summers ago, I wrote a draft BDR that focused (in part) on the role of education and the value of a good liberal arts background not just for the software industry but even for other industries as well. One of the themes of this particular BDR emphasized a pivotal and enlightened viewpoint from renowned software architect Allen Holub regarding liberal arts. Ill (faithfully) paraphrase his message: he highlighted the parallels between programming and studying history, by reminding everyone that history is reading and writing (and Ill add, identifying patterns), and software development is also reading and writing (and again, identifying patterns). And so I wrote an opinion piece that focused on this and other related topics. But until today, I never got around to either publishingposting it. Every so often Id think of revising it, and Id even sit down for a few minutes and make some adjustments to it. But then life in general would get in the way and Id never finish it. So what changed A few weeks ago, fellow CoDe Magazine columnist and industry leader Ted Neward wrote a piece in his regular column, Managed Coder , that caught my attention. The title of the article is On Liberal Arts. and I highly recommend that everyone read it. Ted discusses the value of a liberal arts background, the false dichotomy between a liberal arts background and success in software development, and the need to writecommunicate well. He talks about some of his own past encounters with HR personnel management regarding his educational background. He also emphasizes the need to accept and adapt to changes in our industry, as well as the hallmarks of a successful software professional (being reliable, planning ahead, and learning to get past initial conflict with other team members). So its a great read, as are Teds other CoDe articles and blog entries. It also got me back to thinking about my views on this (and other topics) as well, and finally motivated me to finish my own editorial. So, better late than never, here are my current Bakers Dozen of Reflections: I have a saying: Water freezes at 32 degrees . If youre in a trainingmentoring role, you might think youre doing everything in the world to help someone when in fact, theyre only feeling a temperature of 34 degrees and therefore things arent solidifying for them. Sometimes it takes just a little bit more effort or another ideachemical catalyst or a new perspective which means those with prior education can draw on different sources. Water freezes at 32 degrees . Some people can maintain high levels of concentration even with a room full of noisy people. Im not one of them occasionally I need some privacy to think through a critical issue. Some people describe this as you gotta learn to walk away from it. Stated another way, its a search for the rarefied air. This past week I spent hours in half-lit, quiet room with a whiteboard, until I fully understood a problem. It was only then that I could go talk with other developers about a solution. The message here isnt to preach how you should go about your business of solving problems but rather for everyone to know their strengths and what works, and use them to your advantage as much as possible. Some phrases are like fingernails on a chalkboard for me. Use it as a teaching moment is one. (Why is it like fingernails on a chalkboard Because if youre in a mentoring role, you should usually be in teaching moment mode anyway, however subtly). Heres another I cant really explain it in words, but I understand it. This might sound a bit cold, but if a person truly cant explain something in words, maybe they dont understand. Sure, a person can have a fuzzy sense of how something works I can bluff my way through describing how a digital camera works but the truth is that I dont really understand it all that well. There is a field of study known as epistemology (the study of knowledge). One of the fundamental bases of understanding whether its a camera or a design pattern - is the ability to establish context, to identify the chain of related events, the attributes of any components along the way, etc. Yes, understanding is sometimes very hard work, but diving into a topic and breaking it apart is worth the effort. Even those who eschew certification will acknowledge that the process of studying for certification tests will help to fill gaps in knowledge. A database manager is more likely to hire a database developer who can speak extemporaneously (and effortlessly) about transaction isolation levels and triggers, as opposed to someone who sort of knows about it but struggles to describe their usage. Theres another corollary here. Ted Neward recommends that developers take up public speaking, blogging, etc. I agree 100. The process of public speaking and blogging will practically force you to start thinking about topics and breaking down definitions that you might have otherwise taken for granted. A few years ago I thought I understood the T-SQL MERGE statement pretty well. But only after writing about it, speaking about, fielding questions from others who had perspectives that never occurred to me that my level of understanding increased exponentially. I know a story of a hiring manager who once interviewed an authordeveloper for a contract position. The hiring manager was contemptuous of publications in general, and barked at the applicant, So, if youre going to work here, would you rather be writing books or writing code Yes, Ill grant that in any industry there will be a few pure academics. But what the hiring manager missed was the opportunities for strengthening and sharpening skill sets. While cleaning out an old box of books, I came across a treasure from the 1980s: Programmers at Work. which contains interviews with a very young Bill Gates, Ray Ozzie, and other well-known names. Every interview and every insight is worth the price of the book. In my view, the most interesting interview was with Butler Lampson. who gave some powerful advice. To hell with computer literacy. Its absolutely ridiculous. Study mathematics. Learn to think. Lesen. Write. These things are of more enduring value. Learn how to prove theorems: A lot of evidence has accumulated over the centuries that suggests this skill is transferable to many other things. Butler speaks the truth . Ill add to that point learn how to play devils advocate against yourself. The more you can reality-check your own processes and work, the better off youll be. The great computer scientistauthor Allen Holub made the connection between software development and the liberal arts specifically, the subject of history. Here was his point: what is history Reading and writing. What is software development Among other things, reading and writing . I used to give my students T-SQL essay questions as practice tests. One student joked that I acted more like a law professor. Well, just like Coach Donny Haskins said in the movie Glory Road, my way is hard. I firmly believe in a strong intellectual foundation for any profession. Just like applications can benefit from frameworks, individuals and their thought processes can benefit from human frameworks as well. Thats the fundamental basis of scholarship. There is a story that back in the 1970s, IBM expanded their recruiting efforts in the major universities by focusing on the best and brightest of liberal arts graduates. Even then they recognized that the best readers and writers might someday become strong programmersystems analysts. (Feel free to use that story to any HR-type who insists that a candidate must have a computer science degree) And speaking of history: if for no other reason, its important to remember the history of product releases if Im doing work at a client site thats still using SQL Server 2008 or even (gasp) SQL Server 2005, I have to remember what features were implemented in the versions over time. Ever have a favorite doctor whom you liked because heshe explained things in plain English, gave you the straight truth, and earned your trust to operate on you Those are mad skills . and are the result of experience and HARD WORK that take years and even decades to cultivate. There are no guarantees of job success focus on the facts, take a few calculated risks when youre sure you can see your way to the finish line, let the chips fall where they may, and never lose sight of being just like that doctor who earned your trust. Even though some days I fall short, I try to treat my client and their data as a doctor would treat patients. Even though a doctor makes more money There are many clichs I detest but heres one I dont hate: There is no such thing as a bad question. As a former instructor, one thing that drew my ire was hearing someone criticize another person for asking a supposedly, stupid question. A question indicates a person acknowledges they have some gap in knowledge theyre looking to fill. Yes, some questions are better worded than others, and some questions require additional framing before they can be answered. But the journey from forming a question to an answer is likely to generate an active mental process in others. There are all GOOD things. Many good and fruitful discussions originate with a stupid question. I work across the board in SSIS, SSAS, SSRS, MDX, PPS, SharePoint, Power BI, DAX all the tools in the Microsoft BI stack. I still write some. NET code from time to time. But guess what I still spend so much time doing writing T-SQL code to profile data as part of the discovery process. All application developers should have good T-SQL chops. Ted Neward writes (correctly) about the need to adapt to technology changes. Ill add to that the need to adapt to clientemployer changes. Companies change business rules. Companies acquire other companies (or become the target of an acquisition). Companies make mistakes in communicating business requirements and specifications. Yes, we can sometimes play a role in helping to manage those changes and sometimes were the fly, not the windshield. These sometimes cause great pain for everyone, especially the I. T. people. This is why the term fact of life exists we have to deal with it. Just like no developer writes bug-free code every time, no I. T. person deals well with change every single time. One of the biggest struggles Ive had in my 28 years in this industry is showing patience and restraint when changes are flying from many different directions. Here is where my prior suggestion about searching for the rarified air can help. If you can manage to assimilate changes into your thought process, and without feeling overwhelmed, odds are youll be a significant asset. In the last 15 months Ive had to deal with a huge amount of professional change. Its been very difficult at times, but Ive resolved that change will be the norm and Ive tried to tweak my own habits as best I can to cope with frequent (and uncertain) change. Its hard, very hard. But as coach Jimmy Duggan said in the movie A League of Their Own: Of course its hard. If it wasnt hard, everyone would do it. The hard, is what makes it great . A powerful message. Theres been talk in the industry over the last few years about conduct at professional conferences (and conduct in the industry as a whole). Many respected writers have written very good editorials on the topic. Heres my input, for what its worth. Its a message to those individuals who have chosen to behave badly: Dude, it shouldnt be that hard to behave like an adult. A few years ago, CoDe Magazine Chief Editor Rod Paddock made some great points in an editorial about Codes of Conduct at conferences. Its definitely unfortunate to have to remind people of what they should expect out of themselves. But the problems go deeper. A few years ago I sat on a five-person panel (3 women, 2 men) at a community event on Women in Technology. The other male stated that men succeed in this industry because the Y chromosome gives men an advantage in areas of performance. The individual who made these remarks is a highly respected technology expert, and not some bozo making dongle remarks at a conference or sponsoring a programming contest where first prize is a date with a bikini model. Our world is becoming increasingly polarized (just watch the news for five minutes), sadly with emotion often winning over reason. Even in our industry, recently I heard someone in a position of responsibility bash software tool XYZ based on a ridiculous premise and then give false praise to a competing tool. So many opinions, so many arguments, but heres the key: before taking a stand, do your homework and get the facts . Sometimes both sides are partly rightor wrong. Theres only one way to determine: get the facts. As Robert Heinlein wrote, Facts are your single clue get the facts Of course, once you get the facts, the next step is to express them in a meaningful and even compelling way. Theres nothing wrong with using some emotion in an intellectual debate but it IS wrong to replace an intellectual debate with emotion and false agenda. A while back I faced resistance to SQL Server Analysis Services from someone who claimed the tool couldnt do feature XYZ. The specifics of XYZ dont matter here. I spent about two hours that evening working up a demo to cogently demonstrate the original claim was false. In that example, it worked. I cant swear it will always work, but to me thats the only way. Im old enough to remember life at a teen in the 1970s. Back then, when a person lost hisher job, (often) it was because the person just wasnt cutting the mustard. Fast-forward to today: a sad fact of life is that even talented people are now losing their jobs because of the changing economic conditions. Theres never a full-proof method for immunity, but now more than ever its critical to provide a high level of what I call the Three Vs (value, versatility, and velocity) for your employerclients. I might not always like working weekends or very late at night to do the proverbial work of two people but then I remember there are folks out there who would give anything to be working at 1 AM at night to feed their families and pay their bills. Always be yourselfyour BEST self. Some people need inspiration from time to time. Heres mine: the great sports movie, Glory Road. If youve never watched it, and even if youre not a sports fan I can almost guarantee youll be moved like never before. And Ill close with this. If you need some major motivation, Ill refer to a story from 2006. Jason McElwain, a high school student with autism, came off the bench to score twenty points in a high school basketball game in Rochester New York. Heres a great YouTube video. His mother said it all . This is the first moment Jason has ever succeeded and is proud of himself. I look at autism as the Berlin Wall. He cracked it. To anyone who wanted to attend my session at todays SQL Saturday event in DC I apologize that the session had to be cancelled. I hate to make excuses, but a combination of getting back late from Detroit (client trip), a car thats dead (blown head gasket), and some sudden health issues with my wife have made it impossible for me to attend. Back in August, I did the same session (ColumnStore Index) for PASS as a webinar. You can go to this link to access the video (itll be streamed, as all PASS videos are streamed) The link does require that you fill out your name and email address, but thats it. And then you can watch the video. Feel free to contact me if you have questions, at kgoffkevinsgoff. net November 15, 2013 Getting started with Windows Azure and creating SQL Databases in the cloud can be a bit daunting, especially if youve never tried out any of Microsofts cloud offerings. Fortunately, Ive created a webcast to help people get started. This is an absolute beginners guide to creating SQL Databases under Windows Azure. It assumes zero prior knowledge of Azure. You can go to the BDBI Webcasts of this website and check out my webcast (dated 11102013). Or you can just download the webcast videos right here: here is part 1 and here is part 2. You can also download the slide deck here. November 03, 2013 Topic this week: SQL Server Snapshot Isolation Levels, added in SQL Server 2005. To this day, there are still many SQL developers, many good SQL developers who either arent aware of this feature, or havent had time to look at it. Hopefully this information will help. Companion webcast will be uploaded in the next day look for it in the BDBI Webcasts section of this blog. October 26, 2013 Im going to start a weekly post of T-SQL tips, covering many different versions of SQL Server over the years Heres a challenge many developers face. Ill whittle it down to a very simple example, but one where the pattern applies to many situations. Suppose you have a stored procedure that receives a single vendor ID and updates the freight for all orders with that vendor id. create procedure dbo. UpdateVendorOrders update Purchasing. PurchaseOrderHeader set Freight Freight 1 where VendorID VendorID Now, suppose we need to run this for a set of vendor IDs. Today we might run it for three vendors, tomorrow for five vendors, the next day for 100 vendors. We want to pass in the vendor IDs. If youve worked with SQL Server, you can probably guess where Im going with this. The big question is how do we pass a variable number of Vendor IDs Or, stated more generally, how do we pass an array, or a table of keys, to a procedure Something along the lines of exec dbo. UpdateVendorOrders SomeListOfVendors Over the years, developers have come up with different methods: Going all the way back to SQL Server 2000, developers might create a comma-separated list of vendor keys, and pass the CSV list as a varchar to the procedure. The procedure would shred the CSV varchar variable into a table variable and then join the PurchaseOrderHeader table to that table variable (to update the Freight for just those vendors in the table). I wrote about this in CoDe Magazine back in early 2005 (code-magazinearticleprint. aspxquickid0503071ampprintmodetrue. Tip 3) In SQL Server 2005, you could actually create an XML string of the vendor IDs, pass the XML string to the procedure, and then use XQUERY to shred the XML as a table variable. I also wrote about this in CoDe Magazine back in 2007 (code-magazinearticleprint. aspxquickid0703041ampprintmodetrue. Tip 12)Also, some developers will populate a temp table ahead of time, and then reference the temp table inside the procedure. All of these certainly work, and developers have had to use these techniques before because for years there was NO WAY to directly pass a table to a SQL Server stored procedure. Until SQL Server 2008 when Microsoft implemented the table type. This FINALLY allowed developers to pass an actual table of rows to a stored procedure. Now, it does require a few steps. We cant just pass any old table to a procedure. It has to be a pre-defined type (a template). So lets suppose we always want to pass a set of integer keys to different procedures. One day it might be a list of vendor keys. Next day it might be a list of customer keys. So we can create a generic table type of keys, one that can be instantiated for customer keys, vendor keys, etc. CREATE TYPE IntKeysTT AS TABLE ( IntKey int NOT NULL ) So Ive created a Table Typecalled IntKeysTT . Its defined to have one column an IntKey. Nowsuppose I want to load it with Vendors who have a Credit Rating of 1..and then take that list of Vendor keys and pass it to a procedure: DECLARE VendorList IntKeysTT INSERT INTO VendorList SELECT BusinessEntityID from Purchasing. Vendor WHERE CreditRating 1 So, I now have a table type variable not just any table variable, but a table type variable (that I populated the same way I would populate a normal table variable). Its in server memory (unless it needs to spill to tempDB) and is therefore private to the connectionprocess. OK, can I pass it to the stored procedure now Well, not yet we need to modify the procedure to receive a table type. Heres the code: create procedure dbo. UpdateVendorOrdersFromTT IntKeysTT IntKeysTT READONLY update Purchasing. PurchaseOrderHeader set Freight Freight 1 FROM Purchasing. PurchaseOrderHeader JOIN IntKeysTT TempVendorList ON PurchaseOrderHeader. VendorID Te mpVendorList. IntKey Notice how the procedure receives the IntKeysTT table type as a Table Type (again, not just a regular table, but a table type). It also receives it as a READONLY parameter. You CANNOT modify the contents of this table type inside the procedure. Usually you wont want to you simply want to read from it. Well, now you can reference the table type as a parameter and then utilize it in the JOIN statement, as you would any other table variable. So dort haben Sie es. A bit of work to set up the table type, but in my view, definitely worth it. Additionally, if you pass values from. NET, youre in luck. You can pass an ADO. NET data table (with the same tablename property as the name of the Table Type) to the procedure. For. NET developers who have had to pass CSV lists, XML strings, etc. to a procedure in the past, this is a huge benefit. Finally I want to talk about another approach people have used over the years. SQL Server Cursors. At the risk of sounding dogmatic, I strongly advise against Cursors, unless there is just no other way. Cursors are expensive operations in the server, For instance, someone might use a cursor approach and implement the solution this way: DECLARE VendorID int DECLARE dbcursor CURSOR FASTFORWARD FOR SELECT BusinessEntityID from Purchasing. Vendor where CreditRating 1 FETCH NEXT FROM dbcursor INTO VendorID WHILE FETCHSTATUS 0 EXEC dbo. UpdateVendorOrders VendorID FETCH NEXT FROM dbcursor INTO VendorID The best thing Ill say about this is that it works. And yes, getting something to work is a milestone. But getting something to work and getting something to work acceptably are two different things. Even if this process only takes 5-10 seconds to run, in those 5-10 seconds the cursor utilizes SQL Server resources quite heavily. Thats not a good idea in a large production environment. Additionally, the more the of rows in the cursor to fetch and the more the number of executions of the procedure, the slower it will be. When I ran both processes (the cursor approach and then the table type approach) against a small sampling of vendors (5 vendors), the processing times where 260 ms and 60 ms, respectively. So the table type approach was roughly 4 times faster. But then when I ran the 2 scenarios against a much larger of vendors (84 vendors), the different was staggering 6701 ms versus 207 ms, respectively. So the table type approach was roughly 32 times faster. Again, the CURSOR approach is definitely the least attractive approach. Even in SQL Server 2005, it would have been better to create a CSV list or an XML string (providing the number of keys could be stored in a scalar variable). But now that there is a Table Type feature in SQL Server 2008, you can achieve the objective with a feature thats more closely modeled to the way developers are thinking specifically, how do we pass a table to a procedure Now we have an answer Hope you find this feature help. Feel free to post a comment.

No comments:

Post a Comment