DBMS Terdistribusi - Panduan Cepat

Untuk memfungsikan organisasi mana pun, ada kebutuhan akan database yang terpelihara dengan baik. Di masa lalu, database dulu bersifat sentralistik. Namun, dengan meningkatnya globalisasi, organisasi cenderung menjadi beragam di seluruh dunia. Mereka dapat memilih untuk mendistribusikan data melalui server lokal daripada database pusat. Maka, tibalah konsepDistributed Databases.

Bab ini memberikan gambaran umum tentang database dan Sistem Manajemen Database (DBMS). Database adalah kumpulan data terkait yang dipesan. DBMS adalah paket perangkat lunak untuk mengerjakan database. Sebuah studi rinci tentang DBMS tersedia dalam tutorial kami yang bernama "Belajar DBMS". Pada bab ini kami merevisi konsep utama agar pembelajaran DDBMS dapat dilakukan dengan mudah. Tiga topik yang dibahas adalah skema database, jenis database dan operasi pada database.

Sistem Manajemen Basis Data dan Basis Data

SEBUAH databaseadalah kumpulan data terkait yang disusun untuk tujuan tertentu. Database dapat diatur sebagai kumpulan beberapa tabel, di mana tabel mewakili elemen atau entitas dunia nyata. Setiap tabel memiliki beberapa bidang berbeda yang mewakili fitur karakteristik entitas.

Misalnya, database perusahaan dapat mencakup tabel untuk proyek, karyawan, departemen, produk, dan catatan keuangan. Bidang di tabel Karyawan mungkin Nama, ID_perusahaan, Tanggal_ Bergabung, dan sebagainya.

SEBUAH database management systemadalah kumpulan program yang memungkinkan pembuatan dan pemeliharaan database. DBMS tersedia sebagai paket perangkat lunak yang memfasilitasi definisi, konstruksi, manipulasi, dan berbagi data dalam database. Pengertian basis data meliputi uraian tentang struktur basis data. Pembangunan database melibatkan penyimpanan data yang sebenarnya dalam media penyimpanan apa pun. Manipulasi mengacu pada pengambilan informasi dari database, memperbarui database, dan menghasilkan laporan. Berbagi data memfasilitasi data untuk diakses oleh pengguna atau program yang berbeda.

Contoh Area Aplikasi DBMS

Mesin Anjungan Tunai Mandiri
Sistem Reservasi Kereta
Sistem Manajemen Karyawan
Sistem Informasi Mahasiswa

Contoh Paket DBMS

MySQL
Oracle
SQL Server
dBASE
FoxPro
PostgreSQL, dll.

Skema Database

Skema database adalah deskripsi dari database yang ditentukan selama desain database dan dapat mengalami perubahan yang jarang terjadi. Ini mendefinisikan organisasi data, hubungan di antara mereka, dan batasan yang terkait dengannya.

Database sering direpresentasikan melalui three-schema architecture atau ANSISPARC architecture. Tujuan dari arsitektur ini adalah untuk memisahkan aplikasi pengguna dari database fisik. Ketiga level tersebut adalah -

Internal Level having Internal Schema - Ini menjelaskan struktur fisik, rincian penyimpanan internal dan jalur akses untuk database.
Conceptual Level having Conceptual Schema- Ini menjelaskan struktur seluruh database sambil menyembunyikan detail penyimpanan fisik data. Ini menggambarkan entitas, atribut dengan tipe datanya dan batasannya, operasi pengguna dan hubungan.
External or View Level having External Schemas or Views - Ini menjelaskan bagian dari database yang relevan dengan pengguna tertentu atau sekelompok pengguna sambil menyembunyikan sisa database.

Jenis DBMS

Ada empat jenis DBMS.

DBMS hierarkis

Dalam DBMS hierarki, hubungan antar data dalam database dibuat sedemikian rupa sehingga satu elemen data ada sebagai bawahan dari yang lain. Elemen data memiliki hubungan induk-anak dan dimodelkan menggunakan struktur data "pohon". Ini sangat cepat dan sederhana.

DBMS jaringan

Jaringan DBMS merupakan salah satu tempat hubungan antar data dalam database yang berjenis many-to-many dalam bentuk jaringan. Strukturnya umumnya rumit karena adanya banyak hubungan banyak-ke-banyak. DBMS jaringan dimodelkan menggunakan struktur data "grafik".

DBMS relasional

Dalam database relasional, database direpresentasikan dalam bentuk relasi. Setiap relasi memodelkan entitas dan direpresentasikan sebagai tabel nilai. Dalam relasi atau tabel, baris disebut tupel dan menunjukkan satu record. Kolom disebut bidang atau atribut dan menunjukkan properti karakteristik entitas. RDBMS adalah sistem manajemen database paling populer.

Misalnya - Hubungan Mahasiswa -

DBMS Berorientasi Objek

DBMS berorientasi objek diturunkan dari model paradigma pemrograman berorientasi objek. Mereka sangat membantu dalam merepresentasikan data yang konsisten seperti yang disimpan dalam database, serta data sementara, seperti yang ditemukan dalam menjalankan program. Mereka menggunakan elemen kecil yang dapat digunakan kembali yang disebut objek. Setiap objek berisi bagian data dan satu set operasi yang bekerja pada data tersebut. Objek dan atributnya diakses melalui pointer alih-alih disimpan dalam model tabel relasional.

Misalnya - Database berorientasi objek Rekening Bank yang disederhanakan -

DBMS terdistribusi

Database terdistribusi adalah sekumpulan database yang saling berhubungan yang didistribusikan melalui jaringan komputer atau internet. Sistem Manajemen Basis Data Terdistribusi (DDBMS) mengelola basis data terdistribusi dan menyediakan mekanisme untuk membuat basis data transparan bagi pengguna. Dalam sistem ini, data sengaja didistribusikan di antara beberapa node sehingga semua sumber daya komputasi organisasi dapat digunakan secara optimal.

Operasi di DBMS

Empat operasi dasar pada database adalah Buat, Ambil, Perbarui, dan Hapus.

CREATE struktur database dan mengisinya dengan data - Pembuatan relasi database melibatkan penentuan struktur data, tipe data dan batasan data yang akan disimpan.

Example - Perintah SQL untuk membuat tabel siswa -

CREATE TABLE STUDENT ( 
   ROLL INTEGER PRIMARY KEY, 
   NAME VARCHAR2(25), 
   YEAR INTEGER, 
   STREAM VARCHAR2(10) 
);

Setelah format data ditentukan, data sebenarnya disimpan sesuai dengan format di beberapa media penyimpanan.

Example Perintah SQL untuk memasukkan satu tupel ke dalam tabel siswa -

INSERT INTO STUDENT ( ROLL, NAME, YEAR, STREAM) 
VALUES ( 1, 'ANKIT JHA', 1, 'COMPUTER SCIENCE');

RETRIEVEinformasi dari database - Mengambil informasi umumnya melibatkan pemilihan subset tabel atau menampilkan data dari tabel setelah beberapa perhitungan selesai. Ini dilakukan dengan query di atas tabel.

Example - Untuk mengambil nama semua siswa aliran Ilmu Komputer, kueri SQL berikut harus dijalankan -

SELECT NAME FROM STUDENT 
WHERE STREAM = 'COMPUTER SCIENCE';

UPDATE informasi yang disimpan dan memodifikasi struktur database - Memperbarui tabel melibatkan perubahan nilai lama di baris tabel yang ada dengan nilai baru.

Example - Perintah SQL untuk mengubah aliran dari Elektronik ke Elektronik dan Komunikasi -

UPDATE STUDENT 
SET STREAM = 'ELECTRONICS AND COMMUNICATIONS' 
WHERE STREAM = 'ELECTRONICS';

Memodifikasi database berarti mengubah struktur tabel. Namun, modifikasi tabel tunduk pada sejumlah batasan.

Example - Untuk menambahkan bidang atau kolom baru, katakan alamat ke tabel Siswa, kami menggunakan perintah SQL berikut -

ALTER TABLE STUDENT 
ADD ( ADDRESS VARCHAR2(50) );

DELETE informasi yang disimpan atau menghapus tabel secara keseluruhan - Penghapusan informasi tertentu melibatkan penghapusan baris yang dipilih dari tabel yang memenuhi kondisi tertentu.

Example- Untuk menghapus semua siswa yang berada di tahun ^ke- 4 saat mereka pingsan, kami menggunakan perintah SQL -

DELETE FROM STUDENT 
WHERE YEAR = 4;

Alternatifnya, seluruh tabel dapat dihapus dari database.

Example - Untuk menghapus tabel siswa sepenuhnya, perintah SQL yang digunakan adalah -

DROP TABLE STUDENT;

Bab ini memperkenalkan konsep DDBMS. Dalam database terdistribusi, terdapat sejumlah database yang mungkin tersebar secara geografis ke seluruh dunia. DBMS terdistribusi mengelola database terdistribusi sedemikian rupa sehingga database tersebut muncul sebagai satu database tunggal bagi pengguna. Di bagian selanjutnya dari bab ini, kita melanjutkan mempelajari faktor-faktor yang menyebabkan database terdistribusi, keuntungan dan kerugiannya.

SEBUAH distributed database adalah kumpulan dari beberapa database yang saling berhubungan, yang tersebar secara fisik di berbagai lokasi yang berkomunikasi melalui jaringan komputer.

fitur

Basis data dalam koleksi secara logis saling terkait satu sama lain. Seringkali mereka mewakili satu database logis.
Data disimpan secara fisik di beberapa situs. Data di setiap situs dapat dikelola oleh DBMS yang tidak bergantung pada situs lain.
Prosesor di situs terhubung melalui jaringan. Mereka tidak memiliki konfigurasi multiprosesor.
Database terdistribusi bukanlah sistem file yang terhubung secara longgar.
Basis data terdistribusi menggabungkan pemrosesan transaksi, tetapi itu tidak identik dengan sistem pemrosesan transaksi.

Sistem Manajemen Basis Data Terdistribusi

Sistem manajemen basis data terdistribusi (DDBMS) adalah sistem perangkat lunak terpusat yang mengelola basis data terdistribusi dengan cara seolah-olah semuanya disimpan di satu lokasi.

fitur

Ini digunakan untuk membuat, mengambil, memperbarui, dan menghapus database terdistribusi.
Ini menyinkronkan database secara berkala dan menyediakan mekanisme akses berdasarkan distribusi yang menjadi transparan bagi pengguna.
Ini memastikan bahwa data yang dimodifikasi di situs mana pun diperbarui secara universal.
Ini digunakan di area aplikasi di mana volume data yang besar diproses dan diakses oleh banyak pengguna secara bersamaan.
Ini dirancang untuk platform database yang heterogen.
Ini menjaga kerahasiaan dan integritas data dari database.

Faktor Pendorong DDBMS

Faktor berikut mendorong pindah ke DDBMS -

Distributed Nature of Organizational Units- Sebagian besar organisasi saat ini dibagi menjadi beberapa unit yang secara fisik didistribusikan ke seluruh dunia. Setiap unit membutuhkan kumpulan data lokalnya sendiri. Dengan demikian, keseluruhan database organisasi menjadi terdistribusi.
Need for Sharing of Data- Beberapa unit organisasi sering kali perlu berkomunikasi satu sama lain dan berbagi data serta sumber daya mereka. Ini menuntut database umum atau database yang direplikasi yang harus digunakan secara sinkron.
Support for Both OLTP and OLAP- Pemrosesan Transaksi Online (OLTP) dan Pemrosesan Analitik Online (OLAP) bekerja pada sistem yang beragam yang mungkin memiliki data umum. Sistem database terdistribusi membantu kedua pemrosesan ini dengan menyediakan data yang disinkronkan.
Database Recovery- Salah satu teknik umum yang digunakan dalam DDBMS adalah replikasi data di berbagai situs. Replikasi data secara otomatis membantu dalam pemulihan data jika database di situs mana pun rusak. Pengguna dapat mengakses data dari situs lain saat situs yang rusak sedang dibangun kembali. Dengan demikian, kegagalan database mungkin menjadi hampir tidak terlihat oleh pengguna.
Support for Multiple Application Software- Sebagian besar organisasi menggunakan berbagai perangkat lunak aplikasi, masing-masing dengan dukungan database spesifiknya. DDBMS menyediakan fungsionalitas yang seragam untuk menggunakan data yang sama di antara platform yang berbeda.

Keuntungan dari Database Terdistribusi

Berikut ini adalah keuntungan dari database terdistribusi dibandingkan database terpusat.

Modular Development- Jika sistem perlu diperluas ke lokasi baru atau unit baru, dalam sistem basis data terpusat, tindakan tersebut memerlukan upaya besar dan gangguan pada fungsi yang ada. Namun, dalam database terdistribusi, pekerjaan hanya memerlukan penambahan komputer baru dan data lokal ke situs baru dan akhirnya menghubungkannya ke sistem terdistribusi, tanpa gangguan pada fungsi saat ini.

More Reliable- Jika terjadi kegagalan database, sistem total dari database terpusat akan terhenti. Namun, dalam sistem terdistribusi, ketika komponen gagal, fungsi sistem terus berlanjut mungkin pada kinerja yang berkurang. Karenanya DDBMS lebih andal.

Better Response- Jika data didistribusikan secara efisien, maka permintaan pengguna dapat dipenuhi dari data lokal itu sendiri, sehingga memberikan respons yang lebih cepat. Di sisi lain, dalam sistem terpusat, semua kueri harus melewati komputer pusat untuk diproses, yang meningkatkan waktu respons.

Lower Communication Cost- Dalam sistem basis data terdistribusi, jika data ditempatkan secara lokal di mana data tersebut banyak digunakan, maka biaya komunikasi untuk manipulasi data dapat diminimalkan. Ini tidak mungkin dilakukan dalam sistem terpusat.

Kesulitan dari Database Terdistribusi

Berikut adalah beberapa masalah yang terkait dengan database terdistribusi.

Need for complex and expensive software - DDBMS menuntut perangkat lunak yang kompleks dan seringkali mahal untuk menyediakan transparansi dan koordinasi data di beberapa situs.
Processing overhead - Bahkan operasi sederhana mungkin memerlukan banyak komunikasi dan kalkulasi tambahan untuk memberikan keseragaman dalam data di seluruh situs.
Data integrity - Kebutuhan untuk memperbarui data di banyak situs menimbulkan masalah integritas data.
Overheads for improper data distribution- Responsivitas kueri sangat bergantung pada distribusi data yang tepat. Distribusi data yang tidak tepat sering kali menyebabkan respons yang sangat lambat terhadap permintaan pengguna.

Di bagian tutorial ini, kita akan mempelajari berbagai aspek yang membantu dalam merancang lingkungan database terdistribusi. Bab ini dimulai dengan jenis database terdistribusi. Basis data terdistribusi dapat diklasifikasikan ke dalam basis data homogen dan heterogen yang memiliki divisi lebih lanjut. Bagian selanjutnya dari bab ini membahas arsitektur terdistribusi yaitu client-server, peer-to-peer dan multi-DBMS. Akhirnya, alternatif desain yang berbeda seperti replikasi dan fragmentasi diperkenalkan.

Jenis Basis Data Terdistribusi

Database terdistribusi dapat secara luas diklasifikasikan ke dalam lingkungan database terdistribusi yang homogen dan heterogen, masing-masing dengan sub-divisi lebih lanjut, seperti yang ditunjukkan pada ilustrasi berikut.

Basis Data Terdistribusi Homogen

Dalam database terdistribusi homogen, semua situs menggunakan DBMS dan sistem operasi yang identik. Properti-propertinya adalah -

Situs tersebut menggunakan perangkat lunak yang sangat mirip.
Situs menggunakan DBMS atau DBMS yang identik dari vendor yang sama.
Setiap situs mengetahui semua situs lain dan bekerja sama dengan situs lain untuk memproses permintaan pengguna.
Basis data diakses melalui satu antarmuka seolah-olah itu adalah basis data tunggal.

Jenis Basis Data Terdistribusi Homogen

Ada dua jenis database terdistribusi homogen -

Autonomous- Setiap database independen yang berfungsi sendiri-sendiri. Mereka terintegrasi dengan aplikasi pengontrol dan menggunakan pengiriman pesan untuk berbagi pembaruan data.
Non-autonomous - Data didistribusikan di seluruh node homogen dan pusat atau master DBMS mengoordinasikan pembaruan data di seluruh situs.

Basis Data Terdistribusi Heterogen

Dalam database terdistribusi heterogen, situs yang berbeda memiliki sistem operasi, produk DBMS, dan model data yang berbeda. Properti-propertinya adalah -

Situs yang berbeda menggunakan skema dan perangkat lunak yang berbeda.
Sistem dapat terdiri dari berbagai DBMS seperti relasional, jaringan, hierarki atau berorientasi objek.
Pemrosesan kueri rumit karena skema yang berbeda.
Pemrosesan transaksi rumit karena perangkat lunak yang berbeda.
Sebuah situs mungkin tidak mengetahui situs lain sehingga ada kerja sama terbatas dalam memproses permintaan pengguna.

Jenis Basis Data Terdistribusi Heterogen

Federated - Sistem basis data yang heterogen bersifat independen dan terintegrasi bersama sehingga berfungsi sebagai sistem basis data tunggal.
Un-federated - Sistem database menggunakan modul koordinasi pusat di mana database diakses.

Arsitektur DBMS Terdistribusi

Arsitektur DDBMS umumnya dikembangkan tergantung pada tiga parameter -

Distribution - Ini menyatakan distribusi fisik data di berbagai situs.
Autonomy - Ini menunjukkan distribusi kendali sistem database dan sejauh mana setiap konstituen DBMS dapat beroperasi secara independen.
Heterogeneity - Ini mengacu pada keseragaman atau ketidaksamaan model data, komponen sistem, dan database.

Model Arsitektur

Beberapa model arsitektur yang umum adalah -

Klien - Arsitektur Server untuk DDBMS
Arsitektur Peer - to - Peer untuk DDBMS
Arsitektur Multi-DBMS

Klien - Arsitektur Server untuk DDBMS

Ini adalah arsitektur dua tingkat yang fungsinya dibagi menjadi server dan klien. Fungsi server terutama mencakup manajemen data, pemrosesan kueri, pengoptimalan, dan manajemen transaksi. Fungsi klien terutama mencakup antarmuka pengguna. Namun, mereka memiliki beberapa fungsi seperti pemeriksaan konsistensi dan manajemen transaksi.

Dua klien yang berbeda - arsitektur server adalah -

Server Tunggal Banyak Klien
Beberapa Server Banyak Klien (ditampilkan dalam diagram berikut)

Arsitektur Peer-to-Peer untuk DDBMS

Dalam sistem ini, setiap rekan bertindak sebagai klien dan server untuk memberikan layanan database. Teman sebaya berbagi sumber daya mereka dengan rekan lain dan mengoordinasikan aktivitas mereka.

Arsitektur ini umumnya memiliki empat level skema -

Global Conceptual Schema - Menggambarkan pandangan logis global data.
Local Conceptual Schema - Menggambarkan organisasi data logis di setiap situs.
Local Internal Schema - Menggambarkan organisasi data fisik di setiap situs.
External Schema - Menggambarkan tampilan data pengguna.

Arsitektur Multi-DBMS

Ini adalah sistem basis data terintegrasi yang dibentuk oleh kumpulan dua atau lebih sistem basis data otonom.

Multi-DBMS dapat diekspresikan melalui enam level skema -

Multi-database View Level - Menggambarkan beberapa tampilan pengguna yang terdiri dari himpunan bagian dari database terdistribusi terintegrasi.
Multi-database Conceptual Level - Menggambarkan multi-database terintegrasi yang terdiri dari definisi struktur multi-database logis global.
Multi-database Internal Level - Menggambarkan distribusi data di berbagai situs dan multi-database ke pemetaan data lokal.
Local database View Level - Menggambarkan pandangan publik atas data lokal.
Local database Conceptual Level - Menggambarkan organisasi data lokal di setiap situs.
Local database Internal Level - Menggambarkan organisasi data fisik di setiap situs.

Ada dua alternatif desain untuk multi-DBMS -

Model dengan level konseptual multi-database.
Model tanpa level konseptual multi-database.

Alternatif Desain

Alternatif desain distribusi untuk tabel di DDBMS adalah sebagai berikut -

Tidak direplikasi dan tidak terfragmentasi
Direplikasi sepenuhnya
Direplikasi sebagian
Fragmented
Mixed

Tidak direplikasi & tidak terfragmentasi

Dalam alternatif desain ini, tabel berbeda ditempatkan di lokasi berbeda. Data ditempatkan sedemikian rupa sehingga berada dekat dengan situs tempat data paling sering digunakan. Ini paling cocok untuk sistem database di mana persentase kueri yang diperlukan untuk menggabungkan informasi dalam tabel yang ditempatkan di situs berbeda rendah. Jika strategi distribusi yang tepat diterapkan, alternatif desain ini membantu mengurangi biaya komunikasi selama pemrosesan data.

Direplikasi Sepenuhnya

Dalam alternatif desain ini, di setiap situs, satu salinan dari semua tabel database disimpan. Karena, setiap situs memiliki salinan seluruh database sendiri, kueri sangat cepat sehingga memerlukan biaya komunikasi yang dapat diabaikan. Sebaliknya, redundansi besar-besaran dalam data membutuhkan biaya besar selama operasi pembaruan. Oleh karena itu, ini cocok untuk sistem di mana sejumlah besar kueri diperlukan untuk ditangani sedangkan jumlah pembaruan basis data rendah.

Direplikasi Sebagian

Salinan tabel atau bagian tabel disimpan di situs berbeda. Distribusi tabel dilakukan sesuai dengan frekuensi akses. Ini mempertimbangkan fakta bahwa frekuensi mengakses tabel sangat bervariasi dari situs ke situs. Jumlah salinan tabel (atau bagian) bergantung pada seberapa sering kueri akses dijalankan dan situs yang menghasilkan kueri akses.

Terfragmentasi

Dalam desain ini, tabel dibagi menjadi dua atau lebih bagian yang disebut sebagai fragmen atau partisi, dan setiap fragmen dapat disimpan di situs yang berbeda. Ini mempertimbangkan fakta bahwa jarang terjadi bahwa semua data yang disimpan dalam tabel diperlukan di situs tertentu. Selain itu, fragmentasi meningkatkan paralelisme dan memberikan pemulihan bencana yang lebih baik. Di sini, hanya ada satu salinan dari setiap fragmen dalam sistem, yaitu tidak ada data yang berlebihan.

Tiga teknik fragmentasi adalah -

Fragmentasi vertikal
Fragmentasi horizontal
Fragmentasi hibrida

Distribusi Campuran

Ini adalah kombinasi dari fragmentasi dan replikasi parsial. Di sini, tabel awalnya terfragmentasi dalam bentuk apa pun (horizontal atau vertikal), dan kemudian fragmen ini direplikasi sebagian di lokasi yang berbeda sesuai dengan frekuensi mengakses fragmen.

Pada bab terakhir, kami telah memperkenalkan alternatif desain yang berbeda. Dalam bab ini, kita akan mempelajari strategi yang membantu dalam mengadopsi desain. Strategi secara luas dapat dibagi menjadi replikasi dan fragmentasi. Namun, dalam banyak kasus, kombinasi keduanya digunakan.

Replikasi Data

Replikasi data adalah proses menyimpan salinan terpisah dari database di dua atau lebih situs. Ini adalah teknik toleransi kesalahan yang populer dari database terdistribusi.

Keuntungan Replikasi Data

Reliability - Jika terjadi kegagalan situs mana pun, sistem basis data terus berfungsi karena salinan tersedia di situs lain.
Reduction in Network Load- Karena salinan data lokal tersedia, pemrosesan kueri dapat dilakukan dengan pengurangan penggunaan jaringan, terutama selama jam utama. Pembaruan data dapat dilakukan di luar jam utama.
Quicker Response - Ketersediaan salinan data lokal memastikan pemrosesan kueri yang cepat dan waktu respons yang cepat.
Simpler Transactions- Transaksi memerlukan lebih sedikit jumlah gabungan tabel yang terletak di situs berbeda dan koordinasi minimal di seluruh jaringan. Dengan demikian, mereka menjadi lebih sederhana.

Kekurangan Replikasi Data

Increased Storage Requirements- Memelihara banyak salinan data dikaitkan dengan peningkatan biaya penyimpanan. Ruang penyimpanan yang dibutuhkan dalam kelipatan dari penyimpanan yang dibutuhkan untuk sistem terpusat.
Increased Cost and Complexity of Data Updating- Setiap kali item data diperbarui, pembaruan perlu diterapkan pada semua salinan data di situs yang berbeda. Ini membutuhkan teknik dan protokol sinkronisasi yang kompleks.
Undesirable Application – Database coupling- Jika mekanisme pembaruan yang kompleks tidak digunakan, menghapus inkonsistensi data memerlukan koordinasi yang kompleks di tingkat aplikasi. Ini menghasilkan aplikasi yang tidak diinginkan - penggandengan basis data.

Beberapa teknik replikasi yang umum digunakan adalah -

Replikasi snapshot
Replikasi hampir secara real-time
Tarik replikasi

Fragmentasi

Fragmentasi adalah tugas membagi tabel menjadi satu set tabel yang lebih kecil. Bagian dari tabel disebutfragments. Fragmentasi dapat terdiri dari tiga jenis: horizontal, vertikal, dan hibrida (kombinasi horizontal dan vertikal). Fragmentasi horizontal selanjutnya dapat diklasifikasikan menjadi dua teknik: fragmentasi horizontal primer dan fragmentasi horizontal turunan.

Fragmentasi harus dilakukan sedemikian rupa sehingga tabel asli dapat direkonstruksi dari fragmen. Ini diperlukan agar tabel asli dapat direkonstruksi dari fragmen kapan pun diperlukan. Persyaratan ini disebut "rekonstruktif".

Keuntungan Fragmentasi

Karena data disimpan dekat dengan lokasi penggunaan, efisiensi sistem database meningkat.
Teknik pengoptimalan kueri lokal sudah cukup untuk sebagian besar kueri karena data tersedia secara lokal.
Karena data yang tidak relevan tidak tersedia di situs, keamanan dan privasi sistem database dapat dipertahankan.

Kerugian dari Fragmentasi

Ketika data dari fragmen berbeda diperlukan, kecepatan akses mungkin sangat tinggi.
Dalam kasus fragmentasi rekursif, pekerjaan rekonstruksi membutuhkan teknik yang mahal.
Kurangnya salinan cadangan data di situs yang berbeda dapat membuat database tidak efektif jika terjadi kegagalan situs.

Fragmentasi Vertikal

Dalam fragmentasi vertikal, bidang atau kolom tabel dikelompokkan menjadi beberapa fragmen. Untuk menjaga rekonstruksi, setiap fragmen harus berisi bidang kunci utama pada tabel. Fragmentasi vertikal dapat digunakan untuk menegakkan privasi data.

Misalnya, mari kita pertimbangkan bahwa database Universitas menyimpan catatan semua siswa yang terdaftar dalam tabel Mahasiswa yang memiliki skema berikut.

SISWA

Regd_No	Nama	Kursus	Alamat	Semester	Biaya	Tanda

Sekarang, rincian biaya disimpan di bagian akun. Dalam kasus ini, desainer akan memecah database sebagai berikut -

CREATE TABLE STD_FEES AS 
   SELECT Regd_No, Fees 
   FROM STUDENT;

Fragmentasi Horizontal

Fragmentasi horizontal mengelompokkan tupel tabel sesuai dengan nilai dari satu atau lebih bidang. Fragmentasi horizontal juga harus sesuai dengan aturan rekonstruksi. Setiap fragmen horizontal harus memiliki semua kolom dari tabel dasar asli.

Sebagai contoh, dalam skema kemahasiswaan, jika detail seluruh mahasiswa Mata Kuliah Ilmu Komputer perlu dipertahankan di Sekolah Ilmu Komputer, maka perancang akan memecah basis data secara horizontal sebagai berikut -

CREATE COMP_STD AS 
   SELECT * FROM STUDENT  
   WHERE COURSE = "Computer Science";

Fragmentasi Hibrid

Dalam fragmentasi hibrid, kombinasi teknik fragmentasi horizontal dan vertikal digunakan. Ini adalah teknik fragmentasi paling fleksibel karena teknik ini menghasilkan fragmen dengan informasi asing yang minimal. Namun, rekonstruksi tabel asli seringkali merupakan tugas yang mahal.

Fragmentasi hibrida dapat dilakukan dengan dua cara alternatif -

Pertama, buat satu set fragmen horizontal; kemudian buat fragmen vertikal dari satu atau lebih fragmen horizontal.
Pertama, buat satu set fragmen vertikal; kemudian buat fragmen horizontal dari satu atau lebih fragmen vertikal.

Transparansi distribusi adalah properti dari basis data terdistribusi berdasarkan detail internal distribusi yang disembunyikan dari pengguna. Perancang DDBMS dapat memilih untuk memecah tabel, mereplikasi fragmen dan menyimpannya di situs yang berbeda. Namun, karena pengguna tidak menyadari detail ini, mereka merasa database terdistribusi mudah digunakan seperti database terpusat.

Tiga dimensi transparansi distribusi adalah -

Transparansi lokasi
Transparansi fragmentasi
Transparansi replikasi

Transparansi Lokasi

Transparansi lokasi memastikan bahwa pengguna dapat membuat kueri pada tabel atau fragmen apa pun dari tabel seolah-olah disimpan secara lokal di situs pengguna. Fakta bahwa tabel atau fragmennya disimpan di situs jarak jauh dalam sistem database terdistribusi, harus sepenuhnya dilupakan oleh pengguna akhir. Alamat situs jarak jauh dan mekanisme akses sepenuhnya tersembunyi.

Untuk menerapkan transparansi lokasi, DDBMS harus memiliki akses ke kamus data dan direktori DDBMS yang diperbarui dan akurat yang berisi detail lokasi data.

Transparansi Fragmentasi

Transparansi fragmentasi memungkinkan pengguna untuk melakukan kueri di atas tabel mana pun seolah-olah tidak terfragmentasi. Jadi, ini menyembunyikan fakta bahwa tabel yang di-kueri pengguna sebenarnya adalah fragmen atau gabungan dari beberapa fragmen. Ini juga menyembunyikan fakta bahwa fragmen-fragmen tersebut berada di berbagai lokasi.

Ini agak mirip dengan pengguna tampilan SQL, di mana pengguna mungkin tidak tahu bahwa mereka menggunakan tampilan tabel dan bukan tabel itu sendiri.

Transparansi Replikasi

Transparansi replikasi memastikan bahwa replikasi database disembunyikan dari pengguna. Ini memungkinkan pengguna untuk melakukan kueri di atas tabel seolah-olah hanya ada satu salinan tabel.

Transparansi replikasi dikaitkan dengan transparansi konkurensi dan transparansi kegagalan. Setiap kali pengguna memperbarui item data, pembaruan tersebut tercermin di semua salinan tabel. Namun, operasi ini tidak boleh diketahui oleh pengguna. Ini adalah transparansi konkurensi. Selain itu, jika terjadi kegagalan situs, pengguna masih dapat melanjutkan permintaannya menggunakan salinan yang direplikasi tanpa mengetahui kegagalan. Ini adalah transparansi kegagalan.

Kombinasi Transparansi

Dalam sistem basis data terdistribusi, perancang harus memastikan bahwa semua transparansi yang dinyatakan dipertahankan sampai batas tertentu. Perancang dapat memilih untuk memecah tabel, mereplikasi dan menyimpannya di situs yang berbeda; semua tidak menyadari pengguna akhir. Namun, transparansi distribusi yang lengkap adalah tugas yang berat dan membutuhkan upaya desain yang cukup besar.

Kontrol database mengacu pada tugas menegakkan peraturan untuk memberikan data yang benar kepada pengguna dan aplikasi database yang otentik. Agar data yang benar tersedia bagi pengguna, semua data harus sesuai dengan batasan integritas yang ditentukan dalam database. Selain itu, data harus disaring jauh dari pengguna yang tidak sah untuk menjaga keamanan dan privasi database. Kontrol database adalah salah satu tugas utama administrator database (DBA).

Tiga dimensi kontrol database adalah -

Authentication
Hak akses
Batasan integritas

Autentikasi

Dalam sistem database terdistribusi, otentikasi adalah proses di mana hanya pengguna yang sah yang dapat memperoleh akses ke sumber data.

Otentikasi dapat diterapkan dalam dua tingkat -

Controlling Access to Client Computer- Pada level ini, akses pengguna dibatasi saat login ke komputer klien yang menyediakan antarmuka pengguna ke server database. Metode yang paling umum adalah kombinasi nama pengguna / kata sandi. Namun, metode yang lebih canggih seperti otentikasi biometrik dapat digunakan untuk data dengan keamanan tinggi.
Controlling Access to the Database Software- Pada level ini, perangkat lunak / administrator database memberikan beberapa kredensial kepada pengguna. Pengguna mendapatkan akses ke database menggunakan kredensial ini. Salah satu caranya adalah dengan membuat akun login di dalam database server.

Hak akses

Hak akses pengguna mengacu pada hak istimewa yang diberikan pengguna terkait operasi DBMS seperti hak untuk membuat tabel, menjatuhkan tabel, menambah / menghapus / memperbarui tupel dalam tabel atau permintaan di atas tabel.

Dalam lingkungan terdistribusi, karena ada sejumlah besar tabel dan jumlah pengguna yang lebih banyak, tidaklah layak untuk menetapkan hak akses individu kepada pengguna. Jadi, DDBMS mendefinisikan peran tertentu. Peran adalah konstruksi dengan hak istimewa tertentu dalam sistem database. Setelah peran yang berbeda ditentukan, pengguna individu diberi salah satu peran ini. Seringkali hierarki peran didefinisikan sesuai dengan hierarki otoritas dan tanggung jawab organisasi.

Misalnya, pernyataan SQL berikut membuat peran "Akuntan" dan kemudian menetapkan peran ini ke pengguna "ABC".

CREATE ROLE ACCOUNTANT; 
GRANT SELECT, INSERT, UPDATE ON EMP_SAL TO ACCOUNTANT; 
GRANT INSERT, UPDATE, DELETE ON TENDER TO ACCOUNTANT; 
GRANT INSERT, SELECT ON EXPENSE TO ACCOUNTANT; 
COMMIT; 
GRANT ACCOUNTANT TO ABC; 
COMMIT;

Kontrol Integritas Semantik

Kontrol integritas semantik mendefinisikan dan memberlakukan batasan integritas sistem database.

Batasan integritas adalah sebagai berikut -

Batasan integritas tipe data
Batasan integritas entitas
Batasan integritas referensial

Batasan Integritas Jenis Data

Batasan tipe data membatasi rentang nilai dan tipe operasi yang dapat diterapkan ke bidang dengan tipe data yang ditentukan.

Misalnya, mari kita pertimbangkan bahwa tabel "HOSTEL" memiliki tiga bidang - nomor asrama, nama asrama, dan kapasitas. Nomor asrama harus dimulai dengan huruf kapital "H" dan tidak boleh NULL, dan kapasitas tidak boleh lebih dari 150. Perintah SQL berikut dapat digunakan untuk definisi data -

CREATE TABLE HOSTEL ( 
   H_NO VARCHAR2(5) NOT NULL, 
   H_NAME VARCHAR2(15), 
   CAPACITY INTEGER, 
   CHECK ( H_NO LIKE 'H%'), 
   CHECK ( CAPACITY <= 150) 
);

Kontrol Integritas Entitas

Kontrol integritas entitas memberlakukan aturan sehingga setiap tupel dapat diidentifikasi secara unik dari tupel lain. Untuk ini kunci utama ditentukan. Kunci utama adalah sekumpulan bidang minimal yang dapat mengidentifikasi tupel secara unik. Batasan integritas entitas menyatakan bahwa tidak ada dua tupel dalam tabel yang dapat memiliki nilai identik untuk kunci primer dan tidak ada bidang yang merupakan bagian dari kunci primer yang dapat memiliki nilai NULL.

Misalnya, dalam tabel hostel di atas, nomor hostel dapat ditetapkan sebagai kunci utama melalui pernyataan SQL berikut (mengabaikan pemeriksaan) -

CREATE TABLE HOSTEL ( 
   H_NO VARCHAR2(5) PRIMARY KEY, 
   H_NAME VARCHAR2(15), 
   CAPACITY INTEGER 
);

Batasan Integritas Referensial

Batasan integritas referensial menetapkan aturan kunci asing. Kunci asing adalah bidang dalam tabel data yang merupakan kunci utama dari tabel terkait. Batasan integritas referensial menetapkan aturan bahwa nilai bidang kunci asing harus berada di antara nilai kunci utama tabel yang direferensikan atau seluruhnya NULL.

Misalnya, mari kita pertimbangkan meja siswa di mana siswa dapat memilih untuk tinggal di asrama. Untuk memasukkan ini, kunci utama tabel asrama harus dimasukkan sebagai kunci asing di tabel siswa. Pernyataan SQL berikut menggabungkan ini -

CREATE TABLE STUDENT (  
   S_ROLL INTEGER PRIMARY KEY, 
   S_NAME VARCHAR2(25) NOT NULL, 
   S_COURSE VARCHAR2(10), 
   S_HOSTEL VARCHAR2(5) REFERENCES HOSTEL 
);

Saat kueri ditempatkan, kueri itu pertama kali dipindai, diurai, dan divalidasi. Representasi internal dari kueri kemudian dibuat seperti pohon kueri atau grafik kueri. Kemudian strategi eksekusi alternatif dirancang untuk mengambil hasil dari tabel database. Proses memilih strategi eksekusi yang paling tepat untuk pemrosesan kueri disebut optimasi kueri.

Masalah Optimasi Kueri di DDBMS

Di DDBMS, pengoptimalan kueri adalah tugas penting. Kompleksitasnya tinggi karena jumlah strategi alternatif dapat meningkat secara eksponensial karena faktor-faktor berikut -

Kehadiran sejumlah fragmen.
Distribusi fragmen atau tabel di berbagai situs.
Kecepatan tautan komunikasi.
Perbedaan dalam kemampuan pemrosesan lokal.

Oleh karena itu, dalam sistem terdistribusi, target sering kali menemukan strategi eksekusi yang baik untuk pemrosesan kueri daripada yang terbaik. Waktu untuk mengeksekusi kueri adalah jumlah dari berikut ini -

Waktu untuk mengkomunikasikan pertanyaan ke database.
Waktu untuk mengeksekusi fragmen kueri lokal.
Saatnya mengumpulkan data dari situs yang berbeda.
Saatnya menampilkan hasil ke aplikasi.

Pemrosesan Kueri

Pemrosesan kueri adalah sekumpulan semua aktivitas mulai dari penempatan kueri hingga menampilkan hasil kueri. Langkah-langkahnya seperti yang ditunjukkan pada diagram berikut -

Aljabar Relasional

Aljabar relasional mendefinisikan himpunan operasi dasar model database relasional. Urutan operasi aljabar relasional membentuk ekspresi aljabar relasional. Hasil ekspresi ini mewakili hasil query database.

Operasi dasarnya adalah -

Projection
Selection
Union
Intersection
Minus
Join

Proyeksi

Operasi proyeksi menampilkan subset bidang tabel. Ini memberikan partisi vertikal pada tabel.

Syntax in Relational Algebra

$$ \ pi _ {<{AttributeList}>} {(<{Table Name}>)} $$

Sebagai contoh, mari kita pertimbangkan database siswa berikut -

STUDENT
Roll_No	Name	Course	Semester	Gender
2	Amit Prasad	BCA	1	Pria
4	Varsha Tiwari	BCA	1	Perempuan
5	Asif Ali	MCA	2	Pria
6	Joe Wallace	MCA	1	Pria
8	Shivani Iyengar	BCA	1	Perempuan

Jika kami ingin menampilkan nama dan mata kuliah semua siswa, kami akan menggunakan ekspresi aljabar relasional berikut -

$$\pi_{Name,Course}{(STUDENT)}$$

Pilihan

Operasi pemilihan menampilkan subset tupel tabel yang memenuhi kondisi tertentu. Ini memberikan partisi horizontal pada tabel.

Syntax in Relational Algebra

$$ \ sigma _ {<{Kondisi}>} {(<{Nama Tabel}>)} $$

Misalnya, dalam tabel Siswa, jika kami ingin menampilkan detail dari semua siswa yang telah memilih kursus MCA, kami akan menggunakan ekspresi aljabar relasional berikut -

$$\sigma_{Course} = {\small "BCA"}^{(STUDENT)}$$

Kombinasi Operasi Proyeksi dan Seleksi

Untuk sebagian besar kueri, kami memerlukan kombinasi operasi proyeksi dan pemilihan. Ada dua cara untuk menulis ekspresi ini -

Menggunakan urutan operasi proyeksi dan pemilihan.
Menggunakan operasi ganti nama untuk menghasilkan hasil antara.

Misalnya untuk menampilkan nama semua siswi mata kuliah BCA -

Ekspresi aljabar relasional menggunakan urutan operasi proyeksi dan seleksi

$$\pi_{Name}(\sigma_{Gender = \small "Female" AND \: Course = \small "BCA"}{(STUDENT)})$$

Ekspresi aljabar relasional menggunakan operasi ganti nama untuk menghasilkan hasil antara

$$FemaleBCAStudent \leftarrow \sigma_{Gender = \small "Female" AND \: Course = \small "BCA"} {(STUDENT)}$$

$$Result \leftarrow \pi_{Name}{(FemaleBCAStudent)}$$

Persatuan

Jika P adalah hasil dari operasi dan Q adalah hasil dari operasi lain, penyatuan P dan Q ($p \cup Q$) adalah himpunan semua tupel yang ada di P atau di Q atau di keduanya tanpa duplikat.

Misalnya untuk menampilkan semua mahasiswa baik yang sedang semester 1 maupun yang sedang kuliah di BCA -

$$Sem1Student \leftarrow \sigma_{Semester = 1}{(STUDENT)}$$

$$BCAStudent \leftarrow \sigma_{Course = \small "BCA"}{(STUDENT)}$$

$$Result \leftarrow Sem1Student \cup BCAStudent$$

Persimpangan

Jika P adalah hasil operasi dan Q adalah hasil operasi lain, perpotongan P dan Q ( $p \cap Q$ ) adalah himpunan dari semua tupel yang ada di P dan Q keduanya.

Misalnya, diberikan dua skema berikut -

EMPLOYEE

EmpID	Nama	Kota	Departemen	Gaji

PROJECT

PId	Kota	Departemen	Status

Untuk menampilkan nama semua kota tempat proyek berada dan juga karyawan tinggal -

$$CityEmp \leftarrow \pi_{City}{(EMPLOYEE)}$$

$$CityProject \leftarrow \pi_{City}{(PROJECT)}$$

$$Result \leftarrow CityEmp \cap CityProject$$

Minus

Jika P adalah hasil dari operasi dan Q adalah hasil dari operasi lain, P - Q adalah himpunan semua tupel yang ada di P dan bukan di Q.

Misalnya, untuk membuat daftar semua departemen yang tidak memiliki proyek yang sedang berjalan (proyek dengan status = sedang berlangsung) -

$$AllDept \leftarrow \pi_{Department}{(EMPLOYEE)}$$

$$ProjectDept \leftarrow \pi_{Department} (\sigma_{Status = \small "ongoing"}{(PROJECT)})$$

$$Result \leftarrow AllDept - ProjectDept$$

Ikuti

Operasi gabungan menggabungkan tupel terkait dari dua tabel berbeda (hasil kueri) ke dalam satu tabel.

Misalnya, pertimbangkan dua skema, Pelanggan dan Cabang dalam database Bank sebagai berikut -

CUSTOMER

CustID	AccNo	TypeOfAc	BranchID	DateOfOpening

BRANCH

BranchID	Nama cabang	Kode IFSC	Alamat

Untuk membuat daftar detail karyawan bersama dengan detail cabang -

$$Result \leftarrow CUSTOMER \bowtie_{Customer.BranchID=Branch.BranchID}{BRANCH}$$

Menerjemahkan Kueri SQL ke dalam Aljabar Relasional

Kueri SQL diterjemahkan ke dalam ekspresi aljabar relasional yang setara sebelum pengoptimalan. Kueri pada awalnya diuraikan menjadi blok kueri yang lebih kecil. Blok-blok ini diterjemahkan ke ekspresi aljabar relasional yang setara. Pengoptimalan mencakup pengoptimalan setiap blok dan kemudian pengoptimalan kueri secara keseluruhan.

Contoh

Mari kita pertimbangkan skema berikut -

KARYAWAN

EmpID	Nama	Kota	Departemen	Gaji

PROYEK

PId	Kota	Departemen	Status

KARYA

EmpID	PID	Jam

Contoh 1

Untuk menampilkan detail semua karyawan yang mendapatkan gaji KURANG dari gaji rata-rata, kami menulis kueri SQL -

SELECT * FROM EMPLOYEE 
WHERE SALARY < ( SELECT AVERAGE(SALARY) FROM EMPLOYEE ) ;

Kueri ini berisi satu sub-kueri bertingkat. Jadi, ini bisa dipecah menjadi dua blok.

Blok bagian dalam adalah -

SELECT AVERAGE(SALARY)FROM EMPLOYEE ;

Jika hasil dari query ini adalah AvgSal, maka blok luar adalah -

SELECT * FROM EMPLOYEE WHERE SALARY < AvgSal;

Ekspresi aljabar relasional untuk blok dalam -

$$AvgSal \leftarrow \Im_{AVERAGE(Salary)}{EMPLOYEE}$$

Ekspresi aljabar relasional untuk blok luar -

$$ \ sigma_ {Gaji <{AvgSal}>} {EMPLOYEE} $$

Contoh 2

Untuk menampilkan ID proyek dan status semua proyek karyawan 'Arun Kumar', kami menulis kueri SQL -

SELECT PID, STATUS FROM PROJECT 
WHERE PID = ( SELECT FROM WORKS  WHERE EMPID = ( SELECT EMPID FROM EMPLOYEE 
            WHERE NAME = 'ARUN KUMAR'));

Kueri ini berisi dua sub-kueri bertingkat. Dengan demikian, dapat dipecah menjadi tiga blok, sebagai berikut -

SELECT EMPID FROM EMPLOYEE WHERE NAME = 'ARUN KUMAR'; 
SELECT PID FROM WORKS WHERE EMPID = ArunEmpID; 
SELECT PID, STATUS FROM PROJECT WHERE PID = ArunPID;

(Di sini ArunEmpID dan ArunPID adalah hasil dari kueri dalam)

Ekspresi aljabar relasional untuk ketiga blok tersebut adalah -

$$ArunEmpID \leftarrow \pi_{EmpID}(\sigma_{Name = \small "Arun Kumar"} {(EMPLOYEE)})$$

$$ArunPID \leftarrow \pi_{PID}(\sigma_{EmpID = \small "ArunEmpID"} {(WORKS)})$$

$$Result \leftarrow \pi_{PID, Status}(\sigma_{PID = \small "ArunPID"} {(PROJECT)})$$

Perhitungan Operator Aljabar Relasional

Perhitungan operator aljabar relasional dapat dilakukan dengan berbagai cara, dan setiap alternatif disebut an access path.

Alternatif komputasi bergantung pada tiga faktor utama -

Jenis operator
Memori yang tersedia
Struktur disk

Waktu untuk melaksanakan operasi aljabar relasional adalah jumlah dari -

Saatnya memproses tupel.
Saatnya mengambil tupel tabel dari disk ke memori.

Karena waktu untuk memproses tupel jauh lebih kecil daripada waktu untuk mengambil tupel dari penyimpanan, terutama dalam sistem terdistribusi, akses disk sering kali dianggap sebagai metrik untuk menghitung biaya ekspresi relasional.

Perhitungan Seleksi

Perhitungan operasi pemilihan bergantung pada kompleksitas kondisi pemilihan dan ketersediaan indeks pada atribut tabel.

Berikut ini adalah alternatif komputasi tergantung pada indeks -

No Index- Jika tabel tidak diurutkan dan tidak memiliki indeks, maka proses pemilihan melibatkan pemindaian semua blok disk tabel. Setiap blok dibawa ke dalam memori dan setiap tupel di blok tersebut diperiksa untuk melihat apakah memenuhi kondisi pemilihan. Jika kondisinya terpenuhi, maka akan ditampilkan sebagai output. Ini adalah pendekatan yang paling mahal karena setiap tupel dibawa ke dalam memori dan setiap tupel diproses.
B+ Tree Index- Kebanyakan sistem database dibangun di atas indeks B + Tree. Jika kondisi pemilihan didasarkan pada bidang, yang merupakan kunci dari indeks Pohon B + ini, maka indeks ini digunakan untuk mengambil hasil. Namun, memproses pernyataan pemilihan dengan kondisi kompleks mungkin melibatkan lebih banyak akses blok disk dan dalam beberapa kasus pemindaian lengkap atas tabel.
Hash Index- Jika indeks hash digunakan dan bidang kuncinya digunakan dalam kondisi pemilihan, maka mengambil tupel menggunakan indeks hash menjadi proses yang sederhana. Indeks hash menggunakan fungsi hash untuk menemukan alamat bucket tempat nilai kunci yang sesuai dengan nilai hash disimpan. Untuk menemukan nilai kunci dalam indeks, fungsi hash dijalankan dan alamat keranjang ditemukan. Nilai kunci di keranjang dicari. Jika kecocokan ditemukan, tupel sebenarnya diambil dari blok disk ke dalam memori.

Perhitungan Gabungan

Ketika kita ingin menggabungkan dua tabel, katakanlah P dan Q, setiap tupel di P harus dibandingkan dengan setiap tupel di Q untuk menguji apakah kondisi penggabungan terpenuhi. Jika kondisinya terpenuhi, tupel yang sesuai akan digabungkan, menghilangkan bidang duplikat dan ditambahkan ke relasi hasil. Akibatnya, ini adalah operasi yang paling mahal.

Pendekatan umum untuk menghitung gabungan adalah -

Pendekatan Nested-loop

Ini adalah pendekatan gabungan konvensional. Ini dapat diilustrasikan melalui pseudocode berikut (Tabel P dan Q, dengan tuple tuple_p dan tuple_q dan bergabung dengan atribut a) -

For each tuple_p in P 
For each tuple_q in Q
If tuple_p.a = tuple_q.a Then 
   Concatenate tuple_p and tuple_q and append to Result 
End If 
Next tuple_q 
Next tuple-p

Pendekatan Sort-merge

Dalam pendekatan ini, dua tabel diurutkan secara individual berdasarkan atribut penggabungan dan kemudian tabel yang diurutkan digabungkan. Teknik penyortiran eksternal diadopsi karena jumlah record sangat tinggi dan tidak dapat diakomodasi dalam memori. Setelah tabel individu diurutkan, satu halaman setiap tabel yang diurutkan dibawa ke memori, digabungkan berdasarkan atribut join dan tupel yang digabungkan ditulis.

Pendekatan Hash-join

Pendekatan ini terdiri dari dua fase: fase partisi dan fase probing. Dalam fase pemartisian, tabel P dan Q dipecah menjadi dua set partisi yang saling terpisah. Fungsi hash umum diputuskan. Fungsi hash ini digunakan untuk menetapkan tupel ke partisi. Dalam fase probing, tupel dalam partisi P dibandingkan dengan tupel dari partisi Q yang sesuai. Jika cocok, maka tupel tersebut akan ditulis.

Setelah jalur akses alternatif untuk komputasi ekspresi aljabar relasional diturunkan, jalur akses optimal ditentukan. Pada bab ini, kita akan melihat optimasi query dalam sistem terpusat sedangkan pada bab berikutnya kita akan mempelajari optimasi query dalam sistem terdistribusi.

Dalam sistem terpusat, pemrosesan kueri dilakukan dengan tujuan berikut -

Meminimalkan waktu respons kueri (waktu yang dibutuhkan untuk menghasilkan hasil untuk kueri pengguna).
Maksimalkan throughput sistem (jumlah permintaan yang diproses dalam jangka waktu tertentu).
Kurangi jumlah memori dan penyimpanan yang diperlukan untuk pemrosesan.
Tingkatkan paralelisme.

Query Parsing dan Terjemahan

Awalnya, kueri SQL dipindai. Kemudian diurai untuk mencari kesalahan sintaksis dan ketepatan tipe data. Jika kueri melewati langkah ini, kueri diuraikan menjadi blok kueri yang lebih kecil. Setiap blok kemudian diterjemahkan ke ekspresi aljabar relasional yang setara.

Langkah-langkah untuk Pengoptimalan Kueri

Pengoptimalan kueri melibatkan tiga langkah, yaitu pembuatan pohon kueri, pembuatan rencana, dan pembuatan kode rencana kueri.

Step 1 − Query Tree Generation

Pohon kueri adalah struktur data pohon yang mewakili ekspresi aljabar relasional. Tabel kueri direpresentasikan sebagai node daun. Operasi aljabar relasional direpresentasikan sebagai node internal. Akar mewakili kueri secara keseluruhan.

Selama eksekusi, node internal dieksekusi setiap kali tabel operannya tersedia. Node tersebut kemudian diganti dengan tabel hasil. Proses ini berlanjut untuk semua node internal sampai node root dijalankan dan diganti dengan tabel hasil.

Sebagai contoh, mari kita pertimbangkan skema berikut -

KARYAWAN

EmpID	EName	Gaji	DeptNo	Tanggal Bergabung

DEPARTEMEN

DNo	DName	Lokasi

Contoh 1

Mari kita pertimbangkan kueri sebagai berikut.

$$\pi_{EmpID} (\sigma_{EName = \small "ArunKumar"} {(EMPLOYEE)})$$

Pohon kueri terkait akan menjadi -

Contoh 2

Mari kita pertimbangkan kueri lain yang melibatkan gabungan.

$\pi_{EName, Salary} (\sigma_{DName = \small "Marketing"} {(DEPARTMENT)}) \bowtie_{DNo=DeptNo}{(EMPLOYEE)}$

Berikut adalah pohon kueri untuk kueri di atas.

Step 2 − Query Plan Generation

Setelah pohon kueri dibuat, rencana kueri dibuat. Rencana kueri adalah pohon kueri yang diperluas yang menyertakan jalur akses untuk semua operasi di pohon kueri. Jalur akses menentukan bagaimana operasi relasional di pohon harus dilakukan. Misalnya, operasi pemilihan dapat memiliki jalur akses yang memberikan detail tentang penggunaan indeks pohon B + untuk pemilihan.

Selain itu, rencana kueri juga menyatakan bagaimana tabel perantara harus diteruskan dari satu operator ke operator berikutnya, bagaimana tabel sementara harus digunakan dan bagaimana operasi harus dialirkan / digabungkan.

Step 3− Code Generation

Pembuatan kode adalah langkah terakhir dalam pengoptimalan kueri. Ini adalah bentuk kueri yang dapat dieksekusi, yang bentuknya bergantung pada jenis sistem operasi yang mendasarinya. Setelah kode kueri dibuat, Manajer Eksekusi menjalankannya dan menghasilkan hasilnya.

Pendekatan untuk Pengoptimalan Kueri

Di antara pendekatan untuk pengoptimalan kueri, pencarian lengkap dan algoritme berbasis heuristik banyak digunakan.

Optimasi Pencarian Lengkap

Dalam teknik ini, untuk kueri, semua kemungkinan rencana kueri awalnya dibuat dan kemudian rencana terbaik dipilih. Meskipun teknik ini memberikan solusi terbaik, teknik ini memiliki kompleksitas ruang dan waktu eksponensial karena ruang solusi yang besar. Misalnya teknik pemrograman dinamis.

Optimasi Berbasis Heuristik

Pengoptimalan berbasis heuristik menggunakan pendekatan pengoptimalan berbasis aturan untuk pengoptimalan kueri. Algoritme ini memiliki kompleksitas ruang dan waktu polinomial, yang lebih rendah daripada kompleksitas eksponensial dari algoritme berbasis pencarian yang lengkap. Namun, algoritme ini tidak selalu menghasilkan rencana kueri terbaik.

Beberapa aturan heuristik yang umum adalah -

Lakukan operasi pilih dan proyek sebelum bergabung dengan operasi. Ini dilakukan dengan memindahkan operasi pilih dan proyek ke bawah pohon kueri. Ini mengurangi jumlah tupel yang tersedia untuk digabungkan.
Lakukan operasi pemilihan / proyek yang paling ketat pada awalnya sebelum operasi lainnya.
Hindari operasi produk silang karena menghasilkan tabel perantara berukuran sangat besar.

Bab ini membahas tentang optimasi query pada sistem database terdistribusi.

Arsitektur Pemrosesan Kueri Terdistribusi

Dalam sistem database terdistribusi, pemrosesan kueri terdiri dari pengoptimalan di tingkat global dan lokal. Kueri memasuki sistem database di klien atau situs pengendali. Di sini, pengguna divalidasi, kueri diperiksa, diterjemahkan, dan dioptimalkan di tingkat global.

Arsitekturnya dapat direpresentasikan sebagai -

Memetakan Kueri Global ke Kueri Lokal

Proses pemetaan kueri global ke kueri lokal dapat diwujudkan sebagai berikut -

Tabel yang diperlukan dalam kueri global memiliki fragmen yang didistribusikan ke beberapa situs. Basis data lokal memiliki informasi hanya tentang data lokal. Situs pengendali menggunakan kamus data global untuk mengumpulkan informasi tentang distribusi dan merekonstruksi tampilan global dari fragmen.
Jika tidak ada replikasi, pengoptimal global menjalankan kueri lokal di situs tempat fragmen disimpan. Jika ada replikasi, pengoptimal global memilih situs berdasarkan biaya komunikasi, beban kerja, dan kecepatan server.
Pengoptimal global menghasilkan rencana eksekusi terdistribusi sehingga jumlah transfer data paling sedikit terjadi di seluruh situs. Rencana tersebut menyatakan lokasi fragmen, urutan langkah-langkah kueri yang perlu dijalankan dan proses yang terlibat dalam mentransfer hasil antara.
Kueri lokal dioptimalkan oleh server database lokal. Akhirnya, hasil kueri lokal digabungkan bersama melalui operasi gabungan dalam kasus fragmen horizontal dan operasi gabungan untuk fragmen vertikal.

Sebagai contoh, mari kita pertimbangkan bahwa skema Proyek berikut ini terpecah-pecah secara horizontal menurut Kota, kota-kota tersebut adalah New Delhi, Kolkata, dan Hyderabad.

PROYEK

PId	Kota	Departemen	Status

Misalkan ada kueri untuk mengambil detail dari semua proyek yang statusnya "Sedang Berlangsung".

Kueri global akan menjadi & inus;

$$\sigma_{status} = {\small "ongoing"}^{(PROJECT)}$$

Permintaan di server New Delhi akan -

$$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})}$$

Permintaan di server Kolkata akan -

$$\sigma_{status} = {\small "ongoing"}^{({Kol}_-{PROJECT})}$$

Permintaan di server Hyderabad akan -

$$\sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$$

Untuk mendapatkan hasil keseluruhan, kita perlu menggabungkan hasil dari tiga query sebagai berikut -

$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})} \cup \sigma_{status} = {\small "ongoing"}^{({kol}_-{PROJECT})} \cup \sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$

Pengoptimalan Kueri Terdistribusi

Pengoptimalan kueri terdistribusi memerlukan evaluasi sejumlah besar pohon kueri yang masing-masing menghasilkan hasil kueri yang diperlukan. Ini terutama karena adanya sejumlah besar data yang direplikasi dan terfragmentasi. Oleh karena itu, targetnya adalah menemukan solusi optimal, bukan solusi terbaik.

Masalah utama untuk pengoptimalan kueri terdistribusi adalah -

Pemanfaatan sumber daya secara optimal dalam sistem terdistribusi.
Query trading.
Pengurangan ruang solusi kueri.

Pemanfaatan Sumber Daya Secara Optimal dalam Sistem Terdistribusi

Sistem terdistribusi memiliki sejumlah server database di berbagai situs untuk melakukan operasi yang berkaitan dengan kueri. Berikut adalah pendekatan untuk pemanfaatan sumber daya yang optimal -

Operation Shipping- Dalam operasi pengiriman, operasi dijalankan di situs tempat data disimpan dan bukan di situs klien. Hasilnya kemudian ditransfer ke situs klien. Ini sesuai untuk operasi yang operandnya tersedia di lokasi yang sama. Contoh: Operasi Select dan Project.

Data Shipping- Dalam pengiriman data, fragmen data ditransfer ke server database, tempat operasi dijalankan. Ini digunakan dalam operasi di mana operan didistribusikan di situs yang berbeda. Ini juga sesuai dalam sistem di mana biaya komunikasinya rendah, dan prosesor lokal jauh lebih lambat daripada server klien.

Hybrid Shipping- Ini adalah kombinasi dari pengiriman data dan operasi. Di sini, fragmen data ditransfer ke prosesor berkecepatan tinggi, tempat operasi dijalankan. Hasilnya kemudian dikirim ke situs klien.

Query Trading

Dalam algoritme perdagangan kueri untuk sistem basis data terdistribusi, situs pengendali / klien untuk kueri terdistribusi disebut pembeli dan situs tempat kueri lokal dieksekusi disebut penjual. Pembeli merumuskan sejumlah alternatif untuk memilih penjual dan merekonstruksi hasil global. Target pembeli adalah mencapai biaya yang optimal.

Algoritme dimulai dengan pembeli menetapkan sub-kueri ke situs penjual. Rencana optimal dibuat dari rencana permintaan yang dioptimalkan secara lokal yang diusulkan oleh penjual dikombinasikan dengan biaya komunikasi untuk merekonstruksi hasil akhir. Setelah rencana optimal global dirumuskan, kueri dijalankan.

Pengurangan Ruang Solusi dari Query

Solusi optimal umumnya melibatkan pengurangan ruang solusi sehingga biaya kueri dan transfer data berkurang. Ini dapat dicapai melalui seperangkat aturan heuristik, seperti halnya heuristik dalam sistem terpusat.

Berikut adalah beberapa aturannya -

Lakukan operasi pemilihan dan proyeksi sedini mungkin. Ini mengurangi aliran data melalui jaringan komunikasi.
Sederhanakan operasi pada fragmen horizontal dengan menghilangkan kondisi pemilihan yang tidak relevan dengan lokasi tertentu.
Dalam kasus operasi gabungan dan gabungan yang terdiri dari fragmen yang terletak di beberapa lokasi, transfer data terfragmentasi ke situs di mana sebagian besar datanya ada dan lakukan operasi di sana.
Gunakan operasi semi-join untuk memenuhi syarat tupel yang akan digabungkan. Ini mengurangi jumlah transfer data yang pada akhirnya mengurangi biaya komunikasi.
Gabungkan daun dan sub-pohon umum dalam pohon kueri terdistribusi.

Bab ini membahas berbagai aspek pemrosesan transaksi. Kami juga akan mempelajari tugas tingkat rendah yang termasuk dalam transaksi, status transaksi, dan properti transaksi. Di bagian terakhir, kita akan melihat jadwal dan serialisasi jadwal.

Transaksi

Transaksi adalah program yang mencakup kumpulan operasi database, yang dijalankan sebagai unit logis dari pemrosesan data. Operasi yang dilakukan dalam transaksi mencakup satu atau beberapa operasi database seperti memasukkan, menghapus, memperbarui, atau mengambil data. Ini adalah proses atom yang dilakukan hingga selesai seluruhnya atau tidak dilakukan sama sekali. Transaksi yang hanya melibatkan pengambilan data tanpa pembaruan data apa pun disebut transaksi hanya baca.

Setiap operasi tingkat tinggi dapat dibagi menjadi sejumlah tugas atau operasi tingkat rendah. Misalnya, operasi pembaruan data dapat dibagi menjadi tiga tugas -

read_item() - membaca item data dari penyimpanan ke memori utama.
modify_item() - ubah nilai item di memori utama.
write_item() - tulis nilai yang diubah dari memori utama ke penyimpanan.

Akses database dibatasi untuk operasi read_item () dan write_item (). Demikian pula, untuk semua transaksi, membaca dan menulis membentuk operasi basis data dasar.

Operasi Transaksi

Operasi tingkat rendah yang dilakukan dalam transaksi adalah -

begin_transaction - Sebuah penanda yang menentukan awal dari eksekusi transaksi.
read_item or write_item - Operasi database yang dapat disisipkan dengan operasi memori utama sebagai bagian dari transaksi.
end_transaction - Penanda yang menentukan akhir transaksi.
commit - Sinyal untuk menentukan bahwa transaksi telah berhasil diselesaikan secara keseluruhan dan tidak akan dibatalkan.
rollback- Sinyal untuk menentukan bahwa transaksi tidak berhasil dan semua perubahan sementara dalam database dibatalkan. Transaksi yang berkomitmen tidak dapat dibatalkan.

Status Transaksi

Sebuah transaksi dapat melalui subset dari lima status, aktif, berkomitmen sebagian, berkomitmen, gagal, dan dibatalkan.

Active- Keadaan awal di mana transaksi masuk adalah keadaan aktif. Transaksi tetap dalam status ini saat menjalankan operasi baca, tulis, atau lainnya.
Partially Committed - Transaksi memasuki keadaan ini setelah pernyataan terakhir dari transaksi telah dieksekusi.
Committed - Transaksi memasuki keadaan ini setelah berhasil menyelesaikan transaksi dan pemeriksaan sistem telah mengeluarkan sinyal komit.
Failed - Transaksi beralih dari status sebagian berkomitmen atau status aktif ke status gagal ketika ditemukan bahwa eksekusi normal tidak dapat lagi dilanjutkan atau pemeriksaan sistem gagal.
Aborted - Ini adalah keadaan setelah transaksi dibatalkan setelah kegagalan dan database telah dikembalikan ke keadaan semula sebelum transaksi dimulai.

Diagram transisi status berikut menggambarkan status dalam transaksi dan operasi transaksi tingkat rendah yang menyebabkan perubahan status.

Properti Transaksi yang Diinginkan

Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation, and Durability.

Atomicity − This property states that a transaction is an atomic unit of processing, that is, either it is performed in its entirety or not performed at all. No partial update should exist.
Consistency − A transaction should take the database from one consistent state to another consistent state. It should not adversely affect any data item in the database.
Isolation − A transaction should be executed as if it is the only one in the system. There should not be any interference from the other concurrent transactions that are simultaneously running.
Durability − If a committed transaction brings about a change, that change should be durable in the database and not lost in case of any failure.

Schedules and Conflicts

In a system with a number of simultaneous transactions, a schedule is the total order of execution of operations. Given a schedule S comprising of n transactions, say T1, T2, T3………..Tn; for any transaction Ti, the operations in Ti must execute as laid down in the schedule S.

Types of Schedules

There are two types of schedules −

Serial Schedules − In a serial schedule, at any point of time, only one transaction is active, i.e. there is no overlapping of transactions. This is depicted in the following graph −

Parallel Schedules − In parallel schedules, more than one transactions are active simultaneously, i.e. the transactions contain operations that overlap at time. This is depicted in the following graph −

Conflicts in Schedules

In a schedule comprising of multiple transactions, a conflict occurs when two active transactions perform non-compatible operations. Two operations are said to be in conflict, when all of the following three conditions exists simultaneously −

The two operations are parts of different transactions.
Both the operations access the same data item.
At least one of the operations is a write_item() operation, i.e. it tries to modify the data item.

Serializability

A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a serial schedule comprising of the same ‘n’ transactions. A serializable schedule contains the correctness of serial schedule while ascertaining better CPU utilization of parallel schedule.

Equivalence of Schedules

Equivalence of two schedules can be of the following types −

Result equivalence − Two schedules producing identical results are said to be result equivalent.
View equivalence − Two schedules that perform similar action in a similar manner are said to be view equivalent.
Conflict equivalence − Two schedules are said to be conflict equivalent if both contain the same set of transactions and has the same order of conflicting pairs of operations.

Concurrency controlling techniques ensure that multiple transactions are executed simultaneously while maintaining the ACID properties of the transactions and serializability in the schedules.

In this chapter, we will study the various approaches for concurrency control.

Locking Based Concurrency Control Protocols

Locking-based concurrency control protocols use the concept of locking data items. A lock is a variable associated with a data item that determines whether read/write operations can be performed on that data item. Generally, a lock compatibility matrix is used which states whether a data item can be locked by two transactions at the same time.

Locking-based concurrency control systems can use either one-phase or two-phase locking protocols.

One-phase Locking Protocol

In this method, each transaction locks an item before use and releases the lock as soon as it has finished using it. This locking method provides for maximum concurrency but does not always enforce serializability.

Two-phase Locking Protocol

In this method, all locking operations precede the first lock-release or unlock operation. The transaction comprise of two phases. In the first phase, a transaction only acquires all the locks it needs and do not release any lock. This is called the expanding or the growing phase. In the second phase, the transaction releases the locks and cannot request any new locks. This is called the shrinking phase.

Every transaction that follows two-phase locking protocol is guaranteed to be serializable. However, this approach provides low parallelism between two conflicting transactions.

Timestamp Concurrency Control Algorithms

Timestamp-based concurrency control algorithms use a transaction’s timestamp to coordinate concurrent access to a data item to ensure serializability. A timestamp is a unique identifier given by DBMS to a transaction that represents the transaction’s start time.

These algorithms ensure that transactions commit in the order dictated by their timestamps. An older transaction should commit before a younger transaction, since the older transaction enters the system before the younger one.

Timestamp-based concurrency control techniques generate serializable schedules such that the equivalent serial schedule is arranged in order of the age of the participating transactions.

Some of timestamp based concurrency control algorithms are −

Basic timestamp ordering algorithm.
Conservative timestamp ordering algorithm.
Multiversion algorithm based upon timestamp ordering.

Timestamp based ordering follow three rules to enforce serializability −

Access Rule − When two transactions try to access the same data item simultaneously, for conflicting operations, priority is given to the older transaction. This causes the younger transaction to wait for the older transaction to commit first.
Late Transaction Rule − If a younger transaction has written a data item, then an older transaction is not allowed to read or write that data item. This rule prevents the older transaction from committing after the younger transaction has already committed.
Younger Transaction Rule − A younger transaction can read or write a data item that has already been written by an older transaction.

Optimistic Concurrency Control Algorithm

In systems with low conflict rates, the task of validating every transaction for serializability may lower performance. In these cases, the test for serializability is postponed to just before commit. Since the conflict rate is low, the probability of aborting transactions which are not serializable is also low. This approach is called optimistic concurrency control technique.

In this approach, a transaction’s life cycle is divided into the following three phases −

Execution Phase − A transaction fetches data items to memory and performs operations upon them.
Validation Phase − A transaction performs checks to ensure that committing its changes to the database passes serializability test.
Commit Phase − A transaction writes back modified data item in memory to the disk.

This algorithm uses three rules to enforce serializability in validation phase −

Rule 1 − Given two transactions T_i and T_j, if T_i is reading the data item which T_j is writing, then T_i’s execution phase cannot overlap with T_j’s commit phase. T_j can commit only after T_i has finished execution.

Rule 2 − Given two transactions T_i and T_j, if T_i is writing the data item that T_j is reading, then T_i’s commit phase cannot overlap with T_j’s execution phase. T_j can start executing only after T_i has already committed.

Rule 3 − Given two transactions T_i and T_j, if T_i is writing the data item which T_j is also writing, then T_i’s commit phase cannot overlap with T_j’s commit phase. T_j can start to commit only after T_i has already committed.

Concurrency Control in Distributed Systems

In this section, we will see how the above techniques are implemented in a distributed database system.

Distributed Two-phase Locking Algorithm

The basic principle of distributed two-phase locking is same as the basic two-phase locking protocol. However, in a distributed system there are sites designated as lock managers. A lock manager controls lock acquisition requests from transaction monitors. In order to enforce co-ordination between the lock managers in various sites, at least one site is given the authority to see all transactions and detect lock conflicts.

Depending upon the number of sites who can detect lock conflicts, distributed two-phase locking approaches can be of three types −

Centralized two-phase locking − In this approach, one site is designated as the central lock manager. All the sites in the environment know the location of the central lock manager and obtain lock from it during transactions.
Primary copy two-phase locking − In this approach, a number of sites are designated as lock control centers. Each of these sites has the responsibility of managing a defined set of locks. All the sites know which lock control center is responsible for managing lock of which data table/fragment item.
Distributed two-phase locking − In this approach, there are a number of lock managers, where each lock manager controls locks of data items stored at its local site. The location of the lock manager is based upon data distribution and replication.

Distributed Timestamp Concurrency Control

In a centralized system, timestamp of any transaction is determined by the physical clock reading. But, in a distributed system, any site’s local physical/logical clock readings cannot be used as global timestamps, since they are not globally unique. So, a timestamp comprises of a combination of site ID and that site’s clock reading.

For implementing timestamp ordering algorithms, each site has a scheduler that maintains a separate queue for each transaction manager. During transaction, a transaction manager sends a lock request to the site’s scheduler. The scheduler puts the request to the corresponding queue in increasing timestamp order. Requests are processed from the front of the queues in the order of their timestamps, i.e. the oldest first.

Conflict Graphs

Another method is to create conflict graphs. For this transaction classes are defined. A transaction class contains two set of data items called read set and write set. A transaction belongs to a particular class if the transaction’s read set is a subset of the class’ read set and the transaction’s write set is a subset of the class’ write set. In the read phase, each transaction issues its read requests for the data items in its read set. In the write phase, each transaction issues its write requests.

A conflict graph is created for the classes to which active transactions belong. This contains a set of vertical, horizontal, and diagonal edges. A vertical edge connects two nodes within a class and denotes conflicts within the class. A horizontal edge connects two nodes across two classes and denotes a write-write conflict among different classes. A diagonal edge connects two nodes across two classes and denotes a write-read or a read-write conflict among two classes.

The conflict graphs are analyzed to ascertain whether two transactions within the same class or across two different classes can be run in parallel.

Distributed Optimistic Concurrency Control Algorithm

Distributed optimistic concurrency control algorithm extends optimistic concurrency control algorithm. For this extension, two rules are applied −

Rule 1 − According to this rule, a transaction must be validated locally at all sites when it executes. If a transaction is found to be invalid at any site, it is aborted. Local validation guarantees that the transaction maintains serializability at the sites where it has been executed. After a transaction passes local validation test, it is globally validated.

Rule 2 − According to this rule, after a transaction passes local validation test, it should be globally validated. Global validation ensures that if two conflicting transactions run together at more than one site, they should commit in the same relative order at all the sites they run together. This may require a transaction to wait for the other conflicting transaction, after validation before commit. This requirement makes the algorithm less optimistic since a transaction may not be able to commit as soon as it is validated at a site.

This chapter overviews deadlock handling mechanisms in database systems. We’ll study the deadlock handling mechanisms in both centralized and distributed database system.

What are Deadlocks?

Deadlock is a state of a database system having two or more transactions, when each transaction is waiting for a data item that is being locked by some other transaction. A deadlock can be indicated by a cycle in the wait-for-graph. This is a directed graph in which the vertices denote transactions and the edges denote waits for data items.

For example, in the following wait-for-graph, transaction T1 is waiting for data item X which is locked by T3. T3 is waiting for Y which is locked by T2 and T2 is waiting for Z which is locked by T1. Hence, a waiting cycle is formed, and none of the transactions can proceed executing.

Deadlock Handling in Centralized Systems

There are three classical approaches for deadlock handling, namely −

Deadlock prevention.
Deadlock avoidance.
Deadlock detection and removal.

All of the three approaches can be incorporated in both a centralized and a distributed database system.

Deadlock Prevention

The deadlock prevention approach does not allow any transaction to acquire locks that will lead to deadlocks. The convention is that when more than one transactions request for locking the same data item, only one of them is granted the lock.

One of the most popular deadlock prevention methods is pre-acquisition of all the locks. In this method, a transaction acquires all the locks before starting to execute and retains the locks for the entire duration of transaction. If another transaction needs any of the already acquired locks, it has to wait until all the locks it needs are available. Using this approach, the system is prevented from being deadlocked since none of the waiting transactions are holding any lock.

Deadlock Avoidance

The deadlock avoidance approach handles deadlocks before they occur. It analyzes the transactions and the locks to determine whether or not waiting leads to a deadlock.

The method can be briefly stated as follows. Transactions start executing and request data items that they need to lock. The lock manager checks whether the lock is available. If it is available, the lock manager allocates the data item and the transaction acquires the lock. However, if the item is locked by some other transaction in incompatible mode, the lock manager runs an algorithm to test whether keeping the transaction in waiting state will cause a deadlock or not. Accordingly, the algorithm decides whether the transaction can wait or one of the transactions should be aborted.

There are two algorithms for this purpose, namely wait-die and wound-wait. Let us assume that there are two transactions, T1 and T2, where T1 tries to lock a data item which is already locked by T2. The algorithms are as follows −

Wait-Die − If T1 is older than T2, T1 is allowed to wait. Otherwise, if T1 is younger than T2, T1 is aborted and later restarted.
Wound-Wait − If T1 is older than T2, T2 is aborted and later restarted. Otherwise, if T1 is younger than T2, T1 is allowed to wait.

Deadlock Detection and Removal

The deadlock detection and removal approach runs a deadlock detection algorithm periodically and removes deadlock in case there is one. It does not check for deadlock when a transaction places a request for a lock. When a transaction requests a lock, the lock manager checks whether it is available. If it is available, the transaction is allowed to lock the data item; otherwise the transaction is allowed to wait.

Since there are no precautions while granting lock requests, some of the transactions may be deadlocked. To detect deadlocks, the lock manager periodically checks if the wait-forgraph has cycles. If the system is deadlocked, the lock manager chooses a victim transaction from each cycle. The victim is aborted and rolled back; and then restarted later. Some of the methods used for victim selection are −

Choose the youngest transaction.
Choose the transaction with fewest data items.
Choose the transaction that has performed least number of updates.
Choose the transaction having least restart overhead.
Choose the transaction which is common to two or more cycles.

This approach is primarily suited for systems having transactions low and where fast response to lock requests is needed.

Deadlock Handling in Distributed Systems

Transaction processing in a distributed database system is also distributed, i.e. the same transaction may be processing at more than one site. The two main deadlock handling concerns in a distributed database system that are not present in a centralized system are transaction location and transaction control. Once these concerns are addressed, deadlocks are handled through any of deadlock prevention, deadlock avoidance or deadlock detection and removal.

Transaction Location

Transactions in a distributed database system are processed in multiple sites and use data items in multiple sites. The amount of data processing is not uniformly distributed among these sites. The time period of processing also varies. Thus the same transaction may be active at some sites and inactive at others. When two conflicting transactions are located in a site, it may happen that one of them is in inactive state. This condition does not arise in a centralized system. This concern is called transaction location issue.

This concern may be addressed by Daisy Chain model. In this model, a transaction carries certain details when it moves from one site to another. Some of the details are the list of tables required, the list of sites required, the list of visited tables and sites, the list of tables and sites that are yet to be visited and the list of acquired locks with types. After a transaction terminates by either commit or abort, the information should be sent to all the concerned sites.

Transaction Control

Transaction control is concerned with designating and controlling the sites required for processing a transaction in a distributed database system. There are many options regarding the choice of where to process the transaction and how to designate the center of control, like −

One server may be selected as the center of control.
The center of control may travel from one server to another.
The responsibility of controlling may be shared by a number of servers.

Distributed Deadlock Prevention

Just like in centralized deadlock prevention, in distributed deadlock prevention approach, a transaction should acquire all the locks before starting to execute. This prevents deadlocks.

The site where the transaction enters is designated as the controlling site. The controlling site sends messages to the sites where the data items are located to lock the items. Then it waits for confirmation. When all the sites have confirmed that they have locked the data items, transaction starts. If any site or communication link fails, the transaction has to wait until they have been repaired.

Though the implementation is simple, this approach has some drawbacks −

Pre-acquisition of locks requires a long time for communication delays. This increases the time required for transaction.
In case of site or link failure, a transaction has to wait for a long time so that the sites recover. Meanwhile, in the running sites, the items are locked. This may prevent other transactions from executing.
If the controlling site fails, it cannot communicate with the other sites. These sites continue to keep the locked data items in their locked state, thus resulting in blocking.

Distributed Deadlock Avoidance

As in centralized system, distributed deadlock avoidance handles deadlock prior to occurrence. Additionally, in distributed systems, transaction location and transaction control issues needs to be addressed. Due to the distributed nature of the transaction, the following conflicts may occur −

Conflict between two transactions in the same site.
Conflict between two transactions in different sites.

In case of conflict, one of the transactions may be aborted or allowed to wait as per distributed wait-die or distributed wound-wait algorithms.

Let us assume that there are two transactions, T1 and T2. T1 arrives at Site P and tries to lock a data item which is already locked by T2 at that site. Hence, there is a conflict at Site P. The algorithms are as follows −

Distributed Wound-Die
- If T1 is older than T2, T1 is allowed to wait. T1 can resume execution after Site P receives a message that T2 has either committed or aborted successfully at all sites.
- If T1 is younger than T2, T1 is aborted. The concurrency control at Site P sends a message to all sites where T1 has visited to abort T1. The controlling site notifies the user when T1 has been successfully aborted in all the sites.
Distributed Wait-Wait
- If T1 is older than T2, T2 needs to be aborted. If T2 is active at Site P, Site P aborts and rolls back T2 and then broadcasts this message to other relevant sites. If T2 has left Site P but is active at Site Q, Site P broadcasts that T2 has been aborted; Site L then aborts and rolls back T2 and sends this message to all sites.
- If T1 is younger than T1, T1 is allowed to wait. T1 can resume execution after Site P receives a message that T2 has completed processing.

Distributed Deadlock Detection

Just like centralized deadlock detection approach, deadlocks are allowed to occur and are removed if detected. The system does not perform any checks when a transaction places a lock request. For implementation, global wait-for-graphs are created. Existence of a cycle in the global wait-for-graph indicates deadlocks. However, it is difficult to spot deadlocks since transaction waits for resources across the network.

Alternatively, deadlock detection algorithms can use timers. Each transaction is associated with a timer which is set to a time period in which a transaction is expected to finish. If a transaction does not finish within this time period, the timer goes off, indicating a possible deadlock.

Another tool used for deadlock handling is a deadlock detector. In a centralized system, there is one deadlock detector. In a distributed system, there can be more than one deadlock detectors. A deadlock detector can find deadlocks for the sites under its control. There are three alternatives for deadlock detection in a distributed system, namely.

Centralized Deadlock Detector − One site is designated as the central deadlock detector.
Hierarchical Deadlock Detector − A number of deadlock detectors are arranged in hierarchy.
Distributed Deadlock Detector − All the sites participate in detecting deadlocks and removing them.

This chapter looks into replication control, which is required to maintain consistent data in all sites. We will study the replication control techniques and the algorithms required for replication control.

As discussed earlier, replication is a technique used in distributed databases to store multiple copies of a data table at different sites. The problem with having multiple copies in multiple sites is the overhead of maintaining data consistency, particularly during update operations.

In order to maintain mutually consistent data in all sites, replication control techniques need to be adopted. There are two approaches for replication control, namely −

Synchronous Replication Control
Asynchronous Replication Control

Synchronous Replication Control

In synchronous replication approach, the database is synchronized so that all the replications always have the same value. A transaction requesting a data item will have access to the same value in all the sites. To ensure this uniformity, a transaction that updates a data item is expanded so that it makes the update in all the copies of the data item. Generally, two-phase commit protocol is used for the purpose.

For example, let us consider a data table PROJECT(PId, PName, PLocation). We need to run a transaction T1 that updates PLocation to ‘Mumbai’, if PLocation is ‘Bombay’. If no replications are there, the operations in transaction T1 will be −

Begin T1: 
   Update PROJECT Set PLocation = 'Mumbai' 
   Where PLocation = 'Bombay'; 
End T1;

If the data table has two replicas in Site A and Site B, T1 needs to spawn two children T1A and T1B corresponding to the two sites. The expanded transaction T1 will be −

Begin T1: 
   Begin T1A : 
      Update PROJECT Set PLocation = 'Mumbai' 
      Where PLocation = 'Bombay'; 
   End T1A;  
	
   Begin T2A : 
      Update PROJECT Set PLocation = 'Mumbai'
      Where PLocation = 'Bombay'; 
   End T2A; 
	
End T1;

Asynchronous Replication Control

In asynchronous replication approach, the replicas do not always maintain the same value. One or more replicas may store an outdated value, and a transaction can see the different values. The process of bringing all the replicas to the current value is called synchronization.

A popular method of synchronization is store and forward method. In this method, one site is designated as the primary site and the other sites are secondary sites. The primary site always contains updated values. All the transactions first enter the primary site. These transactions are then queued for application in the secondary sites. The secondary sites are updated using rollout method only when a transaction is scheduled to execute on it.

Replication Control Algorithms

Some of the replication control algorithms are −

Master-slave replication control algorithm.
Distributed voting algorithm.
Majority consensus algorithm.
Circulating token algorithm.

Master-Slave Replication Control Algorithm

There is one master site and ‘N’ slave sites. A master algorithm runs at the master site to detect conflicts. A copy of slave algorithm runs at each slave site. The overall algorithm executes in the following two phases −

Transaction acceptance/rejection phase − When a transaction enters the transaction monitor of a slave site, the slave site sends a request to the master site. The master site checks for conflicts. If there aren’t any conflicts, the master sends an “ACK+” message to the slave site which then starts the transaction application phase. Otherwise, the master sends an “ACK-” message to the slave which then rejects the transaction.
Transaction application phase − Upon entering this phase, the slave site where transaction has entered broadcasts a request to all slaves for executing the transaction. On receiving the requests, the peer slaves execute the transaction and send an “ACK” to the requesting slave on completion. After the requesting slave has received “ACK” messages from all its peers, it sends a “DONE” message to the master site. The master understands that the transaction has been completed and removes it from the pending queue.

Distributed Voting Algorithm

This comprises of ‘N’ peer sites, all of whom must “OK” a transaction before it starts executing. Following are the two phases of this algorithm −

Distributed transaction acceptance phase − When a transaction enters the transaction manager of a site, it sends a transaction request to all other sites. On receiving a request, a peer site resolves conflicts using priority based voting rules. If all the peer sites are “OK” with the transaction, the requesting site starts application phase. If any of the peer sites does not “OK” a transaction, the requesting site rejects the transaction.
Distributed transaction application phase − Upon entering this phase, the site where the transaction has entered, broadcasts a request to all slaves for executing the transaction. On receiving the requests, the peer slaves execute the transaction and send an “ACK” message to the requesting slave on completion. After the requesting slave has received “ACK” messages from all its peers, it lets the transaction manager know that the transaction has been completed.

Majority Consensus Algorithm

This is a variation from the distributed voting algorithm, where a transaction is allowed to execute when a majority of the peers “OK” a transaction. This is divided into three phases −

Voting phase − When a transaction enters the transaction manager of a site, it sends a transaction request to all other sites. On receiving a request, a peer site tests for conflicts using voting rules and keeps the conflicting transactions, if any, in pending queue. Then, it sends either an “OK” or a “NOT OK” message.
Transaction acceptance/rejection phase − If the requesting site receives a majority “OK” on the transaction, it accepts the transaction and broadcasts “ACCEPT” to all the sites. Otherwise, it broadcasts “REJECT” to all the sites and rejects the transaction.
Transaction application phase − When a peer site receives a “REJECT” message, it removes this transaction from its pending list and reconsiders all deferred transactions. When a peer site receives an “ACCEPT” message, it applies the transaction and rejects all the deferred transactions in the pending queue which are in conflict with this transaction. It sends an “ACK” to the requesting slave on completion.

Circulating Token Algorithm

In this approach the transactions in the system are serialized using a circulating token and executed accordingly against every replica of the database. Thus, all the transactions are accepted, i.e. none is rejected. This has two phases −

Transaction serialization phase − In this phase, all transactions are scheduled to run in a serialization order. Each transaction in each site is assigned a unique ticket from a sequential series, indicating the order of transaction. Once a transaction has been assigned a ticket, it is broadcasted to all the sites.
Transaction application phase − When a site receives a transaction along with its ticket, it places the transaction for execution according to its ticket. After the transaction has finished execution, this site broadcasts an appropriate message. A transaction ends when it has completed execution in all the sites.

A database management system is susceptible to a number of failures. In this chapter we will study the failure types and commit protocols. In a distributed database system, failures can be broadly categorized into soft failures, hard failures and network failures.

Soft Failure

Soft failure is the type of failure that causes the loss in volatile memory of the computer and not in the persistent storage. Here, the information stored in the non-persistent storage like main memory, buffers, caches or registers, is lost. They are also known as system crash. The various types of soft failures are as follows −

Operating system failure.
Main memory crash.
Transaction failure or abortion.
System generated error like integer overflow or divide-by-zero error.
Failure of supporting software.
Power failure.

Hard Failure

A hard failure is the type of failure that causes loss of data in the persistent or non-volatile storage like disk. Disk failure may cause corruption of data in some disk blocks or failure of the total disk. The causes of a hard failure are −

Power failure.
Faults in media.
Read-write malfunction.
Corruption of information on the disk.
Read/write head crash of disk.

Recovery from disk failures can be short, if there is a new, formatted, and ready-to-use disk on reserve. Otherwise, duration includes the time it takes to get a purchase order, buy the disk, and prepare it.

Network Failure

Network failures are prevalent in distributed or network databases. These comprises of the errors induced in the database system due to the distributed nature of the data and transferring data over the network. The causes of network failure are as follows −

Communication link failure.
Network congestion.
Information corruption during transfer.
Site failures.
Network partitioning.

Commit Protocols

Any database system should guarantee that the desirable properties of a transaction are maintained even after failures. If a failure occurs during the execution of a transaction, it may happen that all the changes brought about by the transaction are not committed. This makes the database inconsistent. Commit protocols prevent this scenario using either transaction undo (rollback) or transaction redo (roll forward).

Commit Point

The point of time at which the decision is made whether to commit or abort a transaction, is known as commit point. Following are the properties of a commit point.

It is a point of time when the database is consistent.
At this point, the modifications brought about by the database can be seen by the other transactions. All transactions can have a consistent view of the database.
At this point, all the operations of transaction have been successfully executed and their effects have been recorded in transaction log.
At this point, a transaction can be safely undone, if required.
At this point, a transaction releases all the locks held by it.

Transaction Undo

The process of undoing all the changes made to a database by a transaction is called transaction undo or transaction rollback. This is mostly applied in case of soft failure.

Transaction Redo

The process of reapplying the changes made to a database by a transaction is called transaction redo or transaction roll forward. This is mostly applied for recovery from a hard failure.

Transaction Log

A transaction log is a sequential file that keeps track of transaction operations on database items. As the log is sequential in nature, it is processed sequentially either from the beginning or from the end.

Purposes of a transaction log −

To support commit protocols to commit or support transactions.
To aid database recovery after failure.

A transaction log is usually kept on the disk, so that it is not affected by soft failures. Additionally, the log is periodically backed up to an archival storage like magnetic tape to protect it from disk failures as well.

Lists in Transaction Logs

The transaction log maintains five types of lists depending upon the status of the transaction. This list aids the recovery manager to ascertain the status of a transaction. The status and the corresponding lists are as follows −

A transaction that has a transaction start record and a transaction commit record, is a committed transaction – maintained in commit list.
A transaction that has a transaction start record and a transaction failed record but not a transaction abort record, is a failed transaction – maintained in failed list.
A transaction that has a transaction start record and a transaction abort record is an aborted transaction – maintained in abort list.
A transaction that has a transaction start record and a transaction before-commit record is a before-commit transaction, i.e. a transaction where all the operations have been executed but not committed – maintained in before-commit list.
A transaction that has a transaction start record but no records of before-commit, commit, abort or failed, is an active transaction – maintained in active list.

Immediate Update and Deferred Update

Immediate Update and Deferred Update are two methods for maintaining transaction logs.

In immediate update mode, when a transaction executes, the updates made by the transaction are written directly onto the disk. The old values and the updates values are written onto the log before writing to the database in disk. On commit, the changes made to the disk are made permanent. On rollback, changes made by the transaction in the database are discarded and the old values are restored into the database from the old values stored in the log.

In deferred update mode, when a transaction executes, the updates made to the database by the transaction are recorded in the log file. On commit, the changes in the log are written onto the disk. On rollback, the changes in the log are discarded and no changes are applied to the database.

In order to recuperate from database failure, database management systems resort to a number of recovery management techniques. In this chapter, we will study the different approaches for database recovery.

The typical strategies for database recovery are −

In case of soft failures that result in inconsistency of database, recovery strategy includes transaction undo or rollback. However, sometimes, transaction redo may also be adopted to recover to a consistent state of the transaction.
In case of hard failures resulting in extensive damage to database, recovery strategies encompass restoring a past copy of the database from archival backup. A more current state of the database is obtained through redoing operations of committed transactions from transaction log.

Recovery from Power Failure

Power failure causes loss of information in the non-persistent memory. When power is restored, the operating system and the database management system restart. Recovery manager initiates recovery from the transaction logs.

In case of immediate update mode, the recovery manager takes the following actions −

Transactions which are in active list and failed list are undone and written on the abort list.
Transactions which are in before-commit list are redone.
No action is taken for transactions in commit or abort lists.

In case of deferred update mode, the recovery manager takes the following actions −

Transactions which are in the active list and failed list are written onto the abort list. No undo operations are required since the changes have not been written to the disk yet.
Transactions which are in before-commit list are redone.
No action is taken for transactions in commit or abort lists.

Recovery from Disk Failure

A disk failure or hard crash causes a total database loss. To recover from this hard crash, a new disk is prepared, then the operating system is restored, and finally the database is recovered using the database backup and transaction log. The recovery method is same for both immediate and deferred update modes.

The recovery manager takes the following actions −

The transactions in the commit list and before-commit list are redone and written onto the commit list in the transaction log.
The transactions in the active list and failed list are undone and written onto the abort list in the transaction log.

Checkpointing

Checkpoint is a point of time at which a record is written onto the database from the buffers. As a consequence, in case of a system crash, the recovery manager does not have to redo the transactions that have been committed before checkpoint. Periodical checkpointing shortens the recovery process.

The two types of checkpointing techniques are −

Consistent checkpointing
Fuzzy checkpointing

Consistent Checkpointing

Consistent checkpointing creates a consistent image of the database at checkpoint. During recovery, only those transactions which are on the right side of the last checkpoint are undone or redone. The transactions to the left side of the last consistent checkpoint are already committed and needn’t be processed again. The actions taken for checkpointing are −

The active transactions are suspended temporarily.
All changes in main-memory buffers are written onto the disk.
A “checkpoint” record is written in the transaction log.
The transaction log is written to the disk.
The suspended transactions are resumed.

If in step 4, the transaction log is archived as well, then this checkpointing aids in recovery from disk failures and power failures, otherwise it aids recovery from only power failures.

Fuzzy Checkpointing

In fuzzy checkpointing, at the time of checkpoint, all the active transactions are written in the log. In case of power failure, the recovery manager processes only those transactions that were active during checkpoint and later. The transactions that have been committed before checkpoint are written to the disk and hence need not be redone.

Example of Checkpointing

Let us consider that in system the time of checkpointing is tcheck and the time of system crash is tfail. Let there be four transactions T_a, T_b, T_c and T_d such that −

T_a commits before checkpoint.
T_b starts before checkpoint and commits before system crash.
T_c starts after checkpoint and commits before system crash.
T_d starts after checkpoint and was active at the time of system crash.

The situation is depicted in the following diagram −

The actions that are taken by the recovery manager are −

Nothing is done with T_a.
Transaction redo is performed for T_b and T_c.
Transaction undo is performed for T_d.

Transaction Recovery Using UNDO / REDO

Transaction recovery is done to eliminate the adverse effects of faulty transactions rather than to recover from a failure. Faulty transactions include all transactions that have changed the database into undesired state and the transactions that have used values written by the faulty transactions.

Transaction recovery in these cases is a two-step process −

UNDO all faulty transactions and transactions that may be affected by the faulty transactions.
REDO all transactions that are not faulty but have been undone due to the faulty transactions.

Steps for the UNDO operation are −

If the faulty transaction has done INSERT, the recovery manager deletes the data item(s) inserted.
If the faulty transaction has done DELETE, the recovery manager inserts the deleted data item(s) from the log.
If the faulty transaction has done UPDATE, the recovery manager eliminates the value by writing the before-update value from the log.

Steps for the REDO operation are −

If the transaction has done INSERT, the recovery manager generates an insert from the log.
If the transaction has done DELETE, the recovery manager generates a delete from the log.
If the transaction has done UPDATE, the recovery manager generates an update from the log.

In a local database system, for committing a transaction, the transaction manager has to only convey the decision to commit to the recovery manager. However, in a distributed system, the transaction manager should convey the decision to commit to all the servers in the various sites where the transaction is being executed and uniformly enforce the decision. When processing is complete at each site, it reaches the partially committed transaction state and waits for all other transactions to reach their partially committed states. When it receives the message that all the sites are ready to commit, it starts to commit. In a distributed system, either all sites commit or none of them does.

The different distributed commit protocols are −

One-phase commit
Two-phase commit
Three-phase commit

Distributed One-phase Commit

Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a controlling site and a number of slave sites where the transaction is being executed. The steps in distributed commit are −

After each slave has locally completed its transaction, it sends a “DONE” message to the controlling site.
The slaves wait for “Commit” or “Abort” message from the controlling site. This waiting time is called window of vulnerability.
When the controlling site receives “DONE” message from each slave, it makes a decision to commit or abort. This is called the commit point. Then, it sends this message to all the slaves.
On receiving this message, a slave either commits or aborts and then sends an acknowledgement message to the controlling site.

Distributed Two-phase Commit

Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The steps performed in the two phases are as follows −

Phase 1: Prepare Phase

After each slave has locally completed its transaction, it sends a “DONE” message to the controlling site. When the controlling site has received “DONE” message from all slaves, it sends a “Prepare” message to the slaves.
The slaves vote on whether they still want to commit or not. If a slave wants to commit, it sends a “Ready” message.
A slave that does not want to commit sends a “Not Ready” message. This may happen when the slave has conflicting concurrent transactions or there is a timeout.

Phase 2: Commit/Abort Phase

After the controlling site has received “Ready” message from all the slaves −
- The controlling site sends a “Global Commit” message to the slaves.
- The slaves apply the transaction and send a “Commit ACK” message to the controlling site.
- When the controlling site receives “Commit ACK” message from all the slaves, it considers the transaction as committed.
After the controlling site has received the first “Not Ready” message from any slave −
- The controlling site sends a “Global Abort” message to the slaves.
- The slaves abort the transaction and send a “Abort ACK” message to the controlling site.
- When the controlling site receives “Abort ACK” message from all the slaves, it considers the transaction as aborted.

Distributed Three-phase Commit

The steps in distributed three-phase commit are as follows −

Phase 1: Prepare Phase

The steps are same as in distributed two-phase commit.

Phase 2: Prepare to Commit Phase

The controlling site issues an “Enter Prepared State” broadcast message.
The slave sites vote “OK” in response.

Phase 3: Commit / Abort Phase

The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message is not required.

In this chapter, we will look into the threats that a database system faces and the measures of control. We will also study cryptography as a security tool.

Database Security and Threats

Data security is an imperative aspect of any database system. It is of particular importance in distributed systems because of large number of users, fragmented and replicated data, multiple sites and distributed control.

Threats in a Database

Availability loss − Availability loss refers to non-availability of database objects by legitimate users.
Integrity loss − Integrity loss occurs when unacceptable operations are performed upon the database either accidentally or maliciously. This may happen while creating, inserting, updating or deleting data. It results in corrupted data leading to incorrect decisions.
Confidentiality loss − Confidentiality loss occurs due to unauthorized or unintentional disclosure of confidential information. It may result in illegal actions, security threats and loss in public confidence.

Measures of Control

The measures of control can be broadly divided into the following categories −

Access Control − Access control includes security mechanisms in a database management system to protect against unauthorized access. A user can gain access to the database after clearing the login process through only valid user accounts. Each user account is password protected.
Flow Control − Distributed systems encompass a lot of data flow from one site to another and also within a site. Flow control prevents data from being transferred in such a way that it can be accessed by unauthorized agents. A flow policy lists out the channels through which information can flow. It also defines security classes for data as well as transactions.
Data Encryption − Data encryption refers to coding data when sensitive data is to be communicated over public channels. Even if an unauthorized agent gains access of the data, he cannot understand it since it is in an incomprehensible format.

What is Cryptography?

Cryptography is the science of encoding information before sending via unreliable communication paths so that only an authorized receiver can decode and use it.

The coded message is called cipher text and the original message is called plain text. The process of converting plain text to cipher text by the sender is called encoding or encryption. The process of converting cipher text to plain text by the receiver is called decoding or decryption.

The entire procedure of communicating using cryptography can be illustrated through the following diagram −

Conventional Encryption Methods

In conventional cryptography, the encryption and decryption is done using the same secret key. Here, the sender encrypts the message with an encryption algorithm using a copy of the secret key. The encrypted message is then send over public communication channels. On receiving the encrypted message, the receiver decrypts it with a corresponding decryption algorithm using the same secret key.

Security in conventional cryptography depends on two factors −

A sound algorithm which is known to all.
A randomly generated, preferably long secret key known only by the sender and the receiver.

The most famous conventional cryptography algorithm is Data Encryption Standard or DES.

The advantage of this method is its easy applicability. However, the greatest problem of conventional cryptography is sharing the secret key between the communicating parties. The ways to send the key are cumbersome and highly susceptible to eavesdropping.

Public Key Cryptography

In contrast to conventional cryptography, public key cryptography uses two different keys, referred to as public key and the private key. Each user generates the pair of public key and private key. The user then puts the public key in an accessible place. When a sender wants to sends a message, he encrypts it using the public key of the receiver. On receiving the encrypted message, the receiver decrypts it using his private key. Since the private key is not known to anyone but the receiver, no other person who receives the message can decrypt it.

The most popular public key cryptography algorithms are RSA algorithm and Diffie– Hellman algorithm. This method is very secure to send private messages. However, the problem is, it involves a lot of computations and so proves to be inefficient for long messages.

The solution is to use a combination of conventional and public key cryptography. The secret key is encrypted using public key cryptography before sharing between the communicating parties. Then, the message is send using conventional cryptography with the aid of the shared secret key.

Digital Signatures

A Digital Signature (DS) is an authentication technique based on public key cryptography used in e-commerce applications. It associates a unique mark to an individual within the body of his message. This helps others to authenticate valid senders of messages.

Typically, a user’s digital signature varies from message to message in order to provide security against counterfeiting. The method is as follows −

The sender takes a message, calculates the message digest of the message and signs it digest with a private key.
The sender then appends the signed digest along with the plaintext message.
The message is sent over communication channel.
The receiver removes the appended signed digest and verifies the digest using the corresponding public key.
The receiver then takes the plaintext message and runs it through the same message digest algorithm.
If the results of step 4 and step 5 match, then the receiver knows that the message has integrity and authentic.

A distributed system needs additional security measures than centralized system, since there are many users, diversified data, multiple sites and distributed control. In this chapter, we will look into the various facets of distributed database security.

In distributed communication systems, there are two types of intruders −

Passive eavesdroppers − They monitor the messages and get hold of private information.
Active attackers − They not only monitor the messages but also corrupt data by inserting new data or modifying existing data.

Security measures encompass security in communications, security in data and data auditing.

Communications Security

In a distributed database, a lot of data communication takes place owing to the diversified location of data, users and transactions. So, it demands secure communication between users and databases and between the different database environments.

Security in communication encompasses the following −

Data should not be corrupt during transfer.
The communication channel should be protected against both passive eavesdroppers and active attackers.
In order to achieve the above stated requirements, well-defined security algorithms and protocols should be adopted.

Two popular, consistent technologies for achieving end-to-end secure communications are −

Secure Socket Layer Protocol or Transport Layer Security Protocol.
Virtual Private Networks (VPN).

Data Security

In distributed systems, it is imperative to adopt measure to secure data apart from communications. The data security measures are −

Authentication and authorization − These are the access control measures adopted to ensure that only authentic users can use the database. To provide authentication digital certificates are used. Besides, login is restricted through username/password combination.
Data encryption − The two approaches for data encryption in distributed systems are −
- Internal to distributed database approach: The user applications encrypt the data and then store the encrypted data in the database. For using the stored data, the applications fetch the encrypted data from the database and then decrypt it.
- External to distributed database: The distributed database system has its own encryption capabilities. The user applications store data and retrieve them without realizing that the data is stored in an encrypted form in the database.
Validated input − In this security measure, the user application checks for each input before it can be used for updating the database. An un-validated input can cause a wide range of exploits like buffer overrun, command injection, cross-site scripting and corruption in data.

Data Auditing

A database security system needs to detect and monitor security violations, in order to ascertain the security measures it should adopt. It is often very difficult to detect breach of security at the time of occurrences. One method to identify security violations is to examine audit logs. Audit logs contain information such as −

Date, time and site of failed access attempts.
Details of successful access attempts.
Vital modifications in the database system.
Access of huge amounts of data, particularly from databases in multiple sites.

All the above information gives an insight of the activities in the database. A periodical analysis of the log helps to identify any unnatural activity along with its site and time of occurrence. This log is ideally stored in a separate server so that it is inaccessible to attackers.