Structured Files : Structured Files Chapter 19
What The Record Manager Does : What The Record Manager Does Storage allocation: store tuples in file blocks
Tuple addressing: give tuple an id identifier
provide fast access via that id.
Enumeration: fast enumeration of all relation’s tuples
Content addressing: give fast accessible via attribute values.
Maintenance: update/delete a tuple and its access paths.
Protection: support for security
encrypt or tuple-granularity access control.
Outline : Outline Representing values
Representing records
Storing records in pages and across pages
Organizing records (entry, relative, key, hash)
Examples of fix/log/log logic.
Record Allocation in a Page : Record Allocation in a Page
Recall:
File is a collection of fixed-length pages (blocks).
File and buffer managers map files to disc/RAM slot on disk block page page body Block
Trailer Page
Dir Page
Head Block
Head checksum
Page Declares : Page Declares typedef struct /* global page numbers */
{ FILENO fileno; /*file where the page lives */
uint pageno; /* page number within the file */
} PAGEID, *PAGEIDP; /* */
typedef struct
PAGEID thatsme; /* identifies the page */
PAGE_TYPE page_type; /* see description above */
OBJID object_id; /* internal id of the relation,index,etc. */
LSN safe_up_to; /* page LSN for the WAL - protocol */
PAGEID previous; /* often pages are members of doubly */
PAGEID next; /* linked lists */
PAGE_STATE status; /* valid,in-doubt,copy of something,etc*/
int no_entries; /* # entries in page dir (see below) */
int unused; /* free bytes not in freespace */
int freespace; /* # contiguous free bytes for data */
char stuff[]; /* will grow */
} PAGE_HEADER, * PAGE_PTR; /* */
Different uses of pages : Different uses of pages Data: Homogeneous record storage
Cluster: like Data except many different record types
Index (access path): hashed or B-tree
Free-space bitmap: describes status of ³ 4,000 other pages.
Directory: meta-data about this or other files
Page Directory: Points to Records on Page : Page Directory: Points to Records on Page Record id is: File, Page, Directory_offset Page Header 1st Tuple 2nd 2nd Tuple 3rd Tuple 4thTuple 5th Tuple 2 1 3 4 5 Page directory grows in this direction Tuples are inserted in this direction
Accessing a Record : Accessing a Record Read by TID: Lock record shared
locate page
Get semaphore shared
follow directory offset
copy tuple
Give semaphore
Insert by TID: Lock record exclusive
locate page
Get semaphore exclusive
Find space
Insert
log insert (tid, new value).
update page LSN, header, directory,
Give semaphore
Accessing a Record : Accessing a Record Delete by TID: Lock record exclusive
locate page
Get semaphore exclusive
Add record to free space
Log delete (tid, old value).
update page LSN, header, directory,
Give semaphore
Update TID: much like delete-&-insert
Finding Space for Insert / Update : Finding Space for Insert / Update If tuple fits in page contiguous free-space: easy.
If tuple fits in page free space: reorganize (compress)
Physiological logging makes this cheap.
If tuple does not fit then:
leave forwarding address on page.
Optionally leave record prefix on page.
Segment record among several pages.
tid
Finding space within a file : Finding space within a file Free space table:
Summarizes status of many pages
(8KB page => 64Kb => 500MB of 8KB data pages)
Good for clustered & contiguous allocation
bitmap should be transaction protected
If transaction aborts, page is freed again.
Alternatively, treat bitmap as a hint
Rebuild periodically. p1 p2 p3 p4 p5 p6 f17 . . . . . . . . . . ··· p7 . . . P19 P20 P21 ..... f2 F19 21 f3 f4 f5 f6 f7 ··· Free space directories
Finding space within a file : Finding space within a file Free space cursor/list
Chain should be transaction protected
Else: rebuild at restart
do not trust pointers
(free page may be allocated).
empty_page_anchor point_of_insert . . file catalog chain of empty pages page for next insert
Tuple Allocation - I : Tuple Allocation - I The first strategy maintains a pointer to the “current block for insert” (CBI). When that
block fills up, an empty block is requested from a system service, which then becomes the
new “current block for insert”. And so on. This is the sequential insert strategy. Questions: What happens, when the pointer arrives at the last block? How do we reclaim
space freed by deleted tuples? CBI: where next?
Incremental Space Expansion - I : Incremental Space Expansion - I When the list of empty blocks is exhausted, there are two options to find space for
new tuples. Let us assume the following configuration: And so on. This works as long as enough space is freed up by deleted tuples. If there
are only few gaps, finding space for a new tuple can become very expensive, because
many blocks have to be probed sequentially. The first option is to let the CBI pointer circulate over the set of allocated blocks,
assuming that space is released by deleted tuples. The need to probe blocks that are completely filled can be avoided by maintaining
a an array of bits that contains one bit per block indicating whether a block is full:
Naming Tuples (records) : Naming Tuples (records) Relative byte address:
file, offset in file: OK for insert-then-read-only DBs
record can't easily grow.
deleted space not easily reclaimed.
Tuple Identifier
file, page, index: The design shown below.
Main disadvantage: expensive reorganization (fixing overflows) dir_index 3 7446 pageno fileid nodeid 7446 5127 this tuple pseudo -TID dir_index fileid nodeid this tuple 3 7446 pageno 7446
Implementing Database Keys : Implementing Database Keys Address record via directory
Address has a ID to allow for invalidation
ID never reused.
Pointer can be swizzled.
Popular with network & OO DBs
Naming Tuples via Primary Key : Naming Tuples via Primary Key {Entry Sequenced, Relative}: primary key is physical addr
{Hash, B-tree}: primary key is content (primary key)
Primary Key an alternative to DBkey
B-tree clusters related data
Problems: B-tree access is slower than Hash. Hash & B-tree keys not fixed length but neither is node.db_key
Benefit: key can grow to LARGE databases Good for distributed/partitioned data
It’s religious.
Datatype Representation : Datatype Representation E: External representation: ASCII, ISO Latin1, Unicode,...
P: Programming language representation
many: PL/1, Cobol, C, all have different VARCHAR
many type mismatches between P and F
: interval, datetime, user,...
F: File representation: "native" types (e.g.: null values, ....).
Lots of mapping functions.
It would be great if F-1(F(x)) = x for these functions, but....
Called the impedance mismatch between DB and PL
Datatype Representations : Datatype Representations P _ F: Implies a special language
(all other languages are 2nd class)
E _ F: Use characters for everything.
Problem: E changes from country to country!
(all other languages are 2nd class)
No easy way out of this.
Unicode will help most of us and make E_F more attractive
Representing Records : Representing Records
Representing Records : Representing Records struct relations{
Uint relation_no; /* internal id for the relation */
char * owner; /* user id of the creator */
long creation_date; /* date when it was created */
PAGENO current_point_of_insert; /* free space done via */
PAGENO empty_page_anchor; /* free space cursor method */
Uint no_of_attributes; /*#attributes in relation */
Uint no_of_fixed_atts; /* # fixed-length attributes */
Uint no_of_var_atts; /* # variable-length attributes */
struct attributes * p_attr;} /* pointer to the attributes array */
struct attributes[]; /* attributes array */
{ char * attribute_name; /* external name of the attribute */
Uint attribute_position; /* index of the field in the tuple (1,2,...) */
char attribute_type; /* this encodes the SQL - type definition */
Boolean var_length; /* is it variable_length field ? */
Boolean nulls_allowed; /* can field assume NULL value ? */
char * default_value; /* value assumed if none stored in tuple */
Uint field_length; /* maximum length of field */
int accumulated_offset; /* explained later */
Uint significant_digits; /* for data type FIXED */
char * encryption_key; /* if the value encrypted */
char * rest;} /* further information on the attribute */
Representing Records : Representing Records Generic header
(rid, tid, #fields)
all fixed length encoding
(fat records, fast-simple code
max < page path length)
variable fields have length
(short records, slow code)
type-length-value
(simple slow code, easy reorg)
fixed + ptrs to variables.
(compact, fast code)
m 3 4 tuple length F1 F2 F3 F4 F5 F6 2 4 8 10 n m L F1 F2 F3 F4 F5 F6 3 4 2 4 L 3 4 2 4 n F 1 F 2 F 3 F 4 F 5 F general prefix to all tuple representations relation-id tuple-id tuple length number of fields in the tuple or actual tuple length number of fields name number of fields 6
Representing Records (Reuter Recommends) : Representing Records (Reuter Recommends)
Some Details : Some Details Representing null values:
missing field
special value
extra field
bitmap
Representing keys
efficient comparison is important
store "conditioned" key so simple byte-compare.
Flip integer sign (so negative sorts low)
Flip float so exponent first, mantissa second, flipped signs
Compress varchars.
MANY refinements.
Want an order-preserving compression.
Fat Records (Longer Than a Page) : Fat Records (Longer Than a Page) Record must fit on page.
Long fields segregated
to separate page: may be good in some cases (Multi-media DBs)
Overflow page chains
Segment record across pages
Obese records (Longer Than 10 Pages) : Obese records (Longer Than 10 Pages) If record is super-large, then may want to index into it quickly.
“Obvious" design is standard tree.
Record is root of tree.
Grow levels when one fills.
Allows blob growth, update,...
Non-Normalized Relations : Non-Normalized Relations
Structured File Definition : Structured File Definition
File Layouts : File Layouts Unstructured:
a sequence of bytes
Structured,
Entry Sequenced.
Records inserted at end
Records cannot grow
key is RBA (relative byte address)
Relative:
fixed size record slots
records limited by that size
key is relative record number
Associative File Types : Associative File Types Hashed:
Records addressed by key field(s)
bucket has list of records
overflow to other buckets
or to overflow pages.
Key Sequenced
Records addressed by keyfield(s)
Records in sorted order.
either sorting or b-tree or...
Parameters at Create : Parameters at Create Database
Record type (fields)
Key
Organization { Entry Sequenced, Relative, Hashed, Key Sequenced }
Block size (page size)
Extent size (storage area)
Partitioning (among discs or nodes) by key.
Attributes: access control
allocation and archive strategy
transactional
lifetime, zero on free, and on and on ....
Parameters at Create : Parameters at Create "Secondary" indices.
Primary key is....(e.g. customer number).
Secondary key is social security number
Non-Unique secondary key is Last_Name, First_name
Secondary indices can be {unique or not } and {hashed or Key Sequenced }
index is like a table.
fields of index are:
secondary key, primary key
So can define index on any
kind of base table
Secondary Index Example : Secondary Index Example Base table is key-sequenced on CustomerNumber.
Index table is key sequence on Name-CustomerNumber.
Index can be a replica of the base table in another order. Transaction recovery and locking keeps them consistent.
Tuple management system
Maintains indices (insert, update, delete)
Navigates to base table via secondary index as one request.
What happens when you open a relation? : What happens when you open a relation? Many files get opened. Read directory (catalog)
Partitions,
Indices
Once OPEN, Application can SCAN the relation : Once OPEN, Application can SCAN the relation Scan is a row & column subset
SELECT
FROM
WHERE
With a specified start/stop key
AND BETWEEN AND
In a specified order (supported by a secondary index)
ASCENDING | DESCENDING
A locking protocol {Serializable | Repeatable Read | Committed Read
Uncommited Read | Skip Uncommitted |…}
TIMEOUT
SCAN States : SCAN States
SCAN States: How they change : SCAN States: How they change On error, scan state does not change.
On open,
scan is {before | after} the {first | last} set element
if scan is {ascending | descending}
On fetch next:
if {not end of set | at end of set}
scan is {at next | before first | after last } element
On insert
scan is at element
On delete
scan is at the missing element
SCAN States: How they change : SCAN States: How they change On update: scan position is not affected.
if tuple moves (because ordering attributes affected)
scan key position is unchanged
Can create Halloween problem (give everybody a 10% raise)
But scan enumerates entire set.
SCAN Data structure : SCAN Data structure enum
SCAN_STATE { TOP, ON, BOTTOM, BETWEEN, NIL }; /* the 5 scan states */
enum
ISOLATION { UNCOMMITTED_READ,..., SERIALIZABLE, READ_PAST, BOUNCE };
typedef struct
{ Uint scanid; /* handle for scan; returned by open_scan*/
TRID owner; /* which transaction uses the scan */
FILE * fileid; /* handle of file the scan is defined on */
char * scan_key; /* specification of scan key attribute(s) */
char * start_key; /* lower bound of scan range */
char * stop_key; /* upper bound of scan range */
char * filter; /* qualifying predicate for all tuples in scan */
ISOLATION isol_degree; /* locking policy for tuples accessed */
SCAN_STATE scan_state; /* state of scan pointer */
char scan_key[ ]; /* scan key the scan is before, at, or after */
} SCANCB, * SCANCBP;
Entry Sequenced File Insert : Entry Sequenced File Insert fix page descriptor page
find eof page
fix eof data page
if no space in page
< see next slide for transaction to advance page>
unfix descriptor page
add record to page (updating on-page directory)
generate log record (new value) and update page LSN.
compute lock name of record (based on TID).
get lock on record
unfix data page.
To make this work, MUST be assured lock is available
Otherwise page sem can (undetected)deadlock with lock wait
So, UNDO of entry-sequence insert does not free the space,
it just invalidates the record.
Entry Sequenced File Insert If EOF page or File is Full : Entry Sequenced File Insert If EOF page or File is Full Begin new transaction (will not abort if insert aborts)
to extend file EOF page. (leaves insert transaction)
unfix directory page
if file full, panic()
start a top-level transaction
fix the directory
advance the page eof updating directory and freespace
log the changes
fix the data page
format it
log the change
unfix the directory and data page
commit the transaction & resume insert transaction
fix directory, fix eof, check to see that there is room for the record.
Top level transaction
to extend file
Entry Sequenced Operations : Entry Sequenced Operations Delete by RBA.
get record lock
(node,
file,
RBA) exclusive
if {timeout, deadlock, error}
return error;
Fix page
Mark record invalid
Generate log record
Update page lsn
Unfix page. Read by RBA.
get record lock
(node,
file,
RBA) shared
if {timeout, deadlock, error} return error;
Fix page
if record valid copy to buffer
Unfix page
Return record or null Note: both must test that RBA <= EOF. Update, ReadNext, ... are similar.
Relative Files : Relative Files Records fit in fixed-length slots
Operation on slots.
Separate transactions extend the file EOF (allocate and format pages)
Relative Files : Relative Files {Read | Insert | Update | Delete} by key are all easy
Insert "near" key works by:
Plan A:
look at page
Look at neighbor pages (left, right, left, right,...)
Plan B:
allocate overflow page for base page
Plan C:
Look in free-space bit-map or byte (%full) map.
Key Sequenced or Hashed Files : Key Sequenced or Hashed Files Key sequenced
is subject of next chapter.
File Clustering : File Clustering Different record types kept in same page/file
For example: Master and detail records of an invoice. Detail records always accessed if master is.
Situation: Master key : InvoiceNo Detail key: InvoiceNo Foreign Key References Master+ SequenceNo
Technique: Hash or Key sequence Master on InvoiceNo Hash or Key Sequence Detail on InvoiceNo+SequenceNo in same table.
Clustering different record types in a page : Clustering different record types in a page One disc request gets the entire order.
Concept works for any storage hierarchy
Is natural for Hierarchical database systems.
Summary : Summary Representing values
Representing records
storing records in pages and across pages
Organizing records (entry, relative, key, hash)
Examples of fix/log/log logic.
Updating ....
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.
Updating ....
|
Send to Blogs and Networks |
 |