Parallel databases Architecture Query evaluation Query optimization Distributed databases ...
-
Upload
kerry-grant -
Category
Documents
-
view
221 -
download
0
Transcript of Parallel databases Architecture Query evaluation Query optimization Distributed databases ...
Distributed Databases
Distributed Databases
Parallel databases Architecture Query evaluation Query optimization
Distributed databases Architectures Data storage Catalog management Query processing Transactions
Introduction
A parallel database system is designed to improve performance through parallelism Loading data, building indexes, evaluating queries Data may be stored in a distributed way, but solely
for performance reasons A distributed database system is physically
stored across several sites Each site is managed by an independent DBMS Distribution is affected by local ownership, and
availability as well as performance
Parallel Databases
Motivation
How long does it take to scan a 1 terabyte table at 10MB/s? 1,099,511,627,776 bytes = 1,0244 or 240
bytes 10MB = 10,485,760 bytes 1,099,511,627,776 / 10,485,760 =
104,858 104,858 / (60 * 20 * 24) = 1.2 days!
Using 1,000 processors in parallel the time can be reduced to 1.5 minutes
Coarse-Grain and Fine-Grain
A coarse-grain parallel machine consists of a small number of processors Most current high-end computers
A fine-grain parallel machine uses thousands of smaller processors Also referred to as a massively parallel
machine
Performance
Both throughput and response time can be improved by parallelism
Throughput – the number of tasks completed in a given time Processing many small tasks in parallel
increases throughput Response time – the time it takes to
complete a single task Subtasks of large transactions can be
performed in parallel increasing response time
Speed-Up, Scale-Up
Speed-up More resources
means less time for a given amount of data
Scale-up If resources
increase in proportion to increase in data size, time is constant
degree of parallelism
thro
ughp
ut
ideal
degree of parallelism
resp
onse
tim
e
ideal
Parallel Database Architecture
Where possible a parallel database should carry out evaluation steps in parallel There are many opportunities for parallelism in
a relational database There are three main parallel DBMS
architectures Shared nothing Shared memory Shared disk
Shared Memory Architecture Multiple CPUs attached to an
interconnection network Accessing a common region of main
memory Similar to a conventional system
Good for moderate parallelism Communication overhead is low OS services control the CPUs
Interference increases with size As CPUs are added memory contention
becomes a bottleneck Adding more CPUs eventually slows the
system down
P
D
interconnection
network
D D
…
global shared
memory
P P
Shared Disk Architecture
Each CPU has private memory and direct access to data Through the
interconnection network Good for moderate
parallelism Suffers from interference
in the interconnection network Which acts as a bottleneck
Not a good solution for a large scale parallel system
M … MM
P P P
interconnection
network
D D D
Shared Nothing Architecture Each CPU has local memory
and disk space No two CPUs access the same
storage area All CPU communication is
through the network Increases complexity
Linear speed-up▪ Operation time decreases
proportional to increase in CPUs
Linear scale-up▪ Performance maintained if CPU
increase is proportional to data
interconnection
network
P …P P
M MM
D D D
Parallel Query Evaluation A relational query execution plan is a tree,
or graph, of relational algebra operators Operators in a query tree can be executed
in parallel If one operator consumes the output of
another, there is pipelined parallelism Otherwise the operators can be evaluated
independently An operator blocks if it does not produce
any output until it has consumed all its inputs Pipelined parallelism is limited by blocking
operators
Single Operator Evaluation
Individual operators can be evaluated in a parallel way by partitioning input data In data-partitioned parallel evaluation
the input data is partitioned, and worked on in parallel
The results are then combined Tables are horizontally partitioned
Different rows are assigned to different processors
Data Partitioning
Partition using a round-robin algorithm
Partition using hashing Partition using ranges of field values
Round-Robin
Partition using a round-robin algorithm Assign record i to processor i mod n
▪ Similar to RAID systems Suitable for evaluating queries that
access the entire table Less efficient for queries that access
ranges of values and queries on equality
Hash Partitioning
Partition using hashing A hash function based on selected
attributes is applied to each record to determine its processor
The data remains evenly distributed as the table grows, or shrinks over time
Good for equality selections Only one disk is used, leaving the others
free Also useful for sequential scans where
the partitioning attributes are a candidate key
Range Partitioning
Partition using ranges of field values Ranges are chosen from the sort key
values, each range should contain the same number of records▪ Each disk contains one range
If a range is too large can lead to data skew Skew can lead to the processors with
large partitions becoming bottlenecks Good for equality selections, and range
selections
Data Skew
Both hash and range partitioning may result in data skew Where some partitions are larger or smaller Skew can dramatically reduce the speed-up
obtained from parallelism In range partitioning skew can be
reduced by using histograms The histograms contain the number of
attributes and are used to derive even partitions
Parallel Evaluation Code
Parallel data streams are used to provide data for relational operators The streams can come from different disks, or Output of other operators
Streams are merged or split Merged to provide the inputs for a relational operator Split as needed to parallelize processing These operations can buffer data, and should be able
to halt operators that provide their input data A parallel evaluation consists of a network of
relational, merge and split operators
Types of Parallelism
Inter-query parallelism Different queries or transactions execute in
parallel Throughput is increased but response time is
not Easy to support in a shared-memory system
Intra-query parallelism Executing a single query in parallel to speed up
large queries Which in turn can entail either intra-operation
or inter-operation parallelism, or both
Parallel Operations
Scanning and loading Pages can be read in parallel while
scanning a relation The results can be merged If hash or range partitioning is used
selections can be directed to the relevant processors
Sorting Joins
Sorting
The simplest sort method is for each processor to sort its portion of the table Then merge the sorted records The merging phase may limit the amount of parallelism
A better method is to first redistribute the records over the processors using range partitioning Using the sort attributes Each processor sorts its set of records The sets of sorted records are then retrieved in order
To make the partitions even, the data in the processors can be sampled
Joins
Join algorithms can be parallelized Parallelization is most effective for hash or sort-
merge joins▪ Parallel hash join is widely used
The process for parallel hash join is First partition the two tables across the
processors using the same hash function Join the records locally, using any join algorithm Merge the results of the local joins, the union of
these results is the join of the two tables
Improved Parallel Hash Join If tables are very large, parallel hash join
may have a high cost at each processor If each partition is large, multiple passes will be
required for the local joins An alternative approach is to use all
processors for each partition Partition the tables using h1
▪ Each partition of the smaller relation should fit into the combined memory of the processors
Process each partition using all processors▪ Use h2 to determine which processor to send records to
Joins on Inequalities
Partitioning is not suitable for joins on inequalities Such as R ⋈R.a < S.b S Since all records in R could join with a
record in S Fragment and replicate joins can be
used In asymmetric fragment and replicate
join▪ One of the relations is partitioned▪ The other relation is replicated across all
partitions
P0,0
Fragment and Replicate
Each relation can be both fragmented and replicated Into m fragments
of R and n of S However m * n
processors are required
This works with any join condition When partitioning
is not possible
R0
R1
R2
…
Rm-1
S0
P1,0
P2,0
P0,1
P1,1
P0,2
S1 S2 … Sn-1
Pm-1,n-1
Other Operations
Selection – the table may already be partitioned on the selection attribute If not, it can be scanned in parallel
Duplicate elimination – use parallel sorting Projection – can be performed by scanning Aggregation – partition by the grouping attribute
If records do have to be transferred between processors it may be possible to just send partial results
The final result can then be calculated from the partial results▪ e.g. sum
Costs of Parallelism
Using parallel processors reduces the time to perform an operation Possibly to as little as 1/n * original cost
▪ Where n is the number of processors However there are also additional
costs Start-up costs for initiating the operation Skew which may reduce the speed-up Contention for resources resulting in
delays Cost of assembling the final result
Parallel Query Optimization As well as parallelizing individual
operators, different operators can be processed in parallel Different processors perform different
operations Result of one operator can be pipelined
into another Note that sorting and the hash-join partitioning
block pipelines Multiple independent operations can be
executed concurrently Using bushy, rather than left-deep, join trees
Parallel vs. Serial Plans
The best serial plan may not be the best parallel plan Also note that parallelization introduces further
complexity into query optimization Consider a table partitioned into two nodes,
with a local secondary index Node 1 contains names between A and M Node 2 contains names between N and Z
Consider the selection: name < “Noober“ Node 1 should scan its partition, but Node 2 should use the name index
Parallel System Design
In a large-scale parallel system the chances of failure increase Such systems should be designed to
operate even if a processor disk fails Data can be replicated across
multiple processors Failed processors or disks are tracked And request re-routed to the backup
Summary
Architecture Shared-memory is easy, but costly and does
not scale well Shared-nothing is cheap and scales well, but is
harder to implement Both intra-operation, and inter-operation
parallelism are possible Most relational algebra operations can be
performed in parallel How the data is partitioned across
processors is very important
Distributed Databases
Introduction
A distributed database is motivated by a number of factors Increased availability
▪ If a site containing a table goes down, the table may still be available if a copy is maintained at another site
Distributed access to data▪ An organization may have branches in several cities▪ Access patterns are typically affected by locality
Analysis of distributed data Distributed systems must support integrated
access
Ideal
Data is stored at several sites Each site is managed by an independent DBMS The system should make the fact that data is
distributed transparent to the user Distributed Data Independence
Users should not need to know where the data is located
Queries that access several sites should be optimized
Distributed Transaction Atomicity Users should be able to write transactions that
access several sites, in the same way as local transactions
Reality
Users may have to be aware of where data is located Distributed data independence and distributed
transaction atomicity may not be supported These properties may be hard to support
efficiently▪ Sites may be connected by a slow long-distance
network Consider a global system
Administrative overheads for viewing data as a single unified collection may be prohibitively expensive
Distributed vs. Parallel
Distributed and shared-nothing parallel systems appear similar
In practice these are often very different since distributed DBs are typically Geographically separated Separately administered Have slower interconnections May have both local and global
transactions
Types of Distributed Database Homogeneous
Data is distributed but every site runs the same DBMS software
Heterogeneous, or multidatabase Different sites run different DBMSs, and the sites are
connected to enable access to data Require standards for gateway protocols A gateway protocol is an API that allows external
applications access to the database▪ e.g. ODBC and JDBC
Gateways add a layer of processing, and may not be able to entirely mask differences between servers
Distributed DBMS Architecture
Client-Server Collaborating Server Middleware
Client-Server Systems
One or more client processes and one or more server processes A client process sends a query to any one server process Clients are responsible for UI Servers manage data and execute transactions
A popular architecture Relatively simple to implement Servers do not have to deal with user-interactions Users can run a GUI on clients
Communication between client and server should be as set-oriented as possible e.g. stored procedures vs. cursors
Collaborating Server Systems Client-server systems do not allow a single query
to access multiple servers as this would require Breaking the query into sub-queries to be executed at
different sites and merging the answers to the sub-queries
To do this the client would have to be overly complex In a collaborating server system the distinction
between clients and servers is eliminated A collection of DB servers, each able to run
transactions against local data When a query is received that requires data from other
servers the server generates appropriate sub-queries
Middleware Systems
Designed to allow a single query to access multiple servers, but Without requiring all servers to be capable of managing
multi-site query execution Often used to integrate legacy systems
Requires one database server (the middleware) capable of managing multi-server queries Other servers only handle local queries and
transactions The special server coordinates queries and transactions The middleware server typically doesn’t maintain any
data
Distributed Data
Storing Distributed Data
In a distributed system tables are stored across several sites Accessing a table stored elsewhere incurs
message-passing costs A single table may be replicated or
fragmented across several sites Fragments are stored at the sites where they
are most often accessed Several replicas of a table may be stored at
different sites Fragmentation and replication can be combined
Fragmentation
Fragmentation consists of breaking a table into smaller tables, or fragments The fragments are stored instead of the original table Possibly at different sites
Fragmentation can either be vertical or horizontal
TID
empID
fName lName age
city
1 111 Sam Spade 43 Chicago
2 222 Peter Whimsey
51 Surrey
3 333 Sherlock Holmes 35 Surrey
4 444 Anita Blake 29 Boston
horizontal
vertical
Managing Fragmentation Records that belong to a horizontal fragment are
usually identified by a selection query e.g. all the records that relate to a particular city, achieving
locality, reducing communication costs A horizontally fragmented table can be recreated by
computing the union of the fragments▪ Fragments are usually required to be disjoint
Records belonging to a vertical fragment are identified by a projection query The collection of vertical fragments must be a lossless-join
decomposition A unique tuple ID is often assigned to records
Replication
Replication entails storing several copies of a table or of table fragments for Increased availability of data, which protects
against▪ Failure of individual sites, and▪ Failure of communication links
Faster query evaluation▪ Queries can execute faster by using a local copy of a
table There are two kinds of replication,
synchronous, and asynchronous These differ in how replicas are kept current
when the table is modified
Managing Distributed Catalogs Distributing data across sites adds complexity
It is important to track where replicated or fragmented tables are stored
Each replica or fragment must be uniquely named Naming should be performed locally A global relation name consists of {birth site, local name}
▪ The birth site is the site where the table was created
A site catalog records fragments and replicas at a site, and tracks replicas of tables created at the site To locate a table, look up its birth site catalog The birth site never changes, even if the table is moved
Distributed Queries
Distributed Query Processing
Estimating the cost of an evaluation plan must include communication costs Evaluate the number of page reads or writes, and The number of pages that must be sent from one
site to another Pages may need to be shipped between a
number of sites Sites where the data is located, and where the
result is computed, and The site that initiated the query
Single Table Queries
Simple, one table, queries are affected by fragmentation and replication
If a table is horizontally fragmented a query has to be evaluated at multiple sites And the union of the result computed Selections that only require data at one site can be
executed just at that site If a table is vertically fragmented the fragments
have to be joined on the common attribute If a table is replicated, the shipping costs have
to be considered to determine which site to use
Joins in a Distributed DBMS
Joins of tables at different sites can be very expensive
There are a number of strategies for computing joins Fetch as needed Ship to one site Semijoins and Bloomjoins
Fetch As Needed Joins
Designate one table as the outer relation, and compute the join at that site
Fetch records of the inner relation as needed; the cost depends on The size of the relations Whether the inner relation is cached at the outer relation's
site▪ If not, communication costs are incurred once for each time the
inner relation is read
The size of the result relation If the size of the result (R ⋈ S) is greater than R + S
it is cheaper to ship both relations to the query site
Ship to One Site
In this strategy, relations are shipped to a site and the join carried out at that site
The site can be one of the sites involved in the join The result has to be shipped from where it was
computed to the site where the query was posed
Alternatively both input relations can be shipped to the site where the query was originally posed The join is then computed at that site
Semi-joins and Bloom-joins Consider a join between two relations, R and S
at different sites, London and Vancouver Assume that S (the inner join) is to be shipped to
London where the join will be computed Note that some S records may not join to R records Shipping costs can be reduced by only shipping
those S records that will actually join to R records There are two techniques that can reduce the
number of S records to be shipped Semi-joins, and Bloom-joins
Semi-joins
At the first site (London) compute the projection of R on the join columns, a Ship this relation to site 2 (Vancouver)
At Vancouver compute the join of a(R) and S The result of this join is the reduction of S with
respect to R Ship the reduction of S to London
At London compute the join of the reduction of S, and R
The effectiveness of this technique depends on how much smaller the reduction of S is compared to S
Bloomjoins
Bloom-joins are similar to semi-joins, except that a bit vector is sent to the second site The vector is size k and each record in R is hashed to it
▪ A bit is set to 1 if a record hashes to it
▪ The hash function is on the join attribute
The reduction of S is then computed in step 2 By hashing records of S to the bit vector Only those records that hash to a bit with the value of 1
are included in the reduction The cost to send the bit vector is less than the cost
to send the projection (of the join attribute on R) But some unwanted records of S may be in the reduction
Cost Based Optimization
The basic cost based approach is to consider a set of plans and pick the cheapest Communication costs must be considered Local autonomy must be respected Some operations can be carried out in parallel
The query site generated a global plan with suggested local plans Local sites are allowed to change their
suggested plans if they can improve them
Distributed Transactions
Updating Distributed Data If data is distributed it should be transparent to
users Users should be able to ask queries without having to
worry where tables are stored Transactions should be atomic actions, regardless of
data fragmentation or replication If so, all copies of a replicated relation must be modified
before the transaction commits Referred to as synchronous replication
Another approach, asynchronous replication, allows copies of a relation to differ More efficient, but compromises data independence
Synchronous Replication
There are two techniques for ensuring that a transaction sees the same values Regardless of which copy of an object it accesses
In voting, a transaction must write a majority of copies to modify an object, and
Must read enough copies to ensure that it sees at least one most recent copy e.g. 10 copies of an object, at least 6 copies must be
written, and at least 5 read Note that the copies include a version number so
that it is possible to tell which copy is the latest
Synchronous Replication … Voting is not a generally efficient technique
Reading an object requires that multiple copies of the object must be read
Typically, objects are read more than they are written The read-any write-all policy allows any single copy
to be read, but All copies must be written when an object is
written Writes are slower, relative to voting, but Reads are fast, particularly is a local copy is available
Read-any write-all is usually used for synchronous replication
Synchronous Replication Cost Synchronous replication is expensive
Before an update transaction is committed it must obtain X locks on all copies of the data
This may entail sending lock requests to remote sites and waiting for the locks to be confirmed
While holding its other locks If sites, or the communication links fail, the transaction
cannot commit until they are back up Committing the transaction requires ending multiple
messages as part of a commit protocol An alternative is to use asynchronous replication
Asynchronous Replication
A transaction is allowed to commit before all the copies have been changed Readers still only look at a single copy Users must be aware of which copy they
are reading, and that copies may be out of sync
There are two approaches to asynchronous replication Peer-to-peer, and Primary site
Peer-to-Peer Replication
More than one copy can be designated as updatable Changes to the master(s) must be propagated to other
copies If two masters are changed a conflict resolution
strategy must be used Peer-to-peer replication is best used when conflicts
do not arise Where each master site owns a disjoint fragment
▪ Usually a horizontal fragment
Update rights are only held by one master at a time▪ A backup site may gain update rights if the main site fails
Primary Site Replication
One copy of a table is designated as the primary or master copy Users register or publish the primary copies Other sites subscribe to the table (or fragments
of it), by creating secondary copies Secondary copies cannot be directly updated
Changes to the primary copy must be propagated to the secondary copies First, capture change made by committed
transactions Apply the changes to secondary copies
Capture Step
Log-based capture creates an update record from the recovery log when it is written to stable storage Log changes that affect replicated tables are written to
a change data table (CDT) Note that aborted transactions must, at some point, be
removed from the CDT Another approach is to use procedural capture
A trigger invokes a procedure which takes a snapshot of the primary copy
Log-based capture is cheaper and has less delay, but relies on proprietary log details
Apply Step
The apply step takes the capture step changes and propagates them to secondary copies This can be continuously pushed from the master
whenever a CDT is generated, or Periodically requested (or pulled) by the copies
▪ A timer or application controls the frequency of the requests
Log-based capture with continuous apply minimizes delay A cheaper substitute for synchronous replication
Procedural capture and application driven apply gives the most flexibility
Data Warehousing
Complex decision support queries that require data from multiple sites are popular To improve query efficiency, all the data can be copied
to one site, which is then queried These data collections are called data warehouses
Warehouses use asynchronous replication The source data is typically controlled by different
DBMSs Source data often has to be cleaned when creating the
replicas Procedural capture and application apply is best
used for this environment
Distributed Transactions
Transactions may be submitted at one site but can access data at other sites The transaction manager breaks the transaction into
sub-transactions that execute at different sites The sub-transactions are submitted to the other sites The transaction manger at the initial site must
coordinate the activity of the sub-transactions Distributed concurrency control
Locks and deadlocks must be managed across sites Distributed recovery
Transaction atomicity must be ensured across sites
Distributed Locking Schemes In centralized locking, a single site is in charge of
handling lock and unlock requests This is vulnerable to single site failure and bottlenecks
In primary copy locking, all locking is done at the primary copy site for an object Reading a copy of an object usually requires
communication with two sites In fully distributed locking, lock requests are handled
by the lock manager at the local site X locks must be set at all sites when copies are modified S locks are only set at the local site There are other protocols for locking replicated data
Distributed Deadlock
If deadlock detection is being used (rather than prevention) the scheme must be modified Centralized - send all local waits-
for graphs to a central site Hierarchical - organize sites into
a hierarchy and send local graphs to parent
Timeout - abort the transaction if it waits too long
Communication delays can cause phantom deadlocks
T1 T2
site A
T1 T2
site B
T1 T2
global
Distributed Recovery
Recovery in a distributed system is more complex
New kinds of failure can occur Communication failures, and Failures at remote sites where sub-transactions
are executing To ensure atomicity, either all or no sub-
transactions must commit This property must be guaranteed regardless of
site or communication failure This is achieved using a commit protocol
Normal Execution
During normal execution each site maintains a log
Transactions are logged where they execute
The transaction manager at the originating site is called the coordinator
Transaction managers at sub-transaction sites are referred to as subordinates
The most widely used commit protocol is two-phase commit The 2PC protocol for normal execution starts
when the user commits a transaction
Two Phase Commit (2PC)
Coordinator sends prepare messages Subordinates decide whether to abort or commit
Force-write an abort or prepare log record Send no or yes messages to coordinator
If the coordinator receives unanimous yes, it force-writes commit record and sends commit messages Otherwise, force-writes abort and sends abort messages
Subordinates force-write abort or commit log records and send acknowledge messages to the coordinator
When all acknowledge messages have been received the coordinator writes an end log record
2PC Notes
2PC requires two rounds of messages Voting phase Termination phase
Any site’s transaction manager can unilaterally abort a transaction
Log records describing decisions are always forced to stable storage before the message is sent
Log records include the record type, transaction ID, and coordinator ID The coordinator’s commit or abort log record includes the
IDs of all subordinates
2PC Recovery
If there is a commit or abort log record for transaction, T, but no end record, T must be redone If the site is a coordinator keep sending commit, or abort
messages until all acknowledge messages are received If there is a prepare log record for T, but not commit
or abort the site is a subordinate The coordinator is repeatedly contacted to determine T’s
status, until a commit or abort message is received If there is no prepare log record for T, the transaction
is unilaterally aborted And send an abort message if contacted by a subordinate
Blocking and Site Failures If a coordinator fails, the subordinates are unable
to determine whether to commit or abort The transaction is blocked until the coordinator
recovers What happens if a remote site does not respond
during the commit protocol? If the site is the coordinator the transaction should be
aborted If the site is a subordinate that has not voted yes, it
should abort the transaction If the site is a subordinate that has voted yes, it is
blocked until the coordinator responds
2PC Comments
The acknowledge messages are used to tell the coordinator that it can forget a transaction Until all acknowledge messages are received it must keep T in
the transaction table The coordinator may fail after prepare messages, but
before commit or abort It therefore has no information about the transaction’s status
before the crash▪ So it subsequently aborts the transaction
If another site enquires about T, the recovery process responds with an abort message
If a sub-transaction doesn’t perform updates its commit or abort status is irrelevant
2PC with Presumed Abort When a coordinator aborts T, it can undo T
and remove it from the transaction table If there is no information about T, it is
presumed to be aborted Similarly, subordinates do not need to
send ack messages on abort As the coordinator does not have to wait for
acks to abort a transaction Abort log records do not have to be force-
written As the default decision is to abort a transaction
2PC with Presumed Abort … It a sub-transaction does not perform
updates it responds to prepare with a reader message And writes no log records
If the coordinator receives a reader message it is treated as yes But no further messages are sent to that
subordinate If all sub-transactions are readers the
second phase of the protocol is not required The transaction can be removed from the
transaction table
Cloud Databases
Cloud Computing
In cloud computing a vendor supplies computing resources as a service A large number of computers are
connected through a communication network
Such as the internet … The client runs applications and
stores data using these resources And can access the resources with little
effort
Web Databases
Web applications have to be highly scalable Applications may have hundreds of millions of
users Requiring data to be partitioned across
thousands of processors There are a number of systems for data
storage on the cloud Such as Bigtable (from Google) They do not necessarily guarantee the ACID
properties▪ They drop ACID …
Data Representation
Many web data storage systems are not built around an SQL data model Such as NoSql DBs or BigTable Some support semi-structured data
Many web applications manage without extensive query language support
Data storage systems often allow multiple versions of data items to be stored Versions can be identified by timestamp
Partitioning Data
Data is often partitioned using hash or range partitioning Such partitions are referred to as tablets This is performed dynamically as required
It is necessary to know which site contains a particular tablet A tablet controller site tracks the partitioning
function▪ And can map a request to the appropriate site
The mapping information can be replicated to a set of router sites▪ So that the controller does not act as a bottleneck
Traditional DBs on the Cloud A cloud DB introduces a number of
challenges to making a DB ACID compliant Locking Ensuring transactions are atomic Frequent communication between sites
In addition there are a number of issues that relate to both DBs and data storage Replication is controlled by the cloud vendor Security and legal issues