ETL Contents

25
Contents Staging Area.......................................................1 ODC................................................................2 Datawarehouse and DataMart:Definition & Difference.................3 Data Warehousing................................................. 3 Data Mart........................................................ 3 Difference between Data Warehousing and Data Mart................3 Stored Procedure...................................................4 Debugger...........................................................4 Creating Breakpoints............................................. 5 Error Breakpoints........................................... 5 Data Breakpoints............................................ 5 Teradata...........................................................5 1. BTEQ(Basic Teradata Query)....................................6 BTEQ commands.................................................... 6 Session control commands.......................................6 File control commands.......................................... 6 Sequence control commands......................................6 Format control commands........................................6 BTEQ modes....................................................... 6 Export Data.................................................... 6 Export Report.................................................. 6 Export INDICDATA............................................... 7 Export DIF..................................................... 7 2. Teradata FastLoad............................................. 7 3. Teradata MultiLoad............................................ 7 4. TPump ( Teradata Parallel Data Pump)..........................8 5. TPT ( Teradata Parallel Transporter)..........................8 FAST EXPORT...................................................... 9 Locks in Teradata.................................................11

description

ETL

Transcript of ETL Contents

Page 1: ETL Contents

ContentsStaging Area..........................................................................................................................................1

ODC.......................................................................................................................................................2

Datawarehouse and DataMart:Definition & Difference......................................................................3

Data Warehousing.............................................................................................................................3

Data Mart..........................................................................................................................................3

Difference between Data Warehousing and Data Mart....................................................................3

Stored Procedure..................................................................................................................................4

Debugger...............................................................................................................................................4

Creating Breakpoints.........................................................................................................................5

Error Breakpoints...................................................................................................................5

Data Breakpoints...................................................................................................................5

Teradata................................................................................................................................................5

1. BTEQ(Basic Teradata Query)..........................................................................................................6

BTEQ commands................................................................................................................................6

Session control commands............................................................................................................6

File control commands..................................................................................................................6

Sequence control commands........................................................................................................6

Format control commands............................................................................................................6

BTEQ modes......................................................................................................................................6

Export Data....................................................................................................................................6

Export Report................................................................................................................................6

Export INDICDATA.........................................................................................................................7

Export DIF......................................................................................................................................7

2. Teradata FastLoad.........................................................................................................................7

3. Teradata MultiLoad.......................................................................................................................7

4. TPump ( Teradata Parallel Data Pump)..........................................................................................8

5. TPT ( Teradata Parallel Transporter)..............................................................................................8

FAST EXPORT.....................................................................................................................................9

Locks in Teradata................................................................................................................................11

Staging Area Staging area is a place where you hold temporary tables on data warehouse server. Staging tables are

connected to work area or fact tables. We basically need staging area to hold the data, and perform data cleansing and merging, before loading the data into warehouse.

Page 2: ETL Contents

In the absence of a staging area, the data load will have to go from the OLTP system to the OLAP system directly, which in fact will severely hamper the performance of the OLTP system. This is the primary reason for the existence of a staging area. In addition, it also offers a platform for carrying out data cleansing.

Staging area is a temp schema used to1. Do flat mapping i.e. dumping all the OLTP data in to it without applying any business rules. Pushing data

into staging will take less time because there are no business rules or transformation applied on it.2. Used for data cleansing and validation using First Logic A staging area is like a large table with data separated from their sources to be loaded into a data

warehouse in the required format. If we attempt to load data directly from OLTP, it might mess up the OLTP because of format changes between a warehouse and OLTP. Keeping the OLTP data intact is very important for both the OLTP and the warehouse.

According to the complexity of the business rule, we may require staging area, the basic need of staging area is to clean the OLTP source data and gather in a place. Basically it’s a temporary database area. Staging area data is used for the further process and after that they can be deleted.

ODC

ODS can be described as a snap shot of the OLTP system.It acts as a source for EDW(Enterprise datawarehouse).ODS is more normalised than the EDW.Also ODS doesnt store any history.Normally the Dimension tables remain at the ODS (SCD types can be applied in ODS)where as the Facts Flow till the EDW.More importantly Client report reqiurements determing what or what not to have in ODS or EDW.

An ODS is a relational, normalized data store containing the transaction data and current values of master data from the OLTP system

An ODS does not store the history of master data such as the customer, store,and product. When the value of the master data in the OLTP system changes, the ODS is

updated accordingly. An ODS integrates data from several OLTP systems. Unlike a data warehouse,an ODS is updatable.

Because an ODS contains integrated data from several OLTP systems, it is an ideal place to be used for customer support

ODS - Operational Data Store. ODS Comes between staging area & Data Warehouse. ODS is nothing but a staging area, in which you can keep your OLTP type data like your day to day

transactional data. It is fully normalizied. ODS can be described as a snap shot of the OLTP system. It acts as a source for EDW(Enterprise datawarehouse). ODS is more normalised than the EDW.Also ODS doesnt store any history.

Normally the Dimension tables remain at the ODS (SCD types can be applied in ODS)where as the Facts Flow till the EDW. More importantly Client report reqiurements determing what or what not to have in ODS or EDW.

Page 3: ETL Contents

Datawarehouse and DataMart:Definition & DifferenceData warehousing and Data mart are tools used in data storage.With passage of time, small companies become big & this is when they realize that they have amassed huge amounts of data in various departments of the organization. Every department has its own database that works well for that department. But when organizations intend to sort data from various departments for sales, marketing or making plans for future, the process is referred to as Data Mining. Data Warehousing and Data Marts are two tools that help companies in this regard. Just what the difference between data warehousing and data marts is and how they compare with each other is what this article intends to explain.

Data WarehousingA data warehouse is a collection of data marts representing historical data from different operations in the company. This data is stored in a structure optimized for querying and data analysis as a data warehouse. Table design, dimensions and organization should be consistent throughout a data warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also be viewed as a database for historical data from different functions within a company.This is the place where all the data of a company is stored. It is actually a very fast computer system having a large storage capacity. It contains data from all the departments of the company where it is constantly updated to delete redundant data. This tool can answer all complex queries pertaining data.

Data MartA data mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse.It is an indexing and extraction system. Instead of putting the data from all the departments of a company into a warehouse, data mart contains database of separate departments and can come up with information using multiple databases when asked.

IT managers of any growing company are always confused as to whether they should make use of data marts or instead switch over to the more complex and more expensive data warehousing. These tools are easily available in the market, but pose a dilemma to IT managers.

Difference between Data Warehousing and Data MartIt is important to note that there are huge differences between these two tools though they may serve same purpose. Firstly, data mart contains programs, data, software and hardware of a specific department of a company. There can be separate data marts for finance, sales, production or marketing. All these data marts are different but they can be coordinated. Data mart of one department is different from data mart of another department, and though indexed, this system is not suitable for a huge data base as it is designed to meet the requirements of a particular department.

Data Warehousing is not limited to a particular department and it represents the database of a complete organization. The data stored in data warehouse is more detailed though indexing is light as it has to store huge amounts of information. It is also difficult to manage and takes a long time to process. It implies then that data marts are quick and easy to use, as they make use of small amounts of data. Data warehousing is also more expensive because of the same reason.

Page 4: ETL Contents

Stored ProcedureDatabase administrators create stored procedures to automate tasks that are too complicated for standard SQL statements. A Stored Procedure transformation is an important tool for leveraging existing stored procedures from within PowerCenter.

Not all databases support stored procedures, and stored procedure syntax varies depending on the database. You might use stored procedures to complete the following tasks:

Check the status of a target database before loading data into it. Determine if enough space exists in a database. Perform a specialized calculation. : Complex calculations where Database Procedures perform better

than using multiple transformations in Informatica. - - - Breaking of complex calculations into multiple Stored Procedures and using those as Stored Procedure transformation in the same ETL. The output of one SP transformation can act as the input to the other.

Drop and recreate indexes. : For Bulk loading, we need to drop the index first and recreate the index after the load is done.

Re-usability of the Complex procedures in different ETLs in the form of Stored Procedure transformation.

You might use a stored procedure to perform a query or calculation that you would otherwise make part of a mapping. For example, if you already have a well-tested stored procedure for calculating sales tax, you can perform that calculation through the stored procedure instead of recreating the same calculation in an Expression transformation. It is best practices to do not run unnecessary instances of stored procedures as it can impact performance.

Each time a stored procedure runs during a mapping, the session must wait for the stored procedure to complete in the database. You have two possible options to avoid this:

Reduce the row count. Use an active transformation prior to the Stored Procedure transformation to reduce the number of rows that must be passed the stored procedure. Or, create an expression that tests the values before passing them to the stored procedure to make sure that the value does not really need to be passed.

Create an expression. Most of the logic used in stored procedures can be easily replicated using expressions in the Designer

DebuggerDebugger used to get logical or data error.

Debugger is an integral part of Informatica Power Center mapping designer, which help you in troubleshooting the ETL logical error or data error conditions in an Informatica mapping. The debugger user interface shows the step by step execution path of a mapping and how the source data is transformed in the mapping. Features like "break points", "evaluate expression" makes the debugging process easy.

Debugger can only be used with the VALID mapping

Note - Add Tool bar for the debugger in the Designer (make Things more easier to work)

Understand Debugger Interface

Page 5: ETL Contents

The debugger user interface is integrated with Mapping Designer. Once you invoke the debugger, you get few additional windows to display the debugging information such as transformation instance window to show how the data is transformed at a transformation instance and target window to show what data is written to the target.

1. Instance Window : View transformation data., This window gets refreshed as the debugger progresses from one transformation to other. You can choose a specific transformation from the drop down list to see how data looks like at that particular transformation instance for a particular source row.

2. Target Window : View target data. You can see, if the record is going to get inserted, updated, deleted or rejected. If there are multiple target instances, you can choose the target instance name from the drop down window, to see its data.

3. Mapping Window : Mapping window shows the step by step execution path of a mapping. It highlights the transformation instance which is being processed and shows the breakpoints setup on different transformations.

4. Debugger Log : This window shows messages from the Debugger.

Creating Breakpoints When you are running a debugger session, you may not be interested to see the data transformations in all the transformations instances, but specific transformations where you expect a logical or data error. For example, you might want to see what is going wrong in the expression transformation EXP_INSERT_UPDATE for a specific customer record, say CUST_ID = 1001. By setting a break point, you can pause the Debugger on specific transformation or specific condition is satisfied. You can set two types of break points.

Error Breakpoints : When you create an error break point, the Debugger pauses when the Integration Service encounters error conditions such as a transformation error. You also set the number of errors to skip for each break point before the Debugger pauses.

Data Breakpoints : When you create a data break point, the Debugger pauses when the data break point condition evaluates to true. You can set the number of rows to skip for each break point ore the Debugger pauses.

Teradata

Teradata is so advanced in the data-loading department that other database vendors can’t hold a candle to it. A Teradata data warehouse brings enormous amounts of data into the system. This is an area that most companies overlook when purchasing a data warehouse. Most company officials think loading of data is simply that – just loading data, but in reality its more than that. Teradata provides several load and unload utilities for your specific requirements. I am going to explain one by one about Teradata Data Loading Utilities Features, Uses and Best Practices.

Page 6: ETL Contents

1. BTEQ(Basic Teradata Query)BTEQ(Basic Teradata Query) or pronounced as Bee-Tek is a general-purpose, command-based tool which provides an interactive or batch interface that allows you to submit SQL statements, import and export data, and generate reports.It imports and exports data at row level. It provides report formatting and queried data returns to screen, file or printer.

BTEQ commandsBasically there are four types of BTEQ commands and they are:-

Session control commands – Begins and ends BTEQ sessions, and controls session characteristics

File control commands – Specifies input and output formats and identifies information sources and destinations

Sequence control commands – Controls the sequence in which other BTEQ commands and SQL statements are executed

Format control commands – Controls the format of screen and printer output.

BTEQ modesalso there are four different BTEQ modes and they are :-

Export Data – BTEQ allows for multiple techniques to export data. We usually think of an export as moving data off of Teradata to a normal flat file. That is example number one and that is called RECORD Mode.

Export Report – BTEQ can actually take your SQL output report and include the Headers and export all together. It looks like an electronic report. That is EXPORT REPORT mode. This is also called Field Mode.

Page 7: ETL Contents

Export INDICDATA – When there are NULL’s in your data and you export them to a mainframe the actual mainframe application could run into problems so INDICDATA warns of NULL values.

Export DIF – The last mode is DIF and this is used when you want certain flat files to be able to be used by PC applications that utilize the Data Interchange Format.

2. Teradata FastLoadFastLoad as name suggest can load vast amounts of data from flat files from a host into empty tables in Teradata with lightning-like speed. And this lightening speed is possible because it does not use transient journal. FastLoad was developed to loads millions of record into Empty teradata tables. FastLoad loads data into empty tables in from of 64K blocks and its only use is INSERT.

FastLoad divides its job into two phases, Phase 1 or Acquisition Phase and Phase 2 or Application Phase.In Phase 1 data is retrieved from the mainframe or server and move it over the network inside Teradata. The data moves in 64 K blocks and is stored in worktables on the AMPs. When all of the data has been moved from the server or mainframe flat file then in Phase 2 each AMP will hash its worktable rows so each row transfers to the worktables on the proper destination AMP.

Rules before using TeraData FastLoad :- The target tables must be empty No Secondary Indexes are allowed on the Target Table No Referential Integrity is allowed No Triggers are allowed at load time No Join Indexes are allowed on the Target Table No AMPs may go down while FastLoad is processing. No more than one data type conversion is allowed per column during a FastLoad.

As FastLoad does not support indexes, triggers so its better practice to drop these before doing FastLoad , after loading the table you can again enable them.

3. Teradata MultiLoadFastLoad is used to only load the data in tables where as TeraData Multiload Utility can Load, update and delete large tables in Teradata in a bulk mode . MultiLoad has the capability to load multiple tables at one time from either a LAN orChannel environment. This feature rich utility can perform multiple types of DML tasks, including INSERT, UPDATE, DELETE and UPSERT on up to five empty or populated target tables at a time. These DML functions may be run either solo or in combinations, against one or more tables. For these reasons, MultiLoad is the utility of choice when it comes to loading populated tables in the batch environment. As the volume of data being loaded or updated in a single block, the performance of MultiLoad improves. MultiLoad runs on a variety of client platforms, operates in a fail-safe mode and is fully recoverable.Rules before using TeraData MultiLoad :-

Unique Secondary Indexes are not supported on a Target Table Referential Integrity is not supported No Triggers at load time No concatenation of input files is allowed No Join Indexes

Page 8: ETL Contents

So as MultiLoad does not support indexes, triggers so its better practice to drop these before doing MultiLoad , after loading the table you can again enable them.

4. TPump ( Teradata Parallel Data Pump)

Both FastLoad and MultiLoad assemble massive volumes of data rows into 64K blocks and then moves those blocks. TPump does NOT move data in the large blocks. Instead, it loads data one row at a time, using row hash locks. Because it locks at this level, andnot at the table level like MultiLoad, TPump can make many simultaneous, or concurrent, updates on a table.

TPump is used to perform Insert, Updates, Deletes and Upserts from Flat Files to populated Teradata Tables at Row Level.There are benefits certain benefits of using TPump Utility. TPump Utility can have Secondary Indexes, Referential Integrity, Triggers, Join Indexes, and also Pump data at varying rates.

But TPump also have certain limitations and they are:-

No concatenation of input data files is allowed TPump will not process aggregates, arithmetic functions or exponentiation The use of the SELECT function is not allowed No more than four IMPORT commands may be used in a single load task Dates before 1900 or after 1999 must be represented by the yyyy format for the year portion of the

date, not the default format of yy On some network attached systems, the maximum file size when using TPump is 2GB TPump performance will be diminished if Access Logging is used

5. TPT ( Teradata Parallel Transporter)

The Teradata Parallel Transporter (TPT) utility combines BTEQ, FastLoad, MultiLoad, Tpump, and FastExport utilities into one comprehensive language utility. This allows TPT to insert data to tables, export data from tables, and update tables.

TPT works around the concept of Operators and Data Streams. There will be an Operator to read Source data, pass the contents of that Source to a data stream where another operator will be responsible for taking the Data Stream and loading it to disk. Below picture represents Teradata Parallel Transporter Infrastructure.

In other utilities multiple data sources are usually processed in a serial manner, Teradata Parallel Transporter can access multiple data sources in parallel. Teradata Parallel Transporter also allows different specifications for different data sources and, if their data is UNION-compatible, merges them together.

Below picture represents the Teradata Parallel Transporter Infrastructure.

Page 9: ETL Contents

There are four main components of Teradata Parallel Transporter and they are :-Load Operator :- It uses the FastLoad protocol and is a parallel load utility designed to move large volumes of data collected from data sources on channel- and network-attached clients into empty tables in the Teradata Database.Update Operator :- It uses the MultiLoad protocol and it can updates, inserts and upserts large volumes of data into empty or populated tables and/or deletes data from the tables.Export Operator :- It uses the FastExport protocol and it can exports large data sets from Teradata tables or views and imports the data to a client’s system for processing or generating large reports or for loading data into a smaller database.Stream Operator :- It uses the TPump protocol and it can loads near real-time data into the data warehouse. TPump can be used to insert, update, upsert and delete data in the Teradata Database, particularly for environments where data warehouse maintenance overlaps normal working hours. Because the TPump protocol uses row hash locks, users can run queries even while they are updating the Teradata system.

5.FAST EXPORTFastExport ,the name itself is spells to exports data from Teradata to a Flat file. But BTEQ also does the same thing.The main difference is BTEQ exports data in rows and FastExport exports data in 64K blocks. So if its required to load data with lightning speed Fast export is the best choice.

FastExport is a 64K block utility it falls under the limit of 15 block utilities. That means that a system can’t have more than a combination of 15 FastLoads, MultiLoads, and FastExports.

Basic fundamentals of FastExport

1. FastExport EXPORTS data from Teradata.2. FastExport only supports the SELECT statement.3. Choose FastExport over BTEQ when Exporting Data of more than half a million+ rows

Page 10: ETL Contents

4. FastExport supports multiple SELECT statements and multiple tables in a single run5. FastExport supports conditional logic, conditional expressions, arithmetic calculations, and data6. conversions.7. FastExport does NOT support error files or error limits.8. FastExport supports user-written routines INMODs and OUTMODs

Sample fast export Script

.LOGTABLE Empdb.Emp_Table_log;

.LOGON TD/USERNAME,PWD;

BEGIN EXPORTSESSIONS 12;

.IMPORT OUTFILE C:\TEMP\EMPDATA.txt FORMAT BINARY;

SELECT EMP_NUM (CHAR(10)), ,EMP_NAME (CHAR(50)), ,SALARY (CHAR(10)), ,EMP_PHONE (CHAR(10))FROM Empdb.Emp_Table; )

.END EXPORT;

.LOGOFF;

FastExport ModesFastExport has two modes: RECORD or INDICATORRECORD mode is the default, but you can use INDICATOR mode if required.The difference between the two modes is INDICATOR mode will set the indicator bits to 1 for column values containing NULLS.

FastExport FormatsFastExport can export data in below formats

FASTLOAD BINARY TEXT UNFORMAT

Locks in Teradata Locking prevents multiple users who are trying to access or change the same data simultaneously from violating data integrity. This concurrency control is implemented by locking the target data.

Locks are automatically acquired during the processing of a request and released when the request is terminated.

Page 11: ETL Contents

Levels of LockingLocks may be applied at three levels:Database Locks: Apply to all tables and views in the database.Table Locks: Apply to all rows in the table or view.Row Hash Locks: Apply to a group of one or more rows in a table.

There are 4 kind of locks in Teradata:

Exclusive - Exclusive locks are placed when a DDL is fired on the database or table, meaning that the Database object is undergoing structural changes. Concurrent accesses will be blocked.

Compatibility: NONEWrite - A Write lock is placed during a DML operation. INSERT, UPDATE and DELETE will trigger a write lock. It may allow users to fire SELECT queries. But, data consistency will not be ensured.

Compatibility: Access Lock - Users not concerned with data consistency. The underlying data may change and the user may get a "Dirty Read"

Read - This lock happens due to a SELECT access. A Read lock is not compatible with Exclusive or Write locks.Compatibility: Supports other Read locks and Access Locks

Access - When a user uses the LOCKING FOR ACCESS phrase. An Access lock allows users to read a database object that is already under write-lock or read-lock. An access lock does not support Exclusive locks . An access lock does not ensure Data Integrity and may lead to "Stale Read"

Categorised on Levels, we can have locks at Database, Table or Row-level.

Row Hash Lock: A Row Hash lock is a 1-AMP operation where the Primary Index is utilized in the WHERE clause of the query. How it helps: Rather than locking the entire table, Teradata locks only those rows that have the same Row Hash value as generated in the WHERE clause.

UNIX

cat --- for creating and displaying short files chmod --- change permissions cd --- change directory cp --- for copying files date --- display date echo --- echo argument ftp --- connect to a remote machine to download or upload files grep --- search file head --- display first part of file ls --- see what files you have lpr --- standard print command (see also print ) more --- use to read files mkdir --- create directory

Page 12: ETL Contents

mv --- for moving and renaming files ncftp --- especially good for downloading files via anonymous ftp. print --- custom print command (see also lpr ) pwd --- find out what directory you are in rm --- remove a file rmdir --- remove directory rsh --- remote shell setenv --- set an environment variable sort --- sort file tail --- display last part of file tar --- create an archive, add or extract files telnet --- log in to another machine wc --- count characters, words, lines

cat

This is one of the most flexible Unix commands. We can use to create, view and concatenate files. For our first example we create a three-item English-Spanish dictionary in a file called "dict."

% cat >dict red rojo green verde blue azul<control-D> %

<control-D> stands for "hold the control key down, then tap 'd'". The symbol > tells the computer that what is typed is to be put into the file dict. To view a file we use cat in a different way:

% cat dict red rojo green verde blue azul %If we wish to add text to an existing file we do this:

% cat >>dict white blanco black negro <control-D> %

Now suppose that we have another file tmp that looks like this:

% cat tmp cat gato dog perro %Then we can join dict and tmp like this:

% cat dict tmp >dict2

Page 13: ETL Contents

We could check the number of lines in the new file like this:

% wc -l dict28

The command wc counts things --- the number of characters, words, and line in a file.

chmod

This command is used to change the permissions of a file or directory. For example to make a file essay.001 readable by everyone, we do this:

% chmod a+r essay.001

To make a file, e.g., a shell script mycommand executable, we do this

% chmod +x mycommandNow we can run mycommand as a command.

To check the permissions of a file, use ls -l . For more information on chmod, use man chmod.

cd

Use cd to change directory. Use pwd to see what directory you are in.

% cd english % pwd % /u/ma/jeremy/english % lsnovel poems % cd novel % pwd % /u/ma/jeremy/english/novel % lsch1 ch2 ch3 journal scrapbook % cd .. % pwd % /u/ma/jeremy/english % cd poems % cd % /u/ma/jeremy

Jeremy began in his home directory, then went to his english subdirectory. He listed this directory using ls , found that it contained two entries, both of which happen to be diretories. He cd'd to the diretorynovel, and found that he had gotten only as far as chapter 3 in his writing. Then he used cd .. to jump back one level. If had wanted to jump back one level, then go to poems he could have said cd ../poems. Finally he used cd with no argument to jump back to his home directory.

Page 14: ETL Contents

cpUse cp to copy files or directories.

% cp foo foo.2This makes a copy of the file foo.

% cp ~/poems/jabber .

This copies the file jabber in the directory poems to the current directory. The symbol "." stands for the current directory. The symbol "~" stands for the home directory.

dateUse this command to check the date and time.

% dateFri Jan 6 08:52:42 MST 1995

echo

The echo command echoes its arguments. Here are some examples:

% echo this this % echo $EDITOR /usr/local/bin/emacs % echo $PRINTER b129lab1

Things like PRINTER are so-called environment variables. This one stores the name of the default printer --- the one that print jobs will go to unless you take some action to change things. The dollar sign before an environment variable is needed to get the value in the variable. Try the following to verify this:

% echo PRINTER PRINTER

ftp

Use ftp to connect to a remote machine, then upload or download files. See also: ncftp

Example 1: We'll connect to the machine fubar.net, then change director to mystuff, then download the file homework11:

% ftp solitude Connected to fubar.net. 220 fubar.net FTP server (Version wu-2.4(11) Mon Apr 18 17:26:33 MDT 1994) ready. Name (solitude:carlson): jeremy 331 Password required for jeremy. Password: 230 User jeremy logged in. ftp> cd mystuff 250 CWD command successful. ftp> get homework11

Page 15: ETL Contents

ftp> quit

Example 2: We'll connect to the machine fubar.net, then change director to mystuff, then upload the file collected-letters:

% ftp solitude Connected to fubar.net. 220 fubar.net FTP server (Version wu-2.4(11) Mon Apr 18 17:26:33 MDT 1994) ready. Name (solitude:carlson): jeremy 331 Password required for jeremy. Password: 230 User jeremy logged in. ftp> cd mystuff 250 CWD command successful. ftp> put collected-letters ftp> quit

The ftp program sends files in ascii (text) format unless you specify binary mode:

ftp> binary ftp> put foo ftp> ascii ftp> get barThe file foo was transferred in binary mode, the file bar was transferred in ascii mode.

grep

Use this command to search for information in a file or files. For example, suppose that we have a file dict whose contents are

red rojo green verde blue azul white blanco black negroThen we can look up items in our file like this;

% grep red dict red rojo % grep blanco dict white blanco % grep brown dict %

Notice that no output was returned by grep brown. This is because "brown" is not in our dictionary file.

Grep can also be combined with other commands. For example, if one had a file of phone numbers named "ph", one entry per line, then the following command would give an alphabetical list of all persons whose name contains the string "Fred".

% grep Fred ph | sort Alpha, Fred: 333-6565 Beta, Freddie: 656-0099 Frederickson, Molly: 444-0981

Page 16: ETL Contents

Gamma, Fred-George: 111-7676 Zeta, Frederick: 431-0987The symbol "|" is called "pipe." It pipes the output of the grep command into the input of the sort command.

For more information on grep, consult

% man grep

head

Use this command to look at the head of a file. For example,

% head essay.001

displays the first 10 lines of the file essay.001 To see a specific number of lines, do this:

% head -n 20 essay.001This displays the first 20 lines of the file.

ls

Use ls to see what files you have. Your files are kept in something called a directory.

% ls foo letter2 foobar letter3 letter1 maple-assignment1 %

Note that you have six files. There are some useful variants of the ls command:

% ls l* letter1 letter2 letter3 %

Note what happened: all the files whose name begins with "l" are listed. The asterisk (*) is the " wildcard" character. It matches any string.

lpr

This is the standard Unix command for printing a file. It stands for the ancient "line printer." See

% man lpr

for information on how it works. See print for information on our local intelligent print command.

mkdirUse this command to create a directory.

Page 17: ETL Contents

% mkdir essaysTo get "into" this directory, do

% cd essaysTo see what files are in essays, do this:

% ls

There shouldn't be any files there yet, since you just made it. To create files, see cat or emacs.

more

More is a command used to read text files. For example, we could do this:

% more poems

The effect of this to let you read the file "poems ". It probably will not fit in one screen, so you need to know how to "turn pages". Here are the basic commands:

q --- quit more spacebar --- read next page return key --- read next line b --- go back one page

For still more information, use the command man more.

mv

Use this command to change the name of file and directories.

% mv foo foobar

The file that was named foo is now named foobar

ncftp

Use ncftp for anonymous ftp --- that means you don't have to have a password.

% ncftp ftp.fubar.net Connected to ftp.fubar.net > get jokes.txt

The file jokes.txt is downloaded from the machine ftp.fubar.net.

printThis is a moderately intelligent print command.

Page 18: ETL Contents

% print foo % print notes.ps % print manuscript.dvi

In each case print does the right thing, regardless of whether the file is a text file (like foo ), a postcript file (like notes.ps, or a dvi file (like manuscript.dvi. In these examples the file is printed on the default printer. To see what this is, do

% printand read the message displayed. To print on a specific printer, do this:

% print foo jwb321 % print notes.ps jwb321 % print manuscript.dvi jwb321To change the default printer, do this:

% setenv PRINTER jwb321

pwdUse this command to find out what directory you are working in.

% pwd/u/ma/jeremy % cd homework % pwd/u/ma/jeremy/homework % lsassign-1 assign-2 assign-3 % cd % pwd/u/ma/jeremy %

Jeremy began by working in his "home" directory. Then he cd 'd into his homework subdirectory. Cd means " change directory". He used pwd to check to make sure he was in the right place, then used ls to see if all his homework files were there. (They were). Then he cd'd back to his home directory.

rmUse rm to remove files from your directory.

% rm foo remove foo? y % rm letter* remove letter1? y remove letter2? y remove letter3? n %

The first command removed a single file. The second command was intended to remove all files beginning with the string "letter." However, our user (Jeremy?) decided not to remove letter3.

Page 19: ETL Contents

rmdir

Use this command to remove a directory. For example, to remove a directory called "essays", do this:

% rmdir essays

A directory must be empty before it can be removed. To empty a directory, use rm.

rsh

Use this command if you want to work on a computer different from the one you are currently working on. One reason to do this is that the remote machine might be faster. For example, the command

% rsh solitude

connects you to the machine solitude. This is one of our public workstations and is fairly fast.

See also: telnet

setenv % echo $PRINTER labprinter % setenv PRINTER myprinter % echo $PRINTER myprinter

sortUse this commmand to sort a file. For example, suppose we have a file dict with contents

red rojogreen verdeblue azulwhite blancoblack negroThen we can do this:

% sort dict black negro blue azul green verde red rojo white blancoHere the output of sort went to the screen. To store the output in file we do this:

% sort dict >dict.sorted You can check the contents of the file dict.sorted using cat , more , or emacs .

tail

Use this command to look at the tail of a file. For example,

Page 20: ETL Contents

% tail essay.001

displays the last 10 lines of the file essay.001 To see a specific number of lines, do this:

% tail -n 20 essay.001This displays the last 20 lines of the file.

tar

Use create compressed archives of directories and files, and also to extract directories and files from an archive. Example:

% tar -tvzf foo.tar.gz

displays the file names in the compressed archive foo.tar.gz while

% tar -xvzf foo.tar.gzextracts the files.

telnet

Use this command to log in to another machine from the machine you are currently working on. For example, to log in to the machine "solitude", do this:

% telnet solitude

See also: rsh.

wc

Use this command to count the number of characters, words, and lines in a file. Suppose, for example, that we have a file dict with contents

red rojogreen verdeblue azulwhite blancoblack negroThen we can do this

% wc dict 5 10 56 tmp

This shows that dict has 5 lines, 10 words, and 56 characters.

The word count command has several options, as illustrated below:

% wc -l dict 5 tmp % wc -w dict

Page 21: ETL Contents

10 tmp % wc -c dict 56 tmp