Extending Piwik At R7.com

44
Extending Piwik at r7.com Phase 1 – Collecting data Adding some cloud and modern scalability to a traditional LAMP stack leonardo lorieri, r7.com system architect, 'lorieri at gmail.com', feb/2012

description

How we deployed Piwik web analytics system to handle a huge amount of unpredicted traffic, adding some cloud and modern scalability techniques. files:https://github.com/lorieri/piwik-presentation

Transcript of Extending Piwik At R7.com

Page 1: Extending Piwik At R7.com

Extending Piwikat r7.com

Phase 1 – Collecting data

Adding some cloud and modern scalability to a traditional LAMP stack

leonardo lorieri, r7.com system architect, 'lorieri at gmail.com', feb/2012

Page 2: Extending Piwik At R7.com

Why Piwik ?

 - Open Source = flexible, understandable, free!

 - Great interface

 - Mobile app

 - REST API

 - Developers knows the market needs

 - Efficient in small machines

 - Lots of possible improvements

 - Lots of improvements already in the roadmap

 - Great and supportive community (Thank you all!)

Page 3: Extending Piwik At R7.com

Our Plan, goals and trade-offs

 - Don't change original code      - reduces development and maintenance costs

 - Count only visits and page views      - to be fast and focused (even though you still can use the .js tracker,        it is easy to get lost in the UI's beauty and all its functionalities)

 - Handle odd unexpected traffic peaks      - from tv announcements 

 - Count not only websites      - media delivery, internal searches, debugs

 - At least 99% of accuracy

 - Have numbers to compare with other analytics tools

 - We've lost P3P for now

Page 4: Extending Piwik At R7.com

Our big problem - The TV Effect

from Gaiser's presentation at http://www.slideshare.net/rgaiser/r7-no-aws-qcon-sp-2011

Traffic peak during a TV Show

Page 5: Extending Piwik At R7.com

Regular Piwik Setup

based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes

 - Apache/Nginx - Php - Mysql

Page 6: Extending Piwik At R7.com

Bigger Piwik Setup

based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes

 - Apache/Nginx - Php

 - MySql

Page 7: Extending Piwik At R7.com

Regular Php Scaling Piwik Setup

based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes

 - Apache/Nginx - Php

 - MySql Replication     (slave for backup only,      piwik is not "slave ready")

Load balancer/Nginx

Page 8: Extending Piwik At R7.com

Two problems, one easy solution

 Problem: Data Collection for the TV Effect Easy solution: make it asynchronous

 

 Problem: Data processing Hard solution: huge ($$$) servers and complex tunings

Page 9: Extending Piwik At R7.com

Asynchronous Piwik Setup

based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes

 - Nginx - NOT even Php

 - MySql Master - Apache+Php for Admin UI - Archive cron

Load balancer/Nginx

 - MySql Slave - Perl/Python worker to   process logs

(manages user cookies)

(user cookie)

- accesses logs

Visits

REST API

<img src=> request

Admin/Reports

Page 10: Extending Piwik At R7.com

Nginx (more details later)

 - Small virtual machines can handle thousands requests per second

 - Visits divided in logs by virtual hosts

 - HttpUserIdModule    - automatically creates and handles user id cookies

 - HttpLogModule    - formats log as NSCA combined (logging cookies and referrers)

 - HttpEmptyGifModule    - respond an empty gif 

 - HttpHeadersModule    - expires -1;

  (all modules available in ubuntu's nginx-extras package)

 - Logrotate    - unix tool to rotate logs    - each 5 minutes for us

 

Page 11: Extending Piwik At R7.com

Log processing "Worker" (more details later)

 - Copy and uncompress available logs

 - Format them as a REST API request    - force date visit    - force client ip    - force idVisitor    - force User Agent    - User log's Referrer as URL    - Use referrer as page tittle (useful to log multiple hostnames)

 - Send request to Piwik server in parallel    - we are using 270 concurrent requests that      makes 1300 requests per second

Page 12: Extending Piwik At R7.com

Mysql Master (more details later)

  - The only machine that has to be huge, saving money 

 - Piwik admin and reports interface is here    - could be somewhere else, but the machine is huge anyway

 - Mysql tuning 

 - Raid tuning

 - Linux networking tuning     - same in all machines, to handle too many tcp concurrent connections  

Page 13: Extending Piwik At R7.com

Php tuning (more details later)

 - Max execution time - Max INPUT TIME  <--- bug in php reports it as max execution time - Max memory limit - Apc - Apc shm_size Some problems:

  - Consider Apache, it is slower than nginx, but more stable and    much easier to debug, easier to control concurrency  - MYSQLi is more stable and has better debugging than Mysql_pdo  - mod_php is more stable and easier to debug than fastcgi

Page 14: Extending Piwik At R7.com

Piwik tuning

Follow the rules: http://piwik.org/faq/new-to-piwik/#faq_137

 - disable unused plugins - since the cookies comes from nginx, you can set in the config.ini:

[Tracker]trust_visitors_cookies=1 

Page 15: Extending Piwik At R7.com

Handling TV Effectnginx requests/second > maximum requests on peaks of traffic - autoscaling guarantees it - autoscaling provides scheduled capacity changes

total requests in a day  <  (rest api requests/second) * (all seconds in a day) - even though the peaks requests vary in 1000% in a short time, the total amount of traffic is easily handled when it comes in a queue with fixed requests/second rate, it will only take some more time to catch up

maximum apache concurrent requests  > maximum concurrent worker connections - the program that process the logs cannot make more requests than the apache can handle, we configure apache to 1000 concurrent requests and configure the worker to input 260 concurrent requets. so apache has some free slots to other admin tasks.

mysql max connections > apache concurrent requests - otherwise you will get "too many connections"

archive.php performance > rest api input rate - you can't input more that than archieve.php will be able to handle, otherwise you will endup with logs that you will never be able to process

Page 16: Extending Piwik At R7.com

Our real setup, how we deployed itAWS autoscaling for the Nginx machines   - easy high availability, increases and decrease collecting machines automatically     saving money.   - logrotate runs when a machine is "terminated", to make sure none requests were lost

AWS SNS   - To easily notifies when a new log files is ready to use, making it easy to synchronize      the files processing   - Notifies to multiple queues   - Having multiple queues, we can use same logs to multiple analysis tools. We use     one for web analysis and another for flash player debug

AWS SQS    - Easy queue service, so we don't need expensive and complex high availability setups for it

AWS S3    - Cheap and virtually unlimited storage    - Very easy access to files    - Durable, Amazon guarantees better durability than regular data centers

Nginx    - embedded perl script to get real IP on amazon (perl module is also include in ubuntu's      package)

Logrotate    - Added a s3cmd command (package also available on ubuntu) to upload the log to a S3 bucket,      and added an AWS CLI command to send a notification to SNS once it is finished.

Page 17: Extending Piwik At R7.com

Our setup diagram

VisitsELBElastic Load Balancer

nginx autoscaling pool

S3 bucketSNSNotifications

SQSqueues

Otherworkers/processorsfor other projects

worker BigAssMySql

mysql connection

mysql slave, apache, piwik api, python-boto, python-twisted

mysql master, piwik

Piwik Users

one file per virtualhostper machine,for each 5 minutes

one notificationper s3 file

Datacenter

Page 18: Extending Piwik At R7.com

Our Worker - Part 1

Our choice for the REST API was based in same PHP scaling philosophy: Small standalone processes easy to multiply. Also, as in the MySQL replication, it is easier and healthy to process lot of small pieces than to freeze the servers with huge processes.

To input the requests in parallel we used python twisted as shown in this blog post:http://oubiwann.blogspot.com/2008/06/async-batching-with-twisted-walkthrough.html

We installed apache and piwik in the mysql slave machine (of course the php is connecting in the master's mysql), then we tuned apache, mysql and tcp connections (as shown before). We access the REST API using http://127.0.0.1/.

From the twisted blog post mentioned early, we changed the maxRun to 260, added some logging and error handling (we check if a gif was returned and its size, otherwise we log the failed request to be reprocessed later), and we implemented the callLater mentioned in the blog's comments for 0.03 seconds.

To get messages and files from Amazon, we are using python-boto

Page 19: Extending Piwik At R7.com

Our Worker - Part 2Work Flow:

   - check for new messages in the AWS SQS queue   - if there is an message, it means a new file is available, the message     contains a s3 file path   - with the s3 file path, download it, uncompress it   - transform the NSCA log into a REST API request URL   - put the URL in an array   - delete the message in the queue

   - run twisted reactor for that array, making requests in the Piwik server in parallel   - if a request fails, log it to be reprocessed later,      alarm it in the monitoring system      (we use zabbix btw, for more information: http://lorieri.github.com/zabbix/) 

Note: It is good to have one SNS and SQS for each virtual host if you have too many. Python details later

Page 20: Extending Piwik At R7.com

Better costs management- Contributing to Piwik sharing our ideas, brings more ideas and more improvements, and    one of its consequences is to reduce costs

- CPU on Amazon is cheap and you pay as you use by time

- Traffic on Amazon is cheap, and you pay as you use, no long term contracts

- By dividing work we can better manage resources, like having only one or two huge machines   for the MySQLs and lots of small virtual nginx on autoscaling setup. It is easy to decouple the workers   processing to other machines

- Not changing Piwik's code reduces maintenance and development costs

- High Availability on Amazon is easy and cheap

- Storage durability on Amazon is automatic and cheap

- Storage retrieval and management on Amazon is very easy and fast

- Distribution control on Amazon is easy and cheap

- Having an easy way to access the logs, makes it simple to replay traffic, so you can run tests asmuch as you need, and test as many tools as you want, improving resources usage and reducing costs

- Amazon reduces their prices and improve services all the time

Page 21: Extending Piwik At R7.com

Not only web analytics

We are also using Piwik to log video plays

Once an user hits the play button in the flash player, it triggers aGET request similiar to this:

http://player.mysite.com/CATEGORY/VIDEONAME

And we use the Video name as the Action's Page Tittle

It will appear in the Piwik's Actions interface divided by category,and by Videos Name in the actions page tittles

Page 22: Extending Piwik At R7.com

Real NumbersSorry, we can't provide real numbers, but we can do tests and show how far we can go.

 - Collecting data    Nginx: as much requests per second as we need, just a matter of adding more nginx    cheap virtual machines

 - REST API    Running outside the master machine we've got 1500 requests/s, our Mysql Master    has 2 quad core cpus, 64GB memory and Raid 10

 - Download of logs    If you run inside amazon, the traffic is free, the bandwidth is huge    and the latency is small. We download the logs outside amazon and it is not our     bottleneck yet

 - Distribution tasks control   SNS and SQS do it for us, not a bottleneck yet

 - We are still testing how many data we can archive for a month or two, it is already   possible to archive one hour of 1000 requests/s in 30 minutes (considering S3 download    and uncompress), enough to log 50 million a day. But the tests are in too early stages.

Page 23: Extending Piwik At R7.com

CODE OR GTFO!

It is hard to show all the code.Most of tools are regular tools from Ubuntu and Amazon, and some others are relevant only for us. But some code and few links can help a lot.

Unfortunately I can't teach everything and how to use all Amazon tools, but the key points will be shown, like how to get the real user ip address on Amazon and some of the linux and mysql tuning.

Page 24: Extending Piwik At R7.com

MySql tuning details - RaidRaid: http://hwraid.le-vert.net/Our Raid: http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS

Our commands:

Check battery status:/usr/sbin/megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood'|grep ': Yes'

Turn on the write cache:/usr/sbin/megacli -LDInfo -LAll -aAll|tee /tmp/chefraidstatus |grep 'Default Cache Policy: WriteBack' Turn on the cache:/usr/sbin/megacli -LDSetProp Cached -LALL -aALL Turn off the cache in case the battery is not good/usr/sbin/megacli -LDSetProp Cached -LALL -aALL Turn on the HDD cache/usr/sbin/megacli -LDSetProp EnDskCache -LAll -aAll

Turn on the adaptive cache/usr/sbin/megacli -LDSetProp ADRA -LALL -aALL

DO NOT FORGET TO MONITOR THE RAID: There are tool for it in the website above 

Page 25: Extending Piwik At R7.com

MySql tuning details - Innodball mysql tunings you can find here:         http://www.slideshare.net/matsunobu/linux-and-hw-optimizations-for-mysql-7614520

Our tuning: table_cache=1024tmp_table_size=6Gmax_heap_table_size=6Gthread_cache=16query_cache_size=1Gquery_cache_limit=4Mdefault-storage-engine = InnoDBexpire_logs_days = 5ignore-builtin-innodbplugin-load=innodb=ha_innodb_plugin.somax_binlog_size = 1024Mskip-name-resolveinnodb_flush_log_at_trx_commit=2innodb_thread_concurrency=32  # we have 16 cpu threadsinnodb_buffer_pool_size = 40G # we have 64G of memoryinnodb_flush_method=O_DIRECTinnodb_additional_mem_pool_size=100Minnodb_log_buffer_size = 18Minnodb_log_file_size = 300Minteractive_timeout = 999999wait_timeout = 999999

Page 26: Extending Piwik At R7.com

Linux tuning/etc/sysctl.conf:

vm.swappiness = 0net.core.somaxconn = 1024net.ipv4.tcp_rmem = 4096 4096 16777216net.ipv4.tcp_wmem = 4096 4096 16777216net.ipv4.tcp_timestamps = 0net.ipv4.tcp_sack = 1net.ipv4.tcp_window_scaling = 1net.ipv4.tcp_fin_timeout = 20net.ipv4.tcp_keepalive_intvl = 30net.ipv4.tcp_keepalive_probes = 5net.ipv4.tcp_tw_reuse = 1net.core.netdev_max_backlog = 5000net.ipv4.ip_local_port_range = 2000 65535fs.file-max=999999

/etc/security/limits.conf # max open files*               -       nofile         999999

/etc/default/nginxULIMIT="-n 999999"

Page 27: Extending Piwik At R7.com

Nginx confs - Getting Real User IP on AWS ELB

# apt-get install nginx-extras

To get real user IP in a Elastic Load Balancer setup, added those lines inside the http context on /etc/nginx/nginx.conf:

        perl_set  $ip 'sub {

                my $r=shift;                local $_ = $r->header_in("X-Forwarded-For");

                # XXX only works well because we know the AWS network uses 10.x.x.x ip addresses                # Thanks Zed9h                my $ip0 = m{.*\b(                        (?:                                \d|                                1[1-9]|                                [2-9]\d|                                [12]\d{2}                        )\.\d+\.\d+\.\d+                )\b}xo && $1;

                # $ip0 ne $ip1 && "$ip0 ne $ip1\t\t$_"; # debug                $ip0 || $r->remote_addr        }';

(Thanks Zed for the Perl script)

Page 28: Extending Piwik At R7.com

Nginx confs - Adding a virtual host (1/2)

create a file on /etc/nginx/sites-available/VHOST.conf

server {        listen   80; ## listen for ipv4; this line is default and implied        server_name VHOST.MYSITE.com;        root /usr/share/nginx/www;        index index.html index.htm;

        userid on;        userid_name uid;        userid_domain MYSITE.com;        userid_expires max;

        set $myid $uid_got;

        location = /crossdomain.xml {                echo "<?xml version=\"1.0\"?><!DOCTYPE cross-domain-policy SYSTEM \"http://www.macromedia.com/xml/dtds/cross-domain-policy.dtd\"><cross-domain-policy><allow-access-from domain=\"*\" /></cross-domain-policy>";                  expires       modified +24h;                  access_log off;                  error_log /var/log/nginx/error.log;        }

        location / {                if ($uid_got = ""){                        set $myid $uid_set;                }

                expires -1;#               return 204;  #use this if you want an empty response                empty_gif;   #use this if you want an empty gif response        }

Page 29: Extending Piwik At R7.com

        location /healthcheck {                    try_files $uri $uri $uri =404;                    access_log off;                    error_log /var/log/nginx/error.log;        }                           location /nginx_status {          stub_status on;          access_log   off;          allow 127.0.0.1;          deny all;          access_log off;          error_log /var/log/nginx/error.log;        }

        # !!!!!!!!!!!!!!!!!!!!        # the log format is for Amazon AWS only, if you have the real IP, change        # the ip variable to $remote_addr

        log_format VHOST        '$ip - $remote_user [$time_local]  '                                '"$request" $status $body_bytes_sent '                                '"$http_referer" "$http_user_agent" '                                '"$myid"';

        # /mnt is the AWS's fastest partition        access_log /mnt/log/nginx/VHOST.access.log VHOST;        error_log /mnt/log/nginx/VHOST.error.log;}

Nginx confs - Adding a virtual host (2/2)

Page 30: Extending Piwik At R7.com

Testing Nginx

$ curl localhost/result must be a gif

$ curl -I localhost/cookie must be set

Page 31: Extending Piwik At R7.com

Php tuning details

# apt-get install php-apc   create a file /etc/php5/conf.d/piwik.ini

memory_limit = 15Gmax_execution_time = 0max_input_time = 0apc.shm_size = 64

* Piwik tuning on previous slides

Page 32: Extending Piwik At R7.com

AWS SNS It is out of the scope to teach how to create an Autoscaling group, a SNS, a S3 bucket and SQS queue.

We will only show we using them. Create a SNS topic on Amazon "MYTOPIC", and attach a SQS queue on it "MYQUEUE"

Install SNS client and unzip it somewhere, let's say /usr/local/bin:Download from Amazon the file: SimpleNotificationServiceCli-2010-03-31.zip

Install JDK:# apt-get install openjdk-6-jdk

Create a .conf file with a key and secret to access the SNS. let's say /usr/local/sns.conf:AWSAccessKeyId=XXXXXXXXXXXXAWSSecretKey=XXXXXXXXX

Create a source file on /usr/local/sns_env.source:export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/bin/SimpleNotificationServiceCli-1.0.2.3/bin/ export AWS_SNS_HOME=/usr/local/bin/SimpleNotificationServiceCli-1.0.2.3/export EC2_REGION=us-east-1export AWS_CREDENTIAL_FILE=/usr/local/sns.conf

Page 33: Extending Piwik At R7.com

AWS S3 and s3cmd

Create a S3 bucket on Amazon

$ apt-get install s3cmd$ s3cmd --configure$ cp ~/.s3cmd.cfg /usr/local/s3cmd.cfg

Page 34: Extending Piwik At R7.com

Log rotate (1/3)

Script was ripped from Ubuntu's init and logrotate and put in a cronjob. You can use logrotate though. in the crontab:*/5 * * * *  nice /bin/bash /usr/loca/bin/VHOST.sendS3.sh >> /mnt/log/VHOST.send.log 2>&1 sendS3.sh: #!/bin/bashdate #print date to the logDEBUGS3=`mktemp`atexit() {        rm -f $DEBUGS3}trap atexit 0

BUCKET="MYBUCKET"PROJECT="MYVHOST"ARCHIVEDIR="/mnt/MYVHOST/"S3CMD_CONF="/usr/local/s3cmd.cfg"ORIGINPATH="/mnt/log/nginx/VHOST.access.log"SNS_ENV="/usr/local/sns_env.source"SNS_TOPIC="MYTOPIC" HOST=`hostname` #we use instance-id on amazonDATE=$(date --utc +%Y%m%d_%H%M%S)DATEDIR=$(date --utc +%Y/%m/%d)

POSTPATH="$PROJECT/$DATEDIR/$PROJECT-$DATE-$HOST.log"LOCALPATH="$ARCHIVEDIR/$POSTPATH"GZLOCALPATH="$LOCALPATH.gz"REMOTEPATH="s3://$BUCKET/$POSTPATH.gz"

Page 35: Extending Piwik At R7.com

Log rotate (2/3)

echo "->Trying file: $REMOTEPATH"

LOCALDIR="$(dirname "$LOCALPATH")"

#sleep 1 recomended by nginx's wikimkdir -p "$LOCALDIR" &&mv "$ORIGINPATH" "$LOCALPATH" &&{ [ ! -f /var/run/nginx.pid ] || kill -USR1 `cat /var/run/nginx.pid` ; }  &&sleep 1 &&gzip "$LOCALPATH" &&{ MD5=$(/usr/bin/md5sum "$GZLOCALPATH" | awk '{ print $1 }') ; }

#try 3 timesif [ -z "$MD5" ]then        echo "ERROR ON MD5"        OK=1else        OK=$(/usr/bin/s3cmd -d --no-progress -c "$S3CMD_CONF" put "$GZLOCALPATH" "$REMOTEPATH" 2>&1 |grep -q "DEBUG: MD5 sums: computed=$MD5, received=\"$MD5\"";echo $?)        if [ "$OK" -eq "1" ]        then                OK=$(/usr/bin/s3cmd -d --no-progress -c "$S3CMD_CONF" put "$GZLOCALPATH" "$REMOTEPATH" 2>&1 |grep -q "DEBUG: MD5 sums: computed=$MD5, received=\"$MD5\"";echo $?)                if [ "$OK" -eq "1" ]                then                        /usr/bin/s3cmd -d --no-progress -c "$S3CMD_CONF" put "$GZLOCALPATH" "$REMOTEPATH" 2>&1|tee "$DEBUGS3"                        OK=$(grep -q "DEBUG: MD5 sums: computed=$MD5, received=\"$MD5\"" "$DEBUGS3"; echo $?)                fi        fifi

Page 36: Extending Piwik At R7.com

Log rotate (3/3)

 # if ok, publish a message on SNS if [ "$OK" = "0" ]then        source "$SNS_ENV"        echo -n '-> Message: '        sns-publish "$TOPIC" --message "$REMOTEPATH"        OK=${PIPESTATUS[0]}fi

echo "OK=$OK"#for monitoring#/usr/bin/zabbix_sender -s "$HOST" -z XXXXXX.com -k XXXXXX -o "$OK"

Page 37: Extending Piwik At R7.com

Rotating and uploading logs on reboot and shutdown

This is a protection for the Autoscaling group where machines are created andterminated all the time

create a file on /etc/init.d/VHOST.sendme.sh

#!/bin/bash/bin/echo TERMINATED `date --utc` >> /mnt/log/nginx/VHOST.access.log/usr/bin/nice -20 /bin/bash /usr/local/bin/VHOST.sendS3.sh

Then execute:

# update-rc.d VHOST.sendme.sh stop 21 0 6 .

(the dot in the end of the line is required)

Page 38: Extending Piwik At R7.com

Worker details (1/3)

I'm not a developer, my worker python code is too ugly to be shown. It is very similar to the blog post mentioned early, the only addition is download S3 files and ready messages on SQS, the functions are similar as those:

Connect to S3 and SQS   from boto.sqs.connection import SQSConnectionfrom boto.s3.connection import S3Connectionfrom boto.sqs.message import RawMessage # for SNS messagesimport json

print "connecting to sqs"logging.info("connecting to sqs")connsqs = SQSConnection('xxxxxxxxxxxx', 'xxxxxxxxxxxxx')

print "connecting to s3"logging.info("connection to s3")conns3 = S3Connection('xxxxxxxxxxxxxxxx', 'xxxxxxxxxxxxx')

Page 39: Extending Piwik At R7.com

Worker details (2/3)

Reading and deleting SQS messages, and putting results in an array:

print "getting queue"logging.info("getting queue")my_queue = connsqs.get_queue('MYQUEUE')my_queue.set_message_class(RawMessage) #raw messages from SNS

maxmsgs = 10msgs = []msg = my_queue.read()while msg:        logging.info("getting message")        msgsingle = json.loads(msg.get_body())['Message']        logging.info(msgsingle)        msgs.append(msgsingle) 

        logging.info("deleting message")        my_queue.delete_message(msg)        if len(msgs) < maxmsg :                logging.info("getting more messages")                msg = my_queue.read()        else:                msg = False

Page 40: Extending Piwik At R7.com

Worker details (3/3)

Getting files from S3 and putting lines in an array:

msg_data[]lines = []filename = '/tmp/tmppiwikpy.%s.txt' % os.getpid()for msg_data in msgs:        llog = "trying file "+msg_data        logging.info(llog)        if "s3://MYBUCKET/" in msg_data :                s3obj = msg_data.replace("s3://MYBUCKET/","")                llog = "downloading "+msg_data                logging.info(llog)                key = conns3.get_bucket('MYBUCKET').get_key(s3obj)                key.get_contents_to_filename(filename)                llog = "decompressing file"+msg_data                logging.info(llog)                fgz = gzip.open(filename, 'r');                line = fgz.readline()                while line:                        lines.append(line)                        line = fgz.readline() llog = "closing and deleting temporary file"logging.info(llog)fgz.close()os.remove(filename)

Page 41: Extending Piwik At R7.com

Piwik REST API

Check it here:http://piwik.org/docs/tracking-api/#toc-tracking-api-rest-documentation

Your worker script has to create an url like this: http://127.0.0.1/piwik.php?action_name=NAME&idsite=XX&rand=RANDOMNUMBER&rec=1&url=URL&cip=USER_IP&token_auth=YOUR_PIWIK_ADMIN_TOKEN&_id=COOKIE_FROM_NGINX&cdt=DATE_OF_VISIT The cookie from Nginx is the first 16 characters from its md5 sum, as the same as Piwik does internally Date of visit must be in UTC

Page 42: Extending Piwik At R7.com

Others / Next steps

 - If you deploy it not in Amazon, it makes sense to send the log lines to a queue   and have lots of small workers reading and replicating them into Piwik. It is   easier to handle, skip or reprocess a failed line than an entire log file

 - We still have window to improve, we are not using SSDs cards, we didn't do   any partition or sharding

 - Visit logs and Action logs will need to be changed in order to make the database    cheaper and more scalable.

 - Our next step will be try to improve the archives, probably our next bottleneck.

Page 43: Extending Piwik At R7.com

What is missing on Piwik

 - split read and write connections to mysql database, so we can have    the benefits of mysql replications, like run a dedicated slave to the   archive.php selects and a dedicated slave to non-admin users.

 - create a database per website. It is easier to maintain and reduces   the mysql indexes sizes to fit them in memory. You can partition the    tables by idsite, it helps.

 - people read this presentation and send feedbacks :)   please use Piwik's forums for this

 - feature request: optionally have a Mysql connection per website or be   able to configure Piwik's interface to import data from other Piwik    installations, having all websites on a single place. Doing that we can    have smaller databases for each website. (Zabbix have simliar feature)

- feature request: have mysql optional connections profiles, so we can  set smaller buffers for smaller tasks, improving the memory usage

Page 44: Extending Piwik At R7.com

Thanks Piwik !

Now we have modern analytics for old problems

and a modern scaling setup for a traditional LAMP stack

Thanks Zed (aka Carlo) for all programming support,Gaiser for all Amazon tips, Matt for all Piwik tips.

And thanks to R7 Managers Denis and Vechiato to believe and provide time andresources to make it happen and R7 Director Brandi to review this and allow us to share.