Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
-
Upload
lucidworks -
Category
Technology
-
view
879 -
download
0
Transcript of Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
![Page 1: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/1.jpg)
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
![Page 2: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/2.jpg)
Shenghua Wan
Sr Software Engineer, @WalmartLabs [email protected]
Solr Distributed Indexing in WalmartLabs
![Page 3: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/3.jpg)
Background
• Search big data, part of Polaris Search Team in WalmartLabs • Audience management, Axciom Inc. • HPC computational scientist, UTSW Medical Center
3
![Page 4: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/4.jpg)
Our perspective :
• To help make Solr indexing more scalable • From a big data engineer perspective • Solr/Lucene internals are not covered in this talk
4
![Page 5: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/5.jpg)
Problem definition • Input
96 gzipped xml files
• Output 3 shards of binary indexes, one for every 32 xml files • Dedicated indexing servers not scalable • Indexing time in dev environment at least 4 hours -> slow down development iteration
5
![Page 6: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/6.jpg)
Existing “Wheels” for Solr Distributed Indexing • “Indexing Files via Solr and Java MapReduce” (Adam
Smieszny since 2012)
• LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)
• MapReduceIndexerTool (Mark Miller since late 2013)
6
![Page 7: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/7.jpg)
Existing “Wheels” for Solr Distributed Indexing
q “Indexing Files via Solr and Java MapReduce” (Adam Smieszny since 2012)
q LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)
ü MapReduceIndexerTool (Mark Miller since late 2013) This tool is closest to our use case.
7
![Page 8: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/8.jpg)
Start from MapReduceIndexerTool Anatomy of this tool • MorphlineMapper use Morphlines to convert document to SolrInputDocument • SolrRecordWriter
create a embedded Solr instance to index the document • TreeMergeRecordWriter
merge multiple binary indexes into one References: 1. https://github.com/apache/lucene-solr/tree/trunk/solr/
contrib/map-reduce 2. https://github.com/markrmiller/solr-map-reduce-example
8
![Page 9: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/9.jpg)
Our Challenges • Not using Solr Cloud • Not using Zookeeper • Solr version 4.0 (when we did experiments) • Environment • Hadoop version 1 • MapR File System • XML input format
• Easy to maintain and debug • Documentation A runnable example with source code is the best. Thanks to https://github.com/markrmiller/solr-map-reduce-example.
9
![Page 10: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/10.jpg)
Customize Design to Our Use Case Breaking down to two fundamental utilities • Index Generator
replace Morphlines with XmlInputFormat from Apache Mahout and reuse SolrOutputFormat
• Index Merger reuse TreeMergeOutputFormat
References: 1.https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java 2.https://github.com/apache/lucene-solr/tree/trunk/solr/contrib/map-reduce
10
![Page 11: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/11.jpg)
Customize Design to Our Use Case – cont. Breaking down to two fundamental utilities • Index Generator • Index Merger More complicated logic can be built on top of these two simple map-only jobs. Where is reduce? Our use case does not need it. We want it lean and fast. But you may need it.
11
![Page 12: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/12.jpg)
Experiments and Observations
• Index Generation ü CPU-bound ü can easily scale and be parallel ü Map-only wins 12~15% over Map-Reduce in our
experiments ü ~5GB decompressed Xml document indexed within 10
minutes using 7x3 mappers
12
![Page 13: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/13.jpg)
Experiments and Observations – cont.
• Index Merging ü IO-bound Disk and Network. But network was our pain ü Two stages: logical merge and optimize o Logical merge: file movement o Optimize: reduce number of index segments
13
![Page 14: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/14.jpg)
Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1
Nothing suspicious
14
![Page 15: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/15.jpg)
Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1
Go sharp suddenly? • Too many shards • Resource
contention
15
![Page 16: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/16.jpg)
Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1
Optimize time >> Logical merge time 5x ~ 8x (though 64-way is an exception, considered to be outlier because of shared environment)
16
![Page 17: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/17.jpg)
New Challenges
After contacting cluster owner team, we were told the connection of that cluster consist of almost five dozen nodes is 1Gb/s Ethernet.
17
![Page 18: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/18.jpg)
Experiments and Observations – cont. How about “tree” structure merge?
Seems to be attractive
18
![Page 19: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/19.jpg)
Experiments and Observations – cont. Comparing hierarchical merge and n-way merge total time
Kind of unexpected
19
![Page 20: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/20.jpg)
Experiments and Observations Comparing hierarchical merge and n-way merge total time
Relatively isolated environment: no network, but disk IO (4 cores x 2 threads)
4 small reads + 2 large reads
4 small reads
20
![Page 21: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/21.jpg)
Lessons Learnt
• Index generation in parallel is easy
• Merging is not
• N-way merging all shards is better
• Data locality is key
21
![Page 22: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/22.jpg)
Our Solutions
• Plan A “Hey, Sir/Madam, could you please get us 48Gb/s InfiniBand network ASAP? Or 10Gb/s is also fine.” • Plan B A small dedicated indexing Hadoop cluster (starting from one node)
22
![Page 23: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/23.jpg)
Our Solutions A small dedicated indexing Hadoop cluster (starting from one node)
environment! Disk IO (MB/s)!shared! ~44!
Mac Pro (SSD)! ~250!Dedicated! ~202!
Dedicated cluster: • 1 node • 32 cores • 128GB mem
23
![Page 24: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/24.jpg)
Tips
Tunable Parameter • Split Size (Map-Reduce) • Batch Size (Solr Index) • RAM Buffer Size (Solr Index) • Max number of Segments (Solr Index)
24
![Page 25: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/25.jpg)
Opportunities
There are some parts missing in our tool which are allowed by our use case but you may want to have them: 1. Reduce functions (deduplication, other processing logic) 2. Try Spark or equivalent (bottleneck is embedded Solr
instance when merging)
25
![Page 26: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs](https://reader031.fdocuments.in/reader031/viewer/2022022414/5870675a1a28ab48378b5395/html5/thumbnails/26.jpg)
Thanks! We are hiring!
Questions? 26